npm.io
0.1.1 • Published yesterday

@komaa/msteams-voice

Licence
MIT
Version
0.1.1
Deps
4
Size
205 kB
Vulns
0
Weekly
0

@komaa/msteams-voice

npm docs license

A self-contained Microsoft Teams voice agent (CVI) for OpenClaw. An AI assistant that joins Teams calls as a real participant — realtime speech-to-speech or a streaming STT → agent → TTS pipeline — with continuous vision, "speak-only-when-addressed" gating, outbound call-backs with voicemail, avatar lip-sync, and meeting recap.

Full documentation: docs.komaa.com — setup walkthroughs, the Teams worker, configuration reference, and troubleshooting. This README is the quick start.

It's one plugin depending only on the published openclaw plugin-sdk + api.runtime — no fork, no vendored runtime.

Features

  • Realtime speech-to-speech (e.g. OpenAI Realtime) or streaming STT → agent → TTS (any provider)
  • Continuous vision — the agent can "look at" screen-share / camera frames, with a per-minute budget
  • Group-call gating — only answers when addressed by a wake phrase (silent otherwise)
  • Outbound call-backs + voicemail — place a call, deliver a message, or open a conversation
  • Meeting recap — a .docx of minutes with per-speaker attribution
  • Bilingual (Arabic / English) · DTMF · barge-in / echo guard · HMAC-signed media bridge + caller allowlist

Requirements

  • An OpenClaw install (host ≥ 2026.6.9).
  • A Microsoft Teams worker (Azure Bot) that bridges the call audio to this plugin's media WebSocket — see docs.komaa.com.
  • For realtime mode: a realtime voice provider + key. For streaming mode: your openclaw-configured STT/TTS/agent (no realtime key needed).

Install

openclaw plugins install clawhub:@komaa/msteams-voice
cd extensions/msteams-voice && pnpm install && pnpm build

Two modes

realtime streaming
How it talks speech-to-speech realtime model your openclaw STT → agent/model → TTS
Needs a realtime provider yes (realtime.provider + key) no
Latency lowest higher (per-turn)
Vision continuous push (live) attached to each agent turn
Use it when you have a realtime voice model any STT/TTS/model, or lower cost

Mode selection: set mode to "realtime" or "streaming". If omitted, the runtime auto-selects realtime when a realtime provider resolves, else streaming. Both modes honor the inbound allowlist, outbound call-backs, recording gate, and sessionScope agent memory.

Configuration

Config lives under plugins.entries."msteams-voice".config in your OpenClaw config. sharedSecret must match the Teams worker that connects to this plugin's media WebSocket.

Realtime mode (speech-to-speech)
{
  "plugins": { "entries": { "msteams-voice": { "config": {
    "enabled": true,
    "mode": "realtime",
    "port": 9442,
    "path": "/voice/msteams/stream",
    "sharedSecret": "<same secret as the Teams worker>",
    "requireRecordingStatus": true,
    "inboundPolicy": "allowlist",
    "allowFrom": ["<caller AAD object id or phone number>"],
    "inboundGreeting": "Hello, this is the assistant.",
    "maxConcurrentCalls": 4,
    "maxDurationSeconds": 3600,
    "groupCall": { "requireAddress": true, "wakePhrases": ["assistant"], "followUpWindowMs": 8000 },
    "maxVisionPerMinute": 30,
    "meetingRecap": true,
    "bilingual": true,
    "realtime": {
      "provider": "openai",
      "providers": { "openai": { "apiKey": "<key>", "model": "gpt-realtime" } },
      "instructions": "You are a helpful Teams meeting assistant.",
      "toolPolicy": "safe-read-only",
      "suppressInputDuringPlayback": true,
      "echoSuppressionWindowMs": 250,
      "echoBargeInRms": 0.02
    }
  } } } }
}
Streaming mode (STT → agent → TTS, no realtime model)
{
  "plugins": { "entries": { "msteams-voice": { "config": {
    "enabled": true,
    "mode": "streaming",
    "port": 9442,
    "path": "/voice/msteams/stream",
    "sharedSecret": "<same secret as the Teams worker>",
    "requireRecordingStatus": true,
    "inboundPolicy": "allowlist",
    "allowFrom": ["<caller id>"],
    "inboundGreeting": "Hello, this is the assistant.",
    "maxConcurrentCalls": 4,
    "groupCall": { "requireAddress": true, "wakePhrases": ["assistant"] },
    "maxVisionPerMinute": 30,
    "meetingRecap": true,
    "stt": {
      "provider": "<your-stt-provider>",
      "providers": { "<your-stt-provider>": { "apiKey": "<key>" } }
    }
  } } } }
}

In streaming mode, TTS and the agent/model come from your openclaw configuration. STT uses a live transcription session — selected by stt.provider/stt.providers if set, else your openclaw-configured transcription provider; if none resolves it falls back to VAD-segmented file transcription. No realtime provider/key needed. The realtime.* block is ignored except the echo-guard knobs (suppressInputDuringPlayback, echoSuppressionWindowMs, echoBargeInRms), which apply in both modes. Group-call gating, DTMF, and vision work in streaming mode too.

Outbound call-backs (optional, either mode)
"outbound": {
  "enabled": true,
  "workerBaseUrl": "https://<your-teams-worker>",
  "tenantId": "<aad-tenant-id>",
  "answerTimeoutMs": 120000,
  "defaultMode": "notify"        // "notify" delivers a message then ends; "conversation" opens a turn
}

placeCall(userObjectId, { message, mode }) is implemented on the runtime (no-answer/declined → voicemail/no-answer); the outbound block enables it. Triggering it currently requires a host call into the runtime — a built-in agent tool / endpoint is a small follow-up.

Key reference
Key Applies Meaning
enabled both master on/off
mode both "realtime" | "streaming" (auto if omitted)
port / bindAddress / path both media WebSocket server the Teams worker connects to
sharedSecret both HMAC secret — must match the worker (secret input)
requireRecordingStatus both only engage once Teams reports recording active
inboundPolicy both disabled | allowlist | pairing | openenforced on inbound
allowFrom both allowlisted caller ids (Teams aadId or phone digits)
inboundGreeting both opening line
sessionScope both per-phone | per-call | per-thread agent-memory scope
maxConcurrentCalls / maxDurationSeconds / staleCallReaperSeconds both capacity + reaper
groupCall.{requireAddress,wakePhrases,followUpWindowMs} both speak-only-when-addressed gating
maxVisionPerMinute both vision spend cap
meetingRecap / bilingual both post-call minutes / Arabic-English
realtime.{provider,providers,instructions,toolPolicy,…} realtime (echo knobs: both) realtime voice provider + behavior; provider key is a secret input
stt.{provider,providers} streaming live transcription provider (else openclaw STT / file fallback); provider key is a secret input
outbound.{enabled,workerBaseUrl,tenantId,answerTimeoutMs,defaultMode} both outbound call-backs / voicemail

Architecture

The hard parts come from OpenClaw — this plugin is intentionally small:

  • Realtime audio bridgeopenclaw/plugin-sdk/realtime-voice (createRealtimeVoiceBridgeSession, consultRealtimeVoiceAgent, resolveConfiguredRealtimeVoiceProvider).
  • Agent / TTS / STT / media / state / config / loggingapi.runtime.
  • Owned code: src/call-lifecycle.ts (~500 LOC) + thin adapters + the Teams CVI logic.
  • The entry registers a host-managed service (api.registerService({ id, start, stop })).

Built by Komaa.com · MIT licensed

Keywords