Conversation Mode - Voice Loop with Push-to-Talk
You now have access to both text-to-speech (claude-say) AND speech-to-text (claude-listen) for a complete voice conversation.
Architecture
Uses Push-to-Talk (PTT) mode with VAD auto-stop AND auto-start:
- •Recording starts automatically after the welcome message (no key press needed!)
- •Recording stops automatically when you stop speaking (VAD detection)
- •Recording restarts automatically after Claude finishes speaking
- •The conversation flows naturally without any key presses!
- •Silence threshold: 1.5 seconds
- •Echo prevention delay: 400ms after TTS
- •Mic indicator appears in menu bar when recording is active
Available MCP Tools
claude-listen (STT - Push-to-Talk with VAD)
Synchronous mode (blocking) - RECOMMENDED:
| Tool | Description |
|---|---|
start_ptt_mode(key?, auto_stop?, vad_silence_ms?, auto_start?, echo_delay_ms?) | Start PTT mode. Use auto_stop=True, auto_start=True for seamless conversation |
stop_ptt_mode() | Stop PTT mode |
get_ptt_status() | Get PTT state (includes "auto_stop, auto_start" indicators if enabled) |
get_segment_transcription(wait?, timeout?) | Wait for transcription (default timeout: 120s). Returns status: [Ready], [Recording...], [Transcribing...] |
Background mode (non-blocking) - Alternative:
| Tool | Description |
|---|---|
start_ptt_background(key?) | Start PTT in background process |
check_transcription() | Check for new transcription (non-blocking) |
stop_ptt_background() | Stop background PTT |
claude-say (TTS)
| Tool | Description |
|---|---|
speak(text, voice?, speed?) | Queue text, returns immediately (preferred for natural flow) |
speak_and_wait(text, voice?, speed?) | Speak and wait for completion (use when expecting response) |
stop_speaking() | Stop immediately |
TTS Backends
The TTS backend is configured in ~/.mcp-claude-say/.env:
| Backend | Description |
|---|---|
macos | Native macOS say command (default, instant, offline) |
kokoro | Kokoro MLX - 54 neural voices, 9 languages, runs locally on Apple Silicon |
google | Google Cloud TTS - neural voices, requires API key |
Kokoro Voices (if TTS_BACKEND=kokoro)
Pass voice ID as the voice parameter to use a specific voice:
| Language | Voice Examples |
|---|---|
| American English | af_heart (default), af_nova, am_adam, am_echo |
| British English | bf_emma, bf_alice, bm_george, bm_daniel |
| French | ff_siwis |
| Spanish | ef_dora, em_alex |
| Italian | if_sara, im_nicola |
| Portuguese | pf_dora, pm_alex |
| Japanese | jf_alpha, jm_kumo |
| Chinese | zf_xiaoxiao, zm_yunxi |
| Hindi | hf_alpha, hm_omega |
Example: speak("Bonjour!", voice="ff_siwis") for French.
When to use which TTS tool
IMPORTANT - Natural Speech Pattern:
- •speak(): Use for normal responses. One single speak() call with your complete answer is the default.
- •speak_and_wait(): ONLY use when you have a VERY LONG response broken into multiple parts. Put speak_and_wait() at the END to ensure all speech completes before listening.
- •Default speed: Always use
speed=1(1.0) for natural pacing.
Best practice - use speak() for normal responses:
# For typical responses, use ONE speak() call:
speak("I understand completely. The function you're looking for handles authentication and it's located in the auth module. It validates tokens and manages user sessions.", speed=1)
Only use speak_and_wait() for very long multi-part explanations:
# For very long responses that must be split:
speak("First part of a very detailed explanation that covers the initial concept.", speed=1)
speak("Second part that continues with more details.", speed=1)
speak_and_wait("Final part that concludes the explanation.", speed=1) # Only the last one waits
Why this matters: speak() returns immediately without blocking. speak_and_wait() blocks until speech completes, which is only needed when breaking long responses into parts to ensure proper sequencing.
Important: First Message Latency
The first message in a session may take 2-3 seconds longer than usual. This is normal and expected because:
- •VAD Model Loading: The Silero Voice Activity Detection model loads on first use (~2MB)
- •Audio Baseline Calibration: The system learns your ambient noise level
Subsequent messages will be much faster. This is a one-time delay per session.
How It Works
┌─────────────────────────────────────────────────────────┐ │ Seamless Conversation with Auto-Stop + Auto-Start │ │ │ │ /conversation → Welcome TTS │ │ │ │ │ [TTS complete] → [400ms delay] → Auto-start recording │ │ │ │ │ │ 🎤 Mic indicator in menu bar │ │ │ (user speaks...) │ │ │ │ │ [1.5s silence] → Auto-stop → Transcribe │ │ │ │ │ ↓ │ │ Claude responds vocally (TTS) │ │ │ │ │ ↓ │ │ [TTS complete] → [400ms delay] → Auto-start recording │ │ │ │ │ │ (user speaks... loop continues!) │ │ │ │ └─────────────────────────────────────────────────────────┘
- •User types
/conversationto start - •Claude plays welcome message (TTS)
- •After TTS → 400ms delay → recording auto-starts (mic indicator in menu bar)
- •User speaks when mic is active
- •VAD detects 1.5s of silence → auto-stops recording
- •Audio is transcribed with the configured STT engine
- •Claude processes and responds vocally
- •After TTS completes → 400ms delay → auto-starts recording
- •Conversation flows until user says "fin de session"
Starting Conversation Mode
# 1. Start PTT mode with VAD auto-stop AND auto-start for seamless conversation
start_ptt_mode(auto_stop=True, auto_start=True)
# 2. Welcome message - recording starts AUTOMATICALLY after TTS completes!
# IMPORTANT: Use the user's language! Include first-message latency notice.
# With auto_start=True, recording begins right after speak_and_wait() - NO KEY PRESS NEEDED!
# The mic indicator appears in the menu bar when recording is active.
# Examples:
# - English: "Ready. Speak when the mic activates. The first message may take a moment."
# - French: "Prêt. Parle quand le micro s'active. Le premier message peut prendre un moment."
speak_and_wait("Ready. Speak when the mic activates. The first message may take a moment.")
# Recording auto-starts immediately after TTS completes - mic indicator shows in menu bar
# 3. Wait for transcription (auto-stops when silence detected)
transcription = get_segment_transcription(wait=True, timeout=120)
# 4. Process and respond (use speak() for natural flow, speak_and_wait() at the end)
speak("Here's what I found.")
speak("The first point is this.")
speak_and_wait("What would you like to know next?") # After this, recording auto-starts!
# 5. Loop back to step 3 - fully automatic flow, no key presses!
Conversation Loop
# Start with VAD auto-stop AND auto-start for seamless conversation
start_ptt_mode(auto_stop=True, auto_start=True)
# Welcome message - recording starts automatically after TTS!
# French: "Prêt. Parle quand le micro s'active. Le premier message peut prendre un moment."
# English: "Ready. Speak when the mic activates. The first message may take a moment."
speak_and_wait("Ready. Speak when the mic activates. The first message may take a moment.")
# Recording auto-starts after TTS - mic indicator appears in menu bar!
# Main loop - fully automatic, no key presses needed!
while True:
# Wait for transcription (VAD auto-stops when user finishes speaking)
text = get_segment_transcription(wait=True, timeout=120)
# Check for end command
if "fin de session" in text.lower():
break
# Check for timeout
if "Timeout" in text:
speak_and_wait("Tu es toujours là?") # Recording auto-starts after this!
continue
# Process and respond - use speak() for flow, speak_and_wait() at end
speak("I understand your question.")
speak("Let me explain.")
speak_and_wait("Does that make sense?") # After this, recording auto-starts!
# Conversation flows naturally - no key presses at all!
# End session
stop_ptt_mode()
speak_and_wait("Désactivé.")
Ending Conversation Mode
When user says "fin de session" (or similar):
stop_ptt_mode()
speak_and_wait("Désactivé.")
Background Mode (Non-Blocking) - Alternative
Background mode uses polling instead of blocking. Use this if you need Claude to do other tasks while waiting for speech.
Starting Background Mode
# 1. Start background PTT
start_ptt_background() # Returns immediately
# 2. Confirm vocally
speak_and_wait("Prêt.")
# 3. Poll for transcriptions (non-blocking)
result = check_transcription()
# Returns: transcription text, or status like "[Ready...]", "[Recording...]"
Background Conversation Loop
import time
while True:
# Non-blocking check
result = check_transcription()
# Check if it's actual transcription (not status message)
if not result.startswith("["):
# Got real transcription!
if "fin de session" in result.lower():
break
# Respond
speak_and_wait(f"Tu as dit: {result}")
# Small delay before next check
time.sleep(0.5)
# End session
stop_ptt_background()
speak_and_wait("Désactivé.")
When to use Background Mode
- •When you need Claude to perform other tasks while waiting
- •When synchronous mode times out frequently
- •Note: Creates more visible tool calls in the interface
Important Rules
- •Use speak() for natural flow - Queue multiple sentences without blocking
- •Use speak_and_wait() at the end - Only when you need to wait for user response
- •No code vocally - Never read code, paths, or logs aloud
- •Match language - Respond in the same language as the user
- •Detailed responses by default - Give thorough, complete explanations naturally. Technical topics, concepts, and questions deserve full answers. Don't artificially shorten responses.
- •Execute directly - Don't announce actions, just do them and report results
- •Minimal activation messages - Use ONE word only for activation ("Ready", "Prêt", etc.) and deactivation ("Disabled", "Désactivé", etc.) in the user's language
- •Show visual content proactively - When explaining concepts, processes, or technical topics, don't hesitate to display diagrams, tables, code snippets, or structured lists on screen. Voice mode doesn't mean text-only - use the screen as a visual aid. If something would be clearer with a diagram or example, show it while explaining verbally.
Error Handling
- •If timeout (no speech):
speak_and_wait("Tu es toujours là?") - •If transcription unclear:
speak_and_wait("Je n'ai pas compris, peux-tu répéter?")
Available Keys for PTT
| Key | Name |
|---|---|
cmd_r | Right Command (default, recommended) |
cmd_l+s | Left Command + S |
cmd_r+m | Right Command + M |
cmd_l | Left Command |
alt_r | Right Option |
alt_l | Left Option |
ctrl_r | Right Control |
f13, f14, f15 | Function keys |