AgentSkillsCN

conversation

通过“按下即说”实现双向语音对话。当用户说:“对话模式”、“我们聊聊”、“parlons”、“语音对话”、“有声对话”、“PTT 模式”时,或希望与 Claude 进行“双向交流”(而不仅仅是倾听)时,可使用此模式。若只需单向 TTS(Claude 朗读,用户键入文字),则应改用 /speak。

SKILL.md
--- frontmatter
name: conversation
description: "Bidirectional voice conversation with Push-to-Talk. Use when user says: 'conversation mode', 'let's talk', 'parlons', 'voice conversation', 'dialogue vocal', 'PTT mode', or wants to speak WITH Claude (not just listen). For one-way TTS (Claude speaks, user types), use /speak instead."
user_invocable: true

Conversation Mode - Voice Loop with Push-to-Talk

You now have access to both text-to-speech (claude-say) AND speech-to-text (claude-listen) for a complete voice conversation.

Architecture

Uses Push-to-Talk (PTT) mode with VAD auto-stop AND auto-start:

  • Recording starts automatically after the welcome message (no key press needed!)
  • Recording stops automatically when you stop speaking (VAD detection)
  • Recording restarts automatically after Claude finishes speaking
  • The conversation flows naturally without any key presses!
  • Silence threshold: 1.5 seconds
  • Echo prevention delay: 400ms after TTS
  • Mic indicator appears in menu bar when recording is active

Available MCP Tools

claude-listen (STT - Push-to-Talk with VAD)

Synchronous mode (blocking) - RECOMMENDED:

ToolDescription
start_ptt_mode(key?, auto_stop?, vad_silence_ms?, auto_start?, echo_delay_ms?)Start PTT mode. Use auto_stop=True, auto_start=True for seamless conversation
stop_ptt_mode()Stop PTT mode
get_ptt_status()Get PTT state (includes "auto_stop, auto_start" indicators if enabled)
get_segment_transcription(wait?, timeout?)Wait for transcription (default timeout: 120s). Returns status: [Ready], [Recording...], [Transcribing...]

Background mode (non-blocking) - Alternative:

ToolDescription
start_ptt_background(key?)Start PTT in background process
check_transcription()Check for new transcription (non-blocking)
stop_ptt_background()Stop background PTT

claude-say (TTS)

ToolDescription
speak(text, voice?, speed?)Queue text, returns immediately (preferred for natural flow)
speak_and_wait(text, voice?, speed?)Speak and wait for completion (use when expecting response)
stop_speaking()Stop immediately

TTS Backends

The TTS backend is configured in ~/.mcp-claude-say/.env:

BackendDescription
macosNative macOS say command (default, instant, offline)
kokoroKokoro MLX - 54 neural voices, 9 languages, runs locally on Apple Silicon
googleGoogle Cloud TTS - neural voices, requires API key

Kokoro Voices (if TTS_BACKEND=kokoro)

Pass voice ID as the voice parameter to use a specific voice:

LanguageVoice Examples
American Englishaf_heart (default), af_nova, am_adam, am_echo
British Englishbf_emma, bf_alice, bm_george, bm_daniel
Frenchff_siwis
Spanishef_dora, em_alex
Italianif_sara, im_nicola
Portuguesepf_dora, pm_alex
Japanesejf_alpha, jm_kumo
Chinesezf_xiaoxiao, zm_yunxi
Hindihf_alpha, hm_omega

Example: speak("Bonjour!", voice="ff_siwis") for French.

When to use which TTS tool

IMPORTANT - Natural Speech Pattern:

  • speak(): Use for normal responses. One single speak() call with your complete answer is the default.
  • speak_and_wait(): ONLY use when you have a VERY LONG response broken into multiple parts. Put speak_and_wait() at the END to ensure all speech completes before listening.
  • Default speed: Always use speed=1 (1.0) for natural pacing.

Best practice - use speak() for normal responses:

python
# For typical responses, use ONE speak() call:
speak("I understand completely. The function you're looking for handles authentication and it's located in the auth module. It validates tokens and manages user sessions.", speed=1)

Only use speak_and_wait() for very long multi-part explanations:

python
# For very long responses that must be split:
speak("First part of a very detailed explanation that covers the initial concept.", speed=1)
speak("Second part that continues with more details.", speed=1)
speak_and_wait("Final part that concludes the explanation.", speed=1)  # Only the last one waits

Why this matters: speak() returns immediately without blocking. speak_and_wait() blocks until speech completes, which is only needed when breaking long responses into parts to ensure proper sequencing.

Important: First Message Latency

The first message in a session may take 2-3 seconds longer than usual. This is normal and expected because:

  1. VAD Model Loading: The Silero Voice Activity Detection model loads on first use (~2MB)
  2. Audio Baseline Calibration: The system learns your ambient noise level

Subsequent messages will be much faster. This is a one-time delay per session.

How It Works

code
┌─────────────────────────────────────────────────────────┐
│    Seamless Conversation with Auto-Stop + Auto-Start    │
│                                                         │
│  /conversation → Welcome TTS                            │
│       │                                                 │
│  [TTS complete] → [400ms delay] → Auto-start recording  │
│       │                                                 │
│       │     🎤 Mic indicator in menu bar                │
│       │     (user speaks...)                            │
│       │                                                 │
│  [1.5s silence] → Auto-stop → Transcribe                │
│       │                                                 │
│       ↓                                                 │
│  Claude responds vocally (TTS)                          │
│       │                                                 │
│       ↓                                                 │
│  [TTS complete] → [400ms delay] → Auto-start recording  │
│       │                                                 │
│       │     (user speaks... loop continues!)            │
│       │                                                 │
└─────────────────────────────────────────────────────────┘
  1. User types /conversation to start
  2. Claude plays welcome message (TTS)
  3. After TTS → 400ms delay → recording auto-starts (mic indicator in menu bar)
  4. User speaks when mic is active
  5. VAD detects 1.5s of silence → auto-stops recording
  6. Audio is transcribed with the configured STT engine
  7. Claude processes and responds vocally
  8. After TTS completes → 400ms delay → auto-starts recording
  9. Conversation flows until user says "fin de session"

Starting Conversation Mode

python
# 1. Start PTT mode with VAD auto-stop AND auto-start for seamless conversation
start_ptt_mode(auto_stop=True, auto_start=True)

# 2. Welcome message - recording starts AUTOMATICALLY after TTS completes!
# IMPORTANT: Use the user's language! Include first-message latency notice.
# With auto_start=True, recording begins right after speak_and_wait() - NO KEY PRESS NEEDED!
# The mic indicator appears in the menu bar when recording is active.
# Examples:
# - English: "Ready. Speak when the mic activates. The first message may take a moment."
# - French: "Prêt. Parle quand le micro s'active. Le premier message peut prendre un moment."
speak_and_wait("Ready. Speak when the mic activates. The first message may take a moment.")
# Recording auto-starts immediately after TTS completes - mic indicator shows in menu bar

# 3. Wait for transcription (auto-stops when silence detected)
transcription = get_segment_transcription(wait=True, timeout=120)

# 4. Process and respond (use speak() for natural flow, speak_and_wait() at the end)
speak("Here's what I found.")
speak("The first point is this.")
speak_and_wait("What would you like to know next?")  # After this, recording auto-starts!

# 5. Loop back to step 3 - fully automatic flow, no key presses!

Conversation Loop

python
# Start with VAD auto-stop AND auto-start for seamless conversation
start_ptt_mode(auto_stop=True, auto_start=True)

# Welcome message - recording starts automatically after TTS!
# French: "Prêt. Parle quand le micro s'active. Le premier message peut prendre un moment."
# English: "Ready. Speak when the mic activates. The first message may take a moment."
speak_and_wait("Ready. Speak when the mic activates. The first message may take a moment.")
# Recording auto-starts after TTS - mic indicator appears in menu bar!

# Main loop - fully automatic, no key presses needed!
while True:
    # Wait for transcription (VAD auto-stops when user finishes speaking)
    text = get_segment_transcription(wait=True, timeout=120)

    # Check for end command
    if "fin de session" in text.lower():
        break

    # Check for timeout
    if "Timeout" in text:
        speak_and_wait("Tu es toujours là?")  # Recording auto-starts after this!
        continue

    # Process and respond - use speak() for flow, speak_and_wait() at end
    speak("I understand your question.")
    speak("Let me explain.")
    speak_and_wait("Does that make sense?")  # After this, recording auto-starts!
    # Conversation flows naturally - no key presses at all!

# End session
stop_ptt_mode()
speak_and_wait("Désactivé.")

Ending Conversation Mode

When user says "fin de session" (or similar):

python
stop_ptt_mode()
speak_and_wait("Désactivé.")

Background Mode (Non-Blocking) - Alternative

Background mode uses polling instead of blocking. Use this if you need Claude to do other tasks while waiting for speech.

Starting Background Mode

python
# 1. Start background PTT
start_ptt_background()  # Returns immediately

# 2. Confirm vocally
speak_and_wait("Prêt.")

# 3. Poll for transcriptions (non-blocking)
result = check_transcription()
# Returns: transcription text, or status like "[Ready...]", "[Recording...]"

Background Conversation Loop

python
import time

while True:
    # Non-blocking check
    result = check_transcription()

    # Check if it's actual transcription (not status message)
    if not result.startswith("["):
        # Got real transcription!
        if "fin de session" in result.lower():
            break

        # Respond
        speak_and_wait(f"Tu as dit: {result}")

    # Small delay before next check
    time.sleep(0.5)

# End session
stop_ptt_background()
speak_and_wait("Désactivé.")

When to use Background Mode

  • When you need Claude to perform other tasks while waiting
  • When synchronous mode times out frequently
  • Note: Creates more visible tool calls in the interface

Important Rules

  1. Use speak() for natural flow - Queue multiple sentences without blocking
  2. Use speak_and_wait() at the end - Only when you need to wait for user response
  3. No code vocally - Never read code, paths, or logs aloud
  4. Match language - Respond in the same language as the user
  5. Detailed responses by default - Give thorough, complete explanations naturally. Technical topics, concepts, and questions deserve full answers. Don't artificially shorten responses.
  6. Execute directly - Don't announce actions, just do them and report results
  7. Minimal activation messages - Use ONE word only for activation ("Ready", "Prêt", etc.) and deactivation ("Disabled", "Désactivé", etc.) in the user's language
  8. Show visual content proactively - When explaining concepts, processes, or technical topics, don't hesitate to display diagrams, tables, code snippets, or structured lists on screen. Voice mode doesn't mean text-only - use the screen as a visual aid. If something would be clearer with a diagram or example, show it while explaining verbally.

Error Handling

  • If timeout (no speech): speak_and_wait("Tu es toujours là?")
  • If transcription unclear: speak_and_wait("Je n'ai pas compris, peux-tu répéter?")

Available Keys for PTT

KeyName
cmd_rRight Command (default, recommended)
cmd_l+sLeft Command + S
cmd_r+mRight Command + M
cmd_lLeft Command
alt_rRight Option
alt_lLeft Option
ctrl_rRight Control
f13, f14, f15Function keys