AgentSkillsCN

voice-setup

使用 Edge TTS 与 whisper-cpp 为 OpenClaw 添加免费的语音功能(TTS + STT)。当您需要:(1) 为应用增添语音与音频能力;(2) 设置语音转文字的语音识别功能;(3) 配置文本转语音的合成能力;(4) 在 Telegram/WhatsApp 上启用语音消息功能;(5) 用户询问无需 API 密钥的免费 TTS/STT 解决方案时,可使用此功能。

SKILL.md
--- frontmatter
name: voice-setup
description: Set up free voice functionality (TTS + STT) for OpenClaw using Edge TTS and whisper-cpp. Use when: (1) User wants to add voice/audio capabilities, (2) Setting up speech-to-text transcription, (3) Configuring text-to-speech synthesis, (4) Enabling voice messages on Telegram/WhatsApp, (5) User asks about free TTS/STT solutions without API keys.

Voice Setup Skill

This skill helps configure free, open-source voice capabilities for OpenClaw.

Overview

ComponentSolutionCost
TTS (Text-to-Speech)Edge TTS (Microsoft)Free, no API key
STT (Speech-to-Text)whisper-cppFree, local processing

Prerequisites

Check and install dependencies:

bash
# Check for Homebrew
which brew || echo "Install Homebrew first: https://brew.sh"

# Install whisper-cpp (STT)
brew install whisper-cpp

# Install ffmpeg (audio conversion)
brew install ffmpeg

# Download whisper model (base is recommended for balance of speed/accuracy)
mkdir -p ~/.openclaw/models
curl -L -o ~/.openclaw/models/ggml-base.bin \
  "https://huggingface.co/ggerganov/whisper.cpp/resolve/main/ggml-base.bin"

Configuration

Apply this config patch to enable voice:

json
{
  "messages": {
    "tts": {
      "auto": "inbound",
      "provider": "edge",
      "edge": {
        "enabled": true,
        "voice": "zh-CN-XiaoxiaoNeural",
        "lang": "zh-CN",
        "outputFormat": "audio-24khz-48kbitrate-mono-mp3"
      }
    }
  },
  "tools": {
    "media": {
      "audio": {
        "enabled": true,
        "maxBytes": 20971520,
        "models": [
          {
            "type": "cli",
            "command": "bash",
            "args": ["-c", "ffmpeg -i '{{MediaPath}}' -ar 16000 -ac 1 /tmp/whisper-input.wav -y >/dev/null 2>&1 && whisper-cli -m ~/.openclaw/models/ggml-base.bin -l auto -f /tmp/whisper-input.wav 2>/dev/null | grep -E '^\\[' | sed 's/\\[.*\\] //'"],
            "timeoutSeconds": 120
          }
        ]
      }
    }
  }
}

TTS Voice Options

Chinese Voices

  • zh-CN-XiaoxiaoNeural - 女声,活泼 (recommended)
  • zh-CN-YunxiNeural - 男声,自然
  • zh-CN-XiaoyiNeural - 女声,温柔

English Voices

  • en-US-JennyNeural - Female, warm
  • en-US-GuyNeural - Male, natural
  • en-GB-SoniaNeural - British female

Other Languages

  • ja-JP-NanamiNeural - Japanese female
  • ko-KR-SunHiNeural - Korean female
  • de-DE-KatjaNeural - German female

List all voices: npx node-edge-tts --list-voices

TTS Modes

ModeBehavior
offTTS disabled
inboundReply with voice only if user sent voice (recommended)
alwaysAlways reply with voice
taggedOnly when reply contains [[tts]] tags

Whisper Models

ModelSizeSpeedAccuracy
ggml-tiny.bin75MBFastestBasic
ggml-base.bin142MBFastGood (recommended)
ggml-small.bin466MBMediumBetter
ggml-medium.bin1.5GBSlowBest

Testing

Test TTS

bash
npx node-edge-tts -t '你好!语音测试' -v 'zh-CN-XiaoxiaoNeural' -f /tmp/test.mp3

Test STT

bash
ffmpeg -i input.ogg -ar 16000 -ac 1 /tmp/test.wav -y
whisper-cli -m ~/.openclaw/models/ggml-base.bin -l auto -f /tmp/test.wav

Troubleshooting

"whisper-cli not found"

bash
brew install whisper-cpp

"ffmpeg not found"

bash
brew install ffmpeg

Audio file format error

Telegram sends OGG/Opus, whisper needs WAV. The config handles conversion automatically.

Model not found

bash
curl -L -o ~/.openclaw/models/ggml-base.bin \
  "https://huggingface.co/ggerganov/whisper.cpp/resolve/main/ggml-base.bin"