AgentSkillsCN

pocket-tts

使用本地TTS将文本朗读出来。当用户希望听到文字被大声朗读、口述、配音、叙述,或以音频形式呈现时,可使用此功能。可通过“大声朗读这段文字”“说这段话”“开口说话”“大声说出来”“用语言告诉我”“进行叙述”“为这段文字配音”“听我读”“TTS”或其他任何将文本以音频形式呈现的请求触发。

SKILL.md
--- frontmatter
name: pocket-tts
description: >
  Speak text aloud using local TTS. Use when the user wants to hear something read out loud, spoken, voiced, narrated, or audibly rendered. Triggers on: "read this aloud", "say this", "speak", "out loud", "tell me [verbally]", "narrate", "voice this", "hear this", "read to me", "TTS", or any request to audibly render text.

Pocket TTS

Local text-to-speech via pocket-tts server. Streams audio for low latency. macOS only (uses afplay as fallback).

Prerequisites: pip install pocket-tts and brew install ffmpeg

Quick Reference

bash
# Ensure server is running (do this first)
curl -s http://localhost:8321/health > /dev/null 2>&1 || {
  pocket-tts serve --voice ~/.config/pocket-tts/default-voice.wav --port 8321 > /dev/null 2>&1 &
  sleep 4
}

# Speak with streaming playback (audio starts immediately)
curl -s -X POST http://localhost:8321/tts -F "text=Hello world" -o - | ffplay -nodisp -autoexit -loglevel quiet -

# Or with temp file (if ffplay unavailable)
curl -s -X POST http://localhost:8321/tts -F "text=Hello world" -o /tmp/speak.wav && afplay /tmp/speak.wav && rm /tmp/speak.wav

Architecture

Always use the server — it keeps the model and voice embedding warm in memory.

  • Port: 8321
  • Default voice: ~/.config/pocket-tts/default-voice.wav (loaded once at server start)
  • Streaming: /tts returns chunked WAV. Pipe to ffplay for immediate playback during generation.

Changing Voices

Per-request (server keeps default warm, but can generate with others):

bash
curl -s -X POST http://localhost:8321/tts -F "text=Hello" -F "voice_url=jean" -o - | ffplay -nodisp -autoexit -loglevel quiet -

Built-in voices: alba, marius, javert, jean, fantine, cosette, eponine, azelma

Custom: Any http://, https://, or hf:// URL

To change the default, restart server with different --voice.

Creating Custom Voices

bash
# Extract 30s clip from source (pocket-tts truncates to 30s anyway)
ffmpeg -y -ss START_SECONDS -t 30 -i input.mp3 -ar 24000 -ac 1 ~/.config/pocket-tts/default-voice.wav

Troubleshooting

Server not responding: Check if process died, restart with serve command

Slow first response: Server needs ~4s to load model on first start

No audio: Ensure ffplay (from ffmpeg) or afplay (macOS built-in) is available