Pocket TTS

Local text-to-speech via pocket-tts server. Streams audio for low latency. macOS only (uses afplay as fallback).

Prerequisites: pip install pocket-tts and brew install ffmpeg

Quick Reference

bash

# Ensure server is running (do this first)
curl -s http://localhost:8321/health > /dev/null 2>&1 || {
  pocket-tts serve --voice ~/.config/pocket-tts/default-voice.wav --port 8321 > /dev/null 2>&1 &
  sleep 4
}

# Speak with streaming playback (audio starts immediately)
curl -s -X POST http://localhost:8321/tts -F "text=Hello world" -o - | ffplay -nodisp -autoexit -loglevel quiet -

# Or with temp file (if ffplay unavailable)
curl -s -X POST http://localhost:8321/tts -F "text=Hello world" -o /tmp/speak.wav && afplay /tmp/speak.wav && rm /tmp/speak.wav

Architecture

Always use the server — it keeps the model and voice embedding warm in memory.

•Port: 8321
•Default voice: ~/.config/pocket-tts/default-voice.wav (loaded once at server start)
•Streaming: /tts returns chunked WAV. Pipe to ffplay for immediate playback during generation.

Changing Voices

Per-request (server keeps default warm, but can generate with others):

bash

curl -s -X POST http://localhost:8321/tts -F "text=Hello" -F "voice_url=jean" -o - | ffplay -nodisp -autoexit -loglevel quiet -

Built-in voices: alba, marius, javert, jean, fantine, cosette, eponine, azelma

Custom: Any http://, https://, or hf:// URL

To change the default, restart server with different --voice.

Creating Custom Voices

bash

# Extract 30s clip from source (pocket-tts truncates to 30s anyway)
ffmpeg -y -ss START_SECONDS -t 30 -i input.mp3 -ar 24000 -ac 1 ~/.config/pocket-tts/default-voice.wav

Troubleshooting

Server not responding: Check if process died, restart with serve command

Slow first response: Server needs ~4s to load model on first start

No audio: Ensure ffplay (from ffmpeg) or afplay (macOS built-in) is available