speak - Talk to your Claude!
Give your agent the ability to speak to you real-time. Local text-to-speech, voice cloning, and audio generation on Apple Silicon. Give your agent the ability to speak to you real-time. Local TTS with voice cloning on Apple Silicon.
Prerequisites
| Requirement | Check | Install |
|---|---|---|
| Apple Silicon Mac | uname -m → arm64 | Intel not supported |
| macOS 12.0+ | sw_vers | - |
| sox | which sox | brew install sox |
| ffmpeg | which ffmpeg | brew install ffmpeg |
| poppler (PDF) | which pdftotext | brew install poppler |
Input Sources
| Source | Example |
|---|---|
| Text file | speak article.txt |
| Markdown | speak doc.md |
| Direct string | speak "Hello" |
| Clipboard | pbpaste | speak |
| Stdin | cat file.txt | speak |
Web Articles
lynx -dump -nolist "https://example.com/article" | speak --output article.wav
Converting Formats
| Format | Convert Command |
|---|---|
pdftotext doc.pdf doc.txt | |
| DOCX | textutil -convert txt doc.docx |
| HTML | pandoc -f html -t plain doc.html > doc.txt |
Output Modes
| Goal | Command |
|---|---|
| Save for later | speak text.txt --output file.wav |
| Listen now (streaming) | speak text.txt --stream |
| Listen now (complete) | speak text.txt --play |
| Both | speak text.txt --stream --output file.wav |
Default Behavior
speak article.txt # → ~/Audio/speak/article.wav (no playback) speak "Hello" # → ~/Audio/speak/speak_<timestamp>.wav
Directory Auto-Creation
| Directory | Auto-Created? |
|---|---|
~/Audio/speak/ | ✓ Yes |
~/.chatter/voices/ | ✗ No |
| Custom directories | ✗ No |
Always create custom directories first:
mkdir -p ~/.chatter/voices/ mkdir -p ~/Audio/custom/
Voice Cloning
Voice cloning generates speech that matches your vocal characteristics (pitch, tone, cadence) from a short recording.
Quality Expectations
- •Output captures general voice characteristics but is not a perfect replica
- •Quality depends heavily on sample quality
- •15-25 seconds is optimal (10s minimum, 30s maximum)
Recording Your Voice
Using QuickTime:
- •Open QuickTime Player → File → New Audio Recording
- •Record 20 seconds of clear speech
- •File → Export As → Audio Only (.m4a)
- •Convert to WAV (see below)
Using sox (command line):
# -d = use default microphone # Recording starts immediately and stops after 25 seconds sox -d -r 24000 -c 1 ~/.chatter/voices/my_voice.wav trim 0 25
Converting to Required Format
Voice samples MUST be: WAV, 24000 Hz, mono, 10-30 seconds.
# From MP3 ffmpeg -i voice.mp3 -ar 24000 -ac 1 voice.wav # From M4A (QuickTime) ffmpeg -i voice.m4a -ar 24000 -ac 1 voice.wav # Trim to 25 seconds ffmpeg -i long.wav -t 25 -ar 24000 -ac 1 trimmed.wav # Check sample properties ffprobe -i voice.wav 2>&1 | grep -E "Duration|Stream" # Should show: Duration ~15-25s, 24000 Hz, mono
Using Your Voice
# Create directory mkdir -p ~/.chatter/voices/ # Move sample mv voice.wav ~/.chatter/voices/my_voice.wav # Test speak "Testing my voice" --voice ~/.chatter/voices/my_voice.wav --stream # Use for content speak notes.txt --voice ~/.chatter/voices/my_voice.wav --output presentation.wav
Path requirements:
- •✓ Works:
~/.chatter/voices/my_voice.wav(tilde expanded by shell) - •✓ Works:
/Users/name/.chatter/voices/my_voice.wav - •✗ Fails:
my_voice.wav(relative path) - •✗ Fails:
./voices/my_voice.wav(relative path)
Voice Sample Tips
| Good Sample | Bad Sample |
|---|---|
| Quiet room | Background noise |
| Natural pace | Rushed or monotone |
| Clear diction | Mumbling |
| Varied content | Repetitive phrases |
Default Voice
When --voice is omitted, a built-in default voice is used:
speak "Hello world" --stream # Uses default voice
Emotion Tags
Tags produce audible effects (actual sounds), not spoken words:
speak "[sigh] Monday again." --stream # Output: (sigh sound) "Monday again."
| Tag | Effect |
|---|---|
[laugh] | Laughter |
[chuckle] | Light chuckle |
[sigh] | Sighing |
[gasp] | Gasping |
[groan] | Groaning |
[clear throat] | Throat clearing |
[cough] | Coughing |
[crying] | Crying |
[singing] | Sung speech |
NOT supported: [pause], [whisper] (ignored)
For pauses: Use punctuation: "Wait... let me think."
Batch Processing
mkdir -p ~/Audio/book/ speak ch01.txt ch02.txt ch03.txt --output-dir ~/Audio/book/ # Creates: ch01.wav, ch02.wav, ch03.wav # With auto-chunking (for long files) speak chapters/*.txt --output-dir ~/Audio/book/ --auto-chunk # Skip completed files speak chapters/*.txt --output-dir ~/Audio/book/ --skip-existing
Auto-Chunk Behavior
When using --auto-chunk with batch processing:
- •Each input file is chunked independently
- •Chunks are generated and automatically concatenated per file
- •Final output: one
.wavper input file (e.g.,ch01.wav) - •Intermediate chunks deleted (unless
--keep-chunks)
You don't need to manually concatenate chunks — only concatenate final chapter files.
Concatenating Audio
# Explicit order (recommended) speak concat ch01.wav ch02.wav ch03.wav --output book.wav # Glob pattern (REQUIRES zero-padded filenames) speak concat audiobook/*.wav --output book.wav
Zero-Padding Rules
Critical for correct concatenation order:
| Files | Correct | Wrong |
|---|---|---|
| 1-9 | 01, 02, ..., 09 | 1, 2, ..., 9 |
| 10-99 | 01, 02, ..., 99 | 1, 10, 2, ... |
| 100+ | 001, 002, ..., 999 | 1, 100, 2, ... |
Why: Shell glob expansion sorts alphabetically. 1, 10, 2 vs 01, 02, 10.
PDF to Audiobook (Complete Workflow)
Step 1: Find Chapter Boundaries
# Preview table of contents pdftotext -f 1 -l 5 textbook.pdf toc.txt cat toc.txt # Note chapter page numbers # Or search for "Chapter" markers pdftotext textbook.pdf - | grep -n "Chapter"
Step 2: Extract Chapters (Zero-Padded!)
# For 100-page book with ~10 chapters pdftotext -f 1 -l 12 -layout textbook.pdf ch01.txt pdftotext -f 13 -l 25 -layout textbook.pdf ch02.txt pdftotext -f 26 -l 38 -layout textbook.pdf ch03.txt # ... continue for all chapters
Step 3: Estimate Time
speak --estimate ch*.txt # Shows: total audio duration, generation time, storage needed # Quick estimates: # 1 page ≈ 2 min audio ≈ 1 min generation # 100 pages ≈ 200 min audio ≈ 100 min generation ≈ 500 MB
Step 4: Generate Audio
mkdir -p audiobook/ speak ch01.txt ch02.txt ch03.txt --output-dir audiobook/ --auto-chunk # Creates: audiobook/ch01.wav, audiobook/ch02.wav, audiobook/ch03.wav
Step 5: Concatenate
speak concat audiobook/ch01.wav audiobook/ch02.wav audiobook/ch03.wav --output complete_audiobook.wav # Or with glob (only if zero-padded): speak concat audiobook/ch*.wav --output complete_audiobook.wav
PDF Troubleshooting
| Issue | Solution |
|---|---|
| Empty/garbled text | Scanned PDF — use OCR: brew install tesseract |
| Wrong encoding | Try: pdftotext -enc UTF-8 doc.pdf |
| Check word count | pdftotext doc.pdf - | wc -w (should be >100) |
Multi-Voice Content
mkdir -p podcast/scripts podcast/wav echo "Welcome to the show." > podcast/scripts/01_host.txt echo "Thanks for having me." > podcast/scripts/02_guest.txt speak podcast/scripts/01_host.txt --voice ~/.chatter/voices/host.wav --output podcast/wav/01.wav speak podcast/scripts/02_guest.txt --voice ~/.chatter/voices/guest.wav --output podcast/wav/02.wav speak concat podcast/wav/01.wav podcast/wav/02.wav --output podcast.wav
Options Reference
| Option | Description | Default |
|---|---|---|
--stream | Stream as it generates | false |
--play | Play after complete | false |
--output <path> | Output file | ~/Audio/speak/ |
--output-dir <dir> | Batch output directory | - |
--voice <path> | Voice sample (full path) | default |
--timeout <sec> | Timeout per file | 300 |
--auto-chunk | Split long documents | false |
--chunk-size <n> | Chars per chunk | 6000 |
--resume <file> | Resume from manifest | - |
--keep-chunks | Keep intermediate files | false |
--skip-existing | Skip if output exists | false |
--estimate | Show duration estimate | false |
--dry-run | Preview only | false |
--quiet | Suppress output | false |
Commands
| Command | Description |
|---|---|
speak setup | Set up environment |
speak health | Check system status |
speak models | List TTS models |
speak concat | Concatenate audio |
speak daemon kill | Stop TTS server |
speak config | Show configuration |
Performance
| Metric | Value |
|---|---|
| Cold start | ~4-8s |
| Warm start | ~3-8s |
| Speed | 0.3-0.5x RTF (faster than real-time) |
| Storage | ~2.5 MB/min, ~150 MB/hour |
Resume Capability
For interrupted long generations:
# Single file with auto-chunk — use --resume speak long.txt --auto-chunk --output book.wav # If interrupted, manifest saved at ~/Audio/speak/manifest.json speak --resume ~/Audio/speak/manifest.json # Batch processing — use --skip-existing speak ch*.txt --output-dir audiobook/ --auto-chunk # If interrupted, re-run same command: speak ch*.txt --output-dir audiobook/ --auto-chunk --skip-existing
Common Errors
| Error | Cause | Solution |
|---|---|---|
| "Voice file not found" | Relative path | Use full path: ~/.chatter/voices/x.wav |
| "Invalid WAV format" | Wrong specs | Convert: ffmpeg -i in.wav -ar 24000 -ac 1 out.wav |
| "Voice sample too short" | <10 seconds | Record 15-25 seconds |
| "Output directory doesn't exist" | Not created | mkdir -p dirname/ |
| "sox not found" | Not installed | brew install sox |
| Scrambled concat order | Non-zero-padded | Use 01, 02, not 1, 2 |
| Timeout | >5 min generation | Use --auto-chunk or --timeout 600 |
| "Server not running" | Stale daemon | speak daemon kill && speak health |
Setup
speak "test" # Auto-setup on first run (downloads model ~500MB) speak setup # Or manual setup speak health # Verify everything works
Server Management
Server auto-starts and shuts down after 1 hour idle.
speak health # Check status speak daemon kill # Stop manually