Voice Cloning
Generate speech and clone voices locally without API costs.
When to Use
- •Need text-to-speech without paying for ElevenLabs/OpenAI
- •Want to clone a voice from a sample
- •Creating podcasts, voiceovers, or audio content
- •Privacy-sensitive applications (no data leaves your machine)
Quick Start
Option 1: Coqui TTS (Best Quality)
bash
# Install
pip install TTS
# List available models
tts --list_models
# Generate speech
tts --text "Hello, this is a test." --out_path output.wav
# Use specific model (recommended: XTTS v2)
tts --model_name tts_models/multilingual/multi-dataset/xtts_v2 \
--text "Hello world" \
--out_path output.wav
Option 2: Bark (Most Natural)
bash
# Install pip install git+https://github.com/suno-ai/bark.git # Use via Python python skills/voice-cloning/scripts/bark-generate.py "Your text here" output.wav
Option 3: Piper (Fastest)
bash
# Install pip install piper-tts # Generate (very fast, good for bulk) echo "Hello world" | piper --model en_US-lessac-medium --output_file output.wav
Voice Cloning (XTTS v2)
Clone any voice from a 6+ second audio sample:
bash
python skills/voice-cloning/scripts/clone-voice.py \
--sample voice_sample.wav \
--text "Text to speak in cloned voice" \
--output cloned_output.wav
Available Scripts
scripts/coqui-generate.py
Basic TTS generation with Coqui.
scripts/bark-generate.py
Natural-sounding speech with Bark (slower but more expressive).
scripts/clone-voice.py
Clone a voice from an audio sample using XTTS v2.
scripts/batch-tts.py
Generate multiple audio files from a text file (one line = one file).
Model Comparison
| Model | Quality | Speed | Voice Clone | Languages |
|---|---|---|---|---|
| XTTS v2 | ★★★★★ | Slow | ✅ Yes | 16 |
| Bark | ★★★★★ | Very Slow | ❌ No | EN mainly |
| Piper | ★★★☆☆ | Very Fast | ❌ No | 30+ |
Tips
- •For quality: Use XTTS v2 or Bark
- •For speed: Use Piper
- •For cloning: XTTS v2 is your only free option
- •GPU recommended: Bark and XTTS are slow on CPU
Limitations
- •First run downloads models (1-4 GB)
- •GPU recommended for reasonable speed
- •Voice cloning needs clean 6+ second sample
- •Bark can hallucinate on long texts