Text2Speech Skill
Generate high-quality text-to-speech audio using Qwen3-TTS models.
Prerequisites
- •Python 3.8+
- •
requestspackage - •Access to TTSWeb API (https://mc.agaii.org/TTS)
Installation
Via npm (Node.js)
npm install -g @catfishw/text2speech-skill
Via pip (Python)
pip install git+https://github.com/CatfishW/TTSAgentSkill.git
Direct Usage
python3 -m text2speech_skill.cli --help
Quick Start
Speak with Preset Speaker
text2speech speak "Hello world" -s vivian -o hello.wav
Design Custom Voice
text2speech design "Welcome to the future" \ -d "futuristic female AI assistant, clear and professional" \ -o welcome.wav
Clone Voice from Audio
text2speech clone "This is my cloned voice speaking" \ -a reference.wav \ -r "original transcript of reference audio" \ -o cloned.wav
Clone with Preset Timbre
text2speech clone "Hello" -t ryan -o output.wav
Commands
speak
Text-to-speech with preset speaker voices.
text2speech speak <text> [options] Options: -s, --speaker Speaker name (default: vivian) -l, --language Language code (default: Auto) -i, --instruct Style instruction (e.g., "speak cheerfully") -o, --output Output audio file (required)
Speakers: vivian, ryan, aiden, dylan, eric, ono_anna, serena, sohee, uncle_fu
Examples:
text2speech speak "Hello" -s vivian -o hello.wav text2speech speak "Bonjour" -s serena -l French -o bonjour.wav text2speech speak "Hi" -s ryan -i "speak like a news anchor" -o hi.wav
design
Create voice from natural language description.
text2speech design <text> -d <description> [options] Options: -d, --description Voice description (required) -l, --language Language code -o, --output Output audio file (required)
Examples:
text2speech design "Hello" -d "old man with raspy voice" -o oldman.wav text2speech design "Welcome" -d "young energetic female, enthusiastic" -o welcome.wav
clone
Clone voice from reference audio or preset timbre.
text2speech clone <text> [options] Options: -a, --audio Reference audio file -t, --timbre Preset timbre speaker (alternative to audio) -r, --ref-text Reference transcript (for ICL mode) -x, --x-vector-only Use x-vector only mode -i, --instruct Style instruction -l, --language Language code -o, --output Output audio file (required)
Examples:
# Clone from audio with transcript (ICL mode) text2speech clone "Hello" -a ref.wav -r "original text" -o out.wav # Clone from audio (x-vector only, faster) text2speech clone "Hello" -a ref.wav -x -o out.wav # Clone using preset timbre text2speech clone "Hello" -t ryan -o out.wav
batch-speak
Batch process multiple text files.
text2speech batch-speak <input_dir> <output_dir> [options] Options: -s, --speaker Speaker name (default: vivian) -l, --language Language code -i, --instruct Style instruction
Input: Directory containing .txt files
Output: Audio files + batch_report.json
Example:
mkdir -p texts output echo "Hello" > texts/1.txt echo "World" > texts/2.txt text2speech batch-speak texts/ output/ -s vivian
batch-clone
Batch clone voice for multiple texts.
text2speech batch-clone <input_dir> <output_dir> -a <audio> [options] Options: -a, --audio Reference audio (required) -r, --ref-text Reference transcript -l, --language Language code
Example:
text2speech batch-clone texts/ output/ -a reference.wav -r "transcript"
encode
Encode audio to tokens (tokenizer).
text2speech encode <audio> [-o output.json]
Example:
text2speech encode audio.wav -o tokens.json cat tokens.json | jq '.count'
decode
Decode tokens to audio.
text2speech decode <tokens_file> -o <output>
Example:
text2speech decode tokens.json -o output.wav
status
Check service status.
text2speech status
Shows:
- •API health
- •GPU availability
- •Loaded models
- •Speaker count
speakers
List available preset speakers.
text2speech speakers
languages
List supported languages.
text2speech languages
API Configuration
Default API: https://mc.agaii.org/TTS/api/v1
To use local backend, modify text2speech_skill/cli.py:
API_BASE = "http://localhost:24536/api/v1"
Voice Cloning Modes
ICL Mode (In-Context Learning)
- •Requires reference transcript (
--ref-text) - •Higher quality, follows reference prosody
- •Default mode when transcript provided
X-Vector Mode
- •Use
--x-vector-onlyflag - •Faster, only speaker characteristics
- •No transcript needed
Tips
- •Use
@file.txtsyntax to read text from file:text2speech speak @input.txt -o out.wav - •Reference audio should be clear and 5-30 seconds for best cloning
- •ICL mode produces better results than x-vector when transcript is accurate
- •Batch operations save a
batch_report.jsonwith results
Troubleshooting
Job fails with "ref_text required"
→ Add --ref-text with transcript or use --x-vector-only
Audio quality is poor → Use clearer reference audio, or try different speaker/timbre
Timeout on long text → Break into smaller chunks, or use batch mode