Base directory for this skill: /Users/de-shiouhuang/Dropbox/code/tezign/tts-test/.claude/skills/gemini-tts-fast
Gemini Text-to-Speech Skill (1.2x Speed)
Convert text to speech using Google Gemini's TTS API with fixed 1.2x playback speed. Automatically generates standard WAV audio files optimized for video narration.
Features
- •Converts text to natural-sounding speech using Gemini 2.5 Pro TTS
- •Fixed 1.2x speed for more dynamic video narration
- •Outputs standard WAV format (24kHz, 16-bit, mono)
- •Supports multiple voice styles
- •Handles both English and Chinese text
- •Automatic format conversion using ffmpeg
Requirements
- •Python 3.x with
google-genaipackage - •ffmpeg (for audio format conversion)
- •
GOOGLE_API_KEYenvironment variable (stored in.envfile)
Usage
When the user requests text-to-speech conversion, the skill will automatically apply 1.2x speed:
- •Load environment: Source the
.envfile to get the API key - •Parse arguments:
- •Text to convert (required) - supports multiple languages
- •
--output=filename.wav(optional, default: output.wav) - •
--voice=VoiceName(optional, default: Puck) - •Speed is automatically set to 1.2x (no need to specify)
- •Generate audio: Run the script to create the WAV file at 1.2x speed
- •Confirm success: Report the output file location and size
Available Voices
- •Puck (default) - 中性、清晰 (Neutral, clear)
- •Charon - 深沉、权威 (Deep, authoritative)
- •Kore - 温暖、友好 (Warm, friendly)
- •Fenrir - 强劲、动感 (Strong, dynamic)
- •Aoede - 流畅、富有表现力 (Smooth, expressive)
Command Template
The skill now includes its own tts_cli.py script in the skill directory, making it fully portable:
set -a && source .env && set +a && \ source venv/bin/activate && \ python .claude/skills/gemini-tts-fast/tts_cli.py "<text>" --output="<filename>" --voice="<voice>" --speed=1.2
Note: Users need to:
- •Install Python dependencies:
pip install google-genai - •Create a
.envfile withGOOGLE_API_KEY=your-key - •Install ffmpeg:
brew install ffmpeg(macOS) or equivalent - •Create a Python virtual environment:
python -m venv venv
Error Handling
- •If
GOOGLE_API_KEYis missing from.env, instruct user to add it - •If ffmpeg is not installed, instruct user to install it (
brew install ffmpeg) - •If script fails, show the error message
- •If model is unavailable, suggest checking Gemini API status
Examples
Simple usage (English):
/gemini-tts-fast "Hello world"
→ Generates output.wav at 1.2x speed
With custom output:
/gemini-tts-fast "Welcome to our app" --output=welcome.wav
→ Generates welcome.wav at 1.2x speed
With custom voice:
/gemini-tts-fast "Thank you for listening" --output=thanks.wav --voice=Aoede
→ Generates thanks.wav at 1.2x speed with Aoede voice
Chinese text:
/gemini-tts-fast "你好世界" --output=hello_cn.wav --voice=Kore
→ Generates hello_cn.wav at 1.2x speed with Kore voice
Processing script.json:
/gemini-tts-fast @script.json
→ Automatically processes all narration scenes from script.json at 1.2x speed
Technical Details
- •Input: Raw PCM data from Gemini API
- •Processing: Converts to WAV and applies 1.2x speed using ffmpeg
atempofilter - •Output format: RIFF WAVE, 24000 Hz, mono, 16-bit PCM
- •Playback speed: Fixed at 1.2x (shortens duration by ~17%)
- •Temporary files are automatically cleaned up
Why 1.2x Speed?
1.2x speed is optimal for video narration because:
- •Maintains natural speech clarity
- •Keeps the content engaging and dynamic
- •Reduces video length without sounding rushed
- •Standard practice for professional video voiceovers