Base directory for this skill: /Users/de-shiouhuang/Dropbox/code/tezign/tts-test/.claude/skills/gemini-tts-fast

Gemini Text-to-Speech Skill (1.2x Speed)

Convert text to speech using Google Gemini's TTS API with fixed 1.2x playback speed. Automatically generates standard WAV audio files optimized for video narration.

Features

•Converts text to natural-sounding speech using Gemini 2.5 Pro TTS
•Fixed 1.2x speed for more dynamic video narration
•Outputs standard WAV format (24kHz, 16-bit, mono)
•Supports multiple voice styles
•Handles both English and Chinese text
•Automatic format conversion using ffmpeg

Requirements

•Python 3.x with google-genai package
•ffmpeg (for audio format conversion)
•GOOGLE_API_KEY environment variable (stored in .env file)

Usage

When the user requests text-to-speech conversion, the skill will automatically apply 1.2x speed:

•Load environment: Source the .env file to get the API key
•
Parse arguments:
- •Text to convert (required) - supports multiple languages
- •--output=filename.wav (optional, default: output.wav)
- •--voice=VoiceName (optional, default: Puck)
- •Speed is automatically set to 1.2x (no need to specify)
•Generate audio: Run the script to create the WAV file at 1.2x speed
•Confirm success: Report the output file location and size

Available Voices

•Puck (default) - 中性、清晰 (Neutral, clear)
•Charon - 深沉、权威 (Deep, authoritative)
•Kore - 温暖、友好 (Warm, friendly)
•Fenrir - 强劲、动感 (Strong, dynamic)
•Aoede - 流畅、富有表现力 (Smooth, expressive)

Command Template

The skill now includes its own tts_cli.py script in the skill directory, making it fully portable:

bash

set -a && source .env && set +a && \
source venv/bin/activate && \
python .claude/skills/gemini-tts-fast/tts_cli.py "<text>" --output="<filename>" --voice="<voice>" --speed=1.2

Note: Users need to:

•Install Python dependencies: pip install google-genai
•Create a .env file with GOOGLE_API_KEY=your-key
•Install ffmpeg: brew install ffmpeg (macOS) or equivalent
•Create a Python virtual environment: python -m venv venv

Error Handling

•If GOOGLE_API_KEY is missing from .env, instruct user to add it
•If ffmpeg is not installed, instruct user to install it (brew install ffmpeg)
•If script fails, show the error message
•If model is unavailable, suggest checking Gemini API status

Examples

Simple usage (English):

code

/gemini-tts-fast "Hello world"

→ Generates output.wav at 1.2x speed

With custom output:

code

/gemini-tts-fast "Welcome to our app" --output=welcome.wav

→ Generates welcome.wav at 1.2x speed

With custom voice:

code

/gemini-tts-fast "Thank you for listening" --output=thanks.wav --voice=Aoede

→ Generates thanks.wav at 1.2x speed with Aoede voice

Chinese text:

code

/gemini-tts-fast "你好世界" --output=hello_cn.wav --voice=Kore

→ Generates hello_cn.wav at 1.2x speed with Kore voice

Processing script.json:

code

/gemini-tts-fast @script.json

→ Automatically processes all narration scenes from script.json at 1.2x speed

Technical Details

•Input: Raw PCM data from Gemini API
•Processing: Converts to WAV and applies 1.2x speed using ffmpeg atempo filter
•Output format: RIFF WAVE, 24000 Hz, mono, 16-bit PCM
•Playback speed: Fixed at 1.2x (shortens duration by ~17%)
•Temporary files are automatically cleaned up

Why 1.2x Speed?

1.2x speed is optimal for video narration because:

•Maintains natural speech clarity
•Keeps the content engaging and dynamic
•Reduces video length without sounding rushed
•Standard practice for professional video voiceovers