Speech Generation
Automate spoken audio generation for narration, product demos, IVR prompts, accessibility reads, and batch speech jobs using the OpenAI Audio API and the bundled CLI.
Overview
This skill automates text-to-speech generation for developers and content creators. It runs the bundled CLI (scripts/text_to_speech.py) whenever the user needs spoken audio generated from text input.
What it automates:
- •Single clip generation from text with voice direction
- •Batch generation of multiple audio files from JSONL input
- •Instruction augmentation from user intent into structured voice directions
- •Dependency installation and environment setup
Time saved: ~5-15 minutes per clip (manual API calls, parameter tuning, file management)
Triggers
When to Run
This automation should run when:
- •User asks to generate speech, narration, voiceover, or audio from text
- •User needs batch audio generation for multiple prompts
- •User asks about text-to-speech, TTS, or the OpenAI Audio API
- •User wants accessibility reads or IVR phone prompts
Decision Tree
- •Multiple lines, prompts, or many outputs requested --> batch mode
- •Single text or one clip --> single mode
Process
Step 1: Collect Inputs
Gather from the user before generating:
- •Text: The exact words to speak (verbatim, do not paraphrase)
- •Voice: Preferred voice (default:
cedar; brighter:marin) - •Style: Delivery style, tone, pacing preferences
- •Format: Output format (
mp3,wav,opus,aac,flac,pcm) - •Constraints: Speed, pronunciation needs, emphasis points
If the user provides only text, proceed with defaults. Ask only when a critical detail is ambiguous and blocks success.
Step 2: Check Environment
Verify prerequisites before making API calls:
- •OPENAI_API_KEY must be set. If missing, guide the user:
- •Create a key at https://platform.openai.com/api-keys
- •Set it as an environment variable
- •Never ask the user to paste the key in chat
- •openai package must be installed:
Ifcode
uv pip install openai
uvis unavailable:python3 -m pip install openai
Step 3: Augment Instructions
Convert user direction into a structured voice spec. Only make implicit details explicit; do not invent new requirements.
Include only relevant lines from this template:
Voice Affect: <overall character and texture> Tone: <attitude, formality, warmth> Pacing: <slow, steady, brisk> Emotion: <key emotions to convey> Pronunciation: <words to enunciate or emphasize> Pauses: <where to add intentional pauses> Emphasis: <key words or phrases to stress> Delivery: <cadence or rhythm notes>
Rules:
- •Keep it to 4-8 short lines; avoid conflicting guidance
- •Do not rewrite the input text
- •If the user says "narration for a demo", you may add implied constraints (clear pacing, friendly tone)
- •Do not introduce a persona, accent, or emotional style the user did not request
Step 4: Run the CLI
Use the bundled CLI for all generation. Never create one-off scripts.
Single clip:
python scripts/text_to_speech.py speak \ --input "Your text here" \ --voice cedar \ --instructions "Voice Affect: Warm and composed. Tone: Friendly." \ --response-format mp3 \ --out output.mp3
Batch (JSONL):
mkdir -p tmp/speech # Write JSONL with one job per line python scripts/text_to_speech.py speak-batch \ --input tmp/speech/jobs.jsonl \ --out-dir out \ --rpm 50 # Delete the JSONL when done rm -f tmp/speech/jobs.jsonl
Dry run (no API call, no key needed):
python scripts/text_to_speech.py speak --input "Test" --dry-run
Step 5: Validate Output
For important clips, check:
- •Intelligibility and clarity
- •Pacing matches intent
- •Pronunciation of names, acronyms, and numbers
- •Adherence to user constraints
Step 6: Iterate if Needed
Make one targeted change per iteration:
- •Change voice, speed, or instructions (not all at once)
- •Repeat invariant constraints to reduce drift (e.g., "keep pacing steady")
- •Re-validate after each change
Step 7: Deliver
Save final outputs and report:
- •Output file path(s)
- •Final text, instructions, and CLI flags used
- •Voice and format used
Defaults and Rules
| Parameter | Default | Notes |
|---|---|---|
| Model | gpt-4o-mini-tts-2025-12-15 | Only change if user requests |
| Voice | cedar | Use marin for brighter tone |
| Format | mp3 | Use wav for video sync or IVR |
| Speed | 1.0 | Range: 0.25-4.0 |
| RPM cap | 50 | Maximum 50 requests/minute |
| Input limit | 4096 chars | Split longer text into chunks |
Hard rules:
- •Built-in voices only. Custom voices are out of scope.
- •
instructionsonly work with GPT-4o mini TTS models, nottts-1ortts-1-hd. - •Always require
OPENAI_API_KEYbefore live API calls. - •Disclose to end users that the voice is AI-generated.
- •Use the OpenAI Python SDK; do not use raw HTTP.
- •Never modify
scripts/text_to_speech.py. If something is missing, ask the user first. - •Prefer the bundled CLI over writing new one-off scripts.
Use-Case Quick Reference
| Use Case | Voice | Format | Speed | Reference |
|---|---|---|---|---|
| Narration / explainer | cedar | mp3 | 1.0 | references/use-case-defaults.md |
| Product demo / voiceover | cedar or marin | wav / mp3 | 1.0 | references/use-case-defaults.md |
| IVR / phone prompts | cedar or marin | wav | 0.9-1.0 | references/use-case-defaults.md |
| Accessibility reads | cedar | mp3 / wav | 0.95-1.0 | references/use-case-defaults.md |
Verification
Success Indicators
- •Audio file exists at the specified output path
- •File size is non-zero and format matches the requested format
- •Playback sounds intelligible with correct pacing and voice
Failure Handling
| Error | Cause | Resolution |
|---|---|---|
OPENAI_API_KEY is not set | Missing API key | Guide user to set the environment variable |
Input text exceeds 4096 characters | Text too long | Split into smaller chunks |
openai SDK not installed | Missing dependency | Run uv pip install openai |
Output already exists | File conflict | Use --force to overwrite |
| Rate limit (429) | Too many requests | CLI retries automatically with backoff |
| Network error | No connectivity | Check network; see references/codex-network.md |
Examples
Example 1: Single Narration Clip
User: Generate a narration for "Welcome to the demo. Today we'll show how it works." Assistant: I'll generate a narration clip with warm, friendly delivery. Running: python scripts/text_to_speech.py speak \ --input "Welcome to the demo. Today we'll show how it works." \ --voice cedar \ --instructions "Voice Affect: Warm and composed. Tone: Friendly and confident. Pacing: Steady and moderate. Emphasis: Stress 'demo' and 'show'." \ --response-format mp3 \ --out demo-narration.mp3 Done. Wrote demo-narration.mp3 Voice: cedar | Format: mp3 | Speed: 1.0
Example 2: Batch IVR Prompts
User: Generate IVR prompts for our phone system: "Thank you for calling. Please hold." and "For sales, press 1. For support, press 2." Assistant: I'll create both prompts as a batch job with IVR-appropriate settings. Writing JSONL to tmp/speech/jobs.jsonl and running batch: python scripts/text_to_speech.py speak-batch \ --input tmp/speech/jobs.jsonl \ --out-dir out \ --rpm 50 Done. Generated 2 files in out/ Cleaning up tmp/speech/jobs.jsonl
Example 3: Dry Run
User: Preview what the CLI would do for "Hello world" without calling the API.
Assistant: Running dry-run (no API call, no key needed):
python scripts/text_to_speech.py speak --input "Hello world" --dry-run
Output:
{
"input": "Hello world",
"model": "gpt-4o-mini-tts-2025-12-15",
"response_format": "mp3",
"voice": "cedar"
}
Would write speech.mp3
Safety
Idempotency
This automation is idempotent with --force: re-running with the same inputs produces an equivalent output file. Without --force, it refuses to overwrite existing files.
Reversibility
Generated audio files can be deleted. No destructive changes are made to existing files. The CLI never modifies source files.
Prerequisites
Before running, ensure:
- •
OPENAI_API_KEYis set as an environment variable - •
openaiPython package is installed - • Network access is available for API calls (see
references/codex-network.mdfor sandbox environments)
File and Directory Conventions
- •Temporary files:
tmp/speech/(delete JSONL batches after use) - •Final outputs:
output/speech/or user-specified path via--out/--out-dir - •Keep filenames stable and descriptive
Reference Map
- •
references/audio-api.md: API endpoint, parameters, voice list, format options - •
references/cli.md: CLI commands, flags, recipes, guardrails - •
references/voice-directions.md: Instruction template, best practices, example direction blocks - •
references/use-case-defaults.md: Per-use-case defaults for narration, voiceover, IVR, accessibility - •
references/codex-network.md: Sandbox and network approval troubleshooting