happy-audio-gen
Turns text into speech across 6 providers through one CLI. All providers are synchronous (TTS is fast — typically under 10 seconds) except Bailian's voice-design flow (which is still covered but uses a longer poll window).
Quick usage
# Shortest path — OpenAI default voice bun scripts/main.ts --text "Hello, world" --out ./hello.mp3 # Chinese, MiniMax bun scripts/main.ts --provider minimax --text "大家好" --voice male-qn-qingse --out ./hello.mp3 # Long-form, Bailian (auto-splits by sentence) bun scripts/main.ts --provider bailian --textfiles ./script.md --out ./narration.mp3
When to invoke this skill
- •User asks to synthesize speech / TTS / read aloud / narrate / dub / make a voice-over.
- •User asks to convert script / text / article into audio.
- •User names a TTS voice or model.
Do not route here when the user wants to transcribe audio → text (that's STT, different domain), or edit / mix audio files (use a dedicated audio editor).
Step 0: Preflight (BLOCKING)
- •
Locate EXTEND.md:
- •
./.happy-skills/happy-audio-gen/EXTEND.md - •
$XDG_CONFIG_HOME/happy-skills/happy-audio-gen/EXTEND.md - •
~/.happy-skills/happy-audio-gen/EXTEND.md
If none found, run
bun scripts/main.ts --setupand walk the user throughreferences/config/first-time-setup.md. - •
- •
Verify at least one provider has credentials (env var or 1Password reference).
- •
Verify Bun is available. Fallback:
npx -y bun.
Step 1: Choose provider
Preference order:
- •
--provider <id> - •EXTEND.md
default_provider - •Auto-detect env vars:
openai > elevenlabs > bailian > minimax > siliconflow > playht
Pick by language / voice intent:
- •English, natural + fast →
openai(gpt-4o-mini-tts / tts-1). - •Multilingual, voice cloning →
elevenlabs. - •Chinese, long-form →
bailian(qwen-tts auto-chunks long scripts) orminimax. - •Chinese dialect / voice design →
bailian(voice-design with qwen3-tts-vd) orsiliconflow(CosyVoice2). - •Ultra-realistic, short-form →
playht(2.0).
Step 2: Fill parameters
- •
--textor--textfiles: input. Always quote. - •
--out <path>: REQUIRED. Extension determines format (.mp3/.wav/.ogg/.flac). - •
--voice <id>: provider-specific. Seereferences/voices.mdfor the short list of well-known voices. - •
--rate 0.5..2.0: speaking rate. - •
--instruction "...": voice direction (onlyopenaigpt-4o-mini-tts andsiliconflowhonor this). - •
--language <code>:en,zh,ja— only a few providers honor this explicitly.
Step 3: Run
bun scripts/main.ts \ --provider openai \ --model gpt-4o-mini-tts \ --voice alloy \ --text "..." \ --out ./out.mp3
JSON mode:
{ "success": true, "provider": "openai", "model": "gpt-4o-mini-tts", "voice": "alloy", "output": "/abs/out.mp3", "size_bytes": 76032, "format": "mp3" }
Step 4: Long text handling
- •
happy-audio-genautomatically splits long input for providers that cap per-call length (Bailian ≤ 200 Chinese chars per call). Chunks are concatenated byte-for-byte on output. - •For best fidelity with concatenated MP3s, stitch the segments with ffmpeg afterward rather than relying on byte concat.
Step 5: Errors
- •
[openai] OpenAI TTS 400withinvalid voice→ the voice name is not supported by the model. Use one ofalloy,ash,coral,echo,fable,onyx,nova,sage,shimmer. - •
[minimax] ... 2049 invalid api key→ tryMINIMAX_BASE_URL=https://api.minimaxi.com/v1(different region). - •
[bailian] ... 400 DataInspectionFailed→ Aliyun content filter. Surface to the user. - •
[elevenlabs] 401→ key invalid or subscription expired.
References
- •
references/providers.md— per-provider env vars, default models, voice lists. - •
references/voices.md— curated voices for each provider. - •
references/error_codes.md— common errors and fixes. - •
references/config/first-time-setup.md - •
references/config/extend-schema.md - •
assets/EXTEND.template.md