TTS Generation Skill

Converts panel dialogue into speech audio using Gemini TTS (gemini-2.5-flash-preview-tts).

Key Features

•Multi-speaker support: Up to 2 distinct voices per conversation
•Consistent character voices: Same voice for same character across all clips
•Duration tracking: Returns exact audio duration for Veo sync
•Expressive speech: Control tone through natural language prompts

When to Use

•After manga generation, to voice the dialogue for each panel
•Before Veo video generation (need audio duration first)
•When creating animated stories with character conversations

Audio Format

Parameter	Value
Sample Rate	24000 Hz
Channels	1 (mono)
Sample Width	2 bytes (16-bit)
Format	WAV (PCM)

Available Voices

Voice	Style	Good For
Kore	Firm	Confident characters
Puck	Upbeat	Cheerful characters
Charon	Informative	Narration
Fenrir	Excitable	Energetic pets
Aoede	Breezy	Calm characters
Zephyr	Bright	Young characters
Leda	Youthful	Kids, cute pets
Sulafat	Warm	Friendly characters

Inputs

Input	Type	Required	Default	Description
`dialogue`	str	Yes	—	The dialogue text to voice
`character_name`	str	Yes	—	Character speaking (for multi-speaker)
`voice_name`	str	No	Auto-select	Specific voice to use
`emotion`	str	No	"neutral"	Emotion hint (cheerful, sad, excited)

Multi-Speaker Input

For conversations between 2 characters:

python

dialogues = [
    {"character": "Mochi", "text": "Hi there!", "emotion": "cheerful"},
    {"character": "Hero", "text": "Hey Mochi!", "emotion": "excited"},
]

Outputs

Output	Type	Description
`audio_path`	Path	Path to generated WAV file
`duration_seconds`	float	Exact audio duration
`voice_used`	str	Voice name that was used

Implementation Contract

python

@dataclass
class TTSResult:
    audio_path: Path
    duration_seconds: float
    voice_used: str
    character_name: str

class TTSGenerator:
    async def generate_dialogue(
        self,
        dialogue: str,
        character_name: str,
        voice_name: str = None,
        emotion: str = "neutral",
    ) -> TTSResult:
        """
        Generate speech for a single character's dialogue.

        Args:
            dialogue: Text to speak
            character_name: Who is speaking
            voice_name: Specific voice (or auto-select)
            emotion: Emotional tone hint

        Returns:
            TTSResult with audio path and duration
        """
        ...

    async def generate_conversation(
        self,
        dialogues: list[dict],
        voice_mapping: dict[str, str] = None,
    ) -> list[TTSResult]:
        """
        Generate speech for a multi-character conversation.

        Maintains voice consistency across the conversation.
        Max 2 speakers per call (Gemini TTS limit).

        Args:
            dialogues: List of {character, text, emotion} dicts
            voice_mapping: Optional {character_name: voice_name} mapping

        Returns:
            List of TTSResult, one per dialogue line
        """
        ...

    async def generate_panel_audio(
        self,
        panel_dialogues: list[str],
        character_names: list[str],
        voice_mapping: dict[str, str] = None,
    ) -> tuple[Path, float]:
        """
        Generate combined audio for a single manga panel.

        Multiple characters speaking in one panel get combined
        into a single audio file.

        Returns:
            (combined_audio_path, total_duration_seconds)
        """
        ...

Example Usage

python

from skills.generate_tts import TTSGenerator

tts = TTSGenerator()

# Single dialogue
result = await tts.generate_dialogue(
    dialogue="Let's go on an adventure!",
    character_name="Mochi",
    emotion="excited"
)
print(f"Audio: {result.audio_path}, Duration: {result.duration_seconds}s")

# Multi-character conversation
results = await tts.generate_conversation(
    dialogues=[
        {"character": "Mochi", "text": "What's that?", "emotion": "curious"},
        {"character": "Hero", "text": "It's a treasure map!", "emotion": "excited"},
    ],
    voice_mapping={
        "Mochi": "Leda",   # Youthful voice for pet
        "Hero": "Kore",    # Firm voice for protagonist
    }
)

Voice Selection Strategy

When no voice is specified, auto-select based on character analysis:

Character Type	Suggested Voice	Why
Pet (cat/dog)	Leda / Fenrir	Youthful, excitable
Child	Zephyr / Leda	Bright, youthful
Adult male	Kore / Charon	Firm, informative
Adult female	Aoede / Sulafat	Breezy, warm

Integration with Video Pipeline

code

Panel Dialogue → TTS → duration_seconds → Veo (duration param)
                  ↓
             audio_path → FFmpeg (combine with video)

The duration from TTS drives Veo video length to ensure sync.

Error Handling

Error	Cause	Recovery
`ValueError`	Empty dialogue	Skip or use placeholder
`APIError`	TTS API failure	Retry with backoff
`VoiceError`	Unknown voice name	Fall back to default