AgentSkillsCN

TTS Generation

使用Gemini TTS将对话内容转换为语音。支持多说话人对话,且每个角色的声音保持一致。

SKILL.md
--- frontmatter
name: TTS Generation
description: Generates speech audio from dialogue using Gemini TTS. Supports multi-speaker conversations with consistent voices per character.
triggers:
  - Story panels have dialogue that needs to be voiced
  - Agent needs audio for video generation
keywords:
  - text to speech
  - voice
  - dialogue
  - audio

TTS Generation Skill

Converts panel dialogue into speech audio using Gemini TTS (gemini-2.5-flash-preview-tts).

Key Features

  • Multi-speaker support: Up to 2 distinct voices per conversation
  • Consistent character voices: Same voice for same character across all clips
  • Duration tracking: Returns exact audio duration for Veo sync
  • Expressive speech: Control tone through natural language prompts

When to Use

  • After manga generation, to voice the dialogue for each panel
  • Before Veo video generation (need audio duration first)
  • When creating animated stories with character conversations

Audio Format

ParameterValue
Sample Rate24000 Hz
Channels1 (mono)
Sample Width2 bytes (16-bit)
FormatWAV (PCM)

Available Voices

VoiceStyleGood For
KoreFirmConfident characters
PuckUpbeatCheerful characters
CharonInformativeNarration
FenrirExcitableEnergetic pets
AoedeBreezyCalm characters
ZephyrBrightYoung characters
LedaYouthfulKids, cute pets
SulafatWarmFriendly characters

Inputs

InputTypeRequiredDefaultDescription
dialoguestrYesThe dialogue text to voice
character_namestrYesCharacter speaking (for multi-speaker)
voice_namestrNoAuto-selectSpecific voice to use
emotionstrNo"neutral"Emotion hint (cheerful, sad, excited)

Multi-Speaker Input

For conversations between 2 characters:

python
dialogues = [
    {"character": "Mochi", "text": "Hi there!", "emotion": "cheerful"},
    {"character": "Hero", "text": "Hey Mochi!", "emotion": "excited"},
]

Outputs

OutputTypeDescription
audio_pathPathPath to generated WAV file
duration_secondsfloatExact audio duration
voice_usedstrVoice name that was used

Implementation Contract

python
@dataclass
class TTSResult:
    audio_path: Path
    duration_seconds: float
    voice_used: str
    character_name: str

class TTSGenerator:
    async def generate_dialogue(
        self,
        dialogue: str,
        character_name: str,
        voice_name: str = None,
        emotion: str = "neutral",
    ) -> TTSResult:
        """
        Generate speech for a single character's dialogue.

        Args:
            dialogue: Text to speak
            character_name: Who is speaking
            voice_name: Specific voice (or auto-select)
            emotion: Emotional tone hint

        Returns:
            TTSResult with audio path and duration
        """
        ...

    async def generate_conversation(
        self,
        dialogues: list[dict],
        voice_mapping: dict[str, str] = None,
    ) -> list[TTSResult]:
        """
        Generate speech for a multi-character conversation.

        Maintains voice consistency across the conversation.
        Max 2 speakers per call (Gemini TTS limit).

        Args:
            dialogues: List of {character, text, emotion} dicts
            voice_mapping: Optional {character_name: voice_name} mapping

        Returns:
            List of TTSResult, one per dialogue line
        """
        ...

    async def generate_panel_audio(
        self,
        panel_dialogues: list[str],
        character_names: list[str],
        voice_mapping: dict[str, str] = None,
    ) -> tuple[Path, float]:
        """
        Generate combined audio for a single manga panel.

        Multiple characters speaking in one panel get combined
        into a single audio file.

        Returns:
            (combined_audio_path, total_duration_seconds)
        """
        ...

Example Usage

python
from skills.generate_tts import TTSGenerator

tts = TTSGenerator()

# Single dialogue
result = await tts.generate_dialogue(
    dialogue="Let's go on an adventure!",
    character_name="Mochi",
    emotion="excited"
)
print(f"Audio: {result.audio_path}, Duration: {result.duration_seconds}s")

# Multi-character conversation
results = await tts.generate_conversation(
    dialogues=[
        {"character": "Mochi", "text": "What's that?", "emotion": "curious"},
        {"character": "Hero", "text": "It's a treasure map!", "emotion": "excited"},
    ],
    voice_mapping={
        "Mochi": "Leda",   # Youthful voice for pet
        "Hero": "Kore",    # Firm voice for protagonist
    }
)

Voice Selection Strategy

When no voice is specified, auto-select based on character analysis:

Character TypeSuggested VoiceWhy
Pet (cat/dog)Leda / FenrirYouthful, excitable
ChildZephyr / LedaBright, youthful
Adult maleKore / CharonFirm, informative
Adult femaleAoede / SulafatBreezy, warm

Integration with Video Pipeline

code
Panel Dialogue → TTS → duration_seconds → Veo (duration param)
                  ↓
             audio_path → FFmpeg (combine with video)

The duration from TTS drives Veo video length to ensure sync.

Error Handling

ErrorCauseRecovery
ValueErrorEmpty dialogueSkip or use placeholder
APIErrorTTS API failureRetry with backoff
VoiceErrorUnknown voice nameFall back to default