AgentSkillsCN

gemini-live-api

使用Google的Gemini Live API构建实时语音和视频应用。在实现双向音频/视频流、语音助手、带有中断处理的对话式AI,或任何需要与Gemini模型低延迟多模态交互的应用时使用。涵盖WebSocket流、语音活动检测(VAD)、对话期间函数调用、会话管理/恢复,以及用于安全客户端连接的临时令牌。

SKILL.md
--- frontmatter
name: gemini-live-api
description: Build real-time voice and video applications with Google's Gemini Live API. Use when implementing bidirectional audio/video streaming, voice assistants, conversational AI with interruption handling, or any application requiring low-latency multimodal interaction with Gemini models. Covers WebSocket streaming, voice activity detection (VAD), function calling during conversations, session management/resumption, and ephemeral tokens for secure client-side connections.

Gemini Live API

Real-time bidirectional streaming API for voice/video conversations with Gemini.

Quick Start

python
from google import genai
from google.genai import types

client = genai.Client(api_key="YOUR_API_KEY")
config = types.LiveConnectConfig(response_modalities=["AUDIO"])

async with client.aio.live.connect(
    model="gemini-2.5-flash-preview-native-audio-dialog",
    config=config
) as session:
    # Send audio
    await session.send_realtime_input(
        audio=types.Blob(data=audio_bytes, mime_type="audio/pcm;rate=16000")
    )
    # Receive responses
    async for response in session.receive():
        if response.data:
            play_audio(response.data)

Core Patterns

Audio Chat (Mic + Speaker)

Use scripts/audio_chat.py for complete microphone-to-speaker implementation with PyAudio.

Text Chat via Live API

Use scripts/text_chat.py for text-based streaming conversations.

Function Calling

Use scripts/function_calling.py for tool integration:

python
config = types.LiveConnectConfig(
    response_modalities=["TEXT"],
    tools=[{
        "function_declarations": [{
            "name": "get_weather",
            "description": "Get weather for location",
            "parameters": {"type": "object", "properties": {"location": {"type": "string"}}}
        }]
    }]
)
# Handle tool_call in response, send result via session.send_tool_response()

Ephemeral Tokens (Client-Side Auth)

Use scripts/generate_token.py for secure browser/mobile connections:

python
token = client.auth_tokens.create(config={
    "uses": 1,
    "expire_time": now + timedelta(minutes=30),
    "new_session_expire_time": now + timedelta(minutes=1)
})
# Client uses token.name as API key

Key Configuration

SettingOptions
response_modalities["AUDIO"] or ["TEXT"] (not both)
Audio input16-bit PCM, 16kHz, mono
Audio output24kHz
Session limit15 min audio-only, 2 min with video

Voice Selection

python
speech_config=types.SpeechConfig(
    voice_config=types.VoiceConfig(
        prebuilt_voice_config=types.PrebuiltVoiceConfig(
            voice_name="Puck"  # Aoede, Charon, Fenrir, Kore, Puck
        )
    )
)

Interruption Handling (VAD)

Automatic by default. Check response.server_content.interrupted for interruptions.

Session Resumption

Save response.session_resumption_update.handle, pass to new session within 2 hours.

Resources

  • scripts/audio_chat.py - Full mic/speaker streaming example
  • scripts/text_chat.py - Text-based Live API chat
  • scripts/function_calling.py - Tool/function calling pattern
  • scripts/generate_token.py - Ephemeral token generation
  • references/api-reference.md - Complete configuration options, models, audio specs