AgentSkillsCN

speech-use

利用谷歌的生成式AI技术与Cloud Speech SDK,实现语音合成(TTS)、语音转文字(STT)以及声音克隆功能。支持Gemini-TTS、Chirp 3及即时自定义语音等模型。

SKILL.md
--- frontmatter
name: speech-use
description: "Generate (TTS), Transcribe (STT), and Clone voices using Google's GenAI and Cloud Speech SDKs. Supports Gemini-TTS, Chirp 3, and Instant Custom Voice."

Speech Use

Use this skill to perform Text-to-Speech (TTS), Speech-to-Text (STT), and Voice Cloning operations.

This skill uses portable Python scripts managed by uv.

Prerequisites

  1. Environment Variables:

    • GOOGLE_API_KEY (for TTS via Gemini)
    • GOOGLE_CLOUD_PROJECT (Required for STT and Voice Cloning)
    • GOOGLE_APPLICATION_CREDENTIALS (Recommended for STT/Voice Cloning)
  2. APIs Enabled:

    • Text-to-Speech API (texttospeech.googleapis.com)
    • Speech-to-Text API (speech.googleapis.com)

Usage

1. Generate Speech (TTS)

Generate audio from text using Gemini-TTS.

Standard Voice:

bash
uv run skills/speech-use/scripts/generate_speech.py "Hello world, this is a test." --voice Puck --output hello.wav

Custom Voice (Cloned):

bash
uv run skills/speech-use/scripts/generate_speech.py "This is my custom voice speaking." --voice-cloning-key "YOUR_KEY_HERE" --output custom.wav

2. Create Custom Voice (Voice Cloning)

Generate a voiceCloningKey from a reference audio file and a consent file.

Requirements:

  • reference.wav: 10-30s of clear speech (the voice to clone).
  • consent.wav: The speaker saying: "I am the owner of this voice and I consent to Google using this voice to create a synthetic voice model."
bash
uv run skills/speech-use/scripts/create_custom_voice.py --reference-audio reference.wav --consent-audio consent.wav

Save the output key to use with generate_speech.py.

3. Transcribe Audio (STT)

Transcribe audio files using Chirp 3.

bash
uv run skills/speech-use/scripts/transcribe_audio.py audio.wav --language en-US --output transcript.txt

Options

generate_speech.py

  • --voice: Prebuilt voice (e.g., Kore, Puck, Fenrir, Aoede).
  • --voice-cloning-key: Key from create_custom_voice.py.
  • --model: Default gemini-2.5-flash-preview-tts.

transcribe_audio.py

  • --model: Default chirp_3.
  • --language: Default auto.
  • --location: Cloud region (default us).