AgentSkillsCN

text-to-speech

使用 Sarvam AI 的 Mayura 模型在英印语言之间进行翻译。适用于用户需要翻译内容、本地化应用,或在印地语、泰米尔语、孟加拉语、泰卢固语,以及另外 7 种印度语言之间进行文本转换时使用。支持双向翻译、脚本控制,以及语码混用的文本处理。

SKILL.md
--- frontmatter
name: text-to-speech
description: Convert text to natural speech using Sarvam AI's Bulbul model. Use when the user needs to generate audio from text, create voiceovers, build voice interfaces, or synthesize Indian language speech. Supports 11 Indian languages with multiple voices, controllable pitch/pace/loudness, and real-time streaming. Returns base64-encoded audio.
license: Apache-2.0
metadata:
  author: sarvam-ai
  version: "1.0"
  model: bulbul:v2

Text-to-Speech with Bulbul

Bulbul is Sarvam AI's text-to-speech model that generates natural-sounding speech in Indian languages with support for voice customization and streaming.

Installation

bash
pip install sarvamai

Quick Start

python
from sarvamai import SarvamAI
from sarvamai.play import save

client = SarvamAI()

response = client.text_to_speech.convert(
    text="नमस्ते, आप कैसे हैं?",
    target_language_code="hi-IN",
    model="bulbul:v2",
    speaker="anushka"
)

# Response contains base64-encoded audio
save(response,
"output.wav")

Base64 Audio Response

The API returns audio as base64-encoded strings in the audios array:

json
{
    "request_id": "abc123",
    "audios": [
        "UklGRiQAAABXQVZFZm10IBAAAAABAAEA..."
    ]
}

Decode Manually

python
import base64

response = client.text_to_speech.convert(
    text="Hello world",
    target_language_code="en-IN",
    model="bulbul:v2",
    speaker="anushka"
)

# Decode base64 to bytes
audio_bytes = base64.b64decode(response.audios[
    0
])

# Save to file
with open("output.wav",
"wb") as f:
    f.write(audio_bytes)

Supported Languages

CodeLanguageCodeLanguage
hi-INHindita-INTamil
bn-INBengalite-INTelugu
kn-INKannadaml-INMalayalam
mr-INMarathigu-INGujarati
pa-INPunjabior-INOdia
en-INEnglish (Indian)

Available Voices

VoiceTypeBest For
anushkaFemaleGeneral, warm tone
manishaFemaleProfessional, clear
vidyaFemaleFriendly, conversational
arjunMaleAuthoritative, news
amolMaleCasual, storytelling
amartyaMaleDeep, formal

Voice Control

Customize pitch, pace, and loudness:

python
response = client.text_to_speech.convert(
    text="यह एक परीक्षण है।",
    target_language_code="hi-IN",
    model="bulbul:v2",
    speaker="anushka",
    pitch=0.2,          # -1.0 to 1.0 (higher = higher pitch)
    pace=1.2,           # 0.5 to 2.0 (higher = faster)
    loudness=1.5        # 0.5 to 2.0 (higher = louder)
)

Audio Formats

Set output format with output_audio_codec:

FormatDescription
wavUncompressed (default)
mp3MPEG Layer-3
aacAdvanced Audio Coding
opusOptimized for speech
flacLossless
linear16Raw PCM
mulawTelephony (8-bit)
alawTelephony (8-bit)
python
response = client.text_to_speech.convert(
    text="Hello",
    target_language_code="en-IN",
    model="bulbul:v2",
    speaker="anushka",
    output_audio_codec="mp3"
)

Sample Rates

RateUse Case
8000Telephony
16000Voice assistants
22050Standard audio
24000High quality (default)
python
response = client.text_to_speech.convert(
    text="Hello",
    target_language_code="en-IN",
    model="bulbul:v2",
    speaker="anushka",
    sample_rate=8000  # For phone systems
)

JavaScript

javascript
import { SarvamAI
} from "sarvamai";
import fs from "fs";

const client = new SarvamAI();

const response = await client.textToSpeech.convert({
  text: "नमस्ते",
  targetLanguageCode: "hi-IN",
  model: "bulbul:v2",
  speaker: "anushka"
});

// Decode base64 and save
const audioBuffer = Buffer.from(response.audios[
    0
],
"base64");
fs.writeFileSync("output.wav", audioBuffer);

cURL

bash
curl -X POST "https://api.sarvam.ai/text-to-speech" \
  -H "api-subscription-key: $SARVAM_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "inputs": [
        "नमस्ते, कैसे हो?"
    ],
    "target_language_code": "hi-IN",
    "model": "bulbul:v2",
    "speaker": "anushka"
}'

Parameters

ParameterTypeRequiredDescription
text / inputsstring/arrayYesText to synthesize
target_language_codestringYesBCP-47 language code
modelstringYesbulbul:v2 or bulbul:v1
speakerstringYesVoice name
pitchfloatNo-1.0 to 1.0
pacefloatNo0.5 to 2.0
loudnessfloatNo0.5 to 2.0
output_audio_codecstringNoAudio format
sample_rateintNoOutput sample rate

Response

json
{
    "request_id": "20241115_abc123",
    "audios": [
        "UklGRiQAAABXQVZFZm10IBAAAAABAAEA..."
    ]
}

See references/voices.md for voice samples and recommendations.