AgentSkillsCN

speech-to-text

使用 Sarvam AI 的 Bulbul 模型将文本转换为自然流畅的语音。适用于用户需要从文本生成音频、制作旁白、构建语音界面,或合成印度语言语音时使用。支持 11 种印度语言,配备多种音色,可调节音高、语速、音量,并支持实时流媒体传输。返回 Base64 编码的音频。

SKILL.md
--- frontmatter
name: speech-to-text
description: Transcribe audio to text using Sarvam AI's Saarika model. Use when the user needs to convert speech to text, transcribe audio files, build voice interfaces, or process Indian language audio. Supports 11 Indian languages plus English with automatic language detection, code-mixing, speaker diarization, and word-level timestamps.
license: Apache-2.0
metadata:
  author: sarvam-ai
  version: "1.0"
  model: saarika:v2.5

Speech-to-Text with Saarika

Saarika is Sarvam AI's speech recognition model optimized for Indian languages with support for code-mixing (Hindi-English etc.) and multi-speaker scenarios.

Installation

bash
pip install sarvamai

Quick Start

python
from sarvamai import SarvamAI

client = SarvamAI()

response = client.speech_to_text.transcribe(
    file=open("audio.wav",
"rb"),
    model="saarika:v2.5",
    language_code="hi-IN"
)

print(response.transcript)

Supported Languages

CodeLanguageCodeLanguage
hi-INHindita-INTamil
bn-INBengalite-INTelugu
kn-INKannadaml-INMalayalam
mr-INMarathigu-INGujarati
pa-INPunjabior-INOdia
en-INEnglish (Indian)autoAuto-detect

API Options

REST API (≤30 seconds)

For short audio clips:

python
response = client.speech_to_text.transcribe(
    file=open("short_clip.wav",
"rb"),
    model="saarika:v2.5",
    language_code="auto",           # Auto-detect language
    with_timestamps=True,           # Word-level timestamps
    with_diarisation=True           # Speaker identification
)

print(response.transcript)
print(response.language_code)       # Detected language
print(response.words)               # Timestamped words
print(response.speaker_segments)    # Speaker turns

Batch API (≤1 hour)

For long recordings:

python
response = client.speech_to_text.transcribe_batch(
    file=open("long_recording.mp3",
"rb"),
    model="saarika:v2.5",
    language_code="hi-IN"
)

WebSocket Streaming (Real-time)

For live transcription. Audio must be sent as base64-encoded strings.

python
import asyncio
import base64
from sarvamai import AsyncSarvamAI

async def stream_audio():
    client = AsyncSarvamAI()

    async with client.speech_to_text_streaming.connect(
        language_code="hi-IN",
        model="saarika:v2.5",
        high_vad_sensitivity=True
    ) as ws:
        # Read and encode audio to base64
        with open("audio.wav",
"rb") as f:
            audio_base64 = base64.b64encode(f.read()).decode("utf-8")

        # Send base64 encoded audio
        await ws.transcribe(
            audio=audio_base64,
            encoding="audio/wav",
            sample_rate=16000
        )

        # Receive transcription
        response = await ws.recv()
        print(response)

asyncio.run(stream_audio())

WebSocket supported formats: wav, pcm_s16le, pcm_l16, pcm_raw only. MP3/AAC/OGG not supported for streaming.

JavaScript

javascript
import { SarvamAI
} from "sarvamai";
import fs from "fs";

const client = new SarvamAI();

const response = await client.speechToText.transcribe({
  file: fs.createReadStream("audio.wav"),
  model: "saarika:v2.5",
  languageCode: "hi-IN",
  withTimestamps: true
});

console.log(response.transcript);

cURL

bash
curl -X POST "https://api.sarvam.ai/speech-to-text" \
  -H "api-subscription-key: $SARVAM_API_KEY" \
  -F "file=@audio.wav" \
  -F "model=saarika:v2.5" \
  -F "language_code=hi-IN"

Parameters

ParameterTypeRequiredDescription
fileFileYesAudio file (wav, mp3, flac, ogg, webm)
modelstringYessaarika:v2.5 or saarika:v2
language_codestringYesBCP-47 code or auto
with_timestampsboolNoReturn word timestamps
with_diarisationboolNoEnable speaker identification

Response

json
{
    "request_id": "abc123",
    "transcript": "नमस्ते, आप कैसे हैं?",
    "language_code": "hi-IN",
    "words": [
        {
            "word": "नमस्ते",
            "start": 0.0,
            "end": 0.5
        },
        {
            "word": "आप",
            "start": 0.6,
            "end": 0.8
        }
    ],
    "speaker_segments": [
        {
            "speaker": "SPEAKER_00",
            "start": 0.0,
            "end": 2.5
        }
    ]
}

See references/streaming.md for detailed WebSocket documentation.