LiveKit Self-Hosted STT Plugin

Build self-hosted speech-to-text APIs and LiveKit voice agent plugins using Hugging Face models.

Overview

This skill provides templates and guidance for:

•Building a self-hosted STT API server using FastAPI + Whisper/HF models
•Creating a LiveKit plugin that connects to your self-hosted API
•Deploying and scaling in production

Quick Start

Option 1: Build Both (API + Plugin)

When user wants complete setup:

•Create API Server:

bash

python scripts/setup_api_server.py my-stt-server --model openai/whisper-medium
cd my-stt-server
pip install -r requirements.txt
python main.py

•Create Plugin:

bash

python scripts/setup_plugin.py custom-stt
cd livekit-plugins-custom-stt
pip install -e .

•Use in LiveKit Agent:

python

from livekit.plugins import custom_stt

stt=custom_stt.STT(api_url="ws://localhost:8000/ws/transcribe")

Option 2: API Server Only

When user only needs the API server:

•Use scripts/setup_api_server.py with desired model
•See references/api_server_guide.md for implementation details
•Template in assets/api-server/

Option 3: Plugin Only

When user has existing API and needs LiveKit plugin:

•Use scripts/setup_plugin.py with plugin name
•See references/plugin_implementation.md for details
•Template in assets/plugin-template/

Model Selection

Help user choose the right model:

Use Case	Recommended Model	Rationale
Best accuracy	`openai/whisper-large-v3`	SOTA quality, requires GPU
Production balance	`openai/whisper-medium`	Good quality, reasonable speed
Real-time/fast	`openai/whisper-small`	Fast, acceptable quality
CPU-only	`openai/whisper-tiny`	Can run without GPU
English-only	`facebook/wav2vec2-large-960h`	Optimized for English

For detailed comparison and optimization tips, see references/models_comparison.md.

Implementation Workflow

Building the API Server

•
Use the template: Start with assets/api-server/main.py
•
Key components:
- •FastAPI app with WebSocket endpoint
- •Model loading at startup (kept in memory)
- •Audio buffer management
- •WebSocket protocol for streaming
•
Customization points:
- •Model selection (change MODEL_ID in .env)
- •Audio processing parameters
- •Batch size and optimization
- •Error handling

For complete implementation guide, see references/api_server_guide.md.

Building the LiveKit Plugin

•
Use the template: Start with assets/plugin-template/
•
Required implementations:
- •_recognize_impl() - Non-streaming recognition
- •stream() - Return SpeechStream instance
- •SpeechStream class - Handle streaming
•
Key considerations:
- •Audio format conversion (16kHz, mono, 16-bit PCM)
- •WebSocket connection management
- •Event emission (interim/final transcripts)
- •Error handling and cleanup

For complete implementation guide, see references/plugin_implementation.md.

Deployment

Development

bash

# API Server
uvicorn main:app --host 0.0.0.0 --port 8000 --reload

# Test WebSocket
ws://localhost:8000/ws/transcribe

Production

Docker (Recommended):

bash

docker-compose up

Kubernetes: Use manifests in deployment guide

Cloud Platforms: AWS ECS, GCP Cloud Run, Azure Container Instances

For complete deployment guide including scaling, monitoring, and security, see references/deployment.md.

WebSocket Protocol

Client → Server

•Audio: Binary (16-bit PCM, 16kHz)
•Config: {"type": "config", "language": "en"}
•End: {"type": "end"}

Server → Client

•Interim: {"type": "interim", "text": "..."}
•Final: {"type": "final", "text": "...", "language": "en"}
•Error: {"type": "error", "message": "..."}

Common Tasks

Change Model

Edit .env:

bash

MODEL_ID=openai/whisper-small  # Faster model

Add Language Support

In plugin usage:

python

stt=custom_stt.STT(language="es")  # Spanish
stt=custom_stt.STT(detect_language=True)  # Auto-detect

Enable GPU

In API server:

bash

DEVICE=cuda:0  # Use GPU

Scale Horizontally

Deploy multiple API server instances behind load balancer. See references/deployment.md for Nginx configuration.

Troubleshooting

Out of Memory

•Use smaller model (whisper-small or whisper-tiny)
•Reduce batch_size in pipeline
•Enable low_cpu_mem_usage=True

Slow Transcription

•Ensure GPU is enabled (DEVICE=cuda:0)
•Use FP16 precision (automatic on GPU)
•Increase batch_size
•Use smaller model

Connection Issues

•Verify WebSocket support in load balancer
•Check firewall rules
•Increase timeout settings

Scripts

•scripts/setup_api_server.py - Generate API server from template
•scripts/setup_plugin.py - Generate LiveKit plugin from template

References

Load these as needed for detailed information:

•references/api_server_guide.md - Complete API implementation guide
•references/plugin_implementation.md - LiveKit plugin development
•references/models_comparison.md - Model selection and optimization
•references/deployment.md - Production deployment best practices

Assets

Ready-to-use templates:

•assets/api-server/ - Complete FastAPI server with Whisper
•assets/plugin-template/ - LiveKit STT plugin structure

Best Practices

•Keep models in memory - Load once at startup, not per request
•Use appropriate model size - Balance quality vs. speed for your use case
•Process audio in chunks - 1-second chunks work well for streaming
•Implement proper cleanup - Close WebSocket connections gracefully
•Monitor metrics - Track latency, throughput, GPU utilization
•Use Docker - Ensures consistent deployments
•Enable authentication - Secure production APIs
•Scale horizontally - Use load balancer for high availability