Speech-to-Text Transcription
Comprehensive speech-to-text capabilities using multiple STT engines. Record audio, transcribe files, real-time processing, speaker identification, and multi-language support.
Quick Start
When asked to transcribe audio:
- •Choose engine: Whisper (local/free), Google, Azure, or AssemblyAI
- •Record or load: Capture audio or use existing file
- •Transcribe: Convert speech to text
- •Format: Output as plain text, SRT, VTT, or JSON
- •Enhance: Add timestamps, speaker labels, punctuation
Prerequisites
System Requirements
- •Python 3.8+
- •Microphone (for recording)
- •Audio file support: WAV, MP3, M4A, FLAC, OGG
Install Dependencies
Core (required):
pip install sounddevice soundfile numpy --break-system-packages
Whisper (OpenAI - local, free):
pip install openai-whisper --break-system-packages # For faster processing with GPU: pip install openai-whisper torch --break-system-packages
Google Speech (requires API key):
pip install google-cloud-speech --break-system-packages
Azure Speech (requires API key):
pip install azure-cognitiveservices-speech --break-system-packages
AssemblyAI (requires API key):
pip install assemblyai --break-system-packages
Optional enhancements:
pip install pydub webrtcvad --break-system-packages # Audio processing pip install pyaudio --break-system-packages # Alternative audio backend
See reference/setup-guide.md for detailed installation.
STT Engine Comparison
| Engine | Cost | Speed | Quality | Features | Best For |
|---|---|---|---|---|---|
| Whisper | Free | Medium | High | Multilingual, local | Privacy, offline, free |
| Pay-per-use | Fast | High | Punctuation, diarization | Real-time, accuracy | |
| Azure | Pay-per-use | Fast | High | Translation, custom | Enterprise integration |
| AssemblyAI | Pay-per-use | Medium | Very High | Diarization, sentiment | Analysis, insights |
Whisper (Recommended for most users)
- •✅ Free and local - No API costs, runs offline
- •✅ High quality - State-of-the-art accuracy
- •✅ Multilingual - 99+ languages
- •⚠️ Speed - Slower than cloud services (depends on hardware)
- •⚠️ Resources - Needs decent CPU/GPU
Google Cloud Speech
- •✅ Fast - Real-time capable
- •✅ Accurate - Excellent for English
- •✅ Features - Automatic punctuation, speaker diarization
- •⚠️ Cost - $0.006 per 15 seconds (~$1.44/hour)
- •⚠️ Privacy - Audio sent to Google
Azure Speech
- •✅ Enterprise - Microsoft integration
- •✅ Translation - Real-time translation
- •✅ Custom - Train custom models
- •⚠️ Cost - $1 per audio hour
- •⚠️ Setup - More complex configuration
AssemblyAI
- •✅ Features - Speaker diarization, sentiment analysis
- •✅ Quality - Very accurate
- •✅ Developer-friendly - Simple API
- •⚠️ Cost - $0.00025 per second (~$0.90/hour)
Core Operations
Record Audio
Simple recording:
# Record 30 seconds python scripts/record_audio.py --duration 30 --output recording.wav # Record until stopped (Ctrl+C) python scripts/record_audio.py --output recording.wav # Record with voice activity detection python scripts/record_audio.py --vad --output recording.wav
Advanced recording:
# Choose microphone python scripts/list_devices.py # List available mics python scripts/record_audio.py --device 1 --output recording.wav # Specify quality python scripts/record_audio.py \ --sample-rate 48000 \ --channels 2 \ --output recording.wav
Transcribe Files
Using Whisper (local, free):
# Basic transcription python scripts/transcribe_whisper.py --file recording.wav # Choose model size (tiny, base, small, medium, large) python scripts/transcribe_whisper.py \ --file recording.wav \ --model medium # With timestamps python scripts/transcribe_whisper.py \ --file recording.wav \ --timestamps \ --output transcript.json # Multiple languages python scripts/transcribe_whisper.py \ --file recording.wav \ --language es # Spanish
Using Google Cloud:
# Export API key export GOOGLE_APPLICATION_CREDENTIALS="path/to/credentials.json" # Transcribe python scripts/transcribe_google.py \ --file recording.wav \ --language en-US # With speaker diarization python scripts/transcribe_google.py \ --file recording.wav \ --diarization \ --speakers 2
Using Azure:
# Set credentials export AZURE_SPEECH_KEY="your-key" export AZURE_SPEECH_REGION="westus" # Transcribe python scripts/transcribe_azure.py --file recording.wav # Real-time python scripts/transcribe_azure_realtime.py --microphone
Using AssemblyAI:
# Set API key export ASSEMBLYAI_API_KEY="your-key" # Transcribe with features python scripts/transcribe_assemblyai.py \ --file recording.wav \ --diarization \ --sentiment \ --topics
Real-Time Transcription
Stream from microphone:
# Whisper streaming (chunked) python scripts/stream_whisper.py # Google streaming python scripts/stream_google.py # Azure continuous recognition python scripts/stream_azure.py
Format Output
Plain text:
python scripts/transcribe_whisper.py --file audio.wav --output transcript.txt
JSON with metadata:
python scripts/transcribe_whisper.py \ --file audio.wav \ --format json \ --output transcript.json # Output includes: # - Text segments # - Timestamps # - Confidence scores # - Language detection
SRT subtitles:
python scripts/transcribe_whisper.py \ --file video.mp4 \ --format srt \ --output subtitles.srt
VTT subtitles:
python scripts/transcribe_whisper.py \ --file video.mp4 \ --format vtt \ --output subtitles.vtt
Common Workflows
Workflow 1: Meeting Transcription
Scenario: Record and transcribe meeting with speaker labels
# 1. Record meeting python scripts/record_audio.py \ --output meeting.wav \ --vad # Stop on silence # 2. Transcribe with speaker diarization python scripts/transcribe_google.py \ --file meeting.wav \ --diarization \ --speakers 4 \ --output meeting.json # 3. Format for readability python scripts/format_transcript.py \ --input meeting.json \ --format markdown \ --output meeting.md # Result: Formatted transcript with speaker labels and timestamps
Workflow 2: Voice Notes to Markdown
Scenario: Quick voice note → markdown document
# Record voice note python scripts/quick_note.py # (Records audio, transcribes with Whisper, saves as markdown) # Output: voice-note-2025-01-20-14-30.md
Workflow 3: Batch Transcription
Scenario: Transcribe multiple audio files
# Batch process folder python scripts/batch_transcribe.py \ --input ./recordings/ \ --output ./transcripts/ \ --engine whisper \ --model base # Progress shown for each file
Workflow 4: Video Subtitles
Scenario: Generate subtitles for video
# Extract audio from video python scripts/extract_audio.py --video lecture.mp4 --output audio.wav # Generate subtitles python scripts/transcribe_whisper.py \ --file audio.wav \ --format srt \ --output lecture.srt # Embed in video (requires ffmpeg) python scripts/embed_subtitles.py \ --video lecture.mp4 \ --subtitles lecture.srt \ --output lecture-subbed.mp4
Workflow 5: Multi-Language Support
Scenario: Transcribe and translate
# Transcribe Spanish audio python scripts/transcribe_whisper.py \ --file spanish-audio.wav \ --language es \ --output transcript-es.txt # Translate to English python scripts/transcribe_whisper.py \ --file spanish-audio.wav \ --task translate \ --output transcript-en.txt
Whisper Model Sizes
| Model | Parameters | Size | Speed | VRAM | Accuracy |
|---|---|---|---|---|---|
| tiny | 39M | ~75MB | ~32x | ~1GB | Good |
| base | 74M | ~142MB | ~16x | ~1GB | Better |
| small | 244M | ~466MB | ~6x | ~2GB | Great |
| medium | 769M | ~1.5GB | ~2x | ~5GB | Excellent |
| large | 1550M | ~2.9GB | 1x | ~10GB | Best |
Recommendation:
- •Casual use:
tinyorbase(fast, good enough) - •Quality needed:
smallormedium(balanced) - •Professional:
large(best accuracy, slower) - •GPU available: Use
mediumorlarge - •CPU only: Use
tinyorbase
Language Support
Whisper supports 99+ languages:
# Common languages en # English es # Spanish fr # French de # German it # Italian pt # Portuguese nl # Dutch pl # Polish ru # Russian ja # Japanese ko # Korean zh # Chinese ar # Arabic hi # Hindi
Full list: reference/language-codes.md
Speaker Diarization
Identify who said what:
# Google (best diarization) python scripts/transcribe_google.py \ --file meeting.wav \ --diarization \ --speakers 3 # Hint: 3 speakers expected # AssemblyAI python scripts/transcribe_assemblyai.py \ --file meeting.wav \ --diarization # Output format: # Speaker 1: Hello everyone, let's begin # Speaker 2: Thanks for joining # Speaker 1: Today's agenda includes...
Post-process with names:
python scripts/label_speakers.py \ --transcript meeting.json \ --labels "Alice,Bob,Charlie" \ --output meeting-labeled.txt
Audio Processing
Enhance audio quality:
# Reduce noise python scripts/denoise_audio.py \ --input noisy.wav \ --output clean.wav # Normalize volume python scripts/normalize_audio.py \ --input quiet.wav \ --output normalized.wav # Convert format python scripts/convert_audio.py \ --input audio.m4a \ --output audio.wav
Timestamps and Segments
Transcript with timestamps:
{
"segments": [
{
"start": 0.0,
"end": 3.5,
"text": "Welcome to today's meeting.",
"confidence": 0.95
},
{
"start": 3.5,
"end": 7.2,
"text": "Let's review the quarterly results.",
"confidence": 0.92
}
]
}
Search by timestamp:
# Find text at specific time python scripts/find_at_time.py \ --transcript meeting.json \ --time "5:30" # 5 minutes 30 seconds # Extract time range python scripts/extract_range.py \ --transcript meeting.json \ --start "2:00" \ --end "5:00" \ --output excerpt.txt
API Cost Comparison
Per hour of audio:
- •Whisper: Free (local processing)
- •Google: ~$1.44 (60 min × $0.024/min)
- •Azure: ~$1.00 (standard pricing)
- •AssemblyAI: ~$0.90 (3600 sec × $0.00025/sec)
Free tiers:
- •Google: $300 credit (first 90 days)
- •Azure: 5 hours/month free
- •AssemblyAI: 3 hours free on signup
Scripts Reference
Recording:
- •
record_audio.py- Record from microphone - •
list_devices.py- List audio devices - •
test_microphone.py- Test mic input
Transcription:
- •
transcribe_whisper.py- Whisper transcription - •
transcribe_google.py- Google Cloud STT - •
transcribe_azure.py- Azure Speech - •
transcribe_assemblyai.py- AssemblyAI
Real-time:
- •
stream_whisper.py- Whisper streaming - •
stream_google.py- Google streaming - •
stream_azure.py- Azure continuous
Processing:
- •
batch_transcribe.py- Batch processing - •
format_transcript.py- Format output - •
extract_audio.py- Extract from video - •
denoise_audio.py- Noise reduction
Utilities:
- •
quick_note.py- Record + transcribe - •
label_speakers.py- Add speaker names - •
find_at_time.py- Search by timestamp - •
convert_audio.py- Format conversion
Best Practices
- •Start with Whisper - Free, offline, good quality
- •Test different models - Balance speed vs accuracy
- •Use VAD - Voice Activity Detection for cleaner recording
- •Enhance audio first - Denoise for better results
- •Appropriate model size - Don't use large models for quick notes
- •Speaker diarization - Essential for meetings
- •Save raw audio - Keep original for re-processing
- •Add context - Language hints improve accuracy
Troubleshooting
"No module named 'whisper'"
pip install openai-whisper --break-system-packages
"Microphone not working"
# List devices python scripts/list_devices.py # Test specific device python scripts/test_microphone.py --device 1
"Out of memory" (Whisper)
# Use smaller model python scripts/transcribe_whisper.py --file audio.wav --model tiny # Or process in chunks python scripts/transcribe_chunked.py --file large-audio.wav
"Poor transcription quality"
- •Use larger Whisper model (medium/large)
- •Enhance audio first (denoise, normalize)
- •Specify correct language
- •Check microphone quality
"API authentication failed"
# Google export GOOGLE_APPLICATION_CREDENTIALS="/path/to/credentials.json" # Azure export AZURE_SPEECH_KEY="your-key" export AZURE_SPEECH_REGION="region" # AssemblyAI export ASSEMBLYAI_API_KEY="your-key"
Integration Examples
See examples/ for complete workflows:
- •examples/meeting-minutes.md - Meeting transcription
- •examples/podcast-notes.md - Podcast processing
- •examples/lecture-subtitles.md - Video subtitles
- •examples/voice-journal.md - Voice note system
Reference Documentation
- •reference/setup-guide.md - Detailed setup
- •reference/engine-comparison.md - STT engine details
- •reference/language-codes.md - Supported languages
- •reference/api-keys.md - Getting API credentials
- •reference/audio-formats.md - Format specifications