fal.ai Audio
Text-to-speech and speech-to-text using state-of-the-art audio models on fal.ai.
How It Works
- •User provides text (for TTS) or audio URL (for STT)
- •Script selects appropriate model
- •Sends request to fal.ai API
- •Returns audio URL (TTS) or transcription text (STT)
Text-to-Speech Models
| Model | Notes |
|---|---|
fal-ai/minimax/speech-2.6-hd | Best quality |
fal-ai/minimax/speech-2.6-turbo | Fast, good quality |
fal-ai/elevenlabs/eleven-v3 | Natural voices |
fal-ai/chatterbox/multilingual | Multi-language, fast |
fal-ai/kling-video/v1/tts | For video sync |
Text-to-Music Models
| Model | Notes |
|---|---|
fal-ai/minimax-music/v2 | Best quality |
fal-ai/minimax-music/v1.5 | Fast |
fal-ai/lyria2 | Google's model |
fal-ai/elevenlabs/music | Song generation |
fal-ai/sonauto/v2 | Instrumental |
fal-ai/ace-step | Short clips |
fal-ai/beatoven | Background music |
Speech-to-Text Models
| Model | Features | Speed |
|---|---|---|
fal-ai/whisper | Multi-language, timestamps | Fast |
fal-ai/elevenlabs/scribe | Speaker diarization | Medium |
Usage
Text-to-Speech
bash
bash /mnt/skills/user/fal-audio/scripts/text-to-speech.sh [options]
Arguments:
- •
--text- Text to convert to speech (required) - •
--model- TTS model (defaults tofal-ai/minimax/speech-2.6-turbo) - •
--voice- Voice ID or name (model-specific)
Examples:
bash
# Basic TTS (fast, good quality) bash /mnt/skills/user/fal-audio/scripts/text-to-speech.sh \ --text "Hello, welcome to the future of AI." # High quality with MiniMax HD bash /mnt/skills/user/fal-audio/scripts/text-to-speech.sh \ --text "This is premium quality speech." \ --model "fal-ai/minimax/speech-2.6-hd" # Natural voices with ElevenLabs bash /mnt/skills/user/fal-audio/scripts/text-to-speech.sh \ --text "Natural sounding voice generation" \ --model "fal-ai/elevenlabs/eleven-v3" # Multi-language TTS bash /mnt/skills/user/fal-audio/scripts/text-to-speech.sh \ --text "Bonjour, bienvenue dans le futur." \ --model "fal-ai/chatterbox/multilingual"
Speech-to-Text
bash
bash /mnt/skills/user/fal-audio/scripts/speech-to-text.sh [options]
Arguments:
- •
--audio-url- URL of audio file to transcribe (required) - •
--model- STT model (defaults tofal-ai/whisper) - •
--language- Language code (optional, auto-detected)
Examples:
bash
# Transcribe with Whisper bash /mnt/skills/user/fal-audio/scripts/speech-to-text.sh \ --audio-url "https://example.com/audio.mp3" # Transcribe with speaker diarization bash /mnt/skills/user/fal-audio/scripts/speech-to-text.sh \ --audio-url "https://example.com/meeting.mp3" \ --model "fal-ai/elevenlabs/scribe" # Transcribe specific language bash /mnt/skills/user/fal-audio/scripts/speech-to-text.sh \ --audio-url "https://example.com/spanish.mp3" \ --language "es"
MCP Tool Alternative
Text-to-Speech
javascript
mcp__fal-ai__generate({
modelId: "fal-ai/minimax/speech-2.6-turbo",
input: {
text: "Hello, welcome to the future of AI."
}
})
Speech-to-Text
javascript
mcp__fal-ai__generate({
modelId: "fal-ai/whisper",
input: {
audio_url: "https://example.com/audio.mp3"
}
})
Output
Text-to-Speech Output
code
Generating speech... Model: fal-ai/minimax/speech-2.6-turbo Speech generated! Audio URL: https://v3.fal.media/files/abc123/speech.mp3 Duration: 5.2s
Speech-to-Text Output
code
Transcribing audio... Model: fal-ai/whisper Transcription complete! Text: "Hello, this is the transcribed text from the audio file." Duration: 12.5s Language: en
Present Results to User
For TTS:
code
Here's the generated speech: [Download audio](https://v3.fal.media/files/.../speech.mp3) • Duration: 5.2s | Model: Maya TTS
For STT:
code
Here's the transcription: "Hello, this is the transcribed text from the audio file." • Duration: 12.5s | Language: English
Model Selection Guide
Text-to-Speech
MiniMax Speech 2.6 HD (fal-ai/minimax/speech-2.6-hd)
- •Best for: Premium quality requirements
- •Quality: Highest
- •Speed: Medium
MiniMax Speech 2.6 Turbo (fal-ai/minimax/speech-2.6-turbo)
- •Best for: General use with good quality
- •Quality: High
- •Speed: Fast
ElevenLabs v3 (fal-ai/elevenlabs/eleven-v3)
- •Best for: Natural, realistic voices
- •Quality: High
- •Features: Many voice options
Chatterbox Multilingual (fal-ai/chatterbox/multilingual)
- •Best for: Multi-language support
- •Quality: Good
- •Speed: Fast
Text-to-Music
MiniMax Music v2 (fal-ai/minimax-music/v2)
- •Best for: High quality music generation
- •Quality: Highest
Lyria2 (fal-ai/lyria2)
- •Best for: Google's music model
- •Quality: High
Speech-to-Text
Whisper (fal-ai/whisper)
- •Best for: General transcription, timestamps
- •Languages: 99+ languages
- •Features: Word-level timestamps
ElevenLabs Scribe (fal-ai/elevenlabs/scribe)
- •Best for: Multi-speaker recordings
- •Features: Speaker diarization
- •Quality: Professional-grade
Troubleshooting
Empty Audio
code
Error: Generated audio is empty Check that your text is not empty and contains valid content.
Unsupported Audio Format
code
Error: Audio format not supported Supported formats: MP3, WAV, M4A, FLAC, OGG Convert your audio to a supported format.
Language Detection Failed
code
Warning: Could not detect language, defaulting to English Specify the language explicitly with --language option.