Media Configuration
Configure how DevClaw processes images, videos, and audio received from messaging channels (WhatsApp, Telegram, Discord, WebUI).
Settings live in config.yaml under the media: section and can also be changed via WebUI → Configuração.
Vision (Image/Video Understanding)
Controls how images and video frames are described before being added to conversation context.
yaml
media: vision_enabled: true vision_model: "" # empty = use main chat model vision_detail: "auto" # auto | low | high max_image_size: 20971520 # 20MB
Available Vision Models
| Provider | Model | Notes |
|---|---|---|
| Z.AI | glm-4.6v | Flagship, 128K, native tool use |
| Z.AI | glm-4.6v-flashx | Lightweight, affordable |
| Z.AI | glm-4.6v-flash | Free tier |
| Z.AI | glm-4.5v | 106B MOE, thinking mode |
| OpenAI | gpt-4o | Best quality |
| OpenAI | gpt-4o-mini | Fast, cheap |
| OpenAI | gpt-4.1 | Latest |
| OpenAI | gpt-4.1-mini | Balanced |
| Anthropic | claude-sonnet-4-20250514 | Claude Sonnet 4 |
| Anthropic | claude-opus-4-20250514 | Claude Opus 4 |
| Anthropic | claude-haiku-3-5-20241022 | Fast, cheap |
| gemini-3-pro | Multimodal flagship | |
| gemini-3-flash | Fast multimodal | |
| gemini-2.5-pro | Strong reasoning |
If vision_model is empty, the main model from config is used. Set a dedicated vision model when:
- •The main model doesn't support images (e.g. text-only)
- •You want a cheaper model for image understanding
- •You want the best vision quality regardless of chat model
Vision Detail
- •auto: Let the API decide based on image size
- •low: Faster, fewer tokens (~85 tokens/image)
- •high: Detailed analysis, more tokens (~1590 tokens/1092×1092 image)
Audio Transcription
Controls how voice messages and audio files are converted to text.
yaml
media: transcription_enabled: true transcription_model: "whisper-1" transcription_base_url: "https://api.openai.com/v1" transcription_api_key: "" # empty = use main API key max_audio_size: 26214400 # 25MB
Available Transcription Models
| Provider | Model | Base URL | Notes |
|---|---|---|---|
| Z.AI | glm-asr-2512 | https://api.z.ai/api/paas/v4 | Multilingual, CER 0.07, max 25MB |
| OpenAI | whisper-1 | https://api.openai.com/v1 | Legacy, widely compatible |
| OpenAI | gpt-4o-transcribe | https://api.openai.com/v1 | Best quality, logprobs support |
| OpenAI | gpt-4o-mini-transcribe | https://api.openai.com/v1 | Lighter, fast |
| Groq | whisper-large-v3 | https://api.groq.com/openai/v1 | 189x realtime speed, $0.11/hr |
| Groq | whisper-large-v3-turbo | https://api.groq.com/openai/v1 | 216x speed, $0.04/hr |
Choosing a Transcription Provider
- •Z.AI GLM-ASR-2512: Best if already using Z.AI as main provider. Low CER, supports Chinese/English/dialects.
- •OpenAI GPT-4o Transcribe: Best quality, supports diarization variant.
- •Groq Whisper: Fastest (189–216x realtime), cheapest, OpenAI-compatible endpoint.
- •OpenAI Whisper-1: Reliable fallback, broadest format support (SRT, VTT, verbose JSON).
Using a Different Transcription Provider
When the main LLM provider doesn't support audio transcription (e.g. Anthropic, xAI), set transcription_base_url and transcription_api_key:
yaml
# Main provider is Anthropic, transcription via Groq
api:
provider: anthropic
api_key: ${DEVCLAW_API_KEY}
media:
transcription_enabled: true
transcription_model: whisper-large-v3
transcription_base_url: https://api.groq.com/openai/v1
transcription_api_key: ${DEVCLAW_GROQ_API_KEY}
Store the separate API key in the vault:
code
vault_save groq_api_key gsk_xxxx
Quick Setup Examples
Z.AI Full Stack (Vision + Audio)
yaml
media: vision_enabled: true vision_model: glm-4.6v vision_detail: auto transcription_enabled: true transcription_model: glm-asr-2512 transcription_base_url: https://api.z.ai/api/paas/v4
OpenAI Full Stack
yaml
media: vision_enabled: true vision_model: gpt-4o vision_detail: auto transcription_enabled: true transcription_model: gpt-4o-transcribe transcription_base_url: https://api.openai.com/v1
Budget Setup (Cheap Vision + Fast Audio)
yaml
media:
vision_enabled: true
vision_model: gpt-4o-mini
vision_detail: low
transcription_enabled: true
transcription_model: whisper-large-v3-turbo
transcription_base_url: https://api.groq.com/openai/v1
transcription_api_key: ${DEVCLAW_GROQ_API_KEY}