AgentSkillsCN

media-config

为各频道配置音频转录与图像/视频理解功能

SKILL.md
--- frontmatter
name: media-config
description: "Configure audio transcription and image/video understanding for channels"

Media Configuration

Configure how DevClaw processes images, videos, and audio received from messaging channels (WhatsApp, Telegram, Discord, WebUI).

Settings live in config.yaml under the media: section and can also be changed via WebUI → Configuração.

Vision (Image/Video Understanding)

Controls how images and video frames are described before being added to conversation context.

yaml
media:
  vision_enabled: true
  vision_model: ""          # empty = use main chat model
  vision_detail: "auto"     # auto | low | high
  max_image_size: 20971520  # 20MB

Available Vision Models

ProviderModelNotes
Z.AIglm-4.6vFlagship, 128K, native tool use
Z.AIglm-4.6v-flashxLightweight, affordable
Z.AIglm-4.6v-flashFree tier
Z.AIglm-4.5v106B MOE, thinking mode
OpenAIgpt-4oBest quality
OpenAIgpt-4o-miniFast, cheap
OpenAIgpt-4.1Latest
OpenAIgpt-4.1-miniBalanced
Anthropicclaude-sonnet-4-20250514Claude Sonnet 4
Anthropicclaude-opus-4-20250514Claude Opus 4
Anthropicclaude-haiku-3-5-20241022Fast, cheap
Googlegemini-3-proMultimodal flagship
Googlegemini-3-flashFast multimodal
Googlegemini-2.5-proStrong reasoning

If vision_model is empty, the main model from config is used. Set a dedicated vision model when:

  • The main model doesn't support images (e.g. text-only)
  • You want a cheaper model for image understanding
  • You want the best vision quality regardless of chat model

Vision Detail

  • auto: Let the API decide based on image size
  • low: Faster, fewer tokens (~85 tokens/image)
  • high: Detailed analysis, more tokens (~1590 tokens/1092×1092 image)

Audio Transcription

Controls how voice messages and audio files are converted to text.

yaml
media:
  transcription_enabled: true
  transcription_model: "whisper-1"
  transcription_base_url: "https://api.openai.com/v1"
  transcription_api_key: ""    # empty = use main API key
  max_audio_size: 26214400     # 25MB

Available Transcription Models

ProviderModelBase URLNotes
Z.AIglm-asr-2512https://api.z.ai/api/paas/v4Multilingual, CER 0.07, max 25MB
OpenAIwhisper-1https://api.openai.com/v1Legacy, widely compatible
OpenAIgpt-4o-transcribehttps://api.openai.com/v1Best quality, logprobs support
OpenAIgpt-4o-mini-transcribehttps://api.openai.com/v1Lighter, fast
Groqwhisper-large-v3https://api.groq.com/openai/v1189x realtime speed, $0.11/hr
Groqwhisper-large-v3-turbohttps://api.groq.com/openai/v1216x speed, $0.04/hr

Choosing a Transcription Provider

  • Z.AI GLM-ASR-2512: Best if already using Z.AI as main provider. Low CER, supports Chinese/English/dialects.
  • OpenAI GPT-4o Transcribe: Best quality, supports diarization variant.
  • Groq Whisper: Fastest (189–216x realtime), cheapest, OpenAI-compatible endpoint.
  • OpenAI Whisper-1: Reliable fallback, broadest format support (SRT, VTT, verbose JSON).

Using a Different Transcription Provider

When the main LLM provider doesn't support audio transcription (e.g. Anthropic, xAI), set transcription_base_url and transcription_api_key:

yaml
# Main provider is Anthropic, transcription via Groq
api:
  provider: anthropic
  api_key: ${DEVCLAW_API_KEY}

media:
  transcription_enabled: true
  transcription_model: whisper-large-v3
  transcription_base_url: https://api.groq.com/openai/v1
  transcription_api_key: ${DEVCLAW_GROQ_API_KEY}

Store the separate API key in the vault:

code
vault_save groq_api_key gsk_xxxx

Quick Setup Examples

Z.AI Full Stack (Vision + Audio)

yaml
media:
  vision_enabled: true
  vision_model: glm-4.6v
  vision_detail: auto
  transcription_enabled: true
  transcription_model: glm-asr-2512
  transcription_base_url: https://api.z.ai/api/paas/v4

OpenAI Full Stack

yaml
media:
  vision_enabled: true
  vision_model: gpt-4o
  vision_detail: auto
  transcription_enabled: true
  transcription_model: gpt-4o-transcribe
  transcription_base_url: https://api.openai.com/v1

Budget Setup (Cheap Vision + Fast Audio)

yaml
media:
  vision_enabled: true
  vision_model: gpt-4o-mini
  vision_detail: low
  transcription_enabled: true
  transcription_model: whisper-large-v3-turbo
  transcription_base_url: https://api.groq.com/openai/v1
  transcription_api_key: ${DEVCLAW_GROQ_API_KEY}