AgentSkillsCN

qwen-voice

利用Qwen(DashScope/百炼)完成语音相关任务:(1) 使用qwen3-asr-flash将用户的音频/语音消息转录为文本(支持Telegram的.ogg opus、wav、mp3格式),并可选通过分块处理生成粗略的时间戳;(2) 使用qwen3-tts-flash将文本转换为语音回复,支持多种音色选择(默认音色为Cherry),并将输出以.ogg格式的语音笔记发送至Telegram。

SKILL.md
--- frontmatter
name: qwen-voice
description: "Use Qwen (DashScope/百炼) for speech tasks: (1) ASR speech-to-text transcription of user audio/voice messages (Telegram .ogg opus, wav, mp3) using qwen3-asr-flash, optionally with coarse timestamps via chunking; (2) TTS text-to-speech voice reply using qwen3-tts-flash with selectable voice (default Cherry) and output as .ogg voice note for Telegram."

Qwen Voice (ASR + TTS)

Use the bundled scripts. Configure DASHSCOPE_API_KEY in one of:

  • ~/.config/qwen-voice/.env (recommended)
  • <repo>/.qwen-voice/.env (dev/testing)

ASR (speech → text)

Non-timestamp (default)

bash
python3 skills/qwen-voice/scripts/qwen_asr.py --in /path/to/audio.ogg

With timestamps (chunk-based)

bash
python3 skills/qwen-voice/scripts/qwen_asr.py --in /path/to/audio.ogg --timestamps --chunk-sec 3

Notes:

  • Timestamps are generated by fixed-length chunking (not word-level alignment).
  • Input audio is converted to mono 16kHz WAV before sending.

TTS (text → speech)

Preset voice (default: Cherry)

bash
python3 skills/qwen-voice/scripts/qwen_tts.py --text '你好,我是 Pi。' --voice Cherry --out /tmp/out.ogg

Clone voice (create once, reuse)

  1. Create a voice profile from a sample audio:
bash
python3 skills/qwen-voice/scripts/qwen_voice_clone.py --in ./voice_sample.ogg --name george --out work/qwen-voice/george.voice.json
  1. Use the cloned voice to synthesize:
bash
python3 skills/qwen-voice/scripts/qwen_tts.py --text '你好,我是 George。' --voice-profile work/qwen-voice/george.voice.json --out /tmp/out.ogg

Notes:

  • .ogg output is Opus, suitable for Telegram voice messages.
  • Voice cloning uses DashScope customization endpoint + Qwen realtime TTS model.
  • Scripts use a local venv at work/venv-dashscope (auto-created on first run).

Typical chat workflow

  • When user sends voice message/audio: run ASR and reply with the transcribed text.
  • When user explicitly asks for voice reply: run TTS and send the generated .ogg as a voice note.