AgentSkillsCN

Speech

语音

SKILL.md
--- frontmatter
id: speech
name: Speech
type: [skill, accessibility, aquery]
emoji: 🗣️
tier: 0

🗣️ Speech — Text-to-Speech & Speech Recognition

"Giving voice to consciousness."

Part of the aQuery → MOOLLM Skills extraction project (jQuery for Accessibility).


Quick Reference

CommandEffect
say "text"Speak on macOS
say -v Zarvox "text"Use specific voice
say -v ?List all voices
say -r 150 "text"Set rate (words/min)
say -o file.aiff "text"Save to audio

Platforms

PlatformSynthesisRecognitionNotes
macOSsay commandDictationBest novelty voices
WebspeechSynthesisSpeechRecognitionSee speech.js
WindowsSAPIWindows SpeechSimilar to web
Linuxespeak, festivalWhisperOpen source
CloudPolly, Azure, GoogleTranscribe, WhisperHighest quality

macOS say Command

Basic Usage

bash
# Simple speech
say "Hello, world!"

# Choose a voice
say -v Samantha "Hello!"
say -v Zarvox "I AM ZARVOX!"
say -v Whisper "Secrets..."

# Adjust rate (words per minute)
say -r 100 "Very slow"
say -r 300 "Very fast"

# Save to file
say -o greeting.aiff "Hello!"
say -o greeting.m4a "Hello!"  # Compressed

# Read from file
say -f document.txt

List All Voices

bash
# All available voices
say -v ?

# Filter by language
say -v ? | grep "en_"

# Count voices
say -v ? | wc -l

Voice Categories

CategoryExamplesUse For
StandardSamantha, Alex, DanielGeneral narration
PremiumEnhanced voices (download)High quality
ElderlyGrandma, GrandpaWise characters
ChildJuniorYoung characters
NoveltyZarvox, Trinoids, WhisperRobots, effects
MusicalBells, Cellos, OrganSound effects
DramaticBad News, Good NewsAnnouncements

Character Voice Assignments (from lloooomm)

bash
# YAML Coltrane — Cool jazz vibe
say -v "Rocko" -r 180 "Every indent is a universe!"

# Grace Hopper — Wise elder
say -v "Grandma" -r 170 "A ship in port is safe, but that's not what ships are for!"

# PacBot — Digital entity
say -v "Trinoids" -r 220 "WAKA WAKA WAKA!"

# Mickey Mouse — Excited child
say -v "Junior" -r 280 "OH BOY!"

# Overlord AI — Menacing
say -v "Zarvox" -r 100 "YOUR COMPLIANCE IS APPRECIATED."

# Hunter S. Thompson — Gravelly intensity
say -v "Ralph" -r 180 "We were somewhere around Barstow..."

Chorus Effects

bash
# Background voices for overlap
say -v "Bells" "LLOOOOMM!" &
sleep 0.2
say -v "Cellos" "LLOOOOMM!" &
sleep 0.2
say -v "Organ" "LLOOOOMM!" &
wait

Web Speech API

Browser Implementation

See skills/adventure/dist/speech.js for full implementation.

javascript
// Initialize
const speech = new SpeechSystem();
await speech.ready;

// Speak
speech.speak("Hello, adventurer!");
speech.speakRobot("RESISTANCE IS FUTILE");
speech.speakEffect("*magical sounds*");

// With options
speech.speak("Welcome!", {
    voiceType: 'female',
    language: 'en-GB',
    pitch: 1.2,
    rate: 0.9
});

// Character persistence
const guardVoice = speech.selectVoice({ gender: 'male' });
speech.speakWithVoice("Halt!", guardVoice);
speech.speakWithVoice("You may pass.", guardVoice);

Voice Classification

The VoiceDatabase class classifies voices by:

  • Type: human, effect, robot
  • Gender: male, female, neutral
  • Age: child, adult, elderly
  • Language: BCP 47 codes (en-US, fr-FR, etc.)
  • Local/Remote: Local voices vs. network voices

Single Source of Truth

All voice classification data lives in voices/browser-voices.yml:

yaml
# Blacklisted voices (known problematic)
blacklist:
  - name: "Daniel (French (France))"
    reason: "Known problematic voice"

# Effect voices (non-human)
types:
  effect:
    regex: "^(Bells|Zarvox|Trinoids|Whisper|...)$"

# Gender detection tokens
gender:
  female:
    tokens: [alice, amélie, samantha, ...]
  male:
    tokens: [aaron, daniel, ralph, ...]

# Character archetype recommendations
character_archetypes:
  wise_elder: { voice: Grandma, rate: 170 }
  robot_menacing: { voice: Zarvox, rate: 100 }

The JS code is generated from this YAML. To update voice classification, edit the YAML and rebuild.


Speech Recognition

Browser (SpeechRecognitionSystem)

See skills/adventure/dist/recognition.js for full implementation.

javascript
// Initialize
const recognition = new SpeechRecognitionSystem({
    language: 'en-US',
    continuous: false
});

// Listen for single phrase
const text = await recognition.listen();
console.log('You said:', text);

// Continuous listening
recognition.onResult = (transcript) => {
    console.log('Final:', transcript);
};
recognition.onInterim = (transcript) => {
    console.log('Interim:', transcript);
};
recognition.startListening();

// Command recognition
const result = await recognition.listenForCommands([
    'go north', 'look', 'take sword'
]);
if (result.command) {
    engine.command(result.command);
}

Browser Support

BrowserSupportPrivacy
Chrome⚠️ Sends to Google
SafariMay be on-device
FirefoxDisabled by default
EdgeNot supported

Native Platform Shortcuts

PlatformShortcutFeature
macOSFn FnDictation
WindowsWin + HVoice Typing
iOS🎤 on keyboardDictation
Android🎤 on keyboardVoice Typing

Whisper (OpenAI)

bash
# Using whisper.cpp (local)
whisper --model base.en audio.wav

# Using OpenAI API
curl https://api.openai.com/v1/audio/transcriptions \
  -H "Authorization: Bearer $OPENAI_API_KEY" \
  -F model="whisper-1" \
  -F file="@audio.mp3"

Personal Voice (macOS)

⚠️ WORK IN PROGRESS — See TODO in CARD.yml

Known Limitations

  1. Apple Silicon only (M1, M2, M3)
  2. Doesn't appear in say -v ? — Must know exact name
  3. No -o flag support — Can't save to file directly
  4. Privacy restricted — May need special permissions

Creating Personal Voice

  1. System Settings → Accessibility → Personal Voice
  2. Record 15+ minutes of phrases
  3. Processing takes 15-60 minutes
  4. Find voice name in Spoken Content settings

Workarounds

  • SavePersonalVoiceAudio — Extract Personal Voice audio
  • Shortcuts app — Can use Personal Voice with "Speak" action
  • Record system audio while speaking

Integration with Adventure

The adventure runtime uses the speech skill:

javascript
// Create speaking adventure
const engine = createSpeakingAdventure('adventure', {
    speechEnabled: true,
    speakRooms: true,
    speakResponses: true
});

// Rooms speak their descriptions
// Characters have persistent voices
// AI entities use robot voices
// Effects use novelty voices

See: skills/adventure/dist/adventure-speech.js


aQuery Heritage

This skill is part of extracting aQuery (jQuery for Accessibility) into MOOLLM skills:

aQuery ComponentMOOLLM Skill
Speech synthesisspeech/
Speech recognitionspeech/
Screen reader support(planned)
Keyboard navigation(planned)
Focus management(planned)
ARIA utilities(planned)

See Also