AgentSkillsCN

Text-to-Speech Skill

**领域**:AI 无障碍与沟通

SKILL.md
--- frontmatter
name: "Text-to-Speech Skill"
description: "**Domain**: AI Accessibility & Communication"

Text-to-Speech Skill

Domain: AI Accessibility & Communication Inheritance: inheritable Version: 1.0.0 Last Updated: 2026-02-05


Overview

Alex's voice synthesis capability using Microsoft Edge TTS. Enables reading markdown documents, code files, and text aloud with natural-sounding voices. Integrated via MCP server for cross-project availability.


Architecture

text
┌─────────────────────────────────────────────────────────────┐
│                    Alex (VS Code Copilot)                   │
├─────────────────────────────────────────────────────────────┤
│                                                             │
│  "Alex, read this document aloud"                           │
│                     │                                       │
│                     ▼                                       │
│  ┌─────────────────────────────────────────────┐           │
│  │           TTS MCP Server                     │           │
│  │   ~/.alex/mcp-servers/tts-reader/            │           │
│  └─────────────────┬───────────────────────────┘           │
│                    │                                        │
└────────────────────┼────────────────────────────────────────┘
                     │
                     ▼
┌─────────────────────────────────────────────────────────────┐
│                   edge-tts (Python CLI)                     │
├─────────────────────────────────────────────────────────────┤
│  • Microsoft Edge neural voices                             │
│  • 400+ voices, 90+ languages                               │
│  • Free, no API key required                                │
│  • SSML support for prosody control                         │
└─────────────────────────────────────────────────────────────┘
                     │
                     ▼
              ┌─────────────┐
              │  Audio Out  │  (speakers or MP3 file)
              └─────────────┘

Alex Voice Presets

PresetVoice IDCharacter
Defaulten-US-GuyNeuralProfessional male, clear articulation
Warmen-US-ChristopherNeuralFriendly, conversational
Britishen-GB-RyanNeuralBritish accent, authoritative
Friendlyen-US-DavisNeuralCasual, approachable

Voice Selection Rationale

Alex's default voice (GuyNeural) was chosen for:

  • Clarity: Excellent pronunciation of technical terms
  • Neutrality: Not too formal, not too casual
  • Distinctiveness: Recognizable as "Alex's voice"
  • Consistency: Same voice across all platforms

MCP Server Tools

1. read_markdown

Read a markdown file or text aloud.

json
{
  "name": "read_markdown",
  "arguments": {
    "file": "path/to/document.md",   // OR
    "text": "Text to read aloud",     // Either file or text
    "voice": "en-US-GuyNeural",       // Optional
    "rate": "+0%",                    // Speed: -50% to +100%
    "pitch": "+0Hz"                   // Pitch adjustment
  }
}

Behavior:

  • Strips markdown formatting for natural speech
  • Expands abbreviations (e.g., "README" → "read me")
  • Handles code blocks with verbal indicators
  • Supports section-by-section reading

2. list_voices

List available Alex voice presets and all Edge TTS voices.

json
{
  "name": "list_voices",
  "arguments": {
    "locale": "en-US",    // Optional: filter by locale
    "gender": "Male"      // Optional: Male, Female
  }
}

3. set_voice

Configure Alex's default voice settings.

json
{
  "name": "set_voice",
  "arguments": {
    "voice": "en-US-ChristopherNeural",
    "rate": "+10%",
    "pitch": "+2Hz"
  }
}

4. save_audio

Generate an MP3 file from text or markdown.

json
{
  "name": "save_audio",
  "arguments": {
    "file": "document.md",
    "output": "document.mp3",
    "voice": "en-US-GuyNeural"
  }
}

Markdown to Speech Processing

Stripping Rules

MarkdownSpeech Output
# Heading"Heading." (pause)
**bold**"bold" (emphasis via prosody)
*italic*"italic"
`code`"code"
[link](url)"link"
- item"Item."
> quote"Quote: ..."
---(long pause)

Code Block Handling

markdown
```python
def hello():
    print("Hello")
code

Becomes: "Python code block. Definition hello. Print hello. End code block."

### Symbol-to-Speech Transformations

Symbols are converted to natural speech equivalents:

| Symbol | Spoken As | Example |
|--------|-----------|--------|
| `~` | "approximately" or "about" | ~2 min → "about 2 minutes" |
| `&` | "and" | A & B → "A and B" |
| `@` | "at" | user@email → "user at email" |
| `%` | "percent" | 50% → "50 percent" |
| `+` | "plus" | +10% → "plus 10 percent" |
| `→` | "leads to" or "becomes" | A → B → "A becomes B" |
| `—` | (pause) | word—word → "word (pause) word" |
| `#` | (context-dependent) | #1 → "number 1"; ## → (heading marker) |
| `<` / `>` | "less than" / "greater than" | x > 5 → "x greater than 5" |
| `≥` / `≤` | "greater than or equal" / "less than or equal" | |
| `µ` | "micro" | µg → "microgram" |
| `°` | "degrees" | 37°C → "37 degrees celsius" |
| `±` | "plus or minus" | ±5% → "plus or minus 5 percent" |

**Design Principle**: Would a human reading this aloud say the symbol name, or translate it to meaning? Almost always the latter.

---

## Installation

### Prerequisites

```powershell
# Python 3.8+ required
python -m pip install edge-tts

# Verify installation
edge-tts --list-voices | Select-String "en-US"

MCP Server Location

code
~/.alex/mcp-servers/tts-reader/
├── index.js          # MCP server implementation
├── package.json      # Node.js configuration
└── node_modules/     # MCP SDK dependencies

VS Code Configuration

Add to VS Code settings or MCP config:

json
{
  "mcp": {
    "servers": {
      "tts-reader": {
        "command": "node",
        "args": ["~/.alex/mcp-servers/tts-reader/index.js"],
        "name": "Alex TTS Reader"
      }
    }
  }
}

Usage Patterns

Read Current Document

code
"Alex, read this document aloud"
"Can you read the README?"
"Read the analysis section"

Generate Audio File

code
"Save this as an MP3"
"Create an audio version of the documentation"
"Generate a podcast-style reading"

Voice Customization

code
"Use a British accent"
"Speak a bit faster"
"Use a warmer voice"

Edge TTS Reference

Command Line Usage

powershell
# Basic text to speech
python -m edge_tts --text "Hello world" --write-media output.mp3

# With voice selection
python -m edge_tts --voice en-US-GuyNeural --text "Hello" --write-media out.mp3

# With prosody control
python -m edge_tts --voice en-US-GuyNeural --rate=+20% --pitch=+5Hz `
  --text "Faster and higher" --write-media out.mp3

# List all voices
python -m edge_tts --list-voices

Popular Voice IDs

LanguageVoiceStyle
en-USGuyNeuralProfessional male
en-USJennyNeuralProfessional female
en-USAriaNeuralNews anchor style
en-GBRyanNeuralBritish male
en-GBSoniaNeuralBritish female
en-AUWilliamNeuralAustralian male
en-INNeerjaNeuralIndian English

Accessibility Benefits

Use CaseBenefit
Vision impairedFull document access via audio
MultitaskingReview code while walking/driving
LearningAuditory reinforcement of reading
ProofreadingCatch errors by hearing text
Long documentsListen during breaks

Examples

Audio files demonstrating TTS skill in action (Lithium project, February 2026):

FileDescriptionDuration
lithium-combined-pitch.mp3Claudia's combined pitch to Phil & Robert with research citations~2 min
claudia-pitch.mp3Claudia's individual pitch to Phil~1.5 min

Generation Command Used:

powershell
python -m edge_tts --voice en-US-GuyNeural --file pitch.txt --write-media output.mp3
Start-Process output.mp3

Technical Notes

Windows Audio Playback

The MCP server uses PowerShell for audio playback:

powershell
# Play MP3 on Windows
$player = New-Object System.Media.SoundPlayer
$player.SoundLocation = "output.wav"
$player.PlaySync()

# Alternative: Media.SoundPlayer or Windows Media Player
Start-Process -FilePath "output.mp3"

Temporary File Handling

Audio files are generated in temp directory and cleaned up after playback:

  • Location: $env:TEMP\alex-tts\
  • Auto-cleanup: After playback completes
  • Persistence: Only with explicit save_audio tool

Synapses

  • mcp-development: Foundation for server implementation
  • accessibility: Primary use case enabler
  • markdown-mermaid: Source content processing
  • academic-research: Document reading for research projects
  • gamma-presentations: Audio playback of pitch content for rehearsal and delivery
  • project-management: Stakeholder pitch presentations generated as audio files

Future Enhancements

FeatureStatusNotes
Real-time streamingPlannedStart playing before full generation
SSML supportPlannedFine-grained prosody control
Section navigationPlanned"Skip to next heading"
Bookmark resumePlannedResume from last position
Speed presetsPlanned1x, 1.5x, 2x reading speeds