Chichi Speech Service

This skill provides a FastAPI-based REST service for Qwen3 TTS, specifically configured for reusing a high-quality reference audio prompt for efficient and consistent voice cloning. This service is packaged as an installable CLI.

Installation

Prerequisites: uv and python >= 3.10.

If uv not installed, install it first

bash

command -v uv || brew install uv || (curl -LsSf https://astral.sh/uv/install.sh | sh)

Install chichi-speech

bash

export CHICHI_SPEECH_VENV="/Users/$USER/envs/chichi-speech"
uv venv $CHICHI_SPEECH_VENV --allow-existing
source $CHICHI_SPEECH_VENV/bin/activate
uv pip install -e .

Usage

1. Start the Service

The service runs on port 9090 by default.

bash

# Production: Use the included script to run as a daemon with auto-restart
./scripts/start_server.sh --ref-audio src/assets/coco.wav --ref-text src/assets/coco.txt

# To stop:
# ./scripts/stop_server.sh

# Logs are written to /tmp/chichi_server.log

2. Verify Service is Running

Wait for ~15s for model to be loaded. Then Check the health/docs:

bash

sleep 15 && curl http://localhost:9090/docs

3. Generate Speech

If the "text" is longer than 10 sentences, split it to smaller chunks(10 sentences per chunk) and synthesized separately. Specify the "language" based on the input "text". Available languages:

•Chinese
•English
•Japanese
•Korean
•German
•French
•Russian
•Portuguese
•Spanish
•Italian

bash

curl -X POST "http://localhost:9090/synthesize" \
     -H "Content-Type: application/json" \
     -d '{
           "text": "Nice to meet you",
           "language": "auto",
           "format": "ogg"
         }' \
     --output output/nice_to_meet.ogg

Functionality

•Endpoint: POST /synthesize
•Default Port: 9090
•Voice Cloning: Uses a pre-computed voice prompt from reference files to ensure the cloned voice is consistent and generation is fast.

Requirements

•Python 3.10+
•qwen-tts (Qwen3 model library)
•
Access to a reference audio file for voice cloning.
- •By default, it uses public sample audio from Qwen3.
- •CRITICAL: You can provide your own reference audio using the --ref-audio and --ref-text flags.