A community-driven registry for Claude, Cursor, Windsurf, Cline & more. Not affiliated with Anthropic.
Are you the author? Sign in to claim
🗣️ Qwen3-TTS All-in-One Docker: Web UI + REST API + MCP Server | 10 Languages | Voice Clone/Design/Custom | GPU Auto-Ma
Production-ready Docker deployment for Qwen3-TTS with a modern Web UI and full REST API. All models baked into a single Docker image — no downloads at runtime.
| Custom Voice | Voice Design | Voice Clone |
|---|---|---|
![]() | ![]() | ![]() |
/docs.pt file, reuse across sessions# Pull the all-in-one image (~30GB with models)
docker pull neosun/qwen3-tts:2.0.0
# Run with GPU
docker run -d --name qwen3-tts \
--gpus '"device=0"' \
-p 8766:8766 \
neosun/qwen3-tts:2.0.0
git clone https://github.com/neosun100/Qwen3-TTS.git
cd Qwen3-TTS
# Edit .env to set your GPU
echo "NVIDIA_VISIBLE_DEVICES=0" > .env
docker compose up -d
git clone https://github.com/neosun100/Qwen3-TTS.git
cd Qwen3-TTS
bash start.sh
Then open: http://localhost:8766
| Endpoint | Description |
|---|---|
http://localhost:8766 | Web UI |
http://localhost:8766/docs | Swagger API Docs |
http://localhost:8766/health | Health Check |
curl -X POST http://localhost:8766/api/tts/custom-voice \
-H "Content-Type: application/json" \
-d '{
"text": "Hello, this is a test.",
"language": "English",
"speaker": "Ryan",
"instruct": "Speak with excitement"
}' -o output.wav
curl -X POST http://localhost:8766/api/tts/voice-design \
-H "Content-Type: application/json" \
-d '{
"text": "Hello world!",
"language": "English",
"instruct": "Young female voice, cheerful and bright"
}' -o output.wav
curl -X POST http://localhost:8766/api/tts/voice-clone \
-F "text=Hello from a cloned voice" \
-F "language=English" \
-F "ref_text=This is the reference transcript" \
-F "ref_audio=@reference.wav" \
-o output.wav
| Method | Endpoint | Description |
|---|---|---|
| GET | /health | Health check |
| GET | /api/speakers | List speakers with details |
| GET | /api/languages | List supported languages |
| GET | /api/models | List available models |
| GET | /api/gpu-status | GPU memory and model status |
| GET | /api/sample-texts | Sample texts per language |
| POST | /api/gpu-offload | Manually offload GPU memory |
| POST | /api/tts/custom-voice | Custom voice TTS (JSON) |
| POST | /api/tts/voice-design | Voice design TTS (JSON) |
| POST | /api/tts/voice-clone | Voice clone TTS (FormData) |
| POST | /api/tts/voice-clone-from-prompt | TTS from saved voice prompt |
| POST | /api/voice-prompt/save | Save voice clone prompt (.pt) |
| POST | /api/tts/custom-voice/stream | PCM streaming custom voice |
| POST | /api/tts/voice-design/stream | PCM streaming voice design |
| POST | /api/tts/voice-clone/stream | PCM streaming voice clone |
| POST | /api/tokenizer/encode | Encode audio to tokens |
| POST | /api/tokenizer/decode | Decode tokens to audio |
TTS endpoints return timing headers for performance monitoring:
| Header | Description |
|---|---|
X-Time-Load | Model loading time (seconds) |
X-Time-Gen | Audio generation time (seconds) |
X-Time-Total | Total processing time (seconds) |
X-Audio-Duration | Generated audio duration (seconds) |
| Speaker | Gender | Native Language | Description |
|---|---|---|---|
| Vivian | Female | Chinese | Bright, slightly edgy young female |
| Serena | Female | Chinese | Warm, gentle young female |
| Uncle_Fu | Male | Chinese | Seasoned male, low mellow timbre |
| Dylan | Male | Chinese (Beijing) | Youthful Beijing male, clear natural |
| Eric | Male | Chinese (Sichuan) | Lively Chengdu male, slightly husky |
| Ryan | Male | English | Dynamic male, strong rhythmic drive |
| Aiden | Male | English | Sunny American male, clear midrange |
| Ono_Anna | Female | Japanese | Playful Japanese female, light nimble |
| Sohee | Female | Korean | Warm Korean female, rich emotion |
Important: Qwen3-TTS does not support true incremental audio streaming at the model level. The official Qwen team confirmed this in Issue #10:
"Streaming inference is supported at the model architecture level. Currently, our qwen-tts package primarily focuses on enabling quick demos... For streaming capabilities, ongoing development will be mainly driven by the vLLM-Omni community."
The /stream endpoints in this project generate the full audio first, then deliver it as chunked PCM data. This enables progressive playback via Web Audio API but does not reduce time-to-first-byte (TTFB).
For true streaming, monitor:
| Variable | Default | Description |
|---|---|---|
PORT | 8766 | Server port |
NVIDIA_VISIBLE_DEVICES | 0 | GPU device ID |
CUDA_DEVICE | cuda:0 | PyTorch device (always cuda:0 inside container) |
GPU_IDLE_TIMEOUT | 600 | Auto-offload after N seconds idle |
QWEN_TTS_MODEL_DIR | /app/models | Model directory path |
HF_HUB_OFFLINE | 1 | Disable HuggingFace downloads |
If you want to build the Docker image yourself:
git clone https://github.com/neosun100/Qwen3-TTS.git
cd Qwen3-TTS
# Download models first (~14GB total)
pip install -U "huggingface_hub[cli]"
mkdir -p models
huggingface-cli download Qwen/Qwen3-TTS-Tokenizer-12Hz --local-dir ./models/Qwen3-TTS-Tokenizer-12Hz
huggingface-cli download Qwen/Qwen3-TTS-12Hz-1.7B-CustomVoice --local-dir ./models/Qwen3-TTS-12Hz-1.7B-CustomVoice
huggingface-cli download Qwen/Qwen3-TTS-12Hz-1.7B-VoiceDesign --local-dir ./models/Qwen3-TTS-12Hz-1.7B-VoiceDesign
huggingface-cli download Qwen/Qwen3-TTS-12Hz-1.7B-Base --local-dir ./models/Qwen3-TTS-12Hz-1.7B-Base
# Build (uses pre-compiled flash-attn wheel — fast!)
docker build -t neosun/qwen3-tts:2.0.0 .
Building flash-attn from source requires CUDA development headers and takes 40+ minutes even with ninja. This Dockerfile downloads a pre-compiled wheel matching Python 3.12 + CUDA 12 + PyTorch 2.9, reducing install time to seconds.
The wheel is from the official flash-attention releases.
| Model | Size | Description |
|---|---|---|
| Qwen3-TTS-12Hz-1.7B-CustomVoice | 4.3GB | 9 preset speakers with instruction control |
| Qwen3-TTS-12Hz-1.7B-VoiceDesign | 4.3GB | Design new voices from text descriptions |
| Qwen3-TTS-12Hz-1.7B-Base | 4.3GB | Voice cloning from 3-second audio |
| Qwen3-TTS-Tokenizer-12Hz | 651MB | Audio tokenizer (encode/decode) |
This project is licensed under the Apache 2.0 License. The Qwen3-TTS models are subject to the Qwen License.
MCP server integration for DaVinci Resolve Studio
Run Claude Code as an MCP server so any agent can delegate coding tasks to it
Browser automation using accessibility snapshots instead of screenshots
A Jetbrains IDE IntelliJ plugin aimed to provide coding agents the ability to leverage intelliJ's indexing of the codeba