Fish Speech: Fast Open Source TTS Model Guide
Complete guide to Fish Speech, the lightweight open source TTS model. Learn installation, voice cloning, streaming, and production deployment.
Fish Speech: Fast Open Source TTS Model Guide
Fish Speech is a lightweight, fast text-to-speech model designed for real-time applications. With excellent multilingual support and minimal resource requirements, it's ideal for edge deployment and scenarios where speed matters most.
Overview
Fish Speech prioritizes speed and efficiency:
- Ultra-fast inference: ~150ms latency on consumer GPUs
- Lightweight: Runs on 4GB VRAM
- 13+ languages: Multilingual support out of the box
- Voice cloning: Quick speaker adaptation with minimal audio
- Streaming: True streaming synthesis for real-time applications
Why Fish Speech?
In the landscape of open source TTS, Fish Speech fills a specific niche:
| Priority | Fish Speech | Qwen3-TTS | Moshi |
|---|---|---|---|
| Speed | ★★★★★ | ★★★☆☆ | ★★★★☆ |
| Quality | ★★★★☆ | ★★★★★ | ★★★★★ |
| VRAM | 4GB | 8GB+ | 10GB+ |
| Languages | 13+ | 29+ | 2 |
Hardware Requirements
| Configuration | GPU | VRAM | Real-time Factor |
|---|---|---|---|
| Minimum | GTX 1060 | 4GB | ~0.5x |
| Recommended | RTX 3060 | 8GB | ~0.2x |
| Optimal | RTX 4070+ | 12GB | ~0.1x |
Fish Speech can even run on CPU for non-real-time applications.
Installation
PyPI
pip install fish-speechFrom Source
git clone https://github.com/fishaudio/fish-speech
cd fish-speech
pip install -e .Docker
docker pull fishaudio/fish-speech:latest
docker run --gpus all -p 8080:8080 fishaudio/fish-speech:latestQuick Start
Basic Synthesis
from fish_speech import FishSpeech
# Initialize
model = FishSpeech()
# Generate speech
audio = model.synthesize("Hello, this is Fish Speech!")
# Save to file
audio.save("output.wav")Streaming Synthesis
For real-time applications:
from fish_speech import FishSpeech
model = FishSpeech()
# Stream audio chunks as they're generated
for chunk in model.synthesize_stream("This is a longer text that will be streamed in real-time."):
play_audio(chunk)Multilingual Support
Fish Speech supports 13+ languages:
# English
audio_en = model.synthesize("Hello world", language="en")
# Chinese
audio_zh = model.synthesize("你好世界", language="zh")
# Japanese
audio_ja = model.synthesize("こんにちは世界", language="ja")
# Spanish
audio_es = model.synthesize("Hola mundo", language="es")
# German
audio_de = model.synthesize("Hallo Welt", language="de")Auto Language Detection
# Automatic language detection
audio = model.synthesize("Bonjour le monde", language="auto")Voice Cloning
Clone a voice with minimal reference audio (3-10 seconds):
# Clone from reference
voice = model.clone_voice(
reference_audio="speaker.wav",
reference_text="This is the reference text." # Optional but improves quality
)
# Use cloned voice
audio = model.synthesize(
"Now speaking in the cloned voice!",
voice=voice
)Zero-Shot Cloning
Clone without reference text:
voice = model.clone_voice_zero_shot("speaker.wav")
audio = model.synthesize("Speaking in cloned voice.", voice=voice)Voice Customization
Prosody Control
audio = model.synthesize(
"This is an exciting announcement!",
speed=1.2, # 0.5 - 2.0
pitch=1.1, # 0.5 - 2.0
energy=1.15 # 0.5 - 2.0
)SSML Support
ssml_text = """
<speak>
<prosody rate="slow">Welcome to Fish Speech.</prosody>
<break time="500ms"/>
<prosody pitch="high">It's fast and efficient!</prosody>
</speak>
"""
audio = model.synthesize_ssml(ssml_text)API Server
FastAPI Server
from fastapi import FastAPI
from fastapi.responses import StreamingResponse
from fish_speech import FishSpeech
app = FastAPI()
model = FishSpeech()
@app.post("/synthesize")
async def synthesize(text: str, language: str = "en"):
audio = model.synthesize(text, language=language)
return StreamingResponse(
audio.to_wav_stream(),
media_type="audio/wav"
)
@app.post("/synthesize/stream")
async def synthesize_stream(text: str):
async def generate():
for chunk in model.synthesize_stream(text):
yield chunk.to_bytes()
return StreamingResponse(generate(), media_type="audio/wav")Built-in Server
fish-speech serve --host 0.0.0.0 --port 8080Production Deployment
Docker Compose
version: '3.8'
services:
fish-speech:
image: fishaudio/fish-speech:latest
runtime: nvidia
ports:
- "8080:8080"
environment:
- CUDA_VISIBLE_DEVICES=0
volumes:
- ./voices:/app/voices # Custom voices
deploy:
resources:
reservations:
devices:
- driver: nvidia
count: 1
capabilities: [gpu]CPU Deployment
For non-GPU environments:
services:
fish-speech:
image: fishaudio/fish-speech:cpu
ports:
- "8080:8080"
environment:
- OMP_NUM_THREADS=4Performance Optimization
Batching
texts = [
"First sentence.",
"Second sentence.",
"Third sentence."
]
# Batch synthesis for better throughput
audios = model.synthesize_batch(texts)Caching
# Enable voice embedding cache
model = FishSpeech(cache_embeddings=True)
# First call computes embedding
audio1 = model.synthesize("Text 1", voice=custom_voice)
# Subsequent calls use cached embedding
audio2 = model.synthesize("Text 2", voice=custom_voice) # FasterQuantization
# INT8 quantization for reduced memory
model = FishSpeech(quantization="int8")Edge Deployment
Fish Speech is optimized for edge devices:
Raspberry Pi
# Install optimized build
pip install fish-speech[edge]
# Run with CPU optimizations
fish-speech serve --device cpu --threads 4Mobile (ONNX Export)
# Export to ONNX for mobile deployment
model.export_onnx("fish_speech.onnx")Use Cases
Notification Systems
def notify(message: str):
audio = model.synthesize(message, speed=1.1)
play_audio(audio)
notify("You have a new message from John.")Real-time Captioning
async def caption_stream(text_stream):
async for text in text_stream:
async for audio_chunk in model.synthesize_stream_async(text):
yield audio_chunkAudiobook Generation
def generate_audiobook(chapters: list[str], output_dir: str):
for i, chapter in enumerate(chapters):
audio = model.synthesize(chapter)
audio.save(f"{output_dir}/chapter_{i:03d}.wav")Comparison with Alternatives
| Feature | Fish Speech | Qwen3-TTS | Coqui TTS |
|---|---|---|---|
| Latency | ~150ms | ~200ms | ~300ms |
| Min VRAM | 4GB | 8GB | 6GB |
| Languages | 13+ | 29+ | 20+ |
| Voice Clone | Yes | Yes | Yes |
| Edge-ready | Yes | Limited | Limited |
For full comparison, see Open Source Voice AI Models.
When to Use Fish Speech
Choose Fish Speech when you need:
- Fastest possible inference
- Edge or mobile deployment
- Low VRAM requirements
- Quick voice cloning
Consider alternatives when you need:
- Maximum language coverage → Qwen3-TTS
- Full-duplex conversation → Moshi or PersonaPlex-7B
- Highest quality → Qwen3-TTS
Resources
Conclusion
Fish Speech proves that high-quality TTS doesn't require massive resources. Its combination of speed, efficiency, and quality makes it the go-to choice for real-time applications and edge deployment. Whether you're building a notification system, mobile app, or IoT device, Fish Speech delivers.
This article is part of our Open Source Voice AI Models series.