Fish Speech: Fast Open Source TTS Model Guide

Fish Speech is a lightweight, fast text-to-speech model designed for real-time applications. With excellent multilingual support and minimal resource requirements, it's ideal for edge deployment and scenarios where speed matters most.

Overview

Fish Speech prioritizes speed and efficiency:

Ultra-fast inference: ~150ms latency on consumer GPUs
Lightweight: Runs on 4GB VRAM
13+ languages: Multilingual support out of the box
Voice cloning: Quick speaker adaptation with minimal audio
Streaming: True streaming synthesis for real-time applications

Why Fish Speech?

In the landscape of open source TTS, Fish Speech fills a specific niche:

Priority	Fish Speech	Qwen3-TTS	Moshi
Speed	★★★★★	★★★☆☆	★★★★☆
Quality	★★★★☆	★★★★★	★★★★★
VRAM	4GB	8GB+	10GB+
Languages	13+	29+	2

Hardware Requirements

Configuration	GPU	VRAM	Real-time Factor
Minimum	GTX 1060	4GB	~0.5x
Recommended	RTX 3060	8GB	~0.2x
Optimal	RTX 4070+	12GB	~0.1x

Fish Speech can even run on CPU for non-real-time applications.

Installation

PyPI

pip install fish-speech

From Source

git clone https://github.com/fishaudio/fish-speech
cd fish-speech
pip install -e .

Docker

docker pull fishaudio/fish-speech:latest
docker run --gpus all -p 8080:8080 fishaudio/fish-speech:latest

Quick Start

Basic Synthesis

from fish_speech import FishSpeech
 
# Initialize
model = FishSpeech()
 
# Generate speech
audio = model.synthesize("Hello, this is Fish Speech!")
 
# Save to file
audio.save("output.wav")

Streaming Synthesis

For real-time applications:

from fish_speech import FishSpeech
 
model = FishSpeech()
 
# Stream audio chunks as they're generated
for chunk in model.synthesize_stream("This is a longer text that will be streamed in real-time."):
    play_audio(chunk)

Multilingual Support

Fish Speech supports 13+ languages:

# English
audio_en = model.synthesize("Hello world", language="en")
 
# Chinese
audio_zh = model.synthesize("你好世界", language="zh")
 
# Japanese
audio_ja = model.synthesize("こんにちは世界", language="ja")
 
# Spanish
audio_es = model.synthesize("Hola mundo", language="es")
 
# German
audio_de = model.synthesize("Hallo Welt", language="de")

Auto Language Detection

# Automatic language detection
audio = model.synthesize("Bonjour le monde", language="auto")

Voice Cloning

Clone a voice with minimal reference audio (3-10 seconds):

# Clone from reference
voice = model.clone_voice(
    reference_audio="speaker.wav",
    reference_text="This is the reference text."  # Optional but improves quality
)
 
# Use cloned voice
audio = model.synthesize(
    "Now speaking in the cloned voice!",
    voice=voice
)

Zero-Shot Cloning

Clone without reference text:

voice = model.clone_voice_zero_shot("speaker.wav")
audio = model.synthesize("Speaking in cloned voice.", voice=voice)

Voice Customization

Prosody Control

audio = model.synthesize(
    "This is an exciting announcement!",
    speed=1.2,           # 0.5 - 2.0
    pitch=1.1,           # 0.5 - 2.0
    energy=1.15          # 0.5 - 2.0
)

SSML Support

ssml_text = """
<speak>
    <prosody rate="slow">Welcome to Fish Speech.</prosody>
    <break time="500ms"/>
    <prosody pitch="high">It's fast and efficient!</prosody>
</speak>
"""
 
audio = model.synthesize_ssml(ssml_text)

API Server

FastAPI Server

from fastapi import FastAPI
from fastapi.responses import StreamingResponse
from fish_speech import FishSpeech
 
app = FastAPI()
model = FishSpeech()
 
@app.post("/synthesize")
async def synthesize(text: str, language: str = "en"):
    audio = model.synthesize(text, language=language)
    return StreamingResponse(
        audio.to_wav_stream(),
        media_type="audio/wav"
    )
 
@app.post("/synthesize/stream")
async def synthesize_stream(text: str):
    async def generate():
        for chunk in model.synthesize_stream(text):
            yield chunk.to_bytes()
 
    return StreamingResponse(generate(), media_type="audio/wav")

Built-in Server

fish-speech serve --host 0.0.0.0 --port 8080

Production Deployment

Docker Compose

version: '3.8'
services:
  fish-speech:
    image: fishaudio/fish-speech:latest
    runtime: nvidia
    ports:
      - "8080:8080"
    environment:
      - CUDA_VISIBLE_DEVICES=0
    volumes:
      - ./voices:/app/voices  # Custom voices
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: 1
              capabilities: [gpu]

CPU Deployment

For non-GPU environments:

services:
  fish-speech:
    image: fishaudio/fish-speech:cpu
    ports:
      - "8080:8080"
    environment:
      - OMP_NUM_THREADS=4

Performance Optimization

Batching

texts = [
    "First sentence.",
    "Second sentence.",
    "Third sentence."
]
 
# Batch synthesis for better throughput
audios = model.synthesize_batch(texts)

Caching

# Enable voice embedding cache
model = FishSpeech(cache_embeddings=True)
 
# First call computes embedding
audio1 = model.synthesize("Text 1", voice=custom_voice)
 
# Subsequent calls use cached embedding
audio2 = model.synthesize("Text 2", voice=custom_voice)  # Faster

Quantization

# INT8 quantization for reduced memory
model = FishSpeech(quantization="int8")

Edge Deployment

Fish Speech is optimized for edge devices:

Raspberry Pi

# Install optimized build
pip install fish-speech[edge]
 
# Run with CPU optimizations
fish-speech serve --device cpu --threads 4

Mobile (ONNX Export)

# Export to ONNX for mobile deployment
model.export_onnx("fish_speech.onnx")

Use Cases

Notification Systems

def notify(message: str):
    audio = model.synthesize(message, speed=1.1)
    play_audio(audio)
 
notify("You have a new message from John.")

Real-time Captioning

async def caption_stream(text_stream):
    async for text in text_stream:
        async for audio_chunk in model.synthesize_stream_async(text):
            yield audio_chunk

Audiobook Generation

def generate_audiobook(chapters: list[str], output_dir: str):
    for i, chapter in enumerate(chapters):
        audio = model.synthesize(chapter)
        audio.save(f"{output_dir}/chapter_{i:03d}.wav")

Comparison with Alternatives

Feature	Fish Speech	Qwen3-TTS	Coqui TTS
Latency	~150ms	~200ms	~300ms
Min VRAM	4GB	8GB	6GB
Languages	13+	29+	20+
Voice Clone	Yes	Yes	Yes
Edge-ready	Yes	Limited	Limited

For full comparison, see Open Source Voice AI Models.

When to Use Fish Speech

Choose Fish Speech when you need:

Fastest possible inference
Edge or mobile deployment
Low VRAM requirements
Quick voice cloning

Consider alternatives when you need:

Maximum language coverage → Qwen3-TTS
Full-duplex conversation → Moshi or PersonaPlex-7B
Highest quality → Qwen3-TTS

Resources

Conclusion

Fish Speech proves that high-quality TTS doesn't require massive resources. Its combination of speed, efficiency, and quality makes it the go-to choice for real-time applications and edge deployment. Whether you're building a notification system, mobile app, or IoT device, Fish Speech delivers.

This article is part of our Open Source Voice AI Models series.