← Back to Blog
Technology··6 min read

Fish Speech: Fast Open Source TTS Model Guide

Complete guide to Fish Speech, the lightweight open source TTS model. Learn installation, voice cloning, streaming, and production deployment.

fish speechopen source ttsfast text to speechvoice cloning

Fish Speech: Fast Open Source TTS Model Guide

Fish Speech is a lightweight, fast text-to-speech model designed for real-time applications. With excellent multilingual support and minimal resource requirements, it's ideal for edge deployment and scenarios where speed matters most.

Overview

Fish Speech prioritizes speed and efficiency:

  • Ultra-fast inference: ~150ms latency on consumer GPUs
  • Lightweight: Runs on 4GB VRAM
  • 13+ languages: Multilingual support out of the box
  • Voice cloning: Quick speaker adaptation with minimal audio
  • Streaming: True streaming synthesis for real-time applications

Why Fish Speech?

In the landscape of open source TTS, Fish Speech fills a specific niche:

PriorityFish SpeechQwen3-TTSMoshi
Speed★★★★★★★★☆☆★★★★☆
Quality★★★★☆★★★★★★★★★★
VRAM4GB8GB+10GB+
Languages13+29+2

Hardware Requirements

ConfigurationGPUVRAMReal-time Factor
MinimumGTX 10604GB~0.5x
RecommendedRTX 30608GB~0.2x
OptimalRTX 4070+12GB~0.1x

Fish Speech can even run on CPU for non-real-time applications.

Installation

PyPI

pip install fish-speech

From Source

git clone https://github.com/fishaudio/fish-speech
cd fish-speech
pip install -e .

Docker

docker pull fishaudio/fish-speech:latest
docker run --gpus all -p 8080:8080 fishaudio/fish-speech:latest

Quick Start

Basic Synthesis

from fish_speech import FishSpeech
 
# Initialize
model = FishSpeech()
 
# Generate speech
audio = model.synthesize("Hello, this is Fish Speech!")
 
# Save to file
audio.save("output.wav")

Streaming Synthesis

For real-time applications:

from fish_speech import FishSpeech
 
model = FishSpeech()
 
# Stream audio chunks as they're generated
for chunk in model.synthesize_stream("This is a longer text that will be streamed in real-time."):
    play_audio(chunk)

Multilingual Support

Fish Speech supports 13+ languages:

# English
audio_en = model.synthesize("Hello world", language="en")
 
# Chinese
audio_zh = model.synthesize("你好世界", language="zh")
 
# Japanese
audio_ja = model.synthesize("こんにちは世界", language="ja")
 
# Spanish
audio_es = model.synthesize("Hola mundo", language="es")
 
# German
audio_de = model.synthesize("Hallo Welt", language="de")

Auto Language Detection

# Automatic language detection
audio = model.synthesize("Bonjour le monde", language="auto")

Voice Cloning

Clone a voice with minimal reference audio (3-10 seconds):

# Clone from reference
voice = model.clone_voice(
    reference_audio="speaker.wav",
    reference_text="This is the reference text."  # Optional but improves quality
)
 
# Use cloned voice
audio = model.synthesize(
    "Now speaking in the cloned voice!",
    voice=voice
)

Zero-Shot Cloning

Clone without reference text:

voice = model.clone_voice_zero_shot("speaker.wav")
audio = model.synthesize("Speaking in cloned voice.", voice=voice)

Voice Customization

Prosody Control

audio = model.synthesize(
    "This is an exciting announcement!",
    speed=1.2,           # 0.5 - 2.0
    pitch=1.1,           # 0.5 - 2.0
    energy=1.15          # 0.5 - 2.0
)

SSML Support

ssml_text = """
<speak>
    <prosody rate="slow">Welcome to Fish Speech.</prosody>
    <break time="500ms"/>
    <prosody pitch="high">It's fast and efficient!</prosody>
</speak>
"""
 
audio = model.synthesize_ssml(ssml_text)

API Server

FastAPI Server

from fastapi import FastAPI
from fastapi.responses import StreamingResponse
from fish_speech import FishSpeech
 
app = FastAPI()
model = FishSpeech()
 
@app.post("/synthesize")
async def synthesize(text: str, language: str = "en"):
    audio = model.synthesize(text, language=language)
    return StreamingResponse(
        audio.to_wav_stream(),
        media_type="audio/wav"
    )
 
@app.post("/synthesize/stream")
async def synthesize_stream(text: str):
    async def generate():
        for chunk in model.synthesize_stream(text):
            yield chunk.to_bytes()
 
    return StreamingResponse(generate(), media_type="audio/wav")

Built-in Server

fish-speech serve --host 0.0.0.0 --port 8080

Production Deployment

Docker Compose

version: '3.8'
services:
  fish-speech:
    image: fishaudio/fish-speech:latest
    runtime: nvidia
    ports:
      - "8080:8080"
    environment:
      - CUDA_VISIBLE_DEVICES=0
    volumes:
      - ./voices:/app/voices  # Custom voices
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: 1
              capabilities: [gpu]

CPU Deployment

For non-GPU environments:

services:
  fish-speech:
    image: fishaudio/fish-speech:cpu
    ports:
      - "8080:8080"
    environment:
      - OMP_NUM_THREADS=4

Performance Optimization

Batching

texts = [
    "First sentence.",
    "Second sentence.",
    "Third sentence."
]
 
# Batch synthesis for better throughput
audios = model.synthesize_batch(texts)

Caching

# Enable voice embedding cache
model = FishSpeech(cache_embeddings=True)
 
# First call computes embedding
audio1 = model.synthesize("Text 1", voice=custom_voice)
 
# Subsequent calls use cached embedding
audio2 = model.synthesize("Text 2", voice=custom_voice)  # Faster

Quantization

# INT8 quantization for reduced memory
model = FishSpeech(quantization="int8")

Edge Deployment

Fish Speech is optimized for edge devices:

Raspberry Pi

# Install optimized build
pip install fish-speech[edge]
 
# Run with CPU optimizations
fish-speech serve --device cpu --threads 4

Mobile (ONNX Export)

# Export to ONNX for mobile deployment
model.export_onnx("fish_speech.onnx")

Use Cases

Notification Systems

def notify(message: str):
    audio = model.synthesize(message, speed=1.1)
    play_audio(audio)
 
notify("You have a new message from John.")

Real-time Captioning

async def caption_stream(text_stream):
    async for text in text_stream:
        async for audio_chunk in model.synthesize_stream_async(text):
            yield audio_chunk

Audiobook Generation

def generate_audiobook(chapters: list[str], output_dir: str):
    for i, chapter in enumerate(chapters):
        audio = model.synthesize(chapter)
        audio.save(f"{output_dir}/chapter_{i:03d}.wav")

Comparison with Alternatives

FeatureFish SpeechQwen3-TTSCoqui TTS
Latency~150ms~200ms~300ms
Min VRAM4GB8GB6GB
Languages13+29+20+
Voice CloneYesYesYes
Edge-readyYesLimitedLimited

For full comparison, see Open Source Voice AI Models.

When to Use Fish Speech

Choose Fish Speech when you need:

  • Fastest possible inference
  • Edge or mobile deployment
  • Low VRAM requirements
  • Quick voice cloning

Consider alternatives when you need:

Resources

Conclusion

Fish Speech proves that high-quality TTS doesn't require massive resources. Its combination of speed, efficiency, and quality makes it the go-to choice for real-time applications and edge deployment. Whether you're building a notification system, mobile app, or IoT device, Fish Speech delivers.


This article is part of our Open Source Voice AI Models series.

Related Articles