← Back to Blog
Technology··5 min read

Moshi Voice Model: Kyutai's Full-Duplex AI Guide

Complete guide to Moshi, Kyutai's open source full-duplex voice AI model. Learn installation, deployment, and how it compares to alternatives.

moshi voice modelkyutai moshifull duplex voice aispeech to speech model

Moshi Voice Model: Kyutai's Full-Duplex AI Guide

Moshi, developed by Kyutai, is a groundbreaking full-duplex speech model that enables truly bidirectional conversation. It can listen and speak simultaneously, creating the most natural conversational experience available in open source voice AI.

Overview

Moshi represents a fundamental shift in voice AI architecture:

  • True full-duplex: Listen and respond simultaneously, not turn-by-turn
  • Ultra-low latency: ~200ms theoretical latency
  • Emotion-aware: Understands and expresses emotional nuance
  • Open source: Available under permissive license
  • Compact: Runs on consumer GPUs

How Moshi Works

Unlike traditional ASR → LLM → TTS pipelines, Moshi processes audio streams directly:

Traditional Pipeline:
Audio → [ASR] → Text → [LLM] → Text → [TTS] → Audio
        (300ms)        (500ms)        (200ms)
        Total: ~1000ms

Moshi:
Audio Stream ←→ [Moshi Model] ←→ Audio Stream
                 (~200ms)

Hardware Requirements

ConfigurationGPUVRAMExpected Latency
MinimumRTX 308010GB~400ms
RecommendedRTX 409024GB~200ms
ProductionA10040GB+~150ms

Installation

PyPI Install

pip install moshi

From Source

git clone https://github.com/kyutai-labs/moshi
cd moshi
pip install -e .

Docker

docker pull kyutai/moshi:latest
docker run --gpus all -p 8998:8998 kyutai/moshi:latest

Quick Start

Basic Conversation

from moshi import Moshi
 
# Initialize model
model = Moshi.from_pretrained("kyutai/moshi-7b")
 
# Create a session
session = model.create_session()
 
# Process audio and get response
response = session.process(input_audio)

Real-Time Streaming

import asyncio
from moshi import Moshi, AudioIO
 
async def run_conversation():
    model = Moshi.from_pretrained("kyutai/moshi-7b")
    session = model.create_session()
 
    # Bidirectional audio streaming
    audio_io = AudioIO()
 
    async with session.stream() as stream:
        async for input_chunk in audio_io.input():
            response_chunk = await stream.process(input_chunk)
            await audio_io.output(response_chunk)
 
asyncio.run(run_conversation())

Full-Duplex in Action

Moshi's full-duplex capability means it doesn't wait for you to finish speaking:

session = model.create_session(
    # Full-duplex settings
    duplex_mode="full",          # "full", "half", or "adaptive"
    overlap_handling="natural",   # How to handle overlapping speech
 
    # Timing controls
    min_response_gap=50,         # ms before responding
    barge_in_sensitivity=0.7,    # How easily user can interrupt
)

Backchanneling

Moshi can provide natural conversational feedback:

session = model.create_session(
    backchanneling=True,  # Enable "mhm", "I see", etc.
    backchannel_frequency="natural"  # or "minimal", "frequent"
)

Voice Options

Built-in Voices

# Available voices
voices = model.list_voices()
# ['moshi-default', 'moshi-calm', 'moshi-energetic']
 
session = model.create_session(voice="moshi-calm")

Language Support

Moshi currently supports English and French:

session = model.create_session(
    language="en",  # or "fr"
    accent="neutral"  # or "british", "american", "french"
)

System Prompts

Guide Moshi's behavior:

session = model.create_session(
    system_prompt="""You are a friendly language tutor helping users practice French.
    Speak slowly and clearly. Gently correct pronunciation mistakes.
    Encourage the user and keep the conversation flowing naturally."""
)

WebSocket Server

Deploy Moshi as a WebSocket server:

from moshi import Moshi, WebSocketServer
 
model = Moshi.from_pretrained("kyutai/moshi-7b")
 
server = WebSocketServer(
    model=model,
    host="0.0.0.0",
    port=8998
)
 
server.run()

JavaScript Client

const ws = new WebSocket('ws://localhost:8998/ws');
 
// Set up audio context
const audioContext = new AudioContext({ sampleRate: 24000 });
const processor = audioContext.createScriptProcessor(2048, 1, 1);
 
// Send audio
processor.onaudioprocess = (e) => {
    const audioData = e.inputBuffer.getChannelData(0);
    ws.send(audioData.buffer);
};
 
// Receive audio
ws.onmessage = async (event) => {
    const audioData = await event.data.arrayBuffer();
    playAudio(audioData);
};

Production Deployment

Docker Compose

version: '3.8'
services:
  moshi:
    image: kyutai/moshi:latest
    runtime: nvidia
    ports:
      - "8998:8998"
    environment:
      - CUDA_VISIBLE_DEVICES=0
      - MODEL_ID=kyutai/moshi-7b
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: 1
              capabilities: [gpu]

Kubernetes Deployment

apiVersion: apps/v1
kind: Deployment
metadata:
  name: moshi
spec:
  replicas: 2
  selector:
    matchLabels:
      app: moshi
  template:
    spec:
      containers:
      - name: moshi
        image: kyutai/moshi:latest
        resources:
          limits:
            nvidia.com/gpu: 1
        ports:
        - containerPort: 8998

Performance Tuning

Quantization

model = Moshi.from_pretrained(
    "kyutai/moshi-7b",
    quantization="int8"  # Reduces VRAM by ~50%
)

Batch Processing

For multiple concurrent sessions:

model = Moshi.from_pretrained(
    "kyutai/moshi-7b",
    max_batch_size=4,  # Handle 4 conversations concurrently
    dynamic_batching=True
)

Limitations

  • Language support: Currently English and French only
  • Voice cloning: Limited compared to TTS-specific models
  • Fine-tuning: Community fine-tuning tools still maturing
  • Documentation: As a newer model, documentation is evolving

Comparison with Alternatives

FeatureMoshiPersonaPlex-7BGPT-4o Realtime
Full-DuplexYesYesYes
Latency~200ms~300ms~300ms
Open SourceYesYesNo
Languages2150+
Self-hostableYesYesNo

For more comparisons, see our Open Source Voice AI Models guide.

When to Use Moshi

Choose Moshi when you need:

  • Lowest possible latency
  • True full-duplex conversation
  • Self-hosted deployment
  • English or French language support

Consider alternatives when you need:

Resources

Conclusion

Moshi sets a new standard for open source full-duplex voice AI. Its ability to truly listen and speak simultaneously creates conversational experiences that feel remarkably natural. For developers building interactive voice applications, Moshi is a compelling option worth exploring.


This article is part of our Open Source Voice AI Models series.

Related Articles