Moshi Voice Model: Kyutai's Full-Duplex AI Guide
Complete guide to Moshi, Kyutai's open source full-duplex voice AI model. Learn installation, deployment, and how it compares to alternatives.
Moshi Voice Model: Kyutai's Full-Duplex AI Guide
Moshi, developed by Kyutai, is a groundbreaking full-duplex speech model that enables truly bidirectional conversation. It can listen and speak simultaneously, creating the most natural conversational experience available in open source voice AI.
Overview
Moshi represents a fundamental shift in voice AI architecture:
- True full-duplex: Listen and respond simultaneously, not turn-by-turn
- Ultra-low latency: ~200ms theoretical latency
- Emotion-aware: Understands and expresses emotional nuance
- Open source: Available under permissive license
- Compact: Runs on consumer GPUs
How Moshi Works
Unlike traditional ASR → LLM → TTS pipelines, Moshi processes audio streams directly:
Traditional Pipeline:
Audio → [ASR] → Text → [LLM] → Text → [TTS] → Audio
(300ms) (500ms) (200ms)
Total: ~1000ms
Moshi:
Audio Stream ←→ [Moshi Model] ←→ Audio Stream
(~200ms)
Hardware Requirements
| Configuration | GPU | VRAM | Expected Latency |
|---|---|---|---|
| Minimum | RTX 3080 | 10GB | ~400ms |
| Recommended | RTX 4090 | 24GB | ~200ms |
| Production | A100 | 40GB+ | ~150ms |
Installation
PyPI Install
pip install moshiFrom Source
git clone https://github.com/kyutai-labs/moshi
cd moshi
pip install -e .Docker
docker pull kyutai/moshi:latest
docker run --gpus all -p 8998:8998 kyutai/moshi:latestQuick Start
Basic Conversation
from moshi import Moshi
# Initialize model
model = Moshi.from_pretrained("kyutai/moshi-7b")
# Create a session
session = model.create_session()
# Process audio and get response
response = session.process(input_audio)Real-Time Streaming
import asyncio
from moshi import Moshi, AudioIO
async def run_conversation():
model = Moshi.from_pretrained("kyutai/moshi-7b")
session = model.create_session()
# Bidirectional audio streaming
audio_io = AudioIO()
async with session.stream() as stream:
async for input_chunk in audio_io.input():
response_chunk = await stream.process(input_chunk)
await audio_io.output(response_chunk)
asyncio.run(run_conversation())Full-Duplex in Action
Moshi's full-duplex capability means it doesn't wait for you to finish speaking:
session = model.create_session(
# Full-duplex settings
duplex_mode="full", # "full", "half", or "adaptive"
overlap_handling="natural", # How to handle overlapping speech
# Timing controls
min_response_gap=50, # ms before responding
barge_in_sensitivity=0.7, # How easily user can interrupt
)Backchanneling
Moshi can provide natural conversational feedback:
session = model.create_session(
backchanneling=True, # Enable "mhm", "I see", etc.
backchannel_frequency="natural" # or "minimal", "frequent"
)Voice Options
Built-in Voices
# Available voices
voices = model.list_voices()
# ['moshi-default', 'moshi-calm', 'moshi-energetic']
session = model.create_session(voice="moshi-calm")Language Support
Moshi currently supports English and French:
session = model.create_session(
language="en", # or "fr"
accent="neutral" # or "british", "american", "french"
)System Prompts
Guide Moshi's behavior:
session = model.create_session(
system_prompt="""You are a friendly language tutor helping users practice French.
Speak slowly and clearly. Gently correct pronunciation mistakes.
Encourage the user and keep the conversation flowing naturally."""
)WebSocket Server
Deploy Moshi as a WebSocket server:
from moshi import Moshi, WebSocketServer
model = Moshi.from_pretrained("kyutai/moshi-7b")
server = WebSocketServer(
model=model,
host="0.0.0.0",
port=8998
)
server.run()JavaScript Client
const ws = new WebSocket('ws://localhost:8998/ws');
// Set up audio context
const audioContext = new AudioContext({ sampleRate: 24000 });
const processor = audioContext.createScriptProcessor(2048, 1, 1);
// Send audio
processor.onaudioprocess = (e) => {
const audioData = e.inputBuffer.getChannelData(0);
ws.send(audioData.buffer);
};
// Receive audio
ws.onmessage = async (event) => {
const audioData = await event.data.arrayBuffer();
playAudio(audioData);
};Production Deployment
Docker Compose
version: '3.8'
services:
moshi:
image: kyutai/moshi:latest
runtime: nvidia
ports:
- "8998:8998"
environment:
- CUDA_VISIBLE_DEVICES=0
- MODEL_ID=kyutai/moshi-7b
deploy:
resources:
reservations:
devices:
- driver: nvidia
count: 1
capabilities: [gpu]Kubernetes Deployment
apiVersion: apps/v1
kind: Deployment
metadata:
name: moshi
spec:
replicas: 2
selector:
matchLabels:
app: moshi
template:
spec:
containers:
- name: moshi
image: kyutai/moshi:latest
resources:
limits:
nvidia.com/gpu: 1
ports:
- containerPort: 8998Performance Tuning
Quantization
model = Moshi.from_pretrained(
"kyutai/moshi-7b",
quantization="int8" # Reduces VRAM by ~50%
)Batch Processing
For multiple concurrent sessions:
model = Moshi.from_pretrained(
"kyutai/moshi-7b",
max_batch_size=4, # Handle 4 conversations concurrently
dynamic_batching=True
)Limitations
- Language support: Currently English and French only
- Voice cloning: Limited compared to TTS-specific models
- Fine-tuning: Community fine-tuning tools still maturing
- Documentation: As a newer model, documentation is evolving
Comparison with Alternatives
| Feature | Moshi | PersonaPlex-7B | GPT-4o Realtime |
|---|---|---|---|
| Full-Duplex | Yes | Yes | Yes |
| Latency | ~200ms | ~300ms | ~300ms |
| Open Source | Yes | Yes | No |
| Languages | 2 | 1 | 50+ |
| Self-hostable | Yes | Yes | No |
For more comparisons, see our Open Source Voice AI Models guide.
When to Use Moshi
Choose Moshi when you need:
- Lowest possible latency
- True full-duplex conversation
- Self-hosted deployment
- English or French language support
Consider alternatives when you need:
- More languages → Qwen3-TTS
- Voice cloning → PersonaPlex-7B
- Simple TTS → Fish Speech
- Managed API → PersonaPlex API
Resources
Conclusion
Moshi sets a new standard for open source full-duplex voice AI. Its ability to truly listen and speak simultaneously creates conversational experiences that feel remarkably natural. For developers building interactive voice applications, Moshi is a compelling option worth exploring.
This article is part of our Open Source Voice AI Models series.