Google Cloud Text-to-Speech: Complete Developer Guide
Learn how to use Google Cloud TTS API with WaveNet and Neural2 voices. Includes code examples, SSML usage, pricing, and best practices.
Google Cloud Text-to-Speech: Complete Developer Guide
Google Cloud Text-to-Speech brings Google's AI expertise to voice synthesis. With WaveNet and Neural2 voices, extensive SSML support, and a generous free tier, it's a compelling choice for developers.
This guide covers everything you need to integrate Google Cloud TTS into your application.
Why Google Cloud TTS?
Google Cloud TTS offers several advantages:
- Voice Quality: WaveNet voices are among the most natural-sounding
- SSML Support: Fine-grained control over pronunciation and prosody
- Languages: 40+ languages with multiple voice options
- Free Tier: 4 million characters/month free (standard voices)
- Integration: Native integration with Google Cloud services
Getting Started
Setup
- Create a Google Cloud project
- Enable the Text-to-Speech API
- Create a service account and download credentials
export GOOGLE_APPLICATION_CREDENTIALS="path/to/credentials.json"Basic Text-to-Speech
from google.cloud import texttospeech
client = texttospeech.TextToSpeechClient()
input_text = texttospeech.SynthesisInput(text="Hello, this is Google Cloud TTS.")
voice = texttospeech.VoiceSelectionParams(
language_code="en-US",
name="en-US-Neural2-D"
)
audio_config = texttospeech.AudioConfig(
audio_encoding=texttospeech.AudioEncoding.MP3
)
response = client.synthesize_speech(
input=input_text,
voice=voice,
audio_config=audio_config
)
with open("output.mp3", "wb") as out:
out.write(response.audio_content)Streaming Synthesis
For real-time applications:
from google.cloud import texttospeech_v1beta1 as texttospeech
client = texttospeech.TextToSpeechClient()
streaming_config = texttospeech.StreamingSynthesizeConfig(
voice=texttospeech.VoiceSelectionParams(
language_code="en-US",
name="en-US-Neural2-D"
)
)
config_request = texttospeech.StreamingSynthesizeRequest(
streaming_config=streaming_config
)
def request_generator():
yield config_request
yield texttospeech.StreamingSynthesizeRequest(
input=texttospeech.StreamingSynthesisInput(
text="This audio streams as it generates."
)
)
responses = client.streaming_synthesize(request_generator())
with open("output.mp3", "wb") as out:
for response in responses:
out.write(response.audio_content)Voice Types
Google Cloud offers three voice tiers:
| Type | Quality | Latency | Price |
|---|---|---|---|
| Standard | Good | Fast | $4/1M chars |
| WaveNet | Excellent | Medium | $16/1M chars |
| Neural2 | Excellent | Medium | $16/1M chars |
Standard Voices
Basic TTS, suitable for high-volume applications where cost matters:
voice = texttospeech.VoiceSelectionParams(
language_code="en-US",
name="en-US-Standard-A"
)WaveNet Voices
DeepMind's WaveNet technology produces highly natural speech:
voice = texttospeech.VoiceSelectionParams(
language_code="en-US",
name="en-US-Wavenet-D"
)Neural2 Voices
Latest generation with improved quality and efficiency:
voice = texttospeech.VoiceSelectionParams(
language_code="en-US",
name="en-US-Neural2-D"
)SSML Support
Google Cloud has excellent SSML (Speech Synthesis Markup Language) support for fine-grained control.
Basic SSML
ssml = """
<speak>
Hello! <break time="500ms"/>
Welcome to <emphasis level="strong">Google Cloud</emphasis> TTS.
</speak>
"""
input_text = texttospeech.SynthesisInput(ssml=ssml)Pronunciation Control
ssml = """
<speak>
The word <phoneme alphabet="ipa" ph="təˈmeɪtoʊ">tomato</phoneme>
can be pronounced differently.
</speak>
"""Prosody Control
ssml = """
<speak>
<prosody rate="slow" pitch="-2st">
This is spoken slowly with a lower pitch.
</prosody>
<prosody rate="fast" pitch="+2st">
This is spoken quickly with a higher pitch.
</prosody>
</speak>
"""Audio Markers
Track position in generated audio:
ssml = """
<speak>
<mark name="intro"/>Welcome to our app.
<mark name="main"/>Here is the main content.
<mark name="end"/>Thank you for listening.
</speak>
"""
response = client.synthesize_speech(
input=texttospeech.SynthesisInput(ssml=ssml),
voice=voice,
audio_config=audio_config,
enable_time_pointing=[texttospeech.TimepointType.SSML_MARK]
)
# Access timepoints
for timepoint in response.timepoints:
print(f"Mark '{timepoint.mark_name}' at {timepoint.time_seconds}s")Audio Configuration
Output Formats
# MP3 (default, good for web)
audio_config = texttospeech.AudioConfig(
audio_encoding=texttospeech.AudioEncoding.MP3
)
# LINEAR16 (uncompressed, for processing)
audio_config = texttospeech.AudioConfig(
audio_encoding=texttospeech.AudioEncoding.LINEAR16,
sample_rate_hertz=24000
)
# OGG_OPUS (efficient streaming)
audio_config = texttospeech.AudioConfig(
audio_encoding=texttospeech.AudioEncoding.OGG_OPUS
)Audio Effects
audio_config = texttospeech.AudioConfig(
audio_encoding=texttospeech.AudioEncoding.MP3,
speaking_rate=1.0, # 0.25 to 4.0
pitch=0.0, # -20.0 to 20.0 semitones
volume_gain_db=0.0, # -96.0 to 16.0
effects_profile_id=["small-bluetooth-speaker-class-device"]
)Device Profiles
Optimize for specific playback devices:
profiles = [
"wearable-class-device",
"handset-class-device",
"headphone-class-device",
"small-bluetooth-speaker-class-device",
"medium-bluetooth-speaker-class-device",
"large-home-entertainment-class-device",
"large-automotive-class-device",
"telephony-class-application"
]Pricing
Google Cloud TTS uses character-based pricing:
| Voice Type | Price per 1M characters |
|---|---|
| Standard | $4.00 |
| WaveNet | $16.00 |
| Neural2 | $16.00 |
Free Tier
- Standard voices: 4 million characters/month
- WaveNet/Neural2: 1 million characters/month
Best Practices
Optimize for Latency
- Use Standard voices for real-time applications
- Enable streaming for faster time-to-first-audio
- Keep text chunks reasonable (under 5000 characters)
Improve Quality
- Use Neural2 or WaveNet voices
- Add SSML for natural pacing
- Use appropriate device profiles
Cost Optimization
- Use Standard voices when quality is less critical
- Cache frequently generated audio
- Stay within free tier for development
Language Support
Google Cloud TTS supports 40+ languages. Popular options:
| Language | Code | Voices |
|---|---|---|
| English (US) | en-US | 20+ |
| English (UK) | en-GB | 10+ |
| Spanish | es-ES | 10+ |
| French | fr-FR | 10+ |
| German | de-DE | 10+ |
| Japanese | ja-JP | 5+ |
| Mandarin | cmn-CN | 5+ |
List Available Voices
voices = client.list_voices(language_code="en-US")
for voice in voices.voices:
print(f"{voice.name} - {voice.ssml_gender}")Comparison with Alternatives
| Feature | Google Cloud | OpenAI TTS | ElevenLabs |
|---|---|---|---|
| Voice Quality | Very Good | Very Good | Excellent |
| SSML Support | Excellent | None | Limited |
| Voice Cloning | Custom Voice* | No | Yes |
| Free Tier | Generous | No | Limited |
| Latency | ~200ms | ~400ms | ~300ms |
*Custom Voice requires additional setup and cost.
For a detailed comparison, see our Voice AI API Comparison Guide.
When to Choose Google Cloud TTS
Google Cloud TTS is ideal when:
- You need fine-grained SSML control
- You're already using Google Cloud
- Free tier is important for your project
- You need good multilingual support
Consider alternatives if:
- Voice quality is the top priority (ElevenLabs)
- You need voice cloning (ElevenLabs)
- You need lowest latency (Amazon Polly)
Conclusion
Google Cloud Text-to-Speech offers an excellent balance of quality, features, and pricing. The combination of WaveNet/Neural2 voices, comprehensive SSML support, and a generous free tier makes it a strong choice for many applications.
This article is part of our Voice AI API Comparison series. Explore guides for ElevenLabs, Amazon Polly, and more.