Google Cloud Text-to-Speech: Complete Developer Guide

Google Cloud Text-to-Speech brings Google's AI expertise to voice synthesis. With WaveNet and Neural2 voices, extensive SSML support, and a generous free tier, it's a compelling choice for developers.

This guide covers everything you need to integrate Google Cloud TTS into your application.

Why Google Cloud TTS?

Google Cloud TTS offers several advantages:

Voice Quality: WaveNet voices are among the most natural-sounding
SSML Support: Fine-grained control over pronunciation and prosody
Languages: 40+ languages with multiple voice options
Free Tier: 4 million characters/month free (standard voices)
Integration: Native integration with Google Cloud services

Getting Started

Setup

Create a Google Cloud project
Enable the Text-to-Speech API
Create a service account and download credentials

export GOOGLE_APPLICATION_CREDENTIALS="path/to/credentials.json"

Basic Text-to-Speech

from google.cloud import texttospeech
 
client = texttospeech.TextToSpeechClient()
 
input_text = texttospeech.SynthesisInput(text="Hello, this is Google Cloud TTS.")
 
voice = texttospeech.VoiceSelectionParams(
    language_code="en-US",
    name="en-US-Neural2-D"
)
 
audio_config = texttospeech.AudioConfig(
    audio_encoding=texttospeech.AudioEncoding.MP3
)
 
response = client.synthesize_speech(
    input=input_text,
    voice=voice,
    audio_config=audio_config
)
 
with open("output.mp3", "wb") as out:
    out.write(response.audio_content)

Streaming Synthesis

For real-time applications:

from google.cloud import texttospeech_v1beta1 as texttospeech
 
client = texttospeech.TextToSpeechClient()
 
streaming_config = texttospeech.StreamingSynthesizeConfig(
    voice=texttospeech.VoiceSelectionParams(
        language_code="en-US",
        name="en-US-Neural2-D"
    )
)
 
config_request = texttospeech.StreamingSynthesizeRequest(
    streaming_config=streaming_config
)
 
def request_generator():
    yield config_request
    yield texttospeech.StreamingSynthesizeRequest(
        input=texttospeech.StreamingSynthesisInput(
            text="This audio streams as it generates."
        )
    )
 
responses = client.streaming_synthesize(request_generator())
 
with open("output.mp3", "wb") as out:
    for response in responses:
        out.write(response.audio_content)

Voice Types

Google Cloud offers three voice tiers:

Type	Quality	Latency	Price
Standard	Good	Fast	$4/1M chars
WaveNet	Excellent	Medium	$16/1M chars
Neural2	Excellent	Medium	$16/1M chars

Standard Voices

Basic TTS, suitable for high-volume applications where cost matters:

voice = texttospeech.VoiceSelectionParams(
    language_code="en-US",
    name="en-US-Standard-A"
)

WaveNet Voices

DeepMind's WaveNet technology produces highly natural speech:

voice = texttospeech.VoiceSelectionParams(
    language_code="en-US",
    name="en-US-Wavenet-D"
)

Neural2 Voices

Latest generation with improved quality and efficiency:

voice = texttospeech.VoiceSelectionParams(
    language_code="en-US",
    name="en-US-Neural2-D"
)

SSML Support

Google Cloud has excellent SSML (Speech Synthesis Markup Language) support for fine-grained control.

Basic SSML

ssml = """
<speak>
    Hello! <break time="500ms"/>
    Welcome to <emphasis level="strong">Google Cloud</emphasis> TTS.
</speak>
"""
 
input_text = texttospeech.SynthesisInput(ssml=ssml)

Pronunciation Control

ssml = """
<speak>
    The word <phoneme alphabet="ipa" ph="təˈmeɪtoʊ">tomato</phoneme>
    can be pronounced differently.
</speak>
"""

Prosody Control

ssml = """
<speak>
    <prosody rate="slow" pitch="-2st">
        This is spoken slowly with a lower pitch.
    </prosody>
    <prosody rate="fast" pitch="+2st">
        This is spoken quickly with a higher pitch.
    </prosody>
</speak>
"""

Audio Markers

Track position in generated audio:

ssml = """
<speak>
    <mark name="intro"/>Welcome to our app.
    <mark name="main"/>Here is the main content.
    <mark name="end"/>Thank you for listening.
</speak>
"""
 
response = client.synthesize_speech(
    input=texttospeech.SynthesisInput(ssml=ssml),
    voice=voice,
    audio_config=audio_config,
    enable_time_pointing=[texttospeech.TimepointType.SSML_MARK]
)
 
# Access timepoints
for timepoint in response.timepoints:
    print(f"Mark '{timepoint.mark_name}' at {timepoint.time_seconds}s")

Audio Configuration

Output Formats

# MP3 (default, good for web)
audio_config = texttospeech.AudioConfig(
    audio_encoding=texttospeech.AudioEncoding.MP3
)
 
# LINEAR16 (uncompressed, for processing)
audio_config = texttospeech.AudioConfig(
    audio_encoding=texttospeech.AudioEncoding.LINEAR16,
    sample_rate_hertz=24000
)
 
# OGG_OPUS (efficient streaming)
audio_config = texttospeech.AudioConfig(
    audio_encoding=texttospeech.AudioEncoding.OGG_OPUS
)

Audio Effects

audio_config = texttospeech.AudioConfig(
    audio_encoding=texttospeech.AudioEncoding.MP3,
    speaking_rate=1.0,  # 0.25 to 4.0
    pitch=0.0,  # -20.0 to 20.0 semitones
    volume_gain_db=0.0,  # -96.0 to 16.0
    effects_profile_id=["small-bluetooth-speaker-class-device"]
)

Device Profiles

Optimize for specific playback devices:

profiles = [
    "wearable-class-device",
    "handset-class-device",
    "headphone-class-device",
    "small-bluetooth-speaker-class-device",
    "medium-bluetooth-speaker-class-device",
    "large-home-entertainment-class-device",
    "large-automotive-class-device",
    "telephony-class-application"
]

Pricing

Google Cloud TTS uses character-based pricing:

Voice Type	Price per 1M characters
Standard	$4.00
WaveNet	$16.00
Neural2	$16.00

Free Tier

Standard voices: 4 million characters/month
WaveNet/Neural2: 1 million characters/month

Best Practices

Optimize for Latency

Use Standard voices for real-time applications
Enable streaming for faster time-to-first-audio
Keep text chunks reasonable (under 5000 characters)

Improve Quality

Use Neural2 or WaveNet voices
Add SSML for natural pacing
Use appropriate device profiles

Cost Optimization

Use Standard voices when quality is less critical
Cache frequently generated audio
Stay within free tier for development

Language Support

Google Cloud TTS supports 40+ languages. Popular options:

Language	Code	Voices
English (US)	en-US	20+
English (UK)	en-GB	10+
Spanish	es-ES	10+
French	fr-FR	10+
German	de-DE	10+
Japanese	ja-JP	5+
Mandarin	cmn-CN	5+

List Available Voices

voices = client.list_voices(language_code="en-US")
 
for voice in voices.voices:
    print(f"{voice.name} - {voice.ssml_gender}")

Comparison with Alternatives

Feature	Google Cloud	OpenAI TTS	ElevenLabs
Voice Quality	Very Good	Very Good	Excellent
SSML Support	Excellent	None	Limited
Voice Cloning	Custom Voice*	No	Yes
Free Tier	Generous	No	Limited
Latency	~200ms	~400ms	~300ms

*Custom Voice requires additional setup and cost.

For a detailed comparison, see our Voice AI API Comparison Guide.

When to Choose Google Cloud TTS

Google Cloud TTS is ideal when:

You need fine-grained SSML control
You're already using Google Cloud
Free tier is important for your project
You need good multilingual support

Consider alternatives if:

Voice quality is the top priority (ElevenLabs)
You need voice cloning (ElevenLabs)
You need lowest latency (Amazon Polly)

Conclusion

Google Cloud Text-to-Speech offers an excellent balance of quality, features, and pricing. The combination of WaveNet/Neural2 voices, comprehensive SSML support, and a generous free tier makes it a strong choice for many applications.

This article is part of our Voice AI API Comparison series. Explore guides for ElevenLabs, Amazon Polly, and more.

Google Cloud Text-to-Speech: Complete Developer Guide

Google Cloud Text-to-Speech: Complete Developer Guide

Why Google Cloud TTS?

Getting Started

Setup

Basic Text-to-Speech

Streaming Synthesis

Voice Types

Standard Voices

WaveNet Voices

Neural2 Voices

SSML Support

Basic SSML

Pronunciation Control

Prosody Control

Audio Markers

Audio Configuration

Output Formats

Audio Effects

Device Profiles

Pricing

Free Tier

Best Practices

Optimize for Latency

Improve Quality

Cost Optimization

Language Support

List Available Voices

Comparison with Alternatives

When to Choose Google Cloud TTS

Conclusion

Related Articles

Best Voice AI APIs in 2025: Complete Comparison Guide

ElevenLabs Voice API: Complete Developer Guide

Amazon Polly: Complete Developer Guide