← Back to Blog
Technology··6 min read

Google Cloud Text-to-Speech: Complete Developer Guide

Learn how to use Google Cloud TTS API with WaveNet and Neural2 voices. Includes code examples, SSML usage, pricing, and best practices.

google cloud ttsgoogle text to speech apiwavenet voicesgoogle tts apineural2 voices

Google Cloud Text-to-Speech: Complete Developer Guide

Google Cloud Text-to-Speech brings Google's AI expertise to voice synthesis. With WaveNet and Neural2 voices, extensive SSML support, and a generous free tier, it's a compelling choice for developers.

This guide covers everything you need to integrate Google Cloud TTS into your application.

Why Google Cloud TTS?

Google Cloud TTS offers several advantages:

  • Voice Quality: WaveNet voices are among the most natural-sounding
  • SSML Support: Fine-grained control over pronunciation and prosody
  • Languages: 40+ languages with multiple voice options
  • Free Tier: 4 million characters/month free (standard voices)
  • Integration: Native integration with Google Cloud services

Getting Started

Setup

  1. Create a Google Cloud project
  2. Enable the Text-to-Speech API
  3. Create a service account and download credentials
export GOOGLE_APPLICATION_CREDENTIALS="path/to/credentials.json"

Basic Text-to-Speech

from google.cloud import texttospeech
 
client = texttospeech.TextToSpeechClient()
 
input_text = texttospeech.SynthesisInput(text="Hello, this is Google Cloud TTS.")
 
voice = texttospeech.VoiceSelectionParams(
    language_code="en-US",
    name="en-US-Neural2-D"
)
 
audio_config = texttospeech.AudioConfig(
    audio_encoding=texttospeech.AudioEncoding.MP3
)
 
response = client.synthesize_speech(
    input=input_text,
    voice=voice,
    audio_config=audio_config
)
 
with open("output.mp3", "wb") as out:
    out.write(response.audio_content)

Streaming Synthesis

For real-time applications:

from google.cloud import texttospeech_v1beta1 as texttospeech
 
client = texttospeech.TextToSpeechClient()
 
streaming_config = texttospeech.StreamingSynthesizeConfig(
    voice=texttospeech.VoiceSelectionParams(
        language_code="en-US",
        name="en-US-Neural2-D"
    )
)
 
config_request = texttospeech.StreamingSynthesizeRequest(
    streaming_config=streaming_config
)
 
def request_generator():
    yield config_request
    yield texttospeech.StreamingSynthesizeRequest(
        input=texttospeech.StreamingSynthesisInput(
            text="This audio streams as it generates."
        )
    )
 
responses = client.streaming_synthesize(request_generator())
 
with open("output.mp3", "wb") as out:
    for response in responses:
        out.write(response.audio_content)

Voice Types

Google Cloud offers three voice tiers:

TypeQualityLatencyPrice
StandardGoodFast$4/1M chars
WaveNetExcellentMedium$16/1M chars
Neural2ExcellentMedium$16/1M chars

Standard Voices

Basic TTS, suitable for high-volume applications where cost matters:

voice = texttospeech.VoiceSelectionParams(
    language_code="en-US",
    name="en-US-Standard-A"
)

WaveNet Voices

DeepMind's WaveNet technology produces highly natural speech:

voice = texttospeech.VoiceSelectionParams(
    language_code="en-US",
    name="en-US-Wavenet-D"
)

Neural2 Voices

Latest generation with improved quality and efficiency:

voice = texttospeech.VoiceSelectionParams(
    language_code="en-US",
    name="en-US-Neural2-D"
)

SSML Support

Google Cloud has excellent SSML (Speech Synthesis Markup Language) support for fine-grained control.

Basic SSML

ssml = """
<speak>
    Hello! <break time="500ms"/>
    Welcome to <emphasis level="strong">Google Cloud</emphasis> TTS.
</speak>
"""
 
input_text = texttospeech.SynthesisInput(ssml=ssml)

Pronunciation Control

ssml = """
<speak>
    The word <phoneme alphabet="ipa" ph="təˈmeɪtoʊ">tomato</phoneme>
    can be pronounced differently.
</speak>
"""

Prosody Control

ssml = """
<speak>
    <prosody rate="slow" pitch="-2st">
        This is spoken slowly with a lower pitch.
    </prosody>
    <prosody rate="fast" pitch="+2st">
        This is spoken quickly with a higher pitch.
    </prosody>
</speak>
"""

Audio Markers

Track position in generated audio:

ssml = """
<speak>
    <mark name="intro"/>Welcome to our app.
    <mark name="main"/>Here is the main content.
    <mark name="end"/>Thank you for listening.
</speak>
"""
 
response = client.synthesize_speech(
    input=texttospeech.SynthesisInput(ssml=ssml),
    voice=voice,
    audio_config=audio_config,
    enable_time_pointing=[texttospeech.TimepointType.SSML_MARK]
)
 
# Access timepoints
for timepoint in response.timepoints:
    print(f"Mark '{timepoint.mark_name}' at {timepoint.time_seconds}s")

Audio Configuration

Output Formats

# MP3 (default, good for web)
audio_config = texttospeech.AudioConfig(
    audio_encoding=texttospeech.AudioEncoding.MP3
)
 
# LINEAR16 (uncompressed, for processing)
audio_config = texttospeech.AudioConfig(
    audio_encoding=texttospeech.AudioEncoding.LINEAR16,
    sample_rate_hertz=24000
)
 
# OGG_OPUS (efficient streaming)
audio_config = texttospeech.AudioConfig(
    audio_encoding=texttospeech.AudioEncoding.OGG_OPUS
)

Audio Effects

audio_config = texttospeech.AudioConfig(
    audio_encoding=texttospeech.AudioEncoding.MP3,
    speaking_rate=1.0,  # 0.25 to 4.0
    pitch=0.0,  # -20.0 to 20.0 semitones
    volume_gain_db=0.0,  # -96.0 to 16.0
    effects_profile_id=["small-bluetooth-speaker-class-device"]
)

Device Profiles

Optimize for specific playback devices:

profiles = [
    "wearable-class-device",
    "handset-class-device",
    "headphone-class-device",
    "small-bluetooth-speaker-class-device",
    "medium-bluetooth-speaker-class-device",
    "large-home-entertainment-class-device",
    "large-automotive-class-device",
    "telephony-class-application"
]

Pricing

Google Cloud TTS uses character-based pricing:

Voice TypePrice per 1M characters
Standard$4.00
WaveNet$16.00
Neural2$16.00

Free Tier

  • Standard voices: 4 million characters/month
  • WaveNet/Neural2: 1 million characters/month

Best Practices

Optimize for Latency

  1. Use Standard voices for real-time applications
  2. Enable streaming for faster time-to-first-audio
  3. Keep text chunks reasonable (under 5000 characters)

Improve Quality

  1. Use Neural2 or WaveNet voices
  2. Add SSML for natural pacing
  3. Use appropriate device profiles

Cost Optimization

  1. Use Standard voices when quality is less critical
  2. Cache frequently generated audio
  3. Stay within free tier for development

Language Support

Google Cloud TTS supports 40+ languages. Popular options:

LanguageCodeVoices
English (US)en-US20+
English (UK)en-GB10+
Spanishes-ES10+
Frenchfr-FR10+
Germande-DE10+
Japaneseja-JP5+
Mandarincmn-CN5+

List Available Voices

voices = client.list_voices(language_code="en-US")
 
for voice in voices.voices:
    print(f"{voice.name} - {voice.ssml_gender}")

Comparison with Alternatives

FeatureGoogle CloudOpenAI TTSElevenLabs
Voice QualityVery GoodVery GoodExcellent
SSML SupportExcellentNoneLimited
Voice CloningCustom Voice*NoYes
Free TierGenerousNoLimited
Latency~200ms~400ms~300ms

*Custom Voice requires additional setup and cost.

For a detailed comparison, see our Voice AI API Comparison Guide.

When to Choose Google Cloud TTS

Google Cloud TTS is ideal when:

  • You need fine-grained SSML control
  • You're already using Google Cloud
  • Free tier is important for your project
  • You need good multilingual support

Consider alternatives if:

Conclusion

Google Cloud Text-to-Speech offers an excellent balance of quality, features, and pricing. The combination of WaveNet/Neural2 voices, comprehensive SSML support, and a generous free tier makes it a strong choice for many applications.


This article is part of our Voice AI API Comparison series. Explore guides for ElevenLabs, Amazon Polly, and more.

Related Articles