Amazon Polly: Complete Developer Guide
Learn how to use Amazon Polly for text-to-speech with neural voices. Includes code examples, SSML usage, pricing, and AWS integration tips.
Amazon Polly: Complete Developer Guide
Amazon Polly is AWS's text-to-speech service, known for its low latency, extensive language support, and deep integration with the AWS ecosystem. If you're building on AWS, Polly is a natural choice.
This guide covers everything you need to integrate Amazon Polly into your application.
Why Amazon Polly?
Amazon Polly stands out for several reasons:
- Low Latency: Fastest response times among major providers (~100ms)
- Neural Voices: High-quality neural TTS with natural intonation
- Speaking Styles: Newscaster and conversational styles
- AWS Integration: Seamless with Lambda, S3, and other services
- SSML Support: Full control over speech synthesis
Getting Started
Setup
- Create an AWS account
- Create IAM credentials with Polly access
- Configure AWS CLI or SDK
export AWS_ACCESS_KEY_ID="your-access-key"
export AWS_SECRET_ACCESS_KEY="your-secret-key"
export AWS_REGION="us-east-1"Basic Text-to-Speech
import boto3
polly = boto3.client('polly')
response = polly.synthesize_speech(
Text="Hello, this is Amazon Polly.",
OutputFormat="mp3",
VoiceId="Joanna",
Engine="neural"
)
with open("output.mp3", "wb") as f:
f.write(response['AudioStream'].read())Streaming Synthesis
For real-time applications:
import boto3
polly = boto3.client('polly')
response = polly.synthesize_speech(
Text="This audio streams directly from Polly.",
OutputFormat="pcm",
VoiceId="Joanna",
Engine="neural"
)
# Stream to audio player
audio_stream = response['AudioStream']Voice Types
Amazon Polly offers two engine types:
| Engine | Quality | Latency | Price |
|---|---|---|---|
| Standard | Good | ~50ms | $4/1M chars |
| Neural | Very Good | ~100ms | $16/1M chars |
Standard Voices
Fast and cost-effective:
response = polly.synthesize_speech(
Text="Using standard voice.",
OutputFormat="mp3",
VoiceId="Joanna",
Engine="standard"
)Neural Voices
Higher quality with natural intonation:
response = polly.synthesize_speech(
Text="Using neural voice.",
OutputFormat="mp3",
VoiceId="Joanna",
Engine="neural"
)Speaking Styles
Polly offers unique speaking styles for certain voices:
Newscaster Style
Perfect for news, podcasts, and professional content:
ssml = """
<speak>
<amazon:domain name="news">
Today's top story: AI voice technology continues to advance rapidly.
</amazon:domain>
</speak>
"""
response = polly.synthesize_speech(
TextType="ssml",
Text=ssml,
OutputFormat="mp3",
VoiceId="Matthew",
Engine="neural"
)Conversational Style
Natural dialogue for assistants:
ssml = """
<speak>
<amazon:domain name="conversational">
Hey there! How can I help you today?
</amazon:domain>
</speak>
"""Supported voices for styles:
- Newscaster: Matthew, Joanna, Lupe (US), Amy (UK)
- Conversational: Matthew, Joanna
SSML Support
Amazon Polly has comprehensive SSML support.
Basic SSML
ssml = """
<speak>
Hello! <break time="500ms"/>
Welcome to <emphasis level="strong">Amazon Polly</emphasis>.
</speak>
"""
response = polly.synthesize_speech(
TextType="ssml",
Text=ssml,
OutputFormat="mp3",
VoiceId="Joanna",
Engine="neural"
)Pronunciation Control
ssml = """
<speak>
You say <phoneme alphabet="ipa" ph="təˈmeɪtoʊ">tomato</phoneme>,
I say <phoneme alphabet="ipa" ph="təˈmɑːtəʊ">tomato</phoneme>.
</speak>
"""Prosody Control
ssml = """
<speak>
<prosody rate="slow" pitch="-10%">
Speaking slowly with a lower pitch.
</prosody>
<prosody rate="fast" pitch="+10%">
Speaking quickly with a higher pitch.
</prosody>
</speak>
"""Whispered Speech
ssml = """
<speak>
<amazon:effect name="whispered">
This is a secret message.
</amazon:effect>
</speak>
"""Long-Form Synthesis
For content longer than the API limit, use asynchronous synthesis:
import boto3
import time
polly = boto3.client('polly')
s3 = boto3.client('s3')
# Start async task
response = polly.start_speech_synthesis_task(
Text="Very long text content here...",
OutputFormat="mp3",
VoiceId="Joanna",
Engine="neural",
OutputS3BucketName="your-bucket",
OutputS3KeyPrefix="audio/"
)
task_id = response['SynthesisTask']['TaskId']
# Poll for completion
while True:
task = polly.get_speech_synthesis_task(TaskId=task_id)
status = task['SynthesisTask']['TaskStatus']
if status == 'completed':
output_uri = task['SynthesisTask']['OutputUri']
print(f"Audio available at: {output_uri}")
break
elif status == 'failed':
print("Task failed")
break
time.sleep(5)Speech Marks
Track word timing in generated audio:
response = polly.synthesize_speech(
Text="Hello, this is Amazon Polly speaking.",
OutputFormat="json",
VoiceId="Joanna",
Engine="neural",
SpeechMarkTypes=["word", "sentence"]
)
# Parse speech marks
import json
marks = [json.loads(line) for line in response['AudioStream'].read().decode().strip().split('\n')]
for mark in marks:
print(f"{mark['type']}: '{mark.get('value', '')}' at {mark['time']}ms")Pricing
Amazon Polly uses character-based pricing:
| Engine | Price per 1M characters |
|---|---|
| Standard | $4.00 |
| Neural | $16.00 |
Free Tier
- 5 million characters/month (Standard)
- 1 million characters/month (Neural)
- Valid for 12 months from signup
AWS Integration Examples
Lambda Function
import boto3
import base64
def lambda_handler(event, context):
polly = boto3.client('polly')
text = event.get('text', 'Hello from Lambda!')
response = polly.synthesize_speech(
Text=text,
OutputFormat="mp3",
VoiceId="Joanna",
Engine="neural"
)
audio_base64 = base64.b64encode(
response['AudioStream'].read()
).decode('utf-8')
return {
'statusCode': 200,
'headers': {'Content-Type': 'audio/mpeg'},
'body': audio_base64,
'isBase64Encoded': True
}S3 + CloudFront
Store and serve generated audio:
import boto3
import hashlib
polly = boto3.client('polly')
s3 = boto3.client('s3')
def get_or_create_audio(text, voice_id="Joanna"):
# Generate cache key
cache_key = hashlib.md5(f"{text}{voice_id}".encode()).hexdigest()
s3_key = f"audio/{cache_key}.mp3"
# Check if exists
try:
s3.head_object(Bucket="your-bucket", Key=s3_key)
return f"https://your-cdn.cloudfront.net/{s3_key}"
except:
pass
# Generate new audio
response = polly.synthesize_speech(
Text=text,
OutputFormat="mp3",
VoiceId=voice_id,
Engine="neural"
)
# Upload to S3
s3.put_object(
Bucket="your-bucket",
Key=s3_key,
Body=response['AudioStream'].read(),
ContentType="audio/mpeg"
)
return f"https://your-cdn.cloudfront.net/{s3_key}"Language Support
Amazon Polly supports 30+ languages:
| Language | Voices | Neural Support |
|---|---|---|
| English (US) | 8+ | Yes |
| English (UK) | 4+ | Yes |
| Spanish | 6+ | Yes |
| French | 4+ | Yes |
| German | 4+ | Yes |
| Japanese | 2+ | Yes |
| Portuguese | 4+ | Yes |
List Available Voices
response = polly.describe_voices(LanguageCode="en-US")
for voice in response['Voices']:
engines = voice.get('SupportedEngines', [])
print(f"{voice['Id']} - {voice['Gender']} - Engines: {engines}")Comparison with Alternatives
| Feature | Amazon Polly | Google Cloud | OpenAI TTS |
|---|---|---|---|
| Voice Quality | Good | Very Good | Very Good |
| Latency | ~100ms | ~200ms | ~400ms |
| Speaking Styles | Yes | No | No |
| SSML Support | Full | Full | None |
| Free Tier | Generous | Generous | None |
For a detailed comparison, see our Voice AI API Comparison Guide.
When to Choose Amazon Polly
Amazon Polly is ideal when:
- Low latency is your priority
- You're building on AWS
- You need speaking styles (newscaster, conversational)
- Cost efficiency at scale matters
Consider alternatives if:
- Voice quality is the top priority (ElevenLabs)
- You need maximum language coverage (Azure Speech)
- You want the simplest integration (OpenAI TTS)
Conclusion
Amazon Polly offers the lowest latency among major TTS providers, with strong AWS integration and unique speaking styles. While voice quality trails ElevenLabs, it's excellent for real-time applications and AWS-native projects.
This article is part of our Voice AI API Comparison series. Explore guides for Google Cloud TTS, Azure Speech, and more.