← Back to Blog
Technology··6 min read

Amazon Polly: Complete Developer Guide

Learn how to use Amazon Polly for text-to-speech with neural voices. Includes code examples, SSML usage, pricing, and AWS integration tips.

amazon pollyaws text to speechpolly tts apiaws polly neuralamazon voice api

Amazon Polly: Complete Developer Guide

Amazon Polly is AWS's text-to-speech service, known for its low latency, extensive language support, and deep integration with the AWS ecosystem. If you're building on AWS, Polly is a natural choice.

This guide covers everything you need to integrate Amazon Polly into your application.

Why Amazon Polly?

Amazon Polly stands out for several reasons:

  • Low Latency: Fastest response times among major providers (~100ms)
  • Neural Voices: High-quality neural TTS with natural intonation
  • Speaking Styles: Newscaster and conversational styles
  • AWS Integration: Seamless with Lambda, S3, and other services
  • SSML Support: Full control over speech synthesis

Getting Started

Setup

  1. Create an AWS account
  2. Create IAM credentials with Polly access
  3. Configure AWS CLI or SDK
export AWS_ACCESS_KEY_ID="your-access-key"
export AWS_SECRET_ACCESS_KEY="your-secret-key"
export AWS_REGION="us-east-1"

Basic Text-to-Speech

import boto3
 
polly = boto3.client('polly')
 
response = polly.synthesize_speech(
    Text="Hello, this is Amazon Polly.",
    OutputFormat="mp3",
    VoiceId="Joanna",
    Engine="neural"
)
 
with open("output.mp3", "wb") as f:
    f.write(response['AudioStream'].read())

Streaming Synthesis

For real-time applications:

import boto3
 
polly = boto3.client('polly')
 
response = polly.synthesize_speech(
    Text="This audio streams directly from Polly.",
    OutputFormat="pcm",
    VoiceId="Joanna",
    Engine="neural"
)
 
# Stream to audio player
audio_stream = response['AudioStream']

Voice Types

Amazon Polly offers two engine types:

EngineQualityLatencyPrice
StandardGood~50ms$4/1M chars
NeuralVery Good~100ms$16/1M chars

Standard Voices

Fast and cost-effective:

response = polly.synthesize_speech(
    Text="Using standard voice.",
    OutputFormat="mp3",
    VoiceId="Joanna",
    Engine="standard"
)

Neural Voices

Higher quality with natural intonation:

response = polly.synthesize_speech(
    Text="Using neural voice.",
    OutputFormat="mp3",
    VoiceId="Joanna",
    Engine="neural"
)

Speaking Styles

Polly offers unique speaking styles for certain voices:

Newscaster Style

Perfect for news, podcasts, and professional content:

ssml = """
<speak>
    <amazon:domain name="news">
        Today's top story: AI voice technology continues to advance rapidly.
    </amazon:domain>
</speak>
"""
 
response = polly.synthesize_speech(
    TextType="ssml",
    Text=ssml,
    OutputFormat="mp3",
    VoiceId="Matthew",
    Engine="neural"
)

Conversational Style

Natural dialogue for assistants:

ssml = """
<speak>
    <amazon:domain name="conversational">
        Hey there! How can I help you today?
    </amazon:domain>
</speak>
"""

Supported voices for styles:

  • Newscaster: Matthew, Joanna, Lupe (US), Amy (UK)
  • Conversational: Matthew, Joanna

SSML Support

Amazon Polly has comprehensive SSML support.

Basic SSML

ssml = """
<speak>
    Hello! <break time="500ms"/>
    Welcome to <emphasis level="strong">Amazon Polly</emphasis>.
</speak>
"""
 
response = polly.synthesize_speech(
    TextType="ssml",
    Text=ssml,
    OutputFormat="mp3",
    VoiceId="Joanna",
    Engine="neural"
)

Pronunciation Control

ssml = """
<speak>
    You say <phoneme alphabet="ipa" ph="təˈmeɪtoʊ">tomato</phoneme>,
    I say <phoneme alphabet="ipa" ph="təˈmɑːtəʊ">tomato</phoneme>.
</speak>
"""

Prosody Control

ssml = """
<speak>
    <prosody rate="slow" pitch="-10%">
        Speaking slowly with a lower pitch.
    </prosody>
    <prosody rate="fast" pitch="+10%">
        Speaking quickly with a higher pitch.
    </prosody>
</speak>
"""

Whispered Speech

ssml = """
<speak>
    <amazon:effect name="whispered">
        This is a secret message.
    </amazon:effect>
</speak>
"""

Long-Form Synthesis

For content longer than the API limit, use asynchronous synthesis:

import boto3
import time
 
polly = boto3.client('polly')
s3 = boto3.client('s3')
 
# Start async task
response = polly.start_speech_synthesis_task(
    Text="Very long text content here...",
    OutputFormat="mp3",
    VoiceId="Joanna",
    Engine="neural",
    OutputS3BucketName="your-bucket",
    OutputS3KeyPrefix="audio/"
)
 
task_id = response['SynthesisTask']['TaskId']
 
# Poll for completion
while True:
    task = polly.get_speech_synthesis_task(TaskId=task_id)
    status = task['SynthesisTask']['TaskStatus']
 
    if status == 'completed':
        output_uri = task['SynthesisTask']['OutputUri']
        print(f"Audio available at: {output_uri}")
        break
    elif status == 'failed':
        print("Task failed")
        break
 
    time.sleep(5)

Speech Marks

Track word timing in generated audio:

response = polly.synthesize_speech(
    Text="Hello, this is Amazon Polly speaking.",
    OutputFormat="json",
    VoiceId="Joanna",
    Engine="neural",
    SpeechMarkTypes=["word", "sentence"]
)
 
# Parse speech marks
import json
marks = [json.loads(line) for line in response['AudioStream'].read().decode().strip().split('\n')]
 
for mark in marks:
    print(f"{mark['type']}: '{mark.get('value', '')}' at {mark['time']}ms")

Pricing

Amazon Polly uses character-based pricing:

EnginePrice per 1M characters
Standard$4.00
Neural$16.00

Free Tier

  • 5 million characters/month (Standard)
  • 1 million characters/month (Neural)
  • Valid for 12 months from signup

AWS Integration Examples

Lambda Function

import boto3
import base64
 
def lambda_handler(event, context):
    polly = boto3.client('polly')
 
    text = event.get('text', 'Hello from Lambda!')
 
    response = polly.synthesize_speech(
        Text=text,
        OutputFormat="mp3",
        VoiceId="Joanna",
        Engine="neural"
    )
 
    audio_base64 = base64.b64encode(
        response['AudioStream'].read()
    ).decode('utf-8')
 
    return {
        'statusCode': 200,
        'headers': {'Content-Type': 'audio/mpeg'},
        'body': audio_base64,
        'isBase64Encoded': True
    }

S3 + CloudFront

Store and serve generated audio:

import boto3
import hashlib
 
polly = boto3.client('polly')
s3 = boto3.client('s3')
 
def get_or_create_audio(text, voice_id="Joanna"):
    # Generate cache key
    cache_key = hashlib.md5(f"{text}{voice_id}".encode()).hexdigest()
    s3_key = f"audio/{cache_key}.mp3"
 
    # Check if exists
    try:
        s3.head_object(Bucket="your-bucket", Key=s3_key)
        return f"https://your-cdn.cloudfront.net/{s3_key}"
    except:
        pass
 
    # Generate new audio
    response = polly.synthesize_speech(
        Text=text,
        OutputFormat="mp3",
        VoiceId=voice_id,
        Engine="neural"
    )
 
    # Upload to S3
    s3.put_object(
        Bucket="your-bucket",
        Key=s3_key,
        Body=response['AudioStream'].read(),
        ContentType="audio/mpeg"
    )
 
    return f"https://your-cdn.cloudfront.net/{s3_key}"

Language Support

Amazon Polly supports 30+ languages:

LanguageVoicesNeural Support
English (US)8+Yes
English (UK)4+Yes
Spanish6+Yes
French4+Yes
German4+Yes
Japanese2+Yes
Portuguese4+Yes

List Available Voices

response = polly.describe_voices(LanguageCode="en-US")
 
for voice in response['Voices']:
    engines = voice.get('SupportedEngines', [])
    print(f"{voice['Id']} - {voice['Gender']} - Engines: {engines}")

Comparison with Alternatives

FeatureAmazon PollyGoogle CloudOpenAI TTS
Voice QualityGoodVery GoodVery Good
Latency~100ms~200ms~400ms
Speaking StylesYesNoNo
SSML SupportFullFullNone
Free TierGenerousGenerousNone

For a detailed comparison, see our Voice AI API Comparison Guide.

When to Choose Amazon Polly

Amazon Polly is ideal when:

  • Low latency is your priority
  • You're building on AWS
  • You need speaking styles (newscaster, conversational)
  • Cost efficiency at scale matters

Consider alternatives if:

Conclusion

Amazon Polly offers the lowest latency among major TTS providers, with strong AWS integration and unique speaking styles. While voice quality trails ElevenLabs, it's excellent for real-time applications and AWS-native projects.


This article is part of our Voice AI API Comparison series. Explore guides for Google Cloud TTS, Azure Speech, and more.

Related Articles