Azure Speech Service: Complete Developer Guide

Azure Speech Service is Microsoft's comprehensive speech platform, offering the widest language coverage in the industry with 140+ languages and variants. It's the go-to choice for global applications and enterprises.

This guide covers everything you need to integrate Azure Speech into your application.

Why Azure Speech?

Azure Speech offers compelling advantages:

Language Coverage: 140+ languages and regional variants
Custom Neural Voice: Train custom voices on your data
Enterprise Ready: SOC 2, HIPAA, GDPR compliance
SSML Support: Comprehensive speech control
Avatar Support: Generate talking avatars (preview)

Getting Started

Setup

Create an Azure account
Create a Speech resource in Azure Portal
Get your subscription key and region

export AZURE_SPEECH_KEY="your-subscription-key"
export AZURE_SPEECH_REGION="eastus"

Basic Text-to-Speech

import azure.cognitiveservices.speech as speechsdk
 
speech_config = speechsdk.SpeechConfig(
    subscription="your-key",
    region="eastus"
)
 
speech_config.speech_synthesis_voice_name = "en-US-JennyNeural"
 
synthesizer = speechsdk.SpeechSynthesizer(speech_config=speech_config)
 
result = synthesizer.speak_text_async("Hello, this is Azure Speech.").get()
 
if result.reason == speechsdk.ResultReason.SynthesizingAudioCompleted:
    print("Speech synthesized successfully")

Save to File

audio_config = speechsdk.audio.AudioOutputConfig(filename="output.wav")
 
synthesizer = speechsdk.SpeechSynthesizer(
    speech_config=speech_config,
    audio_config=audio_config
)
 
result = synthesizer.speak_text_async("Saving to file.").get()

Streaming Synthesis

For real-time applications:

def stream_callback(evt):
    if evt.result.reason == speechsdk.ResultReason.SynthesizingAudio:
        audio_data = evt.result.audio_data
        # Process audio chunk
        print(f"Received {len(audio_data)} bytes")
 
synthesizer = speechsdk.SpeechSynthesizer(
    speech_config=speech_config,
    audio_config=None  # Disable default output
)
 
synthesizer.synthesizing.connect(stream_callback)
synthesizer.speak_text_async("This audio streams as it generates.").get()

Voice Types

Azure offers multiple voice tiers:

Type	Quality	Languages	Use Case
Neural	Excellent	140+	Production apps
Standard	Good	Limited	Legacy support
Custom Neural	Excellent	Custom	Brand voices

Neural Voices

High-quality voices with natural intonation:

speech_config.speech_synthesis_voice_name = "en-US-JennyNeural"

Voice Styles

Many neural voices support different speaking styles:

ssml = """
<speak version="1.0" xmlns="http://www.w3.org/2001/10/synthesis"
    xmlns:mstts="https://www.w3.org/2001/mstts" xml:lang="en-US">
    <voice name="en-US-JennyNeural">
        <mstts:express-as style="cheerful">
            This is spoken in a cheerful style!
        </mstts:express-as>
    </voice>
</speak>
"""
 
result = synthesizer.speak_ssml_async(ssml).get()

Available styles (voice-dependent):

cheerful, sad, angry, fearful
friendly, hopeful, excited
newscast, customer service
narration, documentary

Role Play

Some voices can adopt different personas:

ssml = """
<speak version="1.0" xmlns="http://www.w3.org/2001/10/synthesis"
    xmlns:mstts="https://www.w3.org/2001/mstts" xml:lang="en-US">
    <voice name="en-US-JennyNeural">
        <mstts:express-as role="YoungAdultFemale">
            Speaking as a young adult.
        </mstts:express-as>
    </voice>
</speak>
"""

SSML Support

Azure has comprehensive SSML support with Microsoft extensions.

Basic SSML

ssml = """
<speak version="1.0" xmlns="http://www.w3.org/2001/10/synthesis" xml:lang="en-US">
    <voice name="en-US-JennyNeural">
        Hello! <break time="500ms"/>
        Welcome to <emphasis level="strong">Azure Speech</emphasis>.
    </voice>
</speak>
"""
 
result = synthesizer.speak_ssml_async(ssml).get()

Prosody Control

ssml = """
<speak version="1.0" xmlns="http://www.w3.org/2001/10/synthesis" xml:lang="en-US">
    <voice name="en-US-JennyNeural">
        <prosody rate="-20%" pitch="-5%">
            Speaking slowly with a lower pitch.
        </prosody>
        <prosody rate="+20%" pitch="+5%">
            Speaking quickly with a higher pitch.
        </prosody>
    </voice>
</speak>
"""

Audio Effects

ssml = """
<speak version="1.0" xmlns="http://www.w3.org/2001/10/synthesis"
    xmlns:mstts="https://www.w3.org/2001/mstts" xml:lang="en-US">
    <voice name="en-US-JennyNeural">
        <mstts:audioduration value="5s"/>
        This text will be spoken over exactly 5 seconds.
    </voice>
</speak>
"""

Background Audio

Mix speech with background music:

ssml = """
<speak version="1.0" xmlns="http://www.w3.org/2001/10/synthesis"
    xmlns:mstts="https://www.w3.org/2001/mstts" xml:lang="en-US">
    <voice name="en-US-JennyNeural">
        <mstts:backgroundaudio src="https://example.com/music.mp3"
            volume="0.3" fadein="1000" fadeout="1000">
            This speech plays over background music.
        </mstts:backgroundaudio>
    </voice>
</speak>
"""

Custom Neural Voice

Create unique brand voices with your own data.

Requirements

300+ utterances (professional recording recommended)
Clean audio without background noise
Consistent speaker and recording conditions
Azure Speech Studio access

Process

Prepare training data (audio + transcripts)
Upload to Azure Speech Studio
Train custom voice model
Deploy and use via API

# Using custom voice
speech_config.speech_synthesis_voice_name = "your-custom-voice-name"
synthesizer = speechsdk.SpeechSynthesizer(speech_config=speech_config)
result = synthesizer.speak_text_async("Speaking with custom voice.").get()

Pricing

Azure Speech uses tiered pricing:

Standard Tier

Feature	Price
Neural TTS	$16/1M characters
Standard TTS	$4/1M characters
Custom Neural Voice	$24/1M characters

Free Tier

500,000 characters/month (Neural)
5 million characters/month (Standard)
10,000 transactions for Custom Voice

Best Practices

Optimize for Latency

Use regional endpoints close to users
Enable streaming for real-time apps
Keep SSML simple when latency matters

Improve Quality

Use appropriate speaking styles
Add SSML for natural pacing
Test different voices for your content

Cost Optimization

Cache frequently generated audio
Use Standard voices for non-critical content
Batch requests when possible

Language Support

Azure leads with 140+ languages:

Region	Languages
Americas	English, Spanish, Portuguese, French
Europe	German, French, Italian, Dutch, Polish, etc.
Asia	Chinese, Japanese, Korean, Hindi, etc.
Middle East	Arabic, Hebrew, Turkish
Africa	Afrikaans, Swahili, Amharic

List Available Voices

import requests
 
url = f"https://{region}.tts.speech.microsoft.com/cognitiveservices/voices/list"
headers = {"Ocp-Apim-Subscription-Key": key}
 
response = requests.get(url, headers=headers)
voices = response.json()
 
# Filter by language
english_voices = [v for v in voices if v['Locale'].startswith('en-')]
for voice in english_voices[:5]:
    print(f"{voice['ShortName']} - {voice['Gender']} - {voice['VoiceType']}")

Integration Examples

REST API

import requests
 
url = f"https://{region}.tts.speech.microsoft.com/cognitiveservices/v1"
 
headers = {
    "Ocp-Apim-Subscription-Key": key,
    "Content-Type": "application/ssml+xml",
    "X-Microsoft-OutputFormat": "audio-16khz-128kbitrate-mono-mp3"
}
 
ssml = """
<speak version="1.0" xmlns="http://www.w3.org/2001/10/synthesis" xml:lang="en-US">
    <voice name="en-US-JennyNeural">Hello from the REST API!</voice>
</speak>
"""
 
response = requests.post(url, headers=headers, data=ssml)
 
with open("output.mp3", "wb") as f:
    f.write(response.content)

JavaScript/Web

const sdk = require('microsoft-cognitiveservices-speech-sdk');
 
const speechConfig = sdk.SpeechConfig.fromSubscription(key, region);
speechConfig.speechSynthesisVoiceName = 'en-US-JennyNeural';
 
const synthesizer = new sdk.SpeechSynthesizer(speechConfig);
 
synthesizer.speakTextAsync(
    'Hello from JavaScript!',
    result => {
        if (result.reason === sdk.ResultReason.SynthesizingAudioCompleted) {
            console.log('Synthesis completed');
        }
        synthesizer.close();
    },
    error => {
        console.error(error);
        synthesizer.close();
    }
);

Comparison with Alternatives

Feature	Azure Speech	Google Cloud	Amazon Polly
Voice Quality	Very Good	Very Good	Good
Languages	140+	40+	30+
Custom Voice	Yes	Limited	No
Speaking Styles	Extensive	No	Limited
Latency	~150ms	~200ms	~100ms

For a detailed comparison, see our Voice AI API Comparison Guide.

When to Choose Azure Speech

Azure Speech is ideal when:

You need maximum language coverage
You want custom brand voices
Enterprise compliance is required
You need speaking styles and emotions

Consider alternatives if:

Voice quality is the top priority (ElevenLabs)
Lowest latency is critical (Amazon Polly)
You want simpler integration (OpenAI TTS)

Conclusion

Azure Speech Service offers unmatched language coverage and enterprise features. The combination of 140+ languages, custom neural voice training, and comprehensive SSML support makes it the choice for global and enterprise applications.

This article is part of our Voice AI API Comparison series. Explore guides for Amazon Polly, OpenAI TTS, and more.

Azure Speech Service: Complete Developer Guide

Azure Speech Service: Complete Developer Guide

Why Azure Speech?

Getting Started

Setup

Basic Text-to-Speech

Save to File

Streaming Synthesis

Voice Types

Neural Voices

Voice Styles

Role Play

SSML Support

Basic SSML

Prosody Control

Audio Effects

Background Audio

Custom Neural Voice

Requirements

Process

Pricing

Standard Tier

Free Tier

Best Practices

Optimize for Latency

Improve Quality

Cost Optimization

Language Support

List Available Voices

Integration Examples

REST API

JavaScript/Web

Comparison with Alternatives

When to Choose Azure Speech

Conclusion

Related Articles

Best Voice AI APIs in 2025: Complete Comparison Guide

Amazon Polly: Complete Developer Guide

OpenAI TTS API: Complete Developer Guide