← Back to Blog
Technology··6 min read

Azure Speech Service: Complete Developer Guide

Learn how to use Azure Speech for text-to-speech with 140+ languages. Includes code examples, custom neural voice, SSML usage, and best practices.

azure speechazure text to speechmicrosoft tts apiazure neural voiceazure cognitive services speech

Azure Speech Service: Complete Developer Guide

Azure Speech Service is Microsoft's comprehensive speech platform, offering the widest language coverage in the industry with 140+ languages and variants. It's the go-to choice for global applications and enterprises.

This guide covers everything you need to integrate Azure Speech into your application.

Why Azure Speech?

Azure Speech offers compelling advantages:

  • Language Coverage: 140+ languages and regional variants
  • Custom Neural Voice: Train custom voices on your data
  • Enterprise Ready: SOC 2, HIPAA, GDPR compliance
  • SSML Support: Comprehensive speech control
  • Avatar Support: Generate talking avatars (preview)

Getting Started

Setup

  1. Create an Azure account
  2. Create a Speech resource in Azure Portal
  3. Get your subscription key and region
export AZURE_SPEECH_KEY="your-subscription-key"
export AZURE_SPEECH_REGION="eastus"

Basic Text-to-Speech

import azure.cognitiveservices.speech as speechsdk
 
speech_config = speechsdk.SpeechConfig(
    subscription="your-key",
    region="eastus"
)
 
speech_config.speech_synthesis_voice_name = "en-US-JennyNeural"
 
synthesizer = speechsdk.SpeechSynthesizer(speech_config=speech_config)
 
result = synthesizer.speak_text_async("Hello, this is Azure Speech.").get()
 
if result.reason == speechsdk.ResultReason.SynthesizingAudioCompleted:
    print("Speech synthesized successfully")

Save to File

audio_config = speechsdk.audio.AudioOutputConfig(filename="output.wav")
 
synthesizer = speechsdk.SpeechSynthesizer(
    speech_config=speech_config,
    audio_config=audio_config
)
 
result = synthesizer.speak_text_async("Saving to file.").get()

Streaming Synthesis

For real-time applications:

def stream_callback(evt):
    if evt.result.reason == speechsdk.ResultReason.SynthesizingAudio:
        audio_data = evt.result.audio_data
        # Process audio chunk
        print(f"Received {len(audio_data)} bytes")
 
synthesizer = speechsdk.SpeechSynthesizer(
    speech_config=speech_config,
    audio_config=None  # Disable default output
)
 
synthesizer.synthesizing.connect(stream_callback)
synthesizer.speak_text_async("This audio streams as it generates.").get()

Voice Types

Azure offers multiple voice tiers:

TypeQualityLanguagesUse Case
NeuralExcellent140+Production apps
StandardGoodLimitedLegacy support
Custom NeuralExcellentCustomBrand voices

Neural Voices

High-quality voices with natural intonation:

speech_config.speech_synthesis_voice_name = "en-US-JennyNeural"

Voice Styles

Many neural voices support different speaking styles:

ssml = """
<speak version="1.0" xmlns="http://www.w3.org/2001/10/synthesis"
    xmlns:mstts="https://www.w3.org/2001/mstts" xml:lang="en-US">
    <voice name="en-US-JennyNeural">
        <mstts:express-as style="cheerful">
            This is spoken in a cheerful style!
        </mstts:express-as>
    </voice>
</speak>
"""
 
result = synthesizer.speak_ssml_async(ssml).get()

Available styles (voice-dependent):

  • cheerful, sad, angry, fearful
  • friendly, hopeful, excited
  • newscast, customer service
  • narration, documentary

Role Play

Some voices can adopt different personas:

ssml = """
<speak version="1.0" xmlns="http://www.w3.org/2001/10/synthesis"
    xmlns:mstts="https://www.w3.org/2001/mstts" xml:lang="en-US">
    <voice name="en-US-JennyNeural">
        <mstts:express-as role="YoungAdultFemale">
            Speaking as a young adult.
        </mstts:express-as>
    </voice>
</speak>
"""

SSML Support

Azure has comprehensive SSML support with Microsoft extensions.

Basic SSML

ssml = """
<speak version="1.0" xmlns="http://www.w3.org/2001/10/synthesis" xml:lang="en-US">
    <voice name="en-US-JennyNeural">
        Hello! <break time="500ms"/>
        Welcome to <emphasis level="strong">Azure Speech</emphasis>.
    </voice>
</speak>
"""
 
result = synthesizer.speak_ssml_async(ssml).get()

Prosody Control

ssml = """
<speak version="1.0" xmlns="http://www.w3.org/2001/10/synthesis" xml:lang="en-US">
    <voice name="en-US-JennyNeural">
        <prosody rate="-20%" pitch="-5%">
            Speaking slowly with a lower pitch.
        </prosody>
        <prosody rate="+20%" pitch="+5%">
            Speaking quickly with a higher pitch.
        </prosody>
    </voice>
</speak>
"""

Audio Effects

ssml = """
<speak version="1.0" xmlns="http://www.w3.org/2001/10/synthesis"
    xmlns:mstts="https://www.w3.org/2001/mstts" xml:lang="en-US">
    <voice name="en-US-JennyNeural">
        <mstts:audioduration value="5s"/>
        This text will be spoken over exactly 5 seconds.
    </voice>
</speak>
"""

Background Audio

Mix speech with background music:

ssml = """
<speak version="1.0" xmlns="http://www.w3.org/2001/10/synthesis"
    xmlns:mstts="https://www.w3.org/2001/mstts" xml:lang="en-US">
    <voice name="en-US-JennyNeural">
        <mstts:backgroundaudio src="https://example.com/music.mp3"
            volume="0.3" fadein="1000" fadeout="1000">
            This speech plays over background music.
        </mstts:backgroundaudio>
    </voice>
</speak>
"""

Custom Neural Voice

Create unique brand voices with your own data.

Requirements

  • 300+ utterances (professional recording recommended)
  • Clean audio without background noise
  • Consistent speaker and recording conditions
  • Azure Speech Studio access

Process

  1. Prepare training data (audio + transcripts)
  2. Upload to Azure Speech Studio
  3. Train custom voice model
  4. Deploy and use via API
# Using custom voice
speech_config.speech_synthesis_voice_name = "your-custom-voice-name"
synthesizer = speechsdk.SpeechSynthesizer(speech_config=speech_config)
result = synthesizer.speak_text_async("Speaking with custom voice.").get()

Pricing

Azure Speech uses tiered pricing:

Standard Tier

FeaturePrice
Neural TTS$16/1M characters
Standard TTS$4/1M characters
Custom Neural Voice$24/1M characters

Free Tier

  • 500,000 characters/month (Neural)
  • 5 million characters/month (Standard)
  • 10,000 transactions for Custom Voice

Best Practices

Optimize for Latency

  1. Use regional endpoints close to users
  2. Enable streaming for real-time apps
  3. Keep SSML simple when latency matters

Improve Quality

  1. Use appropriate speaking styles
  2. Add SSML for natural pacing
  3. Test different voices for your content

Cost Optimization

  1. Cache frequently generated audio
  2. Use Standard voices for non-critical content
  3. Batch requests when possible

Language Support

Azure leads with 140+ languages:

RegionLanguages
AmericasEnglish, Spanish, Portuguese, French
EuropeGerman, French, Italian, Dutch, Polish, etc.
AsiaChinese, Japanese, Korean, Hindi, etc.
Middle EastArabic, Hebrew, Turkish
AfricaAfrikaans, Swahili, Amharic

List Available Voices

import requests
 
url = f"https://{region}.tts.speech.microsoft.com/cognitiveservices/voices/list"
headers = {"Ocp-Apim-Subscription-Key": key}
 
response = requests.get(url, headers=headers)
voices = response.json()
 
# Filter by language
english_voices = [v for v in voices if v['Locale'].startswith('en-')]
for voice in english_voices[:5]:
    print(f"{voice['ShortName']} - {voice['Gender']} - {voice['VoiceType']}")

Integration Examples

REST API

import requests
 
url = f"https://{region}.tts.speech.microsoft.com/cognitiveservices/v1"
 
headers = {
    "Ocp-Apim-Subscription-Key": key,
    "Content-Type": "application/ssml+xml",
    "X-Microsoft-OutputFormat": "audio-16khz-128kbitrate-mono-mp3"
}
 
ssml = """
<speak version="1.0" xmlns="http://www.w3.org/2001/10/synthesis" xml:lang="en-US">
    <voice name="en-US-JennyNeural">Hello from the REST API!</voice>
</speak>
"""
 
response = requests.post(url, headers=headers, data=ssml)
 
with open("output.mp3", "wb") as f:
    f.write(response.content)

JavaScript/Web

const sdk = require('microsoft-cognitiveservices-speech-sdk');
 
const speechConfig = sdk.SpeechConfig.fromSubscription(key, region);
speechConfig.speechSynthesisVoiceName = 'en-US-JennyNeural';
 
const synthesizer = new sdk.SpeechSynthesizer(speechConfig);
 
synthesizer.speakTextAsync(
    'Hello from JavaScript!',
    result => {
        if (result.reason === sdk.ResultReason.SynthesizingAudioCompleted) {
            console.log('Synthesis completed');
        }
        synthesizer.close();
    },
    error => {
        console.error(error);
        synthesizer.close();
    }
);

Comparison with Alternatives

FeatureAzure SpeechGoogle CloudAmazon Polly
Voice QualityVery GoodVery GoodGood
Languages140+40+30+
Custom VoiceYesLimitedNo
Speaking StylesExtensiveNoLimited
Latency~150ms~200ms~100ms

For a detailed comparison, see our Voice AI API Comparison Guide.

When to Choose Azure Speech

Azure Speech is ideal when:

  • You need maximum language coverage
  • You want custom brand voices
  • Enterprise compliance is required
  • You need speaking styles and emotions

Consider alternatives if:

Conclusion

Azure Speech Service offers unmatched language coverage and enterprise features. The combination of 140+ languages, custom neural voice training, and comprehensive SSML support makes it the choice for global and enterprise applications.


This article is part of our Voice AI API Comparison series. Explore guides for Amazon Polly, OpenAI TTS, and more.

Related Articles