Azure Speech Service: Complete Developer Guide
Learn how to use Azure Speech for text-to-speech with 140+ languages. Includes code examples, custom neural voice, SSML usage, and best practices.
Azure Speech Service: Complete Developer Guide
Azure Speech Service is Microsoft's comprehensive speech platform, offering the widest language coverage in the industry with 140+ languages and variants. It's the go-to choice for global applications and enterprises.
This guide covers everything you need to integrate Azure Speech into your application.
Why Azure Speech?
Azure Speech offers compelling advantages:
- Language Coverage: 140+ languages and regional variants
- Custom Neural Voice: Train custom voices on your data
- Enterprise Ready: SOC 2, HIPAA, GDPR compliance
- SSML Support: Comprehensive speech control
- Avatar Support: Generate talking avatars (preview)
Getting Started
Setup
- Create an Azure account
- Create a Speech resource in Azure Portal
- Get your subscription key and region
export AZURE_SPEECH_KEY="your-subscription-key"
export AZURE_SPEECH_REGION="eastus"Basic Text-to-Speech
import azure.cognitiveservices.speech as speechsdk
speech_config = speechsdk.SpeechConfig(
subscription="your-key",
region="eastus"
)
speech_config.speech_synthesis_voice_name = "en-US-JennyNeural"
synthesizer = speechsdk.SpeechSynthesizer(speech_config=speech_config)
result = synthesizer.speak_text_async("Hello, this is Azure Speech.").get()
if result.reason == speechsdk.ResultReason.SynthesizingAudioCompleted:
print("Speech synthesized successfully")Save to File
audio_config = speechsdk.audio.AudioOutputConfig(filename="output.wav")
synthesizer = speechsdk.SpeechSynthesizer(
speech_config=speech_config,
audio_config=audio_config
)
result = synthesizer.speak_text_async("Saving to file.").get()Streaming Synthesis
For real-time applications:
def stream_callback(evt):
if evt.result.reason == speechsdk.ResultReason.SynthesizingAudio:
audio_data = evt.result.audio_data
# Process audio chunk
print(f"Received {len(audio_data)} bytes")
synthesizer = speechsdk.SpeechSynthesizer(
speech_config=speech_config,
audio_config=None # Disable default output
)
synthesizer.synthesizing.connect(stream_callback)
synthesizer.speak_text_async("This audio streams as it generates.").get()Voice Types
Azure offers multiple voice tiers:
| Type | Quality | Languages | Use Case |
|---|---|---|---|
| Neural | Excellent | 140+ | Production apps |
| Standard | Good | Limited | Legacy support |
| Custom Neural | Excellent | Custom | Brand voices |
Neural Voices
High-quality voices with natural intonation:
speech_config.speech_synthesis_voice_name = "en-US-JennyNeural"Voice Styles
Many neural voices support different speaking styles:
ssml = """
<speak version="1.0" xmlns="http://www.w3.org/2001/10/synthesis"
xmlns:mstts="https://www.w3.org/2001/mstts" xml:lang="en-US">
<voice name="en-US-JennyNeural">
<mstts:express-as style="cheerful">
This is spoken in a cheerful style!
</mstts:express-as>
</voice>
</speak>
"""
result = synthesizer.speak_ssml_async(ssml).get()Available styles (voice-dependent):
- cheerful, sad, angry, fearful
- friendly, hopeful, excited
- newscast, customer service
- narration, documentary
Role Play
Some voices can adopt different personas:
ssml = """
<speak version="1.0" xmlns="http://www.w3.org/2001/10/synthesis"
xmlns:mstts="https://www.w3.org/2001/mstts" xml:lang="en-US">
<voice name="en-US-JennyNeural">
<mstts:express-as role="YoungAdultFemale">
Speaking as a young adult.
</mstts:express-as>
</voice>
</speak>
"""SSML Support
Azure has comprehensive SSML support with Microsoft extensions.
Basic SSML
ssml = """
<speak version="1.0" xmlns="http://www.w3.org/2001/10/synthesis" xml:lang="en-US">
<voice name="en-US-JennyNeural">
Hello! <break time="500ms"/>
Welcome to <emphasis level="strong">Azure Speech</emphasis>.
</voice>
</speak>
"""
result = synthesizer.speak_ssml_async(ssml).get()Prosody Control
ssml = """
<speak version="1.0" xmlns="http://www.w3.org/2001/10/synthesis" xml:lang="en-US">
<voice name="en-US-JennyNeural">
<prosody rate="-20%" pitch="-5%">
Speaking slowly with a lower pitch.
</prosody>
<prosody rate="+20%" pitch="+5%">
Speaking quickly with a higher pitch.
</prosody>
</voice>
</speak>
"""Audio Effects
ssml = """
<speak version="1.0" xmlns="http://www.w3.org/2001/10/synthesis"
xmlns:mstts="https://www.w3.org/2001/mstts" xml:lang="en-US">
<voice name="en-US-JennyNeural">
<mstts:audioduration value="5s"/>
This text will be spoken over exactly 5 seconds.
</voice>
</speak>
"""Background Audio
Mix speech with background music:
ssml = """
<speak version="1.0" xmlns="http://www.w3.org/2001/10/synthesis"
xmlns:mstts="https://www.w3.org/2001/mstts" xml:lang="en-US">
<voice name="en-US-JennyNeural">
<mstts:backgroundaudio src="https://example.com/music.mp3"
volume="0.3" fadein="1000" fadeout="1000">
This speech plays over background music.
</mstts:backgroundaudio>
</voice>
</speak>
"""Custom Neural Voice
Create unique brand voices with your own data.
Requirements
- 300+ utterances (professional recording recommended)
- Clean audio without background noise
- Consistent speaker and recording conditions
- Azure Speech Studio access
Process
- Prepare training data (audio + transcripts)
- Upload to Azure Speech Studio
- Train custom voice model
- Deploy and use via API
# Using custom voice
speech_config.speech_synthesis_voice_name = "your-custom-voice-name"
synthesizer = speechsdk.SpeechSynthesizer(speech_config=speech_config)
result = synthesizer.speak_text_async("Speaking with custom voice.").get()Pricing
Azure Speech uses tiered pricing:
Standard Tier
| Feature | Price |
|---|---|
| Neural TTS | $16/1M characters |
| Standard TTS | $4/1M characters |
| Custom Neural Voice | $24/1M characters |
Free Tier
- 500,000 characters/month (Neural)
- 5 million characters/month (Standard)
- 10,000 transactions for Custom Voice
Best Practices
Optimize for Latency
- Use regional endpoints close to users
- Enable streaming for real-time apps
- Keep SSML simple when latency matters
Improve Quality
- Use appropriate speaking styles
- Add SSML for natural pacing
- Test different voices for your content
Cost Optimization
- Cache frequently generated audio
- Use Standard voices for non-critical content
- Batch requests when possible
Language Support
Azure leads with 140+ languages:
| Region | Languages |
|---|---|
| Americas | English, Spanish, Portuguese, French |
| Europe | German, French, Italian, Dutch, Polish, etc. |
| Asia | Chinese, Japanese, Korean, Hindi, etc. |
| Middle East | Arabic, Hebrew, Turkish |
| Africa | Afrikaans, Swahili, Amharic |
List Available Voices
import requests
url = f"https://{region}.tts.speech.microsoft.com/cognitiveservices/voices/list"
headers = {"Ocp-Apim-Subscription-Key": key}
response = requests.get(url, headers=headers)
voices = response.json()
# Filter by language
english_voices = [v for v in voices if v['Locale'].startswith('en-')]
for voice in english_voices[:5]:
print(f"{voice['ShortName']} - {voice['Gender']} - {voice['VoiceType']}")Integration Examples
REST API
import requests
url = f"https://{region}.tts.speech.microsoft.com/cognitiveservices/v1"
headers = {
"Ocp-Apim-Subscription-Key": key,
"Content-Type": "application/ssml+xml",
"X-Microsoft-OutputFormat": "audio-16khz-128kbitrate-mono-mp3"
}
ssml = """
<speak version="1.0" xmlns="http://www.w3.org/2001/10/synthesis" xml:lang="en-US">
<voice name="en-US-JennyNeural">Hello from the REST API!</voice>
</speak>
"""
response = requests.post(url, headers=headers, data=ssml)
with open("output.mp3", "wb") as f:
f.write(response.content)JavaScript/Web
const sdk = require('microsoft-cognitiveservices-speech-sdk');
const speechConfig = sdk.SpeechConfig.fromSubscription(key, region);
speechConfig.speechSynthesisVoiceName = 'en-US-JennyNeural';
const synthesizer = new sdk.SpeechSynthesizer(speechConfig);
synthesizer.speakTextAsync(
'Hello from JavaScript!',
result => {
if (result.reason === sdk.ResultReason.SynthesizingAudioCompleted) {
console.log('Synthesis completed');
}
synthesizer.close();
},
error => {
console.error(error);
synthesizer.close();
}
);Comparison with Alternatives
| Feature | Azure Speech | Google Cloud | Amazon Polly |
|---|---|---|---|
| Voice Quality | Very Good | Very Good | Good |
| Languages | 140+ | 40+ | 30+ |
| Custom Voice | Yes | Limited | No |
| Speaking Styles | Extensive | No | Limited |
| Latency | ~150ms | ~200ms | ~100ms |
For a detailed comparison, see our Voice AI API Comparison Guide.
When to Choose Azure Speech
Azure Speech is ideal when:
- You need maximum language coverage
- You want custom brand voices
- Enterprise compliance is required
- You need speaking styles and emotions
Consider alternatives if:
- Voice quality is the top priority (ElevenLabs)
- Lowest latency is critical (Amazon Polly)
- You want simpler integration (OpenAI TTS)
Conclusion
Azure Speech Service offers unmatched language coverage and enterprise features. The combination of 140+ languages, custom neural voice training, and comprehensive SSML support makes it the choice for global and enterprise applications.
This article is part of our Voice AI API Comparison series. Explore guides for Amazon Polly, OpenAI TTS, and more.