Speech: Recognition & Synthesis

How does AI handle speech?

Simple explanation

Speech AI works in two directions: listening and talking.

Listening (speech recognition) is like a court reporter — it hears what you say and types it out. Your phone’s voice typing, meeting transcription, and voice assistants all use this.

Talking (speech synthesis) is like a narrator — it reads text and speaks it aloud in a natural voice. GPS directions, audiobook readers, and accessibility tools all use this.

Modern speech AI is remarkably accurate — it handles accents, background noise, and even multiple speakers.

Speech recognition (speech-to-text)

Converts spoken language into written text.

Key speech recognition features in Azure AI Speech
Feature	How It Works
Real-time transcription	Converts speech to text as it happens — ideal for live meetings, captions
Batch transcription	Processes pre-recorded audio files — ideal for call centre recordings, podcasts
Speaker diarisation	Identifies who is speaking — 'Speaker 1 said... Speaker 2 replied...'
Custom speech models	Fine-tune recognition for industry terms, accents, or noisy environments
Multi-language support	100+ languages and regional variants supported
Pronunciation assessment	Evaluates pronunciation accuracy — useful for language learning apps

DataFlow Corp scenario: DataFlow transcribes 10,000 customer support calls per day. They use:

Batch transcription to process call recordings overnight
Speaker diarisation to separate agent and customer speech
Custom speech model trained on their product names and technical terms

What is speaker diarisation?

Diarisation (also spelled “diarization”) is the process of identifying who spoke when in an audio recording with multiple speakers.

Without diarisation:

“Hello, I need help with my account. Sure, I can help you with that.”

With diarisation:

Speaker 1: “Hello, I need help with my account.” Speaker 2: “Sure, I can help you with that.”

This is essential for meeting transcription, interview analysis, and customer support quality review.

Speech synthesis (text-to-speech)

Converts written text into natural-sounding spoken audio.

Feature	Description
Neural voices	AI-generated voices that sound remarkably human (not robotic)
Custom neural voice	Create a unique brand voice from training data
SSML control	Fine-tune pronunciation, speed, pitch, and emphasis using Speech Synthesis Markup Language
Multi-language	Generate speech in 100+ languages
Speaking styles	Adjust tone: cheerful, empathetic, newscast, customer service
Audio output formats	MP3, WAV, OGG, and streaming output

MediSpark scenario: MediSpark builds an accessibility feature for visually impaired patients. Their app uses speech synthesis to read appointment details, medication instructions, and lab results aloud — using a calm, empathetic speaking style.

Speech translation

Azure AI Speech can also translate spoken language in real-time:

Input: spoken English
Step 1: Speech-to-text (English text)
Step 2: Translation (English text → Spanish text)
Step 3: Text-to-speech (Spanish audio)
Output: spoken Spanish

GreenLeaf scenario: GreenLeaf’s field workers speak multiple languages. During team meetings, Azure Speech Translation provides real-time translation — each participant hears the meeting in their preferred language.

🎬 Video walkthrough

Flashcards

Question

What is the difference between speech recognition and speech synthesis?

Click or press Enter to reveal answer

Answer

Speech recognition (STT) converts spoken audio into written text. Speech synthesis (TTS) converts written text into spoken audio. They're opposite directions of the same technology.

Click to flip back

Question

What is speaker diarisation?

Click or press Enter to reveal answer

Answer

The process of identifying who spoke when in a multi-speaker audio recording. The output labels each segment with the speaker: 'Speaker 1 said X, Speaker 2 replied Y.'

Click to flip back

Question

What are neural voices?

Click or press Enter to reveal answer

Answer

AI-generated voices that sound remarkably human and natural. Unlike older robotic TTS, neural voices can express emotion, vary pace, and handle pronunciation naturally. Available in Azure AI Speech.

Click to flip back

Question

What is SSML?

Click or press Enter to reveal answer

Answer

Speech Synthesis Markup Language — an XML-based language that lets you control pronunciation, speed, pitch, pauses, and emphasis in text-to-speech output.

Click to flip back

Knowledge Check

DataFlow Corp needs to transcribe 10,000 recorded customer support calls each night and identify which parts were spoken by the agent vs the customer. Which combination of features do they need?

Knowledge Check

MediSpark builds an accessibility feature that reads appointment details aloud to visually impaired patients in a calm, empathetic tone. Which Azure AI Speech capability is this?

Next up: Computer Vision — how AI sees and understands images.