Domain 1 β€” Module 8 of 11 73%
8 of 26 overall
Domain 1: AI Concepts and Capabilities Free ⏱ ~12 min read

Speech: Recognition & Synthesis

AI can listen and talk. Speech recognition converts voice to text; speech synthesis converts text to voice. Learn how both work, when to use each, and what Azure offers.

How does AI handle speech?

Simple explanation

Speech AI works in two directions: listening and talking.

Listening (speech recognition) is like a court reporter β€” it hears what you say and types it out. Your phone’s voice typing, meeting transcription, and voice assistants all use this.

Talking (speech synthesis) is like a narrator β€” it reads text and speaks it aloud in a natural voice. GPS directions, audiobook readers, and accessibility tools all use this.

Modern speech AI is remarkably accurate β€” it handles accents, background noise, and even multiple speakers.

Speech recognition (speech-to-text)

Converts spoken language into written text.

Key speech recognition features in Azure AI Speech
FeatureHow It Works
Real-time transcriptionConverts speech to text as it happens β€” ideal for live meetings, captions
Batch transcriptionProcesses pre-recorded audio files β€” ideal for call centre recordings, podcasts
Speaker diarisationIdentifies who is speaking β€” 'Speaker 1 said... Speaker 2 replied...'
Custom speech modelsFine-tune recognition for industry terms, accents, or noisy environments
Multi-language support100+ languages and regional variants supported
Pronunciation assessmentEvaluates pronunciation accuracy β€” useful for language learning apps

DataFlow Corp scenario: DataFlow transcribes 10,000 customer support calls per day. They use:

  • Batch transcription to process call recordings overnight
  • Speaker diarisation to separate agent and customer speech
  • Custom speech model trained on their product names and technical terms
What is speaker diarisation?

Diarisation (also spelled β€œdiarization”) is the process of identifying who spoke when in an audio recording with multiple speakers.

Without diarisation:

β€œHello, I need help with my account. Sure, I can help you with that.”

With diarisation:

Speaker 1: β€œHello, I need help with my account.” Speaker 2: β€œSure, I can help you with that.”

This is essential for meeting transcription, interview analysis, and customer support quality review.

Speech synthesis (text-to-speech)

Converts written text into natural-sounding spoken audio.

FeatureDescription
Neural voicesAI-generated voices that sound remarkably human (not robotic)
Custom neural voiceCreate a unique brand voice from training data
SSML controlFine-tune pronunciation, speed, pitch, and emphasis using Speech Synthesis Markup Language
Multi-languageGenerate speech in 100+ languages
Speaking stylesAdjust tone: cheerful, empathetic, newscast, customer service
Audio output formatsMP3, WAV, OGG, and streaming output

MediSpark scenario: MediSpark builds an accessibility feature for visually impaired patients. Their app uses speech synthesis to read appointment details, medication instructions, and lab results aloud β€” using a calm, empathetic speaking style.

Speech translation

Azure AI Speech can also translate spoken language in real-time:

  1. Input: spoken English
  2. Step 1: Speech-to-text (English text)
  3. Step 2: Translation (English text β†’ Spanish text)
  4. Step 3: Text-to-speech (Spanish audio)
  5. Output: spoken Spanish

GreenLeaf scenario: GreenLeaf’s field workers speak multiple languages. During team meetings, Azure Speech Translation provides real-time translation β€” each participant hears the meeting in their preferred language.

🎬 Video walkthrough

Flashcards

Question

What is the difference between speech recognition and speech synthesis?

Click or press Enter to reveal answer

Answer

Speech recognition (STT) converts spoken audio into written text. Speech synthesis (TTS) converts written text into spoken audio. They're opposite directions of the same technology.

Click to flip back

Question

What is speaker diarisation?

Click or press Enter to reveal answer

Answer

The process of identifying who spoke when in a multi-speaker audio recording. The output labels each segment with the speaker: 'Speaker 1 said X, Speaker 2 replied Y.'

Click to flip back

Question

What are neural voices?

Click or press Enter to reveal answer

Answer

AI-generated voices that sound remarkably human and natural. Unlike older robotic TTS, neural voices can express emotion, vary pace, and handle pronunciation naturally. Available in Azure AI Speech.

Click to flip back

Question

What is SSML?

Click or press Enter to reveal answer

Answer

Speech Synthesis Markup Language β€” an XML-based language that lets you control pronunciation, speed, pitch, pauses, and emphasis in text-to-speech output.

Click to flip back

Knowledge Check

Knowledge Check

DataFlow Corp needs to transcribe 10,000 recorded customer support calls each night and identify which parts were spoken by the agent vs the customer. Which combination of features do they need?

Knowledge Check

MediSpark builds an accessibility feature that reads appointment details aloud to visually impaired patients in a calm, empathetic tone. Which Azure AI Speech capability is this?


Next up: Computer Vision β€” how AI sees and understands images.