Domain 2 β€” Module 8 of 15 53%
19 of 26 overall
Domain 2: Implement AI Solutions Using Foundry Free ⏱ ~14 min read

Azure Speech in Foundry Tools

Build a lightweight speech app using Azure AI Speech β€” the dedicated service for speech recognition, synthesis, and translation within Foundry Tools.

Building with Azure AI Speech

Simple explanation

Azure AI Speech is like giving your app ears and a voice.

In the last module, you used GPT-4o to process audio directly. This module uses Azure AI Speech β€” a dedicated service that’s optimised specifically for speech tasks. It’s faster for pure transcription, supports 100+ languages, and gives you fine-grained control over voice output.

Think of it as the specialist vs the generalist: GPT-4o can do everything, but Azure Speech does speech tasks better and cheaper.

Building a speech-to-text app

import azure.cognitiveservices.speech as speechsdk

speech_config = speechsdk.SpeechConfig(
    subscription="your-speech-key",
    region="your-region"
)
speech_config.speech_recognition_language = "en-NZ"

# Recognise from microphone
recognizer = speechsdk.SpeechRecognizer(speech_config=speech_config)

print("Speak now...")
result = recognizer.recognize_once()

if result.reason == speechsdk.ResultReason.RecognizedSpeech:
    print(f"You said: {result.text}")
elif result.reason == speechsdk.ResultReason.NoMatch:
    print("Speech not recognised")

Building a text-to-speech app

speech_config = speechsdk.SpeechConfig(
    subscription="your-speech-key",
    region="your-region"
)

# Choose a neural voice
speech_config.speech_synthesis_voice_name = "en-NZ-MollyNeural"

synthesizer = speechsdk.SpeechSynthesizer(speech_config=speech_config)

result = synthesizer.speak_text("Welcome to MediSpark. Your appointment is confirmed for Tuesday at 2 PM.")

Using SSML for fine-grained control

SSML (Speech Synthesis Markup Language) lets you control how the AI speaks:

<speak version="1.0" xmlns="http://www.w3.org/2001/10/synthesis" xml:lang="en-NZ">
  <voice name="en-NZ-MollyNeural">
    <prosody rate="slow" pitch="+5%">
      Welcome to MediSpark.
    </prosody>
    <break time="500ms"/>
    Your appointment is confirmed for
    <emphasis level="strong">Tuesday at 2 PM</emphasis>.
  </voice>
</speak>
SSML ElementWhat It Controls
prosodySpeed (rate), pitch, and volume
breakPauses between phrases
emphasisStress on specific words
voiceWhich neural voice to use
say-asHow to pronounce dates, numbers, addresses

Combining speech with AI

The most powerful pattern combines Azure Speech with an LLM:

User speaks β†’ Azure Speech (STT) β†’ GPT-4o (reasoning) β†’ Azure Speech (TTS) β†’ AI responds aloud

MediSpark scenario: MediSpark builds a voice-enabled patient assistant:

  1. Patient speaks: β€œWhen is my next appointment?”
  2. Azure Speech transcribes the question
  3. GPT-4o queries the appointment system and generates a response
  4. Azure Speech reads the response aloud in a warm, empathetic neural voice
Continuous recognition for long conversations

recognize_once() listens for a single phrase. For ongoing conversations, use continuous recognition:

recognizer.start_continuous_recognition()
# ... recognizer fires events as speech is detected
recognizer.stop_continuous_recognition()

This is essential for meeting transcription, live captioning, and voice-controlled applications where the user speaks continuously.

🎬 Video walkthrough

Flashcards

Question

What Python package provides the Azure AI Speech SDK?

Click or press Enter to reveal answer

Answer

azure-cognitiveservices-speech β€” provides SpeechConfig, SpeechRecognizer (for STT), and SpeechSynthesizer (for TTS).

Click to flip back

Question

What is SSML and when do you use it?

Click or press Enter to reveal answer

Answer

Speech Synthesis Markup Language β€” XML-based control for text-to-speech output. Use it to adjust speaking rate, pitch, add pauses, emphasise words, and control pronunciation. Essential when you need fine-grained voice control.

Click to flip back

Question

What is continuous recognition?

Click or press Enter to reveal answer

Answer

A speech recognition mode that listens continuously and fires events as speech is detected β€” unlike recognize_once() which listens for a single phrase. Used for meeting transcription and live captioning.

Click to flip back

Knowledge Check

Knowledge Check

MediSpark wants their patient assistant to speak appointment confirmations in a calm, slow pace with emphasis on the date and time. Which Azure Speech feature enables this level of control?

Knowledge Check

DataFlow Corp needs to transcribe a 2-hour recorded meeting, identifying who said what. Which Azure Speech features do they need?


Next up: Visual Prompts β€” sending images to AI and getting intelligent responses.