Domain 2 β€” Module 7 of 15 47%
18 of 26 overall
Domain 2: Implement AI Solutions Using Foundry Free ⏱ ~12 min read

Multimodal: Responding to Speech

Modern AI models can hear you. Learn how to send spoken prompts to a multimodal model and get intelligent responses β€” combining speech recognition with AI reasoning.

AI that listens and responds

Simple explanation

Instead of typing your question, you can speak it β€” and the AI responds intelligently.

Think of it like talking to a voice assistant, but much smarter. You say: β€œLook at this chart and tell me what the trend is.” The AI hears your voice, understands the question, looks at the chart, and gives you a thoughtful answer β€” all in one step.

This is possible because multimodal models like GPT-4o can process audio input alongside text and images.

Two approaches to speech + AI

Traditional speech pipeline vs multimodal audio
FeatureTraditional PipelineMultimodal (GPT-4o)
How it worksSpeech β†’ Text β†’ LLM β†’ Text β†’ Speech (3 separate steps)Speech β†’ GPT-4o β†’ Response (direct audio processing)
Services neededAzure Speech + Azure OpenAI (separate services)GPT-4o multimodal (one model)
LatencyHigher (multiple API calls)Lower (single call)
NuanceLoses tone, emphasis, emotion in transcriptionPreserves audio nuance β€” can understand tone and intent
Best forWhen you need the transcript AND the AI responseWhen you want natural, conversational AI interaction

Using GPT-4o with audio input

import base64

# Read audio file
with open("question.wav", "rb") as f:
    audio_data = base64.b64encode(f.read()).decode()

response = chat.complete(
    model="gpt4o-deployment",
    messages=[
        {"role": "system", "content": "You are a helpful assistant. Respond naturally to spoken questions."},
        {"role": "user", "content": [
            {"type": "input_audio", "input_audio": {"data": audio_data, "format": "wav"}}
        ]}
    ]
)

print(response.choices[0].message.content)

What’s happening:

  • The audio file is encoded as base64 and sent directly to GPT-4o
  • The model processes the audio natively β€” no separate speech-to-text step
  • The response is text (or can be audio in supported configurations)

DataFlow Corp scenario: DataFlow builds a voice-enabled analytics dashboard. Managers speak queries like β€œWhat were our top-selling products last quarter?” GPT-4o understands the spoken question, queries the data, and responds with the answer.

When to use traditional pipeline vs multimodal

Use the traditional pipeline (Speech + LLM) when:

  • You need the transcript for records or compliance
  • You need custom speech recognition (industry terms, accents)
  • You need speech translation between languages
  • Budget is tight (dedicated speech service can be cheaper)

Use multimodal (GPT-4o) when:

  • You want the simplest possible architecture
  • Tone and emotional context matter for the response
  • Low latency is critical
  • You’re already using GPT-4o for other modalities (text, images)

🎬 Video walkthrough

Flashcards

Question

How can GPT-4o process spoken questions?

Click or press Enter to reveal answer

Answer

GPT-4o is multimodal β€” it can accept audio as a direct input modality. The audio is encoded as base64 and sent in the messages array. The model processes the audio waveform directly without needing a separate speech-to-text service.

Click to flip back

Question

What is the advantage of multimodal audio over a traditional speech pipeline?

Click or press Enter to reveal answer

Answer

Lower latency (single API call vs three), preserves audio nuance (tone, emphasis), and simpler architecture. The traditional pipeline requires separate Speech-to-text β†’ LLM β†’ Text-to-speech services.

Click to flip back

Question

When should you use the traditional speech pipeline instead of multimodal?

Click or press Enter to reveal answer

Answer

When you need the transcript for records, need custom speech recognition for industry terms, need speech translation, or when budget is tight (dedicated speech service can be cheaper).

Click to flip back

Knowledge Check

Knowledge Check

MediSpark's doctors want to dictate clinical notes and have AI summarise them. They also need the raw transcript saved to patient records. Which approach should they use?

Knowledge Check

DataFlow Corp wants a voice-enabled dashboard where managers ask spoken questions and get instant answers. Tone of voice should influence the response style. What's the best approach?


Next up: Azure Speech in Foundry Tools β€” building apps with dedicated speech services.