Multimodal: Responding to Speech

AI that listens and responds

Simple explanation

Instead of typing your question, you can speak it — and the AI responds intelligently.

Think of it like talking to a voice assistant, but much smarter. You say: “Look at this chart and tell me what the trend is.” The AI hears your voice, understands the question, looks at the chart, and gives you a thoughtful answer — all in one step.

This is possible because multimodal models like GPT-4o can process audio input alongside text and images.

Two approaches to speech + AI

Traditional speech pipeline vs multimodal audio
Feature	Traditional Pipeline	Multimodal (GPT-4o)
How it works	Speech → Text → LLM → Text → Speech (3 separate steps)	Speech → GPT-4o → Response (direct audio processing)
Services needed	Azure Speech + Azure OpenAI (separate services)	GPT-4o multimodal (one model)
Latency	Higher (multiple API calls)	Lower (single call)
Nuance	Loses tone, emphasis, emotion in transcription	Preserves audio nuance — can understand tone and intent
Best for	When you need the transcript AND the AI response	When you want natural, conversational AI interaction

Using GPT-4o with audio input

import base64

# Read audio file
with open("question.wav", "rb") as f:
    audio_data = base64.b64encode(f.read()).decode()

response = chat.complete(
    model="gpt4o-deployment",
    messages=[
        {"role": "system", "content": "You are a helpful assistant. Respond naturally to spoken questions."},
        {"role": "user", "content": [
            {"type": "input_audio", "input_audio": {"data": audio_data, "format": "wav"}}
        ]}
    ]
)

print(response.choices[0].message.content)

What’s happening:

The audio file is encoded as base64 and sent directly to GPT-4o
The model processes the audio natively — no separate speech-to-text step
The response is text (or can be audio in supported configurations)

DataFlow Corp scenario: DataFlow builds a voice-enabled analytics dashboard. Managers speak queries like “What were our top-selling products last quarter?” GPT-4o understands the spoken question, queries the data, and responds with the answer.

When to use traditional pipeline vs multimodal

Use the traditional pipeline (Speech + LLM) when:

You need the transcript for records or compliance
You need custom speech recognition (industry terms, accents)
You need speech translation between languages
Budget is tight (dedicated speech service can be cheaper)

Use multimodal (GPT-4o) when:

You want the simplest possible architecture
Tone and emotional context matter for the response
Low latency is critical
You’re already using GPT-4o for other modalities (text, images)

🎬 Video walkthrough

Flashcards

Question

How can GPT-4o process spoken questions?

Click or press Enter to reveal answer

Answer

GPT-4o is multimodal — it can accept audio as a direct input modality. The audio is encoded as base64 and sent in the messages array. The model processes the audio waveform directly without needing a separate speech-to-text service.

Click to flip back

Question

What is the advantage of multimodal audio over a traditional speech pipeline?

Click or press Enter to reveal answer

Answer

Lower latency (single API call vs three), preserves audio nuance (tone, emphasis), and simpler architecture. The traditional pipeline requires separate Speech-to-text → LLM → Text-to-speech services.

Click to flip back

Question

When should you use the traditional speech pipeline instead of multimodal?

Click or press Enter to reveal answer

Answer

When you need the transcript for records, need custom speech recognition for industry terms, need speech translation, or when budget is tight (dedicated speech service can be cheaper).

Click to flip back

Knowledge Check

MediSpark's doctors want to dictate clinical notes and have AI summarise them. They also need the raw transcript saved to patient records. Which approach should they use?

Knowledge Check

DataFlow Corp wants a voice-enabled dashboard where managers ask spoken questions and get instant answers. Tone of voice should influence the response style. What's the best approach?

Next up: Azure Speech in Foundry Tools — building apps with dedicated speech services.