Domain 2 β€” Module 9 of 15 60%
20 of 26 overall
Domain 2: Implement AI Solutions Using Foundry Free ⏱ ~12 min read

Visual Prompts: Images as Input

Modern AI can see. Send an image alongside your text prompt, and the AI analyses what's in it. Learn how to use visual input with multimodal models in Foundry.

Sending images to AI

Simple explanation

You can show a picture to AI and ask questions about it β€” just like showing a photo to a friend.

”What’s in this image?” β€œIs there anything unusual?” β€œRead the text on this sign.” β€œHow many people are in this photo?” The AI looks at the image and gives you an intelligent answer.

This works because multimodal models like GPT-4o can process both text AND images simultaneously.

Sending an image with your prompt

import base64

# Read image file
with open("xray.png", "rb") as f:
    image_data = base64.b64encode(f.read()).decode()

response = chat.complete(
    model="gpt4o-deployment",
    messages=[
        {"role": "system", "content": "You are a medical image analysis assistant. Describe what you observe but never provide diagnoses."},
        {"role": "user", "content": [
            {"type": "text", "text": "What do you observe in this chest X-ray?"},
            {"type": "image_url", "image_url": {"url": f"data:image/png;base64,{image_data}"}}
        ]}
    ]
)

print(response.choices[0].message.content)

What’s happening:

  • The user message contains BOTH text and an image
  • The image is base64-encoded and embedded in the message
  • GPT-4o processes both together, understanding the question AND the visual content

Image input methods

MethodHow It WorksBest For
Base64 encodingEmbed the image data directly in the API callLocal files, private images
URL referenceProvide a public URL to the imagePublicly accessible images, web content
# Method 2: URL reference
{"type": "image_url", "image_url": {"url": "https://example.com/chart.png"}}

What you can do with visual prompts

TaskExample PromptUse Case
Describe”What’s in this image?”Accessibility, cataloguing
Analyse”What trends do you see in this chart?”Business intelligence, reporting
Read text”Read all the text in this document”OCR alternative, document processing
Compare”What’s different between these two images?”Quality control, before/after analysis
Count”How many people are in this photo?”Event monitoring, crowd analysis
Classify”Is this a defective or normal product?”Manufacturing quality control

GreenLeaf scenario: GreenLeaf farmers photograph their crops and ask the AI:

  • β€œAre there signs of disease in this tomato plant?”
  • β€œWhat type of pest damage do you see?”
  • β€œCompare this week’s growth to last week’s photo”
Limitations of visual prompts

Visual prompts are powerful but have limitations:

  • Not a medical diagnostic tool β€” the model can describe what it sees, but shouldn’t make diagnoses
  • May misidentify fine details β€” small text, distant objects, or subtle differences may be missed
  • No real-time video β€” processes individual images, not live video streams
  • Token cost β€” images consume tokens, with higher-resolution images using more tokens
  • Content filtering β€” harmful or sensitive images are blocked

Exam tip: The exam may test your understanding of when visual prompts are appropriate vs when a dedicated vision service (Azure AI Vision) is better.

🎬 Video walkthrough

Flashcards

Question

How do you send an image to GPT-4o for analysis?

Click or press Enter to reveal answer

Answer

Include it in the user message as a content array item with type 'image_url'. The image can be base64-encoded (for local files) or referenced by URL (for public images). The model processes both text and image together.

Click to flip back

Question

What are the two methods for providing images to a multimodal model?

Click or press Enter to reveal answer

Answer

Base64 encoding (embed image data directly in the API call, best for local/private images) and URL reference (provide a public URL, best for web-accessible images).

Click to flip back

Question

What types of tasks can visual prompts handle?

Click or press Enter to reveal answer

Answer

Image description, chart/diagram analysis, text reading (OCR), image comparison, object counting, classification, and visual question answering.

Click to flip back

Knowledge Check

Knowledge Check

MediSpark wants doctors to upload X-ray images and get a description of what the AI observes. The system prompt should ensure the AI never provides diagnoses. Which implementation is correct?

Knowledge Check

GreenLeaf wants to process 10,000 field photos per day to detect crop disease. The analysis needs to be fast and cost-effective with a simple 'healthy/diseased' classification. What's the best approach?


Next up: Generating Images with AI β€” creating new visual content from text descriptions using GPT-image.