Visual Prompts: Images as Input
Modern AI can see. Send an image alongside your text prompt, and the AI analyses what's in it. Learn how to use visual input with multimodal models in Foundry.
Sending images to AI
You can show a picture to AI and ask questions about it β just like showing a photo to a friend.
βWhatβs in this image?β βIs there anything unusual?β βRead the text on this sign.β βHow many people are in this photo?β The AI looks at the image and gives you an intelligent answer.
This works because multimodal models like GPT-4o can process both text AND images simultaneously.
Sending an image with your prompt
import base64
# Read image file
with open("xray.png", "rb") as f:
image_data = base64.b64encode(f.read()).decode()
response = chat.complete(
model="gpt4o-deployment",
messages=[
{"role": "system", "content": "You are a medical image analysis assistant. Describe what you observe but never provide diagnoses."},
{"role": "user", "content": [
{"type": "text", "text": "What do you observe in this chest X-ray?"},
{"type": "image_url", "image_url": {"url": f"data:image/png;base64,{image_data}"}}
]}
]
)
print(response.choices[0].message.content)
Whatβs happening:
- The user message contains BOTH text and an image
- The image is base64-encoded and embedded in the message
- GPT-4o processes both together, understanding the question AND the visual content
Image input methods
| Method | How It Works | Best For |
|---|---|---|
| Base64 encoding | Embed the image data directly in the API call | Local files, private images |
| URL reference | Provide a public URL to the image | Publicly accessible images, web content |
# Method 2: URL reference
{"type": "image_url", "image_url": {"url": "https://example.com/chart.png"}}
What you can do with visual prompts
| Task | Example Prompt | Use Case |
|---|---|---|
| Describe | βWhatβs in this image?β | Accessibility, cataloguing |
| Analyse | βWhat trends do you see in this chart?β | Business intelligence, reporting |
| Read text | βRead all the text in this documentβ | OCR alternative, document processing |
| Compare | βWhatβs different between these two images?β | Quality control, before/after analysis |
| Count | βHow many people are in this photo?β | Event monitoring, crowd analysis |
| Classify | βIs this a defective or normal product?β | Manufacturing quality control |
GreenLeaf scenario: GreenLeaf farmers photograph their crops and ask the AI:
- βAre there signs of disease in this tomato plant?β
- βWhat type of pest damage do you see?β
- βCompare this weekβs growth to last weekβs photoβ
Limitations of visual prompts
Visual prompts are powerful but have limitations:
- Not a medical diagnostic tool β the model can describe what it sees, but shouldnβt make diagnoses
- May misidentify fine details β small text, distant objects, or subtle differences may be missed
- No real-time video β processes individual images, not live video streams
- Token cost β images consume tokens, with higher-resolution images using more tokens
- Content filtering β harmful or sensitive images are blocked
Exam tip: The exam may test your understanding of when visual prompts are appropriate vs when a dedicated vision service (Azure AI Vision) is better.
π¬ Video walkthrough
Flashcards
Knowledge Check
MediSpark wants doctors to upload X-ray images and get a description of what the AI observes. The system prompt should ensure the AI never provides diagnoses. Which implementation is correct?
GreenLeaf wants to process 10,000 field photos per day to detect crop disease. The analysis needs to be fast and cost-effective with a simple 'healthy/diseased' classification. What's the best approach?
Next up: Generating Images with AI β creating new visual content from text descriptions using GPT-image.