Building a Vision App

Building with Azure AI Vision

Simple explanation

Module 20 used GPT-4o to answer questions about images. This module uses Azure AI Vision — a dedicated service that’s faster and cheaper for specific vision tasks.

Think of the difference: GPT-4o is like a brilliant friend who can discuss anything about an image. Azure AI Vision is like a specialist tool — it’s optimised for reading text (OCR), detecting objects, and classifying images with high speed and accuracy.

Azure AI Vision vs GPT-4o for vision

Azure AI Vision vs GPT-4o for image tasks
Feature	Azure AI Vision	GPT-4o Visual Prompts
Best for	High-volume classification, OCR, object detection	Complex visual reasoning, open-ended questions
Output	Structured JSON (tags, objects, text)	Natural language response
Cost	Lower per-transaction	Higher per-token
Custom models	Yes — Custom Vision service for your own classifiers	No — uses general knowledge
Speed	Fast — optimised for vision	Slower — processes full LLM pipeline

Image analysis with the SDK

from azure.ai.vision.imageanalysis import ImageAnalysisClient
from azure.ai.vision.imageanalysis.models import VisualFeatures
from azure.core.credentials import AzureKeyCredential

client = ImageAnalysisClient(
    endpoint="https://your-vision-resource.cognitiveservices.azure.com/",
    credential=AzureKeyCredential("your-key")
)

# Analyse an image
result = client.analyze(
    image_url="https://example.com/field-photo.jpg",
    visual_features=[
        VisualFeatures.CAPTION,
        VisualFeatures.TAGS,
        VisualFeatures.OBJECTS,
        VisualFeatures.READ
    ]
)

# Caption
print(f"Caption: {result.caption.text} (confidence: {result.caption.confidence:.2f})")

# Tags
for tag in result.tags.list:
    print(f"Tag: {tag.name} ({tag.confidence:.2f})")

# Objects detected
for obj in result.objects.list:
    print(f"Object: {obj.tags[0].name} at [{obj.bounding_box}]")

# Text (OCR)
for block in result.read.blocks:
    for line in block.lines:
        print(f"Text: {line.text}")

Visual features explained

Feature	What It Returns	Use Case
CAPTION	A natural language description of the image	Accessibility, image cataloguing
TAGS	List of keywords describing the content	Search indexing, content tagging
OBJECTS	Detected objects with bounding boxes	Quality control, inventory counting
READ	Extracted text (OCR)	Document processing, sign reading
PEOPLE	Detected people with positions	Crowd analysis, security
SMART_CROPS	Suggested crop regions for thumbnails	Social media, responsive images

GreenLeaf scenario: GreenLeaf builds a crop health monitoring app:

Farmer uploads field photo via mobile app
TAGS — identifies plant types, soil conditions
OBJECTS — counts individual plants, locates problem areas
CAPTION — generates a description for the report

🎬 Video walkthrough

Flashcards

Question

What Python package provides the Azure AI Vision image analysis SDK?

Click or press Enter to reveal answer

Answer

azure-ai-vision-imageanalysis — provides ImageAnalysisClient with the analyze() method that accepts visual features like CAPTION, TAGS, OBJECTS, and READ.

Click to flip back

Question

What visual features can you request from Azure AI Vision?

Click or press Enter to reveal answer

Answer

CAPTION (description), TAGS (keywords), OBJECTS (with bounding boxes), READ (OCR text), PEOPLE (detected persons), and SMART_CROPS (thumbnail suggestions).

Click to flip back

Question

When should you use Azure AI Vision instead of GPT-4o for image tasks?

Click or press Enter to reveal answer

Answer

When you need high-volume processing, structured JSON output, custom classifiers, or lower per-transaction cost. GPT-4o is better for complex visual reasoning and open-ended questions about images.

Click to flip back

Knowledge Check

GreenLeaf wants to build an app that processes 5,000 field photos daily, tagging each with the type of crop visible. Which approach is most cost-effective?

Knowledge Check

DataFlow Corp receives scanned business documents. They need to: 1) extract all text, 2) identify what objects appear in any embedded photos, and 3) generate a description of each page. Which visual features do they request?

Next up: Content Understanding: Documents & Forms — extracting structured data from invoices, receipts, and forms.