Domain 2 β€” Module 11 of 15 73%
22 of 26 overall
Domain 2: Implement AI Solutions Using Foundry Free ⏱ ~14 min read

Building a Vision App

Combine image analysis capabilities into a complete application. Use Azure AI Vision to classify images, detect objects, and read text β€” all from Python.

Building with Azure AI Vision

Simple explanation

Module 20 used GPT-4o to answer questions about images. This module uses Azure AI Vision β€” a dedicated service that’s faster and cheaper for specific vision tasks.

Think of the difference: GPT-4o is like a brilliant friend who can discuss anything about an image. Azure AI Vision is like a specialist tool β€” it’s optimised for reading text (OCR), detecting objects, and classifying images with high speed and accuracy.

Azure AI Vision vs GPT-4o for vision

Azure AI Vision vs GPT-4o for image tasks
FeatureAzure AI VisionGPT-4o Visual Prompts
Best forHigh-volume classification, OCR, object detectionComplex visual reasoning, open-ended questions
OutputStructured JSON (tags, objects, text)Natural language response
CostLower per-transactionHigher per-token
Custom modelsYes β€” Custom Vision service for your own classifiersNo β€” uses general knowledge
SpeedFast β€” optimised for visionSlower β€” processes full LLM pipeline

Image analysis with the SDK

from azure.ai.vision.imageanalysis import ImageAnalysisClient
from azure.ai.vision.imageanalysis.models import VisualFeatures
from azure.core.credentials import AzureKeyCredential

client = ImageAnalysisClient(
    endpoint="https://your-vision-resource.cognitiveservices.azure.com/",
    credential=AzureKeyCredential("your-key")
)

# Analyse an image
result = client.analyze(
    image_url="https://example.com/field-photo.jpg",
    visual_features=[
        VisualFeatures.CAPTION,
        VisualFeatures.TAGS,
        VisualFeatures.OBJECTS,
        VisualFeatures.READ
    ]
)

# Caption
print(f"Caption: {result.caption.text} (confidence: {result.caption.confidence:.2f})")

# Tags
for tag in result.tags.list:
    print(f"Tag: {tag.name} ({tag.confidence:.2f})")

# Objects detected
for obj in result.objects.list:
    print(f"Object: {obj.tags[0].name} at [{obj.bounding_box}]")

# Text (OCR)
for block in result.read.blocks:
    for line in block.lines:
        print(f"Text: {line.text}")

Visual features explained

FeatureWhat It ReturnsUse Case
CAPTIONA natural language description of the imageAccessibility, image cataloguing
TAGSList of keywords describing the contentSearch indexing, content tagging
OBJECTSDetected objects with bounding boxesQuality control, inventory counting
READExtracted text (OCR)Document processing, sign reading
PEOPLEDetected people with positionsCrowd analysis, security
SMART_CROPSSuggested crop regions for thumbnailsSocial media, responsive images

GreenLeaf scenario: GreenLeaf builds a crop health monitoring app:

  1. Farmer uploads field photo via mobile app
  2. TAGS β€” identifies plant types, soil conditions
  3. OBJECTS β€” counts individual plants, locates problem areas
  4. CAPTION β€” generates a description for the report

🎬 Video walkthrough

Flashcards

Question

What Python package provides the Azure AI Vision image analysis SDK?

Click or press Enter to reveal answer

Answer

azure-ai-vision-imageanalysis β€” provides ImageAnalysisClient with the analyze() method that accepts visual features like CAPTION, TAGS, OBJECTS, and READ.

Click to flip back

Question

What visual features can you request from Azure AI Vision?

Click or press Enter to reveal answer

Answer

CAPTION (description), TAGS (keywords), OBJECTS (with bounding boxes), READ (OCR text), PEOPLE (detected persons), and SMART_CROPS (thumbnail suggestions).

Click to flip back

Question

When should you use Azure AI Vision instead of GPT-4o for image tasks?

Click or press Enter to reveal answer

Answer

When you need high-volume processing, structured JSON output, custom classifiers, or lower per-transaction cost. GPT-4o is better for complex visual reasoning and open-ended questions about images.

Click to flip back

Knowledge Check

Knowledge Check

GreenLeaf wants to build an app that processes 5,000 field photos daily, tagging each with the type of crop visible. Which approach is most cost-effective?

Knowledge Check

DataFlow Corp receives scanned business documents. They need to: 1) extract all text, 2) identify what objects appear in any embedded photos, and 3) generate a description of each page. Which visual features do they request?


Next up: Content Understanding: Documents & Forms β€” extracting structured data from invoices, receipts, and forms.