Domain 1 β€” Module 6 of 8 75%
6 of 27 overall
Domain 1: Plan and Manage an Azure AI Solution Free ⏱ ~12 min read

Quotas, Scaling & Cost

AI workloads can get expensive fast. Learn how to manage quotas, rate limits, scaling, and cost footprints β€” plus how to monitor model performance and detect drift before users notice.

Managing AI costs and limits

Simple explanation

AI models are like electricity β€” powerful but you pay for every unit you use, and there’s a limit to how much you can draw at once.

Quotas set how much capacity you’re allowed. Rate limits cap how fast you can use it. Scaling adjusts capacity up or down as demand changes. And cost management stops you from getting a surprise bill at the end of the month.

The exam tests whether you can keep AI workloads running smoothly without burning through budget or hitting walls.

Quotas and rate limits

ConceptScopeWhat It LimitsHow to Increase
Subscription quotaEntire Azure subscriptionTotal TPM available for a model in a regionRequest increase via Azure portal
Deployment rate limitSingle model deploymentRPM and TPM for that specific deploymentAdjust within subscription quota
Provisioned capacityReserved deploymentFixed compute capacity (PTU) guaranteeing a model-specific TPM ratePurchase more PTU
Exam tip: Quota vs rate limit

The exam distinguishes between these:

  • Quota = your budget ceiling (subscription level). Example: 300K TPM for GPT-4o in East US.
  • Rate limit = how fast one deployment can spend. Example: Deployment β€œprod-chat” limited to 80K TPM.
  • Multiple deployments share the quota. If quota is 300K and you have 3 deployments, their combined rate limits can’t exceed 300K.

Cost management strategies

StrategyHow It Saves MoneyBest For
Right-size the modelUse SLMs for simple tasks instead of LLMsHigh-volume, low-complexity workloads
Prompt cachingReuse cached prefills for repeated system promptsApps with long, stable system prompts
Batch processingProcess requests in bulk at lower priorityNon-real-time workloads (report generation, analysis)
Token budgetingSet max_tokens to prevent runaway responsesAll deployments
Model RouterAuto-route to cheapest capable modelVariable complexity workloads

Monitoring model performance

Beyond cost, you need to monitor whether your models are performing well:

MetricWhat to WatchRed Flag
GroundednessAre responses based on retrieved data?Responses contain information not in the source documents
RelevanceDo responses answer the actual question?Users rephrase and retry frequently
Safety eventsAre safety filters triggering?Spike in blocked requests or user complaints
DriftHas model behaviour changed over time?Quality scores declining without code changes
LatencyResponse time per requestP95 latency exceeding SLA thresholds
Real-world example: Atlas Financial's cost controls

Atlas Financial processes 100,000 compliance reviews monthly. Their cost strategy:

  • Provisioned throughput for the compliance agent (predictable cost, guaranteed capacity)
  • Serverless for the internal FAQ chatbot (low, variable usage)
  • Phi-4-mini for email classification (50,000 emails/day β€” SLM saves 80% vs GPT-4o)
  • Batch API for monthly regulatory report generation (not time-sensitive)
  • Token budget of 2,000 tokens max on all customer-facing responses

Result: 60% cost reduction compared to running everything on GPT-4o serverless.

Key terms

Question

What is TPM (Tokens Per Minute)?

Click or press Enter to reveal answer

Answer

The rate limit unit for AI model deployments. It caps how many tokens (input + output) a deployment can process per minute. Both subscription quotas and deployment rate limits are measured in TPM.

Click to flip back

Question

What is model drift?

Click or press Enter to reveal answer

Answer

When a model's behaviour changes over time without any code changes on your side. Can happen due to model updates, data distribution shifts, or changes in user query patterns. Detected through ongoing evaluation metrics.

Click to flip back

Question

What is prompt caching?

Click or press Enter to reveal answer

Answer

A cost-saving feature where repeated system prompts are cached and reused, reducing token costs. Most effective when your application uses long, stable system prompts that don't change between requests.

Click to flip back

Question

What is groundedness in AI evaluation?

Click or press Enter to reveal answer

Answer

A metric measuring whether the model's response is based on the retrieved source data. Low groundedness means the model is generating information not supported by the provided context β€” a form of hallucination.

Click to flip back

Knowledge check

Knowledge Check

NeuralMed's patient chatbot is hitting rate limit errors during peak hours (9-11 AM) but usage is low overnight. Their subscription quota has available capacity. What should they do?

Knowledge Check

MediaForge notices their content generation agent's quality scores have declined over the past 2 weeks, but no code changes were deployed. What is the most likely cause?