Domain 2 β€” Module 2 of 8 25%
10 of 27 overall
Domain 2: Develop AI solutions by using Azure data management services Free ⏱ ~14 min read

Cosmos DB Performance: Indexing, RUs, Consistency

How Cosmos DB really costs what it costs. Request Units, indexing policies for AI workloads, and the five consistency levels β€” pick the wrong one and AI memory feels broken.

RUs β€” the only unit Cosmos cares about

Simple explanation

Cosmos DB doesn’t bill in CPU, IO, or rows. It bills in Request Units (RUs). Every read, every write, every query has an RU cost. You pay for a budget of RUs per second; if your workload exceeds it, requests get throttled (HTTP 429).

Three rules of thumb worth memorising:

  • A point read on a 1 KB document costs ~1 RU
  • A write of a 1 KB document costs ~5–10 RUs (more with indexing)
  • A complex query reading lots of docs costs as many RUs as docs touched

The exam tests two RU-saving levers: smarter indexing policies (don’t index what you never query) and looser consistency levels (don’t pay for guarantees you don’t need).

Indexing policies β€” pay only for what you query

Cosmos automatically indexes every property by default. That’s convenient, but for write-heavy AI workloads (chat history, telemetry) it’s wasteful β€” every write pays to index properties you never query.

// Default policy β€” index everything
{
  "indexingMode": "consistent",
  "includedPaths": [{ "path": "/*" }],
  "excludedPaths": [{ "path": "/_etag/?" }]
}

For a chat-history container where you only filter by userId and createdAt, a tighter policy slashes write RU cost:

{
  "indexingMode": "consistent",
  "includedPaths": [
    { "path": "/userId/?" },
    { "path": "/createdAt/?" }
  ],
  "excludedPaths": [
    { "path": "/*" }
  ]
}

Translation: β€œindex only userId and createdAt; ignore everything else.” Writes get noticeably cheaper; queries on indexed paths stay fast.

Three indexing strategies. Most production AI containers move from default to tight included paths.
FeatureDefault policyTight included pathsLazy indexing
Write RU cost (1 KB doc)~7-10 RUs~5 RUs (3-5 fewer paths to index)~5 RUs (deferred)
Query coverageEvery property indexedOnly listed pathsOnly listed paths
RiskWastes RUs on unused propertiesForgot to add a path β†’ unindexed query, slowReads can return slightly stale data
Best forMost apps starting outWrite-heavy production after access patterns are clearBulk ingestion with eventual queries

Composite indexes for ORDER BY queries

Without a composite index, queries with multi-property ORDER BY blow up RUs. Composite indexes precompute the sort order:

{
  "indexingMode": "consistent",
  "includedPaths": [{ "path": "/*" }],
  "excludedPaths": [{ "path": "/_etag/?" }],
  "compositeIndexes": [
    [
      { "path": "/userId", "order": "ascending" },
      { "path": "/createdAt", "order": "descending" }
    ]
  ]
}

Now SELECT * FROM c WHERE c.userId = @u ORDER BY c.createdAt DESC runs as a single-partition query that streams results in order β€” minimal RU.

Consistency levels β€” five options, one trade-off

Strong and bounded staleness cost ~2Γ— the read RU of session, consistent prefix, and eventual β€” those three weaker levels cost the same baseline.
FeatureStrongBounded stalenessSession (default)Consistent prefixEventual
Read sees the latest writeAlwaysWithin K versions or T secondsWithin your own sessionReads never see out-of-order writes; latest may lagEventually
Multi-region read latencyHighest (synchronous replication)HighLow (read your own writes locally)LowLowest
Read RU cost~2Γ— a session read~2Γ— a session read~baseline~baseline (same as session)~baseline (same as session)
Best forFinancial transfers, locks, single-writer correctnessAuditable freshness windowsMost apps β€” read your own writes, otherwise relaxedReplays of ordered eventsCounts, recommendations, telemetry
from azure.cosmos import ConsistencyLevel

client = CosmosClient(
    url, credential,
    consistency_level=ConsistencyLevel.Eventual,   # downgrade for cheap reads
)

# Or per-request override (when SDK supports it):
container.read_item(
    item="msg-789",
    partition_key="user-42",
    request_options={"consistencyLevel": "Eventual"},
)
Exam tip: 'My agent loses chat history that just happened'

When the symptom is β€œI just wrote a message and the next read doesn’t see it”, the consistency level is too weak. Session consistency (the default) is usually correct: it guarantees a client reads its own writes. Falling back to Eventual saves RUs but creates exactly this bug.

When the symptom is β€œwrites are slow / latency is high in our other region”, consistency is too strong. Strong consistency requires synchronous replication; downgrade to Bounded Staleness or Session if your workload tolerates short windows.

Throttling and the 429 dance

When provisioned RUs are exhausted, Cosmos returns 429 Too Many Requests with a header telling you how long to wait:

HTTP/1.1 429 Too Many Requests
x-ms-retry-after-ms: 187
x-ms-request-charge: 8.32

The SDKs implement exponential back-off + retry automatically. You can configure the retry policy:

client = CosmosClient(
    url, credential,
    retry_total=9,
    retry_backoff_max=30,
    request_timeout=10,
)

For sustained 429s, the fix is one of:

  • Increase RU/s on the container
  • Move to autoscale (10–100% range)
  • Tighten indexing to reduce per-write cost
  • Spread load across more partitions (better partition-key choice)

Provisioned vs autoscale vs serverless

ModeBillBest for
Provisioned (manual)Per RU/s reservedSteady, predictable load
Autoscale10% of max minimum, scales up to maxSpiky workloads (AI inference traffic, user activity)
ServerlessPer RU consumedDev/test, very low traffic, sporadic AI demos

Autoscale’s pricing isn’t always cheaper than manual β€” it’s about 1.5Γ— the per-RU cost in the equivalent reserved tier. The win is during sustained idle periods, where you pay 10% of max instead of 100%.

Key terms

Question

What is a Cosmos DB Request Unit (RU)?

Click or press Enter to reveal answer

Answer

The normalised cost abstraction: 1 RU is roughly a point read of a 1 KB doc. Every operation has an RU cost driven by doc size, query complexity, indexing, and consistency. You provision RU/s budgets and pay for them; exceeding them returns HTTP 429.

Click to flip back

Question

How can a Cosmos DB indexing policy reduce write RU cost?

Click or press Enter to reveal answer

Answer

Cosmos indexes every property by default. Excluding properties you never query (`excludedPaths`) skips work on every write, lowering the per-write RU cost. Tighten the policy when access patterns are stable.

Click to flip back

Question

What does Session consistency guarantee?

Click or press Enter to reveal answer

Answer

A client reads its own writes within the same session β€” i.e., 'monotonic reads' from one client's view. It's the default in Cosmos because it's strong enough for most apps yet cheaper than Strong consistency. Different sessions may see different orderings.

Click to flip back

Question

When should you use Strong consistency in Cosmos DB?

Click or press Enter to reveal answer

Answer

When correctness requires every read to see the latest committed write across all regions β€” financial ledgers, distributed locks, leader-election state. The trade-off is higher latency (synchronous cross-region replication) and roughly 2Γ— the read RU cost vs Session.

Click to flip back

Question

What is a composite index in Cosmos DB?

Click or press Enter to reveal answer

Answer

An index over multiple property paths in a specified order. Required for efficient multi-property `ORDER BY` and equality+range queries. Without one, those queries either fail or burn many RUs.

Click to flip back

Knowledge check

Knowledge Check

Mira's chat-history container has default indexing. The queries only ever filter by `c.userId` and order by `c.timestamp`. Writes are 4Γ— too expensive. What's the cleanest fix?

Knowledge Check

Theo's clinical assistant writes a new message and immediately reads it back to verify storage. Sometimes the read returns 'not found'. The Cosmos account is single-region with default consistency. What's the most likely cause?

Knowledge Check

Lin's container holds a multi-tenant SaaS app with bursts at the top of every hour and idle the rest of the time. Which throughput mode is most cost-effective?