Cosmos DB Performance: Indexing, RUs, Consistency

RUs — the only unit Cosmos cares about

Simple explanation

Cosmos DB doesn’t bill in CPU, IO, or rows. It bills in Request Units (RUs). Every read, every write, every query has an RU cost. You pay for a budget of RUs per second; if your workload exceeds it, requests get throttled (HTTP 429).

Three rules of thumb worth memorising:

A point read on a 1 KB document costs ~1 RU
A write of a 1 KB document costs ~5–10 RUs (more with indexing)
A complex query reading lots of docs costs as many RUs as docs touched

The exam tests two RU-saving levers: smarter indexing policies (don’t index what you never query) and looser consistency levels (don’t pay for guarantees you don’t need).

Indexing policies — pay only for what you query

Cosmos automatically indexes every property by default. That’s convenient, but for write-heavy AI workloads (chat history, telemetry) it’s wasteful — every write pays to index properties you never query.

// Default policy — index everything
{
  "indexingMode": "consistent",
  "includedPaths": [{ "path": "/*" }],
  "excludedPaths": [{ "path": "/_etag/?" }]
}

For a chat-history container where you only filter by userId and createdAt, a tighter policy slashes write RU cost:

{
  "indexingMode": "consistent",
  "includedPaths": [
    { "path": "/userId/?" },
    { "path": "/createdAt/?" }
  ],
  "excludedPaths": [
    { "path": "/*" }
  ]
}

Translation: “index only userId and createdAt; ignore everything else.” Writes get noticeably cheaper; queries on indexed paths stay fast.

Three indexing strategies. Most production AI containers move from default to tight included paths.
Feature	Default policy	Tight included paths	Lazy indexing
Write RU cost (1 KB doc)	~7-10 RUs	~5 RUs (3-5 fewer paths to index)	~5 RUs (deferred)
Query coverage	Every property indexed	Only listed paths	Only listed paths
Risk	Wastes RUs on unused properties	Forgot to add a path → unindexed query, slow	Reads can return slightly stale data
Best for	Most apps starting out	Write-heavy production after access patterns are clear	Bulk ingestion with eventual queries

Composite indexes for ORDER BY queries

Without a composite index, queries with multi-property ORDER BY blow up RUs. Composite indexes precompute the sort order:

{
  "indexingMode": "consistent",
  "includedPaths": [{ "path": "/*" }],
  "excludedPaths": [{ "path": "/_etag/?" }],
  "compositeIndexes": [
    [
      { "path": "/userId", "order": "ascending" },
      { "path": "/createdAt", "order": "descending" }
    ]
  ]
}

Now SELECT * FROM c WHERE c.userId = @u ORDER BY c.createdAt DESC runs as a single-partition query that streams results in order — minimal RU.

Consistency levels — five options, one trade-off

Strong and bounded staleness cost ~2× the read RU of session, consistent prefix, and eventual — those three weaker levels cost the same baseline.
Feature	Strong	Bounded staleness	Session (default)	Consistent prefix	Eventual
Read sees the latest write	Always	Within K versions or T seconds	Within your own session	Reads never see out-of-order writes; latest may lag	Eventually
Multi-region read latency	Highest (synchronous replication)	High	Low (read your own writes locally)	Low	Lowest
Read RU cost	~2× a session read	~2× a session read	~baseline	~baseline (same as session)	~baseline (same as session)
Best for	Financial transfers, locks, single-writer correctness	Auditable freshness windows	Most apps — read your own writes, otherwise relaxed	Replays of ordered events	Counts, recommendations, telemetry

from azure.cosmos import ConsistencyLevel

client = CosmosClient(
    url, credential,
    consistency_level=ConsistencyLevel.Eventual,   # downgrade for cheap reads
)

# Or per-request override (when SDK supports it):
container.read_item(
    item="msg-789",
    partition_key="user-42",
    request_options={"consistencyLevel": "Eventual"},
)

Exam tip: 'My agent loses chat history that just happened'

When the symptom is “I just wrote a message and the next read doesn’t see it”, the consistency level is too weak. Session consistency (the default) is usually correct: it guarantees a client reads its own writes. Falling back to Eventual saves RUs but creates exactly this bug.

When the symptom is “writes are slow / latency is high in our other region”, consistency is too strong. Strong consistency requires synchronous replication; downgrade to Bounded Staleness or Session if your workload tolerates short windows.

Throttling and the 429 dance

When provisioned RUs are exhausted, Cosmos returns 429 Too Many Requests with a header telling you how long to wait:

HTTP/1.1 429 Too Many Requests
x-ms-retry-after-ms: 187
x-ms-request-charge: 8.32

The SDKs implement exponential back-off + retry automatically. You can configure the retry policy:

client = CosmosClient(
    url, credential,
    retry_total=9,
    retry_backoff_max=30,
    request_timeout=10,
)

For sustained 429s, the fix is one of:

Increase RU/s on the container
Move to autoscale (10–100% range)
Tighten indexing to reduce per-write cost
Spread load across more partitions (better partition-key choice)

Provisioned vs autoscale vs serverless

Mode	Bill	Best for
Provisioned (manual)	Per RU/s reserved	Steady, predictable load
Autoscale	10% of max minimum, scales up to max	Spiky workloads (AI inference traffic, user activity)
Serverless	Per RU consumed	Dev/test, very low traffic, sporadic AI demos

Autoscale’s pricing isn’t always cheaper than manual — it’s about 1.5× the per-RU cost in the equivalent reserved tier. The win is during sustained idle periods, where you pay 10% of max instead of 100%.