Cosmos DB Performance: Indexing, RUs, Consistency
How Cosmos DB really costs what it costs. Request Units, indexing policies for AI workloads, and the five consistency levels β pick the wrong one and AI memory feels broken.
RUs β the only unit Cosmos cares about
Cosmos DB doesnβt bill in CPU, IO, or rows. It bills in Request Units (RUs). Every read, every write, every query has an RU cost. You pay for a budget of RUs per second; if your workload exceeds it, requests get throttled (HTTP 429).
Three rules of thumb worth memorising:
- A point read on a 1 KB document costs ~1 RU
- A write of a 1 KB document costs ~5β10 RUs (more with indexing)
- A complex query reading lots of docs costs as many RUs as docs touched
The exam tests two RU-saving levers: smarter indexing policies (donβt index what you never query) and looser consistency levels (donβt pay for guarantees you donβt need).
Indexing policies β pay only for what you query
Cosmos automatically indexes every property by default. Thatβs convenient, but for write-heavy AI workloads (chat history, telemetry) itβs wasteful β every write pays to index properties you never query.
// Default policy β index everything
{
"indexingMode": "consistent",
"includedPaths": [{ "path": "/*" }],
"excludedPaths": [{ "path": "/_etag/?" }]
}
For a chat-history container where you only filter by userId and createdAt, a tighter policy slashes write RU cost:
{
"indexingMode": "consistent",
"includedPaths": [
{ "path": "/userId/?" },
{ "path": "/createdAt/?" }
],
"excludedPaths": [
{ "path": "/*" }
]
}
Translation: βindex only userId and createdAt; ignore everything else.β Writes get noticeably cheaper; queries on indexed paths stay fast.
| Feature | Default policy | Tight included paths | Lazy indexing |
|---|---|---|---|
| Write RU cost (1 KB doc) | ~7-10 RUs | ~5 RUs (3-5 fewer paths to index) | ~5 RUs (deferred) |
| Query coverage | Every property indexed | Only listed paths | Only listed paths |
| Risk | Wastes RUs on unused properties | Forgot to add a path β unindexed query, slow | Reads can return slightly stale data |
| Best for | Most apps starting out | Write-heavy production after access patterns are clear | Bulk ingestion with eventual queries |
Composite indexes for ORDER BY queries
Without a composite index, queries with multi-property ORDER BY blow up RUs. Composite indexes precompute the sort order:
{
"indexingMode": "consistent",
"includedPaths": [{ "path": "/*" }],
"excludedPaths": [{ "path": "/_etag/?" }],
"compositeIndexes": [
[
{ "path": "/userId", "order": "ascending" },
{ "path": "/createdAt", "order": "descending" }
]
]
}
Now SELECT * FROM c WHERE c.userId = @u ORDER BY c.createdAt DESC runs as a single-partition query that streams results in order β minimal RU.
Consistency levels β five options, one trade-off
| Feature | Strong | Bounded staleness | Session (default) | Consistent prefix | Eventual |
|---|---|---|---|---|---|
| Read sees the latest write | Always | Within K versions or T seconds | Within your own session | Reads never see out-of-order writes; latest may lag | Eventually |
| Multi-region read latency | Highest (synchronous replication) | High | Low (read your own writes locally) | Low | Lowest |
| Read RU cost | ~2Γ a session read | ~2Γ a session read | ~baseline | ~baseline (same as session) | ~baseline (same as session) |
| Best for | Financial transfers, locks, single-writer correctness | Auditable freshness windows | Most apps β read your own writes, otherwise relaxed | Replays of ordered events | Counts, recommendations, telemetry |
from azure.cosmos import ConsistencyLevel
client = CosmosClient(
url, credential,
consistency_level=ConsistencyLevel.Eventual, # downgrade for cheap reads
)
# Or per-request override (when SDK supports it):
container.read_item(
item="msg-789",
partition_key="user-42",
request_options={"consistencyLevel": "Eventual"},
)
Exam tip: 'My agent loses chat history that just happened'
When the symptom is βI just wrote a message and the next read doesnβt see itβ, the consistency level is too weak. Session consistency (the default) is usually correct: it guarantees a client reads its own writes. Falling back to Eventual saves RUs but creates exactly this bug.
When the symptom is βwrites are slow / latency is high in our other regionβ, consistency is too strong. Strong consistency requires synchronous replication; downgrade to Bounded Staleness or Session if your workload tolerates short windows.
Throttling and the 429 dance
When provisioned RUs are exhausted, Cosmos returns 429 Too Many Requests with a header telling you how long to wait:
HTTP/1.1 429 Too Many Requests
x-ms-retry-after-ms: 187
x-ms-request-charge: 8.32
The SDKs implement exponential back-off + retry automatically. You can configure the retry policy:
client = CosmosClient(
url, credential,
retry_total=9,
retry_backoff_max=30,
request_timeout=10,
)
For sustained 429s, the fix is one of:
- Increase RU/s on the container
- Move to autoscale (10β100% range)
- Tighten indexing to reduce per-write cost
- Spread load across more partitions (better partition-key choice)
Provisioned vs autoscale vs serverless
| Mode | Bill | Best for |
|---|---|---|
| Provisioned (manual) | Per RU/s reserved | Steady, predictable load |
| Autoscale | 10% of max minimum, scales up to max | Spiky workloads (AI inference traffic, user activity) |
| Serverless | Per RU consumed | Dev/test, very low traffic, sporadic AI demos |
Autoscaleβs pricing isnβt always cheaper than manual β itβs about 1.5Γ the per-RU cost in the equivalent reserved tier. The win is during sustained idle periods, where you pay 10% of max instead of 100%.
Key terms
Knowledge check
Mira's chat-history container has default indexing. The queries only ever filter by `c.userId` and order by `c.timestamp`. Writes are 4Γ too expensive. What's the cleanest fix?
Theo's clinical assistant writes a new message and immediately reads it back to verify storage. Sometimes the read returns 'not found'. The Cosmos account is single-region with default consistency. What's the most likely cause?
Lin's container holds a multi-tenant SaaS app with bursts at the top of every hour and idle the rest of the time. Which throughput mode is most cost-effective?