Cosmos DB for Vectors: Embeddings + Similarity Search
Storing embeddings next to the documents that produced them, then finding the closest vectors with VectorDistance. Vector index types, dimensions, and the one-line RAG pattern.
What βvector search in Cosmosβ actually means
An embedding is a list of numbers that represents the meaning of a piece of text (or image). Two pieces of text with similar meanings have embeddings that are close to each other in mathematical space.
Cosmos DB lets you store those embeddings inside the same documents as the source data β your chat message, your support ticket, your knowledge-base article β and search by similarity using a SQL function called VectorDistance. It returns the closest documents to a given query vector.
Thatβs the foundation of semantic retrieval and RAG (Retrieval-Augmented Generation): store knowledge as embeddings, find the relevant ones at query time, ground the LLM in those results.
The data model
{
"id": "kb-article-42",
"tenantId": "tidewater",
"title": "Insulin storage policy v3",
"body": "Refrigerate at 2-8Β°C. Discard after 28 days at room temperature...",
"embedding": [0.012, -0.034, 0.567, ..., 0.089], // 1536 floats (text-embedding-3-small)
"language": "en",
"updatedAt": "2026-04-15T09:30:00Z"
}
The vector lives on /embedding. Itβs just a JSON array of floats; Cosmos doesnβt care how you generated it (Azure OpenAI, a self-hosted model, anything).
Container configuration: two policies
Vector embedding policy
{
"vectorEmbeddings": [
{
"path": "/embedding",
"dataType": "float32",
"distanceFunction": "cosine",
"dimensions": 1536
}
]
}
This declares: the embedding lives at /embedding, is a 1536-dimensional float32 vector, and we measure similarity with cosine distance.
| Setting | Common choices | When to pick which |
|---|---|---|
dataType | float32, float16, int8, uint8 | float32 β default; lower precisions trade accuracy for storage |
distanceFunction | cosine, dotproduct, euclidean | cosine β most LLM embeddings (text-embedding-ada-002, text-embedding-3-*) |
dimensions | 1536 (Azure OpenAI text-embedding-3-small), 3072 (text-embedding-3-large), 768 (sentence-transformers) | Match your embedding model exactly |
Vector indexing policy
{
"indexingMode": "consistent",
"includedPaths": [{ "path": "/*" }],
"excludedPaths": [{ "path": "/_etag/?" }, { "path": "/embedding/*" }],
"vectorIndexes": [
{
"path": "/embedding",
"type": "diskANN"
}
]
}
Two important things:
excludedPathslists/embedding/*so the regular index doesnβt waste effort indexing each floatvectorIndexesis a separate array β thatβs where the vector index goes
| Feature | flat | quantizedFlat | diskANN |
|---|---|---|---|
| Algorithm | Brute-force scan, exact | Compressed brute-force, near-exact | Approximate nearest neighbour (ANN) |
| Best for sizes | Up to ~10k vectors per partition | 10k-100k vectors per partition | 100k-millions of vectors per partition |
| Recall | 100% (exact) | Very high (~99%) | High but tunable (~95-99%) |
| RU cost | Highest | Medium | Lowest at scale |
The query β VectorDistance
-- Top-5 most similar articles to a query embedding, scoped to a tenant
SELECT TOP 5
c.id, c.title,
VectorDistance(c.embedding, @queryVec) AS score
FROM c
WHERE c.tenantId = 'tidewater'
ORDER BY VectorDistance(c.embedding, @queryVec)
# Python β generate query embedding then run the search
from azure.cosmos import CosmosClient
from openai import AzureOpenAI
oa = AzureOpenAI(...)
query_text = "How long can insulin sit out at room temperature?"
query_vec = oa.embeddings.create(
model="text-embedding-3-small",
input=query_text,
).data[0].embedding
results = container.query_items(
query="""
SELECT TOP @k c.id, c.title,
VectorDistance(c.embedding, @vec) AS score
FROM c
WHERE c.tenantId = @tid
ORDER BY VectorDistance(c.embedding, @vec)
""",
parameters=[
{"name": "@k", "value": 5},
{"name": "@vec", "value": query_vec},
{"name": "@tid", "value": "tidewater"},
],
partition_key="tidewater",
)
Key things:
VectorDistance(field, vector)returns the distance β smaller is closerORDER BY VectorDistance(...)triggers the vector index- Always include the partition-key filter when possible (turns it into a single-partition vector search)
Hybrid search β vectors plus metadata filters
Where Cosmos shines: filter by metadata BEFORE running vector similarity.
SELECT TOP 10 c.id, c.title, VectorDistance(c.embedding, @vec) AS score
FROM c
WHERE c.tenantId = @tid
AND c.language = 'en'
AND c.updatedAt > '2026-01-01'
AND c.docType IN ('policy', 'guideline')
ORDER BY VectorDistance(c.embedding, @vec)
The metadata predicates run first (against the regular index), narrowing the candidate set. The vector index then ranks the remaining candidates. Thatβs significantly faster than a separate βvector search β metadata filterβ pipeline.
Real-world example: Theo's clinical RAG
Tidewater Health stores 30,000 clinical guidelines per tenant. Each is embedded with text-embedding-3-large (3072 dim) and stored on /embedding. The container is partitioned by /tenantId.
Theoβs RAG query at clinician question time:
- Embed the clinicianβs question (Azure OpenAI)
- Single Cosmos query that filters by tenant, language, and recency, then ranks by vector distance
- Top 5 results pass to the LLM as RAG context
All in one SDK round-trip, all in one billing source, all with one identity. Without hybrid search Theo would need a separate vector store and a join β twice the operational surface, twice the secrets, twice the latency.
Storage and RU economics for vectors
Vector storage and indexing arenβt free:
- A 1536-dim float32 vector is 6 KB. 100,000 documents = 600 MB just in vectors.
- Quantising to int8 or float16 cuts storage 4Γ or 2Γ respectively, with small recall loss.
- diskANN indexes have an upfront build cost; insert RU cost rises while the index ingests.
- The first vector query after a long idle period may be slow (the index is paged in from storage).
For very large vector sets where storage matters, consider:
int8quantisationquantizedFlatindex for partial precision- Splitting embeddings into a separate container partitioned the same way as source data, so vector queries donβt carry full document weight
Key terms
Knowledge check
Mira stores 1.2 million product-image embeddings per tenant in Cosmos. Each tenant has its own partition. Top-5 similarity queries are slow at this scale. Which vector index should she configure?
Theo's vector search query is `SELECT TOP 5 c.id, VectorDistance(c.embedding, @vec) FROM c ORDER BY VectorDistance(c.embedding, @vec)`. The container is partitioned by `/tenantId` and there are 12 tenants. He sees high RU cost and high latency. What's the simplest fix?
Lin's container holds 30,000 docs per partition with 1536-dim embeddings. Storage is the limiting factor β the embedding alone is 6 KB per doc, and there's other content too. Which option offers the best storage-to-recall trade-off without sacrificing useful results?