Cosmos DB for Vectors: Embeddings + Similarity Search ·…

What “vector search in Cosmos” actually means

Simple explanation

An embedding is a list of numbers that represents the meaning of a piece of text (or image). Two pieces of text with similar meanings have embeddings that are close to each other in mathematical space.

Cosmos DB lets you store those embeddings inside the same documents as the source data — your chat message, your support ticket, your knowledge-base article — and search by similarity using a SQL function called VectorDistance. It returns the closest documents to a given query vector.

That’s the foundation of semantic retrieval and RAG (Retrieval-Augmented Generation): store knowledge as embeddings, find the relevant ones at query time, ground the LLM in those results.

The data model

{
  "id": "kb-article-42",
  "tenantId": "tidewater",
  "title": "Insulin storage policy v3",
  "body": "Refrigerate at 2-8°C. Discard after 28 days at room temperature...",
  "embedding": [0.012, -0.034, 0.567, ..., 0.089],  // 1536 floats (text-embedding-3-small)
  "language": "en",
  "updatedAt": "2026-04-15T09:30:00Z"
}

The vector lives on /embedding. It’s just a JSON array of floats; Cosmos doesn’t care how you generated it (Azure OpenAI, a self-hosted model, anything).

Container configuration: two policies

Vector embedding policy

{
  "vectorEmbeddings": [
    {
      "path": "/embedding",
      "dataType": "float32",
      "distanceFunction": "cosine",
      "dimensions": 1536
    }
  ]
}

This declares: the embedding lives at /embedding, is a 1536-dimensional float32 vector, and we measure similarity with cosine distance.

Setting	Common choices	When to pick which
`dataType`	`float32`, `float16`, `int8`, `uint8`	float32 — default; lower precisions trade accuracy for storage
`distanceFunction`	`cosine`, `dotproduct`, `euclidean`	cosine — most LLM embeddings (text-embedding-ada-002, text-embedding-3-*)
`dimensions`	1536 (Azure OpenAI text-embedding-3-small), 3072 (text-embedding-3-large), 768 (sentence-transformers)	Match your embedding model exactly

Vector indexing policy

{
  "indexingMode": "consistent",
  "includedPaths": [{ "path": "/*" }],
  "excludedPaths": [{ "path": "/_etag/?" }, { "path": "/embedding/*" }],
  "vectorIndexes": [
    {
      "path": "/embedding",
      "type": "diskANN"
    }
  ]
}

Two important things:

excludedPaths lists /embedding/* so the regular index doesn’t waste effort indexing each float
vectorIndexes is a separate array — that’s where the vector index goes

Pick the smallest index that fits your data; diskANN scales to millions but adds approximation.
Feature	flat	quantizedFlat	diskANN
Algorithm	Brute-force scan, exact	Compressed brute-force, near-exact	Approximate nearest neighbour (ANN)
Best for sizes	Up to ~10k vectors per partition	10k-100k vectors per partition	100k-millions of vectors per partition
Recall	100% (exact)	Very high (~99%)	High but tunable (~95-99%)
RU cost	Highest	Medium	Lowest at scale

The query — VectorDistance

-- Top-5 most similar articles to a query embedding, scoped to a tenant
SELECT TOP 5
  c.id, c.title,
  VectorDistance(c.embedding, @queryVec) AS score
FROM c
WHERE c.tenantId = 'tidewater'
ORDER BY VectorDistance(c.embedding, @queryVec)

# Python — generate query embedding then run the search
from azure.cosmos import CosmosClient
from openai import AzureOpenAI

oa = AzureOpenAI(...)
query_text = "How long can insulin sit out at room temperature?"
query_vec = oa.embeddings.create(
    model="text-embedding-3-small",
    input=query_text,
).data[0].embedding

results = container.query_items(
    query="""
        SELECT TOP @k c.id, c.title,
            VectorDistance(c.embedding, @vec) AS score
        FROM c
        WHERE c.tenantId = @tid
        ORDER BY VectorDistance(c.embedding, @vec)
    """,
    parameters=[
        {"name": "@k", "value": 5},
        {"name": "@vec", "value": query_vec},
        {"name": "@tid", "value": "tidewater"},
    ],
    partition_key="tidewater",
)

Key things:

VectorDistance(field, vector) returns the distance — smaller is closer
ORDER BY VectorDistance(...) triggers the vector index
Always include the partition-key filter when possible (turns it into a single-partition vector search)

Hybrid search — vectors plus metadata filters

Where Cosmos shines: filter by metadata BEFORE running vector similarity.

SELECT TOP 10 c.id, c.title, VectorDistance(c.embedding, @vec) AS score
FROM c
WHERE c.tenantId = @tid
  AND c.language = 'en'
  AND c.updatedAt > '2026-01-01'
  AND c.docType IN ('policy', 'guideline')
ORDER BY VectorDistance(c.embedding, @vec)

The metadata predicates run first (against the regular index), narrowing the candidate set. The vector index then ranks the remaining candidates. That’s significantly faster than a separate “vector search → metadata filter” pipeline.

Real-world example: Theo's clinical RAG

Tidewater Health stores 30,000 clinical guidelines per tenant. Each is embedded with text-embedding-3-large (3072 dim) and stored on /embedding. The container is partitioned by /tenantId.

Theo’s RAG query at clinician question time:

Embed the clinician’s question (Azure OpenAI)
Single Cosmos query that filters by tenant, language, and recency, then ranks by vector distance
Top 5 results pass to the LLM as RAG context

All in one SDK round-trip, all in one billing source, all with one identity. Without hybrid search Theo would need a separate vector store and a join — twice the operational surface, twice the secrets, twice the latency.

Storage and RU economics for vectors

Vector storage and indexing aren’t free:

A 1536-dim float32 vector is 6 KB. 100,000 documents = 600 MB just in vectors.
Quantising to int8 or float16 cuts storage 4× or 2× respectively, with small recall loss.
diskANN indexes have an upfront build cost; insert RU cost rises while the index ingests.
The first vector query after a long idle period may be slow (the index is paged in from storage).

For very large vector sets where storage matters, consider:

int8 quantisation
quantizedFlat index for partial precision
Splitting embeddings into a separate container partitioned the same way as source data, so vector queries don’t carry full document weight