Domain 2 β€” Module 6 of 8 75%
14 of 27 overall
Domain 2: Develop AI solutions by using Azure data management services Free ⏱ ~14 min read

pgvector + RAG: Retrieval-Augmented Apps

How a SQL database does vector search. pgvector data types, HNSW vs ivfflat, hybrid filters, the canonical RAG SQL, and why metadata filters live in WHERE β€” not after.

What pgvector adds

Simple explanation

pgvector is a PostgreSQL extension that adds a vector data type and similarity operators. You store an embedding the same way you’d store any other column. You search with a SQL query that includes a similarity operator β€” Postgres pulls the closest matches, ranked.

The big win over standalone vector databases: you keep all your other tables and joins. You can filter by tenant, by language, by recency, by user permissions β€” all in one transactional SQL query, not a two-step β€œvector search β†’ join β†’ filter” dance.

The exam expects you to choose between HNSW (fast, recommended) and IVFFlat (older, lower memory) indexes, write the canonical similarity query, and apply metadata filters efficiently.

Setup β€” enable the extension

-- Server-level: add 'vector' to azure.extensions (via portal, CLI, or ARM)
-- Then in the database:
CREATE EXTENSION IF NOT EXISTS vector;

The data type and operators

CREATE TABLE article_embeddings (
  article_id  uuid PRIMARY KEY,
  embedding   vector(1536) NOT NULL,
  model       text NOT NULL,
  tenant_id   text NOT NULL,
  language    text NOT NULL,
  updated_at  timestamptz NOT NULL DEFAULT now()
);

-- Insert one
INSERT INTO article_embeddings (article_id, embedding, model, tenant_id, language)
VALUES ($1, $2, 'text-embedding-3-small', 'tidewater', 'en');
OperatorDistanceUse when your embedding model expects
<->Euclidean (L2)General-purpose, normalised embeddings
<=>CosineMost LLM embeddings (text-embedding-ada-002, text-embedding-3-*)
<#>Negative inner productSome specialised models β€” rare in AI-200

For Azure OpenAI text embeddings, cosine (<=>) is the right choice.

Indexes β€” HNSW vs IVFFlat

HNSW is the default recommendation. IVFFlat survives mainly when memory is the binding constraint.
FeatureHNSWIVFFlat
AlgorithmLayered graph (Hierarchical Navigable Small World)Inverted file with k-means clustering
Build timeSlow on large datasetsFast
Query latencyLowest β€” usually a few msHigher than HNSW for the same recall
MemoryHigher (graph in memory)Lower
Recall tuning`m`, `ef_construction`, `ef_search``lists`, `probes`
Recommended forProduction AI apps β€” best speed/recall trade-offConstrained memory, very large rebuild needs
-- HNSW index for cosine distance
CREATE INDEX article_embeddings_hnsw_idx
ON article_embeddings
USING hnsw (embedding vector_cosine_ops)
WITH (m = 16, ef_construction = 64);

-- IVFFlat alternative
CREATE INDEX article_embeddings_ivfflat_idx
ON article_embeddings
USING ivfflat (embedding vector_cosine_ops)
WITH (lists = 200);   -- typical: rows / 1000

The _ops suffix matches the operator class to your distance function:

  • vector_cosine_ops for <=>
  • vector_l2_ops for <->
  • vector_ip_ops for <#>

If you build the index for the wrong operator class, queries won’t use it β€” they’ll fall back to a sequential scan.

Exam tip: tune ef_search at query time, not at index time

HNSW has a runtime knob, hnsw.ef_search, that controls the breadth of the search. Higher values find more neighbours (higher recall) at the cost of latency. You can change it per session:

SET hnsw.ef_search = 200;  -- default 40

This lets you tune the recall/latency trade-off per query type β€” small for a quick lookup, large for a one-off high-recall analytic query.

The canonical RAG query

-- Top-5 most similar articles for a tenant, in a language, recently updated
SELECT
  a.id, a.title, a.body,
  e.embedding <=> $1 AS distance
FROM article_embeddings e
JOIN knowledge_articles a USING (article_id)
WHERE
  e.tenant_id = $2
  AND e.language = $3
  AND a.updated_at > $4
ORDER BY e.embedding <=> $1
LIMIT 5;

Three things to notice:

  1. Metadata filters in WHERE before sorting by distance. The planner uses the regular indexes on tenant_id, language, updated_at to narrow first, then ranks survivors by similarity.
  2. Same operator in ORDER BY as in the SELECT β€” both <=>. The HNSW index covers ORDER BY when the operator and operator class match.
  3. Bind parameters via prepared statements ($1, $2, …). Avoid building SQL strings; psycopg / Npgsql handle vector binding cleanly.

Hybrid search β€” vector plus full-text

-- Combine semantic and lexical relevance
WITH semantic AS (
  SELECT article_id, e.embedding <=> $1 AS dist
  FROM article_embeddings e
  WHERE e.tenant_id = $2
  ORDER BY e.embedding <=> $1
  LIMIT 50
),
lexical AS (
  SELECT id AS article_id, ts_rank(to_tsvector('english', body), plainto_tsquery($3)) AS rank
  FROM knowledge_articles
  WHERE tenant_id = $2
  ORDER BY rank DESC
  LIMIT 50
)
SELECT a.id, a.title,
       coalesce(s.dist, 1.0) AS dist,
       coalesce(l.rank, 0.0) AS rank,
       (1 - coalesce(s.dist, 1.0)) * 0.7 + coalesce(l.rank, 0.0) * 0.3 AS hybrid_score
FROM knowledge_articles a
LEFT JOIN semantic s USING (article_id)
LEFT JOIN lexical l USING (article_id)
WHERE s.article_id IS NOT NULL OR l.article_id IS NOT NULL
ORDER BY hybrid_score DESC
LIMIT 10;

Hybrid scoring blends semantic similarity with classic full-text relevance. Keyword-heavy queries (drug names, error codes) often match better lexically; conceptual queries match better semantically. The blend covers both.

RAG end-to-end β€” the canonical flow

1. User asks a question
2. App embeds the question (Azure OpenAI text-embedding-3-small)
3. SQL query against pgvector with WHERE for tenant + language + recency
   ORDER BY embedding <=> @query_vec LIMIT 5
4. Results pass to the LLM as RAG context (prompt template includes them)
5. LLM generates a grounded answer, optionally citing source IDs
6. Audit log writes the question + retrieved IDs + answer to a separate table

Steps 2-5 happen in three round-trips: one to Azure OpenAI for embedding, one to PostgreSQL for retrieval, one to Azure OpenAI for generation. Total p50 latency on a well-tuned database: ~1.2 s.

Performance gotchas

SymptomLikely causeFix
Vector queries fall back to Seq ScanWrong operator class in the index, or planner stats staleMatch _ops to the operator (vector_cosine_ops for <=>); run ANALYZE table_name;
HNSW index build runs out of memorymaintenance_work_mem too lowBump maintenance_work_mem for the build (e.g., SET maintenance_work_mem = '4GB';)
Recall is too lowef_search too smallIncrease hnsw.ef_search (or probes for IVFFlat)
Large embedding columns slowing readsEmbeddings selected when not neededDon’t SELECT *; only select the embedding column when you need it
Long INSERTs after creating a vector indexIndex maintenance on every writeDrop and recreate the index for bulk loads, then put it back

Key terms

Question

What is pgvector?

Click or press Enter to reveal answer

Answer

A PostgreSQL extension that adds the `vector` data type, distance operators (`<->` L2, `<=>` cosine, `<#>` negative inner product), and two index types (HNSW and IVFFlat) for vector similarity search. Lets you do vector search inside the same SQL query as your relational filters.

Click to flip back

Question

When should you choose HNSW over IVFFlat?

Click or press Enter to reveal answer

Answer

HNSW is the default recommendation for production AI workloads β€” better recall/latency trade-off, no clustering pre-step. Choose IVFFlat only when memory is the binding constraint or when index build speed matters more than query speed.

Click to flip back

Question

What is `vector_cosine_ops`?

Click or press Enter to reveal answer

Answer

The operator class that tells pgvector an index is built for cosine distance (`<=>`). Match the `_ops` suffix to your distance operator: `vector_cosine_ops` for `<=>`, `vector_l2_ops` for `<->`, `vector_ip_ops` for `<#>`. Mismatch β†’ planner won't use the index.

Click to flip back

Question

Why apply metadata filters in WHERE rather than after a vector search?

Click or press Enter to reveal answer

Answer

The query planner uses regular B-tree / GIN indexes on metadata columns to narrow the candidate set first, then runs vector similarity on the survivors. Filtering after retrieval forces the database to over-fetch candidates and discard them β€” slower and burns memory.

Click to flip back

Question

How do you tune recall on an HNSW index at query time?

Click or press Enter to reveal answer

Answer

Set `hnsw.ef_search` per session. Default is 40. Higher values (e.g., 100, 200) explore more nodes β€” higher recall, higher latency. Use `SET hnsw.ef_search = N;` before the query (or in a connection pool config). No index rebuild required.

Click to flip back

Knowledge check

Knowledge Check

Theo's RAG query is `SELECT id FROM article_embeddings ORDER BY embedding <=> $1 LIMIT 5`. He created an HNSW index with `vector_l2_ops` (the default operator class he saw in a sample). Queries are slow. Why?

Knowledge Check

Mira wants top-5 product images similar to a query, restricted to one tenant, English-language only, less than 30 days old. Which query shape is correct?

Knowledge Check

Lin's HNSW index returns lower recall than expected on a small but high-stakes query. What's the simplest tunable to try?