pgvector + RAG: Retrieval-Augmented Apps

What pgvector adds

Simple explanation

pgvector is a PostgreSQL extension that adds a vector data type and similarity operators. You store an embedding the same way you’d store any other column. You search with a SQL query that includes a similarity operator — Postgres pulls the closest matches, ranked.

The big win over standalone vector databases: you keep all your other tables and joins. You can filter by tenant, by language, by recency, by user permissions — all in one transactional SQL query, not a two-step “vector search → join → filter” dance.

The exam expects you to choose between HNSW (fast, recommended) and IVFFlat (older, lower memory) indexes, write the canonical similarity query, and apply metadata filters efficiently.

Setup — enable the extension

-- Server-level: add 'vector' to azure.extensions (via portal, CLI, or ARM)
-- Then in the database:
CREATE EXTENSION IF NOT EXISTS vector;

The data type and operators

CREATE TABLE article_embeddings (
  article_id  uuid PRIMARY KEY,
  embedding   vector(1536) NOT NULL,
  model       text NOT NULL,
  tenant_id   text NOT NULL,
  language    text NOT NULL,
  updated_at  timestamptz NOT NULL DEFAULT now()
);

-- Insert one
INSERT INTO article_embeddings (article_id, embedding, model, tenant_id, language)
VALUES ($1, $2, 'text-embedding-3-small', 'tidewater', 'en');

Operator	Distance	Use when your embedding model expects
`<->`	Euclidean (L2)	General-purpose, normalised embeddings
`<=>`	Cosine	Most LLM embeddings (text-embedding-ada-002, text-embedding-3-*)
`<#>`	Negative inner product	Some specialised models — rare in AI-200

For Azure OpenAI text embeddings, cosine (<=>) is the right choice.

Indexes — HNSW vs IVFFlat

HNSW is the default recommendation. IVFFlat survives mainly when memory is the binding constraint.
Feature	HNSW	IVFFlat
Algorithm	Layered graph (Hierarchical Navigable Small World)	Inverted file with k-means clustering
Build time	Slow on large datasets	Fast
Query latency	Lowest — usually a few ms	Higher than HNSW for the same recall
Memory	Higher (graph in memory)	Lower
Recall tuning	`m`, `ef_construction`, `ef_search`	`lists`, `probes`
Recommended for	Production AI apps — best speed/recall trade-off	Constrained memory, very large rebuild needs

-- HNSW index for cosine distance
CREATE INDEX article_embeddings_hnsw_idx
ON article_embeddings
USING hnsw (embedding vector_cosine_ops)
WITH (m = 16, ef_construction = 64);

-- IVFFlat alternative
CREATE INDEX article_embeddings_ivfflat_idx
ON article_embeddings
USING ivfflat (embedding vector_cosine_ops)
WITH (lists = 200);   -- typical: rows / 1000

The _ops suffix matches the operator class to your distance function:

vector_cosine_ops for <=>
vector_l2_ops for <->
vector_ip_ops for <#>

If you build the index for the wrong operator class, queries won’t use it — they’ll fall back to a sequential scan.

Exam tip: tune ef_search at query time, not at index time

HNSW has a runtime knob, hnsw.ef_search, that controls the breadth of the search. Higher values find more neighbours (higher recall) at the cost of latency. You can change it per session:

SET hnsw.ef_search = 200;  -- default 40

This lets you tune the recall/latency trade-off per query type — small for a quick lookup, large for a one-off high-recall analytic query.

The canonical RAG query

-- Top-5 most similar articles for a tenant, in a language, recently updated
SELECT
  a.id, a.title, a.body,
  e.embedding <=> $1 AS distance
FROM article_embeddings e
JOIN knowledge_articles a USING (article_id)
WHERE
  e.tenant_id = $2
  AND e.language = $3
  AND a.updated_at > $4
ORDER BY e.embedding <=> $1
LIMIT 5;

Three things to notice:

Metadata filters in WHERE before sorting by distance. The planner uses the regular indexes on tenant_id, language, updated_at to narrow first, then ranks survivors by similarity.
Same operator in ORDER BY as in the SELECT — both <=>. The HNSW index covers ORDER BY when the operator and operator class match.
Bind parameters via prepared statements ($1, $2, …). Avoid building SQL strings; psycopg / Npgsql handle vector binding cleanly.

Hybrid search — vector plus full-text

-- Combine semantic and lexical relevance
WITH semantic AS (
  SELECT article_id, e.embedding <=> $1 AS dist
  FROM article_embeddings e
  WHERE e.tenant_id = $2
  ORDER BY e.embedding <=> $1
  LIMIT 50
),
lexical AS (
  SELECT id AS article_id, ts_rank(to_tsvector('english', body), plainto_tsquery($3)) AS rank
  FROM knowledge_articles
  WHERE tenant_id = $2
  ORDER BY rank DESC
  LIMIT 50
)
SELECT a.id, a.title,
       coalesce(s.dist, 1.0) AS dist,
       coalesce(l.rank, 0.0) AS rank,
       (1 - coalesce(s.dist, 1.0)) * 0.7 + coalesce(l.rank, 0.0) * 0.3 AS hybrid_score
FROM knowledge_articles a
LEFT JOIN semantic s USING (article_id)
LEFT JOIN lexical l USING (article_id)
WHERE s.article_id IS NOT NULL OR l.article_id IS NOT NULL
ORDER BY hybrid_score DESC
LIMIT 10;

Hybrid scoring blends semantic similarity with classic full-text relevance. Keyword-heavy queries (drug names, error codes) often match better lexically; conceptual queries match better semantically. The blend covers both.

RAG end-to-end — the canonical flow

1. User asks a question
2. App embeds the question (Azure OpenAI text-embedding-3-small)
3. SQL query against pgvector with WHERE for tenant + language + recency
   ORDER BY embedding <=> @query_vec LIMIT 5
4. Results pass to the LLM as RAG context (prompt template includes them)
5. LLM generates a grounded answer, optionally citing source IDs
6. Audit log writes the question + retrieved IDs + answer to a separate table

Steps 2-5 happen in three round-trips: one to Azure OpenAI for embedding, one to PostgreSQL for retrieval, one to Azure OpenAI for generation. Total p50 latency on a well-tuned database: ~1.2 s.

Performance gotchas

Symptom	Likely cause	Fix
Vector queries fall back to Seq Scan	Wrong operator class in the index, or planner stats stale	Match `_ops` to the operator (`vector_cosine_ops` for `<=>`); run `ANALYZE table_name;`
HNSW index build runs out of memory	`maintenance_work_mem` too low	Bump `maintenance_work_mem` for the build (e.g., `SET maintenance_work_mem = '4GB';`)
Recall is too low	`ef_search` too small	Increase `hnsw.ef_search` (or `probes` for IVFFlat)
Large embedding columns slowing reads	Embeddings selected when not needed	Don’t `SELECT *`; only select the embedding column when you need it
Long INSERTs after creating a vector index	Index maintenance on every write	Drop and recreate the index for bulk loads, then put it back