pgvector + RAG: Retrieval-Augmented Apps
How a SQL database does vector search. pgvector data types, HNSW vs ivfflat, hybrid filters, the canonical RAG SQL, and why metadata filters live in WHERE β not after.
What pgvector adds
pgvector is a PostgreSQL extension that adds a vector data type and similarity operators. You store an embedding the same way youβd store any other column. You search with a SQL query that includes a similarity operator β Postgres pulls the closest matches, ranked.
The big win over standalone vector databases: you keep all your other tables and joins. You can filter by tenant, by language, by recency, by user permissions β all in one transactional SQL query, not a two-step βvector search β join β filterβ dance.
The exam expects you to choose between HNSW (fast, recommended) and IVFFlat (older, lower memory) indexes, write the canonical similarity query, and apply metadata filters efficiently.
Setup β enable the extension
-- Server-level: add 'vector' to azure.extensions (via portal, CLI, or ARM)
-- Then in the database:
CREATE EXTENSION IF NOT EXISTS vector;
The data type and operators
CREATE TABLE article_embeddings (
article_id uuid PRIMARY KEY,
embedding vector(1536) NOT NULL,
model text NOT NULL,
tenant_id text NOT NULL,
language text NOT NULL,
updated_at timestamptz NOT NULL DEFAULT now()
);
-- Insert one
INSERT INTO article_embeddings (article_id, embedding, model, tenant_id, language)
VALUES ($1, $2, 'text-embedding-3-small', 'tidewater', 'en');
| Operator | Distance | Use when your embedding model expects |
|---|---|---|
<-> | Euclidean (L2) | General-purpose, normalised embeddings |
<=> | Cosine | Most LLM embeddings (text-embedding-ada-002, text-embedding-3-*) |
<#> | Negative inner product | Some specialised models β rare in AI-200 |
For Azure OpenAI text embeddings, cosine (<=>) is the right choice.
Indexes β HNSW vs IVFFlat
| Feature | HNSW | IVFFlat |
|---|---|---|
| Algorithm | Layered graph (Hierarchical Navigable Small World) | Inverted file with k-means clustering |
| Build time | Slow on large datasets | Fast |
| Query latency | Lowest β usually a few ms | Higher than HNSW for the same recall |
| Memory | Higher (graph in memory) | Lower |
| Recall tuning | `m`, `ef_construction`, `ef_search` | `lists`, `probes` |
| Recommended for | Production AI apps β best speed/recall trade-off | Constrained memory, very large rebuild needs |
-- HNSW index for cosine distance
CREATE INDEX article_embeddings_hnsw_idx
ON article_embeddings
USING hnsw (embedding vector_cosine_ops)
WITH (m = 16, ef_construction = 64);
-- IVFFlat alternative
CREATE INDEX article_embeddings_ivfflat_idx
ON article_embeddings
USING ivfflat (embedding vector_cosine_ops)
WITH (lists = 200); -- typical: rows / 1000
The _ops suffix matches the operator class to your distance function:
vector_cosine_opsfor<=>vector_l2_opsfor<->vector_ip_opsfor<#>
If you build the index for the wrong operator class, queries wonβt use it β theyβll fall back to a sequential scan.
Exam tip: tune ef_search at query time, not at index time
HNSW has a runtime knob, hnsw.ef_search, that controls the breadth of the search. Higher values find more neighbours (higher recall) at the cost of latency. You can change it per session:
SET hnsw.ef_search = 200; -- default 40This lets you tune the recall/latency trade-off per query type β small for a quick lookup, large for a one-off high-recall analytic query.
The canonical RAG query
-- Top-5 most similar articles for a tenant, in a language, recently updated
SELECT
a.id, a.title, a.body,
e.embedding <=> $1 AS distance
FROM article_embeddings e
JOIN knowledge_articles a USING (article_id)
WHERE
e.tenant_id = $2
AND e.language = $3
AND a.updated_at > $4
ORDER BY e.embedding <=> $1
LIMIT 5;
Three things to notice:
- Metadata filters in WHERE before sorting by distance. The planner uses the regular indexes on
tenant_id,language,updated_atto narrow first, then ranks survivors by similarity. - Same operator in ORDER BY as in the SELECT β both
<=>. The HNSW index covers ORDER BY when the operator and operator class match. - Bind parameters via prepared statements ($1, $2, β¦). Avoid building SQL strings; psycopg / Npgsql handle vector binding cleanly.
Hybrid search β vector plus full-text
-- Combine semantic and lexical relevance
WITH semantic AS (
SELECT article_id, e.embedding <=> $1 AS dist
FROM article_embeddings e
WHERE e.tenant_id = $2
ORDER BY e.embedding <=> $1
LIMIT 50
),
lexical AS (
SELECT id AS article_id, ts_rank(to_tsvector('english', body), plainto_tsquery($3)) AS rank
FROM knowledge_articles
WHERE tenant_id = $2
ORDER BY rank DESC
LIMIT 50
)
SELECT a.id, a.title,
coalesce(s.dist, 1.0) AS dist,
coalesce(l.rank, 0.0) AS rank,
(1 - coalesce(s.dist, 1.0)) * 0.7 + coalesce(l.rank, 0.0) * 0.3 AS hybrid_score
FROM knowledge_articles a
LEFT JOIN semantic s USING (article_id)
LEFT JOIN lexical l USING (article_id)
WHERE s.article_id IS NOT NULL OR l.article_id IS NOT NULL
ORDER BY hybrid_score DESC
LIMIT 10;
Hybrid scoring blends semantic similarity with classic full-text relevance. Keyword-heavy queries (drug names, error codes) often match better lexically; conceptual queries match better semantically. The blend covers both.
RAG end-to-end β the canonical flow
1. User asks a question
2. App embeds the question (Azure OpenAI text-embedding-3-small)
3. SQL query against pgvector with WHERE for tenant + language + recency
ORDER BY embedding <=> @query_vec LIMIT 5
4. Results pass to the LLM as RAG context (prompt template includes them)
5. LLM generates a grounded answer, optionally citing source IDs
6. Audit log writes the question + retrieved IDs + answer to a separate table
Steps 2-5 happen in three round-trips: one to Azure OpenAI for embedding, one to PostgreSQL for retrieval, one to Azure OpenAI for generation. Total p50 latency on a well-tuned database: ~1.2 s.
Performance gotchas
| Symptom | Likely cause | Fix |
|---|---|---|
| Vector queries fall back to Seq Scan | Wrong operator class in the index, or planner stats stale | Match _ops to the operator (vector_cosine_ops for <=>); run ANALYZE table_name; |
| HNSW index build runs out of memory | maintenance_work_mem too low | Bump maintenance_work_mem for the build (e.g., SET maintenance_work_mem = '4GB';) |
| Recall is too low | ef_search too small | Increase hnsw.ef_search (or probes for IVFFlat) |
| Large embedding columns slowing reads | Embeddings selected when not needed | Donβt SELECT *; only select the embedding column when you need it |
| Long INSERTs after creating a vector index | Index maintenance on every write | Drop and recreate the index for bulk loads, then put it back |
Key terms
Knowledge check
Theo's RAG query is `SELECT id FROM article_embeddings ORDER BY embedding <=> $1 LIMIT 5`. He created an HNSW index with `vector_l2_ops` (the default operator class he saw in a sample). Queries are slow. Why?
Mira wants top-5 product images similar to a query, restricted to one tenant, English-language only, less than 30 days old. Which query shape is correct?
Lin's HNSW index returns lower recall than expected on a small but high-stakes query. What's the simplest tunable to try?