PostgreSQL Tuning: Compute, Memory, Connections

Sizing in three numbers

Simple explanation

The three knobs that decide PostgreSQL performance are vCores, memory, and IOPS. vCores set how much CPU the planner has, memory controls how much of your data fits in cache (and how big your HNSW index can be), and IOPS controls how fast the disk delivers when you exceed the cache.

For vector workloads (pgvector with HNSW indexes), memory is usually the binding constraint. The HNSW graph wants to live in RAM. If it doesn’t fit, every query reads from disk and gets slow.

Connection pooling is the second big lever. Each PostgreSQL connection costs ~10 MB of memory; running 1000 idle connections from your container app eats gigabytes for nothing. Use PgBouncer (Flexible Server has it built in) or pool inside your app.

Sizing for vector workloads

Pick Memory Optimized when your HNSW index is large enough to thrash General Purpose.
Feature	Burstable	General Purpose	Memory Optimized
Memory per vCore	~2 GB	~4 GB	~8 GB
Best for	Dev/test, tiny POCs	OLTP-style workloads, light vector use	Large vector indexes, RAG production
Cost shape	Cheapest, throttles under load	Steady, mid-range	Most per vCore, but vector workloads stop being I/O-bound

A rough sizing rule for HNSW:

HNSW index size ≈ rows × dimension × 4 bytes × m / 8

For 1M rows, 1536 dim, m=16: ~12 GB just for the index. Add the heap, the planner, the connections — you want a server with 32 GB+ RAM, putting you in Memory Optimized territory.

Memory parameters that matter

Parameter	What it controls	Tuning hint
`shared_buffers`	Postgres’s main page cache	25% of system RAM (default in Flexible Server is reasonable)
`effective_cache_size`	Hint to the planner about OS + Postgres cache	50–75% of system RAM
`work_mem`	Per-operation sort/hash memory	Start at 16-64 MB; bump if EXPLAIN shows external sorts
`maintenance_work_mem`	Memory for VACUUM, CREATE INDEX	1-4 GB during HNSW index builds
`max_connections`	Hard cap on connections	Keep low (200-500); use a pooler for more

For HNSW index builds:

SET maintenance_work_mem = '4GB';
SET max_parallel_maintenance_workers = 4;
CREATE INDEX article_embeddings_hnsw_idx
  ON article_embeddings
  USING hnsw (embedding vector_cosine_ops)
  WITH (m = 16, ef_construction = 64);
RESET maintenance_work_mem;

Building a 1M-row HNSW index with 4 GB of maintenance_work_mem and 4 parallel workers is roughly an order of magnitude faster than building with defaults.

Connection pooling — PgBouncer is your friend

Each PostgreSQL connection is a backend process consuming ~10 MB. A Container App scaling out to 50 replicas, each holding 20 connections, means 1000 connections — usually overkill, often hitting max_connections.

Two patterns:

Pattern	How	Best for
Built-in PgBouncer (Flexible Server)	Enable in server parameters; clients connect to the same host on port 6432	Most production workloads — Microsoft runs the pooler
Application-side pool (psycopg pool, Npgsql pool)	Each replica caches a small number of connections	Layered with PgBouncer for best results

# Python with psycopg_pool — connection pool inside the worker
from psycopg_pool import ConnectionPool

pool = ConnectionPool(
    conninfo="host=roo-pg.postgres.database.azure.com user=... ...",
    min_size=2,
    max_size=10,
    timeout=5,
)

with pool.connection() as conn:
    with conn.cursor() as cur:
        cur.execute("SELECT 1")

Two pooler modes (PgBouncer):

Mode	What it does	Use for
Transaction pooling	Connection is reused per transaction	Most workloads — highest concurrency
Session pooling	Connection is reused per session	When you need session-state features (LISTEN/NOTIFY, prepared statements with session lifetime, temporary tables)

Exam tip: 'connection storms' are pooling questions

When the symptom is “thousands of connection errors during traffic spike” or “Postgres is hitting max_connections”, the answer is almost always to add a connection pooler — built-in PgBouncer or an application-side pool. Increasing max_connections is a band-aid; PostgreSQL backends are heavyweight.

IOPS and storage tuning

Premium SSD v2 (the default for new Flexible Server deployments) decouples size from IOPS:

Workload	IOPS target	Why
OLTP, mostly cache hits	3,000–6,000	Most reads from `shared_buffers`; IOPS only on writes/checkpoints
Heavy vector ingest + reindex	6,000–20,000	Large bulk loads + index builds saturate IO
Large analytical queries	12,000+	Sequential scans hit storage hard

WAL (write-ahead log) is on the same volume as data on Flexible Server. High write workloads benefit from more IOPS even when reads cache well.

Read replicas — when to add one

A read replica is a standby that streams WAL from the primary and serves read-only queries. Use cases:

Use case	Why
Offload reporting / analytics	Big aggregate queries on the primary slow OLTP — move them to a replica
Geographic read locality	Cross-region replica serves distant readers with lower latency
RAG retrieval reads	Vector queries can run on a replica while the primary handles ingest

Caveats: replicas lag the primary by a few hundred ms typically; for “must read your own write” semantics, query the primary.