Azure Managed Redis: Caching + Vector Search

What Azure Managed Redis is

Simple explanation

Azure Managed Redis is the new managed Redis service on Azure — it replaces Azure Cache for Redis Enterprise. It’s fast (microsecond reads), fully managed, and ships with the Redis Stack modules — including RediSearch, which gives you vector search.

Two main jobs in AI-200:

Caching: store the result of an expensive operation (LLM call, vector search, computed feature) for a short time so repeated requests don’t pay the cost again
Vector similarity over hot data: for embeddings you want sub-millisecond search on, store them in Redis with a vector index

The exam tests TTLs (expiration), invalidation patterns, and the basics of vector indexing in Redis.

Caching essentials — TTL, expiration, invalidation

The core caching pattern:

import redis

r = redis.Redis(
    host="roo.redis.azure.net",
    port=10000,
    password="<entra-token-or-key>",
    ssl=True,
)

cache_key = f"embedding:{hash_text(text)}"
cached = r.get(cache_key)
if cached:
    return cached

# Cache miss — compute
result = openai.embeddings.create(...)
r.set(cache_key, json.dumps(result), ex=3600)  # 1 hour TTL
return result

Three TTL strategies:

The three classic cache freshness strategies. Most production AI systems combine fixed TTL + explicit invalidation.
Feature	Fixed TTL	Sliding TTL	Explicit invalidation
How	`SET key value EX 3600`	On every read, refresh: `EXPIRE key 3600`	On data change, `DEL key` (or pub/sub event)
Best for	Predictable freshness windows (hourly, daily)	Hot keys you want to keep alive while in use	Source-of-truth changes — invalidate when the underlying data changes
Risk	Cold cache after expiry (one slow request)	Stale data if nothing's invalidated and the source changed	Forgetting to invalidate on every write path

Exam tip: 'cache stampede' and the dogpile problem

When a hot cache key expires under heavy traffic, dozens of replicas all miss simultaneously and all try to recompute — the cache “stampedes”. Two classic mitigations:

Probabilistic early expiration — refresh the key probabilistically a bit before TTL ends, so misses happen one at a time
Lock-and-recompute — first miss takes a short Redis lock, recomputes, sets the value; others read the new value once the lock is released

AI scenarios with expensive recompute (LLM calls, large vector searches) are particularly vulnerable to stampedes.

Eviction policies — what gets dropped when memory fills

Managed Redis evicts when you exceed the maxmemory budget. Choose a policy that matches your access pattern:

Policy	What it evicts	Best for
`noeviction`	Nothing — writes fail with OOM	Strict caches that must not lose data
`allkeys-lru`	Least Recently Used across all keys	General caching — the safe default
`allkeys-lfu`	Least Frequently Used	When some keys are perennially hot regardless of recency
`volatile-ttl`	Keys with the soonest TTL first, among keys with TTLs	Mixed workloads where some keys are persistent and others ephemeral

allkeys-lru is the default Microsoft recommends for cache use. volatile-ttl makes sense when you store a mix of cache-with-TTL plus permanent state in the same instance.

Vector indexing — RediSearch

Redis Stack’s RediSearch module supports vector fields with HNSW or FLAT indexes. The pattern is conceptually similar to pgvector, but the API is Redis-flavoured.

from redis.commands.search.field import VectorField, TagField, NumericField
from redis.commands.search.index_definition import IndexDefinition, IndexType

# Create the index once
r.ft("idx:embeddings").create_index(
    [
        VectorField(
            "embedding",
            "HNSW",
            {
                "TYPE": "FLOAT32",
                "DIM": 1536,
                "DISTANCE_METRIC": "COSINE",
                "M": 16,
                "EF_CONSTRUCTION": 64,
            },
        ),
        TagField("tenant"),
        TagField("language"),
        NumericField("updated_at"),
    ],
    definition=IndexDefinition(prefix=["doc:"], index_type=IndexType.HASH),
)

# Insert a doc
r.hset(
    f"doc:{doc_id}",
    mapping={
        "embedding": np.array(vec, dtype="float32").tobytes(),
        "tenant": "tidewater",
        "language": "en",
        "updated_at": int(time.time()),
        "title": "Insulin storage policy v3",
    },
)

# Query top-5 nearest with metadata filter
res = r.ft("idx:embeddings").search(
    Query("(@tenant:{tidewater} @language:{en})=>[KNN 5 @embedding $vec AS score]")
    .sort_by("score").return_fields("title", "score").dialect(2),
    query_params={"vec": np.array(query_vec, dtype="float32").tobytes()},
)

Two index types:

Type	What it does	Best for
`FLAT`	Brute-force, exact	Small datasets where exactness matters
`HNSW`	Approximate nearest neighbour, low latency	Production AI workloads — usually the right choice

Distance metrics: COSINE, IP (inner product), L2.

Hybrid search in Redis

The @field:{value} filter syntax inside a vector query is RediSearch’s hybrid search. Filter first, KNN second:

(@tenant:{tidewater} @language:{en} @updated_at:[1714521600 +inf])=>[KNN 5 @embedding $vec]

Reads like: “among docs whose tenant is tidewater AND language is en AND updated_at >= 1714521600, return the 5 nearest to the query vector.”

Authentication

Mode	When	Notes
Microsoft Entra ID	Recommended	Container Apps / AKS managed identity authenticates; map identities to Redis access policies
Access keys	Legacy / quick start	Two keys per cache; rotate one at a time

# Grant a managed identity Data Owner on the cache
az redis identity assign \
  --name roo-redis -g roo-prod \
  --identities $UAI_RESOURCE_ID

az role assignment create \
  --assignee $PRINCIPAL_ID \
  --role "Redis Cache Contributor" \
  --scope $(az redis show -n roo-redis -g roo-prod --query id -o tsv)

When to pick Managed Redis vs Cosmos vs PostgreSQL

Three Azure-native data services for AI. Pick based on volume, latency, and data shape.
Feature	Cosmos DB NoSQL	PostgreSQL + pgvector	Azure Managed Redis
Latency target	Single-digit ms	10-50 ms (depending on tuning)	Sub-millisecond
Volume per instance	TBs across partitions	TBs per server	GBs (RAM-bound)
Best for	Document store + vectors at scale	Relational + vectors with joins	Hot cache + ultra-low-latency vectors
Persistence	Persistent, replicated	Persistent, replicated	RAM-first; persistence available but watch trade-offs
Use as	Primary store for chat / RAG sources	Primary store with strong relational shape	Cache layer + tier 1 vector retrieval

Real-world example: Priya's two-tier vector setup

BeanCraft’s menu-Q&A bot serves 240 stores. Priya’s data architecture:

PostgreSQL + pgvector — 80,000 menu items + dietary docs + brand standards (durable, joined to product catalog)
Managed Redis — top 2,000 most-asked items embedded in a RediSearch HNSW index, refreshed nightly

The bot first searches Redis (sub-ms). On a miss, it falls back to PostgreSQL (~30 ms). 92% of queries hit Redis. Average retrieval latency: 4 ms — fast enough that customers feel the bot is “instant”.