Domain 3 β€” Module 4 of 7 57%
25 of 28 overall
Domain 3: Implement AI Capabilities in Database Solutions Free ⏱ ~13 min read

Vector Data: Types, Indexes, and Storage

Design your database for vector data β€” understand the VECTOR data type, choose index types and distance metrics, and use VECTOR_NORMALIZE and VECTORPROPERTY.

Storing vectors in SQL

Simple explanation

Think of GPS coordinates for every piece of data.

Regular data has an address (a primary key). Vector data has GPS coordinates in a high-dimensional space β€” except instead of latitude/longitude (2 dimensions), embeddings have 256 to 3,072 dimensions. SQL Server 2025 adds a native VECTOR data type to store these coordinates efficiently, and vector indexes to find nearby points quickly.

The VECTOR data type

-- Declare a vector column with specific dimensions
CREATE TABLE Products (
    ProductId INT NOT NULL PRIMARY KEY,
    Name NVARCHAR(200) NOT NULL,
    DescriptionEmbedding VECTOR(1536) NULL  -- 1536-dimension vector
);

-- Insert a vector value (JSON array format)
INSERT INTO Products (ProductId, Name, DescriptionEmbedding)
VALUES (1, 'Wireless Mouse', '[0.0123, -0.0456, 0.0789, ...]');

Choosing vector dimensions

Embedding ModelDimensionsStorage Per VectorQuality
text-embedding-3-small1536~6 KBGood for most use cases
text-embedding-3-large3072~12 KBHigher quality, more storage
text-embedding-ada-0021536~6 KBLegacy, still widely used
Custom/smaller models256-7681-3 KBFaster search, lower quality

Note: Standard VECTOR supports max 1998 dimensions with float32. For 3072 dimensions (e.g., text-embedding-3-large), use the PARAMETERS option to request reduced dimensions from the model, or use float16 preview support.

Vector functions

VECTOR_NORMALIZE

Normalises a vector to unit length (magnitude = 1). Essential before using cosine distance.

-- Normalise vectors for consistent distance calculations
UPDATE Products
SET DescriptionEmbedding = VECTOR_NORMALIZE(DescriptionEmbedding, 'norm2');

VECTORPROPERTY

Inspects vector metadata:

-- Get the number of dimensions
SELECT VECTORPROPERTY(DescriptionEmbedding, 'Dimensions') AS Dims
FROM Products WHERE ProductId = 1;
-- Returns: 1536

VECTOR_DISTANCE

Calculates the distance between two vectors:

-- Find the 5 most similar products to a query vector
SELECT TOP 5 ProductId, Name,
    VECTOR_DISTANCE('cosine', DescriptionEmbedding, @queryVector) AS Distance
FROM Products
ORDER BY Distance ASC;  -- Lower distance = more similar

Distance metrics

Distance metrics β€” cosine is the default for text embeddings
MetricRangeInterpretationBest For
Cosine0 to 20 = identical direction, 2 = oppositeText embeddings (most common)
Euclidean (L2)0 to infinity0 = same point, larger = fartherWhen magnitude matters (image features)
Dot productDepends on normalisationSmaller = more similar (negative dot product in SQL Server)Normalised vectors, high-performance search

Vector indexes

Without an index, vector search requires scanning every row (brute force). Vector indexes enable fast approximate nearest neighbour (ANN) search.

-- Create a vector index
CREATE VECTOR INDEX IX_Products_Embedding
ON Products (DescriptionEmbedding)
WITH (
    METRIC = 'cosine',
    TYPE = 'DISKANN'
);

DiskANN index

SQL Server uses DiskANN (Disk-based Approximate Nearest Neighbours) β€” Microsoft Research’s algorithm optimised for SSD-based storage. Key properties:

  • Handles billions of vectors
  • Works with data larger than memory
  • Configurable accuracy vs speed trade-off
Exam tip: Vector index sizing

Vector data is large. Plan storage carefully:

  • 1 million rows x 1536 dimensions x 4 bytes = ~6 GB just for vectors
  • Add the DiskANN index overhead (~20-30% additional)
  • Total: ~8 GB for 1M vectors with 1536 dimensions

If the exam asks about storage planning for vector data, remember to account for both the vector column AND the index overhead. Smaller dimensions (256-768) significantly reduce storage requirements.

Question

What is the VECTOR data type in SQL Server 2025?

Click or press Enter to reveal answer

Answer

A native data type that stores fixed-dimension floating-point arrays (embeddings). Declared as VECTOR(N) where N is the number of dimensions (e.g., VECTOR(1536)). Values are stored in an optimised binary format and inserted as JSON arrays.

Click to flip back

Question

Why should you normalise vectors before searching?

Click or press Enter to reveal answer

Answer

VECTOR_NORMALIZE converts vectors to unit length (magnitude = 1). This ensures cosine distance works correctly β€” without normalisation, vectors with different magnitudes may produce misleading distance values even if they point in the same direction.

Click to flip back

Question

What distance metric should you use for text embeddings?

Click or press Enter to reveal answer

Answer

Cosine distance is the standard for text embeddings. It measures the angle between vectors (direction), ignoring magnitude. A cosine distance of 0 means identical direction (most similar), 2 means opposite. Most embedding models are optimised for cosine similarity.

Click to flip back

Question

What is DiskANN?

Click or press Enter to reveal answer

Answer

DiskANN (Disk-based Approximate Nearest Neighbours) is Microsoft Research's vector index algorithm used in SQL Server. It is optimised for SSD storage, handles datasets larger than memory, and provides configurable accuracy vs speed trade-offs. It is the default vector index type.

Click to flip back

Knowledge Check

Priya at Vault Bank is designing a vector search system for 5 million customer support tickets. Each ticket has a 1536-dimension embedding. She needs sub-second search latency. What should she create?

Next up: Vector Search: Distance, ANN, and ENN β€” implement vector search with approximate and exact nearest neighbours.