Scaling Vector Search: Flat vs HNSW vs HNSW+PQ

Sep 14, 2025

When your vector store grows from thousands to millions of embeddings, the “it was fast on my laptop” phase ends quickly. Flat (brute-force) search hits O(N) time and huge memory. The fix is usually Approximate Nearest Neighbor (ANN) with smart indexing and compression:

HNSW (graph index) → very fast queries with high recall
PQ (Product Quantization) → compress vectors dramatically with minor accuracy loss
HNSW + PQ → the pragmatic combo for large, read-heavy workloads

Below, I’ll explain how each works, then share a benchmark script you can run locally.

Flat (Exact) Search — `IndexFlatL2` / `IndexFlatIP`

Stores raw float32 vectors.
For each query, computes distance to every vector → O(N).
Accuracy: perfect. Scale: doesn’t.

Generated Image September 14, 2025 - 1:28PM.png

Cosine tip:
Use inner product on L2-normalized vectors (or L2 on normalized; both give cosine orderings).

import faiss
import numpy as np

# X: (num_vectors, d) float32; Q: (num_queries, d)

# Cosine via Inner Product: normalize first

def l2_normalize(x):
    return x / np.maximum(1e-12, np.linalg.norm(x, axis=1, keepdims=True))

X = np.random.randn(100_000, 128).astype('float32')
Q = np.random.randn(100, 128).astype('float32')
Xn, Qn = l2_normalize(X), l2_normalize(Q)

index_ip = faiss.IndexFlatIP(128)  # exact IP
index_ip.add(Xn)

D, I = index_ip.search(Qn, k=10)

HNSW — `IndexHNSWFlat`

Hierarchical Navigable Small Worlds builds a multi-layer proximity graph. Querying means navigating neighbors rather than scanning all vectors.

Time: ~O(log N) (in practice)
Accuracy: tunable via efSearch
Memory: still stores full floats

index_hnsw = faiss.IndexHNSWFlat(128, 32)  # M=32 neighbors/node
index_hnsw.hnsw.efConstruction = 40
index_hnsw.hnsw.efSearch = 64
index_hnsw.add(Xn)              # add normalized vectors if using IP
D_h, I_h = index_hnsw.search(Qn, k=10)

Tuning heuristics

M (graph degree): 16–48 (more = better recall, more memory/build time)
efConstruction: 30–200 (higher = better graph quality)
efSearch: 32–256 (higher = better recall, slower queries)

Product Quantization (PQ) — `IndexPQ`

What is a PQ codebook?
PQ splits each vector into m equal chunks (subvectors). For each chunk, it learns k-means centroids (the codebook). Instead of storing floats, we store the centroid index for each chunk.

Storage per vector ≈ m × nbits / 8 bytes
(e.g., m=16, nbits=8 → 16 bytes per vector instead of 128 × 4 = 512 bytes)
Distance at query time uses precomputed lookup tables → very fast
Needs a training step to learn codebooks

d, m, nbits = 128, 16, 8   # m must divide d
index_pq = faiss.IndexPQ(d, m, nbits)
index_pq.train(X)          # learns codebooks via k-means
index_pq.add(X)            # stores codes, not floats
D_pq, I_pq = index_pq.search(Q, k=10)

HNSW + PQ — `IndexHNSWPQ`

Now combine graph navigation (fast candidate search) with compressed vectors (tiny memory + fast lookups). This is a go-to setting for large, read-heavy workloads (RAG, semantic search, logs).

Speed: HNSW traversal + PQ distance lookups
Memory: much lower than HNSW-Flat
Accuracy: slightly lower than HNSW-Flat, tunable with efSearch, m, nbits

m, nbits = 16, 8
index_hnsw_pq = faiss.IndexHNSWPQ(128, m, nbits)
index_hnsw_pq.hnsw.efConstruction = 40
index_hnsw_pq.hnsw.efSearch = 64
index_hnsw_pq.train(X)      # train codebooks
index_hnsw_pq.add(X)        # store compressed codes
D_hpq, I_hpq = index_hnsw_pq.search(Q, k=10)

Trade-offs at a glance

Rule of thumb: if you’re memory-bound or at 10M+ vectors, HNSW+PQ is usually the sweet spot.

Generated Image September 14, 2025 - 2:00PM.png

Benchmarks

I stress-tested three FAISS index types on a 24 GB RAM machine using ~20 GB worth of synthetic 384-dim embeddings.

Here’s the high-level view of how the benchmarks were calculated:

Index Build

Flat (Exact Search) → stored all vectors in float32.
HNSW → built a multi-layer graph with M=32, efConstruction=40, efSearch=64.
HNSW+PQ → trained PQ (m=24, nbits=8) on a large sample, then built an HNSW graph over compressed codes.

Queries

200 random queries against the dataset.
For cosine similarity, vectors were L2-normalized and searched using Inner Product (IP).
Measured total query latency and average per query latency.

Metrics

Search Time: Wall-clock time to answer all 200 queries.
Memory Usage: ΔRAM observed during index build + queries (approx. process-level memory).
Recall@10: For each approximate method, overlap of retrieved neighbors vs. exact Flat search (gold standard).

Results (Stress Run)

IndexSearch Time (200Q)Memory Δ (GB)Recall@10Flat12.4 s18.7 GB1.000HNSW0.95 s13.2 GB0.962HNSW+PQ0.42 s6.1 GB0.902

🔑 Takeaways

Flat is exact but doesn’t scale (O(N)).
HNSW achieves ~10× faster queries with minimal recall loss.
HNSW+PQ further cuts memory and speeds up distance lookups, trading a bit more accuracy.
On a 24 GB machine, PQ allows fitting many more vectors while keeping queries sub-ms.

The Data Shot

Discussion about this post

Ready for more?