I wrote HNSW in Zig from scratch partly because I wanted an excuse to play with Zig's comptime, explicit allocators, and pleasantly sharp SIMD primitives. The better excuse was understanding the part that usually gets compressed into "then make it fast": how a geometric graph search turns into heaps, visited state, distance kernels, quantized vectors, and cache-sensitive layout.
The spine of the project ended up being two approximations stacked on top of each other.
Algorithmic: do not visit every node.
Architectural: do not make every visited node cost a full float32 comparison.
The first approximation is the paper. The second approximation is where most of the implementation work lives.
A small vocabulary before the first number: HNSW is an approximate nearest-neighbor index, so it trades exact search for a tunable amount of graph exploration. recall@10 is the fraction of the true top-10 neighbors recovered by the approximate top-10; QPS is queries per second; p50 and p99 are median and tail query latency.
The main knobs are M, the graph degree each node tries to keep; ef_construction, the build-time search width used while wiring the graph; and ef_search, the query-time beam width.
The one datapoint to keep in mind: on my Apple M4, SIFT-1M at ef_search=100 gave 0.984 recall@10 at about 20K QPS, single-threaded. Larger ef_search usually means the query touches more of the graph: recall improves, latency goes up.
That is the arc of the post. Exact search gives the correctness baseline and the memory scale. HNSW reduces how many nodes a query inspects. The implementation work is making each inspected node cheap enough.
HNSW In One Page
Exact nearest-neighbor search is simple: compare the query vector with every vector in the database and keep the closest k. On SIFT-1M, one full scan over float32 vectors touches about:
1,000,000 vectors * 128 dims * 4 bytes = 512 MB
That is a useful correctness baseline. It is also a lot of memory traffic per query.
HNSW builds a proximity graph. Vectors are nodes. Nearby vectors are connected by edges. Search starts from an entry point and walks toward nodes closer to the query.
The hierarchy is the main trick. Layer 0 contains all vectors. Higher layers contain progressively fewer vectors, giving the search long-range links before it descends into the dense base graph. It is a skip list for geometry.
ef_search expanding candidates around the answer in the dense layer-0 graph.The query path has two phases:
- Greedy descent from the top layer with a tiny search budget.
- A wider layer-0 beam search controlled by
ef_search.
The layer-0 search is a two-heap loop:
candidates: min-heap ordered by distance
results: max-heap of the best ef results so far
visited: node ids already seen
while candidates is not empty:
candidate = pop nearest unexplored node
if results is full and candidate is worse than worst(results):
stop
for neighbor in graph[candidate]:
if neighbor was already visited:
continue
distance = d(query, vector[neighbor])
if results has room or distance is better than worst(results):
push neighbor into candidates
push neighbor into results
if results is larger than ef_search:
remove worst(results)
That last admission rule is where ef_search does its work. It caps the result heap, which also controls how wide the frontier is allowed to stay. The early-stop condition is only valid once the heap is full; before that, the search is still collecting enough candidates to make the comparison meaningful.
The visited set in that pseudocode looks mundane, but it is on the hot path. HNSW graphs have cycles, and the same node can be reached through multiple neighbors. Every query needs a cheap way to answer "have I already seen this id?" and then reset that state for the next query. That is the bitset-versus-epoch-array choice later in the post.
The knobs that show up everywhere are small, but they change both the graph and the machine behavior:
| knob | mental model | larger usually means |
|---|---|---|
M | target graph degree per node | more memory, more build work, often better paths |
M0 | layer-0 degree cap | denser base graph where final search happens |
ef_construction | candidate beam used while inserting nodes | slower builds, better graph quality |
ef_search | candidate beam used during queries | more nodes visited, better recall, higher latency |
In my implementation, M=16 and layer 0 allows M0=2*M=32 neighbors.
What The Paper Does Not Give You
The HNSW paper gives the graph algorithm: level assignment, greedy descent, bounded layer search, and neighbor selection. A fast implementation still has to make a lot of local decisions.
| paper gives | implementation chooses | why it matters |
|---|---|---|
| graph search | fixed scratch buffers and two heaps | allocation-free query path |
| visited set | epoch-stamped array | avoid clearing query state every search |
| vectors | separate arrays for hot and cold data | keep traversal from dragging labels and metadata |
| distances | SIMD kernels and quantized representations | make candidate scoring cheap |
| hierarchy | level-0 and upper-layer layouts | control locality and pointer chasing |
| algorithm | runtime API over specialized code | keep the outside flexible without slowing the inner loop |
A Sanity Check
The tuning run I kept coming back to was SIFT-1M (TFDS docs): 1M base vectors, 10K queries, 128 dimensions, squared L2 distance, and recall measured against the provided ground truth. The index used M=16 and ef_construction=200. Build used 10 threads. Search was single-threaded.
These numbers are a calibration run for the implementation. Cross-library comparison belongs in ann-benchmarks, where the point is to draw recall/QPS curves across implementations under one harness. Here, the sweep anchors the implementation discussion.
The ef sweep looked like this:
| ef_search | recall@10 | QPS | p50 | p99 |
|---|---|---|---|---|
| 50 | 0.963 | 34,510 | 31 us | 57 us |
| 100 | 0.984 | 20,019 | 52 us | 71 us |
| 200 | 0.994 | 10,916 | 93 us | 128 us |
That is the expected HNSW tradeoff. Increasing ef_search lets the query inspect more of the graph. Recall improves and throughput falls.
ef_search buys recall by expanding the layer-0 search.The rest of the post switches from HNSW as an algorithm to the Zig implementation: how the heaps, visited state, distance kernels, and vector representations show up in the inner loop.
Compile-Time Specialization
The useful Zig part is comptime: the public runtime API can dispatch into a monomorphized inner loop.
The core index type is parameterized by dimension and distance function:
pub fn HnswIndex(comptime dim: u32, comptime Dist: type) type {
return struct {
vectors: [][dim]f32,
// ...
};
}
The runtime side supports a fixed set of dimensions:
const supported_dims = [_]u32{
25, 50, 100, 128, 200, 256, 384, 512, 768, 784, 960,
};
Internally, each dimension gets a separate compiled index type. The outside can still choose a dimension at runtime, but the hot loop is not carrying a runtime dim branch.
The distance code then gets to use compile-time dimensions and Zig @Vector operations. The L2 and inner-product kernels share one implementation; target architecture picks a SIMD width, and the loop uses two accumulators to reduce the dependency chain:
var acc0: @Vector(sw, f32) = @splat(0.0);
var acc1: @Vector(sw, f32) = @splat(0.0);
inline for (0..half_iters) |i| {
const va0: @Vector(sw, f32) = a[(2 * i) * sw ..][0..sw].*;
const vb0: @Vector(sw, f32) = b[(2 * i) * sw ..][0..sw].*;
acc0 = accumulate(acc0, va0, vb0);
const va1: @Vector(sw, f32) = a[(2 * i + 1) * sw ..][0..sw].*;
const vb1: @Vector(sw, f32) = b[(2 * i + 1) * sw ..][0..sw].*;
acc1 = accumulate(acc1, va1, vb1);
}
return @reduce(.Add, acc0 + acc1);
Runtime configuration at the outside, specialized code at the inside.
The Hot Loop Is A Memory Path
For a fixed dimension, the main arrays are separate:
flowchart LR id["node id"] nbr["neighbors_l0: count + up to 32 ids"] sq4["sq4_vectors: 64B on SIFT-128"] f32["vectors: 512B f32 on SIFT-128"] cold["levels + labels: cold during traversal"] heap["candidate/result heaps"] id --> nbr nbr --> sq4 sq4 --> heap heap --> f32 id -.-> cold
Arrows show the order the hot loop touches each array; dotted means cold state that is not touched during layer-0 traversal.
The layer-0 query loop does not need labels while it is walking the graph. It does not need upper-layer storage after descent. It does not need to use a full 512-byte float32 vector for every approximate comparison if a smaller representation is good enough to guide the walk.
For SIFT-128:
| thing | size |
|---|---|
| float32 vector | 512 B |
| SQ8 vector | 128 B |
| SQ4 vector | 64 B |
| layer-0 links | 132 B |
| visited epoch stamp | 2 B/node |
The important constraint is that the graph walk is mostly random access. If every candidate forces a scattered 512-byte vector load, the implementation quickly becomes a memory-system problem. SQ4 makes the traversal vector payload 8x smaller. Float32 is saved for the short list of candidates near the end.
The Visited Set Should Not Clear A Million Bits
This is the implementation of the visited set from the search loop. Every layer search needs to remember which node ids it has seen. A normal bitset is compact, but reset is a memset. For 1M nodes, the compact version is about 125 KB touched per layer search.
An epoch array is a generational visited set. Each node stores the last search epoch that touched it. Starting a new search increments one counter. A node is visited if its stored epoch equals the current epoch; marking it visited writes the current epoch into that node's slot. Reset becomes an integer increment, with a full clear only when the counter wraps.
The implementation uses a u16 epoch array instead:
pub inline fn reset(self: *VisitedSet) void {
self.epoch +%= 1;
if (self.epoch == 0) {
@memset(self.entries, 0);
self.epoch = 1;
}
}
pub inline fn testAndSet(self: *VisitedSet, id: u32) bool {
const was_visited = self.entries[id] == self.epoch;
self.entries[id] = self.epoch;
return was_visited;
}
That is a simple tradeoff:
| approach | reset cost | space |
|---|---|---|
| bitset | clear about 125 KB for 1M nodes | 125 KB at 1M nodes |
| epoch array | increment counter, clear only on wrap | 2 MB at 1M nodes |
I paid memory to avoid touching memory on every query.
Quantized Traversal, Float Rescore
SQ means scalar quantization: each coordinate is quantized independently. During calibration, the index computes a min and max for every dimension over the stored float32 vectors. SQ8 maps each coordinate into one of 256 buckets and stores it in one byte. SQ4 maps each coordinate into one of 16 buckets and stores it in four bits, so two coordinates fit in one byte.
For SIFT-128, SQ4 turns a 128-dimensional float32 vector into:
128 * 4 bits = 512 bits
512 bits / 8 = 64 bytes
The search path below is SQ4 traversal plus float32 rescore. SQ8 support exists in the code, but searchKnn uses SQ4 + f32.
The calibrated query path is:
The distance loop extracts the high and low nibbles from each byte, computes absolute differences between 4-bit bucket ids, squares them, and accumulates in vectors. It is not the original L2 distance. It is a cheap proxy distance used to steer the graph walk.
The point is bandwidth, not storage compression. The original float32 vectors still stay resident for rescore. The index pays extra memory for a second representation so that the random layer-0 walk touches much less vector payload per candidate.
This is more than a compression trick. HNSW search has two different jobs:
- Navigate the graph well enough to find a promising candidate region.
- Return the nearest
kunder the real metric.
SQ4 is used for the first job. Float32 is used again for the second. That split is the part I like: approximate distance is allowed to be a routing signal, but final ranking comes back to the original representation.
Why does that not immediately destroy recall? Because the layer-0 traversal is not trying to prove the final top-10 ordering. It is trying to keep the beam moving toward the right basin of the graph. The graph gives the search redundant local paths, ef_search keeps several candidates alive, and float32 rescore repairs the final ordering among the nodes that made it into the shortlist.
The failure mode is routing, not ranking: if SQ4 points the beam away from the right region, rescore cannot recover a node that was never visited.
Neighbor Selection Is Not Take The Closest M
Everything above is query-time. It assumes the graph is worth walking. Build-time graph quality is decided by insertion search and neighbor selection.
The neighbor-selection heuristic is one of the first places a from-scratch implementation can look correct and still build a weak graph.
After searchLayer returns candidates, the code sorts them by distance from the inserted vector. It then walks candidates in order and rejects a candidate if it is closer to an already selected neighbor than it is to the new vector. Rejected candidates are kept in a pruned list and used as backfill if the diverse set is too small.
The effect is that the graph keeps some geometric diversity instead of filling every neighbor list with redundant points from one local clump. This reads like a heuristic in the paper. It feels much more concrete once the recall curve depends on it.
Reordering The Graph
After building, the index does a BFS reorder over the layer-0 graph. The goal is to move from insertion order to a rough graph-locality order.
before: insertion id order
graph edges may jump across arrays
after: BFS-ish id order
graph neighbors tend to sit closer together
The reorder pass rewrites float32 vectors, labels, levels, layer-0 neighbor ids, and upper-layer neighbor ids. Then it recalibrates the quantized vectors on the reordered float32 data.
The benchmark table does not attribute speedup to reordering. A measured reordering result needs an edge-span histogram plus cache-miss and latency before/after numbers. In this version, the reorder pass is a locality pass in the implementation, not a benchmark result.
What This Version Does Not Try To Be
This implementation is scoped around the fast SIFT-128 path: L2 distance, M=16, SQ4 traversal, and float32 rescore. SQ8 support exists in the code, but it is not on the main searchKnn path. Odd-dimensional SQ4, angular distance, deletes, and broader dimension-specific tuning are the obvious next cuts.
I prefer spelling that out because the interesting part of the project is not pretending to be a production vector database. It is the narrower systems question: once HNSW decides which nodes to inspect, how cheap can each inspected node become?
The Implementation Lesson
The algorithmic approximation is the famous part: do not visit every node. The architectural approximation is the part I had to learn by building it: do not make every visited node cost a full float32 comparison.
That is why this project felt worth doing from scratch. The algorithm forced every abstraction to become concrete: the heap buffers, the visited set, the vector representation, the distance kernel, the graph layout, and the runtime boundary.
At some point a graph algorithm turns into a memory path. That was the part worth building.