hnsw in zig: graph search as a memory problem

I wrote HNSW in Zig from scratch partly because I wanted an excuse to play with Zig's comptime, explicit allocators, and pleasantly sharp SIMD primitives. The better excuse was understanding the part that usually gets compressed into "then make it fast": how a geometric graph search turns into heaps, visited state, distance kernels, quantized vectors, and cache-sensitive layout.

The spine of the project ended up being two approximations stacked on top of each other.

Algorithmic: do not visit every node.

Architectural: do not make every visited node cost a full float32 comparison.

The first approximation is the paper. The second approximation is where most of the implementation work lives.

A small vocabulary before the first number: HNSW is an approximate nearest-neighbor index, so it trades exact search for a tunable amount of graph exploration. recall@10 is the fraction of the true top-10 neighbors recovered by the approximate top-10; QPS is queries per second; p50 and p99 are median and tail query latency.

The main knobs are M, the graph degree each node tries to keep; ef_construction, the build-time search width used while wiring the graph; and ef_search, the query-time beam width.

The one datapoint to keep in mind: on my Apple M4, SIFT-1M at ef_search=100 gave 0.984 recall@10 at about 20K QPS, single-threaded. Larger ef_search usually means the query touches more of the graph: recall improves, latency goes up.

That is the arc of the post. Exact search gives the correctness baseline and the memory scale. HNSW reduces how many nodes a query inspects. The implementation work is making each inspected node cheap enough.

HNSW In One Page

Exact nearest-neighbor search is simple: compare the query vector with every vector in the database and keep the closest k. On SIFT-1M, one full scan over float32 vectors touches about:

1,000,000 vectors * 128 dims * 4 bytes = 512 MB

That is a useful correctness baseline. It is also a lot of memory traffic per query.

HNSW builds a proximity graph. Vectors are nodes. Nearby vectors are connected by edges. Search starts from an entry point and walks toward nodes closer to the query.

The hierarchy is the main trick. Layer 0 contains all vectors. Higher layers contain progressively fewer vectors, giving the search long-range links before it descends into the dense base graph. It is a skip list for geometry.

A query starts at the current entry point, follows long-range links in sparse upper layers, then spends most of ef_search expanding candidates around the answer in the dense layer-0 graph.

The query path has two phases:

Greedy descent from the top layer with a tiny search budget.
A wider layer-0 beam search controlled by ef_search.

The layer-0 search is a two-heap loop:

candidates: min-heap ordered by distance
results:    max-heap of the best ef results so far
visited:    node ids already seen

while candidates is not empty:
    candidate = pop nearest unexplored node
    if results is full and candidate is worse than worst(results):
        stop

    for neighbor in graph[candidate]:
        if neighbor was already visited:
            continue
        distance = d(query, vector[neighbor])

        if results has room or distance is better than worst(results):
            push neighbor into candidates
            push neighbor into results
            if results is larger than ef_search:
                remove worst(results)

That last admission rule is where ef_search does its work. It caps the result heap, which also controls how wide the frontier is allowed to stay. The early-stop condition is only valid once the heap is full; before that, the search is still collecting enough candidates to make the comparison meaningful.

The visited set in that pseudocode looks mundane, but it is on the hot path. HNSW graphs have cycles, and the same node can be reached through multiple neighbors. Every query needs a cheap way to answer "have I already seen this id?" and then reset that state for the next query. That is the bitset-versus-epoch-array choice later in the post.

The knobs that show up everywhere are small, but they change both the graph and the machine behavior:

knob	mental model	larger usually means
`M`	target graph degree per node	more memory, more build work, often better paths
`M0`	layer-0 degree cap	denser base graph where final search happens
`ef_construction`	candidate beam used while inserting nodes	slower builds, better graph quality
`ef_search`	candidate beam used during queries	more nodes visited, better recall, higher latency

In my implementation, M=16 and layer 0 allows M0=2*M=32 neighbors.

What The Paper Does Not Give You

The HNSW paper gives the graph algorithm: level assignment, greedy descent, bounded layer search, and neighbor selection. A fast implementation still has to make a lot of local decisions.

paper gives	implementation chooses	why it matters
graph search	fixed scratch buffers and two heaps	allocation-free query path
visited set	epoch-stamped array	avoid clearing query state every search
vectors	separate arrays for hot and cold data	keep traversal from dragging labels and metadata
distances	SIMD kernels and quantized representations	make candidate scoring cheap
hierarchy	level-0 and upper-layer layouts	control locality and pointer chasing
algorithm	runtime API over specialized code	keep the outside flexible without slowing the inner loop

A Sanity Check

The tuning run I kept coming back to was SIFT-1M (TFDS docs): 1M base vectors, 10K queries, 128 dimensions, squared L2 distance, and recall measured against the provided ground truth. The index used M=16 and ef_construction=200. Build used 10 threads. Search was single-threaded.

These numbers are a calibration run for the implementation. Cross-library comparison belongs in ann-benchmarks, where the point is to draw recall/QPS curves across implementations under one harness. Here, the sweep anchors the implementation discussion.

recall@100.984

QPS20,019

p5052 us

p9971 us

The ef sweep looked like this:

ef_search	recall@10	QPS	p50	p99
50	0.963	34,510	31 us	57 us
100	0.984	20,019	52 us	71 us
200	0.994	10,916	93 us	128 us

That is the expected HNSW tradeoff. Increasing ef_search lets the query inspect more of the graph. Recall improves and throughput falls.

Single-threaded SIFT-1M sweep on an Apple M4. Higher ef_search buys recall by expanding the layer-0 search.

The rest of the post switches from HNSW as an algorithm to the Zig implementation: how the heaps, visited state, distance kernels, and vector representations show up in the inner loop.

Compile-Time Specialization

The useful Zig part is comptime: the public runtime API can dispatch into a monomorphized inner loop.

The core index type is parameterized by dimension and distance function:

pub fn HnswIndex(comptime dim: u32, comptime Dist: type) type {
    return struct {
        vectors: [][dim]f32,
        // ...
    };
}

The runtime side supports a fixed set of dimensions:

const supported_dims = [_]u32{
    25, 50, 100, 128, 200, 256, 384, 512, 768, 784, 960,
};

Internally, each dimension gets a separate compiled index type. The outside can still choose a dimension at runtime, but the hot loop is not carrying a runtime dim branch.

The distance code then gets to use compile-time dimensions and Zig @Vector operations. The L2 and inner-product kernels share one implementation; target architecture picks a SIMD width, and the loop uses two accumulators to reduce the dependency chain:

var acc0: @Vector(sw, f32) = @splat(0.0);
var acc1: @Vector(sw, f32) = @splat(0.0);

inline for (0..half_iters) |i| {
    const va0: @Vector(sw, f32) = a[(2 * i) * sw ..][0..sw].*;
    const vb0: @Vector(sw, f32) = b[(2 * i) * sw ..][0..sw].*;
    acc0 = accumulate(acc0, va0, vb0);

    const va1: @Vector(sw, f32) = a[(2 * i + 1) * sw ..][0..sw].*;
    const vb1: @Vector(sw, f32) = b[(2 * i + 1) * sw ..][0..sw].*;
    acc1 = accumulate(acc1, va1, vb1);
}

return @reduce(.Add, acc0 + acc1);

Runtime configuration at the outside, specialized code at the inside.

The Hot Loop Is A Memory Path

For a fixed dimension, the main arrays are separate:

flowchart LR
  id["node id"]
  nbr["neighbors_l0: count + up to 32 ids"]
  sq4["sq4_vectors: 64B on SIFT-128"]
  f32["vectors: 512B f32 on SIFT-128"]
  cold["levels + labels: cold during traversal"]
  heap["candidate/result heaps"]

  id --> nbr
  nbr --> sq4
  sq4 --> heap
  heap --> f32
  id -.-> cold

Arrows show the order the hot loop touches each array; dotted means cold state that is not touched during layer-0 traversal.

The layer-0 query loop does not need labels while it is walking the graph. It does not need upper-layer storage after descent. It does not need to use a full 512-byte float32 vector for every approximate comparison if a smaller representation is good enough to guide the walk.

For SIFT-128:

thing	size
float32 vector	512 B
SQ8 vector	128 B
SQ4 vector	64 B
layer-0 links	132 B
visited epoch stamp	2 B/node

The important constraint is that the graph walk is mostly random access. If every candidate forces a scattered 512-byte vector load, the implementation quickly becomes a memory-system problem. SQ4 makes the traversal vector payload 8x smaller. Float32 is saved for the short list of candidates near the end.

The Visited Set Should Not Clear A Million Bits

This is the implementation of the visited set from the search loop. Every layer search needs to remember which node ids it has seen. A normal bitset is compact, but reset is a memset. For 1M nodes, the compact version is about 125 KB touched per layer search.

An epoch array is a generational visited set. Each node stores the last search epoch that touched it. Starting a new search increments one counter. A node is visited if its stored epoch equals the current epoch; marking it visited writes the current epoch into that node's slot. Reset becomes an integer increment, with a full clear only when the counter wraps.

The implementation uses a u16 epoch array instead:

pub inline fn reset(self: *VisitedSet) void {
    self.epoch +%= 1;
    if (self.epoch == 0) {
        @memset(self.entries, 0);
        self.epoch = 1;
    }
}

pub inline fn testAndSet(self: *VisitedSet, id: u32) bool {
    const was_visited = self.entries[id] == self.epoch;
    self.entries[id] = self.epoch;
    return was_visited;
}

That is a simple tradeoff:

approach	reset cost	space
bitset	clear about 125 KB for 1M nodes	125 KB at 1M nodes
epoch array	increment counter, clear only on wrap	2 MB at 1M nodes

I paid memory to avoid touching memory on every query.

Quantized Traversal, Float Rescore

SQ means scalar quantization: each coordinate is quantized independently. During calibration, the index computes a min and max for every dimension over the stored float32 vectors. SQ8 maps each coordinate into one of 256 buckets and stores it in one byte. SQ4 maps each coordinate into one of 16 buckets and stores it in four bits, so two coordinates fit in one byte.

For SIFT-128, SQ4 turns a 128-dimensional float32 vector into:

128 * 4 bits = 512 bits
512 bits / 8 = 64 bytes

The search path below is SQ4 traversal plus float32 rescore. SQ8 support exists in the code, but searchKnn uses SQ4 + f32.

The calibrated query path is:

Upper layers still use the original float32 representation. Once the search reaches layer 0, the query is quantized and most candidate scoring happens against SQ4 vectors. Only the short candidate set is rescored with the original float32 vectors.

The distance loop extracts the high and low nibbles from each byte, computes absolute differences between 4-bit bucket ids, squares them, and accumulates in vectors. It is not the original L2 distance. It is a cheap proxy distance used to steer the graph walk.

The point is bandwidth, not storage compression. The original float32 vectors still stay resident for rescore. The index pays extra memory for a second representation so that the random layer-0 walk touches much less vector payload per candidate.

This is more than a compression trick. HNSW search has two different jobs:

Navigate the graph well enough to find a promising candidate region.
Return the nearest k under the real metric.

SQ4 is used for the first job. Float32 is used again for the second. That split is the part I like: approximate distance is allowed to be a routing signal, but final ranking comes back to the original representation.

Why does that not immediately destroy recall? Because the layer-0 traversal is not trying to prove the final top-10 ordering. It is trying to keep the beam moving toward the right basin of the graph. The graph gives the search redundant local paths, ef_search keeps several candidates alive, and float32 rescore repairs the final ordering among the nodes that made it into the shortlist.

The failure mode is routing, not ranking: if SQ4 points the beam away from the right region, rescore cannot recover a node that was never visited.

Neighbor Selection Is Not Take The Closest M

Everything above is query-time. It assumes the graph is worth walking. Build-time graph quality is decided by insertion search and neighbor selection.

The neighbor-selection heuristic is one of the first places a from-scratch implementation can look correct and still build a weak graph.

After searchLayer returns candidates, the code sorts them by distance from the inserted vector. It then walks candidates in order and rejects a candidate if it is closer to an already selected neighbor than it is to the new vector. Rejected candidates are kept in a pruned list and used as backfill if the diverse set is too small.

The effect is that the graph keeps some geometric diversity instead of filling every neighbor list with redundant points from one local clump. This reads like a heuristic in the paper. It feels much more concrete once the recall curve depends on it.

Reordering The Graph

After building, the index does a BFS reorder over the layer-0 graph. The goal is to move from insertion order to a rough graph-locality order.

before: insertion id order
        graph edges may jump across arrays

after:  BFS-ish id order
        graph neighbors tend to sit closer together

The reorder pass rewrites float32 vectors, labels, levels, layer-0 neighbor ids, and upper-layer neighbor ids. Then it recalibrates the quantized vectors on the reordered float32 data.

The benchmark table does not attribute speedup to reordering. A measured reordering result needs an edge-span histogram plus cache-miss and latency before/after numbers. In this version, the reorder pass is a locality pass in the implementation, not a benchmark result.

What This Version Does Not Try To Be

This implementation is scoped around the fast SIFT-128 path: L2 distance, M=16, SQ4 traversal, and float32 rescore. SQ8 support exists in the code, but it is not on the main searchKnn path. Odd-dimensional SQ4, angular distance, deletes, and broader dimension-specific tuning are the obvious next cuts.

I prefer spelling that out because the interesting part of the project is not pretending to be a production vector database. It is the narrower systems question: once HNSW decides which nodes to inspect, how cheap can each inspected node become?

The Implementation Lesson

The algorithmic approximation is the famous part: do not visit every node. The architectural approximation is the part I had to learn by building it: do not make every visited node cost a full float32 comparison.

That is why this project felt worth doing from scratch. The algorithm forced every abstraction to become concrete: the heap buffers, the visited set, the vector representation, the distance kernel, the graph layout, and the runtime boundary.

At some point a graph algorithm turns into a memory path. That was the part worth building.

from exact search to heaps, epochs, SIMD, quantized traversal, and layout.

Published