Latency Engineering

Why std::unordered_map Can Be Bad For Latency

An interactive article about the cost of node-heavy hash maps, cache locality, flat hash maps, prefetch, and where Rust changes the default story.

May 3, 202614 min read

The Obvious Hash Map

Start with the design that looks perfectly reasonable on paper: a bucket array whose collisions turn into linked lists.

If we only care about asymptotic complexity, the requirements are simple. We want a key/value store with average O(1) insert, get, and erase. We also need some way to handle collisions, because two different keys can absolutely land in the same bucket.

The straightforward answer is a bucket array plus linked lists. Hash the key, reduce the hash into a bucket index, walk that bucket's chain, and either append or return the matching entry. It is compact, teachable, and easy to trust.

I only care about insert and get in this walkthrough, because erase does not change the performance story we care about. The important part is the shape of the data structure.

Bucketed chaining

A bucket points into a chain, and the chain points to the next node.

The mechanics are easy to explain because the structure is literally a set of links. That same pointer-heavy shape is also why this design gets unpleasant once latency starts to matter.

Live step

Start with the incoming pair (kiwi, 11).

incoming pair
key = "kiwi"
value = 11
hash -> bucket 5
next = ?
0
1
2
3
4
5
6
7
head node
hash -> bucket 5
"grape"
value = 4
next -> peach
linked node
hash -> bucket 5
"peach"
value = 9
next -> null

From a data-structures class point of view, we are basically done. The map works. It handles collisions. The implementation is not scary. That is exactly why it is so easy to miss what comes next. The problem is not correctness. The problem is hardware.

Cache Locality Is The Real Constraint

If you care about latency, the shape of memory matters as much as the shape of the algorithm.

Cache locality is the property that the data you need next is physically near the data you just touched. CPUs love that. Once a cache line is loaded, nearby bytes become cheap to access. If the next thing you need is somewhere else entirely, the core has to wait while deeper caches or main memory catch up.

The exact numbers depend on the machine, but the shape is stable: L1 is extremely fast, L2 is slower, L3 slower again, and DRAM is a cliff. A single trip to main memory can cost as much as many tight, predictable instructions.

In an HFT market-making system, that difference shows up in places that sound almost boring: looking up an order by id, finding per-symbol state, checking position limits, or locating pricing data before changing an order. None of those operations are complicated on their own. The problem is that they happen on a path that may run for every market update, every trade, and every price change.

Cache hierarchy

A cache miss gets expensive because the data lives farther away.

This picture intentionally collapses the cache stack into one box. The visual job is just to show that missing nearby data forces the request to travel further before the CPU can move on.

CPU

load request

Cache

line 0
line 1
requested line
line 3

Memory

line 120
line 121
requested line
line 123
line 124

Start with the nearby cache

The CPU sends the request to the nearest cache first because that is the cheapest place the data could already be.

Illustrative latency table

L1~1 nsTiny, extremely fast, and where you want hot lookup state to live.
L2~4 nsStill fast, but already meaningfully slower than L1.
L3~12 nsShared, bigger, and much more expensive than the inner caches.
DRAM~80 nsThis is where latency-sensitive code goes to suffer.

Why std::unordered_map Can Be Rough On Latency

The CPU does not experience a lookup as O(1). It experiences it as a sequence of memory touches.

The problem with the classic bucket-plus-node design is not that it is incorrect. The problem is that every lookup can become a little pointer-chasing expedition. First you load the bucket head. Then you load one node. Then maybe another. Then maybe another. Each node may live on a different cache line, with unrelated allocator state sitting in between.

That means the cost of get("orange") is not just hashing plus a few key comparisons. It is hashing plus a bucket load plus several opportunities to miss cache and stall.

In a market maker, that can mean the table you use to track live orders, order handles, or per-instrument state is quietly sitting on the hot path between receiving a market update and deciding whether to change a price in the book. When spreads are tight and queue position matters, repeated little stalls stop being theoretical very quickly.

Why locality is bad

A lookup turns into a linked tour through far-apart heap allocations.

One bucket lights up, then the useful nodes are discovered far away from each other.

headnextnext

CPU core

hash("orange")

Bucket array

the compact part ends here

Scattered heap nodes

the useful nodes are lit, but they are far apart

The important visual is the spacing. The lookup does not move through one dense block of useful state. It lands on a bucket, then on one node, then on another node, with unrelated memory sitting between each stop.

That is why a node-based table can feel fine in algorithm notes and frustrating on a hot path. Each pointer reveals the next address late. The machine keeps discovering the future one miss at a time.

Important nuance

I am using std::unordered_map here as shorthand for the classic node-heavy hash-map shape that many C++ hot paths end up benchmarking against. The precise standard guarantees are not the point. The locality profile is.

Solution, Flat Hash Map

If latency matters, you usually want a flatter table that keeps probes close together.

A flat hash map changes the picture by keeping the table inside one dense array of slots. Instead of jumping from bucket head to separately allocated node to separately allocated node, the lookup starts at a home position inside that array and probes nearby slots.

The real mechanics are simple once you stop imagining linked nodes. First, hash the key. That hash picks a home group, which is just a small fixed-size window of slots in the flat table. The lookup checks that window first instead of following a pointer into the heap.

Then the probe walks forward through nearby slots until one of two things happens: either it finds the matching key, or it hits an empty slot, which proves the key is not present. Insert follows the same path and writes into the first available slot it finds.

In a SwissTable-style design, there is an extra optimization layer: each slot also carries a tiny control byte or fingerprint. The table can scan those bytes quickly, reject most slots immediately, and only do full key comparisons for the few candidates that survive that filter.

The demo below intentionally hides that control-byte detail and shows the simpler memory story first. The important part is that the walk stays local. If you care about low tail latency, this is usually the right default direction: a SwissTable-style or otherwise flat, open-addressing map rather than a node-oriented one.

That is the HFT view of the choice. You are not picking a flat map because it is academically cleaner. You are picking it because order lookups, price bookkeeping, and symbol-state reads should feel like walking one small neighborhood of memory, not going on a scavenger hunt through the heap.

Group probing

The probe stays inside one nearby run of slots.

Hashing picks where probing starts, then the walk stays inside one contiguous stretch instead of bouncing across distant nodes.

home group

the hash chooses where probing starts

contiguous row

the probe moves forward here until match or empty

That still does not magically remove misses. What it changes is what one miss buys you. When a cache line arrives, you get a batch of nearby slots, and in real implementations usually a batch of control bytes too, so the CPU can often do several meaningful checks before it has to fetch anything else.

A flatter layout

One loaded line gives you several nearby places to check.

The same path now lands on a contiguous run, so one fetch brings multiple useful squares close to the CPU.

CPU core

hash("orange")

Home group

this points into one nearby run

Cache lines

Contiguous slot array

the useful squares live together in one contiguous run

One More Step: Prefetch

Prefetch is not a miracle cure. It is the extra gain you get after the layout is already predictable.

Prefetch helps when you can guess the next memory location early enough to ask for it before you need it. That is difficult with a linked structure whose next pointer is only discovered after the current miss resolves.

With a flatter probing scheme, the future is much less mysterious. If the home group does not match, there is a reasonable next place to look. That makes prefetch a useful optimization instead of wishful thinking.

That is the comparison to watch in the demo below. In the node-based version, the next hop is discovered late. In the flat version, the next group is already a reasonable guess. In the prefetched version, that next group can be on the move before the current one is done.

One more step: prefetch

Prefetch helps when the next hop is obvious before you need it.

Linked flow

next hop requested early
prefetch early

group 48

current work

group 49

already incoming

group 50

later

Hit patternprefetched next line

Latency comparison

node-based map82 ns
flat map31 ns
flat map + prefetch23 ns

These are illustrative nanosecond costs, not benchmark claims for a specific machine.

Does Every Use Case Need A Flat Map?

No. Most software is not an exchange, and you should not cargo-cult a low-latency data structure into a cold path.

Use a flatter map first when lookups sit on a critical request path, market-data path, simulation loop, or any other place where the operation happens constantly and tail latency matters.

That includes a lot of market-making infrastructure: order-id maps, per-instrument state tables, and the small bookkeeping lookups that sit around price generation, order updates, and trade handling.

A regular general-purpose map is fine when the lookup is off the hot path, the surrounding work dominates the budget, or you care more about API familiarity and semantics than squeezing out cache misses.

Profile before rewriting because key shape, hash function cost, load factor, allocator behavior, and mutation patterns can all move the bottleneck around.

Rust Changes The Default Story

If you are writing Rust, the standard-library answer already starts much closer to the flat-map camp.

Rust's default std::collections::HashMap is already documented as using quadratic probing with SIMD lookup, which is a very different starting point from the mental model of buckets plus linked lists. The central critique in this article maps much more directly to classic node-based hash maps than it does to Rust's standard one.

That does not mean Rust code is automatically fast. Hashing still costs time, cache misses still hurt, and your access pattern still matters. It just means Rust's default table design is already playing a smarter memory-layout game.

Rust's default hasher also prioritizes robustness against adversarial inputs. If a hot path profiles poorly, table layout and hash choice are two separate questions worth measuring independently.

When std::unordered_map Is A Bad Choice

Not always. But often enough that it should stop being the automatic answer for latency-sensitive C++.

It is a bad choice when a lookup happens for every single message, order, or packet and you care about the slowest few microseconds as much as the average.

It is a bad choice when your keys and values are small enough that memory indirection becomes the dominant cost.

It is a bad choice when the structure is making the CPU rediscover scattered memory over and over again.

And it is usually fine when none of those are true.

Big-O tells you the lookup is constant on average. The CPU wants to know whether the next piece of data is already nearby. That is why the same abstract data structure can feel completely different in production. If latency is the job, locality is the design.

Hash mapsCache localityLow latencyHFTC++Rust