The Obvious Hash Map
Start with the design that looks perfectly reasonable on paper: a bucket array whose collisions turn into linked lists.
If we only care about asymptotic complexity, the requirements are simple. We want a key/value store with average O(1) insert, get, and erase. We also need some way to handle collisions, because two different keys can absolutely land in the same bucket.
The straightforward answer is a bucket array plus linked lists. Hash the key, reduce the hash into a bucket index, walk that bucket's chain, and either append or return the matching entry. It is compact, teachable, and easy to trust.
I only care about insert and get in this walkthrough, because erase does not change the performance story we care about. The important part is the shape of the data structure.
Bucketed chaining
A bucket points into a chain, and the chain points to the next node.
The mechanics are easy to explain because the structure is literally a set of links. That same pointer-heavy shape is also why this design gets unpleasant once latency starts to matter.
Live step
Start with the incoming pair (kiwi, 11).
From a data-structures class point of view, we are basically done. The map works. It handles collisions. The implementation is not scary. That is exactly why it is so easy to miss what comes next. The problem is not correctness. The problem is hardware.
Cache Locality Is The Real Constraint
If you care about latency, the shape of memory matters as much as the shape of the algorithm.
Cache locality is the property that the data you need next is physically near the data you just touched. CPUs love that. Once a cache line is loaded, nearby bytes become cheap to access. If the next thing you need is somewhere else entirely, the core has to wait while deeper caches or main memory catch up.
The exact numbers depend on the machine, but the shape is stable: L1 is extremely fast, L2 is slower, L3 slower again, and DRAM is a cliff. A single trip to main memory can cost as much as many tight, predictable instructions.
In an HFT market-making system, that difference shows up in places that sound almost boring: looking up an order by id, finding per-symbol state, checking position limits, or locating pricing data before changing an order. None of those operations are complicated on their own. The problem is that they happen on a path that may run for every market update, every trade, and every price change.
Cache hierarchy
A cache miss gets expensive because the data lives farther away.
This picture intentionally collapses the cache stack into one box. The visual job is just to show that missing nearby data forces the request to travel further before the CPU can move on.
CPU
Cache
Memory
Start with the nearby cache
The CPU sends the request to the nearest cache first because that is the cheapest place the data could already be.
Illustrative latency table
Why std::unordered_map Can Be Rough On Latency
The CPU does not experience a lookup as O(1). It experiences it as a sequence of memory touches.
The problem with the classic bucket-plus-node design is not that it is incorrect. The problem is that every lookup can become a little pointer-chasing expedition. First you load the bucket head. Then you load one node. Then maybe another. Then maybe another. Each node may live on a different cache line, with unrelated allocator state sitting in between.
That means the cost of get("orange") is not just hashing plus a few key comparisons. It is hashing plus a bucket load plus several opportunities to miss cache and stall.
In a market maker, that can mean the table you use to track live orders, order handles, or per-instrument state is quietly sitting on the hot path between receiving a market update and deciding whether to change a price in the book. When spreads are tight and queue position matters, repeated little stalls stop being theoretical very quickly.
Why locality is bad
A lookup turns into a linked tour through far-apart heap allocations.
One bucket lights up, then the useful nodes are discovered far away from each other.
CPU core
Bucket array
the compact part ends here
Scattered heap nodes
the useful nodes are lit, but they are far apart
The important visual is the spacing. The lookup does not move through one dense block of useful state. It lands on a bucket, then on one node, then on another node, with unrelated memory sitting between each stop.
That is why a node-based table can feel fine in algorithm notes and frustrating on a hot path. Each pointer reveals the next address late. The machine keeps discovering the future one miss at a time.
Important nuance
I am using std::unordered_map here as shorthand for the classic node-heavy hash-map shape that many C++ hot paths end up benchmarking against. The precise standard guarantees are not the point. The locality profile is.
Solution, Flat Hash Map
If latency matters, you usually want a flatter table that keeps probes close together.
A flat hash map changes the picture by keeping the table inside one dense array of slots. Instead of jumping from bucket head to separately allocated node to separately allocated node, the lookup starts at a home position inside that array and probes nearby slots.
The real mechanics are simple once you stop imagining linked nodes. First, hash the key. That hash picks a home group, which is just a small fixed-size window of slots in the flat table. The lookup checks that window first instead of following a pointer into the heap.
Then the probe walks forward through nearby slots until one of two things happens: either it finds the matching key, or it hits an empty slot, which proves the key is not present. Insert follows the same path and writes into the first available slot it finds.
In a SwissTable-style design, there is an extra optimization layer: each slot also carries a tiny control byte or fingerprint. The table can scan those bytes quickly, reject most slots immediately, and only do full key comparisons for the few candidates that survive that filter.
The demo below intentionally hides that control-byte detail and shows the simpler memory story first. The important part is that the walk stays local. If you care about low tail latency, this is usually the right default direction: a SwissTable-style or otherwise flat, open-addressing map rather than a node-oriented one.
That is the HFT view of the choice. You are not picking a flat map because it is academically cleaner. You are picking it because order lookups, price bookkeeping, and symbol-state reads should feel like walking one small neighborhood of memory, not going on a scavenger hunt through the heap.
Group probing
The probe stays inside one nearby run of slots.
Hashing picks where probing starts, then the walk stays inside one contiguous stretch instead of bouncing across distant nodes.
home group
the hash chooses where probing starts
contiguous row
the probe moves forward here until match or empty
That still does not magically remove misses. What it changes is what one miss buys you. When a cache line arrives, you get a batch of nearby slots, and in real implementations usually a batch of control bytes too, so the CPU can often do several meaningful checks before it has to fetch anything else.
A flatter layout
One loaded line gives you several nearby places to check.
The same path now lands on a contiguous run, so one fetch brings multiple useful squares close to the CPU.
CPU core
Home group
this points into one nearby run
Cache lines
Contiguous slot array
the useful squares live together in one contiguous run
One More Step: Prefetch
Prefetch is not a miracle cure. It is the extra gain you get after the layout is already predictable.
Prefetch helps when you can guess the next memory location early enough to ask for it before you need it. That is difficult with a linked structure whose next pointer is only discovered after the current miss resolves.
With a flatter probing scheme, the future is much less mysterious. If the home group does not match, there is a reasonable next place to look. That makes prefetch a useful optimization instead of wishful thinking.
That is the comparison to watch in the demo below. In the node-based version, the next hop is discovered late. In the flat version, the next group is already a reasonable guess. In the prefetched version, that next group can be on the move before the current one is done.
One more step: prefetch
Prefetch helps when the next hop is obvious before you need it.
Linked flow
next hop requested earlygroup 48
current work
group 49
already incoming
group 50
later
Latency comparison
These are illustrative nanosecond costs, not benchmark claims for a specific machine.
Does Every Use Case Need A Flat Map?
No. Most software is not an exchange, and you should not cargo-cult a low-latency data structure into a cold path.
Use a flatter map first when lookups sit on a critical request path, market-data path, simulation loop, or any other place where the operation happens constantly and tail latency matters.
That includes a lot of market-making infrastructure: order-id maps, per-instrument state tables, and the small bookkeeping lookups that sit around price generation, order updates, and trade handling.
A regular general-purpose map is fine when the lookup is off the hot path, the surrounding work dominates the budget, or you care more about API familiarity and semantics than squeezing out cache misses.
Profile before rewriting because key shape, hash function cost, load factor, allocator behavior, and mutation patterns can all move the bottleneck around.
Rust Changes The Default Story
If you are writing Rust, the standard-library answer already starts much closer to the flat-map camp.
Rust's default std::collections::HashMap is already documented as using quadratic probing with SIMD lookup, which is a very different starting point from the mental model of buckets plus linked lists. The central critique in this article maps much more directly to classic node-based hash maps than it does to Rust's standard one.
That does not mean Rust code is automatically fast. Hashing still costs time, cache misses still hurt, and your access pattern still matters. It just means Rust's default table design is already playing a smarter memory-layout game.
Rust's default hasher also prioritizes robustness against adversarial inputs. If a hot path profiles poorly, table layout and hash choice are two separate questions worth measuring independently.
When std::unordered_map Is A Bad Choice
Not always. But often enough that it should stop being the automatic answer for latency-sensitive C++.
It is a bad choice when a lookup happens for every single message, order, or packet and you care about the slowest few microseconds as much as the average.
It is a bad choice when your keys and values are small enough that memory indirection becomes the dominant cost.
It is a bad choice when the structure is making the CPU rediscover scattered memory over and over again.
And it is usually fine when none of those are true.
Big-O tells you the lookup is constant on average. The CPU wants to know whether the next piece of data is already nearby. That is why the same abstract data structure can feel completely different in production. If latency is the job, locality is the design.