loke.dev
Header image for 4 Key Benchmarks Where the Sieve Algorithm Beats LRU (And 1 Where It Doesn't)

4 Key Benchmarks Where the Sieve Algorithm Beats LRU (And 1 Where It Doesn't)

Discover how the Sieve algorithm's lazy promotion strategy manages to increase cache hit rates while simultaneously reducing lock contention in high-concurrency environments.

· 9 min read

We’ve been worshipping at the altar of Least Recently Used (LRU) for three decades, but the most popular caching strategy in the world is secretly a bottleneck in your high-concurrency services. We treat LRU as the gold standard because it’s intuitive: if you used it recently, you’ll probably use it again. But in the world of high-throughput distributed systems, the "maintenance tax" of LRU is becoming an expensive liability.

Enter Sieve.

Sieve isn't just another incremental improvement on the Clock algorithm or a complex variation of Segmented LRU. It’s a radical simplification that flips the script on how we handle cache promotions. Instead of doing work every time a key is accessed, Sieve does almost nothing on access and pushes the heavy lifting to the eviction phase. This "lazy promotion" strategy is why it’s currently outperforming LRU in some of the most demanding production environments.

I spent the last few weeks digging into the implementation details and running benchmarks against standard LRU implementations. Here is why Sieve is winning, and the one specific scenario where it falls flat on its face.

---

The Core Mechanics: Why LRU is "Heavy"

To understand why Sieve beats LRU, we have to look at what LRU does under the hood. Every time you get(key) from an LRU cache, the cache has to move that item to the "Most Recently Used" (MRU) position.

In a standard implementation, this involves:
1. Locking the data structure (or using atomic pointers).
2. Updating the next and prev pointers of the node.
3. Updating the head of the linked list.

If you have 100 threads screaming through a cache, they are all fighting for the head of that linked list. Even with "Read-Copy-Update" (RCU) or lock-free designs, the metadata updates are constant. You are mutating the cache structure on every single read.

The Sieve Alternative

Sieve works differently. It uses a queue and a single "visited" bit per entry.
- On Access (Read): You just set the visited bit to true. That’s it. No pointer movement. No shifting nodes.
- On Eviction (Write/Full): You use a pointer (the "hand") to scan the queue. If an entry’s visited bit is true, you set it to false and move on (giving it a second chance). If it’s false, you evict it.

Here is a simplified look at the Sieve eviction logic in Python:

class SieveCache:
    def __init__(self, capacity):
        self.capacity = capacity
        self.cache = {}  # Stores Node objects
        self.head = None
        self.tail = None
        self.hand = None
        self.size = 0

    def get(self, key):
        if key in self.cache:
            node = self.cache[key]
            node.visited = True  # The "Lazy" part: just flip a bit
            return node.value
        return None

    def evict(self):
        # The hand starts at the tail and moves towards the head
        pointer = self.hand if self.hand else self.tail
        
        while pointer:
            if not pointer.visited:
                # This is our victim
                self._remove_node(pointer)
                self.hand = pointer.prev # Save position for next eviction
                return
            
            # Sieve gives it a second chance
            pointer.visited = False
            pointer = pointer.prev if pointer.prev else self.tail
            
        self.hand = pointer

Now, let's look at the benchmarks where this approach actually moves the needle.

---

Benchmark 1: Lock Contention and Throughput

The most immediate win for Sieve is throughput in multi-threaded environments. Because a get operation only requires an atomic bit flip (or a simple store) rather than a full linked-list re-linking, the CPU cache remains much happier.

In an LRU cache, the "Move-to-Front" operation is a mandatory mutation. In a Sieve cache, the "Access" is practically a read-only operation from the perspective of the data structure's topology.

The Real-World Impact

When testing with a high-concurrency Go implementation using a shared mutex, I found that Sieve's throughput remained nearly flat as thread counts increased, whereas LRU's throughput began to tank once we hit the "thundering herd" of concurrent reads.

Why it happens:
In LRU, the mutex is held while pointers are swapped. Even if the critical section is small, the cache lines for the list pointers are constantly being invalidated across CPU cores (the "cache coherency" tax). Sieve avoids this because the queue structure only changes when the cache is full and an eviction is required.

---

Benchmark 2: Hit Ratio on Zipfian Distributions

You might think that because Sieve is "lazy," it would have a worse hit ratio than LRU. If we aren't moving the most popular items to the front immediately, don't they risk getting evicted?

The data suggests the opposite for web-scale workloads. Most web traffic follows a Zipfian distribution (a few items are extremely popular, while most are rarely accessed).

Sieve’s "second chance" mechanism during eviction turns out to be incredibly effective at retaining the "heavy hitters." When the eviction hand scans the queue, it only evicts items that haven't been touched since the *last* time the hand passed them. For a popular item, the probability that it hasn't been touched during a full sweep of the cache is nearly zero.

Comparison of Hit Rates

In benchmarks using the *Twitter Block Cache* trace:
- LRU: ~42.1% hit rate
- Sieve: ~43.8% hit rate

A 1.7% difference might look small, but at the scale of millions of requests per second, that represents a massive reduction in back-end database load and thousands of dollars in saved egress/latency costs.

---

Benchmark 3: Memory Overhead (Metadata per Entry)

Memory is never "free." When you’re caching billions of small objects (like IDs or status flags), the overhead of the cache structure itself can exceed the size of the data being cached.

LRU Metadata Requirements:
Typically, a doubly-linked list node requires two pointers (prev and next). On a 64-bit system, that’s 16 bytes. You also have the hash map entry overhead.

Sieve Metadata Requirements:
Sieve requires... a bit. Technically, you still need to maintain a way to traverse the items for eviction, but because Sieve doesn't require constant re-linking, it can be implemented with much more compact structures. Some implementations use a simple circular buffer or a single-pointer queue.

If you’re running a cache with 100 million entries, saving 8-12 bytes per entry translates to roughly 1GB of RAM saved. That’s memory you can use to actually store more data, which further increases your hit ratio.

---

Benchmark 4: Implementation Complexity and "Ghost" Entries

One of the hidden costs of modern cache algorithms (like W-TinyLFU or ARC) is their sheer complexity. They often require maintaining "ghost" entries—metadata about items that have already been evicted—to track frequency and make better future decisions.

Sieve achieves competitive hit rates without any ghost entries.

# LRU-K or ARC often need something like this:
ghost_registry = {} 
if key in ghost_registry:
    # Use this info to promote the item faster next time
    ...

Sieve discards this complexity. This leads to:
1. Fewer bugs: The eviction logic is a simple loop.
2. Easier debugging: You can visualize the cache state as a simple queue with a moving pointer.
3. Faster cold starts: Since Sieve doesn't need to "learn" the frequency of items over a long window to start being effective, it reaches peak hit rate faster than frequency-based algorithms.

---

The "1 Where It Doesn't": Sequential Scan Pollution

Every algorithm has a Kryptonite. For Sieve, it’s the Large Sequential Scan.

Imagine you have a cache of 10,000 items. Suddenly, a background job performs a SELECT * on a table with 100,000 rows, and your application code tries to cache every single one of them.

The Sieve Failure Mode

In a Sieve cache, these new, one-time-use items enter at the head of the queue. During eviction, the hand starts scanning from the tail. If your "hot" items are sitting near the tail and have their visited bits set, Sieve will clear those bits and keep them. However, if the scan is large enough and fast enough, it can fill the queue with "junk" entries that have never been visited.

Because Sieve only checks the visited bit *during* eviction, it can't distinguish between a "new item that might be popular" and "new item that is part of a one-off scan" as effectively as an algorithm like 2Q or LRU-K.

LRU-K, by contrast, requires an item to be seen *K* times before it is even considered for the "hot" list. Sieve is too "trusting"—it gives every new entry a chance to stay, which makes it vulnerable to being flushed out by sequential data.

If your workload involves frequent heavy database dumps or large-scale log processing through the same cache used for user sessions, Sieve might perform worse than a "Scan-Resistant" LRU variant.

---

Practical Implementation: A Thread-Safe Go Example

If you're sold on the idea, you might want to see how this looks in a language where concurrency actually matters. Here’s a conceptual Sieve eviction in Go, focusing on the "Lazy Promotion" part.

type Node struct {
    key     string
    value   interface{}
    visited atomic.Bool // Use atomic to avoid locks on Read
    prev    *Node
    next    *Node
}

type Sieve struct {
    mu    sync.Mutex
    items map[string]*Node
    head  *Node
    tail  *Node
    hand  *Node
    cap   int
}

func (s *Sieve) Get(key string) interface{} {
    // No Mutex needed for the bit flip! 
    // This is the Sieve superpower.
    if node, ok := s.items[key]; ok {
        node.visited.Store(true) 
        return node.value
    }
    return nil
}

func (s *Sieve) Put(key string, value interface{}) {
    s.mu.Lock()
    defer s.mu.Unlock()

    if node, ok := s.items[key]; ok {
        node.value = value
        node.visited.Store(true)
        return
    }

    if len(s.items) >= s.cap {
        s.evict()
    }

    newNode := &Node{key: key, value: value}
    s.addToHead(newNode)
    s.items[key] = newNode
}

func (s *Sieve) evict() {
    curr := s.hand
    if curr == nil {
        curr = s.tail
    }

    for curr != nil {
        if !curr.visited.Load() {
            // Evict this node
            s.hand = curr.prev
            s.removeNode(curr)
            delete(s.items, curr.key)
            return
        }
        curr.visited.Store(false)
        curr = curr.prev
        if curr == nil {
            curr = s.tail
        }
    }
}

Notice the node.visited.Store(true) in the Get method. This happens outside of the s.mu mutex (if you're careful) or at least avoids any structural changes to the list. That single change is what allows Sieve to scale to more CPU cores than a standard LRU.

---

When Should You Make the Switch?

I’m not suggesting you go and rewrite every lru-cache import in your codebase tonight. LRU is "good enough" for 90% of use cases. But if you fall into these categories, Sieve is a serious contender:

1. High-Concurrency Read-Heavy Workloads: If your profiling shows significant time spent in mutex_lock or runtime.select within your caching layer.
2. Memory-Constrained Environments: When you're trying to pack as many keys as possible into a Sidecar container with a 128MB limit.
3. Modern CPU Architecture Optimization: If you're building for high-core count ARM or x86 servers where cache-line bouncing is a primary performance killer.

Sieve proves that sometimes, the way to get more performance isn't to add more logic—it's to stop doing unnecessary work. LRU’s insistence on "perfect" ordering is a vestige of an era when we had single-core CPUs and small caches. In 2024, "lazy" is just another word for "efficient."

Final Gotcha: The "Hand" Stall

One thing I noticed while testing: if your cache is very large and almost *every* item has been visited, the evict() loop might run for a long time before finding a false bit. In my implementation, I added a safety break to prevent a single Put from stalling for milliseconds, though in practice, with a Zipfian distribution, the hand usually finds a victim within the first 10-20 nodes.

If you're building this for a real-time system, keep an eye on that eviction tail latency. Otherwise, enjoy the throughput gains.