The Cache-Line War: How Atomic Contention Sabotages WebAssembly Performance

Have you ever spent days meticulously parallelizing a heavy computational task in WebAssembly, only to find that the eight-threaded version is actually *slower* than the original single-threaded loop?

It feels like a betrayal. You’ve done everything right: you’ve moved the heavy lifting into Web Workers, you’ve initialized a SharedArrayBuffer to avoid expensive data cloning, and you’re using Atomics to ensure thread safety. On paper, your performance should be scaling linearly with the number of cores. Instead, your CPU fans are screaming while your throughput stays flat—or worse, plummets.

The culprit is usually not your logic, but a hardware-level phenomenon known as False Sharing. In the world of high-performance WebAssembly (Wasm), this is the invisible "Cache-Line War" that sabotages even the most elegant multithreaded designs.

The Illusion of Granularity

When we write code, we think in terms of variables. An i32 is four bytes. A f64 is eight. We imagine that if Thread A is writing to an integer at memory address 0x100 and Thread B is writing to an integer at 0x104, they are operating in total isolation.

The CPU hardware sees things differently.

Modern CPUs don't fetch data from main memory one byte or one integer at a time. That would be incredibly inefficient. Instead, they fetch data in "Cache Lines"—typically 64 bytes at a time. When your Wasm code accesses a single 4-byte integer, the CPU pulls that entire 64-byte chunk into its L1 cache.

This 64-byte chunk is the smallest unit of ownership in the CPU’s cache coherency protocol. This is where the war begins.

The MESI Protocol: The Rules of Engagement

To understand why your Wasm threads are fighting, you need to understand the MESI (Modified, Exclusive, Shared, Invalid) protocol.

When Thread A (on Core 1) reads a memory address, that cache line is marked as Shared. If Thread B (on Core 2) reads the same address, it also has a Shared copy. But the moment Thread A uses an Atomics.store or a CompareExchange to update its value, it must become the Exclusive owner of that entire 64-byte line.

To do this, Core 1 sends an "Invalidate" signal to Core 2. Core 2 must then dump its copy of that cache line. If Thread B then wants to update its own (totally different!) variable that happens to live within those same 64 bytes, it must wait for Core 1 to write its data back to a lower cache level, then fetch the line again, invalidating Core 1’s copy in the process.

This is False Sharing. Two threads are fighting for ownership of a cache line even though they aren't actually touching the same data. They are just neighbors living in the same 64-byte apartment building, and only one person is allowed to have the key at a time.

Seeing the Sabotage in Code

Let’s look at a "naive" implementation of a multi-threaded counter in Rust, intended to be compiled to wasm32-unknown-unknown (or wasm32-wasi-threads).

Imagine we want to track two different metrics—say, how many "Red" pixels and "Blue" pixels we’ve processed in an image.

use std::sync::atomic::{AtomicU32, Ordering};

// This struct is 8 bytes total. 
// a and b will almost certainly reside on the same cache line.
pub struct PixelStats {
    pub red_count: AtomicU32,
    pub blue_count: AtomicU32,
}

// Imagine this is called by multiple Web Workers simultaneously
pub fn increment_stats(stats: &PixelStats, is_red: bool) {
    if is_red {
        stats.red_count.fetch_add(1, Ordering::SeqCst);
    } else {
        stats.blue_count.fetch_add(1, Ordering::SeqCst);
    }
}

In a single-threaded environment, this is perfect. In a multi-threaded Wasm environment where Thread 1 is incrementing red_count and Thread 2 is incrementing blue_count, these two threads will spend 90% of their time "bouncing" the cache line back and forth between CPU cores.

The AtomicU32 operations, which are supposed to be fast hardware instructions, become bottlenecked by the interconnect bus between cores.

Why is this worse in the Browser?

You might wonder why we talk about this specifically in the context of WebAssembly.

In native C++ or Rust development on Linux/Windows, we have powerful tools like perf or Intel VTune that can literally point to a line of code and say, "You have a 40% cache-miss rate here due to contention."

In the browser, we are blind.

The Chrome DevTools or Firefox Profiler will show you that your Web Workers are "Busy" and consuming 100% CPU, but they won't tell you *why*. You’ll see high execution time on the atomic instruction, and you might assume the math is just slow. Because Wasm runs in a sandbox, we don't have access to the hardware performance counters that reveal cache-line bouncing.

This makes False Sharing a "silent killer" of Wasm performance.

The Solution: Strategic Padding

The fix for False Sharing is to ensure that variables modified by different threads reside on different cache lines. This means we need to "waste" some memory to gain speed—a classic trade-off.

In Rust, we can use the #[repr(align(64))] attribute to force a variable or a struct to start at the beginning of a new cache line and occupy at least 64 bytes.

Here is the "War-Ready" version of our stats counter:

use std::sync::atomic::{AtomicU32, Ordering};

#[repr(align(64))]
pub struct PaddedAtomic {
    val: AtomicU32,
}

impl PaddedAtomic {
    pub fn new(initial: u32) -> Self {
        Self { val: AtomicU32::new(initial) }
    }
}

pub struct OptimizedPixelStats {
    // Each of these now occupies 64 bytes of space.
    // They are guaranteed to be on different cache lines.
    pub red_count: PaddedAtomic,
    pub blue_count: PaddedAtomic,
}

pub fn increment_optimized(stats: &OptimizedPixelStats, is_red: bool) {
    if is_red {
        stats.red_count.val.fetch_add(1, Ordering::SeqCst);
    } else {
        stats.blue_count.val.fetch_add(1, Ordering::SeqCst);
    }
}

By adding this padding, Thread 1 can hammer red_count on Core 1 while Thread 2 hammers blue_count on Core 2. Since they own different cache lines, the hardware doesn't need to synchronize anything between the cores. They run at full speed.

Real-World Performance Impact

I recently worked on a Wasm-based physics engine. We had an array of "Joint" constraints, and we tried to process them in parallel. Initially, the joints were stored in a tight array:

// The "Slow" Way
struct Joint {
    tension: AtomicF32,
    compression: AtomicF32,
}
let joints: Vec<Joint> = ...;

When we ran this across 4 threads, it was 1.4x slower than 1 thread. The contention was so high that the CPU was spending more time shuffling cache lines than doing physics. After restructuring the data to ensure that joints processed by different threads were separated by 64-byte boundaries, the performance jumped to 3.2x faster than the single-threaded version.

That is a massive swing based entirely on how data is laid out in memory, without changing a single line of logic.

The JavaScript Side of the Fence

If you aren't using a language like Rust or C++ that handles alignment for you, and you're working directly with SharedArrayBuffer in JavaScript, you have to do the math manually.

If you have a Uint32Array backed by a SharedArrayBuffer, don't put thread-specific globals at indices 0, 1, 2, 3. Instead, space them out.

// BAD: Indices 0 and 1 are in the same 64-byte chunk (offset 0 and 4)
const shared = new Uint32Array(new SharedArrayBuffer(1024));
const THREAD_1_COUNT_IDX = 0;
const THREAD_2_COUNT_IDX = 1;

// GOOD: Space them by 16 elements (16 * 4 bytes = 64 bytes)
const THREAD_1_COUNT_IDX = 0;
const THREAD_2_COUNT_IDX = 16; 
const THREAD_3_COUNT_IDX = 32;

// Now, when Worker 1 does this:
Atomics.add(shared, THREAD_1_COUNT_IDX, 1);
// It won't invalidate the cache line for Worker 2 doing this:
Atomics.add(shared, THREAD_2_COUNT_IDX, 1);

A Gotcha: The Browser's Internal Padding

It is worth noting that some modern JS engines and Wasm runtimes are starting to get smarter about this, but you cannot rely on them. Furthermore, the 64-byte rule is a standard for x86 and most ARM processors (like Apple's M-series), but some high-end server CPUs use 128-byte cache lines.

If you want to be truly future-proof, aligning to 128 bytes is the safest, albeit more memory-expensive, bet.

When Should You Care?

You don't need to pad every single variable in your Wasm module. Alignment is only critical when:
1. The data is frequently written to (not just read).
2. The data is shared across multiple threads.
3. The variables are adjacent in memory but logically independent.

If you have a large buffer that one thread reads and another writes (like a producer/consumer queue), you should pad the "Head" and "Tail" pointers of that queue. These are the classic hotspots for atomic contention.

Conclusion: Designing for the Metal

WebAssembly allows us to bring near-native performance to the browser, but it also brings native-level responsibilities. We can no longer ignore the physical reality of the CPU.

The next time you're profiling a multithreaded Wasm application and the numbers aren't making sense, stop looking at your algorithms and start looking at your memory layout. Are your threads fighting over the same 64-byte slice of reality?

Padding might feel like "wasting" memory, but in the cache-line war, it’s the only way to broker a peace treaty that actually lets your code run fast. Memory is cheap; synchronization is expensive. Choose wisely.