loke.dev
Header image for The Memory Layout of a Branch

The Memory Layout of a Branch

How the physical memory address of your JIT-compiled functions can trigger CPU branch mispredictions, and the hidden cost of code-space fragmentation in V8.

· 8 min read

I used to lose sleep over microbenchmarks that made no sense. I’d have two functions—logically identical, both JIT-optimized by V8, both running the same loop—but one would consistently lag 15% behind the other. I checked for hidden closures, I checked for deoptimizations, I checked for GC pressure. Nothing. It wasn't until I started looking at the raw assembly addresses that the ghost in the machine revealed itself: the performance of your code isn't just about *what* the instructions are, but *where* they live in physical memory.

It turns out that the CPU’s branch predictor is a bit like a distracted librarian. If you put two books with similar titles in specific slots on the shelf, the librarian starts getting them confused. In the world of high-performance JavaScript, this manifests as "aliasing" in the Branch Target Buffer (BTB), and it’s a direct consequence of how V8 lays out code in memory.

The Architecture of a Guess

Modern CPUs are incredibly fast because they are optimistic. They don't wait for a conditional statement (like an if or a while) to resolve before they start executing the next set of instructions. They guess. This is speculative execution.

To make these guesses, the CPU uses the Branch Target Buffer (BTB). Think of the BTB as a specialized, high-speed cache. When the CPU encounters a branch instruction (like jne—jump if not equal), it looks up the address of that instruction in the BTB to see where it went last time.

But here is the catch: the BTB is finite. It’s a hardware component with limited entries. To keep things fast, the CPU doesn't use the full 64-bit memory address as a key. Instead, it uses a hash—often just the lower bits of the instruction's memory address.

If two different branches in two different parts of your JIT-compiled code happen to have addresses that hash to the same BTB entry, they "alias." One branch overwrites the prediction history of the other. They start fighting for the same slot, leading to branch mispredictions, pipeline flushes, and a significant performance tax that looks invisible in your JavaScript source.

When JIT Goes Hunting for Space

In a language like C++, the linker decides the memory layout at compile time. In JavaScript, the memory layout is a moving target. V8’s JIT compiler (TurboFan) generates machine code at runtime and stuffs it into a specialized region of the heap called Code Space.

When you run a function for the first time, it’s interpreted. When it gets "hot," TurboFan compiles it. If it gets even hotter, or if the types change, it might be re-compiled. Every time this happens, V8 has to find a contiguous block of memory to store the new machine code.

// A simple hot function
function compute(arr) {
  let sum = 0;
  for (let i = 0; i < arr.length; i++) {
    // This 'if' is a branch in machine code
    if (arr[i] % 2 === 0) {
      sum += arr[i];
    } else {
      sum -= arr[i];
    }
  }
  return sum;
}

const data = new Int32Array(10000).map(() => Math.random() * 100);

// Warm it up to trigger TurboFan
for (let i = 0; i < 10000; i++) {
  compute(data);
}

In the example above, the if (arr[i] % 2 === 0) becomes a conditional jump in assembly. If this code lives at memory address 0x...A1F0, the CPU uses bits from A1F0 to index the BTB.

Now, imagine your application grows. You load more modules, you create more closures, and V8’s code space starts looking like a game of Tetris. If a new function is compiled and happens to land at 0x...B1F0, the lower bits match. If both functions are running in a tight loop, they will constantly kick each other out of the branch predictor.

The Hidden Cost of Code-Space Fragmentation

V8 tries to keep code together, but fragmentation is inevitable. When code is deoptimized (perhaps a hidden class changed), the old machine code is discarded, leaving a hole.

V8’s MemoryAllocator manages the CodeRange. Inside this range, memory is divided into Pages. When the engine needs to "write" a new JITed function, it looks for a free chunk. If your application has been running for a long time, these chunks are scattered.

This fragmentation doesn't just cause instruction cache misses; it increases the probability of BTB aliasing. Because the JIT generates code non-deterministically based on execution timing, you can end up in a "bad" memory layout purely by chance.

I’ve seen cases where restarting a Node.js process fixes a performance regression simply because the second time around, the hot functions landed in "luckier" memory addresses that didn't alias with each other or with core library functions.

Visualizing the Instruction Pointer

To see this in action, you have to go beneath the surface. You can use V8 flags to see where your code is landing.

If you run Node.js with:
node --trace-opt --print-opt-code --code-comments my-script.js

You’ll see the actual assembly and, crucially, the memory addresses.

--- Optimized code ---
optimization_id: 0
source_position: 0
kind: OPTIMIZED_FUNCTION
name: compute
...
Instructions (size = 320):
0x3504082440c0     0  push rbp
0x3504082440c1     1  mov rbp,rsp
...
0x35040824411a    5a  test al,0x1
0x35040824411c    5c  jnz 0x350408244130  <-- This is our branch!

That address—0x35040824411c—is what the CPU sees. If another hot branch in your application ends in 11c, you are in for a world of hurt.

Why "Small" Changes Cause Huge Swings

One of the most frustrating aspects of this is how a seemingly unrelated change—like adding a few lines of logging or importing a new utility library—can slow down a completely different part of your app.

When you add code, you shift the offsets. If you add 16 bytes of code to a function that precedes your hot loop in the JIT buffer, your hot loop's branches move by 16 bytes.

1. Before: Branch at 0x...400 (Index 0 in BTB)
2. Add 16 bytes of logic above it.
3. After: Branch at 0x...410 (Index 16 in BTB)

Suddenly, your code is hitting a different part of the hardware. If Index 16 was already being used by a high-frequency internal V8 branch (like a feedback vector update), your performance drops. This is why we sometimes see "performance cliffs" where adding one line of code causes a 20% slowdown that doesn't make sense logically.

Practical Code: Forcing the Layout?

In JavaScript, we can't manually place code at specific addresses like we can with a linker script in C. However, we can observe how "bloating" a function can change its performance characteristics by shifting internal offsets.

Consider this experimental setup:

function createWork(paddingSize) {
  // We use the Function constructor to create a new "identity"
  // for the JIT, ensuring it gets its own space in memory.
  const padding = 'let a = 0;'.repeat(paddingSize);
  
  return new Function('arr', `
    let sum = 0;
    ${padding}
    for (let i = 0; i < arr.length; i++) {
      if (arr[i] % 2 === 0) {
        sum += arr[i];
      } else {
        sum -= arr[i];
      }
    }
    return sum;
  `);
}

const data = new Int32Array(10000).fill(1).map(() => Math.floor(Math.random() * 100));

// Test different paddings
for (let p = 0; p < 50; p += 5) {
  const fn = createWork(p);
  // Warm up
  for(let i=0; i<1000; i++) fn(data);
  
  const start = performance.now();
  for(let i=0; i<10000; i++) fn(data);
  const end = performance.now();
  
  console.log(`Padding ${p}: ${(end - start).toFixed(4)}ms`);
}

When you run this, you'll often see the execution time oscillate. It’s not a linear increase. You might find that Padding 10 is faster than Padding 0, and Padding 15 is slower than both. This is the physical reality of memory alignment and branch prediction interference showing its face.

Mitigation Strategies (or: How to live with it)

Since we can't control the BTB directly, how do we write code that is resilient to memory layout issues?

1. Reduce Branch Density

The fewer branches you have, the fewer entries you need in the BTB. This is why "branchless" programming is so popular in high-frequency trading and game engines. In JavaScript, you can sometimes replace conditionals with math.

Bad (Branchy):

if (val > 10) {
  result += val;
}

Better (Branchless-ish):

result += (val > 10) * val;

Note: Modern JITs are smart, but sometimes they need help to avoid generating actual jump instructions.

2. Monomorphic Call Sites

V8 uses "Inline Caches" (IC) to handle function calls. If a call site becomes "polymorphic" (sees many different shapes of objects), V8 generates a "megamorphic" stub. These stubs are full of branches that are highly susceptible to aliasing. Keeping your functions monomorphic (always passing the same type of object) reduces the complexity of the generated machine code and saves BTB entries.

3. Function Size Matters

Large functions generate large blocks of machine code. The more instructions between branches, the less likely you are to have "dense" aliasing within a single function. However, very large functions might not be inlined. The sweet spot is usually small, inlinable functions that TurboFan can stitch together into a coherent block.

4. Be Wary of "Micro-benchmarking"

This is the most important takeaway. If you are benchmarking a tiny snippet of code, the results are heavily influenced by where that code landed in memory during that specific run. Always run benchmarks multiple times in different process life cycles, or use a tool like benchmark.js that attempts to account for these variances.

The V8 Perspective: Pointer Compression and Alignment

In recent versions of V8 (and Chrome/Node), a feature called Pointer Compression was introduced. To save memory, V8 uses 32-bit offsets instead of 64-bit pointers for objects on the heap. This changes how memory is addressed and, consequently, how it’s cached.

Furthermore, V8 tries to align instructions on 16-byte or 64-byte boundaries. Alignment is the CPU’s best friend. When code is aligned, the instruction fetcher can grab a whole "line" of code from the L1 cache at once. If your hot branch happens to straddle two different cache lines because of where it was placed in the JIT buffer, you’ll pay a double-load penalty.

The Takeaway

We like to think of JavaScript as a high-level, abstract language where we don't have to care about the metal. For 99% of applications, that’s true. But when you are pushing the limits of Node.js or a browser-based engine, you aren't just writing logic; you are orchestrating an intricate dance of electrons inside a silicon chip.

The next time you see a performance dip that defies logic, don't just rewrite your loops. Consider that your code might just be in a bad neighborhood in memory. We can't always move the code, but understanding *why* the neighborhood matters makes us better at diagnosing the truly weird bugs that live at the bottom of the stack.

Memory layout isn't just a C++ problem. It’s a reality of every language that eventually touches a CPU. In the world of V8, your branches aren't just logic—they have a physical address, and those addresses have consequences.