loke.dev
Header image for A Technical Post-Mortem of the Draw Call: 3 Lessons in WebGPU Render Bundle Scaling

A Technical Post-Mortem of the Draw Call: 3 Lessons in WebGPU Render Bundle Scaling

Analyze the hidden costs of draw call submission and the architectural shifts required to achieve 100,000 active objects in the browser.

· 6 min read

If you loop through 100,000 objects in a standard JavaScript requestAnimationFrame loop and issue a passEncoder.draw() for each, your frame budget won't just disappear—it will explode. Even with WebGPU's reduced driver overhead compared to WebGL, the "CPU-to-GPU tax" remains a physical reality. At this scale, the bottleneck isn't the GPU's ability to rasterize triangles; it's the JavaScript engine's ability to scream instructions at the graphics API fast enough to keep the hardware fed.

To hit the 100,000 active object mark without your browser tab turning into a slide show, you have to stop thinking about "drawing" and start thinking about "recording." This shift requires moving away from immediate-mode-style execution toward GPURenderBundles.

Here is a post-mortem on why raw draw calls fail at scale and the three specific lessons learned from re-architecting a renderer for massive object counts.

Lesson 1: The Validation Bottleneck is the Silent Killer

In WebGPU, every command you issue via a GPUCommandEncoder undergoes a validation check. The browser needs to ensure that your bind groups match the pipeline layout, that your vertex buffers have enough memory for the draw range, and that you aren't trying to write to a depth texture that’s currently bound as a sampler.

While WebGPU's design pushes much of this validation to "pipeline creation time," the draw() call itself still incurs a cost. Multiply that by 100,000, and you’re spending 10ms just on browser-side overhead before a single bit of data hits the GPU.

The Naive Approach (The Frame Killer)

// This will crawl at 100k objects
const renderPass = commandEncoder.beginRenderPass(renderPassDescriptor);
renderPass.setPipeline(pipeline);

for (let i = 0; i < objects.length; i++) {
    // Re-binding per object is a massive overhead
    renderPass.setBindGroup(0, objects[i].bindGroup); 
    renderPass.setVertexBuffer(0, objects[i].vertexBuffer);
    renderPass.draw(objects[i].vertexCount);
}

renderPass.end();

In this loop, the JavaScript thread is doing a massive amount of work every single frame. The solution is the Render Bundle. A GPURenderBundle allows you to pre-record the commands (the setPipeline, setBindGroup, and draw calls) once and then "replay" that recorded chunk with a single call in your main render pass.

The Bundle Approach

const bundleEncoder = device.createRenderBundleEncoder({
    colorFormats: [format],
    depthStencilFormat: 'depth24plus',
});

bundleEncoder.setPipeline(pipeline);
for (let i = 0; i < staticObjects.length; i++) {
    bundleEncoder.setBindGroup(0, staticObjects[i].bindGroup);
    bundleEncoder.draw(staticObjects[i].vertexCount);
}
const staticBundle = bundleEncoder.finish();

// Inside the main render loop:
const renderPass = commandEncoder.beginRenderPass(renderPassDescriptor);
renderPass.executeBundles([staticBundle]); // ONE call for 100,000 objects
renderPass.end();

The Takeaway: Bundles move the validation cost from the *render loop* to the *initialization phase*. If your scene is relatively static, bundles are a free performance lunch.

---

Lesson 2: Bind Group Thrashing and the "Uniform Atlas"

Even with bundles, if you have 100,000 unique bind groups (one for each object's transform matrix), you will run into memory management issues and cache misses on the GPU. The sheer volume of setBindGroup commands inside the bundle still creates a performance ceiling.

I found that the most effective way to scale was to move away from "One Bind Group Per Object" and toward "One Bind Group Per 1,000 Objects." This is where the concept of a Uniform Atlas (or a massive Storage Buffer) comes in.

Instead of each object having its own GPUBuffer for a transform matrix, you pack all 100,000 matrices into a single GPUBuffer with usage: GPUBufferUsage.STORAGE.

The Optimized Storage Buffer Structure

// WGSL Shader
struct ObjectData {
    modelMatrix: mat4x4<f32>,
    color: vec4<f32>,
};

@group(0) @binding(0) var<storage, read> allObjects: array<ObjectData>;

@vertex
fn vs_main(@builtin(instance_index) instanceIdx: u32, ...) -> ... {
    let data = allObjects[instanceIdx];
    // Use data.modelMatrix...
}

By using the @builtin(instance_index), you can draw thousands of objects with a single draw call. This is technically "instancing," but when combined with bundles, it allows you to group objects by material or pipeline.

Why not just use one big draw call?

In a real-world scenario, your 100,000 objects aren't all the same. They have different meshes or different textures. You use Bundles to group these. For example:
- Bundle A: 20,000 "Rocks" using drawIndexedInstanced.
- Bundle B: 50,000 "Grass Blades" using a different pipeline.
- Bundle C: 30,000 "Debris" pieces.

This reduces your executeBundles call to just three items, while the GPU handles the heavy lifting of indexing into the storage buffer.

---

Lesson 3: The Cost of Dirty States and Incremental Updates

One of the biggest gotchas with GPURenderBundle is that they are immutable. If a single object out of your 100,000 moves, you theoretically have to re-record the entire bundle. This is the "Bundle Tax."

To solve this, I had to implement a Tiered Culling and Bundling Strategy.

Instead of one giant bundle of 100k objects, I split the world into spatial chunks (quadtrees or octrees). Each chunk gets its own bundle. If an object moves within a chunk, only that chunk's bundle is marked as "dirty" and re-recorded.

The Bundle Manager Logic

class BundleManager {
    constructor() {
        this.chunks = new Map(); // Map<ChunkID, GPURenderBundle>
        this.dirtyChunks = new Set();
    }

    updateObject(obj) {
        const chunkId = this.getChunkId(obj.position);
        this.dirtyChunks.add(chunkId);
    }

    getRenderBundles() {
        for (const chunkId of this.dirtyChunks) {
            this.chunks.set(chunkId, this.reRecordBundle(chunkId));
        }
        this.dirtyChunks.clear();
        return Array.from(this.chunks.values());
    }
}

The "Hidden" GPU-to-CPU Sync

Here’s a hard-earned lesson: Don't try to read back visibility data from the GPU to decide what to put in your bundle.

If you use a Compute Shader to do frustum culling (which you should at 100k objects), and then try to bring that visibility list back to JavaScript to record a new GPURenderBundle, you will introduce a "stall." The CPU will wait for the GPU to finish the compute pass, killing your parallelism.

Instead, the most advanced WebGPU architectures use Indirect Drawing.

With drawIndexedIndirect, the "count" of how many objects to draw lives in a GPUBuffer. The JavaScript code doesn't even know how many objects are being drawn; it just tells the GPU: "Look at this buffer, and draw whatever number is written at byte 0."

// Drawing 100,000 objects without JS knowing which ones are visible
renderPass.setPipeline(pipeline);
renderPass.setBindGroup(0, globalBindGroup);
renderPass.setVertexBuffer(0, masterVertexBuffer);
// The draw command parameters are fetched from a buffer on the GPU
renderPass.drawIndexedIndirect(indirectBuffer, 0); 

Summary: The Architectural Shift

Achieving 100,000 active objects in WebGPU isn't about writing a faster loop; it's about avoiding the loop entirely.

1. Eliminate the overhead: Use GPURenderBundle to bake your draw calls so the browser doesn't have to re-validate them every frame.
2. Flatten your data: Stop using individual uniform buffers. Use one massive Storage Buffer and index into it using instance_index.
3. Stay on the GPU: Use Compute Shaders for culling and drawIndirect to keep the visibility logic on the hardware.

The move from WebGL to WebGPU is a move from "telling the GPU what to do" to "setting up a system where the GPU tells itself what to do." If your JavaScript is doing more than just passing a few orchestration buffers per frame, you aren't yet tapping into the real power of the API. Focus on the bundles, respect the validation cost, and let the storage buffers carry the weight.