loke.dev
Header image for Why Your WebGPU Compute Shader Is Silently Triggering a GPU Process Crash

Why Your WebGPU Compute Shader Is Silently Triggering a GPU Process Crash

Understand the fragile relationship between your GPU's 'Timeout Detection' and the browser's watchdog timer to prevent high-performance AI tasks from killing your app.

· 4 min read

I spent three days debugging a matrix multiplication kernel that worked perfectly on 1024x1024 matrices but turned my browser into a stuttering mess the second I bumped the dimensions to 4096. There were no syntax errors, no validation warnings in the console, and my logic was sound. Then, suddenly, the screen would flicker, the browser would go white, and a little notification would pop up saying "GPU Process Crashed." It felt like I was being gaslit by my own hardware. I eventually realized that I wasn't writing bad code; I was just being too greedy with the GPU's time.

The Invisible Executioner: TDR and Watchdogs

When you run a compute shader, you aren't just "using the GPU"—you are sharing a precious, high-stakes resource with the rest of your operating system. Your OS needs the GPU to draw the start menu, and your browser needs it to render every other tab you have open.

If your shader takes too long to execute—say, by trying to process a massive AI model in a single dispatch—the operating system starts to panic. On Windows, this is governed by TDR (Timeout Detection and Recovery). If the GPU doesn't respond for about 2 seconds, Windows assumes the driver is hung and resets it.

Chrome and Edge have an even shorter leash. They implement their own "watchdog" timer. If a command buffer takes longer than a fraction of a second, the browser decides your tab is a rogue process and kills the GPU connection to save the rest of the system.

The "Silent" Part of the Crash

The most annoying part? WebGPU handles this "gracefully" from its perspective, which means "silently" from yours. If you don't explicitly listen for it, you'll just see your computePass finish instantly, but nothing will ever happen on the GPU again.

Here is how you actually catch the culprit in JavaScript:

// This is your safety net. Don't leave home without it.
device.lost.then((info) => {
    console.error(`💔 GPU device was lost: ${info.message}`);
    console.error(`Reason: ${info.reason}`); // usually 'destroyed' or 'unknown'

    if (info.reason !== 'destroyed') {
        // This is where you realize your shader was too heavy
        alert("The GPU process crashed. Your shader likely triggered a timeout.");
    }
});

If you see info.reason as anything other than "destroyed" (which happens when you manually call device.destroy()), you’ve likely tripped the watchdog.

The "Death Loop" Shader

Here is a classic example of a shader that looks fine but is actually a bomb. It tries to do too much work in a single thread, keeping the GPU busy for way too long.

// WARNING: This is a crash-starter
@group(0) @binding(0) var<storage, read_write> data: array<f32>;

@compute @workgroup_size(64)
fn main(@builtin(global_invocation_id) grid: vec3<u32>) {
    let index = grid.x;
    
    // Doing a billion iterations in one thread is a one-way ticket
    // to TDR-ville.
    var value = data[index];
    for (var i = 0u; i < 1000000000u; i++) {
        value = sqrt(value + f32(i));
    }
    data[index] = value;
}

On a modern RTX 4090, this *might* finish. On an integrated Intel chip? You've just killed the browser process.

How to Stop the Killing

If your workload is legitimately heavy—like training a neural network or path tracing—you have to stop thinking about your work as one giant task. You need to "slice" it.

1. The Multi-Dispatch Strategy

Instead of one dispatchWorkgroups(1024, 1024), run 1024 individual dispatches of (1024, 1). This gives the browser's compositor a tiny window of time between commands to breathe and update the UI.

const commandEncoder = device.createCommandEncoder();
const pass = commandEncoder.beginComputePass();
pass.setPipeline(pipeline);
pass.setBindGroup(0, bindGroup);

// Slice the work into chunks
for (let i = 0; i < totalChunks; i++) {
    pass.dispatchWorkgroups(64, 1, 1); 
    // This allows the GPU to potentially context-switch between dispatches
}
pass.end();
device.queue.submit([commandEncoder.finish()]);

2. Use onSubmittedWorkDone()

If you are running a heavy loop in JavaScript that submits work to the GPU, don't just hammer the queue. Wait for the GPU to give you the "thumbs up" that it finished the previous batch.

async function heavyCompute() {
    for (let i = 0; i < 100; i++) {
        const encoder = device.createCommandEncoder();
        // ... build your pass ...
        device.queue.submit([encoder.finish()]);
        
        // This is the magic button. It returns a promise that 
        // resolves when the GPU has finished all work submitted so far.
        await device.queue.onSubmittedWorkDone();
        
        console.log(`Finished chunk ${i}`);
    }
}

The Gotcha: Indirect Dispatches

There’s a specific edge case involving dispatchWorkgroupsIndirect. Since the dispatch dimensions are stored in a GPU buffer, the browser cannot know ahead of time how long the shader will run. If your buffer accidentally contains a massive number (like 0xFFFFFFFF due to an underflow), you will trigger a GPU reset that is incredibly hard to debug because the "input" looked fine on the CPU side.

Always clamp your indirect dispatch parameters:

// Inside a 'cleanup' or 'pre-compute' shader
if (calculated_width > 4096u) {
    indirect_buffer[0] = 4096u; 
} else {
    indirect_buffer[0] = calculated_width;
}

Final Thoughts

WebGPU gives us "near-native" power, but it also gives us the native responsibility of not hogging the hardware. If your app is dying silently, stop looking for a logic error and start looking at your execution time.

Keep your dispatches small, listen for device.lost, and remember: the GPU is a shared resource. If you try to take it all for yourself, the OS will take it away from you by force.