4 Invisible Padding Rules That Will Sabotage Your WebGPU Uniform Buffers

Most developers assume that a Float32Array in JavaScript is a transparent window into GPU memory. You see an array of ten numbers, you send it to the GPU, and the GPU sees ten numbers. This is a lie. In WebGPU, your data isn't just a list of values; it’s a meticulously partitioned grid of memory governed by a strict, often unintuitive layout called std140.

If you treat a Uniform Buffer like a standard JavaScript array, your shaders won't crash. They just won't work. Your colors will be black, your transforms will be identity matrices, and your lights will be positioned at the origin. You’ll spend three hours staring at your math code only to realize the math was perfect—your data was just four bytes to the left of where the GPU expected it to be.

Here are the four invisible padding rules of the std140 layout that will sabotage your WebGPU uniform buffers if you don't account for them manually.

1. The Vec3 Traps: The "Ghost" Fourth Component

This is the most common point of failure for beginners. In WGSL, a vec3<f32> seems like it should take up 12 bytes (3 components × 4 bytes each). If you define a struct in WGSL like this:

struct SceneData {
    lightPosition: vec3<f32>,
    lightIntensity: f32,
}

You might logically assume your JavaScript buffer should look like this:

// This looks right, but it is WRONG
const data = new Float32Array([
    1.0, 2.0, 3.0, // lightPosition
    0.8            // lightIntensity
]);

It fails because std140 dictates that a vec3 has the same alignment and size requirements as a vec4. Even though you only care about X, Y, and Z, the GPU allocates 16 bytes for that vec3. The lightIntensity value in the example above won't be read as 0.8; the GPU will look for it at byte offset 12, find a "hole" (padding), and then look at byte offset 16, which doesn't exist in our array.

The Fix

You must manually pad your JavaScript arrays to match the 16-byte alignment. Every vec3 in a uniform buffer is effectively a vec4 with a useless w component.

// The correct way to align a vec3 followed by a float
const data = new Float32Array([
    1.0, 2.0, 3.0, 0.0, // X, Y, Z, [PADDING]
    0.8                 // lightIntensity
]);

I've seen people try to get clever by packing an unrelated float into that fourth slot to save space. While technically possible, it makes your WGSL code a nightmare to read because you're constantly accessing lightPosition.w to get your lightIntensity. Just accept the wasted 4 bytes; your sanity is worth more than a single float.

2. Array Strides: The "16-Byte Minimum" Law

When you create an array of types in a uniform buffer, you might think the elements are packed tightly together. They aren't. In std140, every element in an array is aligned to a 16-byte boundary, regardless of the element's actual size, unless that element is already larger than 16 bytes (like a mat4x4).

Consider this WGSL struct:

struct PhysicsObject {
    velocity: f32,
    mass: f32,
}

struct Scene {
    objects: array<PhysicsObject, 3>,
}

A PhysicsObject only contains 8 bytes of data (two 4-byte floats). You might expect the total size of objects to be 24 bytes (8 bytes × 3).

Nope. Because of the array stride rules, each PhysicsObject is padded to 16 bytes. The GPU sees:
- Object 0: velocity, mass, [8 bytes of padding]
- Object 1: velocity, mass, [8 bytes of padding]
- Object 2: velocity, mass, [8 bytes of padding]

If you send a Float32Array([v1, m1, v2, m2, v3, m3]), the GPU will read v1 and m1 correctly for the first object, but when it looks for the second object 16 bytes later, it will find whatever garbage happened to be in your memory.

The Code Reality

When building this buffer in JS, you have to account for those "dead zones."

const bufferData = new Float32Array(3 * 4); // 3 elements, each taking 4 float slots (16 bytes)

for (let i = 0; i < 3; i++) {
    const offset = i * 4; 
    bufferData[offset + 0] = velocities[i];
    bufferData[offset + 1] = masses[i];
    // offset + 2 and offset + 3 are left as 0.0 (the padding)
}

This is why many WebGPU developers prefer using vec4 for everything. If you just make your struct array<vec4<f32>, 3>, the padding is explicit, and you aren't surprised when your data alignment goes sideways.

3. Nested Structs and the "Largest Member" Rule

Alignment isn't just about the members inside a struct; it's about the struct itself. A struct’s alignment is determined by its largest member. This is a recursive headache.

If you have a small struct that you nest inside a larger one, the smaller struct might suddenly require massive amounts of padding at the start or end to satisfy the alignment of its internal components.

Look at this example:

struct Inner {
    rotation: f32,
}

struct Outer {
    position: vec4<f32>, // Alignment: 16
    data: Inner,         // Alignment: 4
    scale: f32,          // Alignment: 4
}

In this case, Inner only needs 4-byte alignment. Outer works fine. But watch what happens if we change Inner:

struct Inner {
    transform: mat4x4<f32>, // Alignment: 16
}

struct Outer {
    id: f32,     // Offset 0
    // [12 bytes of padding go here!]
    data: Inner, // Offset 16
}

Because Inner contains a mat4x4, its base alignment becomes 16. Even though id only takes up 4 bytes, the GPU refuses to start data until the next 16-byte boundary.

I’ve wasted entire afternoons debugging "shaking" meshes because I didn't realize a nested struct was being shifted 12 bytes to the right by the compiler. When you are writing your JavaScript writeBuffer calls, you have to calculate these offsets manually.

Calculating Offsets Manually

If you aren't using a helper library, I recommend creating a small "Map" of your offsets so you don't lose track:

const UNIFORM_LAYOUT = {
    id: 0,
    inner_transform: 4, // 16 bytes offset / 4 bytes per float
};

const buffer = new Float32Array(20); // (4 + 64) bytes / 4 = 17 floats, rounded to 16-byte alignment
// ... fill buffer ...

4. The Tail End: Padding to the Minimum Buffer Size

Even if your internal struct alignment is perfect, the total size of your uniform buffer must be a multiple of 16 bytes. If your struct technically only uses 20 bytes, WebGPU will often require you to allocate 32 bytes for the actual GPUBuffer.

But there's a second, more insidious rule: The `minUniformBufferOffsetAlignment`.

While not strictly a std140 rule, it's a hardware rule that affects how you use these buffers. Most modern GPUs require that the start of a uniform buffer binding (if you are using offsets) must be a multiple of 256 bytes.

If you are trying to pack data for 10 different objects into one large buffer and use dynamicOffsets to switch between them, you can't just space them by the size of your struct.

// This will likely throw a validation error on many systems
device.createBindGroup({
    layout: layout,
    entries: [{
        binding: 0,
        resource: {
            buffer: myLargeBuffer,
            offset: 64, // Struct size is 64, but this must be a multiple of 256!
            size: 64
        }
    }]
});

The Strategy

Don't just calculate your struct size. Calculate your aligned stride.

const minAlignment = device.limits.minUniformBufferOffsetAlignment; // Usually 256
const structSize = 64;
const alignedStride = Math.ceil(structSize / minAlignment) * minAlignment;

// Now your buffer is (numObjects * 256) bytes large

Even if your data is only 64 bytes, you leave 192 bytes of empty space before the next object's data begins. It feels wasteful, but it's the only way to satisfy the hardware's requirement for efficient memory fetching.

Why Does This Exist? (The "Why")

It’s easy to get angry at these rules. Why can't the GPU just read the memory like a CPU does?

The answer is Memory Throughput. CPUs are optimized for low-latency, random access. GPUs are massive parallel processing engines. To keep thousands of cores fed with data, the memory controller fetches data in wide "chunks" (usually 128 or 256 bits at a time).

If a vec4 is split across two of these chunks, the GPU has to perform two memory fetches and then execute extra logic to stitch the bits back together. By forcing std140 alignment, the hardware guarantees that any basic type (like a vec4 or mat4) can be retrieved in a single, clean memory transaction. The "wasted" space is the price we pay for 10 teraflops of performance.

A Practical "Checklist" for Debugging

When your uniform data looks like garbage on the screen, run through this mental checklist:

1. Count your floats: Is your total Float32Array.length a multiple of 4? If not, you've definitely missed some padding.
2. Look for `vec3`: Did you leave a 4-byte gap after every single vec3?
3. Check array members: If you have an array<f32, 10>, that’s fine. If you have an array<vec2, 10>, are you spacing them 16 bytes apart? (Yes, even vec2 arrays are often padded to 16-byte strides in std140).
4. The Matrix Rule: Remember that a mat4x4 is essentially four vec4s. It takes 64 bytes and requires 16-byte alignment.
5. Use `size` and `offset` explicitly: When calling device.queue.writeBuffer, don't just pass the whole array if you're unsure. Use the parameters to specify exactly which byte you are starting at.

Tooling: Don't Do This by Hand Forever

While it is vital to understand these rules, you shouldn't be doing this math in your head for every project. The risk of a typo is too high.

I highly recommend using a library like WGSL-Preprocessor or Haku to help manage layouts. Better yet, check out [WebGPU-Utils](https://github.com/greggman/webgpu-utils) by Gregg Tavares. It includes a layout function that can take a WGSL struct string and return the exact byte offsets and sizes you need.

import { makeShaderDataDefinitions, makeStructuredView } from 'webgpu-utils';

const code = `
  struct Scene {
    resolution: vec2<f32>,
    time: f32,
  };
  @group(0) @binding(0) var<uniform> scene: Scene;
`;

const defs = makeShaderDataDefinitions(code);
const views = makeStructuredView(defs.uniforms.scene);

// views.set({ resolution: [1920, 1080], time: 1.5 });
// Now views.arrayBuffer is perfectly aligned for the GPU

Summary

The std140 layout is a ghost in the machine. It doesn't show up in your console logs, and it doesn't stop your code from compiling. It just silently shifts your data until your shaders are reading zeros or the wrong variables entirely.

1. `vec3` is `vec4` for the purposes of memory.
2. Arrays elements are padded to 16 bytes.
3. Structs align to their largest member.
4. Uniform offsets usually need 256-byte alignment.

Respect the alignment, and your GPU will finally see the data you actually sent it. Ignore it, and you'll be debugging "black screens" until the end of time.