Your Middleware Is Too Late: Scaling Rate Limiting with eBPF and XDP

You’ve been told that middleware is the best place to handle rate limiting. It’s convenient, it’s close to your business logic, and you can write it in the same language as your API. But if you’re building for high-scale or defending against malicious traffic, your middleware is already too late. By the time your Ruby, Python, or Go code decides to return a 429 Too Many Requests, the damage is done. Your CPU has already spent thousands of cycles on context switches, your kernel has already processed the TCP handshake, and your memory is already cluttered with expensive sk_buff structures.

If you want to actually scale protection, you have to stop acting at the application layer and start acting at the metal. You need to move your rate limiting into the kernel using eBPF and XDP.

The Invisible Tax of the User Space

To understand why application-level rate limiting fails under pressure, we have to look at how a packet travels through Linux.

When a packet hits your Network Interface Card (NIC), it triggers an interrupt. The kernel driver picks it up, allocates a data structure called an sk_buff (socket buffer), and passes it up through the networking stack. The kernel does IP validation, routing, firewalling (iptables/nftables), and finally associates it with a socket. Only then is the process woken up, the data copied from kernel space to user space, and your middleware finally gets a look at the headers.

Even if your middleware is "fast," the overhead of getting the packet to your code is immense. In a DDoS scenario or a massive traffic spike, your kernel spends all its time processing packets that you’re just going to throw away anyway. This is "Livelock"—a state where the system is doing work but making no progress because it's overwhelmed by the overhead of processing input.

Enter XDP: The Fast Path

XDP (eXpress Data Path) is a framework within the Linux kernel that allows you to run eBPF (extended Berkeley Packet Filter) programs directly at the earliest possible point in the software stack: the network driver's receive ring.

With XDP, we can make a decision about a packet—XDP_DROP, XDP_PASS, or XDP_TX (reflect it back)—before the kernel has even allocated an sk_buff. We are talking about dropping malicious traffic at 10 million packets per second on a single core without breaking a sweat.

Building a Kernel-Level Rate Limiter

Let’s get our hands dirty. We’re going to build a basic rate limiter that tracks IP addresses and drops traffic that exceeds a certain threshold. This requires two parts: the C code that runs in the kernel and a loader/controller that runs in user space.

The Kernel Program (C)

Our kernel program needs a way to remember how many packets we’ve seen from a specific IP. For this, we use an eBPF Map.

#include <linux/bpf.h>
#include <linux/if_ether.h>
#include <linux/ip.h>
#include <bpf/bpf_helpers.h>

// We'll use an LRU Hash map so the kernel automatically 
// evicts old IPs when the map gets full.
struct {
    __uint(type, BPF_MAP_TYPE_LRU_HASH);
    __uint(max_entries, 10000);
    __type(key, __u32);   // IPv4 address
    __type(value, __u64); // Packet count
} ip_stats_map SEC(".maps");

SEC("xdp")
int xdp_rate_limit(struct xdp_md *ctx) {
    void *data_end = (void *)(long)ctx->data_end;
    void *data = (void *)(long)ctx->data;

    struct ethhdr *eth = data;
    if ((void *)(eth + 1) > data_end)
        return XDP_PASS;

    if (eth->h_proto != __constant_htons(ETH_P_IP))
        return XDP_PASS;

    struct iphdr *iph = data + sizeof(struct ethhdr);
    if ((void *)(iph + 1) > data_end)
        return XDP_PASS;

    __u32 ip_src = iph->saddr;
    __u64 *count = bpf_map_lookup_elem(&ip_stats_map, &ip_src);

    if (count) {
        // Use an atomic increment to avoid race conditions between CPU cores
        __sync_fetch_and_add(count, 1);

        // If they've sent more than 1000 packets, drop them.
        // In a real app, you'd integrate time-based logic.
        if (*count > 1000) {
            return XDP_DROP;
        }
    } else {
        __u64 initial_count = 1;
        bpf_map_update_elem(&ip_stats_map, &ip_src, &initial_count, BPF_ANY);
    }

    return XDP_PASS;
}

char _license[] SEC("license") = "GPL";

Why this is different

Look at that return XDP_DROP;. When this line executes, the packet is gone. It doesn't go to the TCP stack. No memory is allocated for it. No context switch to your Go/Node/Rust app occurs. It's the most efficient way to ignore someone in the history of computing.

The State Problem: Maps and User Space

The kernel code above is a bit "dumb"—it counts total packets forever. In a real-world scenario, you want a sliding window (e.g., 100 requests per second).

Doing complex floating-point math or tracking timestamps inside an eBPF program is possible but tricky because the eBPF verifier is very strict about loops and complexity. Usually, the best architecture is a Hybrid Approach:

1. Kernel Space (XDP): Increments counters and drops packets if a "block" flag is set in a map.
2. User Space (Control Plane): Reads the counters every second, calculates the rate, and updates a "blacklist" map if an IP is behaving badly.

Here is how you might write the user-space side in Go using the cilium/ebpf library:

package main

import (
	"log"
	"time"
	"github.com/cilium/ebpf"
	"github.com/cilium/ebpf/link"
)

func main() {
	// Load the compiled eBPF ELF file
	objs := struct {
		IPStatsMap *ebpf.Map `ebpf:"ip_stats_map"`
		XdpProg    *ebpf.Program `ebpf:"xdp_rate_limit"`
	}{}
	if err := loadObjects(&objs, nil); err != nil {
		log.Fatalf("loading objects: %v", err)
	}
	defer objs.XdpProg.Close()

	// Attach to the eth0 interface
	l, err := link.AttachXDP(link.XDPOptions{
		Program:   objs.XdpProg,
		Interface: 2, // Index of eth0
	})
	if err != nil {
		log.Fatalf("could not attach XDP: %v", err)
	}
	defer l.Close()

	// Control loop: reset counters every second to create a "rate"
	ticker := time.NewTicker(1 * time.Second)
	for range ticker.C {
		var (
			key   uint32
			value uint64
		)
		iter := objs.IPStatsMap.Iterate()
		for iter.Next(&key, &value) {
			// If we see high traffic, we could log it or alert
			if value > 500 {
				log.Printf("IP %v sent %d packets in the last second", key, value)
			}
			// Reset for the next window
			newVal := uint64(0)
			objs.IPStatsMap.Update(key, newVal, ebpf.UpdateExist)
		}
	}
}

Atomic Operations and the "Thundering Herd"

You might have noticed I used __sync_fetch_and_add in the C code. This is vital. eBPF programs run on every CPU core simultaneously. If a single IP is flooding you with packets, multiple CPU cores will be trying to update that same map entry at the exact same time. Without atomic operations, you'd have a classic race condition where increments are lost, making your rate limiter inaccurate exactly when you need it most (under heavy load).

However, atomic operations have a performance cost. They force cache synchronization across cores. If you are aiming for absolute maximum throughput, you might use BPF_MAP_TYPE_PERCPU_HASH. This gives each CPU core its own private map. The user-space program then sums up the values from all cores. It's faster for the kernel but more work for the control plane.

The "Gotchas" You Will Encounter

I've made this sound like a silver bullet, but XDP has constraints that will frustrate you if you aren't prepared.

1. The Packet is Raw

At the XDP layer, you don't have "HTTP Headers." You have an array of bytes. If you want to rate limit based on a User-Agent or an Authorization header, you have to manually parse the Ethernet header, then the IP header, then the TCP header, find the data offset, and then search for strings in the payload. It’s tedious and error-prone. This is why XDP is usually used for L3/L4 (IP/Port) rate limiting, while application middleware still handles L7 (Business Logic) limiting.

2. The Verifier is Your New Nemesis

The eBPF verifier ensures your code won't crash the kernel. It forbids loops unless they are bounded and can be proven to terminate. It prevents you from accessing memory that hasn't been bounds-checked. If you try to read the IP address without first checking if the packet is long enough to actually contain an IP address, the verifier will reject your program.

3. Driver Support

XDP works best when the NIC driver supports it ("Native XDP"). Most modern drivers for Intel, Mellanox, and AWS (ENA) do. If your driver doesn't support it, Linux falls back to "Generic XDP," which runs after the sk_buff is allocated. It still saves you the user-space context switch, but you lose the massive performance gains of the early drop.

When Should You Actually Use This?

Don't go deleting your Express.js rate-limit middleware just yet.

Stick to Middleware if:
- You need to limit based on User IDs or API keys.
- Your traffic volume is manageable (under 10k requests/sec).
- You need to return helpful JSON error messages to the client.

Move to XDP if:
- You are seeing high "Softirq" CPU usage during traffic spikes.
- You are defending against volumetric attacks (DDoS).
- You are running a high-throughput proxy or load balancer.
- You need to drop traffic from "known bad" IP lists without taxing your application.

Strategy: The Multi-Layer Defense

The most robust architecture I’ve seen doesn’t pick one; it uses both.

Use XDP as a "Shield." It’s your heavy-duty filter that drops the garbage, the scanners, and the flooders. It’s cheap and fast. Then, let the "clean" traffic pass through to your Middleware, where you can do the expensive, high-context rate limiting—like checking if a specific user has exceeded their monthly quota for a specific feature.

By moving the bulk of the "No" decisions into the kernel, you ensure that when your application code finally gets a packet, it’s actually worth the CPU cycles it’s about to consume. Your middleware isn't just late; it's expensive. Start dropping at the door.