NullRabbit Logo
Back to Research Hub

The Kernel Doesn't Care About Your Restart Script

·Simon·7 min read

Building a production BPF/XDP scanner is an exercise in humility.

We build autonomous network defence at NullRabbit. Part of that means scanning infrastructure at the packet level -- XDP for ingress, TC for egress, eBPF maps for state, all running in-kernel at microsecond resolution. The theory is elegant. The practice will ruin your week. It's ruined 4 of mine so far.

This is a collection of things that broke, why they broke, and what we did about them. If you're building anything that touches BPF in production, maybe this saves you a few days.

XDP Doesn't Leave When You Do

The first lesson: XDP programs survive process death.

We run our timing scanner (limpet-timing) as a systemd service. It attaches XDP to the network interface, collects packet-level timing data, does its work. When the process exits -- clean shutdown, crash, OOM kill, doesn't matter -- the XDP program stays attached.

The kernel uses link-based XDP attachment. The BPF link file descriptor should trigger cleanup on close. It does. Eventually. The kernel defers FD cleanup, and "deferred" in kernel time can mean your systemctl restart finds XDP already attached and fails.

So now you have a dead process and an orphaned XDP program sitting on your interface. No scanner running, no data flowing, and the next startup fails because the interface is already claimed.

Our fix is blunt: a cleanup script that runs as both ExecStartPre and ExecStopPost in the systemd unit. It kills lingering processes, then polls ip link set dev $IFACE xdp off with retries. Ten attempts, one second apart. If XDP is still attached after that, the unit fails hard rather than running without packet capture.

Not elegant. Works every time.

The Ghosts in the Traffic Control Layer

XDP handles ingress. For egress visibility, we use TC with a clsact qdisc. Same problem, different layer.

Stale TC qdiscs from previous runs block new BPF filter attachments. The cleanup script now also runs tc qdisc del dev $IFACE clsact before every start. If you're wondering whether we discovered this at 2am on a scanner that had been silently failing for hours -- yes. Obviously yes.

BPF and Async Rust: A Horror Story

Our scanner is async Rust. Tokio runtime, the usual. BPF operations are not async. They're blocking system calls that talk to the kernel. This means every BPF scan runs on tokio's blocking thread pool via spawn_blocking with Handle::block_on inside.

Here's what happens with one worker thread: deadlock. The blocking task needs the runtime to poll futures. The runtime has one thread. That thread is blocked. Nothing moves.

We hardcoded worker_threads = 2. This is not a real fix. Under extreme load the blocking pool can still saturate. But it stopped the deadlock at normal operating load, and we moved on because there were fourteen other things on fire.

The deeper problem is that our BPF library held a mutex during pacing sleeps. The pacing profile controls probe timing -- you don't want to hammer a target with packets. But std::thread::sleep while holding a mutex means every other concurrent scan is blocked waiting for one scan's pacing delay to finish. We fixed the library to release the mutex during sleep. Obvious in hindsight. Invisible until you're watching twelve scans queued behind one slow target.

Crash Loops at Kernel Speed

When BPF fails to attach -- wrong kernel version, missing capabilities, interface already claimed -- it fails fast. Systemd sees the exit, restarts the service, BPF fails again. Restart. Fail. Restart. Fail. At systemd's default restart interval, you're burning CPU and flooding logs with identical errors.

We built a BpfHealthGuard that detects repeated BPF attachment failures. On detection, it publishes one MQTT health alert (using a compare-and-swap guard so exactly one task sends it, not twelve), triggers a 30-second cooldown, then exits cleanly. The cooldown breaks the tight loop. The alert means someone knows about it.

Docker Capabilities: The Privilege Escalation You Actually Need

Running BPF from a container requires capabilities that make security teams nervous:

CAP_NET_RAW, CAP_NET_ADMIN, CAP_BPF, CAP_SYS_ADMIN, and LimitMEMLOCK=infinity.

There's no way around this. BPF needs to load programs into the kernel. XDP needs to attach to network interfaces. You need raw socket access. You need admin over network configuration. And BPF maps are locked memory, so you need unlimited MEMLOCK or your maps fail silently when they exceed the default limit.

We eventually moved the scanner from Docker to bare systemd units precisely because the capability dance in containers was fragile. The Docker image also shipped without iproute2 for a while, which meant the cleanup script's ip commands failed silently. The scanner ran -- it just couldn't clean up after itself, which meant the next restart failed. Containers and BPF are not friends.

The Binary That Wasn't

This one cost us three days.

We deploy scanner binaries via CI. The deploy script copies the new binary to the target machine and restarts the service. Simple. Except cp fails silently when the target binary is still running. The file is locked. The copy doesn't happen. No error. The old binary keeps running.

Scanner-1 ran a stale binary from March 6th until March 9th. We didn't notice because the service was "healthy" -- it was running, responding to health checks, accepting work. It just wasn't running the code we thought it was.

Fix: rm -f before cp, a 2-second sleep after systemctl stop, and SHA256 verification post-deploy. If the hash doesn't match the release artifact, the deploy fails loudly.

Concurrency Without Limits

Our early timing scanner had no concurrency control. Every incoming scan request spawned a tokio task. Thirty requests meant thirty concurrent BPF scans, each with its own kernel-level packet capture, its own map entries, its own memory allocation.

This is fine at low volume. At campaign scale -- scanning hundreds of hosts across a validator network -- it's resource exhaustion. BPF maps have finite space. Kernel memory has limits even with unlimited MEMLOCK. The scanner would degrade gracefully until it didn't, then fall off a cliff.

We added semaphores. MAX_CONCURRENT_TIMING (default 10) bounds timing scans. MAX_CONCURRENT_DISCOVERY (default 10) bounds discovery scans. Both configurable. The orchestrator queues the rest. Simple and boring, which is what you want from infrastructure.

What All of This Means

None of these problems are individually hard. XDP cleanup is a shell script. Mutex contention is a library fix. Binary deployment is basic ops. Concurrency control is a semaphore.

But they compound. A stale XDP program prevents restart. A failed restart means no scanner. No scanner means no data. No data means the system that's supposed to observe and learn -- the entire premise of behavioural security -- is blind.

Every one of these issues was discovered in production. Not because we don't test -- we do -- but because BPF in production on ephemeral infrastructure across multiple cloud providers surfaces edge cases that lab environments simply can't reproduce. The kernel version matters. The network driver matters. Whether your NIC supports XDP native mode or falls back to generic mode matters. Whether the previous tenant's iptables rules left something weird in the netfilter tables matters.

Building packet-level scanning that actually works in production is mostly not about the scanning. It's about everything around the scanning: cleanup, lifecycle, deployment, concurrency, monitoring. The BPF program itself is the easy part. Keeping it alive, across restarts, across deploys, across cloud provider quirks -- that's the work.

We're building this because we believe network defence needs to operate at packet speed, not dashboard speed. XDP gives us that. But XDP also gives us a front-row seat to every way the kernel can make your life difficult.

Worth it. Mostly. Soon, we'll show you what the vectors look like.

The limpet source is available over https://github.com/NullRabbitLabs/limpet-trs

We'd appreciate a star or two.

Related Posts