NullRabbit
cohort.v1
← All symptoms
Symptom · search entry

Validator packet loss, AF_XDP drops and IOPS saturation: reading host-level saturation signals

Rising packet loss, AF_XDP drop counters and disk IOPS saturation on a validator host are three readings of the same condition: traffic and state are arriving faster than the pipeline behind them can drain. Clients that take packets through AF_XDP move ingest off the kernel's normal path for speed, which also moves the drop point into the application's own rings, so AF_XDP drops mean the validator itself, not the kernel, failed to keep up. IOPS saturation is the storage half of the same story.

Common causes

  • Burst traffic, shreds, repair and gossip during load events, arrives faster than the receive rings are drained, and the fill ring underruns, so frames are dropped at the NIC boundary.
  • Ring sizes, queue counts or core pinning do not match the NIC's interrupt layout, leaving some queues hot while others idle.
  • The accounts database and ledger share an NVMe device with the OS or with snapshot writes, and replay stalls on write latency rather than throughput.
  • External flood traffic is consuming ingest capacity before any of it is classified, which is the condition inline filtering exists to prevent.

System-level mechanism

The ingest and replay pipeline is a chain of queues, and saturation anywhere backs up everything upstream of it. When replay stalls on disk, receive buffers fill; when receive rings overflow, shreds are lost; lost shreds trigger repair, which adds more traffic to the interface that is already dropping. The feedback loop is the dangerous part, because it converts a marginal overload into a falling-behind event. Kernel-bypass networking removes the kernel's buffering safety margin in exchange for latency, which is the right trade for a validator, but it makes the drop counters the earliest honest signal of trouble, and it places the only effective intervention point at the NIC boundary itself, where traffic can be dropped before it costs anything.

What this indicates

Drops and IOPS saturation rising together under external load indicate the host is absorbing pressure it should be shedding at ingress. Rising IOPS with quiet NICs indicates a storage layout problem instead. The counters localise the bottleneck precisely; the remedies, queue tuning, disk isolation, or filtering at the XDP layer, follow from which one moved first.

Related issues

Slot distance increasing and vote credits dropping as the pipeline falls behind; a restarted node stuck catching up because the same disks now carry snapshot restore as well; cluster-wide slowdowns when many hosts hit this state at once.

Deep references

Related symptoms
Evidence