NullRabbit
cohort.v1
← All symptoms
Symptom · search entry

Solana validator stuck catching up and snapshot download slow: why it happens

A validator stuck catching up is losing a race: it must replay slots faster than the cluster produces them, starting from a snapshot that is ageing while it downloads. The cluster advances at roughly two and a half slots per second, so a node that replays at or below that rate will never close the gap, and solana catchup reports the distance growing instead of shrinking. A slow snapshot download makes the race longer; slow replay makes it unwinnable.

Common causes

  • The snapshot is being fetched from a slow or distant RPC peer, and the node does not abort and retry a faster one, so the snapshot is already thousands of slots old when replay starts.
  • Replay throughput is capped by disk: the accounts database and ledger are on shared or throttled NVMe, and untarring the snapshot competes with replay for the same IOPS.
  • CPU is undersized or shared, so banking and replay stages cannot exceed cluster pace even with healthy disks.
  • The node restarts into catchup during a period of high cluster load, when blocks are fuller and each slot costs more to replay.

System-level mechanism

Catchup is bounded by the slowest stage of a pipeline that runs entirely on one host: download bandwidth, snapshot decompression, accounts index rebuild, then replay. Each stage has a different bottleneck, which is why the same symptom appears on hosts that fail for different reasons. The race framing matters because it makes the arithmetic explicit: if replay manages three slots per second against the cluster's two and a half, a one-hour-old snapshot still costs around five hours of catchup. Operators who only provision for steady-state validation discover the gap during recovery, which is exactly when stake is offline and the cost is visible.

What this indicates

A catchup that converges slowly indicates marginal provisioning; one that diverges indicates a hard bottleneck, usually disk, and no amount of waiting will fix it. Measure snapshot peer throughput and replay slots per second separately before changing anything, because the remedies are different: peer selection for the first, hardware isolation for the second.

Related issues

Slot distance increasing on a running node; IOPS saturation and packet loss under load; vote credits dropping once the node finally rejoins.

Deep references

  • We're securing validators at the wrong layer covers why the infrastructure layer underneath consensus, including recovery paths like this one, gets the least attention and absorbs the most failure.
  • Expensive work before authentication covers the load on the RPC nodes that serve snapshots, which is the other half of a slow download.
  • slashr.dev shows delinquency windows across networks, which is where extended catchup time becomes publicly visible.
Related symptoms
Evidence