Why ML detection on validator infrastructure keeps reporting ROC = 1.000, and what to do about it

Public-internet-reachable validator infrastructure has been racking up resource-consumption findings across Sui, Solana, Ethereum, Aptos, and Cosmos-family chains since 2024. The disclosure pipelines work. Operators want monitoring on this surface. The detection literature has obliged, and the detection literature, almost uniformly, reports near-perfect ROC.

That is the problem.

I've put a working draft on GitHub: nr-substrate-paper. It documents a methodology for ML detection of validator infrastructure attacks that takes Sommer & Paxson and the Dos and Don'ts of ML in Security seriously enough to assume our own first results were wrong, and built an audit machinery to prove it.

The first TrainReport said 1.000

V1 of the trainer ran against a 1,092-bundle Sui validator-DoS corpus. ROC = 1.000 across all 17 leave-one-primitive-out folds. Two minutes of audit found why: random-Gaussian features were also scoring 1.000 (LOPO benign-holdout contamination), and resp.count alone was scoring 1.000 (capture-pipeline co-linearity). The model was learning the corpus, not the attack.

The first run looked perfect, and "perfect" is exactly what tells you something is broken.

The interesting question is not how to suppress that result. It is how to build a process where the next result, and the next, and the next, are forced to surface their own version of the same failure before the headline number stabilises.

Iterative leak-surface peeling

The methodology has three pieces that have to hold together:

Pre-registration. Before each training cycle, a STEP-N-DESIGN.md registers numerical thresholds, structural predictions, and outcome-band composition rules. If audit fires, the cycle stops. There is no iterating within a cycle. v_N+1 is a fresh registration with the audit findings as inputs.

Falsifiability discriminators. Random-Gaussian-feature ROC, single-feature ROC sweeps, KDE-overlap floors, mode-mix spread targets. Each cycle bakes in the next one's audit triggers. The harness is designed to fail loudly on artefacts that previously slipped through.

Corpus immutability. Iterations land on Spaces increment-only. corpus_v1.0, v1.1, v1.2. Nothing is rewritten. The history of what the model saw is auditable.

Commit to what counts as success before you see the numbers, build a test rig designed to embarrass the model, and freeze the data so anyone can replay it.

Across V1 to V7-narrow on a 2,103-bundle Sui+Solana corpus, this surfaced and closed eight distinct leak surfaces in sequence. Each one looked like a result before audit. Each one was a methodology bug.

Eight times in a row, what looked like progress was the harness lying to us. The methodology caught it.

Two things that are open, two things that are closed

The paper rests on two pillars that are independent of any single training outcome:

Bundle v1, the multi-modal capture format with controlled-vocabulary provenance fields, has now absorbed three schema-additive extensions across primitive families and chains without modification to the core layout. That is the test for whether a format generalises.
A chain-agnostic family taxonomy of ten attack families plus benign that classifies post-cycle attacks without extension. Solana primitives slotted into the Sui-derived families.

Open format, closed corpus. The schema and reference parsers are public. The 2,103-bundle corpus is not. That is the Hugging Face-style analogue for security data: anyone can produce bundles in this format, the curated training corpus is the moat.

The format is a public standard anyone can build on. The labelled attack data stays proprietary.

The V7-narrow finding worth lingering on

The cipher-agnostic claim asked whether models trained on cleartext traffic retain accuracy when the same attacks are run over TLS. The cross-chain mechanism claim asked whether features that detect an attack family on Sui detect the same family on Solana.

V7-narrow falsified the cross-chain mechanism claim at the rate-invariant 13-feature manifest layer. It looked like the cross-chain story was over.

It wasn't. Step-11 V8 retrain, after removing one feature (pcap.mean_packet_size) implicated in a tier-architectural fidelity-tier limit, recovered 124 to 191 percent retention against the cleartext baseline across both chains and both LOPO regimes. Cross-chain mechanism non-transfer in this domain is feature-localised, not whole-feature-space. One feature was carrying the failure. The rest of the manifest generalised.

Equally important: the encryption boundary itself does not introduce additional accuracy degradation beyond what the cleartext baseline already captures. That is the cipher-agnostic claim landing.

Two questions: does encryption break detection, does cross-chain detection work. The pre-committed gate said cross-chain failed; one bad feature was carrying it. Drop the feature, both claims land -- TLS doesn't blind the detector, and detectors trained on Sui generalise to Solana.

Why this matters outside this paper

Most ML-in-security papers that report ROC = 1.000 are not lying. They are running a process that does not have the apparatus to detect that they are wrong. The contribution here is not a model. The model is downstream. The contribution is the apparatus: the registration discipline, the audit floors, the v_N transitions as first-class artefacts.

If you are doing ML on network telemetry and your numbers look too good, they are. The methodology in this paper is one way to find out why before someone else does.

The deliverable isn't a clever model. It's the process that stops you fooling yourself with one.

What's next

V1 of the preprint ships when the arXiv submission lands. Companion artefacts: nr-bundle-spec (public after 2026-06-05, when the Solana coordinated-disclosure window closes), and nr-bundles-public on Hugging Face with sample bundles spanning multiple primitives and chains.

The architectural companion, Earned Autonomy, is the production-deployment side of the same picture: this paper is the data layer, that paper is the governance layer.

Working draft is here: github.com/NullRabbitLabs/nr-substrate-paper. Issues and pull requests welcome.

Why ML Detection on Validator Infrastructure Keeps Reporting ROC = 1.000

Why ML detection on validator infrastructure keeps reporting ROC = 1.000, and what to do about it

The first TrainReport said 1.000

Iterative leak-surface peeling

Two things that are open, two things that are closed

The V7-narrow finding worth lingering on

Why this matters outside this paper

What's next

Related Posts

How we're building cross-chain ML detection for blockchain validator infrastructure

Introducing Substrate: An Open Format for Validator Threat Intelligence

We Scanned 5,700 [Solana, Eth, Sui, Atom] Validators. Here's What We Found.