NullRabbit
cohort.v1
← Back to Research
Research · June 9, 2026

The 99% was wrong. So was the 0.32.

Simon Morley·5 min read

We build detectors for network attacks against blockchain validators. An attack is captured as a structured bundle of packets, host metrics and RPC responses, a classifier is trained on the corpus, and the model tells an operator what it is looking at: a state-sync flood, a response amplification, a rate-limiter bypass. The format is chain-agnostic by design, and the central claim is that a detector trained on it generalises across chains.

This post started as a write-up about a comfortable accuracy number hiding an uncomfortable one. While writing it we found that the uncomfortable number was also wrong, and we only found out because we stopped writing and ran the test. There are two lessons here, and the second one is the better one.

The 99%

The shipped model reported 99% accuracy. The number is real and close to meaningless. It comes from random k-fold cross-validation on a corpus where each attack primitive contributes fifty or more near-identical capture bundles. Split that randomly into train and test and near-duplicates of every test bundle sit in the training set. The 99% measures memorisation of a corpus, not detection of an attack.

The fix is standard, and when your differentiation is methodological honesty it is non-negotiable: hold out the natural unit of generalisation, which for a cross-chain claim is the whole chain, not a random slice. So we went looking for the leave-one-chain-out number to report instead.

The 0.32

We had one to hand. An earlier evaluation with an entire chain held out had produced a ROC AUC of 0.32, worse than a coin flip. It fit the story perfectly: the model does not transfer across chains, it memorised one chain's fingerprint. We wrote that up and nearly published it.

Then we did the thing we tell everyone else to do and ran the experiment cleanly, on the corpus we already had, before putting it in front of anyone. Train on one chain, test on a chain the model has never seen:

QuestionHeld-out-chain score
Is this traffic an attack? (binary, full features)ROC AUC 0.95
Which family of attack is it? (multiclass, full features)macro-F1 0.17

The 0.32 was real but not representative. It came from a deliberately stripped-down feature subset, thirteen features chosen to be invariant to traffic rate and encryption, and that subset carries no cross-chain signal. The full feature space had never been tested under chain holdout at all. It had only been tested holding out one attack at a time, which leaks chain identity through every other attack on the same chain. Tested properly, binary detection transferred across chains at 0.95.

We had reached for the most dramatic honest number rather than the most representative one. That is a subtler failure than headlining a leaked metric, and a more tempting one, because 0.32 felt like rigour.

What was actually true

The clean run did more than correct us. It found the real gap, which neither headline had named.

Detection generalises. Host CPU and memory pressure, packet rates, response-amplification ratios: the shape of something being wrong is genuinely chain-agnostic, and a detector trained on one chain flags attacks on an unseen chain at 0.95.

Attribution does not. With a whole chain held out, the model's ability to name the family of attack collapses to macro-F1 0.17 across seven families, barely above the random floor of 0.14. It knows the traffic is hostile. It cannot tell you whether it is a memory exhaustion or a rate-limiter bypass on a chain it has not seen.

That distinction is the engineering roadmap, and we could not see it from either headline. The binary detector is shippable now. The family classifier needs a specific kind of work: examples of the same family across many different chains, so the model can learn what a rate-limiter bypass looks like in general rather than what it looks like on one chain. The lever is data of a particular shape, not a fancier model and not better features. We checked the feature hypothesis directly: dropping the chain-specific columns changed the result by nothing, because the model was already ignoring them. The wrong number had us about to spend a week on the wrong fix.

The two lessons

The first is the familiar one. Do not headline a number that leaked; hold out the real unit of generalisation.

The second is the one we actually learned this week: your honest number can be honestly wrong. Picking the most alarming defensible figure is a bias of its own, dressed as rigour. The defence is to run the clean experiment yourself, on data you already have, before the number goes in a document, especially when the number flatters your sense of intellectual honesty. We were one edit away from publishing a true sentence, that cross-chain ROC was 0.32, which would have left every reader with a false belief, that the detector does not transfer. What caught it was not more scepticism. It was thirty lines of evaluation code and an afternoon.

For anyone building a cross-domain detector

Three questions, in this order. Is your evaluation split leaking? If near-duplicates of your test items sit in training, your headline is memorisation, so hold out the whole domain. Is the honest number you reached for the representative one? A real figure from a narrow regime can mislead worse than an inflated one from a broad regime. Have you split detect from classify? They generalise differently, often by a lot, and reporting them fused hides exactly the gap you most need to find.

The discipline behind all three is the same: pre-register the evaluation, run it before you write the conclusion, and be as suspicious of the number that confirms your humility as of the one that confirms your hype.

security-researchmethodologymachine-learningdetectionevaluation

Related Posts