Open data for blockchain validator security: the first multi-modal dataset for infrastructure attacks
There is no MITRE ATT&CK for blockchain validators.
There's MITRE for enterprise IT. There's MITRE ATLAS for ML systems. There's SWC and OWASP Smart Contract Top 10 for application-layer blockchain bugs. There is nothing for the infrastructure layer where validators actually run. The RPC surface, the gossip layer, the consensus daemon, the indexer. No shared vocabulary, no shared taxonomy, no shared data.
This is not because the attacks don't exist. It's because the apparatus for naming them doesn't.
I noticed this the slow way. Independent security research on Sui turned up real architectural vulnerabilities. Disclosed properly. The response was institutional shrug. Not malice. There's just no framework for receiving infrastructure-layer security research from outside the project's own engineers. Bug bounty programmes have carve-outs for "RPC DoS" that close out architectural findings because the carve-out can't distinguish "you flooded the endpoint" from "you found that simulateTransaction wedges the entire async runtime." Both look like DoS to a triage form.
So we built the apparatus.
Key facts
- What:
nr-bundles-publicv0.1, the first open multi-modal dataset for blockchain validator infrastructure attacks. - Where: huggingface.co/datasets/NullRabbit/nr-bundles-public.
- How much: 31 schema-pinned bundles, attack and benign workloads.
- Which chains: Sui and Solana, with Cosmos and CometBFT next.
- Which families: 7 of 10 attack families populated, including
response_amp,compute_amp, and four others. - Licence: CC-BY-4.0 on the dataset, MIT on the bundle spec.
- Format: open (nr-bundle-spec). Taxonomy: open. Corpus: proprietary.
What just shipped
Today we published nr-bundles-public on Hugging Face. Thirty-one multi-modal observations of validator infrastructure under attack and benign workloads. Two chains, Sui and Solana. Seven vulnerability families. CC-BY-4.0.
Each entry is a bundle: a single observation of a validator running one workload, captured across five modalities simultaneously. Network packets, host telemetry, application metrics, consensus signals, RPC responses. Schema-pinned, provenance-hashed, replayable. The format is open and lives at nr-bundle-spec under MIT. Anyone can produce bundles for any chain.
The families are mechanism-defined, not chain-named. response_amp describes asymmetric output:input byte ratio whether you're looking at Solana, Sui, Cosmos, or anything else with an RPC surface. compute_amp describes asymmetric work:request CPU ratio across the same. That's the bit that makes cross-chain ML actually tractable. Trained on Sui bundles for family X, the model generalises to Solana bundles for family X, because the format and the mechanism are the same.
The taxonomy has ten families. Seven are populated in this release. The other three are queued.
What is a bundle?
A bundle is a single, time-bounded observation of a validator running one workload, captured across five modalities simultaneously and pinned to a schema. The five modalities:
- Network packets at the wire.
- Host telemetry (CPU, memory, scheduler latency, syscalls).
- Application metrics (the validator process itself).
- Consensus signals (rounds, votes, forks, finality).
- RPC responses (request, response, byte counts, error codes).
Every bundle is hash-provenanced and replayable. That's what makes cross-modal ML possible: the model can attend to packet shape, CPU saturation, and consensus drift inside the same labelled observation, then learn that a particular attack family always looks like a particular joint signature.
Why Hugging Face
A reasonable question. Hugging Face is mostly LLMs and image generators and people fine-tuning Mistral derivatives. Security datasets on it are either application-layer smart contract bugs or generic enterprise traffic captures.
Two answers.
The boring one: GitHub is storage. Hugging Face is a discovery interface. A dataset on GitHub is invisible to researchers not already looking for it. A dataset on Hugging Face surfaces in tag-specific feeds, dataset search, citation graphs. The model when it goes up gets its own page, its own viewer, its own "datasets this model trains on" backlink. Researchers find the work without us pushing.
The interesting one: Hugging Face is where the people who train models against unusual data find each other. The conversation about how to train robust detectors against multi-modal, schema-pinned, provenance-tracked corpora is going to happen somewhere, and it's not going to happen on Twitter. It's going to happen in dataset community tabs and model discussion threads. We want to be in that room.
What this means for validator operators
If you operate a validator on any chain, you currently buy security in one of three ways. You pay a firm to audit your smart contracts. You run intrusion detection on your perimeter. You hope nothing exotic happens at the protocol layer.
The third one is doing a lot of work in that sentence.
Validator infrastructure has an attack surface that perimeter IDS doesn't see and smart contract audits don't cover. The JSON-RPC endpoint exposes architectural footguns. The gossip layer has amplification primitives. The consensus daemon has computational asymmetries that survive every form of rate limiting because the asymmetry is in the ratio, not the rate.
Detecting these attacks requires data that doesn't exist yet at public scale. We are publishing the first piece of that data today. The detectors that ship next week and the autonomous defensive systems that ship after that all anchor on this data. Open format, open taxonomy, closed corpus. Hugging Face for security infrastructure ML.
What this means for the world (and the part where I refrain from getting carried away)
The horizon here is bigger than blockchain. The bundle format is chain-agnostic by design but the design pattern is infrastructure-agnostic by intent. A central bank running a digital currency on validator-style nodes has the same shape of attack surface. A national-grid SCADA system speaking a custom protocol over IP has the same shape of attack surface. The governance layer, when machines should be authorised to act on defensive decisions without a human in the loop, has been published separately as the earned autonomy framework.
The category we're defining has a name: autonomous defence for decentralised networks. Today it's blockchain validators. In eighteen months it could be a lot more.
I am aware this sounds like the kind of thing every founder says when they hit a milestone. I'd dismiss it too. But the thing about defining a category that didn't exist before is that nobody's particularly inclined to take your word for it; the work has to do the work. That's why the format is published, the taxonomy is published, the dataset is published. The corpus we keep proprietary. The methodology we keep visible. The discipline we keep boring.
How does this compare to other security datasets?
| Dataset class | Layer | Multi-modal | Chain-agnostic | Provenance-pinned |
|---|---|---|---|---|
| OWASP / SWC | Application (smart contract) | No | No | No |
| Enterprise PCAP corpora | Network only | No | N/A | Partial |
| MITRE ATLAS | ML systems | No | N/A | No |
| nr-bundles-public | Validator infrastructure | Yes (5 modalities) | Yes (open spec) | Yes (hash-provenanced) |
No prior public dataset covers the validator infrastructure layer with joint capture across packets, host, application, consensus, and RPC. That gap is what nr-bundles-public fills.
What's next
Two models follow this dataset within the next fortnight. The V8 cipher-agnostic detector that demonstrates our pre-registration methodology on the byte-amplification family. The multi-class softmax detector that backs an interactive Space. Upload a bundle, see what the model thinks of it. Both with model cards that say what the models can do and, more importantly, what they demonstrably cannot do.
Cosmos and CometBFT enter the corpus next. The substrate paper that documents the methodology, the four-layer falsifiability framework, and the iterative leak-surface peeling pattern is in preparation. Coordinated disclosures continue on their own track.
If you're a researcher working on adjacent problems, the dataset is open and the format is open. If you're an operator wondering whether infrastructure-layer security is something you should be thinking about, the disclosures we've published this year suggest yes. If you're a journalist or analyst writing about this space, get in touch. There is more here than fits in a blog post.
We're at nullrabbit.ai. The dataset is at huggingface.co/datasets/NullRabbit/nr-bundles-public. The bundle spec is at github.com/NullRabbitLabs/nr-bundle-spec. The earned autonomy paper is at doi.org/10.5281/zenodo.18406828.
Most attacks have names. The ones nobody had a name for now have data.
- Simon
Related Posts
DeFi Under the Microscope: 1,075 Hosts, 3,001 Ports, One Timing Scan
A first look at what DeFi validator infrastructure looks like at the kernel level. We crack open the consolidated dataset -- embedding galaxies, jitter fingerprints, RTT ridgelines, and 10,000 anomaly events across 642 silent hosts.
What Does a DeFi Network Actually Look Like?
Every blockchain network has a physical fingerprint. We pointed our eBPF/XDP scanner at 1,075 hosts across multiple DeFi validator networks and mapped 3,001 timing fingerprints to reveal the structure underneath the consensus layer.
NR-2026-001 - Three Agave RPC architectural findings
Three architectural findings in the Agave JSON-RPC layer at v3.1.9: response amplification on getMultipleAccounts, Tokio executor saturation via simulateTransaction, and spawn_blocking pool saturation via getProgramAccounts. Architectural patterns, not rate-limit DoS - operator rate limits don't close them.
