Open data for blockchain validator security: the first multi-modal dataset for infrastructure attacks

There is no MITRE ATT&CK for blockchain validators.

There's MITRE for enterprise IT. There's MITRE ATLAS for ML systems. There's SWC and OWASP Smart Contract Top 10 for application-layer blockchain bugs. There is nothing for the infrastructure layer where validators actually run. The RPC surface, the gossip layer, the consensus daemon, the indexer. No shared vocabulary, no shared taxonomy, no shared data.

This is not because the attacks don't exist. It's because the apparatus for naming them doesn't.

I noticed this the slow way. Independent security research on Sui turned up real architectural vulnerabilities. Disclosed properly. The response was institutional shrug. Not malice. There's just no framework for receiving infrastructure-layer security research from outside the project's own engineers. Bug bounty programmes have carve-outs for "RPC DoS" that close out architectural findings because the carve-out can't distinguish "you flooded the endpoint" from "you found that simulateTransaction wedges the entire async runtime." Both look like DoS to a triage form.

So we built the apparatus.

Key facts

What: nr-bundles-public v0.1, the first open multi-modal dataset for blockchain validator infrastructure attacks.
Where: huggingface.co/datasets/NullRabbit/nr-bundles-public.
How much: 31 schema-pinned bundles, attack and benign workloads.
Which chains: Sui and Solana, with Cosmos and CometBFT next.
Which families: 7 of 10 attack families populated, including response_amp, compute_amp, and four others.
Licence: CC-BY-4.0 on the dataset, MIT on the bundle spec.
Format: open (nr-bundle-spec). Taxonomy: open. Corpus: proprietary.

What just shipped

Today we published nr-bundles-public on Hugging Face. Thirty-one multi-modal observations of validator infrastructure under attack and benign workloads. Two chains, Sui and Solana. Seven vulnerability families. CC-BY-4.0.

Each entry is a bundle: a single observation of a validator running one workload, captured across five modalities simultaneously. Network packets, host telemetry, application metrics, consensus signals, RPC responses. Schema-pinned, provenance-hashed, replayable. The format is open and lives at nr-bundle-spec under MIT. Anyone can produce bundles for any chain.

The families are mechanism-defined, not chain-named. response_amp describes asymmetric output:input byte ratio whether you're looking at Solana, Sui, Cosmos, or anything else with an RPC surface. compute_amp describes asymmetric work:request CPU ratio across the same. That's the bit that makes cross-chain ML actually tractable. Trained on Sui bundles for family X, the model generalises to Solana bundles for family X, because the format and the mechanism are the same.

The taxonomy has ten families. Seven are populated in this release. The other three are queued.

What is a bundle?

A bundle is a single, time-bounded observation of a validator running one workload, captured across five modalities simultaneously and pinned to a schema. The five modalities:

Network packets at the wire.
Host telemetry (CPU, memory, scheduler latency, syscalls).
Application metrics (the validator process itself).
Consensus signals (rounds, votes, forks, finality).
RPC responses (request, response, byte counts, error codes).

Every bundle is hash-provenanced and replayable. That's what makes cross-modal ML possible: the model can attend to packet shape, CPU saturation, and consensus drift inside the same labelled observation, then learn that a particular attack family always looks like a particular joint signature.

Why Hugging Face

A reasonable question. Hugging Face is mostly LLMs and image generators and people fine-tuning Mistral derivatives. Security datasets on it are either application-layer smart contract bugs or generic enterprise traffic captures.

Two answers.

The boring one: GitHub is storage. Hugging Face is a discovery interface. A dataset on GitHub is invisible to researchers not already looking for it. A dataset on Hugging Face surfaces in tag-specific feeds, dataset search, citation graphs. The model when it goes up gets its own page, its own viewer, its own "datasets this model trains on" backlink. Researchers find the work without us pushing.

The interesting one: Hugging Face is where the people who train models against unusual data find each other. The conversation about how to train robust detectors against multi-modal, schema-pinned, provenance-tracked corpora is going to happen somewhere, and it's not going to happen on Twitter. It's going to happen in dataset community tabs and model discussion threads. We want to be in that room.

What this means for validator operators

If you operate a validator on any chain, you currently buy security in one of three ways. You pay a firm to audit your smart contracts. You run intrusion detection on your perimeter. You hope nothing exotic happens at the protocol layer.

The third one is doing a lot of work in that sentence.

Validator infrastructure has an attack surface that perimeter IDS doesn't see and smart contract audits don't cover. The JSON-RPC endpoint exposes architectural footguns. The gossip layer has amplification primitives. The consensus daemon has computational asymmetries that survive every form of rate limiting because the asymmetry is in the ratio, not the rate.

Detecting these attacks requires data that doesn't exist yet at public scale. We are publishing the first piece of that data today. The detectors that ship next week and the autonomous defensive systems that ship after that all anchor on this data. Open format, open taxonomy, closed corpus. Hugging Face for security infrastructure ML.

What this means for the world (and the part where I refrain from getting carried away)

The horizon here is bigger than blockchain. The bundle format is chain-agnostic by design but the design pattern is infrastructure-agnostic by intent. A central bank running a digital currency on validator-style nodes has the same shape of attack surface. A national-grid SCADA system speaking a custom protocol over IP has the same shape of attack surface. The governance layer, when machines should be authorised to act on defensive decisions without a human in the loop, has been published separately as the earned autonomy framework.

The category we're defining has a name: autonomous defence for decentralised networks. Today it's blockchain validators. In eighteen months it could be a lot more.

I am aware this sounds like the kind of thing every founder says when they hit a milestone. I'd dismiss it too. But the thing about defining a category that didn't exist before is that nobody's particularly inclined to take your word for it; the work has to do the work. That's why the format is published, the taxonomy is published, the dataset is published. The corpus we keep proprietary. The methodology we keep visible. The discipline we keep boring.

How does this compare to other security datasets?

Dataset class	Layer	Multi-modal	Chain-agnostic	Provenance-pinned
OWASP / SWC	Application (smart contract)	No	No	No
Enterprise PCAP corpora	Network only	No	N/A	Partial
MITRE ATLAS	ML systems	No	N/A	No
nr-bundles-public	Validator infrastructure	Yes (5 modalities)	Yes (open spec)	Yes (hash-provenanced)

No prior public dataset covers the validator infrastructure layer with joint capture across packets, host, application, consensus, and RPC. That gap is what nr-bundles-public fills.

What's next

Two models follow this dataset within the next fortnight. The V8 cipher-agnostic detector that demonstrates our pre-registration methodology on the byte-amplification family. The multi-class softmax detector that backs an interactive Space. Upload a bundle, see what the model thinks of it. Both with model cards that say what the models can do and, more importantly, what they demonstrably cannot do.

Cosmos and CometBFT enter the corpus next. The substrate paper that documents the methodology, the four-layer falsifiability framework, and the iterative leak-surface peeling pattern is in preparation. Coordinated disclosures continue on their own track.

If you're a researcher working on adjacent problems, the dataset is open and the format is open. If you're an operator wondering whether infrastructure-layer security is something you should be thinking about, the disclosures we've published this year suggest yes. If you're a journalist or analyst writing about this space, get in touch. There is more here than fits in a blog post.

We're at nullrabbit.ai. The dataset is at huggingface.co/datasets/NullRabbit/nr-bundles-public. The bundle spec is at github.com/NullRabbitLabs/nr-bundle-spec. The earned autonomy paper is at doi.org/10.5281/zenodo.18406828.

Most attacks have names. The ones nobody had a name for now have data.

Simon

Open data for blockchain validator security: the first multi-modal dataset for infrastructure attacks

Key facts

What just shipped

What is a bundle?

Why Hugging Face

What this means for validator operators

What this means for the world (and the part where I refrain from getting carried away)

How does this compare to other security datasets?

What's next

Related Posts

DeFi Under the Microscope: 1,075 Hosts, 3,001 Ports, One Timing Scan

What Does a DeFi Network Actually Look Like?

NR-2026-001 - Three Agave RPC architectural findings