NullRabbit Logo
Back to Research Hub

Introducing Substrate: An Open Format for Validator Threat Intelligence

·Simon·7 min read

The substrate that doesn't exist yet

Threat intelligence sharing in blockchain validator infrastructure is broken in a specific, boring way: there is no agreed-upon format for capturing what an attack against a validator actually looks like.

Not the postmortem. Not the CVE summary. The thing itself. The packets. The host telemetry. The application metrics. The validator's response. Time-aligned, labelled, reproducible, shareable.

This format does not exist. There are adjacent formats - STIX for enterprise threat intel, MISP for IOC sharing, plain pcap for network forensics, the CVE registry for vulnerabilities - but none of them describe an attack against a validator in a way that another defender can pick up, replay, and learn from. The result is the state of the field: informal Discord backchannels, ad-hoc disclosures in the DMs, vendor reports that summarise incidents in prose, foundation security teams independently rediscovering the same primitives (or honestly, not bothering).

This is the gap. We're filling it.

What's missing, more precisely

Threat intelligence has institutional hubs in the rest of the security world. Enterprise security has ISACs, MITRE, CISA. Cloud security has the CSPs themselves acting as central reporters. Web application security has bug bounty platforms with structured disclosure pipelines. None of these exist for validator infrastructure. There is no Project Zero for the systems-software layer of blockchain validators. There is no MITRE ATT&CK for validator abuse. There is no CVE-equivalent that captures the systems-software vulnerabilities specific to this domain - the gas-free RPC loops, the asymmetric query amplifications, the gossip-layer abuses, the consensus abuses that don't fit into any existing taxonomy because the taxonomy doesn't exist.

The audit firms are adjacent but wrong. They look at smart contracts, not infrastructure. The systems security firms are adjacent but wrong. They look at infrastructure, but not validator-specific infrastructure, and they miss the consensus-layer surface entirely. The chain foundations themselves have security teams, but each operates in isolation - Sui's security work doesn't inform Solana's, Cosmos doesn't inform Ethereum, and the institutional knowledge that should accumulate across the ecosystem instead evaporates after each incident. This is not decentralisation.

Anyone who's tried to share a validator-specific finding has run into this - there's nowhere structured to put it.

What we mean by substrate

Three things, layered.

A bundle format: a contract-validated multi-modal capture of a single attack instance. Pcap of the network exchange. Host telemetry from the targeted node. Application metrics from the validator process. The response payloads. A reserved slot for vector embeddings so the bundle can be indexed against learned representations later. Everything time-aligned. Every bundle labelled with the attack primitive and family it represents.

A taxonomy: ten vulnerability families covering the systems-software layer of validator infrastructure. Response amplification. Compute amplification. Memory amplification. Connection exhaustion. Consensus abuse. Gossip abuse. Auth bypass. Rate limiter bypass. Service misconfiguration. Reconnaissance. Family identifiers are chain-agnostic; primitive implementations within each family are chain-specific. A "response amplification" bundle for Sui and a "response amplification" bundle for Solana share schema and family label, differ in the primitive that triggers them.

A corpus: the bundles themselves, validated, labelled, organised. Corpus v1.0 just landed: 1,092 multi-modal bundles across 19 attack primitives in 9 of the 10 families, contract-validated, zero quarantined. The reconnaissance family is being populated now.

The format and taxonomy are open. The corpus is proprietary.

This is not an accident, it's the strategy.

Open the format, keep the data

The pattern is familiar: STIX is open and the threat intelligence built on top of it is not. Hugging Face's model card spec is open and the models are licensed individually. OCI image format is open and the registries that host the images compete on data and operations.

For us, this maps directly. The bundle format will be specified publicly. The taxonomy will be specified publicly. Anyone building defensive systems for validator infrastructure can adopt both, contribute to both, build pipelines that consume bundles from anywhere. We want this. The format is more valuable the more it's adopted.

The corpus is something else. Building it has been expensive - adversarial research against live validator infrastructure across multiple chains, contract-validated capture across four telemetry modalities, hand-labelled primitives, schema iteration across a year of work. The model is the commodity. The labelled adversarial corpus is the asset. We will release reference subsets to seed adoption. We will not release the full corpus.

This is the same shape every serious security data company eventually settles into. The novel part is applying it to a domain that has no incumbent doing it.

The numbers

Corpus v1.0 contains 1,092 bundles. Across 19 attack primitives. Across 9 of 10 vulnerability families. Every bundle passes contract validation - schema-validated against a Pydantic contract at the moment of capture, with bundles that fail the contract quarantined rather than included. Zero bundles quarantined means every capture across every primitive met the schema; the schema has been stable across all 19 primitives, which is the kind of finding you only get by trying to break it, repeatedly, against a wide enough surface that drift would have shown up by now.

The four-modality capture means each bundle includes pcap, host telemetry, application metrics, and validator response - time-aligned to a common clock. The reserved vector slot is for downstream embedding work; the bundle is useful as a reproduction artefact whether or not anyone trains on it.

The bundle is what an attack reproduction looks like in this domain. Training is one downstream use. Reproducible disclosure is the more fundamental one - and the one the field has been missing.

Connecting to earned autonomy

The bundle format and the corpus answer "what does an attack look like." They don't answer "what should a defender do about it." That's the question earned autonomy was written to address.

We published a paper earlier this year on earned autonomy: a governance framework for delegating defensive authority to machines based on demonstrated competence rather than vendor assertion. The argument is that autonomous defensive action is legitimate when the system has rehearsed on real traffic, produced a counterfactual record, and met an explicit threshold - and that the legitimacy is continuous, not granted once.

The paper described the governance side. The substrate is the data side that makes it operationally real.

Earned autonomy without a corpus is theory. The framework specifies that machines must demonstrate competence on actual attack traffic before being permitted to act, but if the only attack traffic available to a defender is whatever happens to hit their own infrastructure, the demonstration is bounded by what the adversary chooses to send. Most defenders never see the full surface. Competence cannot be earned against a surface that hasn't been observed.

A corpus without earned autonomy is data without a use. Bundles are interesting on their own - reproducible, shareable, taxonomically organised - but the framework that says what to do with them is what makes them defensively useful.

Together, they're a deployable system. The corpus is what a defender uses to demonstrate competence on the full surface, before being trusted on their own. The framework is what specifies when that demonstration is sufficient. The bundles are the evidence; earned autonomy is the rule that turns evidence into authority.

This is why we've been building both. They've always been one thing.

What's coming

A specification document for the bundle format and taxonomy is in draft. It will be published openly when it's ready. We're not committing to a date because the schema is still being exercised against new primitives and we'd rather ship a stable spec late than a fragile one on time.

A reference subset of the corpus will eventually live on Hugging Face and we'll publish reproduction notebooks alongside it.

The reconnaissance family is being populated now. When it lands, the corpus will cover all ten families. The framework will continue to be exercised against real findings as they come out of our adversarial research pipeline; some of those findings will be disclosed publicly, with bundles attached, on the timelines the disclosure process allows.

This is the work. The substrate is what we've been building all along; the post is the part where we stop building it quietly.

If you operate validator infrastructure, run security at a chain foundation, or work in the systems-software layer of blockchain networks, the format will be stronger with your input. Reach us at nullrabbit.ai.

Related Posts