Poison Piggy - After Action Report (Thanks Pi)
Summary
- •Charles Hoskinson discusses a community postmortem by Pi Lanningham, co-founder of Sunday Labs, titled "Poison Piggy Afteraction Report."
- •On November 21, 2025, Cardano experienced a 14-hour degradation of service due to a serialization bug that caused a unidirectional soft fork.
- •The bug led to transaction delays of up to 400 seconds, with some transactions failing to be included in the dominant chain.
- •The incident involved collaboration among founding entities and community projects to identify and resolve the issue without centralized intervention.
- •Affected transactions were analyzed, and it was determined that roughly 3.3% of transactions did not make it into the dominant chain.
- •The incident highlighted the importance of decentralized decision-making, as the community chose which fork to endorse organically.
- •The root cause was traced to a bug that allowed overly long hashes to be accepted, which was exploited during the attack.
- •Hoskinson emphasized the robustness of Cardano's architecture and the decentralized response to the incident, while also acknowledging areas for improvement in user experience and testing rigor.
- •Recommendations for future improvements include better communication strategies for stake pool operators (SPOs) and the development of AI tools to assist with upgrades.
- •The overall conclusion was that while the incident was serious, Cardano's infrastructure demonstrated resilience and the ability to recover effectively.
Full Transcript
Hi, this is Charles Hoskinson broadcasting live from warm, sunny Colorado. I wanted to make a video to go over something that came from the community. This is from Pi Lanningham. Pi is one of the founders of Sunday Labs. You may know him from Sunday Swap and all the work he does in Acropolis, Pragma, and other places.
He wrote something that’s so cool that I want to read it out verbatim and comment a little bit on it because it is a phenomenal postmortem. I think it accomplishes more than just being a postmortem; it establishes terminology that we can use again and again in the future. So, I’m going to share my screen. There’s a link to this. The title is "Poison Piggy Afteraction Report.
" He just published it, and we’re going to read it out loud. If you don’t like that, well, go find someone else to do it. It’s my channel; I get to do whatever I want. **Introduction** On November 21st, 2025, Cardano suffered a severe degradation of service that lasted roughly 14 hours. This blog post serves as my independent record of what happened, my assessment of how serious it was, and how the network chain and community responded to it.
My goals with this post are to present the cold facts and my expert opinion in sufficiently approachable language that anyone should be able to form their own opinion. I hope I am successful in confronting the seriousness of the incident head-on, making no excuses simply because I’ve chosen to build on Cardano. On the other hand, I hope to effectively explain the nuance around this event, avoid catastrophizing, and highlight what went wrong. You, as the reader, will have to judge whether I did this effectively. There will likely be endless, exhausting, and pointless debate about what terms to use to describe this.
I think this kind of discussion is ultimately masturbatory and stems from people eager to see Cardano’s no-downtime claim dethroned and, on the other side, those clinging to the unbroken streak. What matters is impact. I will describe the real-world impact and the taxonomy I consistently use to classify these events. If, after that, you decide for yourself that Cardano "went down," I won’t begrudge you your opinion. I’m not precious about that label, and I think reasonable people can disagree.
What I would ask is that if you want to operate in good faith, you form this opinion on your own, not from some Twitter talking head who has pre-decided what this event means. Or, if you are one of those Twitter talking heads, at least do your audience the courtesy of defining your thresholds for whatever label you use and apply them consistently to other similar events I’ll describe below. Hopefully, this kind of reasonable effect is not too much to ask from Twitter, but it probably is. **TL;DR** First, a serialization bug caused a unidirectional soft fork of Cardano, first on the preview testnet, then on mainnet. A fix was released shortly after the discovery on testnet before the issue occurred on mainnet.
This fork seriously degraded the quality of the chain but not as severely as you might think. Transactions submitted via robust infrastructure were delayed by up to 400 seconds. This is very important. People in the Twitter sphere were running around saying Cardano stopped, that it was dead, and that you couldn’t do anything with Cardano that day. We were still able to relay transactions, but it could take up to 400 seconds, which is obviously not very timely.
The time between blocks on the now-dominant chain grew to a maximum of 16 minutes, which is very problematic too. Some infrastructure may have created longer delays due to unrelated fragility. Roughly 3.3% (479 out of 14,383) of transactions didn’t make it into the fork that is now dominant. Some systems, particularly bridges or exchanges, were exposed to replay/doublespend risk.
Analysis of the transactions on the discarded fork is still underway to determine if any of that risk materialized as actual damages. When I say it’s going to take a few weeks to clean it up, that’s basically what I mean: certain actors, bridges, exchanges, and DeFi applications were particularly vulnerable to a long-chain reorganization, and it’s going to take some time. This constitutes a serious degradation of service for users but is within expected bounds for high availability of service. The three founding entities, plus Intersect and countless community projects, including Sunday Labs, collaborated to identify and fix the issue and made a recommendation to the SPO community on which fork to choose and which versions to run. That’s very important too.
A lot of people will tell you that we had to manually reset the network and rebuild everything. We didn’t. We said, "Hey, SPOs, as an independent group, you should go with this fork." You have to make a decision, and that was a decentralized process. The SPO community organically endorsed that recommendation, upgrading their nodes and leading to a full organic recovery in roughly 14 hours.
No centralized authority had to restart the chain or intervene. These are the facts. **Timeline of Events** What happened here is, as best as I’ve been able to put together, a complete timeline of the relevant events on the chain. All times are in my time zone, EST. I was in London at the time, so I’m a little behind.
On November 20th, 2025, at approximately 15:39, someone submitted a transaction to the preview testnet, which was accepted by some nodes and rejected by others. This was likely accidental, though we don’t know for sure who or for what reason this transaction was submitted. This is really important because this is the threshold of responsible disclosure. If you break the testnet, the entire purpose of the testnet is to be tested. It’s meant to be broken and stress-tested.
If somebody breaks that, that’s a white hat; that’s what people are supposed to do. They’re supposed to find clever things. When they do, they get bug bounties and accolades, and everybody says, "Oh man, that’s a really bad one. We’ve got to fix that." Whether that person’s known or not is not particularly material.
That’s a good actor; that’s the white zone. The whole reason the testnet exists is to be a safe space to have a discussion about whether the network is reliable or not. At approximately 18:19, about three hours later, I was tagged by Homer J in the SBO testnets channel. Homer J is the attacker on the mainnet. We’ll get to that in a little bit where the issue was being triaged.
You’ll notice that the testnet was broken. We already knew that something had happened, and there was a fork on the testnet, and there was already triaging happening because one of the blocks around that time contained Sunday Swap transactions. Why did Homer tag Pi? Homer was a scooper on Sunday Swap, so he had a lot of sophisticated technical knowledge and was watching all these things in real time. This would ultimately be a red herring, but it made me aware of the issue.
I checked Sunday Swap infrastructure and shared some of the relevant information, such as what versions we were running and which fork we were on. As a policy, we avoid upgrading our infrastructure until we are forced to via hard fork or a security vulnerability. So, I believe Sunday was actually running the old software, which is why there was a fork. People on the old software couldn’t talk to the people on the new software. Early diagnostics indicated Cardano DB Sync was crashing due to a serialization error of some kind.
We thought perhaps this was the root cause rather than a network partition or a chain fork issue since it would impact what we see on Chain Explorers nearly universally. As far as I know, all blockchain explorers are built off querying Cardano DB Sync. This meant we were flying blind as the tools we usually use to inspect the state of the chain were either stuck or following a very low-density fork of the chain. So that’s a resilience issue. Remember in the video I did earlier today, I talked about detection.
This is where we’re not really sure what’s going on, but we have suspicion because something’s not right. After comparing the tips from multiple chains, it became clear that there was indeed a fork in the chain, and it was correlated to node versions. Sunday Labs was running an older version, which followed what would ultimately be considered the healthy fork. I coordinated with Ashish from Cardano Scan to open our relays to his infrastructure to read from my fork, which restored our eyes. I began directly messaging with Sam Leathers, chair of the Cardano product committee.
He had been able to quickly code up a diagnostic tool that found the divergence between the chains. Side note: a big win for the Amaru team here, as the libraries we produce served as a foundation for this tool. That’s an example of where node diversity helps out. Using this tool, we were able to furnish the raw bytes of the transaction and also invited me to join the Google call between core IOG and CF engineers. At around the same time, Andrew Westberg and I identified the faulty hash in the transaction, a delegation certificate to a pool identifier that was twice as big as it should be.
Instead of delegating to the pool "easy1," it delegated to "easy1 easy1." Note that the actual delegation certificate references the pool by hash, but I used the ticker here to illustrate the point. The ledger team was able to quickly track down the bug, when it was introduced, and produce a fix. Everything up to 1 through 8 is stuff that happened in the need-to-know circles, meaning people who are highly sophisticated, people who know what’s going on, people who follow the testnet, people who build on Cardano, core developers of Cardano, and the core entities of Cardano the Cardano Foundation, Intersect, Pragma, and Input Output. All this stuff is happening, and none of you would give a fine damn at all about this if this was all there was to the story because we would just fix the testnet, patch mainnet, and move on.
It’s happened before. Then something very different happened. A fix was introduced, and this gentleman right here was made aware of that fix: Homer J. This is where it got problematic. I was made aware early in the process.
I was at Milos, a nice Greek restaurant in London. We were having an off-site with some workshops. I was talking to Agalos and all these other people when Jerem leaned in, kind of like when Bush was informed of 9/11, and said, "Mr. President, we have an issue." I said, "What is it?
" He said, "Well, the testnet has just forked." I said, "Oh god, do we know how to fix it?" They said, "Yeah, we already have a patch in the works and we’re working on it." So, I knew that we had a race condition during this period to get out that patch as quickly as possible. Now, you don’t go to Twitter and broadcast that because you’re just broadcasting that the mainnet is vulnerable.
You go through the normal disclosure regimes, responsible disclosure, and talk to all the SPOs and other people and say, "Look, we have an issue. Either downgrade or upgrade, but we need to get to a node version that’s not going to cause a fork." But at that moment, nothing was vulnerable. This is where a problem occurred. This is when the iceberg hit.
On November 21st, 2025, at approximately 3:02, a similar transaction was submitted to mainnet. This one delegated to "rats." "Rats" is Charles Hoskinson’s personal stake pool. What Homer J did, by his own admission, is follow all of this and be in direct communication with all the people. He knew that everybody was trying to issue a patch and had full understanding of what it did to the testnet, knowing full well that the same conditions were true on the mainnet.
Any reasonable person with six or seven years of experience in the Cardano ecosystem would know that this attack would cause a fork inside the system and also knew that a fix had already been propagated. Their duty as SPOs is to get this fixed before someone leverages the attack. Instead, what he did was backward engineer it, figure out how to make it work on the mainnet, maybe with AI, maybe not, and then made the decision to deploy that attack onto mainnet. Instead of choosing his own pool, for some reason, he decided to pick the pool of a person he doesn’t like very much, as evidenced by the fact that he was in the fake Fred Discord. This is my own exposition, not Pi’s, but I just want to point it out to provide some meta commentary for all of you who live in the "he’s innocent, didn’t know what he was doing" camp.
You’ll notice that 1 through 8, we were treating as a high-severity incident, and we were in triage and fix mode. The war room had already started when we realized that there was a mainnet vulnerability, and we were all working together as one team to try to get this resolved as quickly as possible, which is why some people didn’t go to bed. They were working from the prior night all the way through to the end of the incident. I woke up around this time by pure coincidence. I was suffering from a fairly bad cold and was very restless that night.
I saw the initial messages reporting that mainnet was suffering from a similar issue. This is where Pi had the same experience as me. He thought, "Oh, fuck." I decided there wasn’t much I could contribute as Sunday Labs infrastructure was already on the old version without the bug, but the following day would need people to pick up the mantle with renewed energy. So, I made the decision to try to get more sleep while others worked on the remediation.
Overnight, IOG, the CF, Intersect, many exchanges, and many SPOs all upgraded nodes to the patch version. This was the call fest. Everybody got together in the war room. We were calling people like crazy, saying, "Guys, guys, guys, you’ve got to upgrade. We’ve got to get this done.
" Collectively, the official recommendation was to choose the fork that was more restrictive. In theory, choosing the more relaxed chain would have resolved the fork faster since it had the majority of the stake, but it would have created two main challenges. First, a fix was already released while the impact was only on the testnet, and trying to change direction would have muddled the communications, creating confusion about what the actual recommendation should be. More importantly, it was judged that off-chain tooling, wallets, and exchanges would likely have been disrupted for much longer. While the chain itself might have recovered quickly, the ecosystem would have been impacted much longer.
Every single Daedalus user would have to manually reset their wallets. All of the off-chain infrastructure would not work. Your explorers wouldn’t show anything for days to weeks, and it would be very difficult for anything to work on Cardano. Effectively, it would have shut Cardano’s off-chain ecosystem off for a week, and everybody would have experienced a massive disruption. That’s an understatement of understatements.
It’s why we went with the old fork. The other chain was nicknamed the "pig chain" because it was likely the fat pool delegation, and the officially recommended chain was nicknamed the "chicken chain." So, you have a race between the pig and the chicken, and believe me, this pig is not a good pig. This is not Nike; this is the worst pig of all. The chicken, on the other hand, that’s a good chicken.
We’re trying to get back to the chicken. At around 4:00 a.m., existing infrastructure monitoring tools were successfully updated to begin collecting metrics on the pig chain so we could monitor the progress and the likely outcome. We were sitting in the war room, and we had a monitor for the pig chain and a monitor for the chicken chain.
We saw them racing against each other. In the background, several people began trying to identify who the culprit was, mostly to determine if we should expect further attempts to disrupt the chain’s recovery. From 4:00 a.m. to 10:00 a.
m., the chain was divergent, meaning the faulty transaction on the pig chain would become fully immutable before the chicken chain overtook it. If this happened, all nodes following the pig chain would need manual intervention to recover the longer chain. They would need to be stopped, their databases truncated, and replayed up to the longest fork. We have this parameter called K in Cardano, and after 12 hours, give or take, things go from mutable to immutable because we assume there will not be a long-chain reorganization greater than that.
To purge that, you actually have to manually purge it using a script from your system. It’s not necessarily unrecoverable. We actually wrote that script in the war room and got it done and pushed it through. By 11:00 a.m.
, a beam of sunshine broke through the clouds. The estimated time for the poison transaction to reach immutability was now longer than the time it would take for the other fork to catch up. At this point, we knew the chain would very likely fully recover within 18 hours. That was about 7:00 p.m.
my time, I believe. I’d have to check, but it was late. We were all exhausted because a lot of people had been up late at night, and we were watching these things in real time. Pi actually has some graphs. At this point, we largely just monitored the chain, ensuring that we had full data dumps for analysis for after-action reports such as this one.
The time of recovery inched down as more and more SPOs made the upgrade, and at roughly 17:16, the chicken chain overtook the pig chain. Since this chain was still valid according to those following the pig chain, they switched to the chicken chain on their own, and the chain fully recovered. It’s over. It was hard, but it’s over. **Root Cause** In case you missed it, here’s an explicit restating of the root cause of the bug.
When parsing hashes such as pool identifiers, transaction hashes, and addresses, each type of hash has an expected length: 28 bytes, 32 bytes, etc. In theory, providing a hash that is too short or too long should be considered incorrect. Older versions of the node correctly rejected hashes that were too long. A commit made about a year ago, on November 24th, 2024, introduced an obscure code path that would accept hashes that were too long and just truncate to the correct length, discarding the bytes that aren’t needed. That was kind of the root cause of all of it—a very small, obscure bug.
Then it became activated, and the attacker took advantage of it. **Analysis and Background Reading** I suggest reading my previous description of War Prrow This prevents the nodes from ever converging without manual intervention. This is what happens to Solana when it has its historic forks. Situations occurred that created circumstances where they could not come back together. They were broken apart, and manual intervention was required to glue them together.
In the case of a soft fork, this actually occurs with Bitcoin, if you think about orphan blocks. So, miner A wins a block, and miner B has a block around the same time. You now have a race condition for the next block, and let’s say miner B wins it. The block that A found is now orphaned and gone. One block is a superset of the other.
These categories are sorted from what I consider the most serious to the least, though the ordering at the end is far more arguable and negotiable. You also have sovereignty violation. Any bug or exploit that leads to remote code execution, forged signatures, or repossessed funds in the ledger is a violation of the core sovereignty and safety of the chain itself. The consequence is a complete loss of trust, likely irrecoverable. There are no known examples on major L1s, but Bitcoin came close in 2018 with CEV 2018 17144, which would have allowed duplicate inputs, and in 2012 with the duplicate coinbase bug, which had a theoretical signature forgery attack vector.
A feature quantum attack on ECDSA would likely qualify and impact many chains. Basically, I can steal your money, take your key, and do all kinds of crazy things. That’s as bad as it gets; your system dies. I didn’t mention this, but the bug that was in Zcash many years ago allowed for the counterfeiting of tokens, and you couldn’t even discover it because of the way zero-knowledge systems work. So, that was another example of a potential sovereignty violation bug, but it wasn’t clear if anybody ever utilized it—probably not.
Ledger violation refers to any bug that leads to a violation of the core ledger guarantees for any length of time that any human would reasonably conclude is a violation of the intent of the chain. This was what I was talking about earlier in my video about the constitution and what the intent on-chain is. The consequence is a severe loss of trust, recoverable on a case-by-case basis, but it should involve extreme changes to engineering discipline to regain trust. For example, in August 2010, Bitcoin suffered a value overflow bug. This created 184.
4 billion Bitcoin. When they say there’s only ever going to be 21 million Bitcoin, well, in August 2010, there was at one point 184.4 billion Bitcoin—9,000 times the intended 21 million supply cap. No reasonable person could argue this was in line with the spirit of the protocol. But I thought code is law.
This required a soft fork where miners upgraded to a node version that rejected transactions with the overflow and recovered 19 hours after the faulty transaction. It actually took longer for Bitcoin to recover than it did for Cardano; we were 14 hours, and it was 19. The consequence of their issue was that there were 184 billion Bitcoin in circulation that shouldn’t have been, and that history was abandoned. It was orphaned out with the old node version switching to the longest chain as it overtook the faulty one. Consensus violation hard fork refers to any bug that results in a central decision-making body mandating, or effectively mandating through outsized leverage, a violation of the chain consensus, usually via a node version that forces a rollback to before some bad event or in coordinated truncation of the chain.
In other words, you’re editing the chain. The consequence is excusable and recoverable early in the blockchain’s life cycle, but less excusable as time progresses. The mother of all examples here is the Ethereum DAO hack. That was a smart contract exploit that resulted in a special ledger rule to force a hard fork to a different consensus. The Ethereum Foundation and exchanges heavily influenced the miners to choose this fork.
They strong-armed them, basically saying, “You’ll get delisted, and your token will go to zero.” That’s when I got involved in the whole Ethereum Classic and Ethereum thing. This is basically manually editing the chain because you didn’t the history. Large chain reorganization soft fork refers to any bug or network condition that results in an extended divergence of consensus on-chain but ultimately heals itself organically or through social consensus and has no explicit lasting impact on the node implementation. It exposes certain application types, such as bridges or exchanges, to financial risk through replay attacks or for being on the wrong fork.
For example, if you have a bonding scheme, the consequence is recoverable with heavy learnings from the incident, if extremely rare. Polygon experienced several decently sized reorganizations in 2023 and 2022. The incident described above also falls into this category. Then you have a smart contract exploit, which is a bug in a smart contract of a popular, widespread protocol that results in a large monetary loss to users. It’s mostly the fault of the smart contract author, but usually, some blame lies with the language constructs that made the bug difficult to avoid.
It is recoverable for the layer 1 with concern depending on the bug, and there are millions of those—about $80 million a day. Full consensus halt refers to any bug or network condition that results in an extended complete halt of consensus. No nodes can make progress without manual human intervention. It is recoverable with concern and unacceptable if it happens repeatedly and regularly. Several examples come from Solana, the Binance Smart Chain bridge exploit, and the Avalanche gossip bug.
Basically, it just stops, and you have to start it again. You have to kick it and restart it. In some cases, you have to edit to restart. That’s just unacceptable. You lose liveliness, and it doesn’t come back on no matter how much you blow the cartridge.
Degradation of service refers to any bug or network condition that results in a widespread degradation of quality of service for end users. The consequences are recoverable. An example is the Ethereum Shanghai denial of service attack and the Sunday Swap launch hiccup on January 23rd. Given the above taxonomy, there was a large chain reorganization and soft fork that self-repaired. It was serious but not existential.
There will be plenty of people eager to pounce on Cardano’s misfortune, mainly because the community has a tendency to be a bit pretentious when it comes to our uptime. Fair enough. I won’t waste my breath in feudal narrative wars with them. The primary concern outside of degradation of service for users was in systems that were unaware of the chain fork and opened themselves up to replay attacks across both chains. For example, suppose an exchange is on the wrong fork and allows users to deposit funds, sell them for another token, and withdraw them to another chain.
When the chain reorganizes, the user would have extracted value from the exchange. Analysis is still underway to determine if this happened, but given the rapid response from the exchanges and the small number of transactions on the pig chain and not the chicken chain, this currently seems unlikely to me. I think so too, but we’ll know in a few weeks. What matters in the end is the following: Did the chain continue to make progress? Yes.
Was service degraded? Yes. Were funds at risk potentially for a subset of people? Yes. Did the Cardano network recover under essentially the worst-case conditions?
Yes. Would I have confidence to build my business on top of infrastructure that exhibited this level of robustness? Absolutely. Now, let’s talk about the good and the bad. The good is that throughout this, Cardano’s engineering excellence and decentralization shine.
Many decisions that some thought were overly paranoid proved their worth. I personally learned a number of things about how Cardano operates and inoculates itself against forks like this, which impressed me. Number one, I’m extremely glad that the node was implemented in a language that takes memory safety extremely seriously. The particular kind of bug related to serializing out of buffers with improper bound checking in any other language could have been very serious. It would have been very easy in languages like C and C++ for such a bug to enable arbitrary code execution.
It’s not hard to imagine someone extracting the public and private keys from many SPOs, gaining complete control of the chain and any funds accessible from those keys. In other words, if we wrote it in C, we’d have been dead—straight up dead. The chain would have broken. With Haskell, we actually survived that. Rust would have survived it too.
I’m extremely grateful that IOG’s emphasis on high-assurance engineering largely prevented much more serious bugs. The fact that one slipped through after eight years doesn’t nullify all the effort that went into eliminating these kinds of bugs before they surface. Remember, no piece of software is perfect, and we only see the planes that survive. So, eight years, one major bug—not bad. I’m extremely impressed with the reporting and monitoring infrastructure maintained by the founding entities.
Being able to track the independent forks, the versions that people were on, the rates at which blocks were being produced, and writing tools on demand was where Amaru came in handy. The Canary transaction infrastructure provided critical and actionable insights throughout the fork that allowed us to reach out to specific SPOs and shape our community strategy. I believe it can be better, but this is absolutely true—it was good. I’m also impressed with the elegance of Ouroboros. By the way, it was nice being with Agalos because we designed Ouroboros together, and for us to sit there and watch this protocol that was on paper in 2016 actually survive an attack that we imagined could occur was remarkable.
I always knew about the 2,160 rollback horizon, which equated to about 12 hours' worth of blocks. But the common wisdom had been to wait three times as long—36 hours—for true finality. That always confused me until this event. In the case of a partition, the chain density on both forks goes down. That means that each chain will take longer to reach a depth of 2,160 blocks.
This creates a natural self-regulation in times of stress, extending the length of the rollback horizon in proportion to the severity of the fork. That’s just mathematically elegant, isn’t it? Time works differently, and it always works in your favor when a crisis occurs. I’m also impressed by the design of the networking stack for a number of reasons. One, the hot-cold-hot-warm-cold peer system led to most nodes quickly finding peers that agreed with them.
As chain density on the pig fork dropped, they reverted to a safe node. In safe node mode, nodes are more cautious about which nodes they trust, reverting to bootstrap peers, which are still Emergo and the Cardano Foundation and local trusted roots. By the way, Bitcoin has bootstrap nodes and bootstrap peers. You should take a look; they’re hardcoded in each SBO, which is able to specify their bootstrap peers and local trusted roots, putting them in control of the trusted backbone of the network. Since these nodes were more likely to have been upgraded in this case, it would stop the spread of pig chain blocks, lowering chain density further and allowing the Cardano network to heal and slow the pig chain down.
The independence of many protocols meant that even throughout the fork, transactions nearly always propagated widely to all peers. This meant that even while the fork was ongoing, most user transactions could be included in both forks, and service was largely uninterrupted. This emphasizes the importance of certain application types responding defensively to network partitions, chain forks, and lower density. There will likely be discussions in the coming days about how to expose these metrics and design patterns better to guard against double spend risk. I was impressed by the decentralization throughout our community—not just as a buzzword.
Our diversity of node versions was paradoxically both the source of the problem and helpful in the recovery. This problem wouldn’t have occurred if all nodes were identical, but it’s impractical to expect total conformity in a distributed system of this scale. In the absence of that, the nodes still on the older desired version helped to stop the spread of the pig chain and accelerate the recovery. In the future, diversity of implementations will have a similar effect. There’s a higher risk that there’s a difference in behavior, but a lower risk that a bug in any one implementation could take out the whole network or chain.
The communication infrastructure meant that thousands of SPOs were able to be notified through a variety of different channels: Discord, Slack, Telegram, email, Twitter. The war room was populated by members of many different organizations, including the founding entities and technical thought leaders like Andrew Westberg, who was a hero there. He did a lot of great work, and representatives from exchanges were involved. No one entity made a unilateral decision about which fork to endorse; many were involved in that discussion. There’s a lot more to be impressed by, but these were the standouts for me.
That being said, I promise I wouldn’t shy away from where Cardano fell short. Here are places where the incident highlighted areas for focus. Number one, anecdotally, many users had more frustrating experiences than the Canary test would imply. I’d love to see the Cardano Foundation improve this and partner with ecosystem infrastructure like Blockfrost and wallets to track the typical user experience over time. We actually need a canary, and we can work on that.
It seems like most or all of our chain explorers depend on Cardano DB Sync, which can blind us when a bug impacts a single component. I’d love to see more diversity in architecture from these tools. That’s going to happen in 2026 when all the other nodes come. The fact that the bug appeared at all is a failure of our testing rigor. This is one area where Cardano has historically shined.
But I suspect—and I’m purely speculating here—that the intense ramping up of pressure to deliver governance and other features over the last year and a half has pulled attention away from these efforts. That’s fair to speculate, and there’s some truth to that. But there’s also the reality that there’s a probability distribution: the longer you go on with software, the higher the probability that there will be at least one bug within a given window of time. So, one bug per eight years is there, but there’s a whole post-mortem that has to be done, and we have to fix it. New tools and procedures and policies have to be brought in.
Pressure was one area, but remember this bug came before the governance pressure; it came in 2022. Sebastian G from DC Spark, now the CTO of Midnight, has the right idea regarding doing code generation and fuzzing from specifications rather than generating a specification from the implementation. You should never generate a specification from the implementation; you’re going the wrong direction. That’s something that Project Blueprint is trying to solve. This will be extra pertinent to two different projects Sunday is working on: Amaru and Acropolis.
We now have direct evidence of the importance of testing rigor, and we’ll need to be on high alert for very subtle divergences between implementations. Such divergences—this was one of the reasons why I was very bearish on prematurely creating alternative implementations—have a high chance of being hard forks rather than soft forks. Once that happens, it’s no bueno. It’s very hard to put the network back together because of the completely different stack of decisions. Here’s Sebastian with an A, who leads the great Cardano Blueprint initiative in an effort to document and specify what it means to be a Cardano node.
Before you create alternative nodes, you create the formal specifications, the testing strategy, and a certification program, along with your canary net to ensure you don’t have hard forks. As mentioned above, I’d like to see node-to-client mini-protocols that allow DApps, wallets, and exchanges to have better insight into the health of the chain network. They should have easy APIs and official design patterns around circuit breakers and finality windows to make their apps even more robust to such events. That’ll come. While the social consensus for which fork to choose was robust, I think it could have been better.
I imagine that many SPOs have graded blindly based on trusting the recommendations of the founding entities alone. Without a clear understanding of the implications and some obscure hypothetical scenarios, one could imagine this being used to fork the chain maliciously. I think we’d only benefit from improved education about how it works, what the implications are, and ensuring that as many SPOs as possible can make informed decisions on tight deadlines. One of the best things we can do is create an AI system that we give to every single stake pool operator that acts as an upgrade sentinel. If they get an upgrade request, the AI helps them reason about a series of questions and tests they should conduct before they blindly upgrade.
This will compensate for the fact that not every SPO is Agalos or Duncan Coots or Pi, and they don’t have a detailed, intimate, core developer-level understanding of how the protocol works. I think we can build that rather quickly. If anybody in the community wants to volunteer to do that, it would be a really fun experiment. I think it’s worth treasury funding, and it can be done rather quickly and constantly upgraded. We’ll get rather superhuman, and this will add an additional layer to quickly inoculate the system.
To the above point, in a recent video update, Charles Hoskinson called for a built-in pub architecture. For those unaware, this is an extra protocol built into the network that enables broadcasting messages for a variety of use cases, leveraging the incredibly connected global network of Cardano nodes. Such a network would have allowed Intersect and others to broadcast SPOs on an emergency channel, alerting them of the urgent need to upgrade. I also believe this would have been a good addition and would unlock many opportunities for Cardano as well. In the end, I believe you should assume every chain is likely one bad transaction away from a similar disruption.
It’s 2 p.m. Do where your children are? Do who your children’s friends are? One bad transaction.
If you can, find the right incantation. Ask yourself—not for the sake of some Twitter argument, but for your own peace of mind—how would your favorite chain have stood up to such a breakdown in consensus? Personally, I’m coming away both impressed by Cardano and with a personal to-do list of where I can put in hard work to improve things even further. Well, first off, Pi, thank you so much for writing that up. It’s really a lot of work you put into it, and you’re a great guy who does a lot of cool things for the network.
It was a long day, and it’s been a long week, but we’re just pushing through. We only survived because we put in the homework early and put in a lot of effort, and I’m very proud of the fact that we survived. This could have killed other chains. So, links are there if you want to read it yourselves, but this was a live video. My commentary went from a 26-minute read to a 48-minute read.
Found an error in the transcript?
Help improve this transcript by reporting an error.