Cardano Stall
Summary
- •Charles Hoskinson provided a postmortem on a network stall issue that occurred on January 22, 2023, affecting mainnet nodes.
- •The issue was first recognized by Smog Pool, with the network stalling for about two minutes before recovering.
- •A team including Sam, Jared, Arno, and SPOs like Marcus Guffler analyzed logs but couldn't identify the exact cause due to the lack of a rolling PCAP.
- •The stall was deemed a transient issue, likely caused by a combination of factors, making reproducibility unlikely.
- •Cardano's system successfully self-healed during the incident, with no transactions or blocks lost, demonstrating its resilience.
- •The team is investigating potential bugs in the Haskell library or the balance R implementation used for optimization.
- •The incident highlighted the decentralized nature of Cardano, with SPOs quickly mobilizing to assist in debugging.
- •Future monitoring will include setting up sentinel nodes with raw PCAP dumps to analyze incoming and outgoing data for similar issues.
- •The network's performance metrics remained stable throughout the incident, and no emergency actions were required from exchanges or consensus groups.
- •Hoskinson expressed gratitude to the SPO community and his team for their dedication during the incident.
Full Transcript
Hi, this is Charles Hoskinson broadcasting live from warm, sunny Colorado. Today is January 23rd, 2023. I wanted to give a little postmortem on the weekend. We had an interesting issue, so let me go ahead and share my screen. If you look over here, this is the Git issue.
It says mainnet nodes with incoming connections unexpectedly shut down with failure in the data map balance. I think Smog Pool was the first to recognize this, but a lot of people noticed that the network stalled for about two minutes. Everything's great now; it came back. how it always breaks on Sunday morning or Monday morning, late at night, while everybody's sleeping—that's just how things work. The long and short of it is that it seems to be a transient issue, probably a combination of several things happening at the same time, which means that reproducibility is unlikely.
We pulled the team together Sunday night—Sam, Jared, Arno, and myself—and we talked to the SPOs, including Marcus Guffler and a few others who were willing to come on. We took a look at all the logs we had, but because we don’t have a rolling PCAP, we couldn’t actually see what transaction or block caused this trip of the balance R code that was put in for optimization. We know where the error was called in the program, but it’s hard to dig through and figure out what triggered that particular issue. It doesn’t seem it’s reproducible, so we’ll keep looking into what caused the stall. The good news is that Cardano did exactly what it was supposed to do when a stall occurs.
The system recovers itself and heals, so the nodes came back up, and everything's great now. That’s exactly what we designed the nodes to do. It’s not very satisfying because ideally, every time a distributed system has a blip or a stall, you’d like to know the exact cause. But the problem is that distributed systems sometimes create what are called emergent bugs. Locally, it’s not reproducible, but a collection of things can create a collective global state that for some reason triggers something in the system, causing it to stop for some people.
As I mentioned, there’s a team working on it. They’re looking into it and chipping away at it. We’ll do a postmortem inside the Git issues. If there’s a bug in either a Haskell library or a bug in the implementation of the balance R, which was put in for optimizing Cardano, it will get patched. The problem is that bugs require an understanding of what triggered them, and that’s what we’re looking for right now.
It’s going to take some time to get there. In some cases, you can’t figure these things out, and they come up once every five or six years. You just say, “Okay, the network recovered; we move on.” However, sometimes you can figure it out, but you never want to drive yourself nuts over it. In any distributed system, whether it be Googleplex, Microsoft Azure, or Amazon Web Services, stuff crashes all the time.
Systems go down for a variety of reasons—hardware failures, software failures, hacking attempts, cosmic rays—all kinds of stuff. There’s a famous cosmic ray issue with Google Bigtable. That’s okay because the second thing you do when you build a large-scale distributed system is build it for resilience and self-healing. So when these types of things happen, the system recovers, which is exactly what occurred here. It was really cool to see how the SPOs came together and how quickly they were on top of this.
They’re still looking into it. In fact, it’s a race condition, and everybody seems to be looking for it. I’m really proud of Smog, Andrew Westberg, and all the others who showed up literally moments after this occurred in the middle of the night, collecting large amounts of information and trying to debug the issue. It just shows you how decentralized and resilient Cardano has become. When I know more, of course, I’ll share it with you guys.
As I mentioned, we have a skunk works team looking into it, trying to figure out a way to reproduce it. But it might not be possible. This is the point when we talk about high assurance, formal methods, self-healing, and resilience. You design these systems so that they can recover in the event that something bad happens like this. Nobody noticed; transactions were not lost, blocks were not dropped, and no money was lost.
The network didn’t halt; it stalled for a little bit and recovered, but it was still moving forward and progressing. It’s not like panic calls had to go to exchanges saying, “Upgrade to this emergency node,” or that all the consensus people had to get together on Discord to figure out how to kick and restart Cardano. It self-healed, and that’s the point of a decentralized, distributed, resilient system—it should have this capability to self-heal and rebuild itself in the event of a stall from a transient event. I’m proud that this happened. It’s frustrating because we’d like to know everything.
We’d like this to be deterministic so that anytime an incident occurs, exactly why it occurred or where it happened. We know what part of the code was impacted; we know where the error handler exception was thrown. But what we don’t know is the causal event that knocked the system into that particular state. We could probably spend weeks or months, perhaps even years, chasing it down for something that may only come up every five years. Its impact is that it’ll stall the network a little bit.
This particular issue will likely never come up again in this format, especially if it’s an emergent transient error. As things are a moving target when optimizations come in, it may actually recover that part of the code. When Genesis comes in, there’s a different operating model, and network upgrades come with a different operating model. Obviously, input endorsers are going to completely change everything. That’s the other part of debugging: as you’d like to find something, you may actually just change the code that had the problem while you upgrade the module to a different state, not even knowing that you fixed the particular bug that caused the issue.
In any event, we’ll be monitoring things. We’re setting up some sentinel nodes that have a raw PCAP dump so we can actually see all the incoming and outgoing data. We’ll do that for probably a one- to two-week rolling basis. If there was a transaction or block that caused this, we could zoom in on that time period and break the transaction apart, run it, and replay it. We did all the usual things; we were able to fully sync nodes.
Whatever the block or transaction or event that caused it obviously wasn’t a dominant blockchain because it didn’t create a stalled state. The network is running, all the metrics look good, and the community came together. Thanks to the SPO community for keeping Cardano running, and thank you to Jared, Sam, and Arno for working the weekend with me. It’s kind of tough to be up late at night, thinking you’re about to put the kids to bed, and then you get a ping saying, “Hey, can you work a few hours?” until you’re so tired you can’t see straight.
But they did because they went above and beyond the call of duty. Nothing to report for now. Thank you all for listening. I’ll have more probably towards the end of the week, and we’ll see if we get to the bottom of this. Cheers!
Found an error in the transcript?
Help improve this transcript by reporting an error.