Prevention versus Recovery in Distributed Systems [Whiteboard]

Hi everyone, this is Charles Hoskinson broadcasting live from warm, sunny Colorado. Today is January 23rd, and I'd like to make a whiteboard video to talk a little bit about network operations and the balance between prevention and recovery in distributed systems. So hold on to your butts; it's going to be a fun video. Let me go ahead and share my screen. As many of we mentioned that the system had a little bit of a blip on Sunday.

Although it wasn't very significant, there was a lovely tweet from a stake pool operator that illustrated the difference between a crash and a minor issue. One of the things that should be brought up is that when you think about the operations of a distributed system, you have to balance the prevention of issues with recovery from them. People often say, "What do you mean, Charles? We should build things that work 100% of the time and never have any issues." That is the goal, but the problem is that you introduce certain challenges into your system.

One is the desire for speed and performance. Two is open access and accessibility. Three is permissionless operation and Byzantine actors. There are other challenges, but when you introduce these, you lose a lot of your prevention mechanisms. What happens is that when you address one issue, like open access and accessibility, it unfortunately comes at the cost of speed and performance.

You can reduce the set of people who have operational access and make it permissioned, which could optimize throughput and potentially improve accessibility to the general public while removing Byzantine actors and thus preventing issues. However, then you've centralized the network a bit, which is why any good distributed system must find a balance between prevention and recovery. Recovery is crucial when something goes wrong, and we need to restore the system. This could involve checkpoints, rollbacks, restarts, and other standard tools and techniques. These are mechanisms that allow the system to recover from particular issues.

In 2022, we introduced several new concepts, including diffusion pipelining. This was popular because it made Cardano better, faster, and cheaper. However, the problem with diffusion pipelining is that when you introduce it, many of the validation checks for transaction propagation are removed to allow for faster network propagation. For example, if a user named Bob submits a bad transaction (let's call it txb), that transaction will be processed by a relay. Prior to diffusion pipelining, if that bad transaction could crash the relay, the relay would die during validation before transmitting it to anyone else.

With diffusion pipelining, you remove some of those safeguards to allow for faster propagation. This doesn't necessarily make the network less safe, but it does increase the number of potential failures from one to many, where "n" is the number of connections that particular relay has. In practice, transactions may drop, and relays may restart, leading to greater speed and performance, but reducing Byzantine actor prevention. This is just one of the trade-offs that exist in distributed systems. We've been writing papers to understand these trade-offs.

For instance, we wrote "Ouroboros Genesis," which aims to recover from the Genesis block to today without a checkpoint. This is slated for delivery this year, and combined with peer-to-peer technology, it should create a robust recovery mechanism for Cardano. We also wrote a protocol paper called "General Scuttlebutt: Byzantine Resilient Gossip Protocols," which discusses these scenarios. The abstract states that one of the most successful applications of peer-to-peer communication networks is in blockchain protocols, which, in Satoshi Nakamoto's own words, rely on the nature of information being easy to spread and hard to stifle. Significant efforts have been made over the last decade to analyze the security of these protocols, but real-world implementations often rely on ad hoc attack mitigation strategies that leave gaps between idealized communication layers and actual security.

We bridge that gap by presenting a Byzantine resilient network layer for blockchain protocols. For the first time, we quantify the problem of network layer attacks in the context of blockchain security models and develop a design that supports restricted adversaries. Many of these concepts have already been integrated into the Cardano network stack, which is why it is quite resilient. For example, when Bob tries to send a malformed transaction, we already have good recovery and prevention ideas in place. We also have a solid graceful recovery rollback mechanism.

Generally, when you have a cascade of failures, they don't tend to propagate throughout the entire network. However, a bug could arise where a chain of bad transactions creates an emergent state that leads to a crash scenario. In such cases, you might see a rippling effect, where it goes from Bob to the relay and then to multiple nodes, resulting in a significant propagation of failures. You can try to prevent this state, but recovery is equally important. Implementing sentinel code and restart mechanisms is essential.

Why am I bringing this up? Because as we look to the future, we have concepts like input endorsers, and the Ouroboros Leo paper is scheduled for release in the first half of 2023. This design will significantly improve speed and performance, allowing for higher throughput and more participants in consensus. However, introducing this new structure also brings a new layer of potential Byzantine attacks. It's possible that parts of the network could collapse when a bad actor or transaction state emerges.

That's acceptable as long as the general network operations remain stable. This is an example of a trade-off where you lose the ideal of a perfectly functioning machine, trading it off for the need for super-fast performance and low-cost transactions. The first trade-off example was in 2022 with diffusion pipelining, where something that would have only affected a few relays now impacts many more. Prevention is still there, and this is where formal methods and modeling are powerful. The complexity of distributed systems introduces emergent challenges, and the more people participate, the more accessibility and expressiveness you introduce, which can lead to potential failures.

This is why Bitcoin does not have smart contracts. The more expressiveness you allow, the greater the chance that someone could find a bizarre transaction or chain of transactions that could lead to a failure mode requiring a reset. This is why we designed extended UTXO and Plutus, to create formal guarantees that prevent users from writing transactions that could break the system. For the most part, it works as intended. However, prevention often reduces expressiveness in the system.

You'll see many smart contract developers express frustration over missing features that exist in Ethereum. We intentionally left those out to avoid creating open hazards for the system. Ethereum is trying to achieve sharding, but they suffer from protocols that aren't well formally modeled, leading to a lack of understanding of how to scale effectively. In contrast, we designed our system to be significantly more open without a bonded slashing system. This required us to work hard on protocol design to prevent and recover from issues without such measures.

While Ethereum's approach could lead to users losing their money due to unforeseen issues, Cardano's design prevents that from happening. Every system has its own set of design trade-offs, including how they accommodate Byzantine actors, open access, expressiveness, and speed and performance. The optimizations you make for speed and performance can significantly impact your ability to prevent attacks. In some cases, you need a formal model of recovery. What we're experiencing now is a natural growing pain in distributed systems design.

The fact that we were able to recover in under two minutes indicates a good balance between prevention and recovery in the system, given the challenges we face: throughput, permission access, accessibility, Byzantine actors, and expressiveness. As Cardano evolves into a system with billions of users and trillions of transactions, some parts of the network will likely be in a failed state at any given time. However, just as a human body has bad cells and must maintain homeostasis, a distributed system must also purge bad states as it scales. The complexity introduced increases the probability of encountering issues, but Cardano has found a beautiful balance that does not require punitive economic measures to compensate for poor protocol design. I'm pleased that we have a formal model to balance the need for prevention and the ability to recover from issues as we scale and shard Cardano.

This relies heavily on extended UTXO and Mithril, technologies we understand well and have demonstrated at scale. While we can implement better network monitoring and be proactive about preventing collapses, the reality is that distributed systems will always face challenges. In 2017, systems collapsed more often, but in 2023, we've been running for five years with minimal issues. The recent blip is a good example of how the community can come together to address challenges. I hope this whiteboard video has been helpful in explaining our design philosophy and where we stand.

I'm continually amazed by how remarkable Cardano is as an ecosystem and how truly decentralized we've become, capable of recovering from occasional blips. Thank you all for listening.

Prevention versus Recovery in Distributed Systems [Whiteboard]

Summary

Full Transcript

Found an error in the transcript?