Data Center Failures and CRDB Replication


Data Center Failures and CRDB Replication

David Lukens, 2021-04-03

Executive Summary

The purpose of this document is to explain how replication of data within CockroachDB can be used to tolerate various failure scenarios and the impacts to query execution times. This simulation will explore the effect of different replication factors on a database cluster that spans multiple Availability Zones in multiple Regions.
Overall, more replicas allows for more nodes to be down at the same time, while both R=5 and R=7 allow for the loss of an entire region. But, counterintuitively, R=7 only allows for a single Availability Zone to be lost, while R=5 allows for two.

Description

In a discussion, the question came up, “What happens to geographically diverse replicas when something fails?” The result was a simulation of geo-partitioned replicas and creation of a visual representation to understand this. This was for a 15 node CockroachDB cluster, with five Availability Zones (AZs or Data Centers), in three different regions. The comparison was between a replication factor of five and seven. For this simulation, simple reads (SELECTs) and writes (INSERTs) were used to illustrate the latencies involved with network connections that span long distances. Assume that the computational time for a read on a single node is 2ms and a write is 4ms.

As background information, CockroachDB prides itself on it's capability to be highly available and its distributed nature. This means ensuring that writes are flushed to disk by a majority of nodes before proclaiming to the client that the write has been committed. This allows for guaranteed consistency and the ability to tolerate a failure. To do this, the leaseholder (the node that acts as the coordinator for a given range of data) will ensure that a quorum of the other nodes that contain that piece of data have committed and acknowledged the write before returning a message to the client indicating the write was successful.

Five Replicas

Let’s look at an initial state of three regions, five AZs, and a total of 15 nodes with a replication factor of five. Our regions are US West, US East, and the EU. Preferences were set to keep the leaseholder of a particular range of data in the region it represents. That is to say the leaseholder for US West data would prefer to stay in the US West region if at all possible.


As we can see here, with five replicas of data, each AZ contains one replica of each region’s data and the leaseholder for a given region’s data resides in that region. This is our best case scenario. Now let’s assume that we are on the West coast and need to query East coast data.


In order to make that query, we need to contact the leaseholder for the East coast data. It coordinates activities on the range and then returns the information back to the client on the West coast. This takes 62ms in the simulation and matches up closely to our round trip packet time of 58ms.

Now let’s look at a write from the west coast of the US to the EU.


Here, the client on the West coast issues an update, which is sent to the EU leaseholder to coordinate. The leaseholder in the EU distributes the write request to the other four nodes that contain replicas, one in each of the other AZs, but cannot commit the write until a quorum (three nodes out of five in this case) confirm the operation. The leaseholder is one of those nodes and will wait for two more nodes to confirm before considering the write committed. Given the shorter round-trip time from the EU to the East coast than to the West coast, it is expected that the two East coast nodes will respond before those on the West coast. So the minimum total time to handle this request (initial request from the West coast to the EU [half of the 132ms ping time], round trip write request [78ms], and finally the response from the EU back to the client on the West coast [66ms]) is 210ms. That 210ms represents just the network latencies involved. Additional time will be required for execution of the query. So you can see how much network latency is involved here when we are crossing continents or jumping to another continent. Unfortunately, this is due to the limits of the speed of light and nobody has figured out how to push electrons any faster yet.

Overall, CockroachDB will attempt to keep an equal number of replicas of a given piece of data in each of the available regions. Within each region, it will try to keep an equal number of replicas between the AZs within that region.

What happens when an AZ goes down? Let’s take out the Northern AZ on the East coast.


When that AZ fails, the rest of the cluster recognizes that there are under replicated ranges now. By default, after five minutes, the database will conclude that those lost nodes are not going to come back and the remaining nodes will up-replicate the ranges. In order to keep the regions balanced in this simulation, those lost ranges were replicated to the other US East AZ. The leaseholder for the East coast disappeared as well, and after ~5 seconds without a leaseholder the remaining nodes elected a new one. Since our preference is to keep that leaseholder on the East coast if possible, it was relocated to the other AZ on the East coast.

When we look at the latency times, since all of the regions still exist, they don’t change as compared to our first situation where everything was functioning properly.

Now, what happens if we lose an entire region?


In this case, if we start from our normal state with five replicas and both AZs on the East coast fail, we have a different outcome. As you can see, with two regions still running, the DB tries to put an even number of replicas in each of the two remaining regions. With five replicas, we end up with a 3-2 split for the three sets of data. The significant difference here, in respect to latency, is that from the West coast if we write East coast data, we now have quorum (three of the five replicas) all on the East coast and the response time is 5ms.


Looking at the original functional state and considering the possible failures, this configuration of three regions, five AZs, 15 nodes, and R=5 allows us to tolerate the failure of an entire region. With any region out of the mix, we still have at least three copies of each range in play to generate quorum.

It also allows us to guarantee that we can tolerate the loss of any two AZs or the simultaneous loss of two nodes and continue to serve up data. Beyond that point, the loss of even an additional node can put us into a situation where we only have two replicas of the data remaining. With only two replicas the DB cannot achieve quorum.

Seven Replicas

“What if we increase the number of replicas? That’ll make things faster and allow us to gracefully handle more failures? Right?”

The overall answer is both no and yes. And the “yes” part of the equation comes at the cost of having to store 40% more data than we did before and to coordinate 40% more copies of each range. Since the infrastructure is not free, the cost must be taken into consideration. If we are working with a DB that is 10’s of TB in size, the cost of cloud based or enterprise grade storage isn’t cheap.

Here is our nomimal state with seven replicas.


Again, CockroachDB tries to balance the number of replicas in each region, and balance again within each region. Since we have fewer AZs than replicas, some of them now have a copy of a replica of that range on multiple nodes within that AZ.

Let’s repeat our first failure scenario and knock out the NorthEast AZ of the US. After all of the under replicated ranges have up replicated, the result looks like this.


As you can see here, the West coast data was replicated in the SouthEast and EU AZs. The East coast data was replicated in the NorthWest AZ, and the EU data was replicated in the SouthEast AZ. Through this, the DB is still trying to keep an even number of replicas in each region, and balance the AZs within a region. This results in a 3-2-2 scenario. And again, the latencies for queries don’t change as we still have at least one AZ in each region.

Let’s start from our completely functional seven replica state and knock out the entire East coast. After replicating the lost ranges, the situation looks like this.


Four replicas were created for the underrepresented data on the West cost, and four more in the EU. Very similar to the R=5 situation where we lost the East coast, latencies for queries are nearly the same as that previous scenario. The same network hops need to be made with R=7 as R=5, there are just more of them in parallel.

“But I thought you said we’d be more resilient with seven replicas?”

Looking at the diagram of when everything was working properly….


We have the ability to tolerate the failure of any one region, as no region contains more than three replicas of a given range. That still leaves four in the other regions to make quorum. This is the same as our situation with five replicas.

Now if we look at the loss of AZs, things are counterintuitive. In the situation with five replicas, we could lose any two AZs as three of the five replicas would remain and we could still achieve quorum. With seven replicas distributed as in the image above, what if we lose the SouthEast AZ and the EU AZ? How many replicas of East coast data remain? There are only three left, which is not enough for a quorum any longer.

From the standpoint of losing nodes, we can lose at least three nodes and still keep functioning. Because if the three nodes are all holding the same range, the remaining four for that range will continue to operate.

Conclusion

As we’ve walked through above, writes to our distributed DB are significantly affected by the latency of our network connections. These transmission times increase the further away our AZs are from each other.

The loss of an AZ where replication is set to both five and seven does not significantly impact the latencies of queries in our distributed DB. The loss of an entire region can impact the latency of our queries.

The failures we can tolerate gracefully with our distributed DB do change between a replication factor of five and seven. In both cases we can handle losing an entire region. With R=5, we can handle the loss of any two AZs, but with R=7 we can only guarantee handling the loss of one AZ. For node failures, with R=5 we can tolerate the loss of any two nodes and in R=7 we can tolerate the loss of any three nodes in a worst case scenario. This comes at the cost of 40% more storage being required for R=7.


Comments

Popular posts from this blog

CockroachDB Backups, Exports, and Archives

Bulk Update of data in CockroachDB