CockroachDB and precision clocks
I was recently working with a situation where CockroachDB nodes were running as VMs on VMware hosts. The difficulty experienced was that when the VMs went through a vmotion, that the hosts would end up flapping upon the completion of the vmotion. They would end up flapping for up to 20 minutes. Obviously, having nodes bouncing up and down is not desirable and could lead to unavailability of data if other maintenance activities are happening concurrently, such as a repave or upgrade, or result in a diminished amount of computational resources. If the node could not successfully rejoin the cluster within five minutes, then the remainder of the cluster would start to up-replicate any data that existed on the down node. This puts yet an additional load on the remaining nodes in the cluster as it tries to self-heal. Historically, the VMs running CockroachDB were utilizing NTPD, synchronizing every 11min, on the guest OS to keep the clocks reasonably well align...