The night of November 9, 2017 was not a good night. I was putting my son to bed when my phone started buzzing away letting me know that one of the main database servers that runs ZeroTier Central had gone down. My first thought was, "No worries. We have two more in two other datacenters. There's no way more than one data center will go down at once."
Shortly thereafter, another one went out.
This was the great OVH outage of November 2017. After everything recovered, we decided that it was time to further diversify and shore up the architecture of ZeroTier Central to ensure that nothing short of Europe getting wiped off the face of the earth would cause this to happen again.
For those that don't already know, ZeroTier Central is our hosted network management interface that allows our users to manage, configure, and authorize devices on their networks. To store all that data we use RethinkDB. Behind the scenes, RethinkDB uses the Raft Consensus Algorithm to ensure data consistency across all of the machines running Central. This requires at least three instances of RethinkDB, a majority of which must be online. Since on November 9th we lost two of three data center, Raft's required quorum could not be fulfilled.
In the week after that outage, we spun up two additional instances of Central in two different datacenters hosted by a second hosting provider. Now our database can survive a multi-datacenter outage and still stay alive.
Prior to December 14, 2017, Central was orchestrating the running of the network controllers. This was less than ideal as it meant whenever we updated Central, the controllers themselves had to be stopped causing a short service interruption to our users. It also meant we had to handle orchestration ourselves, which turns out to be a stickier problem than we originally thought. The system sometimes got into states where too many controllers were on the same machine or more than one machine was hosting the same controller.
Fortunately a lot of really smart people have come up with very good solutions to this problem. Our only issue was that all the most popular solutions seem to require us to containerize ALL THE THINGS! Now, we don't have anything against containers. We just didn't feel it was necessary for our use case and we didn't want the added complexity of adding containers into the mix.
Nomad gives us a lot of new flexibility we didn't have before. It took a little work on our end to make the controller talk directly to RethinkDB instead of the Central application, but that was a small price to pay and ultimately helps stability in the long run. We can now bring down the Central web application without affecting your current networks or devices. We can also roll out upgrades to our network controller binaries one controller at a time, minimizing the downtime to a few seconds for each controller.
Controllers now run on separate machines from Central in 2 separate datacenters run by 2 separate cloud providers. We're also overprovisioned enough to survive losing an entire datacenter without the controllers being down for a noticeable period of time.
We rolled out Nomad management of network controllers on December 14, 2017. If you have an account on ZeroTier Central, you probably got an email about some planned downtime for that day. This is what it was for. We hope the downtime didn't cause you any major issues.
This is just the start of our use of Nomad. We have lots of behind the scenes plans for it in the future. We can easily grow and bring new systems and datacenters into our cluster. We can even use containers in the future if we decide that is the best path forward. With Nomad's help, we hope you don't notice a thing and your networks keep chugging along smoothly!