We've been working on a complete rewrite of the Central backend for a few months now. We are pleased to announce that it's open for public beta testing at: https://my-beta.zerotier.com!
You won't notice a significant user interface difference as that part hasn't changed much (yet). Everything behind the scenes has been completely rewritten. Come test it, create and join networks, point your automation scripts at it, try to crash it (and let us know if you do!). Please note: your accounts have not yet been moved over to the beta site, and any networks created during the beta period will be deleted when the beta period is over. This site is for testing only and is not a replacement for the regular my.zerotier.com.
Since ZeroTier Central first went live we have grown by almost two orders of magnitude. As Jeff Dean once put it, "design for 10x growth, but plan to rewrite before ~100x". Well we're approaching 100x now, and we're finding this advice to be correct.
We've loved using RethinkDB for the past couple of years. It's support for clustering out of the box is great and made it easy for us as a smaller company to build a lot of redundancy into our infrastructure quickly. It performs well, is easy to use, and has mostly served us well. We have run into a few issues with it under extremely high loads, however, and with the company behind RethinkDB going out of business, it's hard to find help when an issue arises. There is an open source community supporting the database but it's small, and maintaining a complex distributed database is no small task.
The first issue is that our workload causes a memory leak in RethinkDB. As a result, every day at 9am and 9pm Pacific Time, we've been doing a rolling restart of our database servers. Things mostly stay running, but you're much more likely to get a 502 error when this is going on. We would like to thank Sam Hughes for his help trying to track this issue down. Unfortunately we were never able to fix it completely.
We ran into another issue with RethinkDB Proxy. We used RethinkDB Proxy on the servers the ZeroTier Network Controllers run on. This issue has unfortunately been very visible to a lot of users. We've had two instances where some networks experienced members being deauthorized unexpectedly. We traced this down to the RethinkDB proxy not always completely connecting to the main database cluster. When this happens, it doesn't give us a warning, and then doesn't always return all of the rows we request when a network controller starts up. This put the controllers into a state where the first time it had seen a node after startup was when the node contacted the controller, rather than from stored data in the database. On a private network, the controller assumes in this case that it's a new node trying to join the network, and sets it in the database as unauthorized.
After searching far and wide for which database to switch to, we decided upon the tried and true PostgreSQL. It's rock solid, has excellent tooling, replication & HA support, and it's fast! Adam and I have also worked with it extensively on several projects over the past 20 years.
Unfortunately, there seems to be some point where dynamic typing begins to cause more trouble than it's worth. Bugs crawl through some cracks where something is expected to be one type, but a few old records from previous revisions of the code deep in the database have it as another. Things are still fairly easy to write, but it sure would be nice to have a compiler to find many of these issues up front, wouldn't it?
Well Go has matured a lot since we first started ZeroTier. We took a good look at everything available out there and decided to Go with it (pun intended). It's easy to learn and use, type safe, has great tooling, has builtin support for our database of choice, and it's been a breeze to get our API put into it. Another advantage is that it will make life easier to distribute Central to those users who wish to run their own instance. It'll be a single binary to run rather than a Docker container.