ZeroTier is growing, and to facilitate that growth we've just made a significant upgrade to our root server infrastructure.
While the ZeroTier network is peer to peer, it relies upon a set of designated core nodes whose role and responsibility is very close to that of the DNS root name servers, hence the similar terminology. For those who are curious about the technical background of this design, this old blog post by ZeroTier's original author explains the reasoning process that led to the system as it is today. In short: while devices encrypt end-to-end and communicate directly whenever they can, the root server infrastructure is critical to facilitating initial setup.
Since ZeroTier's launch its root server roles have been filled by four geographically distributed independent nodes. For some time now these have been located in San Francisco, New York, Paris, and Singapore. Each ZeroTier device announces its existence to all four of these and can choose to query whichever seems to offer the lowest latency. While this setup is simple, robust, and has served us well, it presented a scalability problem: as the network grows how do we add more root servers without requiring clients to make more announcements? It's always possible to just make those roots bigger, but scaling and robustness will eventually demand the ability to scale elastically and to add more locations.
Since our user base is quite global, we also wanted to cover more of the world. Those four locations are great for America and Europe, but it left much of Asia, Australia, Africa, the Middle East, India, and South America with 200ms+ ping times to the nearest root. That's not nearly as bad as it would be if ZeroTier were a completely centralized "back-haul all traffic to the cloud" protocol, but it still meant perceptibly slower connection setup and sign-on times for many users.
Early in 2015 we started to explore the possibility of placing our root servers behind IPv4 and IPv6 anycast addresses. ZeroTier is a UDP-based protocol like DNS and SIP, two other protocols that are frequently deployed this way. Global anycast would allow us to put the whole root infrastructure behind one or maybe two IPs and then add as much actual capacity as we want behind that facade. The advantages at first seemed obvious: a single IP would play well with the need to work behind NAT, and the global Internet infrastructure and BGP would (we thought) take care of geographic route optimization for us.
We went to the trouble of heavily researching the topic and even obtaining a provisional allotment of address space. But as we conversed with numerous cloud providers and ISPs, it quickly became apparent that actually deploying anycast and doing it well is painful and expensive. Hundreds of gigabits of cloudy goodness can today be had for a few hundred dollars per month, but with the complexities of AnyCast and BGP sessions the cost quickly balloons to thousands of dollars per month for significantly less actual capacity. Chief among the cost drivers (and elasticity-killers) is the requirement by nearly all ISPs that anycast hosts be dedicated servers. BGP is also a bit more finicky in practice than it seems in theory. Conversations with experienced BGP administrators convinced us that actually getting anycast announcement and high availability fail-over to work reliably and to deliver optimal routing across the world is deceptively hard; easy to prototype and get operational but hard to tweak and optimize and get truly robust.
BGP also introduces a security concern: if our entire root infrastructure is behind a single IP block with a single ASN, then a single BGP highjack or poisoning attack could take it down. "Legitimate" attacks become easier too: it's a lot easier for a censorship-happy regime to blacklist one IP block than to maintain a list of addresses scattered across blocks belonging to multiple cloud providers.
While we can see uses for anycast in the future for other potential services, a deep study of the topic made us start thinking about other options.
Then we remembered we were a software defined networking company.
Clustering (also known as multi-homing) has been on the ZeroTier feature queue for quite some time. Since ZeroTier's virtual addressing is defined entirely by cryptographic tokens, there is nothing that glues an address to a physical endpoint. That's why I can switch WiFi networks and be back on my virtual networks in less than a minute. It also means that nothing prevents a single ZeroTier "device" from being reachable via more than one IP address. IPs are just paths and they're completely ephemeral and interchangeable.
While the idea's been around for a while, implementation is tricky. How are peers to determine the best path when faced with a set of possibilities? How are new paths added and old ones removed?
The thing we were stuck on is the idea that the initiator of the link should be the one making these decisions. As we were mulling over potential alternatives to anycast, a simple thought came: why doesn't the recipient do that? It has more information.
The contacting peer knows the recipient's ZeroTier address and at least one IP endpoint. The recipient on the other hand can know all its endpoints as well as the endpoints of the contacting peer and the health and system load of all its cluster nodes. From that information it can decide which of its available endpoints the peer should be talking to, and can use already-existing protocol messages in the ZeroTier protocol (ones designed for NAT traversal and LAN location) to send the contacting peer there. Several metrics could be used to determine the best endpoint for communication.
After a few days of coding an early prototype was running. Turns out it took only about 1500 lines of C++ to implement clustering in its entirety, including cluster state management and peer handoff. Right now this code is not included in normal clients; build 1.1.0 or newer with ZT_ENABLE_CLUSTER=1 to enable it. The current version uses a geo-IP database to determine where peers should be sent. This isn't flawless but in practice it's good enough, and we can improve it in the future using other metrics like direct latency measurements and network load.
Clustering allows us to create a global elastically-scalable root server infrastructure with all the same characteristics that we initially sought out through anycast, but without BGP-related security bottlenecks or management overhead and using cheap commodity cloud services. Right now the clustering code has only been heavily tested for this specific deployment scenario, but in the near future we plan to introduce it as something you can use as well. The same clustering code that now powers Alice and Bob could be used to create geographically diverse high-availability clustered services on virtual networks. Any form of clustering is challenging to use with TCP, but UDP based protocols should be a breeze as long as they can be backed by distributed databases. We'll also be introducing clustering for network controllers, making them even more scalable and robust (though they can already be made very stable with simple fail-over).
Using our new clustering code we created two new root "servers": Alice and Bob. The names Alice and Bob seemed a natural fit since these two names are used as examples in virtually every text on cryptography.
If clustering lets us spread each of these new root servers out across as many physical endpoints as we want, then why two? Answer: even greater redundancy. The old design had four completely independent shared-nothing roots. That means that a problem on one would be very unlikely to affect the others, and all four would have to go down for the net to experience problems. Introducing clustering means introducing shared state; cluster members are no longer shared-nothing or truly independent. To preserve the same level of true systematic redundancy it's important that there always be more than one. That way we can do things like upgrade one, wait, then upgrade the other once we've confirmed that nothing is wrong. If one experiences serious problems clients will switch to the other and the network will continue to operate normally.
A goal in our new infrastructure was to offer sub-100ms round trip latency to almost everyone on Earth. Alice and Bob are both (as of the time of this writing) six node clusters spread out across at least three continents.
|Amsterdam / Netherlands||Dallas / USA|
|Johannesburg / South Africa||Frankfurt / Germany|
|New York / USA||Paris / France|
|Sao Paolo / Brazil||Sydney / Australia|
|San Francisco / USA||Tokyo / Japan|
|Singapore||Toronto / Canada|
We've done some latency measurements, and the locations above bring us pretty close. There's a gap in the Middle East and perhaps Northern India and China where latencies are likely to be higher, but they're still going to be lower now than they were before. If we see more users in those areas we'll try to find a good hosting provider to add a presence there.
Alice and Bob are alive now. They took over two of the previous root servers' identities, allowing existing clients to use them with no updates or configuration changes. While clients back to 1.0.1 will work, we strongly recommend upgrading to 1.1.0 for better performance. Upgrading will also give you full dual-stack IPv4 and IPv6 capability, which the new root servers share at every one of their locations.
Before taking the infrastructure live we tested it with 50,000 Docker containers on 200 hosts at four Amazon EC2 availability zones making rapid-fire HTTP requests to and from one another. The results were quite solid, and showed that the existing infrastructure should be able to handle up to 10-15 million devices without significant upgrades. Further projections show that the existing cluster architecture and code could theoretically handle hundreds of millions of devices using larger member nodes. Going much beyond that is likely to require something a bit more elaborate, such a shift from simple member-polling to something like the Raft consensus algorithm (or an underlying database that uses it). But if we have hundreds of millions of devices, that's something we'll have more than enough resources to tackle.
Necessity really is the mother of invention. If we were a giant company we'd have just gone the anycast route, since it seemed initially like the "right solution." But being a comparatively poorer startup, we balked at the cost and the management overhead. Instead we set ourselves to thinking about whether we could replace the big sexy "enterprise" way of anycast with something more clever that ran on top of the same commodity cloud and VPS services we'd used so successfully in the past. In the end that led to a system that is smarter, more scalable, faster, more robust, more secure, and at least an order of magnitude cheaper. We also got a new technology we can soon make available to end users to allow them to create multi-homed geo-clustered services on virtual networks. (More on that soon!)
That's one of the reasons startups are key to technology innovation. Big players will do things the established way because they can afford it, both in money and in personnel. Smaller players are required to substitute brains for braun.