About a week ago many users may have noticed instability on our hosted ZeroTier Central. Network controllers would flicker on and off, and eventually the whole service needed to be restarted across our cluster. 500 errors and timeouts were a thing.
This is the story of one of the worst bugs I've ever investigated in my entire career. This is the story of...
This was no ordinary memory leak. No unbalanced malloc/free or new/delete was this. No circular references with reference counting pointers. No queues not being emptied. No file descriptors being dropped. When we program we assume certain things. We assume that we live in an orderly universe that obeys physical laws.
We assume that the gods are sane.
It was a
dark and stormy night bright cloudless day (we're in Southern California). We'd just pushed out a series of updates to ZeroTier Central to address one of our largest customers' performance concerns, not to mention dealing with the high load that comes with growth in general. I'm a believer in trying to fix performance issues before throwing more hardware at the problem, so we worked hard to factor out a fragile and CPU-intensive coupling between our controller microservice (written in C++) and our backend (written in NodeJS) in favor of a faster and simpler one.
After uttering those famous last words of "it worked in dev and staging," we pushed it to live. Everything came back up and all remained calm. Load dropped a bit. Things seemed to be working well. We went home for the weekend. (We usually try to avoid shipping on Fridays, but this one was kind of critical.)
Sometime after the witching hour, poltergeist activity began to afflict our mobile phones. Bzz, bzz, ... Something was wrong. Very wrong. Since customers were being impacted, we restarted stuff. (Please don't judge. If you've ever run anything at scale you've been there.) It happened again, and again, and finally our devops person put a shim in place to restart the service automatically via a cron job (stop judging us!) every few hours. This kept things stable enough while we could diagnose the problem.
Nothing made any sense until we noticed the controller microservice's memory consumption. A service that should be using perhaps a few hundred megabytes at most was using gigabytes and growing... and growing... and growing... and growing...
A leak. Le sigh. Time to look through the commit log. What did I do? Hmm... nothing looks like it could account for this. We're doing almost everything the RAII way using managed structures that are well tested, and any places with a new/delete are old code that hasn't changed in ages.
Next step involved finding a way to duplicate it in dev. Eventually I was able to do so using siege, a command line utility for load testing web servers. Make tons and tons of network member changes and the leak appears, meaning it must be something in either the new coupling code or the (very simple) database inside the controller microservice.
I worked until about three o'clock in the morning selectively commenting out regions of the code. Eventually I narrowed the problem down to a region of code inside the method that handles creating the actual serialized network configurations that are sent to ZeroTier network members. This gets called when members request updates and when members are changed to push them out.
Unfortunately the region of code in question made absolutely no sense. Everything happening there involved simple non-cyclic structures built on a very well tested JSON library we'd been using for ages and C++ STL data structures that are used everywhere. What made even less sense was that the rate at which the leak occurred could be changed by changing the order of certain lines of code in ways that had no impact on actual logic.
When it's long after midnight and you can't think of anything more to do other than obsessively eyeball the code, it's time to go to bed.
In every haunted house flick there's always the rational one who dismisses everything. The wife, the husband, the memory debugger. We tried valgrind, dmalloc, the Microsoft Visual Studio memory profiler (yes we tried running it in Windows via an elaborate shim just to use this), and glibc's built-in memory tracing.
"Leak? You don't have a leak! There's no such thing as leaks! Maybe you should see a doctor," said these tools.
I knew I wasn't crazy. Memory use doesn't increment by itself. Then I remembered reading something long ago about the presence of optimizations inside certain C++ STL structures like std::string that are designed to reduce memory copying and re-allocation when sub-strings are extracted or certain other operations are performed. I started to suspect that maybe our JSON library with all its slinging around of strings and other STL containers could be triggering some kind of weird edge case, or maybe even creating what amounted to a hidden circular reference due to strings reusing their memory and passing it around to their kinfolk.
At this point I was acting out another haunted house flick cliche: frantically digging up the basement floor in search of bones. Have you ever actually looked at C++ STL code? After giving up trying to fathom hyper-optimized C++ template origami, I tried to rule this possibility out by introducing code everywhere that forced strings to be re-created from plain C pointers and that stringified and then re-created JSON objects to ensure that they weren't holding onto any memory under the hood. This would use more CPU but if it made the leak go away it would validate this hypothesis.
Nothing worked. It's leaking but it's not. It's leaking but the debugger says no memory was lost. It's leaking in ways that are dependent on irrelevant changes to the ordering of mundane operations. This can't be happening.
I decided to background this task and let my subconscious work on it while I enjoyed the rest of my weekend. It worked. Sometime late Sunday night a novel thought arrived: "memory fragmentation."
To achieve high throughput under heavy load, the controller microservice creates a number of worker threads and passes off requests to them. That way things like ECC certificate signatures can happen without blocking the main loop. Years ago while reading some forgotten lore I had read of memory fragmentation and of how this demon of chaos can be summoned by complex programs and multiple threads. I'd been working under the assumption that the wizards of operating systems and language runtimes had long ago banished this beast, like many other ancient demons from the time of creation, to the pit. Modern memory allocators use thread-local pools and object size bins and stuff, right?
I searched a bit and read things that led me to believe that this might not entirely be the case. The default allocator in the standard C library is designed for an acceptable trade-off between memory use and performance under ordinary work loads, but it doesn't always perform ideally in aggressively multithreaded or very high throughput applications.
Luckily there are very highly regarded drop-in replacements like jemalloc. Trembling with anticipation that maybe... just maybe... I'd found the answer... I dropped in jemalloc and ran the test.
CPU usage dropped but otherwise this had no effect.
Then I tried something stupid that for some reason had not yet occurred to me: only create one worker thread. This also had no effect.
The jemalloc library has its own memory debugging features, so I decided to try those and see if they'd reveal something the other debuggers couldn't see. Like the others it stubbornly denied the existence of a leak, but I did notice something curious. This allocator like many other high performance allocators creates a series of memory pools of geometrically increasing size to rapidly service small allocation requests. C++ code that makes extensive use of containers should be creating a huge number of small objects, but instead I saw memory use creeping up in bins of larger size and in the un-binned "huge allocations" category. Some allocations were much larger than anything ZeroTier should need. That made no sense.
My thoughts once again returned to the C++ STL and its rumored under-the-hood memory optimizations. I dug up the basement some more, then grabbed a sledgehammer and took to the walls. It has to be here! It has to be here!
Nothing. Nothing but dirt and drywall and C++ templates.
Defeated, broken, exhausted, curled in the floor in a fetal position, and... wait... I hadn't looked behind operator new! I picked up my hammer, marched purposefully up to the only remaining intact wall, and started whacking away.
Most operators in C++, including its memory allocation and deletion operators, can be overloaded. Indeed this one was. In some C++ STL libraries the overloads for new and delete just hand the task off to malloc and free, but not this one. Behind the gaping hole I tore in the wall leered the hideous moldy corpse of another memory allocator. It had been there all along, probably since Victorian times, silently waiting, brooding, sealed behind the wall by a jealous maintainer...
Since "malloc is slow," libstdc++ "helpfully" adds its own memory allocator layer between you and the C library. This one implements its own caching and pooling, and searching around the web yields many examples of people complaining about it.
It turns out that there is a somewhat convoluted way to disable it globally: set the environment variable "GLIBCPP_FORCE_NEW". After doing this, CPU use increased slightly but memory use stabilized. Recalling jemalloc I now once again tried sticking it under the controller in place of glibc's malloc and both CPU load and memory use dropped to substantially less than either stock configuration. More importantly everything became stable once again.
I don't know what we did in our most recent changes to anger the spirit of this forgotten allocator, but we'd given it a proper burial and our home was once again at peace.
We now have code in production (in ZeroTier Central) that force disables libstdc++ allocation pools via the above environment variable and ensures that jemalloc is preloaded. This lets us use stock binaries while avoiding this problem. We're considering trying to find a way to do this for the precompiled versions we ship, or maybe building clang's libc++ and statically linking that instead.
We use RethinkDB and it's also written in C++, so I decided to take a look and see if setting the same environment variable and preloading jemalloc might improve our database performance. Lo and behold, but it would appear that the binary RPMs for RethinkDB already do this and link against jemalloc. Looks like they discovered this problem too.
From what I can find on the web issues with GNU libstdc++ allocation pools have been discussed for a while, but as far as we can tell the issue persists in the very latest versions. We can duplicate this readily on Debian "stretch," which is pretty much bleeding edge. This is pretty unforgiveable. How many other C++ developers on Linux are banging their heads on the table right now as they search in futility for memory leaks that do not exist?
The right answer to "malloc is slow" is to make it faster. This way regular C programs and programs written in other languages can also benefit. Adding wheels to the wheel is sometimes forgiveable when dealing with closed systems that you can't fix but libstdc++ and glibc are both open source GNU projects. The jemalloc allocator works very well so why not ship that or something very much like it?
If you're looking to duplicate this issue pull the latest ZeroTierOne repo, set up a simple network controller, and then bang heavily on the JSON API with POST requests using siege or another web stress testing tool. Just don't do it in production or your phone might tremble in the night.
The plot thickens. I've received several messages from people claiming the likely problematic C++ allocator, known as mt_alloc, hasn't been the default for a very long time and isn't in CentOS 7. To investigate I tried doing a string search of all binaries in /lib, /usr/lib, /lib64, etc., for "GLIBCPP_FORCE_NEW" and "GLIBCXX_FORCE_NEW" and variations thereof and... came up empty.
Yet setting this environment variable makes the problem go away. I repeated the test and confirmed. Then I tried stupid things like setting "GLIBCXX_FARCE_NOO" and no, the problem remains.
I tried to create a simpler C++ program that used the same JSON library and did similar kinds of JSON schlepping stuff to see if I could create a test case and was unable to do so. Compiled with the same compiler, same options, etc.
The next step is to use a fully instrumented debug build and trace and determine who or what is looking at that environment variable. I still think the problem is somewhere in the C++ stack, but why it's there is mysterious. Our software doesn't include very many things and is low-dependency in general. We're not running some kind of crazy turtles all the way down stack.
Will update when time permits, but unfortunately we're too busy with other things (and we have a work-around) to deeply investigate this issue right now. Perhaps some mysteries are meant only for distribution and core library developers and should not be pondered by mere mortals.
This post got way way way more hits than any of us thought it would. Programmers are craftspeople and every craftsperson loves a good from-the-trenches story and to praise and/or complain about their tools. On an amusing side note, we are apparently "Hipster-Bullshitbingo-Startup-Klitsche" according to a German language web site. LOL.
It's been a while since we published any performance numbers, so today we decided to benchmark the pre-release of ZeroTier 1.2.4 against IPSec and OpenVPN.
Our benchmark setup consisted of two single-core Linux (CentOS 7) virtual machines running on VMWare Workstation on the same Core i7 at 2.8ghz. Benchmarking on the same physical host means that we're only measuring the CPU-constrained impact of each tested virtual network stack. Since there is no actual physical network there are no other factors. By assigning each virtual machine a single core we ensure that they do not compete with one another. (The host CPU has four physical cores.)
Testing was performed using iperf3 in TCP mode transferring a gigabyte of random data. Random payload prevents data compression from impacting transfer speed, though the sender's attempt at compression (if enabled) still contributes to CPU overhead.
|Software||Encryption / Compression||Speed|
|Nothing (VMWare bridge)||--||4760 mbps|
|IPSec / Linux 3.10.0 / libreswan 3.15||AES-128-CBC / None||497 mbps|
|ZeroTier 1.2.3 (pre-1.2.4)||Salsa20 / LZ4 (default)||484 mbps|
|OpenVPN 2.4.1||AES-256-CBC / None||309 mbps|
|OpenVPN 2.4.1||AES-256-CBC / LZO||290 mbps|
|OpenVPN 2.4.1||Blowfish-CBC / None||234 mbps|
|OpenVPN 2.4.1||Blowfish-CBC / LZO||221 mbps|
We didn't expect to beat OpenVPN by such a margin, and we expected IPSec to be at least 10% faster. IPSec's main encapsulation path lives in the kernel, avoiding two kernel/user mode context switches and at least two rounds of memory copying. It also makes use of CPU AES-NI instructions for encryption. Despite these factors ZeroTier clocked nearly identical transfer speeds. We repeated the test several times and with slightly different iperf3 modes and flags and got the same or similar results.
These results tell us ZeroTier's encryption and encapsulation path must be faster than IPSec by enough of a margin to compensate for the cost of kernel/user mode context switching and additional memory copying. Either that or the two are equivalent and we're over-estimating kernel/user mode costs. IPSec turns out to be a little under 3% faster, so maybe that's the overhead of not living in the kernel.
This also means ZeroTier would likely beat IPSec by 5-15% if we ported it to the kernel. We have no plans to do so in the immediate future, but if our users start demanding higher performance we have at least one path forward.
Needless to say we are very happy with these numbers! Our performance is almost identical to IPSec, which is the standard for "enterprise" network tunnels.
ZeroTier would like to welcome Travis LaDuke to our team!
He's joining us to specialize in front-end web UI development, and will be helping to vastly improve ZeroTier Central and add new UI features for new products. He has a diverse background that includes programming control systems and managing network infrastructure for the entertainment industry. To ZeroTier he brings a very relevant combination of skills: front-end UI development expertise combined with extensive enterprise networking experience.
Travis brings the size of our engineering team to four. We're still quite small and we plan to stay that way as long as we can. Nine women can't make a baby in one month. Our small and agile team is an asset. We're proud of what we have achieved and the efficiency with which we've achieved it.
The next major release of ZeroTier's network virtualization engine (1.2.0) is a huge milestone. In addition to other improvements, our virtual networks will be getting a lot smarter. It will now be possible to set fine-grained rules and permissions and implement security monitoring at the network level with all of this being managed via the network controller and enforced cooperatively by all network participants. This brings us to near feature parity with in-data-center SDN systems and virtual private cloud backplanes like Amazon VPC.
This post describes the basic design of the ZeroTier rules engine by way of the reasoning process that led to it. As of the time of this writing (late August, 2016), a working but not quite production ready implementation is taking shape in the "dev" branch of our GitHub repository. ETA for this release is mid to late September, but if you are brave you are welcome to pull "dev" and take a look. Start with "controller/README.md". Note that there have been other changes to the controller too, so don't try to drop this into a production deployment!
In designing our rules engine we took inspiration from OpenFlow, Amazon VPC, and many other sources, but in the end we decided to do something a little bit different. Our mission here at ZeroTier is to "directly connect the world's devices" by in effect placing them all in the same cloud. The requirements implied by this mission rule out (pun intended?) many of the approaches used by conventional LAN-oriented SDN switches and endpoint firewall management solutions.
ZeroTier is designed to run on small devices. That means we can't push big rules tables. The size of the rule set pushed to each device has to be kept under control. Meanwhile the latency and unreliability of the global Internet vs on-premise networks excludes any approach that requires constant contact between endpoints and network controllers. This means we can't use the OpenFlow approach of querying the controller when an unrecognized "flow" is encountered. That would be slow and unreliable.
At the same time we wanted to ship a truly flexible rules engine capable of handling the complex security, monitoring, and micro-segmentation needs of large distributed organizations.
We've wanted to add these capabilities for a long time. The delay has come from the difficulty of designing a system that delivers on all our objectives.
To solve hard problems it helps to first take a step back and think about them conceptually and in terms of first principles. We've had many discussions with users about rules and micro-segmentation, and have also spent a good amount of time perusing the literature and checking out what other systems can do. Here's a rough summary of what we came up with:
Those are the high level goals that informed our design. Here's what we did in response to them.
Once fine-grained permissions, per-device rules, and device group rules are conceptually separated from the definition of global network behavior it becomes practical to limit the global rules table size to something modest enough to accommodate small devices. While there might occasionally be use cases that require more, we think something on the order of a few hundred rules that apply globally to an entire network is probably enough to address most sane requirements. This is enough space to describe in great detail exactly what traffic a network will carry and to implement complex security monitoring patterns.
So that's what we did. Keep in mind that at this stage we are intentionally ignoring the need for fine-grained per-device stuff. We'll pull out some bigger guns to deal with that later.
Now on to security monitoring. If the goal is near-omniscience, there is no substitute for placing a man in the middle and just proxying everything. Unfortunately that's a scalability problem even inside busy data centers, let alone across wide area networks. But it's a case we wanted to at least support for those who want it and are willing to take the performance hit.
To support this we added a REDIRECT action to our rules engine. Our redirect operates at the ZeroTier VL1 (virtual layer 1) layer, which is actually under the VL2 virtual Ethernet layer. That means you can send all traffic matching a given set of criteria to a specific device without in any way altering its Ethernet or IP address headers. That device can then silently observe this traffic and send it along to its destination. The fact that this can be done only for certain traffic means the hit need only be taken when desired. Traffic that does not match redirection rules can still flow directly.
Now what about a lower overhead option? For that we took some inspiration from Linux's iptables and its massive suite of capabilities. Among these are its --tee option, which allows packet cloning to remote observers.
We therefore added our own TEE, and like REDIRECT it has the advantage of operating at VL1. With our packet cloning action every packet matching a set of criteria, or even the first N bytes of every such packet, can be sent to an observer. Criteria include TCP options. This lets network administrators do efficient and scalable things like clone every TCP SYN and FIN to an observer to watch every TCP connection on the network without having to handle connection payload. This allows a lot of network insight with very minimal overhead. A slightly higher overhead option would involve sending, say, the first 64 bytes of every packet to an observer. That would allow the observation of all Ethernet and IP header information with less of a performance hit than full proxying.
But what about endpoint compromise?
Our security monitoring capabilities can never be quite as inescapable as a hardware tap on a physical network. That's because ZeroTier is a distributed system that relies upon endpoint devices to correctly follow and enforce their rules. If an endpoint device is compromised its ZeroTier service could be patched to bypass any network policy. But the fact that rules are evaluated and enforced on both sides of every interaction allows us to do the next best thing. By matching on the inbound/outbound flag in our rules engine and using other clever rule design patterns it's possible to detect cases where one side of an interaction stops abiding by our redirect and packet tee'ing policies. That means an attacker must now compromise both sides of a connection to avoid being observed, and if they've done that... well... you have bigger problems. (A detailed discussion of how to implement this will be forthcoming in future rules engine documentation.)
Global rules take care of global network behavior and security instrumentation, but what if we want to get nit-picky and start setting policies on a per-device or per-device-group basis?
Let's say we have a large company with many departments and we want to allow people to access ports 137, 139, and 445 (SMB/CIFS) only within their respective groups. There are numerous endpoint firewall managers that can do this at the local OS firewall level, but what if we want to embed these rules right into the network?
Powerful (and expensive) enterprise switches and SDN implementations can do this, but under the hood this usually involves the compilation and management of really big tables of rules. Every single switch port and/or IP address or other identifier must get its own specific rules to grant it the desired access, and on big networks a combinatorial explosion quickly ensues. Good UIs can hide this from administrators, but that doesn't fix the rules table bloat problem. In OpenFlow deployments that support transport-triggered (or "reactive") rule distribution to smart switches this isn't a big deal, but as we mentioned up top we can't do things that way because our network controllers might be on the other side of the world from an endpoint.
On the theory side of information security we found a concept that seems to capture the majority if not all of these cases: capability based security. From the article:
A capability (known in some systems as a key) is a communicable, unforgeable token of authority. It refers to a value that references an object along with an associated set of access rights. A user program on a capability-based operating system must use a capability to access an object.
Now let's do a bit of conceptual search and replace. Object (noun) becomes network behavior (verb), and access rights become the right to engage in that behavior on the network. We might then conclude by saying that a user device on a capability-based network must use a capability to engage in a given behavior.
It turns out there's been a little bit of work in this area sponsored by DARPA and others (PDF), but everything we could find still talked in terms of routers and firewalls and other middle-boxes that do not exist in the peer to peer ZeroTier paradigm. But ZeroTier does include a robust cryptosystem, and as we see in systems like Bitcoin cryptography can be a powerful tool to decentralize trust.
For this use case nothing anywhere near as heavy as a block chain is needed. All we need are digital signatures. ZeroTier network controllers already sign network configurations, so why can't they sign capabilities? By doing that it becomes possible to avoid the rules table bloat problem by only distributing capabilities to the devices to which they are assigned. These devices can then lazily push capabilities to each other on an as-needed basis, and the recipient of any capability can verify that it is valid by checking its signature.
But what is a capability in this context?
If we were achieving micro-segmentation with a giant rules table, a capability would be a set of rules. It turns out that can work here too. A ZeroTier network capability is a bundle of cryptographically signed rules that allow a given action and that can be presented ahead of relevant packets when that action is performed.
It works like this. When a sender evaluates its rules it first checks the network's global rules table. If there is a match, appropriate action(s) are taken and rule evaluation is complete. If there is no match the sender then evaluates the capabilities that it has been assigned by the controller. If one of these matches, the capability is (if necessary) pushed to the recipient ahead of the action being performed. When the recipient receives a capability it checks its signature and timestamp and if these are valid it caches it and associates it with the transmitting member. Upon receipt of a packet the recipient can then check the global rules table and, if there is no match, proceed to check the capabilities on record for the sender. If a valid pushed capability permits the action, the packet is accepted.
... or in plainer English: since capabilities are signed by the controller, devices on the network can use them to safely inform one another of what they are allowed to do.
All of this happens "under" virtual Ethernet and is therefore completely invisible to layer 2 and above.
Capabilities alone still don't efficiently address the full scope of the "departments" use case above. If a company has dozens of departments we don't want to have to create dozens and dozens of nearly identical capabilities that do the same thing, and without some way of grouping endpoints a secondary scoping problem begins to arise. IP addresses could be used for this purpose but we wanted something more secure and easier to manage. Having to renumber IP networks every time something's permissions change is terribly annoying.
To solve this problem we introduced a third and final component to our rules engine system: tags. A tag is a tiny cryptographically signed numeric key/value pair that can (like capabilities) be replicated opportunistically. The value associated with each tag ID can be matched in either global or capability scope rules.
This lets us define a single capability called (for example) SAMBA/CIFS that permits communication on ports 137, 139, and 445 and then include a rule in that capability that makes it apply only if both sides' "department" tags match.
Think of network tags as being analogous to topic tags on a forum system or ACLs in a filesystem that supports fine-grained permissions. They're values that can be used inside rule sets to categorize endpoints.
A rule in our system consists of a series of zero or more MATCH entries followed by one ACTION. Matches are ANDed together (evaluated until one does not match) and their sense can be inverted. An action with no preceding matches is always taken. The default action if nothing matches is ACTION_DROP.
Here's a list of the actions and matches currently available:
|ACTION_DROP||Drop this packet (halts evaluation)|
|ACTION_ACCEPT||Accept this packet (halts evaluation)|
|ACTION_TEE||Send this packet to an observer and keep going (optionally only first N bytes)|
|ACTION_REDIRECT||Redirect this packet to another ZeroTier address (all headers preserved)|
|MATCH_SOURCE_ZEROTIER_ADDRESS||Originating VL1 address (40-bit ZT address)|
|MATCH_DEST_ZEROTIER_ADDRESS||Destination VL1 address (40-bit ZT address)|
|MATCH_ETHERTYPE||Ethernet frame type|
|MATCH_MAC_SOURCE||L2 MAC source address|
|MATCH_MAC_DEST||L2 MAC destination address|
|MATCH_IPV4_SOURCE||IPv4 source (with mask, does not match if not IPv4)|
|MATCH_IPV4_DEST||IPv4 destination (with mask, does not match if not IPv4)|
|MATCH_IPV6_SOURCE||IPv6 source (with mask, does not match if not IPv6)|
|MATCH_IPV6_DEST||IPv6 destination (with mask, does not match if not IPv6)|
|MATCH_IP_TOS||IP type of service field|
|MATCH_IP_PROTOCOL||IP protocol (e.g. UDP, TCP, SCTP)|
|MATCH_ICMP||ICMP type (V4 or V6) and optionally code|
|MATCH_IP_SOURCE_PORT_RANGE||Range of IPv4 or IPv6 ports (inclusive)|
|MATCH_IP_DEST_PORT_RANGE||Range of IPv4 or IPv6 ports (inclusive)|
|MATCH_CHARACTERISTICS||Bit field of packet characteristics that include TCP flags, whether this is inbound or outbound, etc.|
|MATCH_FRAME_SIZE_RANGE||Range of Ethernet frame sizes (inclusive)|
|MATCH_TAGS_DIFFERENCE||Difference between two tags is <= value (use 0 to test equality)|
|MATCH_TAGS_BITWISE_AND||Bitwise AND of tags equals value|
|MATCH_TAGS_BITWISE_OR||Bitwise OR of tags equals value|
|MATCH_TAGS_BITWISE_XOR||Bitwise XOR of tags equals value|
Detailed documentation is coming soon. Keep an eye on the "dev" branch.
The ZeroTier rules engine is stateless to control CPU overhead and memory consumption, and stateless firewalls have certain shortcomings. Most of these issues have work-arounds but sometimes these are not obvious. "Design patterns" will be documented eventually along with the rules engine for working around common issues, and we'll be building a rule editor UI into ZeroTier Central that will help as well.
We also have not addressed the problem of QoS and traffic priority. That presents additional challenges since being a virtualization system that abstracts away the physical network it is hard for ZeroTier to know physical topology. One option we're considering is to implement QoS field mirroring, allowing ZeroTier's transport packets to inherit QoS fields from the packets they are encapsulating. That would allow physical edge switches to prioritize traffic. We're still exploring in this domain and hope to have something in the future.
As of the time of this post our rules engine is still under active development. If you have additional thoughts or input please feel free to head over to our community forums and start a thread. Intelligent input is much appreciated since now would be the most convenient time to address any holes in the approach that we've outlined above.
As of this week it is now possible to connect desktop and mobile apps to virtual networks with the ZeroTier SDK. With our SDK applications can now communicate peer to peer with other instances of themselves, other apps, and devices using standard network protocols and with only minimal changes to existing network code. (On some platforms no changes at all are required.) The SDK repository at GitHub contains documentation and example integrations for iOS, Android, and the Unity game engine for in-game peer to peer networking using ZeroTier.
The ZeroTier SDK is an evolution of what we formerly called Network Containers, and still supports the same Linux network stack interposition use case. It's still beta so do not expect perfection. We are innovating here so excuse the dust.
Most existing P2P apps either engineer their own special-purpose protocols from the ground up or use one or more P2P networking libraries, but in both cases P2P communication is done using a protocol stack and deployment that is peculiar to the app and can only easily interoperate with other instances of the same app. This extends the "WIMP model" of computing ("weakly interacting massive programs," a play on the hypothetical "weakly interacting massive particle" from physics) into network space yielding programs that cannot interoperate directly.
This makes it hard to build true ecosystems where many programs can combine to provide exponentially increasing value at higher levels. It also means that peer to peer networking is a "special snowflake" in your development process, requiring special code, special protocols, etc. that are wholly different from the ones your app uses to communicate with the cloud. This is one reason many apps simply skip on peer to peer. Why build the app's networking features twice?
The ZeroTier SDK takes a different approach. It combines a lightweight TCP/IP stack (currently LWIP) with the ZeroTier network virtualization core to yield a "P2P library" that tries to be invisible. Our SDK allows apps to connect to each other using the same protocols (and usually the same code) they might use to connect to a cloud server or anything else on a TCP/IP network.
Since ZeroTier also runs on servers, desktops, laptops, mobile devices, containers, embedded/IoT "things," etc., an app using the ZeroTier SDK can also freely communicate peer to peer with all of these.
The ZeroTier SDK lives entirely in the app. No elevated permissions, kernel-mode code or drivers, or other special treatment by the operating system is needed. This means that P2P apps speaking standard interoperable native network protocols can be shipped in mobile app stores without special entitlements or other hassles.
To understand what the ZeroTier SDK enables it helps to imagine how it might be used.
Let's start by imagining an augmented reality game similar to Pokémon Go. The app is built with the ZeroTier SDK, and all instances of the app join a public virtual network and communicate directly to one another using HTTP and a RESTful API internal to the app. This allows instances of the app to exchange state information directly with lower latency (and at lower cost to the app's developer) than relaying it through cloud servers.
Now the maker of the game does something interesting: they document the game's peer to peer RESTful API and allow third party clients to obtain authentication tokens to communicate with running game instances.
Since the application peer to peer network runs ZeroTier, anything else can join it. This includes but is not limited to servers, desktops, laptops, IoT devices, and so on. Developers can now build scoreboards, web apps, secondary or "meta" games, team communication software, or virtually anything else they can imagine. Since these apps can communicate with actual instances of the game in real time, interoperation with the game ecosystem can be extremely fast and extremely rich. Since the game developer does not have to carry all this traffic over a proprietary cloud this introduces no additional cost burden.
Fast forward a year and there are IoT light bulbs that light up when players are near (with sub-50ms responsiveness), new PC games that extend the augmented reality experience provided by the mobile app into virtual reality worlds, and advanced players have written their own software to help their teams organize and cooperate together.
The ZeroTier SDK is in beta and we're still working to perfect integration on a variety of platforms. Right now we are looking for app and game developers who are interested in working with us. If you just want to take a look feel free to pull the code, but if your interest is more serious drop an e-mail to email@example.com and we'd be happy to work with you and help you out.
ZeroTier is growing, and to facilitate that growth we've just made a significant upgrade to our root server infrastructure.
While the ZeroTier network is peer to peer, it relies upon a set of designated core nodes whose role and responsibility is very close to that of the DNS root name servers, hence the similar terminology. For those who are curious about the technical background of this design, this old blog post by ZeroTier's original author explains the reasoning process that led to the system as it is today. In short: while devices encrypt end-to-end and communicate directly whenever they can, the root server infrastructure is critical to facilitating initial setup.
Since ZeroTier's launch its root server roles have been filled by four geographically distributed independent nodes. For some time now these have been located in San Francisco, New York, Paris, and Singapore. Each ZeroTier device announces its existence to all four of these and can choose to query whichever seems to offer the lowest latency. While this setup is simple, robust, and has served us well, it presented a scalability problem: as the network grows how do we add more root servers without requiring clients to make more announcements? It's always possible to just make those roots bigger, but scaling and robustness will eventually demand the ability to scale elastically and to add more locations.
Since our user base is quite global, we also wanted to cover more of the world. Those four locations are great for America and Europe, but it left much of Asia, Australia, Africa, the Middle East, India, and South America with 200ms+ ping times to the nearest root. That's not nearly as bad as it would be if ZeroTier were a completely centralized "back-haul all traffic to the cloud" protocol, but it still meant perceptibly slower connection setup and sign-on times for many users.
Early in 2015 we started to explore the possibility of placing our root servers behind IPv4 and IPv6 anycast addresses. ZeroTier is a UDP-based protocol like DNS and SIP, two other protocols that are frequently deployed this way. Global anycast would allow us to put the whole root infrastructure behind one or maybe two IPs and then add as much actual capacity as we want behind that facade. The advantages at first seemed obvious: a single IP would play well with the need to work behind NAT, and the global Internet infrastructure and BGP would (we thought) take care of geographic route optimization for us.
We went to the trouble of heavily researching the topic and even obtaining a provisional allotment of address space. But as we conversed with numerous cloud providers and ISPs, it quickly became apparent that actually deploying anycast and doing it well is painful and expensive. Hundreds of gigabits of cloudy goodness can today be had for a few hundred dollars per month, but with the complexities of AnyCast and BGP sessions the cost quickly balloons to thousands of dollars per month for significantly less actual capacity. Chief among the cost drivers (and elasticity-killers) is the requirement by nearly all ISPs that anycast hosts be dedicated servers. BGP is also a bit more finicky in practice than it seems in theory. Conversations with experienced BGP administrators convinced us that actually getting anycast announcement and high availability fail-over to work reliably and to deliver optimal routing across the world is deceptively hard; easy to prototype and get operational but hard to tweak and optimize and get truly robust.
BGP also introduces a security concern: if our entire root infrastructure is behind a single IP block with a single ASN, then a single BGP highjack or poisoning attack could take it down. "Legitimate" attacks become easier too: it's a lot easier for a censorship-happy regime to blacklist one IP block than to maintain a list of addresses scattered across blocks belonging to multiple cloud providers.
While we can see uses for anycast in the future for other potential services, a deep study of the topic made us start thinking about other options.
Then we remembered we were a software defined networking company.
Clustering (also known as multi-homing) has been on the ZeroTier feature queue for quite some time. Since ZeroTier's virtual addressing is defined entirely by cryptographic tokens, there is nothing that glues an address to a physical endpoint. That's why I can switch WiFi networks and be back on my virtual networks in less than a minute. It also means that nothing prevents a single ZeroTier "device" from being reachable via more than one IP address. IPs are just paths and they're completely ephemeral and interchangeable.
While the idea's been around for a while, implementation is tricky. How are peers to determine the best path when faced with a set of possibilities? How are new paths added and old ones removed?
The thing we were stuck on is the idea that the initiator of the link should be the one making these decisions. As we were mulling over potential alternatives to anycast, a simple thought came: why doesn't the recipient do that? It has more information.
The contacting peer knows the recipient's ZeroTier address and at least one IP endpoint. The recipient on the other hand can know all its endpoints as well as the endpoints of the contacting peer and the health and system load of all its cluster nodes. From that information it can decide which of its available endpoints the peer should be talking to, and can use already-existing protocol messages in the ZeroTier protocol (ones designed for NAT traversal and LAN location) to send the contacting peer there. Several metrics could be used to determine the best endpoint for communication.
After a few days of coding an early prototype was running. Turns out it took only about 1500 lines of C++ to implement clustering in its entirety, including cluster state management and peer handoff. Right now this code is not included in normal clients; build 1.1.0 or newer with ZT_ENABLE_CLUSTER=1 to enable it. The current version uses a geo-IP database to determine where peers should be sent. This isn't flawless but in practice it's good enough, and we can improve it in the future using other metrics like direct latency measurements and network load.
Clustering allows us to create a global elastically-scalable root server infrastructure with all the same characteristics that we initially sought out through anycast, but without BGP-related security bottlenecks or management overhead and using cheap commodity cloud services. Right now the clustering code has only been heavily tested for this specific deployment scenario, but in the near future we plan to introduce it as something you can use as well. The same clustering code that now powers Alice and Bob could be used to create geographically diverse high-availability clustered services on virtual networks. Any form of clustering is challenging to use with TCP, but UDP based protocols should be a breeze as long as they can be backed by distributed databases. We'll also be introducing clustering for network controllers, making them even more scalable and robust (though they can already be made very stable with simple fail-over).
Using our new clustering code we created two new root "servers": Alice and Bob. The names Alice and Bob seemed a natural fit since these two names are used as examples in virtually every text on cryptography.
If clustering lets us spread each of these new root servers out across as many physical endpoints as we want, then why two? Answer: even greater redundancy. The old design had four completely independent shared-nothing roots. That means that a problem on one would be very unlikely to affect the others, and all four would have to go down for the net to experience problems. Introducing clustering means introducing shared state; cluster members are no longer shared-nothing or truly independent. To preserve the same level of true systematic redundancy it's important that there always be more than one. That way we can do things like upgrade one, wait, then upgrade the other once we've confirmed that nothing is wrong. If one experiences serious problems clients will switch to the other and the network will continue to operate normally.
A goal in our new infrastructure was to offer sub-100ms round trip latency to almost everyone on Earth. Alice and Bob are both (as of the time of this writing) six node clusters spread out across at least three continents.
|Amsterdam / Netherlands||Dallas / USA|
|Johannesburg / South Africa||Frankfurt / Germany|
|New York / USA||Paris / France|
|Sao Paolo / Brazil||Sydney / Australia|
|San Francisco / USA||Tokyo / Japan|
|Singapore||Toronto / Canada|
We've done some latency measurements, and the locations above bring us pretty close. There's a gap in the Middle East and perhaps Northern India and China where latencies are likely to be higher, but they're still going to be lower now than they were before. If we see more users in those areas we'll try to find a good hosting provider to add a presence there.
Alice and Bob are alive now. They took over two of the previous root servers' identities, allowing existing clients to use them with no updates or configuration changes. While clients back to 1.0.1 will work, we strongly recommend upgrading to 1.1.0 for better performance. Upgrading will also give you full dual-stack IPv4 and IPv6 capability, which the new root servers share at every one of their locations.
Before taking the infrastructure live we tested it with 50,000 Docker containers on 200 hosts at four Amazon EC2 availability zones making rapid-fire HTTP requests to and from one another. The results were quite solid, and showed that the existing infrastructure should be able to handle up to 10-15 million devices without significant upgrades. Further projections show that the existing cluster architecture and code could theoretically handle hundreds of millions of devices using larger member nodes. Going much beyond that is likely to require something a bit more elaborate, such a shift from simple member-polling to something like the Raft consensus algorithm (or an underlying database that uses it). But if we have hundreds of millions of devices, that's something we'll have more than enough resources to tackle.
Necessity really is the mother of invention. If we were a giant company we'd have just gone the anycast route, since it seemed initially like the "right solution." But being a comparatively poorer startup, we balked at the cost and the management overhead. Instead we set ourselves to thinking about whether we could replace the big sexy "enterprise" way of anycast with something more clever that ran on top of the same commodity cloud and VPS services we'd used so successfully in the past. In the end that led to a system that is smarter, more scalable, faster, more robust, more secure, and at least an order of magnitude cheaper. We also got a new technology we can soon make available to end users to allow them to create multi-homed geo-clustered services on virtual networks. (More on that soon!)
That's one of the reasons startups are key to technology innovation. Big players will do things the established way because they can afford it, both in money and in personnel. Smaller players are required to substitute brains for braun.
TL;DR: If you're going to put the network in user space, then put the network in user space.
For the past six months we've been heads-down at ZeroTier, completely buried in code. We've been working on several things: Android and iOS versions of the ZeroTier One network endpoint service (Android is out, iOS coming soon), a new web UI that is now live for ZeroTier hosted networks and will soon be available for on-site enterprise use as well, and a piece of somewhat more radical technology we call Network Containers.
We've been at Hashiconf in Portland this week. Network Containers isn't quite ready for a true release yet, but all the talk of multi-everything agile deployment around here motivated us to put together an announcement and a preview so users can get a taste of what's in store.
We've watched the Docker networking ecosystem evolve for the past two or more years. There are many ways to connect containers, but as near as we can tell all of them can be divided into two groups: user-space overlays that use tun/tap or pcap to create or emulate a virtual network port, and kernel-mode solutions like VXLAN and OpenVSwitch that must be configured on the Docker host itself. The former are flexible and can live inside the container, but they still often require elevated privileges and suffer from performance problems. The latter are faster but far less convenient to deploy, requiring special configuration of the container host and root access.
It's been possible to use ZeroTier One in a Docker container since it was released, but only by launching with options like "--device=/dev/net/tun --cap-add=NET_ADMIN". That gives it many of the same down-sides as other user-mode network overlays. We wanted to do something new, something specifically designed not only for how containers are used today but for how they'll probably be used in the future.
A popular phrase among container-happy devops folks today is "cattle, not pets." If containers are the "cattle" approach to infrastructure then container hosts should be like generic cattle pens, not doggie beds with names embroidered on them. They should be pieces of metal that host "stuff" with no special application specific configuration at all.
All kernel-mode networking solutions require kernel-level configuration. This must be performed on the host as 'root', and can't (easily) be shipped out with containers. It also means if a host is connected to networks X and Y it can't host containers that need networks A and Z, introducing additional constraints for resource allocation that promote fragmentation and bin-packing problems.
We wanted our container networking solution to be contained in the container. That means no kernel, no drivers, no root, and no host configuration requirements.
User-space network virtualization and VPN software usually presents itself to the system through a virtual network port (tun/tap), or by using libpcap to effectively emulate one by capturing and injecting packets on an existing real or dummy network device. The former is the approach used by ZeroTier One and by most VPN software, while the latter is used (last we checked) by Weave and perhaps a few others. The pcap "hack" has the advantage of eliminating the need for special container launch arguments and elevated permissions, but otherwise suffers from the same drawbacks as tun/tap.
User-mode network overlays that still rely on the kernel to perform TCP/IP encapsulation and other core network functions require your data to make an epic journey, passing through the kernel's rather large and complex network stack twice. We call this the double-trip problem.First, data exits the application by way of the socket API and enters the kernel's TCP/IP stack. Then after being encapsulated there it's sent to the tun/tap port or captured via pcap. Next, it enters the network virtualization service where it is further processed, encapsulated, encrypted, etc. Then the overlay-encapsulated or VPN traffic (usually UDP) must enter the kernel again, where it once again must traverse iptables, possible NAT mapping, and other filters and queues. Finally it exits the kernel by way of the network card driver and goes over the wire. This imposes two additional kernel/user mode context switches as well as several memory copy, handoff, and queueing operations.
The double-trip problem makes user-mode network overlays inherently slower than solutions that live in the kernel. But kernel-mode solutions are inflexible. They require access to the metal and root privileges, two things that aren't convenient in any world and aren't practical at all in the coming world of multi-tenant container hosting.
We think user-mode overlays that use tun/tap or pcap occupy a kind of "uncanny valley" between kernel and user mode: by relying on a kernel-mode virtual port they inherit some of the kernel's inflexibility and limitation, but lose its performance. That's okay for VPNs and end-user access to virtual networks, but for high performance enterprise container use we wanted something better. Network Containers is an attempt to escape this uncanny valley not by going back to the kernel but by moving the other direction and going all-in on user-mode. We've taken our core ZeroTier virtual network endpoint and coupled it directly to a lightweight user-mode TCP/IP stack.
This alternative network path is presented to applications via a special dynamic library that intercepts calls to the Linux socket API. This is the same strategy used by proxy wrappers like socksify and tsocks and requires no changes to applications or recompilation. It's also used by high-performance kernel-bypassing bare metal network stacks that are deployed in areas with minimum latency requirements like high frequency trading and industrial process control. It's difficult to get right but so far we've tested Apache, NodeJS, Java, Go binaries, sshd, proftpd, nginx, and numerous other applications with considerable success.
You might be thinking about edge cases, and so are we. Socket APIs are crufty and in some cases poorly specified. It's likely that even a well-tested intercept library will clash with someone's network I/O code somewhere. The good news is that containers come to the rescue here by making it possible to test a specific configuration and then ship with confidence. Edge case issues are much less likely in a well-tested single-purpose microservice container running a fixed snapshot of software than in a heterogenous constantly-shifting environment.
We believe this approach could combine the convenience of in-container user-mode networking with the performance of kernel-based solutions. In addition to eliminating quite a bit of context switch, system call, and memory copy overhead, a private TCP/IP stack per container has the potential to offer throughput advantages on many-core host servers. Since each container has its own stack, a host running sixteen containers effectively has sixteen completely independent TCP threads. Other advantages include the potential to handle huge numbers of TCP connections per container by liberating running applications from kernel-related TCP scaling constraints. With shared memory IPC we believe many millions of TCP connections per service are feasible. Indeed, bare metal user-mode network stacks have demonstrated this in other use cases.
Here's a comparison of the path data takes in the Network Containers world versus conventional tun/tap or pcap based network overlays. The application sees the virtual network, while the kernel sees only encapsulated packets.
Network Containers is still under heavy development. We have a lot of polish, stability testing, and performance tuning to do before posting an alpha release for people to actually try with their own deployments. But to give you a taste, we've created a Docker container image that contains a pre-built and pre-configured instance. You can spin it up on any Docker host that allows containers to access the Internet and test it from any device in the world with ZeroTier One installed.
Don't expect it to work perfectly, and don't expect high performance. While we believe Network Containers could approach or even equal the performance of kernel-mode solutions like VXLAN+IPSec (but without the hassle), so far development has focused on stability and supporting a wide range of application software and we haven't done much of any performance tuning. This build is also a debug build with a lot of expensive tracing enabled.
Here's the steps if you want to give it a try:
Step 1: If you don't have it, download ZeroTier One and install it on whatever device you want to use to access the test container. This could be your laptop, a scratch VM, etc.
Step 2: Join 8056c2e21c000001 (Earth), an open public network that we often use for testing. (If you don't want to stay there don't worry. Leaving a network is as easy as joining one. Just leave Earth when you're done.) The Network Containers demo is pre-configured to join Earth at container start.
Step 3: Run the demo!
docker run zerotier/netcon-preview
The container will output something like this:
*** *** ZeroTier Network Containers Preview *** https://www.zerotier.com/ *** *** Starting ZeroTier network container host... *** Waiting for initial identity generation... *** Waiting for network config... *** Starting Apache... *** *** Up and running at 28.##.##.## -- join network 8056c2e21c000001 and try: *** > ping 28.##.##.## *** > curl http://28.##.##.##/ *** *** Be (a little) patient. It'll probably take 1-2 minutes to be reachable. *** *** Follow https://www.zerotier.com/blog for news and release announcements! ***
While you're waiting for the container to start and to print out its Earth IP address, try pinging earth.zerotier.net (18.104.22.168) from the host running ZeroTier One to test your connectivity. Joining a network usually takes less than 30 seconds, but might take longer if you're behind a highly restrictive firewall or on a slow Internet connection. If you can ping 22.214.171.124, you're online.
Once it's up and running try pinging it and fetching the web page it hosts. In most cases it'll be online in under 30 seconds, but may take a bit longer.
We're planning to ship an alpha version of Network Containers that you can package and deploy yourself in the next few months. We're also planning an integration with Docker's libnetwork API, which will allow it to be launched without modifying the container image. In the end it will be possible to use Network Containers in two different ways: by embedding it into the container image itself so that no special launch options are needed, or by using it as a libnetwork plugin to network-containerize unmodified Docker images.
Docker's security model isn't quite ready for multi-tenancy but it's coming, and when it does we'll see large-scale bare metal multi-tenant container hosts that will offer compute as a pure commodity. You'll be able to run containers anywhere on any provider with a single command and manage them at scale using solutions like Hashicorp's Terraform, Atlas, and Nomad. The world will become one data center, and we're working to provide a simple plug-and-play VLAN solution at global scale.
Hat tip to Joseph Henry, who has been lead developer on this particular project. A huge number of commits from him will be merged shortly!
If you're looking at networks on the control panel, you might have noticed a new feature: below IPv4 address management configuration there is now an IPv6 option.
IPv6 has always worked over ZeroTier, both link-local and any other addressing schemes assigned to ZeroTier devices. But so far the ZeroTier network configuration UI hasn't contained any IPv6-related options for address management. Now we have one: ZeroTier-Mapped RFC4193 addressing.
The IPv6 address space is large: 128 bits per IP. ZeroTier network IDs are 64 bits, and device addresses are 40. 64 + 40 is 104 bits, which is less than 128. This allows us to use network IDs and ZeroTier device IDs to create static globally unique private IPv6 addresses. If you enable this option, within a few minutes the devices on your networks will be assigned IPv6 addresses like:
We've enabled this addressing on Earth, our test public network. Earth's ZeroTier network ID is 8056c2e21c000001, and if you look closely you'll see it there inside the IP after the IPv6 private prefix of 0xfd. After that is 0x99 and 0x93, two arbitrary bytes of padding, and then there's a device ID of 89e92ceee5.
This provides a very nice, semantically meaningful scheme for static IPv6 addressing that guarantees unique addressing across all networks. It also opens the doorway to a mode of operation that could be very good for mobile and Internet-of-things applications.
For very low power operation as well as for very very large networks, it would be beneficial to do away with multicast and broadcast. While multicast is useful on normal LANs, in these areas of application it imposes additional power consumption, memory, and bandwidth requirements that aren't really needed. By using an IPv6 scheme that embeds both network and device ID semantically into the address, it becomes possible to emulate IPv6 NDP (IPv6's equivalent of ARP) and instantly resolve IPv6 addresses to MAC addresses without multicast queries.
So if you've wanted to try IPv6 addressing, give it a shot. It won't interfere with IPv4.
Edit: It appears that adding IPv6 addressing does cause issues with the Android client. Since Android VPN endpoints are limited to a single IP, we will likely force Android to ignore IPv6 for now unless we can find a better solution. So for now if you are using Android devices on your network, you should probably wait to try this addressing mode. Fixed!
For those who don't know, NAT stands for Network Address Translation. If you're on a typical network, your system probably has an IP address like 10.1.2.3 or 192.168.0.66. These are private IPs. They are not your real Internet IP address. Between you and the network there is a device called a "NAT router" that performs intelligent address translation back and forth.
NAT was invented because IPv4, the IP scheme that still runs most Internet sites, has an address space that is too small to allow all devices to have "real" addresses. Its successor, known as IPv6, does not have this limitation but is still fairly early in its adoption curve. Migrating a system as huge as the Internet to a new protocol version takes a very long time.
ZeroTier One runs over a peer to peer network, which means that allowing devices to communicate directly is central to how it operates (at scale and with acceptable performance). Since most users are behind NAT devices, people often wonder how exactly peer to peer connectivity is established.
In reading the Internet chatter on this subject I've been shocked by how many people don't really understand this, hence the reason this post was written. Lots of people think NAT is a show-stopper for peer to peer communication, but it isn't. More than 90% of NATs can be traversed, with most being traversable in reliable and deterministic ways.
At the end of the day anywhere from 4% (our numbers) to 8% (an older number from Google) of all traffic over a peer to peer network must be relayed to provide reliable service. Providing relaying for that small a number is fairly inexpensive, making reliable and scalable P2P networking that always works quite achievable.
The most common and effective technique for NAT traversal is known as UDP hole punching.
UDP stands for User Datagram Protocol. It's sort of TCP's smaller and simpler cousin, a protocol that allows a piece of software to send a single discrete packet from its own address to another IP and port.
A number of Internet protocols use UDP such as DNS, many games, media streaming protocols, etc. For these to be usable behind NAT, NAT routers must implement a concept of "UDP connections." They do this by listening for outgoing UDP packets and when one is seen creating a mapping that says "private IP:port UDP <-> public IP:port UDP." Any further packets leaving the private network will be remapped in the same way, and replies from the external system contacted will be remapped in the opposite direction. This allows a two-way UDP conversation to be initiated by a device behind NAT.
UDP hole punching exploits this concept of a "conversation." It works like this. (Slight variations on this procedure exist, but this is the basic idea.)
1. Alice and Bob are both behind NAT. They both know about a third party -- let's call him Ziggy -- that is not behind NAT. Alice and Bob periodically send UDP messages to Ziggy, who records their existence and their public (Internet-side of NAT) IP addresses and ports.
2. When Alice wants to talk to Bob, she sends a message to Ziggy that says "hey I want to call Bob." Ziggy then sends a message to both Alice and Bob. The message to Alice contains Bob's public IP and port, and the message to Bob contains Alice's.
3. Alice and Bob simultaneously (upon receipt) send messages to each other. Alice's NAT router sees a message leave Alice for Bob's public IP and port, while Bob's sees the same thing in the direction of Alice. Both NAT routers create a mapping entry as described above. Each NAT router then interprets the other party's initialization packet as a reply in a two-way UDP "connection," and interprets further packets likewise. As long as Alice and Bob send keepalive messages to one another frequently enough (about every 120 seconds for typical routers), this conversation can be kept going indefinitely.
Two points about this. First, the folks who say NAT makes P2P impossible are almost right. NAT does make true serverless peer to peer virtually impossible without incredibly difficult and often unreliable methods. But in practice setting up a triangle relationship like the one above is easy, and since the messages are small the server's bandwidth requirements are not large. A single relatively inexpensive cloud server instance can easily provide NAT traversal services for millions of devices. Second, if you're thinking "wow that's an ugly hack" you are correct. It is indeed ugly and a hack, but so is NAT.
The scenario above is exactly how ZeroTier works except for one minor wrinkle: the message from Alice to Ziggy that says "I want to call Bob" is her first message to Bob. ZeroTier's servers also act as relays. When they see traffic being sent from one peer to another, they relay it and periodically send a message called VERB_RENDEZVOUS that tells each party to attempt NAT traversal. If it works, they stop relaying and start talking directly. But if NAT traversal never works, both peers can just keep relaying forever. This provides connections that start working instantly and always work for everyone. In common programming parlance we can call this "lazy NAT traversal." It's the secret to how ZeroTier One provisions connections so quickly.
Lazy traversal also simplifies things. Most peer to peer protocols perform a complicated endpoint characterization step prior to initiating connectivity. That requires a complicated state machine and a lot of state transitions that must be coordinated to determine things like "am I behind NAT" and "what kind of NAT am I behind?" The lazy method just skips all that. Connection setup is stateless.
There are four major types of NAT encountered in the field. The terminology used is somewhat confusing... I'm not really sure what is meant by a "cone." But here they are. The important thing here is that the UDP hole punching technique works for only the first three. The fourth type, known as symmetric NAT, is problematic.
If one host is behind symmetric NAT, traversal can still occur. That's because the host that isn't restricted in this manner can reply to the initial packet from the host that is and in so doing "learn" its per-destination IP and port mapping. But if both hosts are behind symmetric NAT, hole punching can't work... at least not reliably.
There is one
even scarier hack technique that can occasionally work even if both parties are behind symmetric NAT. Many symmetric NATs assign port numbers sequentially. So in addition to trying Alice's first IP and port, Bob can also try Alice at IP:port+1 and possibly also IP:port+2.
ZeroTier doesn't do this yet, but it probably will in a future release. Before enabling it we'd like to do a little more field testing and try to figure out just how often this works and whether it's worth the trouble and small amount of overhead.
But returning to the numbers cited above: only 4-8% of users cannot establish direct links even though as many as 99% are behind NAT. The situation for a peer to peer protocol in the wild is far from hopeless.
Another difficult situation arises if two peers are actually on the same local network behind the same NAT router.
If that NAT router is traversable, NAT-traversal will almost always work as-usual. But this isn't optimal. It imposes a small performance penalty, as traffic must now pop its head out of the LAN and back into it and traverse the router twice. What we'd ideally want is for traffic on the same LAN to simply go directly from host A to B.
The simplest solution is to use UDP broadcasts. This is what many applications including ZeroTier do. A packet is sent to broadcast every 60 seconds or so that says "here I am." Other peers see it and after a proper cryptographic handshake to verify identity establish a direct connection over LAN.
It might be tempting for peers to encode their private IP address and send it to the intermediate server and/or to the other peer. I thought about this when writing ZeroTier. On the plus side, this would work even on large segmented networks where UDP broadcasts don't make it everywhere. But the problem is that this exposes internal configuration details about the network to potentially random external peers. Many network security administrators are not going to like that, so I decided against it. I tried to think of a way to anonymize the data but couldn't, since the IPv4 private address space is so small that no form of hashing will adequately protect against exhaustive search.
Some NAT devices support various methods of intentionally opening ports from the inside. The most common of these is called Universal Plug-and-Play (UPnP).
ZeroTier doesn't support it since typically it's only found in small router devices such as home routers. These usually implement some form of "full-cone" NAT that can be traversed using ordinary hole punching, rendering UPnP unnecessary for our use case. It'll probably be supported eventually, since there are sure to be a few exceptions to that rule and the goal is to support traversal in as many scenarios as can possibly be achieved.
UPnP is a fairly ugly semi-standard. It's not likely to be supported any more widely than it is already.
Due to IPv4 limitations, NAT is deployed on most networks. IPv6 is really the only way around this. Yet most don't realize the cost in kittens. Every time a NAT device remaps an IP address, a kitten dies. This amounts to trillions upon trillions of needless kitten fatalities every day. NAT traversal techniques do not avoid the carnage. They only hide it from the user.
In all seriousness though: NAT is an awful thing. It's an ugly workaround to a fundamental limitation, and the sooner it's rendered obsolete by IPv6 the sooner we can start really deploying a whole new generation of Internet protocols.
Other than the obvious downside of increased software complexity, the worst thing about NAT is the inherent resource overhead it imposes on protocols. This is true even in conventional client-server protocol designs. Because NAT is almost always stateful, frequent keepalive packets are required to hold all connections open. This is true for TCP as well as UDP. If you don't send a packet about once every 120 seconds (for typical NATs), your connection will be forgotten and will reset. Users behind NATs who use SSH have likely discovered this when attempting to leave SSH sessions open for a long time, and SSH (like most protocols) has a protocol keepalive option available as a workaround.
For desktop/laptop and server systems these tiny messages don't matter much, but for small mobile devices they're a battery life killer. They also make implementing peer to peer anything on a mobile device very difficult. In the near term, porting ZeroTier's protocol to mobile without significantly impacting battery life will require some fairly heroic hacks around using platform-provided "push" notifications and other cleverness. None of that would be necessary without NAT, as peers could simply notify one another of their new IP:port locations whenever those locations changed. Usually that's not very frequent, maybe once or twice a day at maximum, and would not impose much overhead.
So if you want to see truly efficient, scalable, and simple Internet protocols in the future, by all means use IPv6 and encourage others to do the same. It's not just about IPv4 address exhaustion. It's also about fundamentally sound protocol design that dispenses with the need for any number of awful hacks like the ones discussed in this post.
... not to mention the kittens.