The next major release of ZeroTier's network virtualization engine (1.2.0) is a huge milestone. In addition to other improvements, our virtual networks will be getting a lot smarter. It will now be possible to set fine-grained rules and permissions and implement security monitoring at the network level with all of this being managed via the network controller and enforced cooperatively by all network participants. This brings us to near feature parity with in-data-center SDN systems and virtual private cloud backplanes like Amazon VPC.
This post describes the basic design of the ZeroTier rules engine by way of the reasoning process that led to it. As of the time of this writing (late August, 2016), a working but not quite production ready implementation is taking shape in the "dev" branch of our GitHub repository. ETA for this release is mid to late September, but if you are brave you are welcome to pull "dev" and take a look. Start with "controller/README.md". Note that there have been other changes to the controller too, so don't try to drop this into a production deployment!
In designing our rules engine we took inspiration from OpenFlow, Amazon VPC, and many other sources, but in the end we decided to do something a little bit different. Our mission here at ZeroTier is to "directly connect the world's devices" by in effect placing them all in the same cloud. The requirements implied by this mission rule out (pun intended?) many of the approaches used by conventional LAN-oriented SDN switches and endpoint firewall management solutions.
ZeroTier is designed to run on small devices. That means we can't push big rules tables. The size of the rule set pushed to each device has to be kept under control. Meanwhile the latency and unreliability of the global Internet vs on-premise networks excludes any approach that requires constant contact between endpoints and network controllers. This means we can't use the OpenFlow approach of querying the controller when an unrecognized "flow" is encountered. That would be slow and unreliable.
At the same time we wanted to ship a truly flexible rules engine capable of handling the complex security, monitoring, and micro-segmentation needs of large distributed organizations.
We've wanted to add these capabilities for a long time. The delay has come from the difficulty of designing a system that delivers on all our objectives.
To solve hard problems it helps to first take a step back and think about them conceptually and in terms of first principles. We've had many discussions with users about rules and micro-segmentation, and have also spent a good amount of time perusing the literature and checking out what other systems can do. Here's a rough summary of what we came up with:
Those are the high level goals that informed our design. Here's what we did in response to them.
Once fine-grained permissions, per-device rules, and device group rules are conceptually separated from the definition of global network behavior it becomes practical to limit the global rules table size to something modest enough to accommodate small devices. While there might occasionally be use cases that require more, we think something on the order of a few hundred rules that apply globally to an entire network is probably enough to address most sane requirements. This is enough space to describe in great detail exactly what traffic a network will carry and to implement complex security monitoring patterns.
So that's what we did. Keep in mind that at this stage we are intentionally ignoring the need for fine-grained per-device stuff. We'll pull out some bigger guns to deal with that later.
Now on to security monitoring. If the goal is near-omniscience, there is no substitute for placing a man in the middle and just proxying everything. Unfortunately that's a scalability problem even inside busy data centers, let alone across wide area networks. But it's a case we wanted to at least support for those who want it and are willing to take the performance hit.
To support this we added a REDIRECT action to our rules engine. Our redirect operates at the ZeroTier VL1 (virtual layer 1) layer, which is actually under the VL2 virtual Ethernet layer. That means you can send all traffic matching a given set of criteria to a specific device without in any way altering its Ethernet or IP address headers. That device can then silently observe this traffic and send it along to its destination. The fact that this can be done only for certain traffic means the hit need only be taken when desired. Traffic that does not match redirection rules can still flow directly.
Now what about a lower overhead option? For that we took some inspiration from Linux's iptables and its massive suite of capabilities. Among these are its --tee option, which allows packet cloning to remote observers.
We therefore added our own TEE, and like REDIRECT it has the advantage of operating at VL1. With our packet cloning action every packet matching a set of criteria, or even the first N bytes of every such packet, can be sent to an observer. Criteria include TCP options. This lets network administrators do efficient and scalable things like clone every TCP SYN and FIN to an observer to watch every TCP connection on the network without having to handle connection payload. This allows a lot of network insight with very minimal overhead. A slightly higher overhead option would involve sending, say, the first 64 bytes of every packet to an observer. That would allow the observation of all Ethernet and IP header information with less of a performance hit than full proxying.
But what about endpoint compromise?
Our security monitoring capabilities can never be quite as inescapable as a hardware tap on a physical network. That's because ZeroTier is a distributed system that relies upon endpoint devices to correctly follow and enforce their rules. If an endpoint device is compromised its ZeroTier service could be patched to bypass any network policy. But the fact that rules are evaluated and enforced on both sides of every interaction allows us to do the next best thing. By matching on the inbound/outbound flag in our rules engine and using other clever rule design patterns it's possible to detect cases where one side of an interaction stops abiding by our redirect and packet tee'ing policies. That means an attacker must now compromise both sides of a connection to avoid being observed, and if they've done that... well... you have bigger problems. (A detailed discussion of how to implement this will be forthcoming in future rules engine documentation.)
Global rules take care of global network behavior and security instrumentation, but what if we want to get nit-picky and start setting policies on a per-device or per-device-group basis?
Let's say we have a large company with many departments and we want to allow people to access ports 137, 139, and 445 (SMB/CIFS) only within their respective groups. There are numerous endpoint firewall managers that can do this at the local OS firewall level, but what if we want to embed these rules right into the network?
Powerful (and expensive) enterprise switches and SDN implementations can do this, but under the hood this usually involves the compilation and management of really big tables of rules. Every single switch port and/or IP address or other identifier must get its own specific rules to grant it the desired access, and on big networks a combinatorial explosion quickly ensues. Good UIs can hide this from administrators, but that doesn't fix the rules table bloat problem. In OpenFlow deployments that support transport-triggered (or "reactive") rule distribution to smart switches this isn't a big deal, but as we mentioned up top we can't do things that way because our network controllers might be on the other side of the world from an endpoint.
On the theory side of information security we found a concept that seems to capture the majority if not all of these cases: capability based security. From the article:
A capability (known in some systems as a key) is a communicable, unforgeable token of authority. It refers to a value that references an object along with an associated set of access rights. A user program on a capability-based operating system must use a capability to access an object.
Now let's do a bit of conceptual search and replace. Object (noun) becomes network behavior (verb), and access rights become the right to engage in that behavior on the network. We might then conclude by saying that a user device on a capability-based network must use a capability to engage in a given behavior.
It turns out there's been a little bit of work in this area sponsored by DARPA and others (PDF), but everything we could find still talked in terms of routers and firewalls and other middle-boxes that do not exist in the peer to peer ZeroTier paradigm. But ZeroTier does include a robust cryptosystem, and as we see in systems like Bitcoin cryptography can be a powerful tool to decentralize trust.
For this use case nothing anywhere near as heavy as a block chain is needed. All we need are digital signatures. ZeroTier network controllers already sign network configurations, so why can't they sign capabilities? By doing that it becomes possible to avoid the rules table bloat problem by only distributing capabilities to the devices to which they are assigned. These devices can then lazily push capabilities to each other on an as-needed basis, and the recipient of any capability can verify that it is valid by checking its signature.
But what is a capability in this context?
If we were achieving micro-segmentation with a giant rules table, a capability would be a set of rules. It turns out that can work here too. A ZeroTier network capability is a bundle of cryptographically signed rules that allow a given action and that can be presented ahead of relevant packets when that action is performed.
It works like this. When a sender evaluates its rules it first checks the network's global rules table. If there is a match, appropriate action(s) are taken and rule evaluation is complete. If there is no match the sender then evaluates the capabilities that it has been assigned by the controller. If one of these matches, the capability is (if necessary) pushed to the recipient ahead of the action being performed. When the recipient receives a capability it checks its signature and timestamp and if these are valid it caches it and associates it with the transmitting member. Upon receipt of a packet the recipient can then check the global rules table and, if there is no match, proceed to check the capabilities on record for the sender. If a valid pushed capability permits the action, the packet is accepted.
... or in plainer English: since capabilities are signed by the controller, devices on the network can use them to safely inform one another of what they are allowed to do.
All of this happens "under" virtual Ethernet and is therefore completely invisible to layer 2 and above.
Capabilities alone still don't efficiently address the full scope of the "departments" use case above. If a company has dozens of departments we don't want to have to create dozens and dozens of nearly identical capabilities that do the same thing, and without some way of grouping endpoints a secondary scoping problem begins to arise. IP addresses could be used for this purpose but we wanted something more secure and easier to manage. Having to renumber IP networks every time something's permissions change is terribly annoying.
To solve this problem we introduced a third and final component to our rules engine system: tags. A tag is a tiny cryptographically signed numeric key/value pair that can (like capabilities) be replicated opportunistically. The value associated with each tag ID can be matched in either global or capability scope rules.
This lets us define a single capability called (for example) SAMBA/CIFS that permits communication on ports 137, 139, and 445 and then include a rule in that capability that makes it apply only if both sides' "department" tags match.
Think of network tags as being analogous to topic tags on a forum system or ACLs in a filesystem that supports fine-grained permissions. They're values that can be used inside rule sets to categorize endpoints.
A rule in our system consists of a series of zero or more MATCH entries followed by one ACTION. Matches are ANDed together (evaluated until one does not match) and their sense can be inverted. An action with no preceding matches is always taken. The default action if nothing matches is ACTION_DROP.
Here's a list of the actions and matches currently available:
|ACTION_DROP||Drop this packet (halts evaluation)|
|ACTION_ACCEPT||Accept this packet (halts evaluation)|
|ACTION_TEE||Send this packet to an observer and keep going (optionally only first N bytes)|
|ACTION_REDIRECT||Redirect this packet to another ZeroTier address (all headers preserved)|
|MATCH_SOURCE_ZEROTIER_ADDRESS||Originating VL1 address (40-bit ZT address)|
|MATCH_DEST_ZEROTIER_ADDRESS||Destination VL1 address (40-bit ZT address)|
|MATCH_ETHERTYPE||Ethernet frame type|
|MATCH_MAC_SOURCE||L2 MAC source address|
|MATCH_MAC_DEST||L2 MAC destination address|
|MATCH_IPV4_SOURCE||IPv4 source (with mask, does not match if not IPv4)|
|MATCH_IPV4_DEST||IPv4 destination (with mask, does not match if not IPv4)|
|MATCH_IPV6_SOURCE||IPv6 source (with mask, does not match if not IPv6)|
|MATCH_IPV6_DEST||IPv6 destination (with mask, does not match if not IPv6)|
|MATCH_IP_TOS||IP type of service field|
|MATCH_IP_PROTOCOL||IP protocol (e.g. UDP, TCP, SCTP)|
|MATCH_ICMP||ICMP type (V4 or V6) and optionally code|
|MATCH_IP_SOURCE_PORT_RANGE||Range of IPv4 or IPv6 ports (inclusive)|
|MATCH_IP_DEST_PORT_RANGE||Range of IPv4 or IPv6 ports (inclusive)|
|MATCH_CHARACTERISTICS||Bit field of packet characteristics that include TCP flags, whether this is inbound or outbound, etc.|
|MATCH_FRAME_SIZE_RANGE||Range of Ethernet frame sizes (inclusive)|
|MATCH_TAGS_DIFFERENCE||Difference between two tags is <= value (use 0 to test equality)|
|MATCH_TAGS_BITWISE_AND||Bitwise AND of tags equals value|
|MATCH_TAGS_BITWISE_OR||Bitwise OR of tags equals value|
|MATCH_TAGS_BITWISE_XOR||Bitwise XOR of tags equals value|
Detailed documentation is coming soon. Keep an eye on the "dev" branch.
The ZeroTier rules engine is stateless to control CPU overhead and memory consumption, and stateless firewalls have certain shortcomings. Most of these issues have work-arounds but sometimes these are not obvious. "Design patterns" will be documented eventually along with the rules engine for working around common issues, and we'll be building a rule editor UI into ZeroTier Central that will help as well.
We also have not addressed the problem of QoS and traffic priority. That presents additional challenges since being a virtualization system that abstracts away the physical network it is hard for ZeroTier to know physical topology. One option we're considering is to implement QoS field mirroring, allowing ZeroTier's transport packets to inherit QoS fields from the packets they are encapsulating. That would allow physical edge switches to prioritize traffic. We're still exploring in this domain and hope to have something in the future.
As of the time of this post our rules engine is still under active development. If you have additional thoughts or input please feel free to head over to our community forums and start a thread. Intelligent input is much appreciated since now would be the most convenient time to address any holes in the approach that we've outlined above.