Research Notes on 2.x Cryptography

Since releasing ZeroTier 1.4 and LF we have shifted most of our effort over to developing our upcoming 2.0 release. Not much has been announced about this release, though in the LF documentation we hinted at how it will be used with 2.0 roots.

In 2.0 we’re improving numerous things: performance, multicast efficiency, the ability to run independent infrastructure, and cryptography. This post discusses the latter.

Since its first release ZeroTier has relied on simple and quite boring cryptography that uses Salsa20 as a stream cipher, Poly1305 for message authentication, and Curve25519/Ed25519 for key agreement and signatures. Our straightforward implementation has proven itself to be quite secure but it lacks some advanced features, most notably forward secrecy / ephemeral keys.

Omitting ephemeral keys has been controversial. Our main rationale has been simplicity and statelessness, as discussed in an issue on the topic at GitHub. It’s also important to note that a large fraction of the traffic people send over ZeroTier is SSH or SSL encrypted already, providing defense in depth and a second layer that usually (in modern implementations) does support forward secrecy and ephemeral keys.

Nevertheless we have always intended to implement ephemeral keys some day, and now is that day.

Salsa20 was a good choice of cipher in 2011 to achieve wide platform support with high performance. Back then many mobile and even smaller and older desktop CPUs did not possess hardware AES acceleration. Without hardware acceleration Salsa20 is quite a bit faster than AES, especially if vector based implementations are used.

Today AES is generally faster than Salsa20. In our benchmarks AES-GCM can be up to twice as fast as Salsa20 with Poly1305 on X86-64 chips with AES hardware acceleration. Given this progress on the AES support front and given the fact that some of our customers have inquired about the ability to link ZeroTier against FIPS-compliant cryptographic libraries, we plan to introduce AES symmetric encryption support in 2.0 and make it the default for communication with other 2.0+ nodes.

Ephemeral Keys

One of the reasons we left ephemeral keys out of ZeroTier is our desire to offer fast stateless session setup. In current ZeroTier a peer that knows another peer’s identity can simply execute key agreement and send a message. No state negotiation is needed. This keeps a lot of things simple and means that connections typically start working as soon as a device has network access and establishes links to root servers.

We plan to keep things this way in 2.0 when it comes to initial communication but also introduce “lazy” ephemeral key negotiation. Once a peer starts talking to another peer it will start pushing ephemeral keys. If it negotiates one, it will stop using the long-lived key in favor of the new key. Ephemeral keys will be re-negotiated fairly often. We’re thinking two minutes is probably a good default.

ZeroTier’s VL1 peer to peer network layer carries two kinds of traffic: control traffic and data traffic. Control traffic includes things like network access certificates, configuration requests, and IP connectivity information to assist in the establishment of peer-to-peer links. Data traffic consists of unicast and multicast virtual Ethernet frames.

Ephemeral keys and forward secrecy are less important for control traffic. If someone decrypts your control traffic they learn things like network IDs, virtual network IP addresses, physical endpoint IPs, and possibly meta-data like host names. Forging control traffic is more problematic, but to do this would require compromise of ZeroTier identity keys which would allow someone to impersonate the node anyway.

Where re-keying and forward secrecy is really desired is for data traffic. Data traffic is what you want to keep private. There’s also a lot more of it, making it more likely that an adversary would be able to gather enough of it to mount an attack.

This leads us to the final element of our ephemeral keying design: network-level ephemeral key requirement settings. The administrators of networks (at the network controller) will be able to set a network flag indicating whether or not a network requires ephemeral key establishment before data traffic will be accepted or sent. Turning this on means data traffic will always be subject to forward secrecy. Keeping it off means peers will eventually switch to ephemeral keys but some small amount of data traffic may be encrypted via long-lived identity keys instead.

We think this design allows us to introduce forward secrecy without compromising performance, ease of use, or simplicity.

Hardening AES-GCM Against Sudden Death

Authenticated encryption is great. It means you can know who you’re talking to without clunky out-of-band mechanisms and it categorically blocks numerous classes of attacks that rely on the ability for an attacker to change ciphertext or craft bogus encrypted packets and ask your node to attempt to process them.

Just about all the current authenticated encryption mechanisms including both Poly1305 and GCM have an issue though: if you re-use an initialization vector (commonly called a “nonce”) more than once, sudden death occurs.

Stream ciphers like Salsa20 and AES in CTR mode (which GCM uses internally) also have this property. That’s because they’re cryptographic random number generators that use the random numbers they generate as a one time pad XORed against the plaintext. A duplicate IV/nonce makes them output an identical stream of random values, and since XOR is commutative this permits two messages encrypted with these same values to be decrypted by XORing them together.

This has always bothered us. ZeroTier is a protocol used to exchange large amounts of data, making the odds of IV re-use higher than they would be for e.g. an instant messaging protocol. Our users also place a high priority on performance, making the use of really large IVs/nonces undesirable for the packet overhead they would add.

There are other modes of operation out there like AES-GCM-SIV that mitigate this risk. In this mode re-using an IV/nonce is not catastrophic. A duplicated IV/nonce with an identical message will result in an identical encrypted message, revealing that two of the same message were sent, but otherwise nothing breaks. The CAESAR competition is also soliciting authenticated stream cipher modes that are not vulnerable to accidental IV/nonce repetition attacks.

Unfortunately none of these new-fangled modes are FIPS or NSA certifiable, and we have customers that want that.

But… what if we could use AES-GCM in the standard way and then do something after the fact to harden it against “sudden death” on accidental nonce/IV duplication?

To do this we invented something we call DDS for Data Dependent Scrambling. It’s just an idea at this stage and is not yet set in stone. If you have a better idea or a criticism, please leave it under this issue at GitHub.

DDS works by first applying AES-GCM in the usual way and then encrypting the result again using both the random IV/nonce and the authentication tag (a.k.a. MAC) generated by GCM as a secret key.

The steps in AES-GCM-DDS are as follows:

  1. Encrypt plaintext and a random nonce / initialization vector with AES-GCM in the standard manner, generating encrypted ciphertext and an authentication tag.
  2. Use AES in simple ECB mode (same AES key is likely fine) to encrypt the nonce/IV and the GCM authentication tag from step 1 together such that they are both encrypted and mixed. This is the first of two AES encryptions of the IV and the tag.
  3. Use the resulting encrypted combined IV and tag from step 2 as a key (for a simple symmetric cipher TBD, not GCM) to encrypt the ciphertext output by GCM again to yield final ciphertext.
  4. Encrypt the encrypted IV and tag from step 2 again in ECB mode to yield a final object we call a combined tag.

(Those familiar with block cipher modes of operation will note that ECB mode is not typically used as it can reveal coarse underlying structure in plaintext. In this case the thing being encrypted is itself either random or a cryptographic identifier with no structure to reveal.)

Decryption basically goes 4, 3, 2, 1, with AES ECB decryption taking the place of encryption in steps 2 and 4.

So what does this ritual accomplish other than to burn a few more CPU cycles?

Consider AES-GCM in the duplicate IV/nonce case. A duplicate IV/nonce both exposes GCM itself to attack and yields an identical one time pad making both messages encrypted with this duplicate IV insecure.

Yet in both cases GCM will yield an authentication tag that is data dependent in that it varies based on the content of the plaintext being encrypted.

Encrypting both the IV/nonce and authentication tag together (and in a way that mixes their bits) yields an opaque encrypted object that also becomes data-dependent. Unless the plaintext is also identical an attacker can no longer see that an IV was re-used. (Duplicating both IV and plaintext results in an absolutely identical encrypted message which only reveals that two of the same message were sent. This is an extremely low probability event with low to zero impact.)

Remember however that a duplicate IV/nonce with GCM (which uses CTR internally) yields an identical key stream. While an attacker can no longer watch the IV field for duplicates, he or she can still XOR messages together and look for known plaintext or low entropy in the result. This is why the ciphertext from GCM is also encrypted a second time using a data-dependent key. The final ciphertext is “infinitely scrambled” in that a single bit difference in the plaintext will change the entire ciphertext. XORing ciphertexts to look for collisions becomes useless.

The secondary ciphertext encryption algorithm used in step 3 remains an open question. The simple answer is to use AES again perhaps in ECB mode (as the input is ciphertext with no structure to reveal) and perhaps with a 128-bit key for a little bit better performance, but is that really necessary? Performance is important here and the only role of this cipher is to make XORing messages to look for duplicate key streams useless. Also note that keys are random and one time use and chosen plaintext attacks are essentially impossible. Could a much faster but weaker cipher like reduced round AES-128 be used? We’d love it if anyone with a deeper knowledge of cryptography could provide insight.

The performance of this scheme (using AES-128 in ECB mode for step 3) on systems with AES hardware acceleration remains higher than the performance of Salsa20 with Poly1305. A single core of a Core i7 running at 2.9ghz can run this entire construction at over 10gbps.

NIST ECC Curve Support

Last and for most users probably least we are adding support for a new identity type that includes both NIST (P-384) and Curve25519 elliptic curve keys. We introduced this to support linking against FIPS-compliant libraries.

Our new identity type (1) includes both Curve25519 and NIST P-384 keys, with the former included to allow new type 1 identities to execute key agreement with older type 0 identities. It also allows type 0 identities to be upgraded without changing a node’s ZeroTier address, though this won’t happen automatically to avoid messing up peoples’ configurations.

Signatures use P-384 ECDSA but include the Curve25519/Ed25519 public keys with the input to be signed, ensuring that the two keys cannot be de-coupled. Key agreement uses both Curve25519 and NIST P-384 (ECDH) and then hashes the resulting shared secrets together. This also ensures the keys can’t be de-coupled as well as mitigating some of the concerns people have about NIST curve security.

(We doubt the NIST curves are backdoored because NSA continues to approve them for top secret data post-Snowden. Snowden showed how easy it is for insiders to leak secrets, and the leak of a backdoor in the encryption used by US intelligence and military services for top secret data would be catastrophic. We’re not sure the NSA would risk suddenly granting Russia, Iran, Syria, North Korea, and ISIS the ability to decrypt and take over all kinds of military communications in exchange for a bit of signals intelligence. The most likely explanation for the lack of rigidity in the NIST curve constants is that nobody was thinking about rigidity back then. The unexplained hashes used to generate them are probably hashes of the name of some NSA employee’s cat or something, and nobody probably remembers.)

We have not decided yet whether this new type will be the default for newly generated identities or whether it will need to be selected. It doesn’t decrease security but also probably doesn’t increase it much if at all. We’ll likely leave the default type set to 0 for now.