Thursday, May 29, 2014

Optimizing the Illumos Kernel Crypto Framework

Optimizing the Illumos Kernel Crypto Framework

Recently I've had some motivation to look into the KCF on Illumos and discovered that, unbeknownst to me, we already had an AES-NI implementation that was automatically enabled when running on Intel and AMD CPUs with AES-NI support. This work was done back in 2010 by Dan Anderson.
This was great news, so I set out to test the performance in Illumos in a VM on my Mac with a Core i5 3210M (2.5GHz normal, 3.1GHz turbo). For a sense of scale I first used OpenSSL suite's "speed" utility (version 1.0.1f):
$ openssl speed -evp aes-128-[ecb|ctr|cbc|gcm]
And here are the results (I've omitted the small-block results, I'm only concerned with best-case throughput here):
OpenSSL encryption performance
Algorithm/Mode
8k ops
AES-128/ECB
3464 MB/s
AES-128/CTR
3185 MB/s
AES-128/CBC
586 MB/s
AES-128/GCM
1026 MB/s
Those are some pretty impressive numbers for single threaded performance, so let's analyze them a bit.

Block Cipher Modes Of Operation

Although this article isn't really about encryption modes, I'd like to elaborate briefly on the performance values we saw above and how they arise.
Symmetric block ciphers like AES work on chunks of data that are, quite obviously, called blocks and are of a fixed length (128 bits in the case of AES). The cipher is a little black box, you throw in an unencrypted (plaintext) block together with the secret encryption key and out pops an encrypted (ciphertext) block of the same length. For decryption, the reverse is the case. You give it the ciphertext block and the same key (hence the term "symmetric" - same key for both encryption and decryption) and out pops the plaintext block. This would be great if all messages we ever sent were of this fixed block length and never repeated (since the same plaintext block with the same key will always produce the same ciphertext), but in reality this is not how things tend to work. So we need a method of extending a block cipher to work on arbitrary length messages and provide some additional security padding so that identical messages cannot be easily identified by an attacker. This is called a block cipher mode of operation.
ECB (Electronic CodeBook) mode is the simplest and works by just running the encryption on each block by itself. Useful as a correctness experiment and for gauging maximum algorithm performance, but not really usable for anything in the real world, as this screenshot from Wikipedia's article on block cipher modes nicely shows:
Doesn't really conceal much, does it? [original image credit: Larry Ewing]
The rest of the modes discussed are actually secure modes (when used properly).
CTR (Counter) mode costs a little bit for the counter calculations, but is capable of approaching the theoretical maximum throughput of ECB due to there being no inter-block dependencies.
CBC (Cipher Block Chaining) mode encryption, on the other hand, is inherently slow due to the way it continuously stalls the CPU instruction pipeline. This is because the encryption input of subsequent blocks depends on the encryption results of previous blocks, so the algorithm "interlocks" the pipeline all the time. Decryption fortunately does not suffer from this problem, so it's much faster (CTR-level fast).
GCM (Galois/Counter Mode) is a modification of the CTR mode, so it parallelizes very well. The added cost (and dip in performance) is caused by the fairly expensive GHASH function, but you do get something in return: an authentication tag, so you don't need to compute a separate MAC using something like HMAC-SHA anymore.
From this we can draw the following conclusions:
  • ECB is a nice tool for testing maximum performance, but is not really useful for anything that needs any security.
  • Use CTR when you're dealing with data that's already been authenticated, or GCM when the data hasn't been authenticated.
  • Avoid CBC like the plague.

Vanilla KCF Performance

So now comes the test for the KCF. I wrote a quick'n'dirty crypto test module that just performed a bunch of encryption operations and timed the results:
KCF encryption performance
Algorithm/Mode
128k ops
AES-128/ECB
117 MB/s
AES-128/CTR
110 MB/s
AES-128/CBC
112 MB/s
AES-128/GCM
56 MB/s
What the hell is that?! This is just plain unacceptable. Obviously we must have hit some nasty performance snag somewhere, because this is comical. And sure enough, we did. When looking around in the AES-NI implementation I came across this bit in aes_intel.s:
#define PROTECTED_CLTS \
 CLTS
#define CLEAR_TS_OR_PUSH_XMM0_XMM1(tmpreg) \
 push %rbp; \
 mov %rsp, %rbp; \
 movq %cr0, tmpreg; \
 testq $CR0_TS, tmpreg; \
 jnz 1f; \
 and $-XMM_ALIGN, %rsp; \
 sub $[XMM_SIZE * 2], %rsp; \
 movaps %xmm0, 16(%rsp); \
 movaps %xmm1, (%rsp); \
 jmp 2f; \
1: \
 PROTECTED_CLTS; \
2:

/* ... snip ... */

ENTRY_NP(aes_encrypt_intel)
 CLEAR_TS_OR_PUSH_XMM0_XMM1(%r10)
        ...
This is a problem:
3.1.2 Instructions That Cause VM Exits Conditionally 
• CLTS. The CLTS instruction causes a VM exit if the bits in position 3 (corresponding to CR0.TS) are set in both the CR0 guest/host mask and the CR0 read shadow.
The CLTS instruction signals to the CPU that we're about to use FPU registers (which is needed for AES-NI), which in VMware causes an exit into the hypervisor. And we've been doing it for every single AES block! Needless to say, performing the equivalent of a very expensive context switch every 16 bytes is going to hurt encryption performance a bit.
The reason why the kernel is issuing CLTS is because for performance reasons, the kernel doesn't save and restore FPU register state on kernel thread context switches. So whenever we need to use FPU registers inside the kernel, we must disable kernel thread preemption via a call to kpreempt_disable() and kpreempt_enable() and save and restore FPU register state manually. During this time, we cannot be descheduled (because if we were, some other thread might clobber our FPU registers), so if a thread does this for too long, it can lead to unexpected latency bubbles (not to mention deadlocks if you happen to block in this context and you're running on a uniprocessor).
The solution was to restructure the AES and KCF block crypto implementations in such a way that we execute encryption in meaningfully small chunks. I opted for 32k bytes, for reasons which I'll explain below. Unfortunately, doing this restructuring work was a bit more complicated than one would imagine, since in the KCF the implementation of the AES encryption algorithm and the block cipher modes is separated into two separate modules that interact through an internal API, which wasn't really conducive to high performance (we'll get to that later). Anyway, having fixed the issue here and running the code at near native speed, this is what I get:
KCF encryption performance with CLTS fix
Algorithm/Mode
128k ops
AES-128/ECB
790 MB/s
AES-128/CTR
439 MB/s
AES-128/CBC
483 MB/s
AES-128/GCM
252 MB/s
Not disastrous anymore, but still, very, very bad. Of course, you've got keep in mind, the thing we're comparing it to, OpenSSL, is no slouch. It's got hand-written highly optimized inline assembly implementations of most of these encryption functions and their specific modes, for lots of platforms. That's a ton of code to maintain and optimize, but I'll be damned if I let this kind of performance gap persist.
Fixing this, however, is not so trivial anymore. It pertains to how the KCF's block cipher mode API interacts with the cipher algorithms. It is beautifully designed and implemented in a fashion that creates minimum code duplication, but this also means that it's inherently inefficient. Some of the reasons for which the KCF API is a very unfortunate design from a performance perspective, are:
  • Rather than expecting nicely aligned input and output data, KCF will eat almost any garbage you feed it, be it a single block, a list of unaligned iovec's or a linked list of disparate message blocks (mblk's).
  • In order to support this crazy array of input and output options and keep the implementation pretty, the KCF block cipher modes internally do a lot of copying of blocks around into and out of temp buffers and invoke a callback encryption function sequentially for each and every one of them.
  • The implementation is built with no regard to auto-vectorization or even any attempt at exploiting wide register files inside of modern CPUs to prevent pipeline stalls (as is evident in the CTR mode implementation).
  • The GCM decryption implementation is almost criminal in that it copies all input into a temporary buffer and consumes it all in the final() call. If you try to decrypt and/or authenticate a few GB worth of data using a single AES key, the kernel is almost guaranteed to soil itself trying to gobble up all of your input data before starting to produce any output (luckily the GCM algorithm is limited to 64GB of data per key, so at least there is an upper bound to this nonsense).
So I set about fixing all of these (and many more) problems, one by one.

Convoluted API

The solution here was to design "fast path" shortcuts we can take when conditions allow for it. For example, while the caller can pass us variously unaligned data and scattered output buffers, in most performance-sensitive situations, they won't, so we can optimize for that. So I've added conditions to all of the major block encryption calls to detect situations when we can shortcut the horribly slow loops with lots of copying and instead perform the work with much less effort:
/*
 * CBC encryption fastpath requirements:
 * - fastpath is enabled
 * - algorithm-specific acceleration function is available
 * - input is block-aligned
 * - output is a single contiguous region or the user requested that
 *   we overwrite their input buffer (input/output aliasing allowed)
 */

Cipher Algorithm-Specific Fastpath Functions

ECB, CBC and CTR gained the ability to pass an algorithm-specific "fastpath" implementation of the block cipher mode, because these functions benefit greatly from pipelining multiple cipher calls into a single place.
ECB, CTR and CBC decryption benefit enormously from being able to exploit the wide XMM register file on Intel to perform encryption/decryption operations on 8 blocks at the same time in a non-interlocking manner. The performance gains here are on the order of 5-8x.
CBC encryption benefits from not having to copy the previously encrypted ciphertext blocks into memory and back into registers to XOR them with the subsequent plaintext blocks, though here the gains are more modest, around 1.3-1.5x.

GHASH Acceleration Ported From OpenSSL

The GCM algorithm benefits greatly from having the most expensive part of the GHASH function performed in hardware using the new PCLMULQDQ carry-less multiplication instruction. The existing implementation of the KCF already used that, but it suffered from the same CLTS performance regression as AES-NI did and the horribly slow implementation inside of the general-purpose block cipher mode code path in gcm.c meant that the implementation couldn't really shine. Plus, there's been recent progress on parallelization of GHASH done using pipelined Karatsuba multipliers, which wasn't in the KCF. So I added a fastpath implementation to GCM which does an initial fast CTR pass over the data and then computes GHASH from the encryption output using the optimized GHASH implementation ported over from OpenSSL.

So How Does It All Stack Up?

After all of this work, this is how the results now look on Illumos, even inside of a VM:
KCF with performance fixes
Algorithm/Mode
128k ops
AES-128/ECB
3764 MB/s
AES-128/CTR
3121 MB/s
AES-128/CBC
691 MB/s
AES-128/GCM
1053 MB/s
On the decryption side of things, CBC decryption also jumped from 627 MB/s to 3011 MB/s. Seeing these performance numbers, you can see why I chose 32k for the operation size in between kernel preemption barriers. Even on the slowest hardware with AES-NI, we can expect at least 300-400 MB/s/core of throughput, so even in the worst case, we'll be hogging the CPU for at most ~0.1ms per run.
Pretty bar graphs of our new overall position:

Overall, we're even a little bit faster than OpenSSL in some tests, though that's probably down to us encrypting 128k blocks vs 8k in the "openssl speed" utility. Anyway, having fixed this monstrous atrocity of a performance bug, I can now finally get some sleep.

Testing

To verify that nothing has been messed up by these changes, I've written a small kernel module that exercises these algorithms against a set of known good test vectors. The module is available on GitHub.

No comments:

Post a Comment