Thursday, May 29, 2014

Optimizing the Illumos Kernel Crypto Framework

Optimizing the Illumos Kernel Crypto Framework

Recently I've had some motivation to look into the KCF on Illumos and discovered that, unbeknownst to me, we already had an AES-NI implementation that was automatically enabled when running on Intel and AMD CPUs with AES-NI support. This work was done back in 2010 by Dan Anderson.
This was great news, so I set out to test the performance in Illumos in a VM on my Mac with a Core i5 3210M (2.5GHz normal, 3.1GHz turbo). For a sense of scale I first used OpenSSL suite's "speed" utility (version 1.0.1f):
$ openssl speed -evp aes-128-[ecb|ctr|cbc|gcm]
And here are the results (I've omitted the small-block results, I'm only concerned with best-case throughput here):
OpenSSL encryption performance
8k ops
3464 MB/s
3185 MB/s
586 MB/s
1026 MB/s
Those are some pretty impressive numbers for single threaded performance, so let's analyze them a bit.

Block Cipher Modes Of Operation

Although this article isn't really about encryption modes, I'd like to elaborate briefly on the performance values we saw above and how they arise.
Symmetric block ciphers like AES work on chunks of data that are, quite obviously, called blocks and are of a fixed length (128 bits in the case of AES). The cipher is a little black box, you throw in an unencrypted (plaintext) block together with the secret encryption key and out pops an encrypted (ciphertext) block of the same length. For decryption, the reverse is the case. You give it the ciphertext block and the same key (hence the term "symmetric" - same key for both encryption and decryption) and out pops the plaintext block. This would be great if all messages we ever sent were of this fixed block length and never repeated (since the same plaintext block with the same key will always produce the same ciphertext), but in reality this is not how things tend to work. So we need a method of extending a block cipher to work on arbitrary length messages and provide some additional security padding so that identical messages cannot be easily identified by an attacker. This is called a block cipher mode of operation.
ECB (Electronic CodeBook) mode is the simplest and works by just running the encryption on each block by itself. Useful as a correctness experiment and for gauging maximum algorithm performance, but not really usable for anything in the real world, as this screenshot from Wikipedia's article on block cipher modes nicely shows:
Doesn't really conceal much, does it? [original image credit: Larry Ewing]
The rest of the modes discussed are actually secure modes (when used properly).
CTR (Counter) mode costs a little bit for the counter calculations, but is capable of approaching the theoretical maximum throughput of ECB due to there being no inter-block dependencies.
CBC (Cipher Block Chaining) mode encryption, on the other hand, is inherently slow due to the way it continuously stalls the CPU instruction pipeline. This is because the encryption input of subsequent blocks depends on the encryption results of previous blocks, so the algorithm "interlocks" the pipeline all the time. Decryption fortunately does not suffer from this problem, so it's much faster (CTR-level fast).
GCM (Galois/Counter Mode) is a modification of the CTR mode, so it parallelizes very well. The added cost (and dip in performance) is caused by the fairly expensive GHASH function, but you do get something in return: an authentication tag, so you don't need to compute a separate MAC using something like HMAC-SHA anymore.
From this we can draw the following conclusions:
  • ECB is a nice tool for testing maximum performance, but is not really useful for anything that needs any security.
  • Use CTR when you're dealing with data that's already been authenticated, or GCM when the data hasn't been authenticated.
  • Avoid CBC like the plague.

Vanilla KCF Performance

So now comes the test for the KCF. I wrote a quick'n'dirty crypto test module that just performed a bunch of encryption operations and timed the results:
KCF encryption performance
128k ops
117 MB/s
110 MB/s
112 MB/s
56 MB/s
What the hell is that?! This is just plain unacceptable. Obviously we must have hit some nasty performance snag somewhere, because this is comical. And sure enough, we did. When looking around in the AES-NI implementation I came across this bit in aes_intel.s:
#define CLEAR_TS_OR_PUSH_XMM0_XMM1(tmpreg) \
 push %rbp; \
 mov %rsp, %rbp; \
 movq %cr0, tmpreg; \
 testq $CR0_TS, tmpreg; \
 jnz 1f; \
 and $-XMM_ALIGN, %rsp; \
 sub $[XMM_SIZE * 2], %rsp; \
 movaps %xmm0, 16(%rsp); \
 movaps %xmm1, (%rsp); \
 jmp 2f; \
1: \

/* ... snip ... */

This is a problem:
3.1.2 Instructions That Cause VM Exits Conditionally 
• CLTS. The CLTS instruction causes a VM exit if the bits in position 3 (corresponding to CR0.TS) are set in both the CR0 guest/host mask and the CR0 read shadow.
The CLTS instruction signals to the CPU that we're about to use FPU registers (which is needed for AES-NI), which in VMware causes an exit into the hypervisor. And we've been doing it for every single AES block! Needless to say, performing the equivalent of a very expensive context switch every 16 bytes is going to hurt encryption performance a bit.
The reason why the kernel is issuing CLTS is because for performance reasons, the kernel doesn't save and restore FPU register state on kernel thread context switches. So whenever we need to use FPU registers inside the kernel, we must disable kernel thread preemption via a call to kpreempt_disable() and kpreempt_enable() and save and restore FPU register state manually. During this time, we cannot be descheduled (because if we were, some other thread might clobber our FPU registers), so if a thread does this for too long, it can lead to unexpected latency bubbles (not to mention deadlocks if you happen to block in this context and you're running on a uniprocessor).
The solution was to restructure the AES and KCF block crypto implementations in such a way that we execute encryption in meaningfully small chunks. I opted for 32k bytes, for reasons which I'll explain below. Unfortunately, doing this restructuring work was a bit more complicated than one would imagine, since in the KCF the implementation of the AES encryption algorithm and the block cipher modes is separated into two separate modules that interact through an internal API, which wasn't really conducive to high performance (we'll get to that later). Anyway, having fixed the issue here and running the code at near native speed, this is what I get:
KCF encryption performance with CLTS fix
128k ops
790 MB/s
439 MB/s
483 MB/s
252 MB/s
Not disastrous anymore, but still, very, very bad. Of course, you've got keep in mind, the thing we're comparing it to, OpenSSL, is no slouch. It's got hand-written highly optimized inline assembly implementations of most of these encryption functions and their specific modes, for lots of platforms. That's a ton of code to maintain and optimize, but I'll be damned if I let this kind of performance gap persist.
Fixing this, however, is not so trivial anymore. It pertains to how the KCF's block cipher mode API interacts with the cipher algorithms. It is beautifully designed and implemented in a fashion that creates minimum code duplication, but this also means that it's inherently inefficient. Some of the reasons for which the KCF API is a very unfortunate design from a performance perspective, are:
  • Rather than expecting nicely aligned input and output data, KCF will eat almost any garbage you feed it, be it a single block, a list of unaligned iovec's or a linked list of disparate message blocks (mblk's).
  • In order to support this crazy array of input and output options and keep the implementation pretty, the KCF block cipher modes internally do a lot of copying of blocks around into and out of temp buffers and invoke a callback encryption function sequentially for each and every one of them.
  • The implementation is built with no regard to auto-vectorization or even any attempt at exploiting wide register files inside of modern CPUs to prevent pipeline stalls (as is evident in the CTR mode implementation).
  • The GCM decryption implementation is almost criminal in that it copies all input into a temporary buffer and consumes it all in the final() call. If you try to decrypt and/or authenticate a few GB worth of data using a single AES key, the kernel is almost guaranteed to soil itself trying to gobble up all of your input data before starting to produce any output (luckily the GCM algorithm is limited to 64GB of data per key, so at least there is an upper bound to this nonsense).
So I set about fixing all of these (and many more) problems, one by one.

Convoluted API

The solution here was to design "fast path" shortcuts we can take when conditions allow for it. For example, while the caller can pass us variously unaligned data and scattered output buffers, in most performance-sensitive situations, they won't, so we can optimize for that. So I've added conditions to all of the major block encryption calls to detect situations when we can shortcut the horribly slow loops with lots of copying and instead perform the work with much less effort:
 * CBC encryption fastpath requirements:
 * - fastpath is enabled
 * - algorithm-specific acceleration function is available
 * - input is block-aligned
 * - output is a single contiguous region or the user requested that
 *   we overwrite their input buffer (input/output aliasing allowed)

Cipher Algorithm-Specific Fastpath Functions

ECB, CBC and CTR gained the ability to pass an algorithm-specific "fastpath" implementation of the block cipher mode, because these functions benefit greatly from pipelining multiple cipher calls into a single place.
ECB, CTR and CBC decryption benefit enormously from being able to exploit the wide XMM register file on Intel to perform encryption/decryption operations on 8 blocks at the same time in a non-interlocking manner. The performance gains here are on the order of 5-8x.
CBC encryption benefits from not having to copy the previously encrypted ciphertext blocks into memory and back into registers to XOR them with the subsequent plaintext blocks, though here the gains are more modest, around 1.3-1.5x.

GHASH Acceleration Ported From OpenSSL

The GCM algorithm benefits greatly from having the most expensive part of the GHASH function performed in hardware using the new PCLMULQDQ carry-less multiplication instruction. The existing implementation of the KCF already used that, but it suffered from the same CLTS performance regression as AES-NI did and the horribly slow implementation inside of the general-purpose block cipher mode code path in gcm.c meant that the implementation couldn't really shine. Plus, there's been recent progress on parallelization of GHASH done using pipelined Karatsuba multipliers, which wasn't in the KCF. So I added a fastpath implementation to GCM which does an initial fast CTR pass over the data and then computes GHASH from the encryption output using the optimized GHASH implementation ported over from OpenSSL.

So How Does It All Stack Up?

After all of this work, this is how the results now look on Illumos, even inside of a VM:
KCF with performance fixes
128k ops
3764 MB/s
3121 MB/s
691 MB/s
1053 MB/s
On the decryption side of things, CBC decryption also jumped from 627 MB/s to 3011 MB/s. Seeing these performance numbers, you can see why I chose 32k for the operation size in between kernel preemption barriers. Even on the slowest hardware with AES-NI, we can expect at least 300-400 MB/s/core of throughput, so even in the worst case, we'll be hogging the CPU for at most ~0.1ms per run.
Pretty bar graphs of our new overall position:

Overall, we're even a little bit faster than OpenSSL in some tests, though that's probably down to us encrypting 128k blocks vs 8k in the "openssl speed" utility. Anyway, having fixed this monstrous atrocity of a performance bug, I can now finally get some sleep.


To verify that nothing has been messed up by these changes, I've written a small kernel module that exercises these algorithms against a set of known good test vectors. The module is available on GitHub.

Thursday, May 22, 2014

Building a ZFS Storage Appliance (part 2)

So it's been a long time in the making, but here's the second part to making a ZFS storage appliance. This part concerns itself mostly with how to get COMSTAR and iSCSI targets up and running. It's kinda short, seeing as most of the information is already included in the stmf-ha manpage, but I'd still like to take this opportunity to reiterate the fundamentals.


So COMSTAR is this great COMmon Scsi TARget subsystem in Illumos that allows you to turn the box into a true SAN array. It has interconnect support for iSCSI, FC, SRP and iSER, but for our purposes I'm just going to focus on iSCSI, since that's the one I'm most familiar with.
iSCSI is really just a method of sending SCSI commands over TCP/IP, allowing you to provide storage services to other devices on a TCP/IP network. This article isn't primarily intended to teach you all the ins and outs of iSCSI, so if you want to know more, I suggest you head over to your friendly professor Wikipedia and learn all about iSCSI.
The primary problem with COMSTAR is that its configuration is kind of, well, let's say "clumsy". The configuration store is part of the SMF service database (which is stored in SQLite), and even if we could get at it by using the svccfg(1M) command, the contents itself is a bunch of packed nvlists and various binary blobs. This is further complicated by the fact that we can't just write out a slightly modified configuration to the SMF service store and have the kernel pick up the differences easily. What the COMSTAR administration commands do is they actually tell the kernel to set up each portion of the stored configuration using specific ioctl() calls. This makes programmatic modification of only portions of the running configuration on a system very complicated.


To circumvent this, I've resorted to a different approach. Instead of keeping the stored COMSTAR configuration as authoritative and then attempt to somehow programmatically modify it and then hope to get its run-time reconfiguration right via the tons of undocumented or poorly documented ioctl() interfaces, I've resorted to ignore the stored configuration entirely. Luckily COMSTAR supports a "no persistence" option in the service configuration, so that any configuration commands issued don't actually modify the persistent configuration in the SMF configuration store. This pretty much means that any time the machine is rebooted, the COMSTAR configuration will be entirely empty and the machine won't try to do anything. That's good and what we want, because in the next step we're going to tell it what to do from our cluster control software. This is similar to what we do in the Heartbeat resource script for ZFS, which explicitly ignores the ZFS cache file to avoid machines auto-importing pools at boot up.
The next step involved writing a program that is capable of using the standard COMSTAR administration commands to set up a running state in COMSTAR to our liking. Naturally, we need to store the desired SCSI target and LU configuration in some place, and what better place to choose than the ZFS pool from which we'll be exporting volumes and migrating between clustered machines. That's why I wrote a simple(ish) shell script called stmf-ha that can be invoked by cluster control software to reconstruct the running state of COMSTAR when we want to import the pool and tear it down when want to we export the pool.

Integrating stmf-ha into the cluster

In part 1 of this guide we've set up Heartbeat and Pacemaker to provide clustering services to our storage array. We've installed the custom ZFS resource script from stmf-ha into Heartbeat to teach our clustering software how to import & export ZFS pools and then we've set up one or more ZFS pools to work on. For NFS this would have been enough, since the NFS configuration is stored on the pool itself and Illumos automatically restores it at pool import, but for COMSTAR we need do this ourselves.
The stmf-ha package includes a script called zfs-helper. Copy this file into the /opt/ha/lib/ocf/lib/heartbeat/helpers directory (create it if necessary) and of course the stmf-ha script itself into some place where it can be invoked with a standard PATH environment variable for root (e.g. /usr/sbin) - alternatively, you can modify the STMF_HA variable in the zfs-helper script to point to where you've placed stmf-ha. The helper script is invoked by the ZFS resource script in Heartbeat to perform additional setup and teardown operations before and after pool import and export. The helper script then invokes stmf-ha after import has succeeded and just prior to export, passing it the pool name we're manipulating.

Configuring iSCSI resources in stmf-ha

So now that we've got stmf-ha installed and integrated into the clustering software, we can begin to create ZFS volumes and exporting them via iSCSI to initiators. I will assume you are familiar with general iSCSI nomenclature and the principles of how to configure iSCSI in COMSTAR.
The simplest way to start testing is to simply create an empty "stmf-ha.conf" file in the root of the ZFS pool. This simply tells stmf-ha that you want to export all of the ZFS volumes on that pool as iSCSI LUs under a default iSCSI target without any access restrictions. This is good for testing, but once you get things going, you'll probably want to lock the setup down a little bit better.
See the manpage of stmf-ha(1M) (copy stmf-ha.1m to /usr/share/man/man1m on your machine and then type "man stmf-ha") - it explains all the special cases and methods of how to configure your pool, your target portal groups and various other access criteria. Also have a look at the sample configuration file which will help you get started fairly quickly.
Once a pool is imported, you can also make changes to both the stmf-ha configuration and to the list of exported ZFS volumes. To reload configuration changes to the script, or e.g. when creating a new ZFS volume you want to export, simply issue the "stmf-ha start <poolname>" command again. The stmf-ha script will re-read the configuration file, the running state of COMSTAR and the pool state and reapply things so that everything that should be exported is exported. Again, please read the manpage, there's lots of info there on what stmf-ha can do and where you'll have to nurse it a bit.

Performance considerations

Please keep in mind that stmf-ha and COMSTAR configuration takes some time. This is especially evident when trying to fail over a pool that's taking a lot of load, since offlining a heavily loaded LU takes a few seconds while we wait for I/O to the LU to cease. In most cases this shouldn't be an issue, especially if your initiators know how to handle targets that go away for a while to do some cluster fail-over (e.g. VMware will hold VM I/O for up to ~120s before declaring the datastore inaccessible), but keep in mind to test, test, test prior to deployment in production - try pulling power cords, network links, hard drives and killing the odd process on the box to simulate out-of-memory conditions. Ultimately there's nothing you can do to prepare yourself for every eventuality out there, but you at least want to understand and verify how the system behaves in the most common failure scenarios. In clustering, predictability is the name of the game, so when you're unsure what's going on, don't change anything.