Thursday, May 29, 2014

Optimizing the Illumos Kernel Crypto Framework

Optimizing the Illumos Kernel Crypto Framework

Recently I've had some motivation to look into the KCF on Illumos and discovered that, unbeknownst to me, we already had an AES-NI implementation that was automatically enabled when running on Intel and AMD CPUs with AES-NI support. This work was done back in 2010 by Dan Anderson.
This was great news, so I set out to test the performance in Illumos in a VM on my Mac with a Core i5 3210M (2.5GHz normal, 3.1GHz turbo). For a sense of scale I first used OpenSSL suite's "speed" utility (version 1.0.1f):
$ openssl speed -evp aes-128-[ecb|ctr|cbc|gcm]
And here are the results (I've omitted the small-block results, I'm only concerned with best-case throughput here):
OpenSSL encryption performance
Algorithm/Mode
8k ops
AES-128/ECB
3464 MB/s
AES-128/CTR
3185 MB/s
AES-128/CBC
586 MB/s
AES-128/GCM
1026 MB/s
Those are some pretty impressive numbers for single threaded performance, so let's analyze them a bit.

Block Cipher Modes Of Operation

Although this article isn't really about encryption modes, I'd like to elaborate briefly on the performance values we saw above and how they arise.
Symmetric block ciphers like AES work on chunks of data that are, quite obviously, called blocks and are of a fixed length (128 bits in the case of AES). The cipher is a little black box, you throw in an unencrypted (plaintext) block together with the secret encryption key and out pops an encrypted (ciphertext) block of the same length. For decryption, the reverse is the case. You give it the ciphertext block and the same key (hence the term "symmetric" - same key for both encryption and decryption) and out pops the plaintext block. This would be great if all messages we ever sent were of this fixed block length and never repeated (since the same plaintext block with the same key will always produce the same ciphertext), but in reality this is not how things tend to work. So we need a method of extending a block cipher to work on arbitrary length messages and provide some additional security padding so that identical messages cannot be easily identified by an attacker. This is called a block cipher mode of operation.
ECB (Electronic CodeBook) mode is the simplest and works by just running the encryption on each block by itself. Useful as a correctness experiment and for gauging maximum algorithm performance, but not really usable for anything in the real world, as this screenshot from Wikipedia's article on block cipher modes nicely shows:
Doesn't really conceal much, does it? [original image credit: Larry Ewing]
The rest of the modes discussed are actually secure modes (when used properly).
CTR (Counter) mode costs a little bit for the counter calculations, but is capable of approaching the theoretical maximum throughput of ECB due to there being no inter-block dependencies.
CBC (Cipher Block Chaining) mode encryption, on the other hand, is inherently slow due to the way it continuously stalls the CPU instruction pipeline. This is because the encryption input of subsequent blocks depends on the encryption results of previous blocks, so the algorithm "interlocks" the pipeline all the time. Decryption fortunately does not suffer from this problem, so it's much faster (CTR-level fast).
GCM (Galois/Counter Mode) is a modification of the CTR mode, so it parallelizes very well. The added cost (and dip in performance) is caused by the fairly expensive GHASH function, but you do get something in return: an authentication tag, so you don't need to compute a separate MAC using something like HMAC-SHA anymore.
From this we can draw the following conclusions:
  • ECB is a nice tool for testing maximum performance, but is not really useful for anything that needs any security.
  • Use CTR when you're dealing with data that's already been authenticated, or GCM when the data hasn't been authenticated.
  • Avoid CBC like the plague.

Vanilla KCF Performance

So now comes the test for the KCF. I wrote a quick'n'dirty crypto test module that just performed a bunch of encryption operations and timed the results:
KCF encryption performance
Algorithm/Mode
128k ops
AES-128/ECB
117 MB/s
AES-128/CTR
110 MB/s
AES-128/CBC
112 MB/s
AES-128/GCM
56 MB/s
What the hell is that?! This is just plain unacceptable. Obviously we must have hit some nasty performance snag somewhere, because this is comical. And sure enough, we did. When looking around in the AES-NI implementation I came across this bit in aes_intel.s:
#define PROTECTED_CLTS \
 CLTS
#define CLEAR_TS_OR_PUSH_XMM0_XMM1(tmpreg) \
 push %rbp; \
 mov %rsp, %rbp; \
 movq %cr0, tmpreg; \
 testq $CR0_TS, tmpreg; \
 jnz 1f; \
 and $-XMM_ALIGN, %rsp; \
 sub $[XMM_SIZE * 2], %rsp; \
 movaps %xmm0, 16(%rsp); \
 movaps %xmm1, (%rsp); \
 jmp 2f; \
1: \
 PROTECTED_CLTS; \
2:

/* ... snip ... */

ENTRY_NP(aes_encrypt_intel)
 CLEAR_TS_OR_PUSH_XMM0_XMM1(%r10)
        ...
This is a problem:
3.1.2 Instructions That Cause VM Exits Conditionally 
• CLTS. The CLTS instruction causes a VM exit if the bits in position 3 (corresponding to CR0.TS) are set in both the CR0 guest/host mask and the CR0 read shadow.
The CLTS instruction signals to the CPU that we're about to use FPU registers (which is needed for AES-NI), which in VMware causes an exit into the hypervisor. And we've been doing it for every single AES block! Needless to say, performing the equivalent of a very expensive context switch every 16 bytes is going to hurt encryption performance a bit.
The reason why the kernel is issuing CLTS is because for performance reasons, the kernel doesn't save and restore FPU register state on kernel thread context switches. So whenever we need to use FPU registers inside the kernel, we must disable kernel thread preemption via a call to kpreempt_disable() and kpreempt_enable() and save and restore FPU register state manually. During this time, we cannot be descheduled (because if we were, some other thread might clobber our FPU registers), so if a thread does this for too long, it can lead to unexpected latency bubbles (not to mention deadlocks if you happen to block in this context and you're running on a uniprocessor).
The solution was to restructure the AES and KCF block crypto implementations in such a way that we execute encryption in meaningfully small chunks. I opted for 32k bytes, for reasons which I'll explain below. Unfortunately, doing this restructuring work was a bit more complicated than one would imagine, since in the KCF the implementation of the AES encryption algorithm and the block cipher modes is separated into two separate modules that interact through an internal API, which wasn't really conducive to high performance (we'll get to that later). Anyway, having fixed the issue here and running the code at near native speed, this is what I get:
KCF encryption performance with CLTS fix
Algorithm/Mode
128k ops
AES-128/ECB
790 MB/s
AES-128/CTR
439 MB/s
AES-128/CBC
483 MB/s
AES-128/GCM
252 MB/s
Not disastrous anymore, but still, very, very bad. Of course, you've got keep in mind, the thing we're comparing it to, OpenSSL, is no slouch. It's got hand-written highly optimized inline assembly implementations of most of these encryption functions and their specific modes, for lots of platforms. That's a ton of code to maintain and optimize, but I'll be damned if I let this kind of performance gap persist.
Fixing this, however, is not so trivial anymore. It pertains to how the KCF's block cipher mode API interacts with the cipher algorithms. It is beautifully designed and implemented in a fashion that creates minimum code duplication, but this also means that it's inherently inefficient. Some of the reasons for which the KCF API is a very unfortunate design from a performance perspective, are:
  • Rather than expecting nicely aligned input and output data, KCF will eat almost any garbage you feed it, be it a single block, a list of unaligned iovec's or a linked list of disparate message blocks (mblk's).
  • In order to support this crazy array of input and output options and keep the implementation pretty, the KCF block cipher modes internally do a lot of copying of blocks around into and out of temp buffers and invoke a callback encryption function sequentially for each and every one of them.
  • The implementation is built with no regard to auto-vectorization or even any attempt at exploiting wide register files inside of modern CPUs to prevent pipeline stalls (as is evident in the CTR mode implementation).
  • The GCM decryption implementation is almost criminal in that it copies all input into a temporary buffer and consumes it all in the final() call. If you try to decrypt and/or authenticate a few GB worth of data using a single AES key, the kernel is almost guaranteed to soil itself trying to gobble up all of your input data before starting to produce any output (luckily the GCM algorithm is limited to 64GB of data per key, so at least there is an upper bound to this nonsense).
So I set about fixing all of these (and many more) problems, one by one.

Convoluted API

The solution here was to design "fast path" shortcuts we can take when conditions allow for it. For example, while the caller can pass us variously unaligned data and scattered output buffers, in most performance-sensitive situations, they won't, so we can optimize for that. So I've added conditions to all of the major block encryption calls to detect situations when we can shortcut the horribly slow loops with lots of copying and instead perform the work with much less effort:
/*
 * CBC encryption fastpath requirements:
 * - fastpath is enabled
 * - algorithm-specific acceleration function is available
 * - input is block-aligned
 * - output is a single contiguous region or the user requested that
 *   we overwrite their input buffer (input/output aliasing allowed)
 */

Cipher Algorithm-Specific Fastpath Functions

ECB, CBC and CTR gained the ability to pass an algorithm-specific "fastpath" implementation of the block cipher mode, because these functions benefit greatly from pipelining multiple cipher calls into a single place.
ECB, CTR and CBC decryption benefit enormously from being able to exploit the wide XMM register file on Intel to perform encryption/decryption operations on 8 blocks at the same time in a non-interlocking manner. The performance gains here are on the order of 5-8x.
CBC encryption benefits from not having to copy the previously encrypted ciphertext blocks into memory and back into registers to XOR them with the subsequent plaintext blocks, though here the gains are more modest, around 1.3-1.5x.

GHASH Acceleration Ported From OpenSSL

The GCM algorithm benefits greatly from having the most expensive part of the GHASH function performed in hardware using the new PCLMULQDQ carry-less multiplication instruction. The existing implementation of the KCF already used that, but it suffered from the same CLTS performance regression as AES-NI did and the horribly slow implementation inside of the general-purpose block cipher mode code path in gcm.c meant that the implementation couldn't really shine. Plus, there's been recent progress on parallelization of GHASH done using pipelined Karatsuba multipliers, which wasn't in the KCF. So I added a fastpath implementation to GCM which does an initial fast CTR pass over the data and then computes GHASH from the encryption output using the optimized GHASH implementation ported over from OpenSSL.

So How Does It All Stack Up?

After all of this work, this is how the results now look on Illumos, even inside of a VM:
KCF with performance fixes
Algorithm/Mode
128k ops
AES-128/ECB
3764 MB/s
AES-128/CTR
3121 MB/s
AES-128/CBC
691 MB/s
AES-128/GCM
1053 MB/s
On the decryption side of things, CBC decryption also jumped from 627 MB/s to 3011 MB/s. Seeing these performance numbers, you can see why I chose 32k for the operation size in between kernel preemption barriers. Even on the slowest hardware with AES-NI, we can expect at least 300-400 MB/s/core of throughput, so even in the worst case, we'll be hogging the CPU for at most ~0.1ms per run.
Pretty bar graphs of our new overall position:

Overall, we're even a little bit faster than OpenSSL in some tests, though that's probably down to us encrypting 128k blocks vs 8k in the "openssl speed" utility. Anyway, having fixed this monstrous atrocity of a performance bug, I can now finally get some sleep.

Testing

To verify that nothing has been messed up by these changes, I've written a small kernel module that exercises these algorithms against a set of known good test vectors. The module is available on GitHub.

Thursday, May 22, 2014

Building a ZFS Storage Appliance (part 2)

So it's been a long time in the making, but here's the second part to making a ZFS storage appliance. This part concerns itself mostly with how to get COMSTAR and iSCSI targets up and running. It's kinda short, seeing as most of the information is already included in the stmf-ha manpage, but I'd still like to take this opportunity to reiterate the fundamentals.

COMSTAR Summary

So COMSTAR is this great COMmon Scsi TARget subsystem in Illumos that allows you to turn the box into a true SAN array. It has interconnect support for iSCSI, FC, SRP and iSER, but for our purposes I'm just going to focus on iSCSI, since that's the one I'm most familiar with.
iSCSI is really just a method of sending SCSI commands over TCP/IP, allowing you to provide storage services to other devices on a TCP/IP network. This article isn't primarily intended to teach you all the ins and outs of iSCSI, so if you want to know more, I suggest you head over to your friendly professor Wikipedia and learn all about iSCSI.
The primary problem with COMSTAR is that its configuration is kind of, well, let's say "clumsy". The configuration store is part of the SMF service database (which is stored in SQLite), and even if we could get at it by using the svccfg(1M) command, the contents itself is a bunch of packed nvlists and various binary blobs. This is further complicated by the fact that we can't just write out a slightly modified configuration to the SMF service store and have the kernel pick up the differences easily. What the COMSTAR administration commands do is they actually tell the kernel to set up each portion of the stored configuration using specific ioctl() calls. This makes programmatic modification of only portions of the running configuration on a system very complicated.

stmf-ha

To circumvent this, I've resorted to a different approach. Instead of keeping the stored COMSTAR configuration as authoritative and then attempt to somehow programmatically modify it and then hope to get its run-time reconfiguration right via the tons of undocumented or poorly documented ioctl() interfaces, I've resorted to ignore the stored configuration entirely. Luckily COMSTAR supports a "no persistence" option in the service configuration, so that any configuration commands issued don't actually modify the persistent configuration in the SMF configuration store. This pretty much means that any time the machine is rebooted, the COMSTAR configuration will be entirely empty and the machine won't try to do anything. That's good and what we want, because in the next step we're going to tell it what to do from our cluster control software. This is similar to what we do in the Heartbeat resource script for ZFS, which explicitly ignores the ZFS cache file to avoid machines auto-importing pools at boot up.
The next step involved writing a program that is capable of using the standard COMSTAR administration commands to set up a running state in COMSTAR to our liking. Naturally, we need to store the desired SCSI target and LU configuration in some place, and what better place to choose than the ZFS pool from which we'll be exporting volumes and migrating between clustered machines. That's why I wrote a simple(ish) shell script called stmf-ha that can be invoked by cluster control software to reconstruct the running state of COMSTAR when we want to import the pool and tear it down when want to we export the pool.

Integrating stmf-ha into the cluster

In part 1 of this guide we've set up Heartbeat and Pacemaker to provide clustering services to our storage array. We've installed the custom ZFS resource script from stmf-ha into Heartbeat to teach our clustering software how to import & export ZFS pools and then we've set up one or more ZFS pools to work on. For NFS this would have been enough, since the NFS configuration is stored on the pool itself and Illumos automatically restores it at pool import, but for COMSTAR we need do this ourselves.
The stmf-ha package includes a script called zfs-helper. Copy this file into the /opt/ha/lib/ocf/lib/heartbeat/helpers directory (create it if necessary) and of course the stmf-ha script itself into some place where it can be invoked with a standard PATH environment variable for root (e.g. /usr/sbin) - alternatively, you can modify the STMF_HA variable in the zfs-helper script to point to where you've placed stmf-ha. The helper script is invoked by the ZFS resource script in Heartbeat to perform additional setup and teardown operations before and after pool import and export. The helper script then invokes stmf-ha after import has succeeded and just prior to export, passing it the pool name we're manipulating.

Configuring iSCSI resources in stmf-ha

So now that we've got stmf-ha installed and integrated into the clustering software, we can begin to create ZFS volumes and exporting them via iSCSI to initiators. I will assume you are familiar with general iSCSI nomenclature and the principles of how to configure iSCSI in COMSTAR.
The simplest way to start testing is to simply create an empty "stmf-ha.conf" file in the root of the ZFS pool. This simply tells stmf-ha that you want to export all of the ZFS volumes on that pool as iSCSI LUs under a default iSCSI target without any access restrictions. This is good for testing, but once you get things going, you'll probably want to lock the setup down a little bit better.
See the manpage of stmf-ha(1M) (copy stmf-ha.1m to /usr/share/man/man1m on your machine and then type "man stmf-ha") - it explains all the special cases and methods of how to configure your pool, your target portal groups and various other access criteria. Also have a look at the sample configuration file which will help you get started fairly quickly.
Once a pool is imported, you can also make changes to both the stmf-ha configuration and to the list of exported ZFS volumes. To reload configuration changes to the script, or e.g. when creating a new ZFS volume you want to export, simply issue the "stmf-ha start <poolname>" command again. The stmf-ha script will re-read the configuration file, the running state of COMSTAR and the pool state and reapply things so that everything that should be exported is exported. Again, please read the manpage, there's lots of info there on what stmf-ha can do and where you'll have to nurse it a bit.

Performance considerations

Please keep in mind that stmf-ha and COMSTAR configuration takes some time. This is especially evident when trying to fail over a pool that's taking a lot of load, since offlining a heavily loaded LU takes a few seconds while we wait for I/O to the LU to cease. In most cases this shouldn't be an issue, especially if your initiators know how to handle targets that go away for a while to do some cluster fail-over (e.g. VMware will hold VM I/O for up to ~120s before declaring the datastore inaccessible), but keep in mind to test, test, test prior to deployment in production - try pulling power cords, network links, hard drives and killing the odd process on the box to simulate out-of-memory conditions. Ultimately there's nothing you can do to prepare yourself for every eventuality out there, but you at least want to understand and verify how the system behaves in the most common failure scenarios. In clustering, predictability is the name of the game, so when you're unsure what's going on, don't change anything.

Monday, June 3, 2013

Building a ZFS Storage Appliance (part 1)

Introduction

The company I worked for has this "cloud computing" thing (i.e. selling virtualized computing resources) and it's all based around running traditional hypervisors (VMware at the moment). And if you want to do that right, you're basically looking at centralized storage.
With most hypervisors, there are multiple storage protocols you can exploit to do it:
  • The most typical entry level is to run something like NFS or CIFS over your network to file servers which hold VM files.
  • You can use block-oriented storage protocols like iSCSI or FC to simulate entire block devices over the network. Here you have the option of going over standard Ethernet (iSCSI, FCoE, AoE, etc.) or run dedicated storage infrastructure over specialized hardware (FC, IB/SRP, etc.)
  • You can go totally crazy and take something like SAS and wire everything up in a sort of "local but simultaneous access" manner with SAS switches etc.
The point of all centralized storage is to make the data available to multiple nodes in a decentralized network. The typical entry into the centralized storage world is done by using NFS. It's easily available in the OS of your choice, it can run over your existing network infrastructure and is capable of quite good scalability. In fact, one shouldn't think that simply because it is the easiest to start with, it also is the least performing. There are some pretty serious NFS deployments out there and the whole NAS vs. SAN debate is as hot as ever.
My company has been going the route of building Linux storage boxes running NFS over dedicated gigabit ethernet infrastructure for quite some time now. The performance is decent, the software is stable and there was already quite a bit of expertize in-house to do that.
However, this approach does scale only so far. And so a couple of months back we found ourselves in the marketplace looking for a new upgrade to our storage infrastructure. There were multiple requirements that needed to be met:
  1. The solution needed to be established in the marketplace. We weren't going to be toying with experimental nonsense.
  2. Deployment had to be done as soon as possible.
  3. Future scalability needed to considered - we didn't want to get stuck in a dead end again.
  4. The price per GB must be competitive - this translates directly into service costs to our customers.
Especially the last point basically meant that all the "big enterprise" vendors' solutions (Dell EqualLogic, EMC Clariion, etc.) went out the window at first glance. There's just no way our customers are going to pay DRAM-type $/GB prices for HDD storage. That is not to diss Dell's or EMC's customers - if their product works for you, then great. It just didn't work for us.
Being a ZFS guy and having made some non-trivial ZFS deployments in our IPTV infrastructure before, I was brought onto the project to help out. Long story short, after a somewhat painful (but surprisingly short) road of discovery and experimentation, we have our new solution up and running. The rest of this article will be a guide on how you can replicate what I did and deploy it in your environment.

What We Wanted

So here are our goals and the scope of what we set out to build:
  • High-performance ZFS-based storage system on SuperMicro hardware (good, reliable, affordable).
  • Fully redundant everything - the network, head units, enclosures, pools - you name it. If it performed a vital function in the storage system, it was to be capable of surviving a failure without (significant) downtime or any data loss.
  • Connectivity to the system over iSCSI on 10 Gigabit Ethernet. NFS is an option, but not the primary protocol. FC kept in our back pocket if the need arises (unlikely, given the per-port cost of FC and performance compared to iSCSI on 10 GbE).
    The primary reason to go with iSCSI over NFS was that iSCSI supports native multi-path.

The Fault Tolerance and Scalability

Looking at things from a high-level structure, we've decided to go with a fully-meshed network structure at each subsystem boundary:


Hardware-wise, the construction is:
  • Two dedicated 24-port 10 GbE switches (with 40 GbE uplinks for future expansion), each of which represents a fully isolated SAN. There is no stacking or synchronization going on between these switches, they really are fully independent of each other. Protocols such as virtual multi-chassis stacking and port bundling introduce more fragility into the network and that rarely helps uptime.
  • Two storage controllers, also called "heads". These are the boxes that actually run ZFS and the iSCSI target software (COMSTAR). They are the brain of the entire operation and that's where we'll focus most of the time in this article. The OS of choice for us was OmniOS.
    To save on complexity, we purchased a pair of 2U SuperMicro servers with SuperMicro's new X9DRW-7TPF+ motherboards which feature an on-board integrated Intel 82599-based controller with dual 10 GbE SFP+ connectors. These, together with another on-board dual-port 1 GbE Intel i350-based NIC and a dedicated IPMI/KVM port take care of all of the front-end connectivity to the head units, leaving all of the PCI-Express slots open for whatever monkey business we want to stuff in there.
  • A pair for SuperMicro SC847 4U JBOD boxes with dual-redundant SAS expanders and 45x 3.5'' front- and rear-accessible hot-swap drive bays, giving us a total capacity of 90x 3.5'' drives. While not the highest density available on the market, they are very reliable, cheap and overall easy to work with (no need to slide them out of the rack to manipulate drives - just pop them right in from the front or rear).
  • An initial "load" of 32x 2TB Toshiba 7k2 SAS hard drives and four OCZ Vertex 4 256GB SSDs for L2ARC. In a RAID-10 configuration this gives a total usable raw storage capacity of ~29 TiB and 1 TiB of L2ARC. The head nodes are each equipped with 128 GiB of DRAM.
  • All of the above can be scaled up by adding switches, NICs, drives, enclosures, SSDs and DRAM as our needs grow.
Best of all, all of this infrastructure (including the network switches) cost us less than $30k and in terms of brute performance is miles ahead of what any proprietary storage vendor could even hope to offer at this price point.

Software

Any storage box is useless without the appropriate software equipment - in fact, that's what you're actually paying for when you buy a proprietary vendor's boxes, as the hardware is mostly stock x86 server components anyway. There are three important software subsystems that are part of the above storage system (besides the OS itself):
  1. The storage backend, which is ZFS itself. No sense in playing with anything less than the state of the art.
  2. COMSTAR to provide the iSCSI (and optional FC) target functionality on top of ZFS volumes.
  3. The in-kernel NFS server to provide the filing capabilities on top of ZFS filesystems.

Clustering Software

Since we've got two boxes accessing shared storage and providing a set of shared-access services to the outside world, this necessarily implies clustering. Clustering is one of those words that's easy to say, but very difficult to do (or at least, do right). I've written some cluster control programs and they have one thing in common: they are notoriously complex and difficult to debug. Therefore, writing my own thing for this deployment just wasn't an option.
So we turned to what the market has to offer. When it comes to Illumos, there are a few systems to choose from, both proprietary and open-source. Of the commercial ones, the best of breed currently is RSF-1. It features ZFS integration, SAS persistent reservations for fail-safe data consistency - the lot. But it's expensive, way more expensive than we were willing to pay.
One of the most common open-source system to deploy for clustering, and the one I finally went with, is Heartbeat with the Pacemaker cluster resource manager (CRM). Unfortunately, there are no pre-built packages of Heartbeat and Pacemaker for Illumos, so I had to make my own. To make this as easy as possible, I have uploaded my product for all to download and improve upon, should you want to do so:
The prebuilt packages sit in the prebuilt_packages subdirectory - unpack and install using the familiar pkgadd commands. Once installed, we will need to proceed with configuring Heartbeat before starting it up.

Heartbeat Configuration

If you're using Heartbeat from the above package, it expects a configuration file in /opt/ha/etc/ha.d/ha.cf. Here's mine, nicely documented:
# Master Heartbeat configuration file
# This file must be identical on all cluster nodes

# GLOBAL OPTIONS
use_logd        yes             # Logging done in separate process to
                                # prevent blocking on disk I/O
baud            38400           # Run the serial link at 38.4 kbaud
realtime        on              # Enable real-time scheduling and lock
                                # heartbeat into memory to prevent its
                                # pages from ever being swapped out

apiauth cl_status gid=haclient uid=hacluster

# NODE LIST SETUP
# Node names depend on the machine's host name. To protect against
# accidental joins from nodes that are part of other zfsstor clusters
# we do not allow autojoins (plus we use shared-secret authentication).
node            head1
node            head2
autojoin        none

# COMMUNICATION CHANNEL SETUP
mcast   igb0    239.51.12.1 694 1 0     # management network
mcast   igb1    239.51.12.1 694 1 0     # dedicated NIC between nodes
mcast   ixgbe0  239.51.12.1 694 1 0     # SAN interface #0
mcast   ixgbe1  239.51.12.1 694 1 0     # SAN interface #1
serial  /dev/cua/a                      # hardwire serial interface

# NODE FAILURE DETECTION
keepalive       1       # Heartbeats every 1 second
warntime        5       # Start issuing warnings after 5 seconds
deadtime        10      # After 10 seconds, a node is considered dead
initdead        60      # Hold off declaring nodes dead for 60 seconds
                        # after Heartbeat startup.
# Enable the Pacemaker CRM with maximum channel compression.
# This is to make sure Pacemaker packets pass over the 38.4kbaud
# serial link without too much delay.
crm                     on
compression             bz2
traditional_compression yes
A few important notes on the above configuration file:
  • We do logging in a separate process called ha_logd. To use ha_logd, put "logfacility daemon" into /opt/ha/etc/logd.cf and start ha_logd up via "svcadm enable ha_logd".
  • To make absolutely sure that heartbeat won't be preempted or swapped out, we run it in realtime mode.
  • To help make the cluster behave predictably, we explicitly list all nodes that are part of it and disable auto-joining. This way each node knows who else it should see as part of the cluster and allows to deploy fencing as needed.
  • Since I'm somewhat paranoid about data corruption, we configure our communication links over all available Ethernet links plus a direct null-modem hardwire link between the motherboards.
    Serial links have the benefit of being essentially zero-configuration devices - there is no networking stack to accidentally misconfigure (all you need to do is make sure both ends expect the same link speed). In addition, your typical DE-9 connector has screws on it, allowing you to screw it tight to prevent accidental disconnection. In a subsequent article I will show how we monitor the health of all cluster links and raise an alarm in case one of them goes down.
  • We turn on the Pacemaker CRM and enable maximum channel compression - this is important since the serial link is quite slow and we don't want to introduce a large delay when communicating over it (the Ethernet interfaces aren't a problem).
Heartbeat also provides a set of OCF resource scripts in /opt/ha/lib/ocf/resource.d/heartbeat. Resource scripts are simply programs that the CRM invokes to get things done (e.g. starting a resource on a node, stopping a resource on a node, monitoring a resource's state, etc.). Think init scripts - that's what they are.
By default, Heartbeat does not ship with support for ZFS, so we need to add it in. I wrote a simple OCF script which manages ZFS pools and is part of the stmf-ha bundle. Copy the file in heartbeat/ZFS into the /opt/ha/lib/ocf/resource.d/heartbeat directory on your cluster nodes. This will teach your CRM how to import and export clustered ZFS pools on your nodes.
The last step that needs to be done is to generate an authentication token for Heartbeat (all Heartbeat channel messages are authenticated) and make sure it is identical on both nodes. To do so, simply execute the following command on one node and copy the resulting file (/opt/ha/etc/ha.d/authkeys) onto the second node:
# ( echo -ne "auth 1\n1 sha1 "; openssl rand -rand /dev/random \
  -hex 16 2> /dev/null ) > /opt/ha/etc/ha.d/authkeys
# chmod 400 /opt/ha/etc/ha.d/authkeys
That's it, all you need to do now is just enable heartbeat via "svcadm enable heartbeat". If heartbeat reboots your machine, it has trouble starting up Pacemaker. Change the "crm on" line in Heartbeat's configuration to "crm respawn" and watch your syslog (/var/adm/messages) for errors which might help you debug what's going wrong.

Identifying Cluster Resources

In the previous step we've set up Heartbeat, but we haven't actually told our cluster to do anything. All Heartbeat does is let all the cluster nodes talk to each other. However, the point of the cluster is to manage a set of services, devices, ZFS pools, etc. In clustering-speak, we want to define what resources will be part of the cluster. A resource is any useful software entity that your cluster can work with. It can be an IP address that migrates across nodes, it can be a web service, a mounted filesystem, etc.
Going over the previous enumeration of software subsystems, we can identify what parts of them we want to manage as cluster resources:
  • The ZFS subsystem primarily consists of the pool we want to attach to a given node, so we will want our cluster to automatically manage importing and exporting a pool. The filesystems and volumes on top of a ZFS pool are automatically managed by the ZFS automounter.
  • NFS exports are automatically managed by the ZFS "sharenfs" property, which is handled by the share(1M) utility. No need to explicitly do anything here, ZFS already does this as part of pool import/export.
  • The COMSTAR configuration of iSCSI targets and LUs needs to be migrated - this links the data on ZFS volumes to outside consumers. Ideally, we would want to tie this to pool import and export as in the case of NFS.
  • Finally, to make cluster migration transparent to outside users, we will want to make sure we migrate the IP addresses used for storage traffic. In case of a cluster-failover event outside users will see a reset on their storage session, which will simply force them to re-connect and reinitialize (which should be transparent for the most part).
So far so good, we've identified three broad areas to migrate: the ZFS pool, the COMSTAR configuration pertaining to the pool and IP addresses. Two of these, the ZFS pool and the IP addresses are pretty easily done and I will elaborate on these below. Migrating COMSTAR configurations is a lot more hairy and I will detail it in a future blog post.

Pacemaker Configuration

Pacemaker is the cluster resource manager, i.e. the actual brain of the cluster. It is the stuff that makes policy decisions on which resources to run where, when to fence a node and how to go about the general business of being a cluster. If you used the above packages to install Pacemaker, you can use the crm shell to configure it (simply run "crm" to start it up in interactive mode).
One of the nice features of Pacemaker is that no matter on which node you invoke the crm shell, it will always talk to the currently active cluster controller instance and automatically sync persistent configuration changes to all nodes in the cluster. I won't delve into the details of how to use "crm" in this article, just the basics to get our setup going. If you've ever used a Cisco router, the crm shell will be somewhat familiar. Tab completion works as expected and you configure Pacemaker by first entering the configuration mode (using the "configure" command).
When you initially run crm, your current configuration should be rather barren:
root@head1:~# crm
crm(live)# configure  
crm(live)configure# show
node $id="3df6dfa2-107b-6608-ba91-d17cd30c0d78" head1
node $id="71270ebc-ac7c-4096-9654-95e4f19d622b" head2
property $id="cib-bootstrap-options" \
    dc-version="1.0.11-6e010d6b0d49a6b929d17c0114e9d2d934dc8e04" \
    cluster-infrastructure="Heartbeat" \
    stonith-enabled="true" \
    last-lrm-refresh="1369414428" \
    stonith-action="reboot" \
    maintenance-mode="false"
The two node lines are automatically filled in by Pacemaker by talking to Heartbeat to discover all the cluster nodes. The property line just tells you a bunch of global Pacemaker property settings.

Node Fencing And STONITH

There are times when the cluster needs to recover from an unexpected situation (such as a node failing) and since we're dealing with a shared-storage cluster here, if the cluster can't execute this kind of recovery reliably, data corruption is a likely outcome. Clustering software deals with this by employing a technique called "fencing", where failed nodes are forcibly taken offline to guarantee that they can't be accessing shared resources. Sometimes this is also lovingly referred to as STONITH or "Shoot The Other Node In The Head". Running a shared-storage cluster without STONITH is generally considered a "Bad Idea(tm)" - you're basically asking for trouble.
The STONITH methods available in any given environment depend on the available hardware for implementing it. In our case, we have machines with an integrated IPMI processor running over a dedicated NIC, which enables us to use the IPMI protocol to remotely power down a node which might need fencing. If the STONITH operation fails, Pacemaker will not try to take over the resources, transitioning into a "maintenance" state needing administrator intervention (which is bad, but better than corrupting data).
For instance, assuming the above two nodes ("head1" and "head2"), you can configure STONITH resources using the following set of crm commands:
crm(live)configure# primitive head1-stonith stonith:external/ipmi \
    params hostname="head1" ipaddr="<head1's_IPMI_IP>" \
    userid="<IPMI_user>" passwd="<IPMI_passwd>" interface="lanplus" \
    op start start-delay="10"
crm(live)configure# primitive head2-stonith stonith:external/ipmi \
    params hostname="head2" ipaddr="<head2's_IPMI_IP>" \
    userid="<IPMI_user>" passwd="<IPMI_passwd>" interface="lanplus"
This tells Pacemaker that there are STONITH resources it can use for fencing. The "start-delay" operation parameter tells Pacemaker to delay executing the head1-stonith action by 10 seconds. It is no mistake that this is only present on the first STONITH resource item. IPMI has a certain delay to executing a power control action, and in case of a split-brain problem, we don't want both nodes killing each other, leaving us with no nodes up and running. Instead, we want kill the head2 node immediately, and wait a little before killing head1 - on the off chance that a split brain did occur, only one of these will succeed (the other node will have been killed before it could execute its STONITH operations).
Next we need to put a location constraint on them to make sure that a machine's STONITH resource isn't running on the machine itself (which would kind of defeat the purpose - if the machine is misbehaving, we want the other nodes to kill it):
crm(live)configure# location head1-stonith-pref head1-stonith -inf: head1
crm(live)configure# location head2-stonith-pref head1-stonith -inf: head2
One last setting that I found useful was to change the default STONITH action from "reboot" to "poweroff" - this guarantees that in case the cluster gets split and one node kills the other, only one node will remain running (the other won't boot up again and try to take the resources over) - this gives you time to come in and diagnose problems.
crm(live)configure# property stonith-action=poweroff
Now just commit the configuration and we're done:
crm(live)configure# commit
With STONITH in place, we can go on to build our actual cluster resources.

IP Addresses

We need to make sure that when migrating iSCSI resources, the clients don't have to go on a SCSI target hunt to discover where their targets have gone (to which IP address) - this can most easily be achieved by moving around IP addresses to the machine that is currently running the iSCSI targets.
Let's designate a pair of IP addresses per head node that will be preferably running on the respective head node with failover to the other node if need be. In my network these sit in different IP subnets. I'm running both head nodes in an active-active fashion, with all storage resources split between two ZFS pools, vmpool01 and vmpool02. This allows me to utilize both storage head nodes simultaneously for maximum performance while imposing the slight inconvenience of having the pool split into two "halves" (something I can live with). If you want to run in an active-passive way, you will only need two SAN IP addresses and you can remove all of the location constraints below (but keep those for STONITH above).
  • The SAN01 interface on head1 will be 192.168.141.207/24 on ixgbe0
  • The SAN02 interface on head1 will be 192.168.142.207/24 on ixgbe1
  • The SAN01 interface on head2 will be 192.168.141.208/24 on ixgbe0
  • The SAN02 interface on head2 will be 192.168.142.208/24 on ixgbe1
To allow for communication on the ixgbe interfaces even when no cluster IP addresses are configured on them, we will also add a few hard-coded IP addresses which always sit on the respective machine (these are the ones Heartbeat will bind to for cluster messaging):
root@head1:~# ipadm create-addr -T static -a 192.168.141.205/24 ixgbe0/v4
root@head1:~# ipadm create-addr -T static -a 192.168.142.205/24 ixgbe1/v4
root@head2:~# ipadm create-addr -T static -a 192.168.141.206/24 ixgbe0/v4
root@head2:~# ipadm create-addr -T static -a 192.168.142.206/24 ixgbe1/v4
Let's configure the resources for each of the clustered IP addresses in the crm shell:
crm(live)configure# primitive head1_san01_IP ocf:heartbeat:IPaddr \
  params ip="192.168.141.207" cidr_netmask="255.255.255.0" nic="ixgbe0"
crm(live)configure# primitive head1_san02_IP ocf:heartbeat:IPaddr \
  params ip="192.168.142.207" cidr_netmask="255.255.255.0" nic="ixgbe1"
crm(live)configure# primitive head2_san01_IP ocf:heartbeat:IPaddr \
  params ip="192.168.141.208" cidr_netmask="255.255.255.0" nic="ixgbe0"
crm(live)configure# primitive head2_san02_IP ocf:heartbeat:IPaddr \
  params ip="192.168.142.208" cidr_netmask="255.255.255.0" nic="ixgbe1"
Next we want to tie them to their respective machines by awarding them a preferential score of "100":
crm(live)configure# location head1_san01_IP_pref head1_san01_IP 100: head1
crm(live)configure# location head1_san02_IP_pref head1_san02_IP 100: head1
crm(live)configure# location head2_san01_IP_pref head2_san01_IP 100: head2
crm(live)configure# location head2_san02_IP_pref head2_san02_IP 100: head2
After a "commit" of this configuration, you should see the following on your head nodes:
root@head1:~# ipadm show-addr | grep ixgbe
ixgbe0/v4         static   ok           192.168.141.205/24
ixgbe0/_a         static   ok           192.168.141.207/24
ixgbe1/v4         static   ok           192.168.142.205/24
ixgbe1/_a         static   ok           192.168.142.207/24
root@head2:~# ipadm show-addr | grep ixgbe
ixgbe0/v4         static   ok           192.168.141.206/24
ixgbe0/_a         static   ok           192.168.141.208/24
ixgbe1/v4         static   ok           192.168.142.206/24
ixgbe1/_a         static   ok           192.168.142.208/24

ZFS Pools

Now that we got our IP addresses up and running, we can configure ZFS pools to migrate across the machines. The configuration is going to be fairly similar to what we did for IP addresses.
First create your ZFS pools on the machines using the standard "zpool create" syntax with one twist. By default, ZFS pools are imported in a persistent manner, meaning, their configuration is cached in the /etc/zfs/zpool.cache file. At next boot, the machine will attempt to import this pool automatically. That is not what we want. We want Pacemaker to control pool import and export, and so we need to suppress this behavior. Luckily, this is easy to do - simply set the "cachefile" property to "none" while creating the pool. The pool will be imported in a non-persistent way (i.e. it will not be placed in the cache file).
root@head1:~# zpool create -o cachefile=none vmpool01 <pool-vdevs>
root@head2:~# zpool create -o cachefile=none vmpool02 <pool-vdevs>
Please note that the cachefile parameter needs to be overridden at each pool import to make sure it isn't cached (otherwise it will be).
Next we declare the basic resources for the ZFS pools in the crm shell:
crm(live)configure# primitive vmpool01 ocf:heartbeat:ZFS \
  params pool="vmpool01" \
  op start timeout="90" \
  op stop timeout="90"
crm(live)configure# primitive vmpool02 ocf:heartbeat:ZFS \
  params pool="vmpool02" \
  op start timeout="90" \
  op stop timeout="90"
Above we used the ZFS OCF script shipped from stmf-ha. This automatically imports a pool non-persistently, making sure proper clustering semantics are maintained. We also set a rather generous start and stop timeout for this resource. By default Pacemaker expects a resources to stop within 20 seconds and ZFS exports and imports can take longer than that, so we need to set our limits higher here.
Now we also need to place an identical location constraint on these resources as we did on the IPs:
crm(live)configure# location vmpool01_pref vmpool01 100: head1
crm(live)configure# location vmpool02_pref vmpool02 100: head2
crm(live)configure# colocation vmpool01_with_IPs \
  inf: vmpool01 head1_san01_IP head1_san02_IP
crm(live)configure# colocation vmpool02_with_IPs \
  inf: vmpool02 head2_san01_IP head2_san02_IP
This will make sure that the ZFS pools will always migrate together with the respective SAN IPs.
There is one final twist to this configuration. By default, Pacemaker will stop and start resources in parallel, meaning, if a fail-over does need to occur, your SAN IPs can migrate over much faster than your ZFS pool can. This is obviously bad, as your SAN clients could attempt to connect to the target node's IP addresses (which are already up), but none of the storage resources will be there (since the ZFS pool might still be exporting/importing). So you need to tell Pacemaker that it has to do stuff in a particular order, like so:
crm(live)configure# order vmpool01_order \
  inf: vmpool01 ( head1_san01_IP head1_san02_IP )
crm(live)configure# order vmpool02_order \
  inf: vmpool02 ( head2_san01_IP head2_san02_IP )
This tells Pacemaker that you want the pool migrated first and then, once it's ready, the IP addresses can be migrated over in parallel. Obviously, Pacemaker understands that stopping the resources on the original node takes place in reverse order.
Now just commit your configuration and you can test voluntary fail-over out by doing a "svcadm disable heartbeat" on one of the nodes (the node will first voluntarily give up its resources, transition them over to another node and only then Heartbeat will shut down). Looking at "crm_mon -1" you should see the following sequence of events take place (assuming you're stopping Heartbeat on head1):
  1. The SAN IPs are stopped on head1
  2. vmpool01 is exported from head1
  3. vmpool01 is imported on head2
  4. head1's SAN IPs are started on head2
In my environment, the whole operation (including iSCSI target offlining) takes place in about 40 seconds - YMMV.

In Part 2

In the next part of this blog post we will be talking about how to configure iSCSI targets in a manner that will allow them to migrate across the different head nodes transparently and under the control of Pacemaker. If you only want to do NFS, the above steps are enough for you to set your NFS services up fully (since NFS shares are configured as ZFS filesystem properties). COMSTAR's configuration isn't so simple and so needs a little more attention.