IntroductionThe company I worked for has this "cloud computing" thing (i.e. selling virtualized computing resources) and it's all based around running traditional hypervisors (VMware at the moment). And if you want to do that right, you're basically looking at centralized storage.
With most hypervisors, there are multiple storage protocols you can exploit to do it:
- The most typical entry level is to run something like NFS or CIFS over your network to file servers which hold VM files.
- You can use block-oriented storage protocols like iSCSI or FC to simulate entire block devices over the network. Here you have the option of going over standard Ethernet (iSCSI, FCoE, AoE, etc.) or run dedicated storage infrastructure over specialized hardware (FC, IB/SRP, etc.)
- You can go totally crazy and take something like SAS and wire everything up in a sort of "local but simultaneous access" manner with SAS switches etc.
My company has been going the route of building Linux storage boxes running NFS over dedicated gigabit ethernet infrastructure for quite some time now. The performance is decent, the software is stable and there was already quite a bit of expertize in-house to do that.
However, this approach does scale only so far. And so a couple of months back we found ourselves in the marketplace looking for a new upgrade to our storage infrastructure. There were multiple requirements that needed to be met:
- The solution needed to be established in the marketplace. We weren't going to be toying with experimental nonsense.
- Deployment had to be done as soon as possible.
- Future scalability needed to considered - we didn't want to get stuck in a dead end again.
- The price per GB must be competitive - this translates directly into service costs to our customers.
Being a ZFS guy and having made some non-trivial ZFS deployments in our IPTV infrastructure before, I was brought onto the project to help out. Long story short, after a somewhat painful (but surprisingly short) road of discovery and experimentation, we have our new solution up and running. The rest of this article will be a guide on how you can replicate what I did and deploy it in your environment.
What We WantedSo here are our goals and the scope of what we set out to build:
- High-performance ZFS-based storage system on SuperMicro hardware (good, reliable, affordable).
- Fully redundant everything - the network, head units, enclosures, pools - you name it. If it performed a vital function in the storage system, it was to be capable of surviving a failure without (significant) downtime or any data loss.
- Connectivity to the system over iSCSI on 10 Gigabit Ethernet. NFS is an option, but not the primary protocol. FC kept in our back pocket if the need arises (unlikely, given the per-port cost of FC and performance compared to iSCSI on 10 GbE).
The primary reason to go with iSCSI over NFS was that iSCSI supports native multi-path.
The Fault Tolerance and ScalabilityLooking at things from a high-level structure, we've decided to go with a fully-meshed network structure at each subsystem boundary:
Hardware-wise, the construction is:
- Two dedicated 24-port 10 GbE switches (with 40 GbE uplinks for future expansion), each of which represents a fully isolated SAN. There is no stacking or synchronization going on between these switches, they really are fully independent of each other. Protocols such as virtual multi-chassis stacking and port bundling introduce more fragility into the network and that rarely helps uptime.
- Two storage controllers, also called "heads". These are the boxes that actually run ZFS and the iSCSI target software (COMSTAR). They are the brain of the entire operation and that's where we'll focus most of the time in this article. The OS of choice for us was OmniOS.
To save on complexity, we purchased a pair of 2U SuperMicro servers with SuperMicro's new X9DRW-7TPF+ motherboards which feature an on-board integrated Intel 82599-based controller with dual 10 GbE SFP+ connectors. These, together with another on-board dual-port 1 GbE Intel i350-based NIC and a dedicated IPMI/KVM port take care of all of the front-end connectivity to the head units, leaving all of the PCI-Express slots open for whatever monkey business we want to stuff in there.
- A pair for SuperMicro SC847 4U JBOD boxes with dual-redundant SAS expanders and 45x 3.5'' front- and rear-accessible hot-swap drive bays, giving us a total capacity of 90x 3.5'' drives. While not the highest density available on the market, they are very reliable, cheap and overall easy to work with (no need to slide them out of the rack to manipulate drives - just pop them right in from the front or rear).
- An initial "load" of 32x 2TB Toshiba 7k2 SAS hard drives and four OCZ Vertex 4 256GB SSDs for L2ARC. In a RAID-10 configuration this gives a total usable raw storage capacity of ~29 TiB and 1 TiB of L2ARC. The head nodes are each equipped with 128 GiB of DRAM.
- All of the above can be scaled up by adding switches, NICs, drives, enclosures, SSDs and DRAM as our needs grow.
SoftwareAny storage box is useless without the appropriate software equipment - in fact, that's what you're actually paying for when you buy a proprietary vendor's boxes, as the hardware is mostly stock x86 server components anyway. There are three important software subsystems that are part of the above storage system (besides the OS itself):
- The storage backend, which is ZFS itself. No sense in playing with anything less than the state of the art.
- COMSTAR to provide the iSCSI (and optional FC) target functionality on top of ZFS volumes.
- The in-kernel NFS server to provide the filing capabilities on top of ZFS filesystems.
Clustering SoftwareSince we've got two boxes accessing shared storage and providing a set of shared-access services to the outside world, this necessarily implies clustering. Clustering is one of those words that's easy to say, but very difficult to do (or at least, do right). I've written some cluster control programs and they have one thing in common: they are notoriously complex and difficult to debug. Therefore, writing my own thing for this deployment just wasn't an option.
So we turned to what the market has to offer. When it comes to Illumos, there are a few systems to choose from, both proprietary and open-source. Of the commercial ones, the best of breed currently is RSF-1. It features ZFS integration, SAS persistent reservations for fail-safe data consistency - the lot. But it's expensive, way more expensive than we were willing to pay.
One of the most common open-source system to deploy for clustering, and the one I finally went with, is Heartbeat with the Pacemaker cluster resource manager (CRM). Unfortunately, there are no pre-built packages of Heartbeat and Pacemaker for Illumos, so I had to make my own. To make this as easy as possible, I have uploaded my product for all to download and improve upon, should you want to do so:
Heartbeat ConfigurationIf you're using Heartbeat from the above package, it expects a configuration file in /opt/ha/etc/ha.d/ha.cf. Here's mine, nicely documented:
# Master Heartbeat configuration file # This file must be identical on all cluster nodes # GLOBAL OPTIONS use_logd yes # Logging done in separate process to # prevent blocking on disk I/O baud 38400 # Run the serial link at 38.4 kbaud realtime on # Enable real-time scheduling and lock # heartbeat into memory to prevent its # pages from ever being swapped out apiauth cl_status gid=haclient uid=hacluster # NODE LIST SETUP # Node names depend on the machine's host name. To protect against # accidental joins from nodes that are part of other zfsstor clusters # we do not allow autojoins (plus we use shared-secret authentication). node head1 node head2 autojoin none # COMMUNICATION CHANNEL SETUP mcast igb0 126.96.36.199 694 1 0 # management network mcast igb1 188.8.131.52 694 1 0 # dedicated NIC between nodes mcast ixgbe0 184.108.40.206 694 1 0 # SAN interface #0 mcast ixgbe1 220.127.116.11 694 1 0 # SAN interface #1 serial /dev/cua/a # hardwire serial interface # NODE FAILURE DETECTION keepalive 1 # Heartbeats every 1 second warntime 5 # Start issuing warnings after 5 seconds deadtime 10 # After 10 seconds, a node is considered dead initdead 60 # Hold off declaring nodes dead for 60 seconds # after Heartbeat startup. # Enable the Pacemaker CRM with maximum channel compression. # This is to make sure Pacemaker packets pass over the 38.4kbaud # serial link without too much delay. crm on compression bz2 traditional_compression yesA few important notes on the above configuration file:
- We do logging in a separate process called ha_logd. To use ha_logd, put "logfacility daemon" into /opt/ha/etc/logd.cf and start ha_logd up via "svcadm enable ha_logd".
- To make absolutely sure that heartbeat won't be preempted or swapped out, we run it in realtime mode.
- To help make the cluster behave predictably, we explicitly list all nodes that are part of it and disable auto-joining. This way each node knows who else it should see as part of the cluster and allows to deploy fencing as needed.
- Since I'm somewhat paranoid about data corruption, we configure our communication links over all available Ethernet links plus a direct null-modem hardwire link between the motherboards.
Serial links have the benefit of being essentially zero-configuration devices - there is no networking stack to accidentally misconfigure (all you need to do is make sure both ends expect the same link speed). In addition, your typical DE-9 connector has screws on it, allowing you to screw it tight to prevent accidental disconnection. In a subsequent article I will show how we monitor the health of all cluster links and raise an alarm in case one of them goes down.
- We turn on the Pacemaker CRM and enable maximum channel compression - this is important since the serial link is quite slow and we don't want to introduce a large delay when communicating over it (the Ethernet interfaces aren't a problem).
By default, Heartbeat does not ship with support for ZFS, so we need to add it in. I wrote a simple OCF script which manages ZFS pools and is part of the stmf-ha bundle. Copy the file in heartbeat/ZFS into the /opt/ha/lib/ocf/resource.d/heartbeat directory on your cluster nodes. This will teach your CRM how to import and export clustered ZFS pools on your nodes.
The last step that needs to be done is to generate an authentication token for Heartbeat (all Heartbeat channel messages are authenticated) and make sure it is identical on both nodes. To do so, simply execute the following command on one node and copy the resulting file (/opt/ha/etc/ha.d/authkeys) onto the second node:
# ( echo -ne "auth 1\n1 sha1 "; openssl rand -rand /dev/random \ -hex 16 2> /dev/null ) > /opt/ha/etc/ha.d/authkeys # chmod 400 /opt/ha/etc/ha.d/authkeysThat's it, all you need to do now is just enable heartbeat via "svcadm enable heartbeat". If heartbeat reboots your machine, it has trouble starting up Pacemaker. Change the "crm on" line in Heartbeat's configuration to "crm respawn" and watch your syslog (/var/adm/messages) for errors which might help you debug what's going wrong.
Identifying Cluster ResourcesIn the previous step we've set up Heartbeat, but we haven't actually told our cluster to do anything. All Heartbeat does is let all the cluster nodes talk to each other. However, the point of the cluster is to manage a set of services, devices, ZFS pools, etc. In clustering-speak, we want to define what resources will be part of the cluster. A resource is any useful software entity that your cluster can work with. It can be an IP address that migrates across nodes, it can be a web service, a mounted filesystem, etc.
Going over the previous enumeration of software subsystems, we can identify what parts of them we want to manage as cluster resources:
- The ZFS subsystem primarily consists of the pool we want to attach to a given node, so we will want our cluster to automatically manage importing and exporting a pool. The filesystems and volumes on top of a ZFS pool are automatically managed by the ZFS automounter.
- NFS exports are automatically managed by the ZFS "sharenfs" property, which is handled by the share(1M) utility. No need to explicitly do anything here, ZFS already does this as part of pool import/export.
- The COMSTAR configuration of iSCSI targets and LUs needs to be migrated - this links the data on ZFS volumes to outside consumers. Ideally, we would want to tie this to pool import and export as in the case of NFS.
- Finally, to make cluster migration transparent to outside users, we will want to make sure we migrate the IP addresses used for storage traffic. In case of a cluster-failover event outside users will see a reset on their storage session, which will simply force them to re-connect and reinitialize (which should be transparent for the most part).
Pacemaker ConfigurationPacemaker is the cluster resource manager, i.e. the actual brain of the cluster. It is the stuff that makes policy decisions on which resources to run where, when to fence a node and how to go about the general business of being a cluster. If you used the above packages to install Pacemaker, you can use the crm shell to configure it (simply run "crm" to start it up in interactive mode).
One of the nice features of Pacemaker is that no matter on which node you invoke the crm shell, it will always talk to the currently active cluster controller instance and automatically sync persistent configuration changes to all nodes in the cluster. I won't delve into the details of how to use "crm" in this article, just the basics to get our setup going. If you've ever used a Cisco router, the crm shell will be somewhat familiar. Tab completion works as expected and you configure Pacemaker by first entering the configuration mode (using the "configure" command).
When you initially run crm, your current configuration should be rather barren:
root@head1:~# crm crm(live)# configure crm(live)configure# show node $id="3df6dfa2-107b-6608-ba91-d17cd30c0d78" head1 node $id="71270ebc-ac7c-4096-9654-95e4f19d622b" head2 property $id="cib-bootstrap-options" \ dc-version="1.0.11-6e010d6b0d49a6b929d17c0114e9d2d934dc8e04" \ cluster-infrastructure="Heartbeat" \ stonith-enabled="true" \ last-lrm-refresh="1369414428" \ stonith-action="reboot" \ maintenance-mode="false"The two node lines are automatically filled in by Pacemaker by talking to Heartbeat to discover all the cluster nodes. The property line just tells you a bunch of global Pacemaker property settings.
Node Fencing And STONITHThere are times when the cluster needs to recover from an unexpected situation (such as a node failing) and since we're dealing with a shared-storage cluster here, if the cluster can't execute this kind of recovery reliably, data corruption is a likely outcome. Clustering software deals with this by employing a technique called "fencing", where failed nodes are forcibly taken offline to guarantee that they can't be accessing shared resources. Sometimes this is also lovingly referred to as STONITH or "Shoot The Other Node In The Head". Running a shared-storage cluster without STONITH is generally considered a "Bad Idea(tm)" - you're basically asking for trouble.
The STONITH methods available in any given environment depend on the available hardware for implementing it. In our case, we have machines with an integrated IPMI processor running over a dedicated NIC, which enables us to use the IPMI protocol to remotely power down a node which might need fencing. If the STONITH operation fails, Pacemaker will not try to take over the resources, transitioning into a "maintenance" state needing administrator intervention (which is bad, but better than corrupting data).
For instance, assuming the above two nodes ("head1" and "head2"), you can configure STONITH resources using the following set of crm commands:
crm(live)configure# primitive head1-stonith stonith:external/ipmi \ params hostname="head1" ipaddr="<head1's_IPMI_IP>" \ userid="<IPMI_user>" passwd="<IPMI_passwd>" interface="lanplus" \ op start start-delay="10" crm(live)configure# primitive head2-stonith stonith:external/ipmi \ params hostname="head2" ipaddr="<head2's_IPMI_IP>" \ userid="<IPMI_user>" passwd="<IPMI_passwd>" interface="lanplus"This tells Pacemaker that there are STONITH resources it can use for fencing. The "start-delay" operation parameter tells Pacemaker to delay executing the head1-stonith action by 10 seconds. It is no mistake that this is only present on the first STONITH resource item. IPMI has a certain delay to executing a power control action, and in case of a split-brain problem, we don't want both nodes killing each other, leaving us with no nodes up and running. Instead, we want kill the head2 node immediately, and wait a little before killing head1 - on the off chance that a split brain did occur, only one of these will succeed (the other node will have been killed before it could execute its STONITH operations).
Next we need to put a location constraint on them to make sure that a machine's STONITH resource isn't running on the machine itself (which would kind of defeat the purpose - if the machine is misbehaving, we want the other nodes to kill it):
crm(live)configure# location head1-stonith-pref head1-stonith -inf: head1 crm(live)configure# location head2-stonith-pref head1-stonith -inf: head2One last setting that I found useful was to change the default STONITH action from "reboot" to "poweroff" - this guarantees that in case the cluster gets split and one node kills the other, only one node will remain running (the other won't boot up again and try to take the resources over) - this gives you time to come in and diagnose problems.
crm(live)configure# property stonith-action=poweroffNow just commit the configuration and we're done:
crm(live)configure# commitWith STONITH in place, we can go on to build our actual cluster resources.
IP AddressesWe need to make sure that when migrating iSCSI resources, the clients don't have to go on a SCSI target hunt to discover where their targets have gone (to which IP address) - this can most easily be achieved by moving around IP addresses to the machine that is currently running the iSCSI targets.
Let's designate a pair of IP addresses per head node that will be preferably running on the respective head node with failover to the other node if need be. In my network these sit in different IP subnets. I'm running both head nodes in an active-active fashion, with all storage resources split between two ZFS pools, vmpool01 and vmpool02. This allows me to utilize both storage head nodes simultaneously for maximum performance while imposing the slight inconvenience of having the pool split into two "halves" (something I can live with). If you want to run in an active-passive way, you will only need two SAN IP addresses and you can remove all of the location constraints below (but keep those for STONITH above).
- The SAN01 interface on head1 will be 192.168.141.207/24 on ixgbe0
- The SAN02 interface on head1 will be 192.168.142.207/24 on ixgbe1
- The SAN01 interface on head2 will be 192.168.141.208/24 on ixgbe0
- The SAN02 interface on head2 will be 192.168.142.208/24 on ixgbe1
root@head1:~# ipadm create-addr -T static -a 192.168.141.205/24 ixgbe0/v4 root@head1:~# ipadm create-addr -T static -a 192.168.142.205/24 ixgbe1/v4 root@head2:~# ipadm create-addr -T static -a 192.168.141.206/24 ixgbe0/v4 root@head2:~# ipadm create-addr -T static -a 192.168.142.206/24 ixgbe1/v4Let's configure the resources for each of the clustered IP addresses in the crm shell:
crm(live)configure# primitive head1_san01_IP ocf:heartbeat:IPaddr \ params ip="192.168.141.207" cidr_netmask="255.255.255.0" nic="ixgbe0" crm(live)configure# primitive head1_san02_IP ocf:heartbeat:IPaddr \ params ip="192.168.142.207" cidr_netmask="255.255.255.0" nic="ixgbe1" crm(live)configure# primitive head2_san01_IP ocf:heartbeat:IPaddr \ params ip="192.168.141.208" cidr_netmask="255.255.255.0" nic="ixgbe0" crm(live)configure# primitive head2_san02_IP ocf:heartbeat:IPaddr \ params ip="192.168.142.208" cidr_netmask="255.255.255.0" nic="ixgbe1"Next we want to tie them to their respective machines by awarding them a preferential score of "100":
crm(live)configure# location head1_san01_IP_pref head1_san01_IP 100: head1 crm(live)configure# location head1_san02_IP_pref head1_san02_IP 100: head1 crm(live)configure# location head2_san01_IP_pref head2_san01_IP 100: head2 crm(live)configure# location head2_san02_IP_pref head2_san02_IP 100: head2After a "commit" of this configuration, you should see the following on your head nodes:
root@head1:~# ipadm show-addr | grep ixgbe ixgbe0/v4 static ok 192.168.141.205/24 ixgbe0/_a static ok 192.168.141.207/24 ixgbe1/v4 static ok 192.168.142.205/24 ixgbe1/_a static ok 192.168.142.207/24 root@head2:~# ipadm show-addr | grep ixgbe ixgbe0/v4 static ok 192.168.141.206/24 ixgbe0/_a static ok 192.168.141.208/24 ixgbe1/v4 static ok 192.168.142.206/24 ixgbe1/_a static ok 192.168.142.208/24
ZFS PoolsNow that we got our IP addresses up and running, we can configure ZFS pools to migrate across the machines. The configuration is going to be fairly similar to what we did for IP addresses.
First create your ZFS pools on the machines using the standard "zpool create" syntax with one twist. By default, ZFS pools are imported in a persistent manner, meaning, their configuration is cached in the /etc/zfs/zpool.cache file. At next boot, the machine will attempt to import this pool automatically. That is not what we want. We want Pacemaker to control pool import and export, and so we need to suppress this behavior. Luckily, this is easy to do - simply set the "cachefile" property to "none" while creating the pool. The pool will be imported in a non-persistent way (i.e. it will not be placed in the cache file).
root@head1:~# zpool create -o cachefile=none vmpool01 <pool-vdevs> root@head2:~# zpool create -o cachefile=none vmpool02 <pool-vdevs>Please note that the cachefile parameter needs to be overridden at each pool import to make sure it isn't cached (otherwise it will be).
Next we declare the basic resources for the ZFS pools in the crm shell:
crm(live)configure# primitive vmpool01 ocf:heartbeat:ZFS \ params pool="vmpool01" \ op start timeout="90" \ op stop timeout="90" crm(live)configure# primitive vmpool02 ocf:heartbeat:ZFS \ params pool="vmpool02" \ op start timeout="90" \ op stop timeout="90"Above we used the ZFS OCF script shipped from stmf-ha. This automatically imports a pool non-persistently, making sure proper clustering semantics are maintained. We also set a rather generous start and stop timeout for this resource. By default Pacemaker expects a resources to stop within 20 seconds and ZFS exports and imports can take longer than that, so we need to set our limits higher here.
Now we also need to place an identical location constraint on these resources as we did on the IPs:
crm(live)configure# location vmpool01_pref vmpool01 100: head1 crm(live)configure# location vmpool02_pref vmpool02 100: head2 crm(live)configure# colocation vmpool01_with_IPs \ inf: vmpool01 head1_san01_IP head1_san02_IP crm(live)configure# colocation vmpool02_with_IPs \ inf: vmpool02 head2_san01_IP head2_san02_IPThis will make sure that the ZFS pools will always migrate together with the respective SAN IPs.
There is one final twist to this configuration. By default, Pacemaker will stop and start resources in parallel, meaning, if a fail-over does need to occur, your SAN IPs can migrate over much faster than your ZFS pool can. This is obviously bad, as your SAN clients could attempt to connect to the target node's IP addresses (which are already up), but none of the storage resources will be there (since the ZFS pool might still be exporting/importing). So you need to tell Pacemaker that it has to do stuff in a particular order, like so:
crm(live)configure# order vmpool01_order \ inf: vmpool01 ( head1_san01_IP head1_san02_IP ) crm(live)configure# order vmpool02_order \ inf: vmpool02 ( head2_san01_IP head2_san02_IP )This tells Pacemaker that you want the pool migrated first and then, once it's ready, the IP addresses can be migrated over in parallel. Obviously, Pacemaker understands that stopping the resources on the original node takes place in reverse order.
Now just commit your configuration and you can test voluntary fail-over out by doing a "svcadm disable heartbeat" on one of the nodes (the node will first voluntarily give up its resources, transition them over to another node and only then Heartbeat will shut down). Looking at "crm_mon -1" you should see the following sequence of events take place (assuming you're stopping Heartbeat on head1):
- The SAN IPs are stopped on head1
- vmpool01 is exported from head1
- vmpool01 is imported on head2
- head1's SAN IPs are started on head2