Monday, June 3, 2013

Building a ZFS Storage Appliance (part 1)

Introduction

The company I worked for has this "cloud computing" thing (i.e. selling virtualized computing resources) and it's all based around running traditional hypervisors (VMware at the moment). And if you want to do that right, you're basically looking at centralized storage.
With most hypervisors, there are multiple storage protocols you can exploit to do it:
  • The most typical entry level is to run something like NFS or CIFS over your network to file servers which hold VM files.
  • You can use block-oriented storage protocols like iSCSI or FC to simulate entire block devices over the network. Here you have the option of going over standard Ethernet (iSCSI, FCoE, AoE, etc.) or run dedicated storage infrastructure over specialized hardware (FC, IB/SRP, etc.)
  • You can go totally crazy and take something like SAS and wire everything up in a sort of "local but simultaneous access" manner with SAS switches etc.
The point of all centralized storage is to make the data available to multiple nodes in a decentralized network. The typical entry into the centralized storage world is done by using NFS. It's easily available in the OS of your choice, it can run over your existing network infrastructure and is capable of quite good scalability. In fact, one shouldn't think that simply because it is the easiest to start with, it also is the least performing. There are some pretty serious NFS deployments out there and the whole NAS vs. SAN debate is as hot as ever.
My company has been going the route of building Linux storage boxes running NFS over dedicated gigabit ethernet infrastructure for quite some time now. The performance is decent, the software is stable and there was already quite a bit of expertize in-house to do that.
However, this approach does scale only so far. And so a couple of months back we found ourselves in the marketplace looking for a new upgrade to our storage infrastructure. There were multiple requirements that needed to be met:
  1. The solution needed to be established in the marketplace. We weren't going to be toying with experimental nonsense.
  2. Deployment had to be done as soon as possible.
  3. Future scalability needed to considered - we didn't want to get stuck in a dead end again.
  4. The price per GB must be competitive - this translates directly into service costs to our customers.
Especially the last point basically meant that all the "big enterprise" vendors' solutions (Dell EqualLogic, EMC Clariion, etc.) went out the window at first glance. There's just no way our customers are going to pay DRAM-type $/GB prices for HDD storage. That is not to diss Dell's or EMC's customers - if their product works for you, then great. It just didn't work for us.
Being a ZFS guy and having made some non-trivial ZFS deployments in our IPTV infrastructure before, I was brought onto the project to help out. Long story short, after a somewhat painful (but surprisingly short) road of discovery and experimentation, we have our new solution up and running. The rest of this article will be a guide on how you can replicate what I did and deploy it in your environment.

What We Wanted

So here are our goals and the scope of what we set out to build:
  • High-performance ZFS-based storage system on SuperMicro hardware (good, reliable, affordable).
  • Fully redundant everything - the network, head units, enclosures, pools - you name it. If it performed a vital function in the storage system, it was to be capable of surviving a failure without (significant) downtime or any data loss.
  • Connectivity to the system over iSCSI on 10 Gigabit Ethernet. NFS is an option, but not the primary protocol. FC kept in our back pocket if the need arises (unlikely, given the per-port cost of FC and performance compared to iSCSI on 10 GbE).
    The primary reason to go with iSCSI over NFS was that iSCSI supports native multi-path.

The Fault Tolerance and Scalability

Looking at things from a high-level structure, we've decided to go with a fully-meshed network structure at each subsystem boundary:


Hardware-wise, the construction is:
  • Two dedicated 24-port 10 GbE switches (with 40 GbE uplinks for future expansion), each of which represents a fully isolated SAN. There is no stacking or synchronization going on between these switches, they really are fully independent of each other. Protocols such as virtual multi-chassis stacking and port bundling introduce more fragility into the network and that rarely helps uptime.
  • Two storage controllers, also called "heads". These are the boxes that actually run ZFS and the iSCSI target software (COMSTAR). They are the brain of the entire operation and that's where we'll focus most of the time in this article. The OS of choice for us was OmniOS.
    To save on complexity, we purchased a pair of 2U SuperMicro servers with SuperMicro's new X9DRW-7TPF+ motherboards which feature an on-board integrated Intel 82599-based controller with dual 10 GbE SFP+ connectors. These, together with another on-board dual-port 1 GbE Intel i350-based NIC and a dedicated IPMI/KVM port take care of all of the front-end connectivity to the head units, leaving all of the PCI-Express slots open for whatever monkey business we want to stuff in there.
  • A pair for SuperMicro SC847 4U JBOD boxes with dual-redundant SAS expanders and 45x 3.5'' front- and rear-accessible hot-swap drive bays, giving us a total capacity of 90x 3.5'' drives. While not the highest density available on the market, they are very reliable, cheap and overall easy to work with (no need to slide them out of the rack to manipulate drives - just pop them right in from the front or rear).
  • An initial "load" of 32x 2TB Toshiba 7k2 SAS hard drives and four OCZ Vertex 4 256GB SSDs for L2ARC. In a RAID-10 configuration this gives a total usable raw storage capacity of ~29 TiB and 1 TiB of L2ARC. The head nodes are each equipped with 128 GiB of DRAM.
  • All of the above can be scaled up by adding switches, NICs, drives, enclosures, SSDs and DRAM as our needs grow.
Best of all, all of this infrastructure (including the network switches) cost us less than $30k and in terms of brute performance is miles ahead of what any proprietary storage vendor could even hope to offer at this price point.

Software

Any storage box is useless without the appropriate software equipment - in fact, that's what you're actually paying for when you buy a proprietary vendor's boxes, as the hardware is mostly stock x86 server components anyway. There are three important software subsystems that are part of the above storage system (besides the OS itself):
  1. The storage backend, which is ZFS itself. No sense in playing with anything less than the state of the art.
  2. COMSTAR to provide the iSCSI (and optional FC) target functionality on top of ZFS volumes.
  3. The in-kernel NFS server to provide the filing capabilities on top of ZFS filesystems.

Clustering Software

Since we've got two boxes accessing shared storage and providing a set of shared-access services to the outside world, this necessarily implies clustering. Clustering is one of those words that's easy to say, but very difficult to do (or at least, do right). I've written some cluster control programs and they have one thing in common: they are notoriously complex and difficult to debug. Therefore, writing my own thing for this deployment just wasn't an option.
So we turned to what the market has to offer. When it comes to Illumos, there are a few systems to choose from, both proprietary and open-source. Of the commercial ones, the best of breed currently is RSF-1. It features ZFS integration, SAS persistent reservations for fail-safe data consistency - the lot. But it's expensive, way more expensive than we were willing to pay.
One of the most common open-source system to deploy for clustering, and the one I finally went with, is Heartbeat with the Pacemaker cluster resource manager (CRM). Unfortunately, there are no pre-built packages of Heartbeat and Pacemaker for Illumos, so I had to make my own. To make this as easy as possible, I have uploaded my product for all to download and improve upon, should you want to do so:
The prebuilt packages sit in the prebuilt_packages subdirectory - unpack and install using the familiar pkgadd commands. Once installed, we will need to proceed with configuring Heartbeat before starting it up.

Heartbeat Configuration

If you're using Heartbeat from the above package, it expects a configuration file in /opt/ha/etc/ha.d/ha.cf. Here's mine, nicely documented:
# Master Heartbeat configuration file
# This file must be identical on all cluster nodes

# GLOBAL OPTIONS
use_logd        yes             # Logging done in separate process to
                                # prevent blocking on disk I/O
baud            38400           # Run the serial link at 38.4 kbaud
realtime        on              # Enable real-time scheduling and lock
                                # heartbeat into memory to prevent its
                                # pages from ever being swapped out

apiauth cl_status gid=haclient uid=hacluster

# NODE LIST SETUP
# Node names depend on the machine's host name. To protect against
# accidental joins from nodes that are part of other zfsstor clusters
# we do not allow autojoins (plus we use shared-secret authentication).
node            head1
node            head2
autojoin        none

# COMMUNICATION CHANNEL SETUP
mcast   igb0    239.51.12.1 694 1 0     # management network
mcast   igb1    239.51.12.1 694 1 0     # dedicated NIC between nodes
mcast   ixgbe0  239.51.12.1 694 1 0     # SAN interface #0
mcast   ixgbe1  239.51.12.1 694 1 0     # SAN interface #1
serial  /dev/cua/a                      # hardwire serial interface

# NODE FAILURE DETECTION
keepalive       1       # Heartbeats every 1 second
warntime        5       # Start issuing warnings after 5 seconds
deadtime        10      # After 10 seconds, a node is considered dead
initdead        60      # Hold off declaring nodes dead for 60 seconds
                        # after Heartbeat startup.
# Enable the Pacemaker CRM with maximum channel compression.
# This is to make sure Pacemaker packets pass over the 38.4kbaud
# serial link without too much delay.
crm                     on
compression             bz2
traditional_compression yes
A few important notes on the above configuration file:
  • We do logging in a separate process called ha_logd. To use ha_logd, put "logfacility daemon" into /opt/ha/etc/logd.cf and start ha_logd up via "svcadm enable ha_logd".
  • To make absolutely sure that heartbeat won't be preempted or swapped out, we run it in realtime mode.
  • To help make the cluster behave predictably, we explicitly list all nodes that are part of it and disable auto-joining. This way each node knows who else it should see as part of the cluster and allows to deploy fencing as needed.
  • Since I'm somewhat paranoid about data corruption, we configure our communication links over all available Ethernet links plus a direct null-modem hardwire link between the motherboards.
    Serial links have the benefit of being essentially zero-configuration devices - there is no networking stack to accidentally misconfigure (all you need to do is make sure both ends expect the same link speed). In addition, your typical DE-9 connector has screws on it, allowing you to screw it tight to prevent accidental disconnection. In a subsequent article I will show how we monitor the health of all cluster links and raise an alarm in case one of them goes down.
  • We turn on the Pacemaker CRM and enable maximum channel compression - this is important since the serial link is quite slow and we don't want to introduce a large delay when communicating over it (the Ethernet interfaces aren't a problem).
Heartbeat also provides a set of OCF resource scripts in /opt/ha/lib/ocf/resource.d/heartbeat. Resource scripts are simply programs that the CRM invokes to get things done (e.g. starting a resource on a node, stopping a resource on a node, monitoring a resource's state, etc.). Think init scripts - that's what they are.
By default, Heartbeat does not ship with support for ZFS, so we need to add it in. I wrote a simple OCF script which manages ZFS pools and is part of the stmf-ha bundle. Copy the file in heartbeat/ZFS into the /opt/ha/lib/ocf/resource.d/heartbeat directory on your cluster nodes. This will teach your CRM how to import and export clustered ZFS pools on your nodes.
The last step that needs to be done is to generate an authentication token for Heartbeat (all Heartbeat channel messages are authenticated) and make sure it is identical on both nodes. To do so, simply execute the following command on one node and copy the resulting file (/opt/ha/etc/ha.d/authkeys) onto the second node:
# ( echo -ne "auth 1\n1 sha1 "; openssl rand -rand /dev/random \
  -hex 16 2> /dev/null ) > /opt/ha/etc/ha.d/authkeys
# chmod 400 /opt/ha/etc/ha.d/authkeys
That's it, all you need to do now is just enable heartbeat via "svcadm enable heartbeat". If heartbeat reboots your machine, it has trouble starting up Pacemaker. Change the "crm on" line in Heartbeat's configuration to "crm respawn" and watch your syslog (/var/adm/messages) for errors which might help you debug what's going wrong.

Identifying Cluster Resources

In the previous step we've set up Heartbeat, but we haven't actually told our cluster to do anything. All Heartbeat does is let all the cluster nodes talk to each other. However, the point of the cluster is to manage a set of services, devices, ZFS pools, etc. In clustering-speak, we want to define what resources will be part of the cluster. A resource is any useful software entity that your cluster can work with. It can be an IP address that migrates across nodes, it can be a web service, a mounted filesystem, etc.
Going over the previous enumeration of software subsystems, we can identify what parts of them we want to manage as cluster resources:
  • The ZFS subsystem primarily consists of the pool we want to attach to a given node, so we will want our cluster to automatically manage importing and exporting a pool. The filesystems and volumes on top of a ZFS pool are automatically managed by the ZFS automounter.
  • NFS exports are automatically managed by the ZFS "sharenfs" property, which is handled by the share(1M) utility. No need to explicitly do anything here, ZFS already does this as part of pool import/export.
  • The COMSTAR configuration of iSCSI targets and LUs needs to be migrated - this links the data on ZFS volumes to outside consumers. Ideally, we would want to tie this to pool import and export as in the case of NFS.
  • Finally, to make cluster migration transparent to outside users, we will want to make sure we migrate the IP addresses used for storage traffic. In case of a cluster-failover event outside users will see a reset on their storage session, which will simply force them to re-connect and reinitialize (which should be transparent for the most part).
So far so good, we've identified three broad areas to migrate: the ZFS pool, the COMSTAR configuration pertaining to the pool and IP addresses. Two of these, the ZFS pool and the IP addresses are pretty easily done and I will elaborate on these below. Migrating COMSTAR configurations is a lot more hairy and I will detail it in a future blog post.

Pacemaker Configuration

Pacemaker is the cluster resource manager, i.e. the actual brain of the cluster. It is the stuff that makes policy decisions on which resources to run where, when to fence a node and how to go about the general business of being a cluster. If you used the above packages to install Pacemaker, you can use the crm shell to configure it (simply run "crm" to start it up in interactive mode).
One of the nice features of Pacemaker is that no matter on which node you invoke the crm shell, it will always talk to the currently active cluster controller instance and automatically sync persistent configuration changes to all nodes in the cluster. I won't delve into the details of how to use "crm" in this article, just the basics to get our setup going. If you've ever used a Cisco router, the crm shell will be somewhat familiar. Tab completion works as expected and you configure Pacemaker by first entering the configuration mode (using the "configure" command).
When you initially run crm, your current configuration should be rather barren:
root@head1:~# crm
crm(live)# configure  
crm(live)configure# show
node $id="3df6dfa2-107b-6608-ba91-d17cd30c0d78" head1
node $id="71270ebc-ac7c-4096-9654-95e4f19d622b" head2
property $id="cib-bootstrap-options" \
    dc-version="1.0.11-6e010d6b0d49a6b929d17c0114e9d2d934dc8e04" \
    cluster-infrastructure="Heartbeat" \
    stonith-enabled="true" \
    last-lrm-refresh="1369414428" \
    stonith-action="reboot" \
    maintenance-mode="false"
The two node lines are automatically filled in by Pacemaker by talking to Heartbeat to discover all the cluster nodes. The property line just tells you a bunch of global Pacemaker property settings.

Node Fencing And STONITH

There are times when the cluster needs to recover from an unexpected situation (such as a node failing) and since we're dealing with a shared-storage cluster here, if the cluster can't execute this kind of recovery reliably, data corruption is a likely outcome. Clustering software deals with this by employing a technique called "fencing", where failed nodes are forcibly taken offline to guarantee that they can't be accessing shared resources. Sometimes this is also lovingly referred to as STONITH or "Shoot The Other Node In The Head". Running a shared-storage cluster without STONITH is generally considered a "Bad Idea(tm)" - you're basically asking for trouble.
The STONITH methods available in any given environment depend on the available hardware for implementing it. In our case, we have machines with an integrated IPMI processor running over a dedicated NIC, which enables us to use the IPMI protocol to remotely power down a node which might need fencing. If the STONITH operation fails, Pacemaker will not try to take over the resources, transitioning into a "maintenance" state needing administrator intervention (which is bad, but better than corrupting data).
For instance, assuming the above two nodes ("head1" and "head2"), you can configure STONITH resources using the following set of crm commands:
crm(live)configure# primitive head1-stonith stonith:external/ipmi \
    params hostname="head1" ipaddr="<head1's_IPMI_IP>" \
    userid="<IPMI_user>" passwd="<IPMI_passwd>" interface="lanplus" \
    op start start-delay="10"
crm(live)configure# primitive head2-stonith stonith:external/ipmi \
    params hostname="head2" ipaddr="<head2's_IPMI_IP>" \
    userid="<IPMI_user>" passwd="<IPMI_passwd>" interface="lanplus"
This tells Pacemaker that there are STONITH resources it can use for fencing. The "start-delay" operation parameter tells Pacemaker to delay executing the head1-stonith action by 10 seconds. It is no mistake that this is only present on the first STONITH resource item. IPMI has a certain delay to executing a power control action, and in case of a split-brain problem, we don't want both nodes killing each other, leaving us with no nodes up and running. Instead, we want kill the head2 node immediately, and wait a little before killing head1 - on the off chance that a split brain did occur, only one of these will succeed (the other node will have been killed before it could execute its STONITH operations).
Next we need to put a location constraint on them to make sure that a machine's STONITH resource isn't running on the machine itself (which would kind of defeat the purpose - if the machine is misbehaving, we want the other nodes to kill it):
crm(live)configure# location head1-stonith-pref head1-stonith -inf: head1
crm(live)configure# location head2-stonith-pref head1-stonith -inf: head2
One last setting that I found useful was to change the default STONITH action from "reboot" to "poweroff" - this guarantees that in case the cluster gets split and one node kills the other, only one node will remain running (the other won't boot up again and try to take the resources over) - this gives you time to come in and diagnose problems.
crm(live)configure# property stonith-action=poweroff
Now just commit the configuration and we're done:
crm(live)configure# commit
With STONITH in place, we can go on to build our actual cluster resources.

IP Addresses

We need to make sure that when migrating iSCSI resources, the clients don't have to go on a SCSI target hunt to discover where their targets have gone (to which IP address) - this can most easily be achieved by moving around IP addresses to the machine that is currently running the iSCSI targets.
Let's designate a pair of IP addresses per head node that will be preferably running on the respective head node with failover to the other node if need be. In my network these sit in different IP subnets. I'm running both head nodes in an active-active fashion, with all storage resources split between two ZFS pools, vmpool01 and vmpool02. This allows me to utilize both storage head nodes simultaneously for maximum performance while imposing the slight inconvenience of having the pool split into two "halves" (something I can live with). If you want to run in an active-passive way, you will only need two SAN IP addresses and you can remove all of the location constraints below (but keep those for STONITH above).
  • The SAN01 interface on head1 will be 192.168.141.207/24 on ixgbe0
  • The SAN02 interface on head1 will be 192.168.142.207/24 on ixgbe1
  • The SAN01 interface on head2 will be 192.168.141.208/24 on ixgbe0
  • The SAN02 interface on head2 will be 192.168.142.208/24 on ixgbe1
To allow for communication on the ixgbe interfaces even when no cluster IP addresses are configured on them, we will also add a few hard-coded IP addresses which always sit on the respective machine (these are the ones Heartbeat will bind to for cluster messaging):
root@head1:~# ipadm create-addr -T static -a 192.168.141.205/24 ixgbe0/v4
root@head1:~# ipadm create-addr -T static -a 192.168.142.205/24 ixgbe1/v4
root@head2:~# ipadm create-addr -T static -a 192.168.141.206/24 ixgbe0/v4
root@head2:~# ipadm create-addr -T static -a 192.168.142.206/24 ixgbe1/v4
Let's configure the resources for each of the clustered IP addresses in the crm shell:
crm(live)configure# primitive head1_san01_IP ocf:heartbeat:IPaddr \
  params ip="192.168.141.207" cidr_netmask="255.255.255.0" nic="ixgbe0"
crm(live)configure# primitive head1_san02_IP ocf:heartbeat:IPaddr \
  params ip="192.168.142.207" cidr_netmask="255.255.255.0" nic="ixgbe1"
crm(live)configure# primitive head2_san01_IP ocf:heartbeat:IPaddr \
  params ip="192.168.141.208" cidr_netmask="255.255.255.0" nic="ixgbe0"
crm(live)configure# primitive head2_san02_IP ocf:heartbeat:IPaddr \
  params ip="192.168.142.208" cidr_netmask="255.255.255.0" nic="ixgbe1"
Next we want to tie them to their respective machines by awarding them a preferential score of "100":
crm(live)configure# location head1_san01_IP_pref head1_san01_IP 100: head1
crm(live)configure# location head1_san02_IP_pref head1_san02_IP 100: head1
crm(live)configure# location head2_san01_IP_pref head2_san01_IP 100: head2
crm(live)configure# location head2_san02_IP_pref head2_san02_IP 100: head2
After a "commit" of this configuration, you should see the following on your head nodes:
root@head1:~# ipadm show-addr | grep ixgbe
ixgbe0/v4         static   ok           192.168.141.205/24
ixgbe0/_a         static   ok           192.168.141.207/24
ixgbe1/v4         static   ok           192.168.142.205/24
ixgbe1/_a         static   ok           192.168.142.207/24
root@head2:~# ipadm show-addr | grep ixgbe
ixgbe0/v4         static   ok           192.168.141.206/24
ixgbe0/_a         static   ok           192.168.141.208/24
ixgbe1/v4         static   ok           192.168.142.206/24
ixgbe1/_a         static   ok           192.168.142.208/24

ZFS Pools

Now that we got our IP addresses up and running, we can configure ZFS pools to migrate across the machines. The configuration is going to be fairly similar to what we did for IP addresses.
First create your ZFS pools on the machines using the standard "zpool create" syntax with one twist. By default, ZFS pools are imported in a persistent manner, meaning, their configuration is cached in the /etc/zfs/zpool.cache file. At next boot, the machine will attempt to import this pool automatically. That is not what we want. We want Pacemaker to control pool import and export, and so we need to suppress this behavior. Luckily, this is easy to do - simply set the "cachefile" property to "none" while creating the pool. The pool will be imported in a non-persistent way (i.e. it will not be placed in the cache file).
root@head1:~# zpool create -o cachefile=none vmpool01 <pool-vdevs>
root@head2:~# zpool create -o cachefile=none vmpool02 <pool-vdevs>
Please note that the cachefile parameter needs to be overridden at each pool import to make sure it isn't cached (otherwise it will be).
Next we declare the basic resources for the ZFS pools in the crm shell:
crm(live)configure# primitive vmpool01 ocf:heartbeat:ZFS \
  params pool="vmpool01" \
  op start timeout="90" \
  op stop timeout="90"
crm(live)configure# primitive vmpool02 ocf:heartbeat:ZFS \
  params pool="vmpool02" \
  op start timeout="90" \
  op stop timeout="90"
Above we used the ZFS OCF script shipped from stmf-ha. This automatically imports a pool non-persistently, making sure proper clustering semantics are maintained. We also set a rather generous start and stop timeout for this resource. By default Pacemaker expects a resources to stop within 20 seconds and ZFS exports and imports can take longer than that, so we need to set our limits higher here.
Now we also need to place an identical location constraint on these resources as we did on the IPs:
crm(live)configure# location vmpool01_pref vmpool01 100: head1
crm(live)configure# location vmpool02_pref vmpool02 100: head2
crm(live)configure# colocation vmpool01_with_IPs \
  inf: vmpool01 head1_san01_IP head1_san02_IP
crm(live)configure# colocation vmpool02_with_IPs \
  inf: vmpool02 head2_san01_IP head2_san02_IP
This will make sure that the ZFS pools will always migrate together with the respective SAN IPs.
There is one final twist to this configuration. By default, Pacemaker will stop and start resources in parallel, meaning, if a fail-over does need to occur, your SAN IPs can migrate over much faster than your ZFS pool can. This is obviously bad, as your SAN clients could attempt to connect to the target node's IP addresses (which are already up), but none of the storage resources will be there (since the ZFS pool might still be exporting/importing). So you need to tell Pacemaker that it has to do stuff in a particular order, like so:
crm(live)configure# order vmpool01_order \
  inf: vmpool01 ( head1_san01_IP head1_san02_IP )
crm(live)configure# order vmpool02_order \
  inf: vmpool02 ( head2_san01_IP head2_san02_IP )
This tells Pacemaker that you want the pool migrated first and then, once it's ready, the IP addresses can be migrated over in parallel. Obviously, Pacemaker understands that stopping the resources on the original node takes place in reverse order.
Now just commit your configuration and you can test voluntary fail-over out by doing a "svcadm disable heartbeat" on one of the nodes (the node will first voluntarily give up its resources, transition them over to another node and only then Heartbeat will shut down). Looking at "crm_mon -1" you should see the following sequence of events take place (assuming you're stopping Heartbeat on head1):
  1. The SAN IPs are stopped on head1
  2. vmpool01 is exported from head1
  3. vmpool01 is imported on head2
  4. head1's SAN IPs are started on head2
In my environment, the whole operation (including iSCSI target offlining) takes place in about 40 seconds - YMMV.

In Part 2

In the next part of this blog post we will be talking about how to configure iSCSI targets in a manner that will allow them to migrate across the different head nodes transparently and under the control of Pacemaker. If you only want to do NFS, the above steps are enough for you to set your NFS services up fully (since NFS shares are configured as ZFS filesystem properties). COMSTAR's configuration isn't so simple and so needs a little more attention.

36 comments:

  1. Great post, Saso. Looking to the next one, and hoping that you'll address advanced ZFS configuration, like compression and volblocksize choices!

    ReplyDelete
  2. Hi Saso, nice blog you have here.
    Do you mind to explain more about ZFS tweak for VMware, especially when you using iSCSI. Thanks

    ReplyDelete
    Replies
    1. Hi Reski. I'll dig into iSCSI in the next blog post that will deal specifically with COMSTAR and how to set things up for fail-over. ZFS itself, though, needs few if any tweaks for VMware. Sure you need to select a good volblocksize, but other than that the thing just works.

      Delete
  3. Hi, Saso!

    Well, I'm going to implement this on fibre channel, when I understand the move of the COMSTAR configuration...? How the f--k would I do that? If I got two heads, active-active, as you say you got, I will need one COMSTAR config on each head. How do I then move the other heads config, without interupting/restarting the COMSTAR service? Or can I perhaps run COMSTAR in multiple instances?
    Perhaps I need to wait until you publish part two...?

    Rgrds Johan

    ReplyDelete
    Replies
    1. Hi Anders,
      Thanks for your comment.
      The configuration can be migrated over using stmf-ha (you'll find it on github, complete with manpage and examples), but that only supports iSCSI (since I didn't have any FC hardware available when I developed it). That means you'll need to extend it to get FC support. I'd appreciate if you can share your patches afterwards, so that they can be integrated upstream and so other FC users can use it too.
      I'll try to get to writing part two this weekend - sort of kept it on the back burner, but seeing as there is interest in this, I'll try to get myself to find the time to do.
      Cheers.

      Delete
  4. Hi Saso, thanks for the writeup. I'm trying almost exactly what you are doing as well, on a supermicro SBB.

    Some comments:

    1) after doing the pkgadd -d ..., I had to also configure my .profile to add the following:
    export PYTHONPATH=/opt/ha/lib/python2.6/site-packages
    export PATH=/opt/ha/bin:/opt/ha/sbin:$PATH
    export OCF_ROOT=/opt/ha/lib/ocf
    export OCF_AGENTS=/opt/ha/lib/ocf/resource.d/heartbeat

    2) I also had to to pkg install ipmitool

    3) occasionally when running crm commands, I get this error:
    crm(live)configure# verify
    ps: unknown output format: -o command
    usage: ps [ -aAdefHlcjLPyZ ] [ -o format ] [ -t termlist ]
    [ -u userlist ] [ -U userlist ] [ -G grouplist ]
    [ -p proclist ] [ -g pgrplist ] [ -s sidlist ]
    [ -z zonelist ] [-h lgrplist]
    'format' is one or more of:
    addr args c class comm ctid dmodel etime f fname gid group lgrp lwp
    nice nlwp opri osz pcpu pgid pid pmem ppid pri project projid pset psr
    rgid rgroup rss ruid ruser s sid stime taskid time tty uid user vsz
    wchan zone zoneid

    Not sure what this is in relation to.

    4) and my main problem: after I create the SAN IPs (e.g. head1_san01_IP ), I don't see them created in output from ipadm show-addr, and doing a "crm resource status" shows the headX-stonith and headX_sanX_IP resources as "Stopped" (I'm using node01 rather than head01, as per output below):

    root@node01:~# crm resource status
    node01-stonith (stonith:external/ipmi) Stopped
    node02-stonith (stonith:external/ipmi) Stopped
    node01_san01_IP (ocf::heartbeat:IPaddr) Started FAILED
    node02_san01_IP (ocf::heartbeat:IPaddr) Started FAILED

    (and then after a while, it shows "Stopped" rather than "FAILED")

    I notice on the console that I see errors "ERROR: error executing ipmitool: Error: Unable to establish IPMI v2 / RMCP+ session Unable to get Chassis Power Status".

    If I run e.g. "root@node01:~# ipmitool -I lan -H 10.25.69.127 -U -A MD5 -P power status" I do get a good response e.g. "Chassis Power is on".

    Is that related to using lanplus versus lan versus e.g. open, or any ideas about this one?

    ReplyDelete
    Replies
    1. sorry,the line:
      If I run e.g. "root@node01:~# ipmitool -I lan -H 10.25.69.127 -U -A MD5 -P power status" I do get a good response e.g. "Chassis Power is on".

      should have read (I think the blog engine stripped the characters):
      If I run e.g. "root@node01:~# ipmitool -I lan -H 10.25.69.127 -U username -A MD5 -P password power status" I do get a good response e.g. "Chassis Power is on".

      Delete
    2. Just for further testing, if I run e.g. "root@node01:~# ipmitool -I lanplus -H 10.25.69.127 -U username -P password power status" I do also get "Chassis Power is on". So doesn't look like a lan/lanplus issue?

      Delete
    3. Hi,

      for fixing 3 you need to fix crm utils.py in python site-packages direc under /opt/ha directory:

      perl -pi -e 's#ps -e -o pid,command#ps -e -o pid,comm#' /opt/ha/lib/python2.6/site-packages/crm/utils.py

      Delete
    4. I have exactly the same problem with Stopped status of IPaddr resource, and I'm not able to create virtual IP address. Did you find any solution?

      Delete
    5. Yeah, I have the same issue. Along with the inconsistency with the prebuilt packages (related to obsolete crm_standby and etc), I also get this:

      Failed actions:
      failover-ip_start_0 (node=omnios-ha2, call=3, rc=5, status=complete): not installed
      failover-ip_start_0 (node=omnios-ha1, call=3, rc=5, status=complete): not installed

      and it doesn't work for me either. The only thing I can think of is that something has changed since Saso says he built and tested it all, since clearly it does not work. Disappointing, since neither FreeBSD nor ZoL have acceptable write performance to NFS datastores when using sync mode (even with a high-performance SSD as SLOG.) But if I can't using HA clustering, I may as well stick with OmniOS and sync=disabled and wait for ZoL to implement AIO (their NFS write perf sucks, since the kernel nfsd is breaking up large writes into multiple synchronous 4KB writes.) Sigh...

      Delete
  5. Any news on part 2? I am looking into doing almost this exact thing.

    ReplyDelete
  6. Hi Saso

    This is exactly what I'm looking for a free open source recipe centralized storage units (in my case two) making data available to multiple nodes (in my case two) in a decentralized network - just like your picture.

    I've been waiting for Part 2 of your blog e.g. the COMSTAR configuration secret sauce.

    Once you publish this I plan to give your solution a try and beat on it on real HW wrap up maintanance commands in idiot proof scripts for adding, deleting, and migrating pools from the cluster.

    If it works out to my satisfaction (useful in a production deployment) then I will integrate it into the napp-it GUI

    Thanks

    Jon Strabala

    ReplyDelete
  7. Hi Saso

    I am one more waiting for the Part 2 :)

    ReplyDelete
  8. One more waiting for part 2 :)

    ReplyDelete
  9. Hi, in your second (and may I add, very informative) post, you mention the following: Luckily COMSTAR supports a "no persistence" option in the service configuration, so that any configuration commands issued don't actually modify the persistent configuration in the SMF configuration store.
    How exactly do you implement this? i.e. example of commands or reference to internetpage

    ReplyDelete
  10. A couple of questions about the authkey file. As the command you tell us to use, it says 'auth 1' and '1 sha1', but the openssl command says 'openssl md5'. Shouldn't the latter be 'openssl sha1'? Also, with the pipe input, it shows as '1 sha1 (stdin)= 425a501057066dc4fe47b14427535805'. I think we want to remove the '(stdin)=', no?

    ReplyDelete
    Replies
    1. I can understand the confusion when seeing "sha1" and "md5" on a single line, but these two words mean different things:
      - the "sha1" word is a flag for heartbeat to use the "sha1" hash for authenticating its messages to its peers
      - the "md5" word is just a flag to openssl to compute an MD5 hash from some random garbage we read from /dev/random and so give us a random but ASCII printable secret value for use in heartbeat
      You can pass "sha1" or "sha256" or whatever hash method your openssl installation supports. It just so happens that "md5" is supported on pretty much every version and produces a string that is long enough to be secure as a password (20 hexa characters).
      As for the "(stdin)=" thing - that's just an artifact of your using openssl 1.0.1. The older openssl 0.9.8 which I used doesn't print that. Anyway, just erase that part, it's garbage we don't want in that file.
      Thanks for bringing these up though, I'll add a note to the blog post.

      Delete
    2. Okay, thanks. BTW, I noticed that when I try to run crm in ongoing (top style) mode, it exits after one iteration and says:

      Defaulting to one-shot mode
      You need to have curses available at compile time to enable console mode

      Delete
  11. Hmmm, so I have my two virtual machines set up so I can test a basic 2-node HA cluster. The heartbeat stuff comes up fine. No resources (yet). So I tried to test node failover with the documented 'crm' followed by 'node' and then 'standby', and I get:

    Error performing operation: The object/attribute does not exist
    scope=status name=standby value=off

    'online' fails the same way? crm_standby command ditto (I assume it just does the lower level stuff as a wrapper?)

    ReplyDelete
  12. Just a follow-up. My cluster has no resources or anything, but I can't imagine why that would break node failover (and the message is completely non-helpful.) Unfortunately, unlike the curses issue, this one is a show-stopper, since an HA system that won't failover is kind of useless :(

    ReplyDelete
  13. Weird, I found similar issue with freebsd a back in 2010 via google and instead of 'node standby xxx' you need to say 'node attribute xxx set standby on'?

    ReplyDelete
  14. No comments? I assume this was all tested, so the manual failover is one of those things not normally done? Or...

    ReplyDelete
    Replies
    1. Dan, I understand your frustration, but I can't figure out exactly what's going wrong in your install. In my case I tested manual and automatic failover and it all just worked. Unfortunately I don't have the time right now to engage in some heavy debug, so you're gonna have to try to figure this one out yourself. Perhaps ask around on the Pacemaker lists, because this issue doesn't seem to be Illumos or ZFS related.

      Delete
    2. Very weird. Composed a whole reply and it apparently went to /dev/null. Anyway, not saying this is illumos related but a bug in the built packages (not your fault it seems but a window where it was inconsistent.) e.g. docs seem to say to do 'crm node standby' but that fails with the error I listed. You need to say 'attribute NODENAME set standby on'. So either you do manual failover some other way, or you do it the above way and forgot, or alpha particles (honestly, I just DL'ed you prebuilts and installed them...)

      Delete
  15. Saso, sorry for the lack of detail. What I found was that 'the standby and online' commands in the crm 'node' menu don't work. They throw the above errors. If you say 'node attribute xxx set standby on' that does what 'standby' used to do. I'm not saying this is a zfs or illumos bug, my apologies if that was not clear. This mailing list entry references the same issue on freebsd: http://lists.linux-ha.org/pipermail/linux-ha/2010-December/042167.html. Note the part where it refers to an apparent change in the wrappers. My point here was: if you go into 'crm' and then 'node' and say 'standby' or 'online' does it work for you? If so, I'm baffled as I downloaded and installed per this blog. If it doesn't work for you, there is an inconsistency in the pre-built packages that you never saw because you don't do manual failover that way (if not, how do you do it when you say 'I tested manual and automatic failover...' I guess what I am saying is I am fine with this the way it is (even if what I described is a bug/glitch), if this is not something I need to deal with.

    ReplyDelete
    Replies
    1. Very strange that the reply of 7:58 didn't show up until I did the later reply. Hmmm...

      Delete
  16. Okay, easy reproduceable testcase here. Did a minimal omnios install. Extracted HA.tar.bz2 and unpacked it. Installed the four packages and created a minimal ha config (only one node needed.) Fire up heartbeat and wait for things to settle down. And then try this:

    root@unknown:~# crm node standby unknown
    Error performing operation: The object/attribute does not exist
    scope=status name=standby value=off

    This is an absolutely vanilla install following the blog instructions. As I said in an earlier post, apparently the syntax has changed, but the help builtin is out of date (or something like that.) So again, if you do manual failover, how do you do it?

    ReplyDelete
  17. Saso, can you please answer my question about how you do manual failover? As I said, what the command help says, does no jibe with how the command actually works. I am stuck right now, since my choices seem to be: omnios with your HA port, vs zfsonlinux, whose HA works out of the box (installed on standard centos install) but has a sync mode nfs write penalty even with a good SSD for SLOG due to the kernel nfs server apparently breaking up large writes into a hundred or more 4KB writes.

    ReplyDelete
  18. Wow, okay, never mind then. On to some other solution...

    ReplyDelete
  19. If anyone would like to evaluate RSF-1 to provide full HA (including multi-node capability) on OmniOS (and most Solaris/Unix/Linux/FreeBSD variants), please drop us a line at www.high-availability.com and we'll be happy to work with you.

    ReplyDelete
  20. hello,

    is it possible to run this setup in an active:active-cluster? or without disconects from VMware NFS shares?

    ReplyDelete
    Replies
    1. If by active-active you mean both head nodes are handling NFS traffic at the same time, then the answer is only if you have two zpools, one per head, which the other node can take over if the primary node for a particular zpool fails.
      As for without disconnects (during failover, I assume), the answer is unfortunately "no". NFS traffic from ESX is TCP and so a failover will always result in a TCP RESET of the connection. VMware should, however, be able to survive that and reconnect (the VMs will experience a temporary delay in IO as ESX buffers the requests while it waits for the storage backend to come back up).

      Delete
    2. thanks for you answer and sorry for my bad english.

      so but how did it NetApp or Equallogic with his SAN's? they have 2Controllers (which are normaly compact servers), i can remove one controller and anything works fine.

      How can i get this with OmnisOS/ZFS?

      Delete
  21. Hi, had this running on omnios r10, updated it to r12 and pyhton path is not reconized anymore, do you know how to fix ?
    Joe

    ReplyDelete
  22. hi Sašo,

    thanks for a great post.

    I am trying to build something slightly similar, although my use is not with VMware and hypervisors, but a large data storage with very modest performance requirements, but very large data volumes (100s of TB).

    Due to a strict budget, I am considering to use SATA rather than SAS drives (the new "archival" 8TB Seagate drives @ ~250 USD/drive seem like an attractive option for this).

    However, the usage of SATA drives in SAS expanders seems to be a known pain point.

    There seem to be 2 strategies - one does use SAS to SATA interposers (one for each drive), which allow to run the dual connection all the way down to each drive level.

    The 2nd strategy would be to use the redundancy up to the expander connection level and then a single path to each drive (a quick browse through the Supermicro manual for the enclosure seems to suggest this might be possible but I might be wrong).

    Would you or anyone else have any experience in using SATA drives in this dual expander enclosure?

    The objective is to still have 2 heads with a failover - we could live with active/passive rather than active/active if it would help things.

    Any help or suggestion from anyone would be highly appreciated.

    Petr

    ReplyDelete