OpenIndiana Hipster 2019.04 is here openindiana

We have released a new OpenIndiana Hipster snapshot 2019.04. The noticeable changes:

  • Firefox was updated to 60.6.3 ESR
  • Virtualbox packages were added (including guest additions)
  • Mate was updated to 1.22
  • IPS has received updates from OmniOS CE and Oracle IPS repos, including automatic boot environment naming
  • Some OI-specific applications have been ported from Python 2.7/GTK 2 to Python 3.5/GTK 3

More information can be found in 2019.04 Release notes and new medias can be downloaded from

OpenIndiana Confluence security incident openindiana

Our infrastructure was too long without attention. As a result, confluence was compromised. Confluence runs on, together with main site and exim/mailman.

  1. We know, confluence was compromised (expected date of infection – Apr 17). The malware was detected at 8th May about 21:00 UTC. It looked as one described in .
    The issue was a security issue in confluence – .
    We beleive it wasn’t a targeted attack – just some more-or-less stupid bot, aiming at confluence, mining bitcoins (this part didn’t work in illumos zone as expected) and pehaps, collecting botnet.
  2. This zone contains mailing list software and web server, other services were not impacted (it didn’t have access to git repos or packages).

Resolution steps.

  1. Malware was completely removed and affected plugins were disabled at May 9 at about 4:00 UTC.
  2. Update to latest confluence version took some time, as
    • our confluence license has expired and I had to contact Atlassian to get new one;
    • it required updating Apache and JDK.
  3. Update to Confluence 6.15.4 was completed at May 10 at about 08:00 UTC.

WordPress at was updated to 5.2.

Further steps.

Given that actually nobody has cared enough for this infrastructure server, we suggest the following steps.

  1. Moving all valuable information from wiki to and docs directory.
  2. Preserve wiki for now just for development purposes.
  3. Completely decommission wiki once migration to oi-docs is finished.

OmniOS Community Edition r151030 LTS OmniOS Community Edition

The OmniOS Community Edition Association is proud to announce the general availability of OmniOS - r151030.

OmniOS is published according to a 6-month release cycle, r151030 LTS takes over from r151028, published in November 2018; and since it is a LTS release it also takes over from r151022. The r151030 LTS release will be supported for 3 Years. It is the first LTS release published by the OmniOS CE Association since taking over the reins from OmniTI in 2017. The next LTS release is scheduled for May 2021. The old stable r151026 release is now end-of-life. See the release schedule for further details.

This is only a small selection of the new features, and bug fixes in the new release; review the release notes for full details.

If you upgrade from r22 and want to see all new features added since then, make sure to also read the release notes for r24, r26 and r28.

New Features (since r28)

Before upgrading make sure to review the upgrade notes in the release notes

The OmniOS team and the illumos community have been very active in creating new features and improving existing ones over the last 6 months.

System Features

  • Support for the SMB 2.1 client protocol has been added illumos issue 9735.

  • The console now has full framebuffer support with variable resolution, more colours and unicode fonts. This is also visible in the boot loader.

  • Several 32-bit only packages have been moved to 64-bit only.

  • OmniOS userland is now built with gcc version 8.

  • A default installation now includes ntpsec in place of ntp; the package can still be removed if not required.

  • A default set of system default parameters are now installed in /etc/system.d/_omnios:system:defaults. These can be overidden if necessary by creating additional local files under /etc/system.d/.

Commands and Command Options

  • The ipadm and dladm commands now show IP and link information if invoked without arguments.

  • dladm show-vnic now shows the zone to which each VNIC is assigned.

  • The default behaviour of recursive chown and chgrp has changed and these commands are now safer with respect to following symbolic links. If only the -R parameter is provided then these utilities now behave as if -P was also specified. Refer to the chown(1) and chgrp(1) manual pages for more information.

  • The /usr/lib/fm/fmd/fmtopo command has improved support for enumerating USB topology.


  • The defaults for new zones have changed. Creating a new zone now initially sets brand=lipkg and ip-type=exclusive.

  • Zone brand templates are available allowing zones to be created within zonecfg via: create -t <type>.

  • pkgsrc branded zones are now available; these are sparse zones with pkgsrc pre-installed.

  • illumos branded zones are now available; these run an independant illumos distribution under the shared OmniOS kernel. Subject to the constraints imposed by the shared kernel, it can be used to run a different version of OmniOS userland or even a different illumos distribution.

  • Zone VNICs and networking information can now be dynamically managed as part of the zone configuration. Refer to for more details.

  • A firewall policy can now be enforced on a non-global zone by creating ipf configuration files under <zoneroot>/etc/. Rules defined in these files cannot be viewed nor overridden from inside the zone. Additional rules can be defined within the zone. This works for all zone types apart from kvm zones; it is even possible to define a global firewall policy for a bhyve zone.

  • The memory footprint of zones has been reduced by disabling unecessary services.


  • Support for importing pools using a temporary name.

  • Support for variable-sized dnodes.

Package Management

  • pkg verify has gained an option to verify individual files:
                # chown sys /var
    # pkg verify -p /var
    PACKAGE                                                            STATUS
    pkg://omnios/SUNWcs                                                 ERROR
          dir: var
                  ERROR: Owner: 'sys (3)' should be 'root (0)'
  • Individual origins for a publisher can be enabled and disabled using -g to specify the origin:
                # pkg set-publisher -g --disable omnios
    # pkg publisher
    omnios       origin   online   F
    omnios       origin   disabled F
  • Package manifests now include SHA-2 hashes for objects, and extended hash information for binary objects, alongside the existing SHA-1 information for backwards compatibility with older pkg versions.

  • Automatic boot-environment names can now be based on the current date and time as well as the publication date of the update. Refer to the pkg(5) man page for more information. Example:

                # pkg set-property auto-be-name time:omnios-%Y.%m.%d

Hardware Support

  • Support for modern AMD and Intel systems.

  • New para-virtualisation drivers for running OmniOS under Microsoft Hyper-V/Azure (beta). These are delivered by the new driver/hyperv/pv package.

  • New bnx (Broadcom NetXtreme) network driver.

  • Improved support for USB 3.1.

Release Notes and Upgrade Instructions

For full relase notes including upgrade instructions; review the release notes and find upgrade instructions at

OmniOSce Newsletter

Since the start of OmniOS Community Edition project, we have predominantly announced our new releases via twitter. We are now also offering a newsletter with announcements of updates, bug fixes and new releases. You can subscribe here.

Commercial Support

Have you ever wondered how OmniOS development gets financed? You may have noticed that there is no big company bankrolling it all. The way we keep afloat is by the companies who rely on OmniOS powered servers taking out support contracts for their hardware. How about you? Visit for more details and to generate a quote. If you aren’t in a position to take a support contract, please consider becoming an OmniOS patron to help secure its future -

About OmniOS Community Edition Association - this Swiss Association is responsible for the ongoing development and maintenance of OmniOS, having been established in Summer 2017 after OmniTI announced their withdrawal from the project.

OmniOSce Association Aarweg 17, 4600 Olten, Switzerland

OmniOS Community Edition r151028z, r151026az, r151022cx OmniOS Community Edition

OmniOS Community Edition weekly releases for w/c 29th of April 2019 are now available.

The following security fixes are available for all supported releases:

For r151028 only, perl has also been upgraded to version 5.28.2

For further details, please see

Any problems or questions, please get in touch.

HA PostgreSQL on Tribblix with Patroni The Trouble with Tribbles...

When it comes to managing PostgreSQL replication, there are a number of options available.

I looked at stolon, but it's not the only game in town. In terms of a fully managed system, there's also patroni.

In terms of overall functionality, stolon and patroni are pretty similar. They both rely on etcd (or something similar) for storing state; they both take care of running the postgres server with the right options, and reconfiguring it as necessary; they'll both promote a replica to continue service if the master fails.

So, here's how to set up a HA PostgreSQL cluster using patroni.

Before starting on anything like this with Tribblix, it's always a good idea to

zap refresh

so that you're up to date in terms of packages and overlays.

First create 3 zones, just like before:

zap create-zone -z node1 -t whole \
  -o base -O patroni -x

zap create-zone -z node2 -t whole \

  -o base -O patroni -x

zap create-zone -z node3 -t whole \

  -o base -O patroni -x

Building the zones like this, with the patroni overlay, will ensure that all the required packages are installed in the zones so you don't need to mess around with installing packages later.

Then zlogin to each node and run the etcd commands as before, to create the user and start etcd.

Now create a user to run postgres on each node

zlogin node1 (and 2 and 3)
useradd -u 11799 -g staff -s /bin/bash -d /export/home/pguser pguser
passwd -N pguser
mkdir -p /export/home/pguser
chown -hR pguser /export/home/pguser

Now you need to create yaml files containing the configuration for each node. See for the sample files I've used here

Log in to each node in turn

pfexec zlogin -l pguser node1

/usr/versions/python-2.7/bin/patroni ${HOME}/node1.yaml

And it initializes a cluster, with just the one node as of yet, and that node will start off as the master.

Now the 2nd node

pfexec zlogin -l pguser node2

/usr/versions/python-2.7/bin/patroni ${HOME}/node2.yaml

And it sets it up as a secondary, replicating from node1.

What do things look like right now? You can check that with:

/usr/versions/python-2.7/bin/patronictl \
  -d etcd:// \
  list my-ha-cluster

Now the third node:

pfexec zlogin -l pguser node3

/usr/versions/python-2.7/bin/patroni ${HOME}/node3.yaml

You can force a failover by killing (or ^C) the patroni process on the master, which should be node1. You'll see one of the replicas coming up as master, and replication on the other replica change to use the new master. One thing I did notice is that patroni initiates the failover process pretty much instantly, whereas stolon waits a few seconds to be sure.

You can initiate a planned failover too:

/usr/versions/python-2.7/bin/patronictl \
  -d etcd:// \
  failover my-ha-cluster

It will ask you for the new master node, and for confirmation, and then you'll have a new master.

But you're not done yet. There's nothing to connect to. For that, patroni doesn't supply its own component (like stolon does with its proxy) but depends on a haproxy instance. The overlay install we used when creating the zone will have made sure that haproxy is installed in each zone, all we have to do is configure and start it.

zlogin to each node, as root, and

wget -O /etc/haproxy.cfg
svcadm enable haproxy

You don't have to set up the haproxy stats page, but it's a convenient way to see what's going on. If you go to the stats page

Then you can see that it's got the active backend up and the secondaries marked as down - haproxy is checking the patroni REST api which is only showing the active postgres instance as up, so haproxy will route all connections through to the master. And, if you migrate the master, haproxy will follow it.

Which to choose? That's always a matter of opinion, and to be honest while there are a few differences, they're pretty much even.
  • stolon is in go, and comes as a tiny number of standalone binaries, which makes it easier to define how it's packaged up
  • patroni is in python, so needs python and a whole bunch of modules as dependencies, which makes deployment harder (which is why I created an overlay - there are 32 packages in there, over 2 dozen python modules)
  • stolon has its own proxy, rather than relying on a 3rd-party component like haproxy
As a distro maintainer, it doesn't make much difference - dealing with those differences and dependencies is part and parcel of daily life. For standalone use, I think I would probably tend towards stolon, simply because of the much smaller packaging effort.

(It's not that stolon necessarily has fewer dependencies, but remember that in go these are all resolved at build time rather than runtime.)

HA PostgreSQL on Tribblix with stolon The Trouble with Tribbles...

I wrote about setting up postgres replication, and noted there that while it did what it said it did - ensured that your data was safely sent off to another system - it wasn't a complete HA solution, requiring additional steps to actually make any use of the hot standby.

What I'm going to describe here is one way to create a fully-automatic HA configuration, using stolon. There's a longer article about stolon, roughly explaining the motivations behind the project.

Stolon uses etcd (or similar) as a reliable, distributed configuration store. So this article follows on directly from setting up an etcd cluster - I'm going to use the same zones, the same names, the same IP addresses, so you will need to have got the etcd cluster running as described there first.

We start off by logging in to each zone using zlogin (with pfexec if you set your account up as the zone administrator when creating the zone):

pfexec zlogin node1 (and node2 and node3)

Followed by installing stolon and postgres on each node, and creating an account for them to use:

zap refresh
zap install TRIBblix-postgres11 TRIBblix-stolon TRIBtext-locale
useradd -u 11799 -g staff -s /bin/bash -d /export/home/pguser pguser
passwd -N pguser
mkdir -p /export/home/pguser
chown -hR pguser /export/home/pguser

In all the following commands I'm assuming you have set your PATH correctly so it contains the postgres and stolon executables. Either add /opt/tribblix/postgres11/bin and /opt/tribblix/stolon/bin to the PATH, or prefix the commands with

env PATH=/opt/tribblix/postgres11/bin:/opt/tribblix/stolon/bin:$PATH

Log in to the first node as pguser.

pfexec zlogin -l pguser node1

Configure the cluster (do this just the once):

stolonctl --cluster-name stolon-cluster \
  --store-backend=etcdv3 init

It's saving the metadata to etcd. Although it's just a single key to mark the stolon cluster as existing at this point.

Now we need a sentinel.

stolon-sentinel --cluster-name stolon-cluster \

It complains that there are no keepers, so zlogin to node1 in another window and start one of those up too:

stolon-keeper --cluster-name stolon-cluster \
  --store-backend=etcdv3 \
  --uid postgres0 --data-dir data/postgres0 \
  --pg-su-password=fancy1 \
  --pg-repl-username=repluser \
  --pg-repl-password=replpassword \

after a little while, a postgres instance appears. Cool!

Note that you have to explicitly specify the listen address. That's also the address that other parts of the cluster use, so you can't use "localhost" or '*', you have to use the actual address.

You also specify the postgres superuser password, and the account for replication and its password. Obviously these ought to be the same for all the nodes in the cluster, so they can all talk to each other successfully.

And now we can add a proxy, after another zlogin to node1:

stolon-proxy --cluster-name stolon-cluster \
  --store-backend=etcdv3 --port 25432

If now you point your client (such as psql) at port 25432 you can talk to the database through the proxy.

Just having one node doesn't meet our desire to build a HA cluster, so let's add some more nodes.

Right, go to the second node,

pfexec zlogin -l pguser node2

and add a sentinel and keeper there:

stolon-sentinel --cluster-name stolon-cluster \

stolon-keeper --cluster-name stolon-cluster \
  --store-backend=etcdv3 \
  --uid postgres1 --data-dir data/postgres1 \
  --pg-su-password=fancy1 \
  --pg-repl-username=repluser \
  --pg-repl-password=replpassword \

What you'll then see happening on the second node is that stolon will automatically set the new postgres instance up as a replica of the first one (it assumes the first one you run is the master).

Then set up the third node:

pfexec zlogin -l pguser node3

with another sentinel and keeper:

stolon-sentinel --cluster-name stolon-cluster \

stolon-keeper --cluster-name stolon-cluster \
  --store-backend=etcdv3 \
  --uid postgres2 --data-dir data/postgres2 \
  --pg-su-password=fancy1 \
  --pg-repl-username=repluser \
  --pg-repl-password=replpassword \

You can also run a proxy on the second and third nodes (or on any other node you might wish to use, come to that). Stolon will configure the proxy for you so that it's always connecting to the master.

At this point you can play around, create a table, insert some data.

And you can test failover. This is the real meat of the problem.

Kill the master (^C its keeper). It takes a while, because it wants to be sure there's actually a problem before taking action, but what you'll see is one of the slaves being promoted to master. And if you run psql against the proxies, they'll send your queries off to the new master. Everything works as it should.

Even better, if you restart the old failed master (as in, restart its keeper), then it successfully sets the old master up as a slave. No split-brain, you get your redundancy back.

I tried this a few more times, killing the new master aka the original slave, and it fails across again.

I'm actually mighty impressed with stolon.

Setting up an etcd cluster on Tribblix The Trouble with Tribbles...

Using etcd to store configuration data is a common pattern, so how might you set up an etcd cluster on Tribblix?

I'll start by creating 3 zones to create a 3-node cluster. For testing these could all be on the same physical system, for production you would obviously want them on separate machines.

As root:

zap refresh

zap create-zone -z node1 -t whole -o base -x

zap create-zone -z node2 -t whole -o base -x

zap create-zone -z node3 -t whole -o base -x

If you add the -U flag with your own username then you'll be able to use zlogin via pfexec from your own account, rather than always running it as root (in other words, subsequent invocations of zlogin could be pfexec zlogin.)

Then zlogin to node1 (and node2 and node3) to install etcd, and create
a user to run the service.

zlogin node1

zap install TRIBblix-etcd
useradd -u 11798 -g staff -s /bin/bash -d /export/home/etcd etcd
passwd -N etcd
mkdir -p /export/home/etcd
chown -hR etcd /export/home/etcd

I'm going to use static initialization to create the cluster. See the
clustering documentation.

You need to give each node a name (I'm going to use the zone name) and the cluster a name, here I'll use pg-cluster-1 as I'm going to use it for some PostgreSQL clustering tests. Then you need to specify the URLs that will be used by this node, and the list of URLs used by the cluster as a whole - which means all 3 machines. For this testing I'm going to use unencrypted connections between the nodes, in practice you would want to run everything over ssl.

zlogin -l etcd node1

/opt/tribblix/etcd/bin/etcd \
  --name node1 \
  --initial-advertise-peer-urls \
  --listen-peer-urls \
  --listen-client-urls, \
  --advertise-client-urls \
  --initial-cluster-token pg-cluster-1 \
  --initial-cluster node1=,node2=,node3= \
  --initial-cluster-state new

The same again for node2, with the same cluster list, but its own

zlogin -l etcd node2

/opt/tribblix/etcd/bin/etcd \
  --name node2 \
  --initial-advertise-peer-urls \
  --listen-peer-urls \
  --listen-client-urls, \
  --advertise-client-urls \
  --initial-cluster-token pg-cluster-1 \
  --initial-cluster node1=,node2=,node3= \
  --initial-cluster-state new

And for node3:

zlogin -l etcd node3

/opt/tribblix/etcd/bin/etcd \
  --name node3 \
  --initial-advertise-peer-urls \
  --listen-peer-urls \
  --listen-client-urls, \
  --advertise-client-urls \
  --initial-cluster-token pg-cluster-1 \
  --initial-cluster node1=,node2=,node3= \
  --initial-cluster-state new

OK, that gives you a 3-node cluster. Initially you'll see complaints about being unable to connect to the other nodes, but it will settle down once they've all started.

And that's basically it. I think in an ideal world this would be an SMF service, with svccfg properties defining the cluster. Something I ought to implement for Tribblix at some point.

One useful tip, while discussing etcd. How do you see what's been stored in etcd? Obviously if you know what the keys in use are, you can just look them up, but if you just want to poke around you don't know what to look for. Also, etcdctl ls has been removed, which is how we used to do it. So to simply list all the keys:

etcdctl get "" --prefix --keys-only

There you have it.

Setting up replicated PostgreSQL on Tribblix The Trouble with Tribbles...

When you're building systems, it's nice to build in some level of resilience. After all, failures will happen.

So, we use PostgreSQL quite a bit. We actually use a fairly traditional replication setup - the whole of the data is pushed using zfs send and receive to a second system. Problem at the source? We just turn on the DR site, and we're done.

One of the reasons for that fairly traditional approach is that PostgreSQL has, historically, not had much built in support for replication. Or, at least, not in a simple and straightforward manner. But it's getting a lot better.

Many of the guides you'll find are rather dated, and show old, rather clunky, and quite laborious ways to set up replication. With current versions of PostgreSQL it's actually pretty trivial to get streaming replication running, so here's how to demo it if you're using Tribblix.

First set up a couple of zones. The idea is that pg1 is the master, pg2 the replica. Run, as root:

zap create-zone -z pg1 -t whole -o base -x -U ptribble

zap create-zone -z pg2 -t whole -o base -x -U ptribble

This simply creates a fairly basic zone, without much in the way of extraneous software installed. Adjust the IP addresses to suit, of course. And I've set them up so that I can use zlogin from my own account.

Then login to each zone, install postgres, and create a user.

zlogin pg1
zap install TRIBblix-postgres11 TRIBtext-locale
useradd -u 11799 -g staff -s /bin/bash -d /export/home/pguser pguser
passwd -N pguser
mkdir /export/home/pguser
chown -hR pguser /export/home/pguser

And the same for pg2.

Then log in to the master as pguser.

zlogin -l pguser pg1

Now initialise a database, and start it up:

env LANG=en_GB.UTF-8 /opt/tribblix/postgres11/bin/initdb -E UTF8 -D ~/db
env LANG=en_GB.UTF-8 /opt/tribblix/postgres11/bin/postgres -D ~/db

The next thing to do is create a PostgreSQL user that will run the streaming replication.

/opt/tribblix/postgres11/bin/psql -d postgres
set password_encryption = 'scram-sha-256';
\password replicate
Enter new password: my_secret_password

Then you need to edit postgresql.conf (in the db directory) with the following settings:

listen_addresses = '*'
wal_level = replica
max_wal_senders = 3 # or whatever
wal_keep_segments = 64 # or whatever
hot_standby = on

And set up authentication so that the user you just created can actually access the database remotely, by adding the following line to pg_hba.conf

host   replication   replicate    scram-sha-256

ideally we would use hostssl so the connection is encrypted, but that's out of scope for this example.

Then restart the master.

Now log in to the slave.

zlogin -l pguser pg2

And all you have to do to replicate the data is run pg_basebackup:

/opt/tribblix/postgres11/bin/pg_basebackup \
  -h -c fast -D ~/db -R -P \
  -U replicate --wal-method=stream

It will prompt you for the super secret passord you entered earlier. Once that's completed you can start the slave:

env LANG=en_GB.UTF-8 /opt/tribblix/postgres11/bin/postgres -D ~/db

And that's it. Data you insert on the master will be written to the slave. In this mode, you can connect to the replica and issue queries to read the data, but you can't change anything.

Note that this doesn't do anything like manage failover or send client connections to the right place. It just makes sure the data is available. You need an extra layer of management to manage the master and slave(s) to actually get HA.

There are a whole variety of ways to do that, but the simplest way to test it is to stop the database on the master, which will make the slave unhappy. But if you know the master isn't coming back, you can promote the replica:

/opt/tribblix/postgres11/bin/pg_ctl promote -D ~/db

and it will spring into action, and you can point all your clients to it.. The old master is now useless to you; the only course of action you have at this point is to wipe it, and then create a new replica based on the new master.

A Return to Form mirrorshades

Configs and dotfiles

Before we get into anything else, here’s a repo with my configs and dotfiles for running FreeBSD as a workstation.

A lot of the good stuff was cribbed from Jessie Frazelle’s dotfiles repo, or from her post on running Linux on a MacBook. Check it out.


I’ve been using Mac OS X since the Public Beta, though not as a daily driver for a few years after it was initially released.

Pismo, PowerMac, PowerBook (12" albook!), MacBook, iMac… It’s been a long road, really, and until the last few years I’ve been very happy.

Before OS X, I used Linux or OpenBSD as a desktop (and before that Windows 3.1), and was generally pretty happy for many years, but I did spend a lot of time screwing around with X11 configs and dist-upgrade breaking my world, and all the usual stuff you might remember from 2003 or 2004.

Around then OS X started getting pretty usable, especially given my relatively simple professional needs:

  • A terminal
  • OpenSSH

Really, that’s it, for work.

Obviously a web browser, and a media player were nice to have, but as a sysadmin most of my day is spent at a shell prompt on some other machine. Very little of the development work I do is on my local machine, but is instead housed in a zone on some compute node in some datacenter somewhere.

So for a long time OS X was good. Each release was faster on my existing hardware (no, really!) and less buggy, and sometimes had useful features I cared about.

Mostly I didn’t use the new features, though, and as time went on the number of features and things it was doing I would prefer it not skyrocketed.

I used Terminal (and later iTerm2) and ssh. Safari got slow so I switched to Firefox, and then Chrome. All my media is in iTunes now, and yeah, it’s kind of awful, but I don’t spend a lot of time actually looking at it.

Circa OS X 10.6, I was really happy with OS X. It was fast, stable, and never got in my way.

Around OS X 10.9, though, things started going wrong. 10.10 improved a few of these things, but overall it just kept degrading. It’s slower, there are a lot of really distracting “features” I can’t seem to actually reliably disable: It’s tied into my phone, and my wife’s phone, so when she adds events I get duplicate notifications (deliver once being a fallacy, I suppose), disrupting me from my work. I disable this, but …

It harasses me every day to upgrade. It desperately wants to just upgrade whenever it wants. More and more it acts like the Windows machines I’ve had to support over the last 20 years, which is deeply frustrating.

It regularly does things in the background without asking, consuming all my bandwidth (again: most of my work is remote, so I’m particularly sensitive to latency.)

And yeah: I’ve disabled all these things. They keep getting re-enabled, and so it’s not hard to take the hint.

Periodically when I try to log into OS X, it will just hang on me, which is sort of beyond the pale.

Upgrades have gone sideways because there are files in /usr/local of all things. And now SIP in 10.11 breaks all sorts of stuff.

I’ve just been fed up for a while. I recognize that mostly it’s because the thing that was working so well for me has now moved on – it’s this digital life hub thing, instead of a nice UX that let me run ssh reliably. So maybe it’s not you, it’s not me, it’s just us. And that’s ok.


OS X has become an obnoxious puppy (“I WANTED TO LOVE YOU SO MUCH I PEED ALL OVER THE FLOOR”), and as I’ve grown older, I’ve become more of a cat person (“…you do your thing, I’ll do mine. Cool? Cool.”)

It was more about the constant distractions than the cost of maintenance.

So What Then

I spent some time defining the problem and how I might solve it.

The primary issue was I spent too much time telling OS X to shut up and leave me alone. Some of this is me and how poorly my brain operates these days – I’ve had 3 kids in the last 5 years. I don’t like to play the twins card, but having twins does something to your brain. The doubled up sleep deprivation and long, long periods of stress altered me in non-trivial ways.

Time is precious and my mental state is constantly fragile. Focus is incredibly difficult to achieve and impossible to maintain if something distracts me.

I needed something that wouldn’t bother me, or consume my time pointlessly. One way or another I was going to have to build this thing. Something off the shelf like what OS X 10.6 was for me was unlikely to simply exist.

I tried running various Linux on my MacBook, but discovered everything I hate about managing Linux on a server platform is in fact amplified in a desktop context. It was less bad than I remembered from 10 years ago, but it was still a poor comparison to when OS X was good – again, for my requirements.

I tried to think of an extreme that wasn’t too extreme – something that was minimalistic, but not egregiously so (like just rocking a console and screen and no GUI at all.) Something that had features and technologies I cared about, without wandering off into the weeds periodically to bring me back a decapitated bird.

OpenBSD was my first choice. I’ve used it for many many years in firewalling or routing contexts, I used to use it as a desktop – but the upgrade process sort of killed it for me. I didn’t want to deal with patching and recompiling my OS, or remembering to look at a web page for errata (or writing a script to do that for me.)

I’ve used FreeBSD for projects in the past (circa FreeBSD 5.x) but not in years. However, they’ve integrated ZFS and DTrace from illumos (which is where I spend the vast majority of my time, logged into things), pf from OpenBSD, and they have both binary OS updates and packages.

If you poke around the FreeBSD site you’ll notice two things which I also found to be extremely commendable: The FreeBSD Handbook is an amazing piece of documentation. It’s far shorter than the Solaris System Admin Guides I knew and loved when I was getting started with Solaris in 2005, but it’s a really well crafted piece of work.

The other is the FreeBSD Code of Conduct. CoCs are (rightfully!) becoming more popular with projects and conferences, and I don’t know when they added theirs – but it’s well thought out, and I appreciate the effort there.

The Workstation

I didn’t want to install FreeBSD on my iMac. My wife uses it for photo and video editing, my music is all tied to OS X, sometimes I have an hour to play the latest Shadowrun game. Dual booting would just be a pain.

I decided to build a computer for home use. I hadn’t done this in at least ten years – I’ve built plenty of servers in that time, but at home I’d been happy to just have my Macs.

I was careful to find hardware that was slightly older, and well supported. I didn’t want to end up with some device that was generally okay but periodically got flaky. I spent a lot of time reading forums, driver man pages, and so on. In retrospect I probably didn’t need to do this much due diligence, but I was being paranoid about the time I might have to spend later fighting with the fallout of a poor purchasing choice.

I ended up purchasing:

So… for what I need, a quad-core box with 32GB RAM is pretty ridiculous. But I’m hoping I can not do this again for quite for a years, so. The case has plenty of room for disks, if that becomes a thing I need over time, as well. (Because, ZFS.)

(I include the CPU fan because it made me laugh while I was installing it. So ridiculous.)

My 4 (and a half she’d be quick to point out at the time) year old and I put it together one night after her sisters went to bed. She sat with me through the installation and I set up her first user account for her. For posterity, kid:

    12:37:43 gaea:~$ uname
12:37:47 gaea:~$ id nora
uid=1002(nora) gid=1002(nora) groups=1002(nora)

FreeBSD Installation

I had done testing in VMs before I got the hardware, so I knew what to expect. And honestly – there’s not much to say here. I hit enter a bunch of times and then I had a system with ZFS on root, running FreeBSD. I was really pleased with how simple this was.

Both FreeBSD and OpenBSD have made huge strides in installation.

FreeBSD Packaging

pkg(8) is a nice tool. The ports tree is massive, and I haven’t found any software I needed that didn’t exist there.

In a server context I’d have the same problems I do with any other distro packaging setup: Missing compile-time options one some things, etc, but for home use it’s been very smooth sailing.

FreeBSD Upgrading

The freebsd-update(8) tool just works. Sometimes it works so well I wonder if it actually did anything. I walked releases from 10.1 RELEASE, through security patches, up to 10.2’s most recent patch level. Two commands each time, and zero problems with any of it.

    freebsd-update fetch
freebsd-update install

As A Desktop

There is a distribution of FreeBSD called PC-BSD that’s targeted at desktops. However, if my goal was this minimal, stay out of my way, sort of environment, I didn’t want a bunch of user-friendly add-ons trying to help me out. I wanted a simple environment that would help me focus.

This guide at was absolutely perfect for me. I had zero problems following it, and it got me a working GUI in under a half hour. The big kicker with X11 for me has always been fonts: They’ve historically been a pain to manage and setup. Either something has changed, or my threshold has moved, because with the exception of Google in Firefox I’m pretty happy with how things look.

The next decision I had to make required some experimentation: Which Desktop Environment to use? After poking at KDE, GNOME, LXCE, WindowMaker (yup), Enlightenment 17 (less said the better, and I ran E16 for ages back in the day) someone on Twitter (I can’t remember who, now!) mentioned i3.

Back around 2003 I had co-workers and friends who were using ion and ratpoision, but I never really got tiling window manages for whatever reason.

i3, it clicked immediately. Loved it.

I stole most of Jessie Frazelle’s dotfiles to make i3 and urxvt nice to look at.

(Initially I was using a terminal emulator called sakura, and it was nice enough. After I pulled in Jessie’s dotfiles I figured I may as well just use urxvt like everyone else.)

I spent a few evenings on getting things setup how I want. With a new OS X install there was very little I’d do beyond customize my shell environment and change the window highlight from Aqua to Grey. (And enable FileVault, firewall, etc.) Spending time on configuring i3 didn’t feel like a waste of time, though. It was iterative (as you can reload the configuration live) and the effects on my workflow felt immediate.

I really like i3.

This was back in April, and I’ve been happily using this setup for the last 7 months. I switch between the FreeBSD box and my iMac, depending on what I’m doing, but when I need to focus I’m in i3wm.

On a Laptop

(My notes.)

After several months, I decided to try it on a laptop. I love the MacBook chassis, but the OS is making me nuts. FreeBSD doesn’t support the wireless NICs in mine yet, and of course I can’t swap it out.

A few hours of research later (looking at modern netbooks that appeared to be compatible) I decided to get a Thinkpad X220 based on a thread on misc@openbsd.

(Someone on that thread coincidentally mentioned that X220s could take 16GB RAM, regardless of the documented max being 8GB.)

Looking at used gear on eBay, I decided to get something with as little RAM, but a faster CPU, as possible. I ordered the laptop eBay, and a 120GB Intel SSD, and two sticks of RAM from Amazon.

I haven’t used a non-Apple laptop in over a decade. Simply speaking from a physical, mechanical perspective, they’re amazing machines. Incredibly well-engineered. They don’t feel flimsy. I’d grown to like the chiclet keyboard, and never felt like the touchpad got confused or would activate incorrectly.

The touchpad on the X220 is pretty crap. The nub mouse is going to take some time to get used to. They keyboard is nice, though I keep hitting the nub instead of the B key.

xset b off and hw.syscons.bell=0 were something that took me a few minutes to remember, but oof. So necessary. The bell on this thing is like someone taking a ball peen hammer to a piece of tin inside your skull.


I couldn’t get ZFS on root to work the X220’s buggy UEFI firmware. I tried various things (legacy mode, MBR, manually writing zfsboot after install, FreeBSD 11 snapshot – which has a fix for this, potentially, but the snap I tried was a bit buggy. I also don’t want to be running HEAD. It goes away the “don’t break” requirement.)

Finally, I split the SSD into 40GB UFS for the OS and another partition for a zpool.

In reality this doesn’t effect me much. ZFS on root is nice to have, but I’ve been living without it on OS X, and I can live without on this laptop for a while longer.

Beyond that, the install worked just fine.


Wifi support was the big reason I chose the X220, so I was a bit frustrated when the system get reassociating with the network. After an hour or so of debugging and googling, I found a post that described the problem. The workaround was changing ifconfig_wlan0 in /etc/rc.conf to

    ifconfig_wlan0="-ht WPA DHCP"

-ht disables 802.11n (see ifconfig(8).)

Once that was in place, everything worked great.

Sleeping and Locking

FreeBSD won’t sleep the laptop if you close the lid. Kind of a deal.

I felt sort of silly while I was writing it – because I’ve been using Macs for so long – but I wrote a i3 keybind to call i3lock and then acpiconf -s 3. It’s similar to my keybind for just calling i3lock, so I imagine I’ll sometimes screw it up, but it’s easy for me to remember either way:

    bindsym Control+Mod1+l exec i3lock -c 111111
bindsym Shift+Mod1+l exec i3lock -c 111111 && sudo acpiconf -s 3

(Requires sudo be configured for that command with NOPASSWD.)

Hardware Upgrade

Swapping out the disk and RAM in the machine was trivial. You can add an mSATA drive if you want to mirror or get extra storage but for my use cases (basically a dumb terminal with all data I care about elsewhere) I decided not to bother.


Both on the laptop and desktop, FreeBSD swims. To be fair, both machines have a ridiculous amount of RAM for what I’m doing (16GB and 32GB respectively; I mean c'mon), and an SSD – and…

    13:17:22 gaea:~$ ps -U bdha | grep -v ssh | grep -v bash
 986  -  Ss   0:00.39 /usr/local/bin/i3
 993  -  S    0:00.88 i3bar --bar_id=bar-0 --socket=/tmp/i3-bdha.TvyJg3/ipc-socket.986
 995  -  S    0:00.06 i3status --config /home/bdha/.i3/status.conf
1000  -  S    0:00.77 urxvt -ls
1061  -  S    0:00.63 urxvt -ls
2847  -  S    0:07.91 urxvt -ls
2896  -  S    0:00.32 urxvt -ls
2964  -  I    0:00.00 /usr/local/bin/dbus-launch --autolaunch 2adb3a5ba60595820f094822554267df --binary-syntax --close-stderr
2965  -  Ss   0:00.00 /usr/local/bin/dbus-daemon --fork --print-pid 5 --print-address 7 --session
3053  -  S    0:00.45 urxvt -ls
3350  -  S    0:00.00 /usr/local/bin/xclip -in -selection clipboard
3356  -  S    0:00.68 firefox
3022  4  I+   0:07.98 vim freebsd_osx_migration.txt
3359  6  R+   0:00.00 ps -U bdha

So it’s hardly fair to compare it to anything else… but realtalk. It’s fast.

It’ll remain fast.

The Problems

The biggest one I have is password sharing. All of my personal passwords are in 1Password. I’ve copied a bunch of them over to KeepPassX (which is what we use at work and on illumos infra), but syncing is definitely a pain point. I could stop using 1p, but honestly – it’s so convenient I’m used to not having to jump through an extra hoop to log into something.

Copy and paste is still a bit frustrating. Sometimes urxvt will get confused; it won’t select things properly. I’m not happy with the hotkey I have set up for it. I imagine I’ll put a bit more time into figuring out what’s up here, because copying chunks of text shouldn’t be something I ever have to think about.

On the laptop, I sometimes forget that I have actual pgup/pgdown keys. Will have to get used to that. :-)


This was a long post, given my requirements were “doesn’t break” but also “doesn’t waste my time.” It seems a bit odd that I spent so much time on customizing configuration here, like I would have done in (forgive me) my youth.

However, for me, this was a one-time cost and I’m getting a lot of out it.

I am less frustrated, and more focused working on this setup. A big chunk of that is even outside the constant popups in OS X, there’s simply less to be distracted by.

I’ve gone so far as to have to literally switch a cable to move between machines (as opposed to a KVM), to help me train my brain into a different context.

Overall I’m quite happy with the choices I made here.

FreeBSD and i3wm are simple (in the best ways), fast, reliable, and most importantly for me – non-invasive.

Installing Chef on Joyent's SmartOS mirrorshades

This is effectively the same procedure as installing Chef on Solaris with pkgsrc or OpenSolaris/OpenIndiana with IPS. We’ll be using Joyent’s provided pkgsrc setup and pkgin installer.

If you’re using SmartOS on the Joyent Cloud, you’ll have pkgin already available. If you’re running SmartOS yourself, you’ll need to install it.

@benjaminws threw up a bootstrap template for the following as well!

The bits:

pkgin install gcc-compiler gcc-runtime gcc-tools-0 ruby19 scmgit-base scmgit-docs gmake sun-jdk6

tar -xzf rubygems-1.8.10.tgz
cd rubygems-1.8.10
ruby setup.rb --no-format-executable

gem install --no-ri --no-rdoc chef

mkdir /etc/chef

cat <<EOF>> /etc/chef/client.rb
log_level        :info
log_location     STDOUT
chef_server_url  ""
validation_client_name "chef-validator"
node_name ""

Drop your validation.pem in /etc/chef, and then run chef-client.

Booting OpenIndiana on Amazon EC2 mirrorshades

Since OpenSolaris was axed, we haven’t had an option for running a Solaris-based distribution on EC2. Thanks to the hard work of Andrzej Szeszo, commissioned by Nexenta, now we do.

This should be considered a proof of concept, and is perhaps not ready for production. Use at your own risk!

Spin up ami-4a0df023 as m1.large or bigger. This is an install of OpenIndiana oi_147.


The image doesn’t currently import your EC2 keypairs, so you’ll need to log in as a user. root is a role in this image, so you’ll need to log in as the oi user.

The oi user’s password is “oi”.

    # ssh oi@
The authenticity of host ' (' can't be established.
RSA key fingerprint is da:b9:0e:73:20:81:4f:a2:a7:91:0d:7d:3c:4b:cb:80.
Are you sure you want to continue connecting (yes/no)? yes
Warning: Permanently added '' (RSA) to the list of known hosts.
Last login: Sun Apr 17 06:54:24 2011 from domu-3-4-5-6
OpenIndiana     SunOS 5.11      oi_147  September 2010
oi@openindiana:~$ pfexec su -
OpenIndiana     SunOS 5.11      oi_147  September 2010

To make root a real user again, use the following commands:

    $ pfexec rolemod -K type=normal root
$ pfexec perl -pi -e 's/PermitRootLogin no/PermitRootLogin yes/' /etc/ssh/sshd_config
$ pfexec svcadm restart ssh

You can now log in as root as you’d expect. This behavior is changing all over the place (including Solaris 11 proper), but I don’t mind being a dinosaur.


There are some limitations, however.

Boot Environments

  • You have no console access
  • If an upgrade fails, you can’t choose a working BE from grub
  • For the same reason, you won’t be able to boot failsafe


  • You won’t be able to pull an IP from EC2 for your zones
  • Only one Elastic IP can be assigned to an instance, so you won’t be able to forward a public IP to an internal zone IP
  • You’ll be able to forward ports from the global zone to zones, of course, but this is less useful than zones having unique IPs associated with them


  • There is a bug in devfsadmd which doesn’t like device IDs over 23. I describe how to deal with this below.

Boot Volumes

There are two EBS volumes assigned with this AMI. The 8GB one is the root pool device. The 1GB is where the boot loader (pv-grub) lives.

Triskelios joked earlier that your instance got into a hosed state, you could mount the 1GB volume elsewhere, munge your grub config, then assign it back to your the busted instance. This theoretically gets around not having console access. It’s also hilarious. But could work.

Upgrading to oi_148

oi_147 has some known bugs, so we want to get up to oi_148. You could also update to the OpenIndiana illumos build (dev-il), but we’ll stick with 148 for now.

The old publisher is still available, as there is software on it not available in OpenIndiana’s repo. However, we need to set that publisher non-sticky so it doesn’t hold back package upgrades. If you don’t set the repo non-sticky, you won’t get a complete upgrade. You’ll be running a 148 kernel, but lots of 147 packages. One symptom of this is zones won’t install.

    root@openindiana:~# pkg publisher
PUBLISHER                             TYPE     STATUS   URI          (preferred)  origin   online                       origin   online

root@openindiana:~# pkg set-publisher --non-sticky

root@openindiana:~# pkg publisher
PUBLISHER                             TYPE     STATUS   URI          (preferred)  origin   online          (non-sticky) origin   online

Once that’s done, we update the current image.

    root@openindiana:~# pkg image-update
                Packages to remove:     4
               Packages to install:     7
                Packages to update:   531
           Create boot environment:   Yes
DOWNLOAD                                  PKGS       FILES    XFER (MB)
Completed                              542/542 11254/11254  225.5/225.5

PHASE                                        ACTIONS
Removal Phase                              1882/1882
Install Phase                              2382/2382
Update Phase                             19953/19953

PHASE                                          ITEMS
Package State Update Phase                 1073/1073
Package Cache Update Phase                   535/535
Image State Update Phase                         2/2

A clone of openindiana exists and has been updated and activated.
On the next boot the Boot Environment openindiana-1 will be mounted on '/'.
Reboot when ready to switch to this updated BE.

NOTE: Please review release notes posted at:

A new BE has been created for us, and is slated to be active on reboot.

    root@openindiana:~# beadm list
BE            Active Mountpoint Space Policy Created      
--            ------ ---------- ----- ------ -------      
openindiana   N      /          94.0K static 2011-04-04 23:00
openindiana-1 R      -          3.90G static 2011-04-17 07:20

root@openindiana:~# reboot

Once the instance comes back up, log in:

    # ssh oi@
Last login: Sun Apr 17 06:54:25 2011 from domu-
OpenIndiana     SunOS 5.11      oi_148  November 2010
oi@openindiana:~$ pfexec su -
OpenIndiana     SunOS 5.11      oi_148  November 2010

As you can see, we’re running oi_148.

To be sure the upgrade is happy and we don’t have any sticky 147 bits left:

    root@openindiana:~# pkg list | grep 147



Create some EBS volumes and attach them to the instance. You’ll need to specify a Linux-style device path for the volume. The older OpenSolaris AMI required a numeric device, as Solaris expects device IDs to be 0..23. This either seems to have been broken at some point in the last year, or doesn’t work with pv-grub. Regardless, we can work around it.

    $ ./aws create-volume --size 128 --zone us-east-1b
$ ./aws create-volume --size 128 --zone us-east-1b
$ ./aws create-volume --size 128 --zone us-east-1b
$ ./aws create-volume --size 128 --zone us-east-1b

$ ./aws attach-volume vol-vvvvvvvv -i i-iiiiiiii -d /dev/sdc
$ ./aws attach-volume vol-vvvvvvvv -i i-iiiiiiii -d /dev/sdd
$ ./aws attach-volume vol-vvvvvvvv -i i-iiiiiiii -d /dev/sde
$ ./aws attach-volume vol-vvvvvvvv -i i-iiiiiiii -d /dev/sdf

Once the volumes are available, you’ll see messages like this in dmesg:

    failed to lookup dev name for /xpvd/xdf@2128
disk_link: invalid disk device number (2128)

Which is the devfsadmd bug I mentioned above. Solaris expects device IDs to be 0..23, and devfsadm doesn’t know how to deal with anything higher.

There’s very likely a way to automate this, but I just wrote a stupid script that creates links in /dev/dsk and /dev/rdsk for the devices we’ve attached to the instance. Until the devices have the proper links, you won’t see them in format or iostat. And cfgadm doesn’t work in a Xen guest, so.

The device IDs are consistent, however. The first two disks in the system (the rpool and the pv-grub volumes) are 2048 and 2064. The device IDs increment by 16:

    root@openindiana:~# format < /dev/null
Searching for disks...
Failed to inquiry this logical diskdone

       0. c0t0d0 <drive type unknown>
       1. c0t1d0 <??????HH???X?[??????? cyl 4095 alt 0 hd 128 sec 32>
Specify disk (enter its number):

So now we link in the new devices:

    root@openindiana:~# ./ c0t2d0 2080
root@openindiana:~# ./ c0t3d0 2096
root@openindiana:~# ./ c0t4d0 2112
root@openindiana:~# ./ c0t5d0 2128

root@openindiana:~# format < /dev/null
Searching for disks...
Failed to inquiry this logical diskFailed to inquiry this logical diskFailed to inquiry this logical diskFailed to inquiry this logical diskFailed to inquiry this logical diskdone

       0. c0t0d0 <drive type unknown>
       1. c0t1d0 <??????HH???X?[??????? cyl 4095 alt 0 hd 128 sec 32>
       2. c0t2d0 <??????HH???X?[??????? cyl 16709 alt 0 hd 255 sec 63>
       3. c0t3d0 <??????HH???X?[??????? cyl 16709 alt 0 hd 255 sec 63>
       4. c0t4d0 <??????HH???X?[??????? cyl 16709 alt 0 hd 255 sec 63>
       5. c0t5d0 <??????HH???X?[??????? cyl 16709 alt 0 hd 255 sec 63>
Specify disk (enter its number):

Create our ZFS pool:

    root@openindiana:~# zpool create tank mirror c0t2d0 c0t3d0 mirror c0t4d0 c0t5d0
root@openindiana:~# zpool list tank
tank   254G    80K   254G     0%  1.00x  ONLINE  -

Once the EFI labels have been written to the disks, format stops throwing errors on them, as well:

    root@openindiana:~# format < /dev/null
Searching for disks...
Failed to inquiry this logical diskdone

       0. c0t0d0 <drive type unknown>
       1. c0t1d0 <??????HH???X?[??????? cyl 4095 alt 0 hd 128 sec 32>
       2. c0t2d0 <Unknown-Unknown-0001-128.00GB>
       3. c0t3d0 <Unknown-Unknown-0001-128.00GB>
       4. c0t4d0 <Unknown-Unknown-0001-128.00GB>
       5. c0t5d0 <Unknown-Unknown-0001-128.00GB>
Specify disk (enter its number):

And, just for fun:

    root@openindiana:~# dd if=/dev/urandom of=/tank/random bs=1024 count=204800
204800+0 records in
204800+0 records out
209715200 bytes (210 MB) copied, 9.14382 s, 22.9 MB/s

So now you have ZFS on EBS, with the ability to do anything ZFS can. Snapshots will be much, much faster than EBS snapshots (though not complete copies, and will obviously be lost if your pool is lost, whereas EBS snapshots are complete copies of the volume and can be cloned and mounted out of band), enable compression, dedup (though this would probably be terrifyingly slow on EC2), and so on. is available here.

Virtual Networking

This is fodder for another post, but something I’ve done elsewhere is to use Crossbow to create a virtual network with zones and VirtualBox VMs. The global zone runs OpenVPN, giving clients access to these private resources. This model seems perfectly suited to EC2, given the IP assignment limitations noted above. Unfortunately I don’t’ imagine VirtualBox is an option here, but even just a private network of zones would be extremely useful.

And perhaps someday EC2 will let you assign multiple Elastic IPs to an instance.


While there are still a few bugs to be worked out, this proof of concept AMI does work and is ready to have its tires kicked.

I’m pretty stoked to have Solaris available on EC2. Thanks Andrzej and Nexenta both!

Building python26@pkgsrc on Solaris mirrorshades

Python fails to build its socket and ssl modules. This is fixed in 2.7, but won’t be backported to 2.6.

This recent post to pkg@netbsd details the problem and links to the patch:

Building DBD::mysql with SUNWspro and pkgsrc on Solaris 11 Express mirrorshades

I use resmon, which is a pretty nice system metric aggregator. It relies on the system Perl specifically for Solaris::Kstat, and so you don’t have to install pieces of the CPAN to get it running. Earlier tonight I decided to point its default MySqlStatus module at our MySQL master and ran into a few annoyances.

Historically, getting DBD::mysql installed with the system Perl has proven somewhat painful.

We use the packages from, whose libmysqlclient is not built shared, so you can’t build DBD::mysql against them. I could have installed pkg:/database/mysql-5? but I already have a MySQL install via pkgsrc.

So, simply:

Make sure /opt/SUNWspro/bin/cc is first in your path, and:

# /bin/perl Makefile.PL –libs=“-L/usr/pkg/lib/mysql -R/usr/pkg/lib/mysql -lmysqlclient -lz” –cflags=“-I/usr/pkg/include/mysql -I/usr/include -m32” # make && make install

And huzzah. DBD::mysql.

Perl 5.12.2, Solaris, Sun Studio, -m64, -Dvendorprefix woes. mirrorshades

****UPDATE**** Nick Clark dug into this and determined it’s a bug in Sun Studio 12.1. Use 12.2 to build Perl. If anyone wants at Oracle wants to buy him some beers, send someone from Sun with them. ****UPDATE****

I spent a fair chunk of yesterday afternoon (between diaper changing, swaddling, swinging, singing, and so forth) debugging a weird problem with Perl 5.12.2 on Solaris.

I had been deploying an updated pkgsrc build with ABI=64 and Sun^WSolaris Studio 12.1 for a new project, and ran into perl@pkgsrc segfaulting on certain modules. Extremely weird. I pulled the source and built that without issue, adding only -Dcc=cc -Accflags='m64' Aldflags='-m64' to build it 64bit with Studio.

This particular project requires deploying Perl modules in tiers, and I thought I would use vendor_perl for stuff I want installed by default that may not necessarily need to live in site_perl. As soon as I rebuilt Perl with -Dvendorprefix the same modules started throwing segv at me.

About five hours of rebuilds later (works fine with gcc, Studio and 32bit, 64bit on Linux with gcc, etc), and here’s the bug report.

Narrowing it down to that I just decided to use APPLLIB_EXT and site_perl.

Very weird.

Building nginx@pkgsrc on Solaris/sspro w mirrorshades

Hosed by default. See this post.

For amd64, you’ll want to use

CONFIGURE_ENV+= NGX_AUX=" src/os/unix/"

You can also just add that to the Makefile.

The First Law of Systems Administration mirrorshades

This post details an outage I caused this week by making several poor decisions.

Each point contains lessons I have learned over the past 10 years, and in this instance studiously ignored. Things I am typically very careful to avoid doing. My record for not breaking things is actually pretty decent, but when I do break things it tends to occur under the same set of circumstances (I’m tired and in a hurry).

Even with a decade of experience and a process that mitigates failures, I managed to do something really, really dumb.

A couple months ago I attended Surge in Baltimore, a conference whose focus is on scalability and dealing with failures. The best talks came down to “this is how we broke stuff, and this is how we recovered.”

Hopefully illuminating this particular failure will not just help someone else recover from something similar, but remind my fellow sysadmins that sometimes you just need to take a nap.

The First Law

Backups. Never do anything unless you have backups.

Stupidity the First

A few weeks ago I added an OCZ Vertex 2 SSD to a ZFS pool as a write cache. These are low-end devices, with not a great MTBF, but my research suggested they would fit our needs.

The pool in question is configured as an array of mirrors. The system was running Solaris 10 U7, which does not have support for either import recovery (-F), import with a missing log device (-m), or removal of log devices.

I had tested the SSD for about a week, burning it in.

The SSD was added without a mirror.

I was quite pleased with myself: The performance increase was obvious and immediate. Good job, me, for making stuff better.

A week after being added to the pool, the SSD died. The exact error from the Solaris SCSI driver was “Device is gone.”

The zpool hung, necessitating a hard system reset. The system came back up, with the SSD being seen as UNAVAIL. We lost whatever writes were in-flight through the ZIL, but given the workload, that was going to be either minor or recoverable.

I made myself a bug to RMA the SSD and order a pair of new ones, and stopped thinking about it, annoyed that a brand new device died after less than a month.

The stupid: Adding a single point of failure to a redundant system.

Bonus stupid: Not more than a month ago I argued on a mailing list that you should always have a mirrored ZIL, regardless of whether or not your system supported import -F or -m. Yup. I ignored my own advice, because I wanted an immediate performance increase.

Extra bonus stupid: Not fixing a problem relating to storage immediately. Sysadmins wrangle data. It’s what we do and when we do it well, it’s why people love us. Leaving a storage system in a hosed, if working, state, is just asking for pain later. Begging for it.

The Second Law

You are not a computer.

Sometimes you are just too tired to work.

Never do anything when your judgement is impaired. In particular, never make major decisions without confirmation when you are overtired (and had, perhaps, just gotten a flu shot). It leads to calamities.

As sysadmins we often have to work on little sleep in non-optimal situations or environments. We sometimes take it as a point of pride that we can do incredibly complex things when we’re barely functional.

At some point you are going to screw yourself, though.

One thing I know about myself: I get really stupid when I’m too tired. If I get woken up at 0300 by a page, I can muscle-memory and squint my way to a fix. If I’ve been up for 14-16 hours and I’ve been getting say, maybe, four hours of sleep a night for the past two months?

I’m going to do something dumb.

Stupidity the Second

I have been upgrading systems to U9 over the last few weeks. The system with the UNAVAIL SSD came up on the rotation. With U9 I’d be able to remove the dead log device. We announced a 30m outage.

And here is where impaired judgement comes in. If the following two thoughts are in your head:

  • I am exhausted
  • I just want to get this done

Stop whatever it is you’re doing. Go take a nap. Wait until a co-worker is around so they can tell you “holy crap, why are you eating live scorpions covered in glass? Stop that stupid thing you are doing!”

My wife is well aware that I do stupid things when I’m tired and tells me “do that later. Go to bed.” Listen to my wife.

I decided to go ahead and upgrade the system with the DEGRADED pool. I have rolling backups for everything on the system except the dataset containing our spam indexes (which are required so customers can view spam we have discarded for them, and release false positives).

Rather than wait to sync that dataset off-system (3-4 hours, and why hadn’t I just started a rolling sync earlier that day? Or had one for the last two years?) I decided to go ahead and upgrade the system.

The stupid: Why would you ever put unique data at risk like this?

Bonus stupid: Why is the data unique? There is no reason for it to be so. Replicating ZFS is trivial. Oversights happen, but this is still dumb.

(My systems all live in pairs. With very few exceptions there are no snowflake services. I take snapshots of my MySQL master. I replicate them, so I can clone and boot them to restore data quickly. I have MySQL replication set up so I can do hot failovers. I have zones replicated via ZFS, I have backups of /etc and /usr/pkg/etc even though the configs are all in git. I replicate all other big datasets to cross-site failover systems with standby zones. I do backups. So why, in my big table of datasets, does this one thing have a big TODO in the replicate column?)

Postpone the maintenance window. It’s ok. Sometimes scheduling conflicts come up. Sometimes you aren’t as prepared as you thought you were. Your customers don’t care that they weren’t supposed to be able to access something for 30 minutes tonight, but instead can’t tomorrow night.

Really. Get some sleep. Wake up tomorrow and feel lucky you didn’t totally break something and potentially lose unrecoverable data.

The Third Law

Don’t make a problem worse. Especially if you caused it.

Never do anything to disks which contain data you need, even if that data is currently inaccessible. Move the workload somewhere else. Hope you think of something.

You are already eating live scorpions covered in glass, don’t go setting them on fire too.

Stupidity the Third

I exported the pool and restarted the system. It Jumpstarted happily. I logged in and…

    # zpool import
  pool: tank
    id: 17954631541182524316
 state: UNAVAIL
status: One or more devices are missing from the system.
action: The pool cannot be imported. Attach the missing
        devices and try again.

        tank        UNAVAIL  missing device
          mirror-0  ONLINE
            c0t2d0  ONLINE
            c0t3d0  ONLINE
          mirror-1  ONLINE
            c0t4d0  ONLINE
            c0t5d0  ONLINE

        Additional devices are known to be part of this pool, though their
        exact configuration cannot be determined.
# zpool import -F tank
cannot import 'tank': one or more devices is currently unavailable
        Destroy and re-create the pool from
        a backup source.

At this point there was a fair amount of cursing.

The thing is, I knew the pool was fragile. I knew that reinstalling the system was going to blow away /etc/zfs/zpool.cache, which is likely the only reason U7 was happy to import the pool after the SSD died initially and it got rebooted.

But my judgement was impaired: I was making really stupid decisions.

The stupid: Doing something irrevocably destructive to a fragile, unique system.

Regretful Morning

At this point I was screwed. I couldn’t import the pool. I had no backups.

I got critical zones back up on other systems (using data that had been replicating off the now hosed box), so services would not be unduly affected. Everything was back up, but customers couldn’t see messages we had discarded for them, and as such not release important mail that had been improperly discarded.

After an hour of trying various things (like logfix, and booting newer instances of Solaris) I gave up. At 0430, I woke up my co-worker Rik, and explained I had totally screwed us.

“That does sound pretty bad.”

I stood up another zone so we could start importing the last seven days of messages from the message queue (which we keep as a hedge in case something just like this happens, though I doubt anyone expected me to be the cause). In the process of this, he rewrote the reindexing system to make it an order of magnitude faster. We went from the refill taking 2 days to 6 hours.

The Road to Recovery

Once the refill was running my body shut down for five hours.

My brain working slightly better, I started thinking: I had a copy of the old zpool.cache, which contained configuration about the now-defunct tank pool. But how could I turn that into something useful?

Keep in mind: My data was on the disk. No corruption has occurred. It was just my version of ZFS that didn’t want to import the pool with a missing log device. How could I force it to?

I had thought about several things before crashing: The logfix tool basically replaces a missing log device with another by walking the ZFS metadata tree, replacing the device path and GUID with another device or a file. Okay, I could try something like that, right? But the code needs Nevada headers or Nevada.

I came back up to James McPherson having built a logfix binary for Solaris 10. Unfortunately it didn’t work (but also didn’t eat anything, so props to James).

So if logfix wasn’t going to work, I was going to have to do something really complicated. Digging around with zdb. Terrifying.

James got me in touch with George Wilson, who had written the zpool import recovery code in the first place. He suggested some things, including:

    # zpool import -V -c /etc/zfs/zpool.cache.log tank
cannot open 'tank': no such pool

Well, that’s not good. zpool import by itself can see the pool, but can’t import.

Specifying the secret recovery flag (-V) doesn’t help, using the alternative cache file that has configuration for the log device claims to not even see the pool!


    # zpool import -V -c /etc/zfs/zpool.cache.log

  pool: tank
    id: 17954631541182524316
 state: DEGRADED
status: One or more devices are missing from the system.
action: The pool can be imported despite missing or damaged devices.  The
        fault tolerance of the pool may be compromised if imported.

        tank        DEGRADED
          mirror-0  ONLINE
            c0t2d0  ONLINE
            c0t3d0  ONLINE
          mirror-1  ONLINE
            c0t4d0  ONLINE
            c0t5d0  ONLINE
          c0t6d0p1  UNAVAIL  cannot open

Okay, so I can see the pool using the old configuration data, but I can’t import it. And it’s seen as DEGRADED, not UNAVAIL. It’s importable. That suggests that I don’t need to go digging around with zdb or a hex editor. George is also starting with the import command, not a hex editor. That seems to imply he thinks it’s recoverable.

(I get that sinking feeling that something you thought was going to be really complicated and dangerous is, in fact, trivial. And you’ve realized long, long after you should have.)

So: -V is the old import switch. I bet that would work on U7. U9 has an actual recovery mechanism now. Maybe…

    # zpool import -F -c /etc/zfs/zpool.cache.log tank
Pool tank returned to its state as of Thu Nov 04 01:25:50 2010.
# zpool list
rpool   136G  2.05G   134G     1%  ONLINE  -
tank    272G   132G   140G    48%  DEGRADED  -

Twelve hours later, there is much more cursing.

Ghost of the Arcane

A lot of UNIX comes down to reading documentation and determine which switches are going to solve your immediate problem. Here, it’s two: -F and -c. That’s it. Let’s assume that twelve hours previous I was well-rested but still astoundingly dumb, and had managed to get myself into the situation where my pool was UNAVAIL.

Because I was well-rested, I would have read the docs, understood them, and recovered the pool within a few minutes. Instead, I had to recharge my brain, created a lot of work for my co-workers, and annoyed my customers. Good job!

Ok. Now I want to get rid of the busted log device. The newly imported degraded pool is on ZFS v10. I need to get it to at least v19, which is when log device removal was added. Thankfully U9 supports v22.

    # zpool upgrade tank
This system is currently running ZFS pool version 22.

Successfully upgraded 'tank' from version 10 to version 22

And get rid of the dead log device:

    # zpool remove tank c0t6d0p1
# zpool status -v tank
  pool: tank
 state: ONLINE
 scrub: none requested

        NAME        STATE     READ WRITE CKSUM
        tank        ONLINE       0     0     0
          mirror-0  ONLINE       0     0     0
            c0t2d0  ONLINE       0     0     0
            c0t3d0  ONLINE       0     0     0
          mirror-1  ONLINE       0     0     0
            c0t4d0  ONLINE       0     0     0
            c0t5d0  ONLINE       0     0     0

errors: No known data errors

And the pool is back online in a usable state.

Before we make any changes to the newly recovered pool I take a snapshot and send it to another system. This takes a few hours. This means that if the new indexer has a bug that doesn’t interact with the existing index, we’ll be able to go back to the pristine data.

I also start up the rolling replication script on the dataset. The first send takes a few hours; incrementals 20-30 minutes.

Both of those things should have already been in place.

That How They Do

Shortly before I got the pool back online, the 7 day import had finished and we had announced to customers they could get back to seeing their discarded messages.

Well, now I had the last 30 days of spam, and all the metadata that went with it. Rebuilding the other 23 days on the new index was going to be both non-trivial and slow. We would have to pull the information off disk for each message (around 2TB of spam), and some data was only stored in the index.

The decision was made to revert to the original dataset. I pointed the new index refiller at it, and 9 minutes later we had the last 12 hours of spam indexed. We swapped around, merged the metadata from the temporary dataset into the original one, and we were back online.

We made the announcement, wrote a blog post, and everything was good again.

Almost as if I had never done anything incredibly stupid.



Maybe you are the lone SA at a small company, but you still have resources to ask for advice. There are certainly people on IRC whose opinion I value. Your boss and co-workers may not know as much about systems as you do, but they can probably recognize a three-legged chair when it’s in front of them.

It is easy to do stupid shit in a vacuum. Talking to people about it is probably enough for you to recognize if it’s a bad idea.

I’ll have another post coming up (with pretty graphs, hopefully) abut hybrid storage pools and their impact on performance. Two SSDs just came in to act as a mirror for this host, so it should be interesting.

Your Co-workers

You have broken something. You feel dumb and defensive, and pissed off at yourself. Don’t take it out on the people who are helping you get the system back online.

When you break something and can’t fix it, you create work for other people. Make sure you thank them and apologize. Act like a professional, or even just a regular human being.

I can think of a few instances where Ricardo Signes has had to save my bacon in the last few years, but probably nothing so major as this case. I had to wake him up at 0430 to give me a hand, and while he’s paid to do it, it’s unfortunate how rare it is to find people as pleasant and professional as he is.

Over the years I’ve worked with lots of smart people, but few as smart and even-tempered as rjbs. Manhug!

Wheaton’s Law

A brief tangent.

Sysadmins are admittedly used to other people breaking things and wanting us to fix it. Treat your co-workers, customers, and users with respect. Do not call them lusers, do not make them feel bad. It is extremely aggravating at times, but they are not a puppy who just had an accident on your new carpet. They are adults, and your colleagues.

At some point you may find yourself on the other side of the table: You have done something and now they can’t get any work done. Hopefully they will recall that when they screwed up, you did not berate them, and will afford you the same courtesy.

Educate them after you have solved their problem.

Don’t be a dick.


Special thanks to James McPherson of Oracle/Sun and George Wilson of Delphix (previously of Oracle/Sun) for giving me a hand. George pointed me to -V and -c which finally helped me realize just how dumb I was being and got my pool back online.

Vendor Support

Once I realized I was screwed and got the immediate booms out of the way, I opened a case with Oracle. P1, at 0700. A rep got back to me around 1900. Nearly 12 hours later. For a “system down” event, affecting many customers, on a paid support contract.

Andre van Eyssen says: If you have a P1 problem, call it in. Don’t use SunSolve. Make the call.

A support contract is not a panacea.

Design your systems to be redundant and resilient.

And don’t do stupid shit when you’re tired.

textproc/libxslt on pkgsrc/solaris mirrorshades

Requires this patch or you get symbol errors on compile.

Need this for devel/hg-git. Working on getting illumos-gate pushed into github (#105).

apr bug on pkgsrc/Solaris x86 mirrorshades

While working on the illumos infrastructure roll-out I ran into an issue with apache22 segfaulting, but .. mostly working, when SSL was enabled. Disabling SSL seems to fixed the issue. Other SSL-enabled programs were not affected.

Turns out it’s an apr bug, and kind of deeply weird. Described here.

On software. mirrorshades

(grumpy face)

< bdha> I see its value.
< bdha> I just hate it.
< bdha> As a sysadmin I feel I am allowed to feel that way.

Segfaults with SSL-enabled and pkgsrc2010qN/solaris mirrorshades

Starting with 2010q1 I noticed that anything built with openssl, including, well, openssl, would segfault. To fix this, compile the openssl package with SSPRO. Don’t forget to modify your mk.conf:

PKGSRC_COMPILER=        sunpro

# For sunpro
CC=     cc
CXX=    CC
CPP=    cc -E

I previously tried to compile everything with pkgsrc’s gcc, but perhaps I’ll change that policy now.

Fix defined here.

nginx on pkgsrc2010q2/solaris. mirrorshades

To compile the nginx package, you need to remove the patch-aa patch from distfiles. You also need to create and add the following to it:

.if (${PKGSRC_COMPILER} == sunpro)
.if (${MACHINE_ARCH} == i386)
CONFIGURE_ENV+= NGX_AUX=" src/os/unix/" 

Problem report and fix here.

Building Postfix on OpenSolaris >=b130 mirrorshades

NIS was finally sent into the cornfield around ONNV b130. Postfix does not have a definition for OpenSolaris (arguably 5.11), just Solaris 5.10. When building, it attempts to compile dict_nis and can’t, unsurprisingly.

To build, remove “#define HAS_NIS” from src/util/sys_def.h in the “#ifdef SUNOS5” section.

With pkgsrc 2010q1, apply this diff to pkgsrc/mail/postfix/patches/patch-ag.

The checksum is 3ea7ecaec06b0ff30fe1a1b2f5197def0219bd6b.

Elsechan… < bda> I guess the X2270 is coming in soon. The power cables are here. <... mirrorshades


< bda> I guess the X2270 is coming in soon. The power cables are here.

< e^ipi> unless it's the computer columbian drug lords and the
 power cable is a warning

< e^ipi> like a toe

ZFS and iSCSI mirrorshades

I was asked to share out the pool on the X4500 via NFS and iSCIS. NFS I was familiar with, and have used a fair amount. iSCSI for all its new hotness factor, I’ve never touched.

I was unsurprised, but pleased, with how trivial it is to set up. On the server (the target):

x4500# zfs create tank/iscsi
x4500# zfs set shareiscsi=on tank/iscsi
x4500# zfs create -s -V 25g tank/iscsi/vol001
x4500# zfs create -s -V 25g tank/iscsi/vol002
x4500# zfs create -s -V 25g tank/iscsi/vol003
x4500# zfs create -s -V 25g tank/iscsi/vol004
x4500# zfs create -s -V 25g tank/iscsi/vol005
x4500# zfs create -s -V 25g tank/iscsi/vol006
x4500# zfs list tank
tank                      1.39G  13.3T  53.3K  /tank
tank/iscsi                1.54M  13.3T  44.8K  /tank/iscsi
tank/iscsi/vol001          246K  13.3T   246K  -
tank/iscsi/vol002          246K  13.3T   246K  -
tank/iscsi/vol003          247K  13.3T   247K  -
tank/iscsi/vol004          262K  13.3T   262K  -
tank/iscsi/vol005          263K  13.3T   263K  -
tank/iscsi/vol006          264K  13.3T   264K  -
tank/nfs                  1.39G  13.3T  1.39G  /tank/nfs

This will start the iSCSI service (iscsigtd) and share not only the parent volume (tank/iscsi) but all the children as well.

Accessing and using the disks on the client (the initiator) is just as easy:

client# iscsiadm modify discovery --sendtargets enable
client# iscsiadm add discovery-address
client# svcadm enable initiator
client# iscsiadm list target
        Alias: tank/iscsi/vol001
        TPGT: 1
        ISID: 4000002a0000
        Connections: 1
        Alias: tank/iscsi/vol002
        TPGT: 1
        ISID: 4000002a0000
        Connections: 1
        Alias: tank/iscsi/vol003
        TPGT: 1
        ISID: 4000002a0000
        Connections: 1
        Alias: tank/iscsi/vol004
        TPGT: 1
        ISID: 4000002a0000
        Connections: 1
        Alias: tank/iscsi/vol005
        TPGT: 1
        ISID: 4000002a0000
        Connections: 1
        Alias: tank/iscsi/vol006
        TPGT: 1
        ISID: 4000002a0000
        Connections: 1
client# format < /dev/null
Searching for disks...done
       0. c0d0 <DEFAULT cyl 4174 alt 2 hd 255 sec 63>
       1. c1t600144F04B66BAA30000144F21056400d0 <DEFAULT cyl 3261 alt 2 hd 255 sec 63>
       2. c1t600144F04B66BAA40000144F21056400d0 <DEFAULT cyl 3261 alt 2 hd 255 sec 63>
       3. c1t600144F04B66BAA50000144F21056400d0 <DEFAULT cyl 3261 alt 2 hd 255 sec 63>
       4. c1t600144F04B66BAA60000144F21056400d0 <DEFAULT cyl 3261 alt 2 hd 255 sec 63>
       5. c1t600144F04B66BAA80000144F21056400d0 <DEFAULT cyl 3261 alt 2 hd 255 sec 63>
       6. c1t600144F04B66BAA90000144F21056400d0 <DEFAULT cyl 3261 alt 2 hd 255 sec 63>

client# zpool create tank \
raidz c1t600144F04B66BAA30000144F21056400d0 c1t600144F04B66BAA40000144F21056400d0 c1t600144F04B66BAA50000144F21056400d0 \
raidz c1t600144F04B66BAA60000144F21056400d0 c1t600144F04B66BAA80000144F21056400d0 c1t600144F04B66BAA90000144F21056400d0
client#  zpool status -v tank
  pool: tank
 state: ONLINE
 scrub: none requested

        NAME                                       STATE     READ WRITE CKSUM
        tank                                       ONLINE       0     0     0
          raidz1                                   ONLINE       0     0     0
            c1t600144F04B66BAA30000144F21056400d0  ONLINE       0     0     0
            c1t600144F04B66BAA40000144F21056400d0  ONLINE       0     0     0
            c1t600144F04B66BAA50000144F21056400d0  ONLINE       0     0     0
          raidz1                                   ONLINE       0     0     0
            c1t600144F04B66BAA60000144F21056400d0  ONLINE       0     0     0
            c1t600144F04B66BAA80000144F21056400d0  ONLINE       0     0     0
            c1t600144F04B66BAA90000144F21056400d0  ONLINE       0     0     0

errors: No known data errors

Very nice.

X4500 and ZFS pool configuration mirrorshades

UPDATED: 2010-01-31 2245

RAS guru Richard Elling notes a couple bad assumptions I’ve made in this post:

Much thanks to Richard for correcting me!

I was recently asked to help install and configure a Sun X4500 (“Thumper”). The system has dual Opterons, 16GB RAM, and 48 500GB SATA disks. The largest pool I’d configured before this project was a Sun J4200: 24 disks.

The Thumper controller/disk setup looks like this:

That’s six controllers, with 46 disks available for data.

The mirrored ZFS rpool are on c5t0 and c4t0. Placing the mirror halves across controllers allows the operating system to survive a controller failure.

ZFS supports two basic redundancy types: Mirroring (RAID1) and RAIDZ (akin to RAID5, but More Gooder). RAIDZ1 is single parity, and RAIDZ2 double. I decided to go with RAIDZ2 as the added redundancy is worth more than capacity: The 500GB disks can trivially be swapped out for 1TB or 2TB disks, but the pool cannot be easily reconfigured after creation.

From the ZFS Best Practices and ZFS Configuration guides, the suggested RAIDZ2 pool configurations are:

  • 4x(9+2), 2 hot spares, 18.0 TB
  • 5x(7+2), 1 hot spare, 17.5 TB
  • 6x(5+2), 4 hot spares, 15.0 TB
  • 7x(4+2), 4 hot spares, 12.5 TB

ZFS pools consist of virtual devices (vdev), which can then be configured in various ways. In the first configuration you are making 4 RAIDZ vdevs of 11 disks each, leaving 2 spares.

(ZFS pools are quite flexible: You could set up mirrors of RAIDZs, three-way mirrors, etc. In addition to single and dual parity RAIDZ, RAIDZ3 was recently released: Triple parity!)

Distributing load across the controllers is an important performance consideration but limits possible pool configurations. Preferably you want each vdev to have the same number of members. Surviving a single controller failure is also required.

RAIDZ2 is double parity, so you lose the “+2” disks noted above, but the vdev can sustain two disk losses. The entire pool can then survive the loss of vdevs*2 disks. This is pretty important given the size of the suggested vdevs (6-11 disks). The ZFS man page recommends that RAIDZ vdevs not be more than 9 disks because you start losing reliability: More disks and not enough parity to go around. The likelihood of losing more than two disks in a vdev with 30 members, instance.

The goal is to balance number of vdev members with parity.

A vdev can be grown by replacing every member disk. Once all disks have been replaced, the vdev will grow, and the pools total capacity is increased. If the goal is to incrementally increase space by replacing individual vdevs, it can be something of a hassle if you have too many (say, 11) disks to replace before you get any benefit.

The process for growing a vdev is: Replace a disk, wait for new disk to resilver, replace a disk, wait for new to resilver, replace a disk… it is somewhat time consuming; not something you want to do very often.

So the tradeoff here is really:

  • More initial space (18TB) but less trivial upgrade (11 disks), and ok performance
  • Less initial space (15TB) but more trivial upgrade (7 disks), good performance
  • Less initial space (12.5TB) but more trivial upgrade (6 disks), best performance

With the 6x(7) or 7x(6) configurations we have four free disks. One or two can be assigned to the pool as a hot spare. The other two or three disks can be used for:

  • A mirror for tasks requiring dedicated I/O
  • Replaced with 3.5" SAS or SSDs for cache devices

I’ll discuss Hybrid Storage Pools (ZFS pools with cache devices consisting of SSD or SAS drives) in another post. They greatly affect pool behavior and performance. Major game-changers.

Unsurprisingly the 12.5TB configuration has the highest RAS, and best performance. It loads disk across controllers evenly, has the best write throughput, is easiest to upgrade, etc.

Sacrificing 6TB of capacity for better redundancy and performance may not be in line with your vision of the systems purpose.

The 15TB configuration seems like a good compromise. High RAS: Tolerance to failure, good performance, good flexibility, not an incredibly painful upgrade path, and 15TB isn’t anything to sneer at.

(Note: After parity and metadata, 13.3TB is actually useable.)

The full system configuration looks like this:

    pool: rpool
state: ONLINE
scrub: none requested

    rpool         ONLINE       0     0     0
      mirror      ONLINE       0     0     0
        c5t0d0s0  ONLINE       0     0     0
        c4t0d0s0  ONLINE       0     0     0 

errors: No known data errors

pool: tank
state: ONLINE
scrub: none requested

    tank        ONLINE       0     0     0
      raidz2    ONLINE       0     0     0
        c0t7d0  ONLINE       0     0     0
        c1t7d0  ONLINE       0     0     0
        c6t7d0  ONLINE       0     0     0
        c7t7d0  ONLINE       0     0     0
        c4t7d0  ONLINE       0     0     0
        c5t7d0  ONLINE       0     0     0
        c0t4d0  ONLINE       0     0     0
      raidz2    ONLINE       0     0     0
        c0t3d0  ONLINE       0     0     0
        c1t3d0  ONLINE       0     0     0
        c6t3d0  ONLINE       0     0     0
        c7t3d0  ONLINE       0     0     0
        c4t3d0  ONLINE       0     0     0
        c5t3d0  ONLINE       0     0     0
        c1t4d0  ONLINE       0     0     0
      raidz2    ONLINE       0     0     0
        c0t6d0  ONLINE       0     0     0
        c1t6d0  ONLINE       0     0     0
        c6t6d0  ONLINE       0     0     0
        c7t6d0  ONLINE       0     0     0
        c4t6d0  ONLINE       0     0     0
        c5t6d0  ONLINE       0     0     0
        c6t4d0  ONLINE       0     0     0
      raidz2    ONLINE       0     0     0
        c0t2d0  ONLINE       0     0     0
        c1t2d0  ONLINE       0     0     0
        c6t2d0  ONLINE       0     0     0
        c7t2d0  ONLINE       0     0     0
        c4t2d0  ONLINE       0     0     0
        c5t2d0  ONLINE       0     0     0
        c7t4d0  ONLINE       0     0     0
      raidz2    ONLINE       0     0     0
        c0t5d0  ONLINE       0     0     0
        c1t5d0  ONLINE       0     0     0
        c6t5d0  ONLINE       0     0     0
        c7t5d0  ONLINE       0     0     0
        c4t5d0  ONLINE       0     0     0
        c5t5d0  ONLINE       0     0     0
        c4t4d0  ONLINE       0     0     0
      raidz2    ONLINE       0     0     0
        c0t1d0  ONLINE       0     0     0
        c1t1d0  ONLINE       0     0     0
        c6t1d0  ONLINE       0     0     0
        c7t1d0  ONLINE       0     0     0
        c4t1d0  ONLINE       0     0     0
        c5t1d0  ONLINE       0     0     0
        c5t4d0  ONLINE       0     0     0
      c0t0d0    AVAIL   
      c1t0d0    AVAIL   

errors: No known data errors

The command to create the tank pool:

    zpool create tank \
raidz2 c0t7d0 c1t7d0 c6t7d0 c7t7d0 c4t7d0 c5t7d0 c0t4d0 \ 
raidz2 c0t3d0 c1t3d0 c6t3d0 c7t3d0 c4t3d0 c5t3d0 c1t4d0 \
raidz2 c0t6d0 c1t6d0 c6t6d0 c7t6d0 c4t6d0 c5t6d0 c6t4d0 \
raidz2 c0t2d0 c1t2d0 c6t2d0 c7t2d0 c4t2d0 c5t2d0 c7t4d0 \
raidz2 c0t5d0 c1t5d0 c6t5d0 c7t5d0 c4t5d0 c5t5d0 c4t4d0 \
raidz2 c0t1d0 c1t1d0 c6t1d0 c7t1d0 c4t1d0 c5t1d0 c5t4d0 \
spare c0t0d0 c1t0d0

This leaves c6t0d0 and c7t0d0 available for use as more spares, for another pool or as cache devices.

I feel the configuration makes for a good compromise. If it doesn’t prove successful or we’ve misjudged the workload for the machine, we have the ability to add cache devices without compromising the pool’s redundancy.

That said, I’ll be quite interested in seeing how it performs!

Documenting infrastructure changes over time is useful for... mirrorshades

Documenting infrastructure changes over time is useful for spotting trends in your knowledge. It’s also helpful when identifying areas that are lacking in investment. Keep track of the major changes to your platform; this table is nice, but you can see there was a large chunk of time I didn’t update it. Things were still changing, but they weren’t documented. Going back a year and digging through our ticketing system looking for major changes wouldn’t necessarily be trivial, either. Keep your documentation up to date or it’s useless!

In addition to a ticketing system, we utilize a CHANGELOG mailling list, where summaries of major operations, development, or policy changes should be sent to, tagged appropriately. We only started doing this in April, though, so populating the missing year is still hard.

On our wiki, I also have brief notes for the content of each release.

Versioned Will Enforcement (You Can Too!) mirrorshades

If security is a never-ending process, operations is systematic refutation of entropy.

When I first started out, I did everything by hand. Installs, configuration, every aspect of management. Eventually I started writing scripts. Scripts would install things for me, and copy configuration files, and whatever else. But the scripts were stupid. If they were run more than once, bad things might happen.

The day I did my first automated network install was an epiphany. No more hitting enter or clicking next until my eyes bled a merry pattern on the keyboard.

The weird thing is, my first job involved using Norton Ghost to install entire labs of workstations with an operating system image. But it never occurred to me, until many years later, that a similar thing could be had for servers. A major hole in my experience.

So then I started using images to install new systems. Of course, the problem with using images is that as soon as you build them, they’re out of date. What’s in the image is not actually representative of what you have in production. The image has new stuff the production boxes won’t, or the production systems were changed in some undocumented way that is not reflected in the image, or… Anyway, then you end up writing more scripts. To keep things in sync. Only they aren’t perfect, because by this point every system is just slightly different enough that you can’t find all the edge cases until they cause a boom.

Two years ago I discovered Puppet. I had seen change management before, but in the form of cfengine, and it didn’t really grab me. Its syntax didn’t make my life any easier. It didn’t offer a mental model for how the different pieces of my infrastructure interacted. Puppet did. Maybe Luke just explained it properly in the videos I watched while researching change management tools.

The joy of change management comes from documenting your infrastructure, and then enforcing that singular vision across it with a minimum of effort.

When you install a new host (presumably using Jumpstart/JET, or FAI, or Cobbler), you install Puppet. A few minutes later, that host is now configured with the same base as the rest of your installed hosts. They’re all the same. File permissions, users, directories, services, cron jobs…

If a service needs to be installed on a group of hosts, you write the service class, include it in the service group, and Puppet does the rest.

There’s no more “Oh, right, we changed how that works, but I guess this system we never think about didn’t get updated, and now we’ve totally screwed ourselves in some really unexpected way.”

There’s no more “Hm, someone changed something on this box, and I don’t know why, but I’d better not touch it,” because your Puppet classes are in a versioned repository. You always know who, and why, something was done. (If someone does make a local change, well, too bad for them, because Puppet is bloody well going to change it back until they create an auditable configuration trail.)

I think there’s a threshold: Once you hit a certain number of hosts, you can’t keep them all in your head. I have 20 physical hosts and 87 virtual ones. When I bring up a new Solaris zone, I don’t want to have to run some script that configures it. Heck, I don’t even want to bring it up myself. I just tell Puppet to do it, and then Puppet enables itself in the zone, and then the zoned Puppet configures the zone and suddenly whatever service I wanted to be running is.

I don’t want to have my installation method add a bunch of users. What if I have new users? Now I need to make sure my user adding scripts, and my post-installation scripts, will do the right thing! No, I think I’ll just let Puppet ensure, every 20 minutes, that users who are supposed to exist, do, and those who shouldn’t, don’t. (Not to mention that Puppet makes sure the users environment is always set up. No more having to copy your dot-files around, or checking them out from your version control system, or…)

Once you reach a certain amount of platform complexity, you need to abstract management into something you can keep in your head. Otherwise you end up spinning repetitively instead of focusing on newer, more interesting work.

It isn’t even really that much of a paradigm shift. We always end up writing scripts to manage our systems for us. Taking the next step and writing classes and functions in Puppet’s declarative language really isn’t a leap.

Once a codebase reaches a certain amount of complexity, it has to be refactored. It has to be abstracted. Otherwise it becomes unmaintainable. As with development, so too for operations.

If you’ve been at this game for a number of years, and you find yourself performing the same tasks over and over; or like me you are administering a moderate number of hosts; or you have thousands upon thousands of systems, and you aren’t using some form of versioned change management: Consider this an intervention.

Dude. You’re doing it wrong.

Console Cowboy Wrangles Himself Out of Work mirrorshades

I’ve been at my current job for three years and change. In many respects, it has been the biggest learning experience of my career (which started in 1999). My previous job had been doing network security for a decently sized university, and the experience almost drove me crazy (go find pictures of my mad scientist hair from that year; you will not question my unstable mental state again, I assure you). When a friend mentioned his employer was hiring, and did I know any system admins looking for work, I said, yeah. Me.

The infrastructure was full of legacy: 10 year old code, four year old Linux boxes, crufty hardware… It wasn’t all doom and gloom; lots of new code existed and worked very well. The R&D side was populated with two very smart people, with good plans on how to fix their side of the shop.

The ops side was a bit of a mess. It was the end result of programmers shoved into administration. There had been no dedicated systems administrator in several years; the admin work had been doled out to the programmers, who understandably had little interest in systems.

After a year digging through and learning as much as I could about the setup, I decided the best solution was to rip it all down and build anew. It’s a testament to either my salesmanship (unlikely) or a willingness and trust by both development and management to try something new and, hopefully, better. Given the state of affairs, though, it probably wasn’t much of a leap. My arguments were sound, and the testing I had done backed them up even more. It wasn’t going to be easy, but migrating from Linux to Solaris 10 was definitely where we wanted to go.

Of course, the changes were rolled out incrementally. In February of 2007 I rolled out our first Solaris 10 box, on a Sun Fire X2100. A little entry-level system, but when you’re being disruptive to a complex ecosystem, it’s good to work incrementally. Otherwise people start asking why the frogs have all suddenly died off.

The subsequent two years saw a lot of changes. All our core services moved onto bigger and better Sun systems running Solaris 10 (the biggest currently being four X4170s that I love). We went from 50 Linux boxes, to a dozen or so Solaris systems.

Consolidation was the first order of business, which is sort of amusing. When I started, each MX ran not only an MTA, a lot of Perl dispatching services, and cached RBL data, it ran a complete replica of the database. The first thing I did to improve MX performance as get MySQL off the MXes onto a dedicated replica, and have each set of site MXes using that. If I remember right, the improvement was something like 50-75%.

So when I started consolidating services into Solaris Zones, the irony didn’t escape me. I had started out separating services onto dedicated hardware, and now I was stuffing a bunch of random toys into the same box again. (Of course, the databases are still on dedicated hardware; and well. New dual CPU quad core Xeons and Nehalems with SAS disks and 32GB of RAM kind of beat the pants off the dual Athlons we had been using…)

After consolidation came change management; Puppet proved to be an excellent choice, and I’ve been happy with it since. Puppet manages almost every aspect of our services. If it isn’t managed, it’s a bug, and a task gets made to fix it.

After consolidation came standardization; in addition to keeping all the systems near the same patch and release level, I rolled out pkgsrc across both our Solaris and Linux platforms. Having the same version of a package on both made life easier in a lot of ways.

We went through several iterations of both installation and management techniques. I had never admin’d Solaris before, so it was a learning experience both for me and (perhaps less so) for our developers. We had to port a lot of code that relied on Linuxisms, and one of our devs built a framework around the CPAN which would keep all our Perl modules in sync across any number of platforms (right now, just two: Debian Linux and Solaris 10, both on x86). We’re a big Perl shop; if you use Perl email modules from the CPAN, you probably use code we developed or maintain.

In addition to the operations turmoil, we went through several changes in how we scheduled and managed our actual work. We finally settled on two week iterations. Each iteration is planned in advance, at the end of the previous iteration. We use Liquid Planner for this, and it has really worked out.

My major regret in rolling out Solaris was not using Live Upgrade until far too late. It wasn’t until two months ago that I actually sat down and took the fifteen minutes to read the documentation and do a test upgrade. For the previous two years I had been patching and upgrading systems stupidly and with as much tedium as was possible. Live Upgrade is one of Solaris’s killer features, right up there with Zones, ZFS, DTrace, mdb, and SMF. I wasted a lot of time I needn’t have if I had been using it.

But… after two years, the infrastructure is stable. We no longer have a monitor that fires when a system boots (uptime.monitor), because systems don’t randomly reboot. If a host does fall offline, the monitors that watch the services the host provides fire instead (and, of course, the ICMP checks). Services live in discrete containers, and it’s easy to tell what’s causing problems at a glance; and if glancing doesn’t work, well, there’s the DTrace Toolkit. Every system’s configuration is enforced by Puppet. Everything from users, to services, to ZFS filesystems, to zones, are versioned and managed (I’ll expand on this in a later post, because I’ve come to believe if you aren’t using change management, You’re Doing It Wrong).

Last week I went away for five days, with no Internet access, and I received no harried phone calls from the developers or support staff. No one even emailed me any questions (not that I would have seen it); the systems just did what they’re meant to: Work.

It’s been percolating for a while, but that really was the clincher. When the lone admin can disappear for a business week and the world doesn’t notice, what becomes of him?

All the basic infrastructural problems have been solved. The foundation is now sound.

For the last two years that was my goal, and it’s been the core focus of every day I do work. All of my plans, from moving our fileservers from mirrored SATA drives in SuperMicros running reiserfs (how many nightmares did that filesystem cause me, I try not to think about) to Dell 210S JBODs on ZFS, to finally Sun J4200s, to… well. To everything. The websites, MX policy servers, spam storage, DNS, SASL, the build system, the development environment, support and billing… Putting out each of those fires was as far as I could see.

There are plenty of things left do on the operations side, certainly: Better monitoring and visualization (Reconnoiter?), refactoring our Puppet classes so they’re not horrible, code instrumentation and log searching that aren’t wrappers around grep, fixing the build and push systems so they’re not rsync and Makefile, Rakefiles, and things we call Bakefiles but are, in fact, not.

And that’s all really important stuff. But what we have works. It’s not falling over. It doesn’t cause a crisis. None of it is on fire.

Looking back at the last ten years, when I’m not in crisis mode, tearing stuff down and rebuilding it, I get bored. I get bored and I find another shop that is on fire.

I really like my job. I don’t much want to find another. I’ve come to enjoy going to bed at a reasonable hour and getting a reasonable amount of sleep. I’ve just turned 30. There are white streaks in my beard.

Firefighting is for younger people, with less experience but more energy.

Now I have to figure out what a systems administrator does, when the world isn’t actually on fire. When things are, on the whole, ticking along pretty well, in fact. In many respects this is where sysadmins always say they want to end up. Where their job is to sit around playing Nethack, because the thing they have designed Just Works. That would drive me mad. If I’m not designing and implementing something to improve the things I’m responsible for, I get really unhappy. My joy circuit ceases to fire. I have no aspirations for supreme slack.

My shop is no longer on fire.

So: Now what?

A teeny bug in jkstat char handling The Trouble with Tribbles...

While messing about with illuminate, I noticed an interesting oddity in the disk display:

See on the end of the product string is that "Revision"? That shouldn't be there, and iostat -En doesn't show it. This comes from my JKstat code, so where have I gone wrong?

This comes from the sderr kstat, which is a named kstat of the device_error class.

A named kstat is just a map of keys and values. The key is a string, the value is a union so you need to know what type the data is in order to be able to interpret the bits (and the data type of each entry is stored in the kstat, so that's fine).

For that Product field, it's initialized like this:

kstat_named_init(&stp->sd_pid, "Product", KSTAT_DATA_CHAR);

OK, so it's of type KSTAT_DATA_CHAR. The relevant entry in the union here is value.c, which is actually defined as a char c[16] - the field is 16 characters in size, long enough to hold up to 128-bit ints - most of the numerical data doesn't take that much space.

(For longer, arbitrary length strings, you can stick a pointer to the string in that union instead.)

Back to iostat data. For a SCSI device (something using the sd driver), the device properties are set up in the sd_set_errstats() function in the sd driver. This does a SCSI enquiry, and then copies the Product ID straight out of the right part of the SCSI enquiry string:

strncpy(stp->sd_pid.value.c, un->un_sd->sd_inq->inq_pid, 16);

(If you're interested, you can see the structure of the SCSI enquiry string in the /usr/include/sys/scsi/generic/inquiry.h header file. The inq_pid comes from bytes 16-31, and is 16 bytes long.)

You can see the problem. The strncpy() just copies 16 bytes into a character array that's 16 bytes long. It fits nicely, but there's a snag - because it fits exactly, there's no trailing null!

The problem with JKstat here is that it is (or was, anyway) using NewStringUTF() to convert the C string into a java String, And that doesn't have any concept of length associated with it. So it starts from the pointer to the beginning of the c[] array, and keeps going until it finds the null to terminate the string.

And if you look at the sd driver, the Revision entry comes straight after the product entry in memory, so what JNI is doing here is reading past the end of the Product value, and keeps going until it find the null at the end of the next name "Revision", and takes the whole lot. It is, I suppose, fortunate that there is something vaguely sensible for it to find.

There doesn't appear to be a way of doing the right thing in JNI itself, the fix has to be to copy the correct amount of the value into a temporary string that does have the trailing null added.

(And all the system tools written in C are fine, because they do have a way to just limit the read to 16 characters.)

OmniOS Community Edition r151028v, r151026av, r151022ct OmniOS Community Edition

OmniOS Community Edition weekly releases for w/c 1st of April 2019 are now available.

The following security fixes are available for all supported releases:

We recommend that you apply these updates as soon as possible if you use the in-kernel SMB/CIFS feature, or provide zone root access to untrusted users.

For further details, please see

Any problems or questions, please get in touch.