OmniOS Community Edition r151030be, r151032ae, r151034e OmniOS Community Edition

This week’s update for all stable OmniOS versions corrects a bug in the pkgdepend command which is important for illumos developers.

For r151034, there is also a fix for cdrom support in kvm zones.

For further details, please see

Any problems or questions, please get in touch.

OmniOS Community Edition r151034d OmniOS Community Edition

The first update for the r151034 stable version of OmniOS is now available.

This update requires a reboot


  • Update Intel CPU microcode to 20200520

  • Update timezone data to 2020a

  • The bhyve zone brand now supports vnc=wait to pause VM execution until a VNC client has connected

  • KVM zone shutdown did not work reliably

  • The bhyve and kvm zone brands now support multiple cdrom entries. These should be configured as properties named cdrom0, cdrom1, etc.

  • The default-recurse pkg property was erroneously affecting the pkg install command; this has been fixed

  • pkg update with a new boot environment could occasionally report pkg: Unable to clone the current boot environment

  • onu to illumos-gate now works with installed zones

  • Fix for a kernel panic when running an NFS client within an lx-branded zone

  • Fix for regression in ftello64() behaviour

  • Fix for (rare) crash in bhyve with some Linux guests

  • Fix for potential lz4 compression failure in vtfontcvt

  • Improve TSO support in vioif driver

For further details, please see

Any problems or questions, please get in touch.

Flight Planning My Cruise Power Josef "Jeff" Sipek

When I was working on my private pilot certificate, there was one thing that was never satisfactorily explained to me: how to select the “right” line of the cruise performance table in the POH. Now that I’m a few years older and wiser, I thought I’d write up an explanation for those who, like me six years ago, aren’t getting a good enough answer from their CFIs.

I did my training in a Cessna 172SP, and so the table was relatively simple:

Reading it is trivial. Pick your cruise altitude, then pick the RPM that the instructor told you to use for cruising (e.g., 2200). Now, read across to figure out what your true airspeed and fuel flow will be. That is all there is to it.

When I got checked out in the club’s 182T, things got more confusing. The table itself got split across multiple pages of the POH because of the addition of a new variable: manifold pressure (MP).

The table works much the same way as before. First, select the table based on which altitude you’ll be cruising at, then pick the RPM and manifold pressure, and read across the true airspeed and fuel flow.

On the surface (bad pun intended), this seems like a reasonable explanation. But if you look closely, there are multiple combinations of RPM and MP which give you the same performance. For example, in the above table both 2200/21” and 2400/20” give more or less the same performance. When I asked how to choose between them, all I got was a reminder to “keep the MP at or below the RPM.” It was thoroughly unsatisfying. So, I stuck with something simple like 2300/23”.

Fast forward to today. I fly a fixed gear Cessna Cardinal (177B). Its manual contains a table much like the one above for a 182. Here is a sample for 4000’:

As before, I started with something simple like 2300/23”, but eventually I had a moment of clarity. When flying the 172 and 182, I paid for Wikipedia article: Hobbs time. In other words, it was in my best interest to cruise as fast as possible without much regard for which exact RPM/MP combination I used (all within club and manufacturer limitations, of course).

My bill for the Cardinal is different—it is based on Wikipedia article: tach time. This means that the lower the RPM, the slower I’m spending money. So, like any other optimization problem, I want to find the right spot where my bill, my cruise speed, and my fuel flow (and therefore endurance) are all acceptable.

If the tach timer is calibrated to run at full speed at 2700 RPM, running the engine at only 2300 equates to 85% while using 2400 equates to 88.9%.

So, say I’m flying for two hours. If I use 2400 RPM, I’ll be paying 1.78 hours. On the other hand, if I use 2300 RPM at the same power output, I’ll be paying for 1.70 hours. Not a big difference, but after 24 hours at 2300 instead of 2400, I would have saved a full hour of tach time.

I don’t yet have enough data to verify these figures, but collecting it is on my todo list.

While composing this post, I happened to find an article by Mike Busch about why lower RPM is better. He makes a number of compeling points—reduced noise, better propeller efficiency, and fewer revolutions the engine has to make (which should improve the engine’s lifetime and therefore the overall cost). I have to admit that Mike’s points seems more compeling than the small savings I’ve calculated above.

OmniOS Community Edition r151032ab, r151030bb OmniOS Community Edition

OmniOS weekly releases for w/c 11th of May 2020 are now available.

This update requires a reboot

Security Fixes

  • Fix for a kernel panic when running an NFS client within an lx-branded zone

Other Changes

  • w and whodo produced error messages about processes in non-global zones

  • onu to illumos-gate now works with installed zones

  • pkglint now detects duplicated pkg attributes in legacy actions

For further details, please see

Any problems or questions, please get in touch.

A Simple Pibell Staring at the C

With all this free time I finally got around to installing a doorbell at home. I had no interest in Ring or the like: what I really wanted was a simple push doorbell that fit the (Victorian) house but would also somehow notify me if I was downstairs…

There are several documented projects on splicing in a Raspberry Pi into existing powered doorbell systems, but that wasn’t what I wanted either.

Instead, the doorbell is a simple contact switch feeding into the Pi’s GPIO pins. It’s effectively extremely simple but I didn’t find a step by step, so this is what I could have done with reading.

I bought the Pi, a case, a power supply, an SD card, and a USB speaker:

Raspberry Pi 3 A+ Pibow Coupé case Pi power supply NOOBS pre-installed SD Card USB speaker

And the doorbell itself plus wiring:

Brass push doorbell Bell wire Crimping pins Crimp Housing

I bought a pre-installed Raspbian SD card as I don’t have an SD card caddy. After some basic configuration (which required HDMI over to a monitor) I started playing with how to set up the Pi.

Of course the PI is absurdly over-powered for this purpose, but I wanted something simple to play with. And anyway, it’s running Pihole too.

The wiring itself is simple: bell wire over through a hole in the door frame to the back of the doorbell (which is a simple contact push). The other end of the wires are connected to the PI’s GPIO pin 18, and ground. The pin is pulled up and we trigger the event when we see a falling edge.

Actually connecting the wires was a bit fiddly: the bell wire is too thin for the 0.1” connector, and lacking a proper crimping tool I had to bodge it with needle-nose pliers. But once in the pins the housing connection is solid enough.

At first I tried to connect it to Alexa but soon gave up on that idea. There’s no way to “announce” via any API, and it kept disconnecting when used as a Bluetooth speaker. And Alexa has that infuriating “Now playing from…” thing you can’t turn off as well.

During fiddling with this I removed PulseAudio from the Pi as a dead loss.

Nor could I use an Anker Soundcore as a Bluetooth speaker: the stupid thing has some sleep mode that means it misses off the first 3 seconds or so of whatever’s playing.

Instead I have the crappy USB speaker above. It’s not great but is enough to be heard from outside and inside.

Aside from playing whatever through the speaker, the bell emails me in case I can’t hear it. Here’s the somewhat crappy script it’s running:

      #!/usr/bin/python -u

# The Pi is wired up such that pin 18 goes through the switch to ground.
# The on-pin pull-up resistor is enabled (so .input() is normally True).
# When the circuit completes, it goes to ground and hence we get a
# falling edge and .input() becomes False.
# I get the occasional phantom still so we wait for settle_time before
# thinking it's real.

from email.mime.text import MIMEText
from subprocess import Popen, PIPE
from datetime import datetime
import subprocess
import RPi.GPIO as GPIO
import signal
import time
import os


GPIO.setup(18, GPIO.IN, pull_up_down=GPIO.PUD_UP)

# in seconds
settle_time = 0.1
bounce_time = 1

def notify():
    print('notifying at %s' % time.time())

    msg = MIMEText("At %s" %"%Y-%m-%d %H:%M:%S"))
    msg["From"] = "doorbell <>"
    msg["To"] = "John Levon <>"
    msg["Cc"] = "John Levon <>"
    msg["Subject"] = "Someone is ringing the doorbell"

    p = Popen(["/usr/sbin/sendmail", "-f", "", "-t", "-oi"], stdin=PIPE)
    while True:
        os.system('aplay -D plughw:1,0 doorbell.wav')
        input_state = GPIO.input(18)
        if input_state:

def settle():
    global settle_time
    input_state = GPIO.input(18)
    print('input state now %s' % input_state)
    return not input_state

def falling_edge(channel):

    input_state = GPIO.input(18)
    print('got falling edge, input_state %s' % input_state)
    if settle():
GPIO.add_event_detect(18, GPIO.FALLING, callback=falling_edge, bouncetime=(bounce_time * 1000))



2020-05-06 Josef "Jeff" Sipek

OpenMCT — While I’m not a fan of web-based UIs, this is a rather neat “dashboard” framework by NASA.

Wideband spectrum received in JO32KF — Over 5 years of HF spectrum waterfall in Enschede, NL.

10 Most(ly dead) Influential Programming Languages

Wikipedia article: PACELC theorem — An extension of the Wikipedia article: CAP theorem.

Learn Rust the Dangerous Way — Finally a Rust tutorial that speaks to people comfortable in C.

Interferometry and Synthesis in Radio Astronomy — An open access book.

Aviation Formulary — Great circle math applied to various aviation problems for those too lazy to derive the formulas themselves.

Papírová platidla Československa 1918-1993, České republiky a Slovenské republiky 1993-2016 — Complete list of all bank notes used in Czechoslovakia, Czech Republic, and Slovak Republic.

NOAA GOES Image ViewerWikipedia article: GOES weather satellite imagery.

OpenIndiana Hipster 2020.04 is here openindiana

We have released a new OpenIndiana Hipster snapshot 2020.04. The noticeable changes:

  • All remaining OI-specific applications have been ported from Python 2.7 to 3.5, including Caiman (slim_source) installer.
  • Installation images now don’t ship Python 2.7, however some software can still depend on it.
  • GCC 7 is used as the main system compiler now.
  • Libreoffice 6.4 was added.
  • PKG was updated to use rapidjson instead of simplejson for json processing which reduced memory consumption on operations with large package catalogues.
  • A lot of packages were updated.

More information can be found in 2020.04 Release notes and new medias can be downloaded from

OmniOS Community Edition r151022ey, r151032z, r151030az OmniOS Community Edition

OmniOS weekly releases for w/c 27th of April 2020 are now available.

This update requires a reboot for r151030 and r151032.

  • For all supported OmniOS releases, git has been updated to fix two vulnerabilities (CVE-2020-5260 and CVE-2020-11008)

For r151030 and r151032 additionally:

  • OpenSSL updated to 1.1.1g

  • Fixes for a buffer overflow in the w and whodo commands

  • Python updated to 2.7.18, the last python 2.7 release

  • Boot hang caused by x2apic probe using incorrect local apic id

  • lx futex called with NULL timeout causes panic

For r151030 only:

  • Fix zpool history unbounded memory usage

For further details, please see

Any problems or questions, please get in touch.

Building FreeBSD Binary Packages Josef "Jeff" Sipek

On my laptop, I use the binary packages provided by FreeBSD ports. Sometimes however, I want to rebuild a package because I want to change an option (for example, recently I wanted to set DEBUG=on for mutt).

While this is very easy, for whatever reason I can never find a doc with a concise set of steps to accomplish it.

So, for the next time I need to do this:

# portsnap fetch
# portsnap update
# cd /usr/ports/some/thing
# make showconfig
# make rmconfig   # to reset config, if needed
# make clean      # as needed
# make package
# pkg install work/pkg/*.txz

That’s all there is to it.

Unleashed: The Birth (and Death) of an Operating System Fork Josef "Jeff" Sipek

I realize that I haven’t mentioned Unleashed on my blahg before. The major reason for it was that whenever I thought of doing that, I ended up working on Unleashed instead. Oops! :)

For those that don’t know, Unleashed was an operating system fork created by me and Lauri in 2016. A fork that we’ve maintained till now. I said was, because earlier this month, we made the last release. Instead of trying to summarize everything again, let me just quote the relevant parts of the announcement email I sent out on April 4th:

This is the fifth and final release of Unleashed—an operating system fork of illumos. For more information about Unleashed itself and the download links, see our website [1].

That is right, we are pulling the plug on this project. What began as a hobby project for two developers never grew much beyond being a hobby project for two developers. But after nearly 4 years of work, we proved that the illumos code base can be cleaned up significantly, its APIs modernized, and the user experience improved in general.

While we’ve made too many changes [2] to list them all here, I’d like to highlight what I think are some of our major accomplishments:

  • shrinking the codebase by about 25% (~5.6 million lines of code) even though we imported major components (e.g., libressl, openbsd ksh)
  • reducing build times drastically (from ~40 minutes to ~16 minutes on a 2012-era 4 core CPU)
  • changing uname to Unleashed and amd64 (from SunOS 5.11 i86pc)

In addition to the projects we finished, we were well on the way to several other improvements that we simply haven’t gotten around to completing. Some of the more notable ones include:

  • page cache rewrite (~3/5 done)
  • modernizing the build system with bmake / removing dmake (~1/5 done)
  • everything 64-bit only (~4/5 done)

All that we’ve accomplished is just the tip of the iceberg of what remains to be done. Unfortunately for Unleashed, we both decided that we wanted to spend our free time on different projects.

I know I’m biased, but I think we’ve made some good changes in Unleashed and I’d love to see at least some of them make their way into illumos-gate or one of the illumos distros. So, I hope that someone will have the time and interest to integrate some of these changes.

Finally, I’d like to encourage anyone who may be interested not to be afraid of forking an open source project. Even if it doesn’t work out, it is extremely educational.

Happy hacking!



What Unleashed was or wasn’t is described on the website, the README, the features file, and the mailing list archives. The history leading up to Unleashed is essentially undocumented. I am dedicating the rest of this post to that purpose.

Jeffix (2015–2016)

Before Unleashed there was Jeffix.

I made an extremely indirect mention of Jeffix on Twitter in 2015, and one direct mention in a past blahg post in 2016.

So, what exactly was Jeffix? Was it a distro? Not quite. It was more of an overlay for OpenIndiana Hipster. You weren’t able to install Jeffix. Instead, you had to install Hipster and then add the Jeffix package repository (at a higher priority), followed by an upgrade.

Why make it? At the time I used OpenIndiana (an illumos distro) on my laptop. While that was great for everyday work, being an illumos developer meant that I wanted to test my changes before they made it upstream. For about a year, I was running various pre-review and pre-RTI bits. This meant that the set of improvements I had available changed over time. Sometimes this got frustrating since I wouldn’t have certain changes just because they didn’t make it through the RTI process yet and I was testing a different change.

So, in October 2015, I decided to change that. I made a repo with an assortment of my changes. Some are mine while others are authored by other developers in the community. I called this modified illumos Jeffix. Very creative, I know.

I kept this up until May 2016. Today, I have no idea why I stopped building it then.

To Fork or Not To Fork

For the three years leading up to Unleashed, I spent a considerable amount of time thinking about what would make illumos better—not just on the technical side but also the community side.

It has always been relatively easy to see that the illumos community was not (and still is not) very big. By itself this isn’t a problem, but it becomes one when you combine it with the other issues in the community.

The biggest problem is the lack of clear vision for where the project should go.

Over the years, the only idea that I’ve seen come up consistently could be summarized as “don’t break compatibility.” What exactly does that mean? Well, ask 10 people and you’ll get 12 different opinions.

Provably hostile

There have been several times where I tried to clean up an interface in the illumos kernel, and the review feedback amounted to “doesn’t this break on $some-sparc-hardware running kernel modules from Solaris?”

In one instance (in August 2015), when I tried to remove an ancient specialized kernel module binary compatibility workaround that Sun added in Solaris 9 (released in 2002), I was asked about what turned out to be a completely ridiculous situation—you’d need to try to load a kernel module built against Solaris 8 that used this rather specialized API, an API that has been changed twice in incompatible ways since it was added (once in 2005 by Sun, and once in April 2015 by Joyent).

Unsurprisingly, that “feedback” completely derailed a trivial cleanup. To this day, illumos has that ugly workaround for Solaris 8 that does absolutely nothing but confuse anyone that looks at that code. (Of course, this cleanup was one of the first commits in Unleashed.)

While this was a relatively simple (albeit ridiculous) case—I just had to find a copy of old Solaris headers to see when things changed—it nicely demonstrates the “before we take your change, please prove that it doesn’t break anything for anyone on any system” approach the community takes.

Note that this is well beyond the typical “please do due diligence” where the reviewers tend to help out with the reasoning or even testing. The approach the illumos community takes feels more like a malicious “let’s throw every imaginable thought at the contributor, maybe some of them stick.” Needless to say, this is a huge motivation killer that makes contributors leave—something that a small-to-begin-with community cannot afford.


In the past there have been a number of people in the illumos community that were, in my opinion, outright toxic for the project. I’m happy to say, that a number of them have left the community. Good riddance.

What do I mean by toxic?

Well, for instance, there was the time in 2014 when someone decided to contribute to a thread about removing SunOS 4.x compatibility code (that is binary compatibility with an OS whose last release was in 1994) with only one sentence: “Removing stuff adds no value.

Elsewhere in the same thread, another person chimed in with his typical verbiage that could be summarized as “why don’t you do something productive with your time instead, and work on issues that I think are important.” While his list of projects was valid, being condescending to anyone willing to spend their free time to help out your project or telling them that they’re wasting their time unless they work on something that scratches your or your employer’s itch is…well…stupid. Yet, this has happened many times on the mailing list and on IRC over the years.

Both of these examples come from the same thread simply because I happened to stumble across it while looking for another email, but rest assured that there have been plenty of other instances of this sort of toxic behavior.

The Peanut Gallery

Every project with enough people on the mailing list ends up with some kind of a Wikipedia article:  peanut gallery. The one in illumos is especially bad. Every time a removal of something antique is mentioned, people that haven’t contributed a single line of code would come out of the woodwork with various forms of “I use it” or even a hypothetical “someone could be using it to do $foo”.

It takes a decent amount of effort to deal with those. For new contributors it is even worse as they have no idea if the feedback is coming from someone that has spent a lot of time developing the project (and should be taken seriously) or if it is coming from an obnoxiously loud user or even a troll (and should be ignored).


All this combined results in a potent mix that drives contributors away. Over the years, I’ve seen people come, put in reasonable effort to attempt to contribute, hit this wall of insanity, and quietly leave.

As far as I can tell, some of the insanity is better now—many of the toxic people left, some of the peanut gallery members started to contribute patches to remove dead code, etc.—but a lot of problems still remain (e.g., changes still seem to get stuck in RTI).

So, why did I write so many negative things about the illumos community? Well it documents the motivation for Unleashed. :) Aside from that, I think there is some good code in illumos that should live on but it can only do that if there is a community taking care of it—a community that can survive long term. Maybe this post will help with opening some eyes.


In July 2016, I visited Helsinki for a job interview at Dovecot. Before the visit, I contacted Lauri to see if he had any suggestions for what to see in Helsinki. In addition to a variety of sightseeing ideas, he suggested that we meet up for a beer.

At this point, the story continues as Lauri described it on the mailing list — while we were lamenting, Jouni suggested we fork it. It was obvious that he was (at least partially) joking, but having considered forking, it resonated with me. After I got home, I thought about it for a while but ultimately decided that it was a good enough idea.

That’s really all there is to the beginning of Unleashed itself. While the decision to fork was definitely instigated by Jouni, the thought was certainly on my mind for some time before that.


With Unleashed over, what am I going to do next?

I have plenty of fun projects to work on—ranging from assorted DSP attempts to file system experiments. They’ll get developed on my FreeBSD laptop. And no, I will not resume contributing to illumos. Maybe I’m older and wiser (and probably grumpier), but I want to spend my time working on code that is appreciated.

With all that said, I made some friends in illumos and I will continue chatting with them about anything and everything.

Nekoware build system for IRIX Minimal Solaris

Besides Solaris, I love IRIX and, as the owner of the Octane, I'd like to have modern versions of at least basic utilities. Almost all packages from have been installed for a long time ago but time is running out and why not collect new versions yourself. Here you can find the initial scripts that allow you to build nekoware tardists. I have focused on "first-aid" stuff so far, like bash, awk, sed and grep but going to add more components later. If someone also wants to participate, just ping me with pull-request.

URLs in gnome-terminal and mutt Staring at the C

For some time now, gnome-terminal amongst others has had a heuristic that guesses at URLs, and allows you to control-click to directly open it. However, this was easily foxed by applications doing line-wrapping instead of letting the terminal do so.

A few years ago, gnome-terminal gained ANSI escape sequences for URL highlighting. It requires applications to output the necessary escape codes, but works far more reliably.

Annoyingly, you still need to control-click, but that is easily fixed. I rebuilt Ubuntu’s build with this change like so:

sudo apt build-dep gnome-terminal
apt source gnome-terminal
cd gnome-terminal-3.28.2
dpkg-buildpackage --no-sign -b
sudo dpkg -i ../gnome-terminal_3.28.2-1ubuntu1~18.04.1_amd64.deb

This would be most useful if mutt supported the sequences, but unfortunately its built-in pager is stuck behind libncurses and can’t easily get out from under it. Using an external pager with mutt is not great either, as you lose all the integration.

There’s also no support in w3m. Even though it thankfully avoids libncurses, it’s a bit of a pain to implement, as instead of just needing to track individual bits for bold on/off or whatever, there’s a whole URL target that needs mapping onto the (re)drawn screen lines.

So instead there’s the somewhat ersatz:

$ grep email-html ~/.muttrc
macro pager,index,attach k "<pipe-message>email-html<Enter>"


$ cat email-html

dir=$(mktemp -d -p /tmp)

ripmime -i - -d $dir --name-by-type

cat $dir/text-html* | w3m -no-mouse -o display_link \
    -o display_link_number -T text/html | \
    sed 's!https*://.*!\x1B]8;;&\x1B\\&\x1B]8;;\x1B\\!g' | less -rX

rm -rf $dir

It’ll have to do.

Migrated Blog Staring at the C

With my Coronavirus-related CFT I finally got around to migrating off Blogger. I lost comments, but I think I’ll probably keep it like that: there’s twitter, and Blogger’s anti-spam facilities were pretty much hopeless.

My first attempt used jekyll. I suppose this works best with Github Pages, because I gave up on it pretty quickly: various irritating Ruby version incompatibilities, random tracebacks from modules, import not working well at all etc.

Next stop was hugo which was much, much nicer. Although it was still a little tedious to import (there’s not really integration, so you need 3rd party tools like the one I used to import the Blogger content - blog2md.

The base theme I ended up using was Strange Case. Having battled with impenetrable Wordpress themes in the past, it was refreshing to be able to modify something so eminently hackable, and being based on the familiar bootstrap was a big plus as well.

It took me a while to fix up a few things (like making Recent Posts show only posts, instead of all pages), and getting used to the way hugo searches the layout files took a bit of time, but it was all in all a good experience.

It seemed a little tricky to create all the necessary 301 Redirect directives for the old Blogger-style permalinks, so I crapped out and just manually added a few that I know people might actually want to find via Google.

I spent far too long trying to find an Atom feed importer for my old Sun blog. Seems like there isn’t a general one, so I threw roller2hugo together instead, which works just enough.

Old web content Staring at the C

I think it's important that everyone should endeavour to maintain existing web content, even if it's not currently relevant.

Enabling xVM on OpenSolaris Staring at the C

Another significant usability improvement that landed in build 126 is Gary and Bill's work on enabling Xen. Now, running xVM should be as simple as:

# pkg install xvm-gui
# echo 'set zfs:zfs_arc_max = 0x10000000' >>/etc/system # yes, you still need this, sadly
# svcadm enable -r milestone/xvm
# reboot

There's also a new Visual Panel for doing this if you prefer a graphical method. More in the flag day message.


Dry-run migration Staring at the C

As part of our ongoing work on improving the ease of use of xVM, the newly available build 126 of OpenSolaris has my putback for:

6878952 Would like dry-run migration

This feature is useful for doing a simple check as to whether a guest can successfully migrate to another dom0 host. For example, domu-221 here is using a disk path that doesn't exist on the remote host hiss:

# virsh migrate --dryrun domu-221 xen:/// hiss    
error: POST operation failed: xend_post: error from xen daemon:
(xend.err 'Remote server error: Access to vbd:768 failed: error: "/iscsi/nevada-hvm" is not a valid block device.')

This works both with running and shutdown guests. Currently, the checks are fairly limited: are disks of the same path available on the remote host (note there is no checking of GUIDs or whatever to verify they really are the same piece of shared storage); is there enough memory on the remote host; and is the remote host the same CPU vendor. We expect these checks to improve both in scope and in reliability in the future.


xVM and COMSTAR iSCSI Staring at the C

I recently had cause to try out COMSTAR for the first time, and I thought I'd write up the steps needed. Unfortunately, it's considerably more complex than the fall-over-easy shareiscsi=on ZFS feature.

Configuring the COMSTAR server

First install the storage-server packages and enable the services:

# svcadm enable -r stmf
# svcadm enable -r iscsi/target

We want to create a target group for each of our xVM guests, each of which will have one LUN in it. After creating the LUN, we define a "view" that allows that LUN to be visible for that target group:

# stmfadm create-tg domu-226
# zfs create -V 15G export/domu-226
# stmfadm create-lu /dev/zvol/rdsk/export/domu-226
Logical unit created: 600144F0C73ABF0F00004AD75DF2001A
# stmfadm add-view -t domu-226 600144F0C73ABF0F00004AD75DF2001A

Now we need to create the iSCSI target for this target group, that has our single LUN in it.

# itadm create-target -l domu-226
Target successfully created

Here (finally) is our iSCSI Alias we can use in the clients. But we're not done yet. By default, this target will be able to see all LUNs not in a target group. So we need to make it a member of our domu-226 target group:

# stmfadm add-tg-member -g domu-226
# stmfadm list-tg -v
Target Group: domu-226

Configuring the iSCSI initiator (client)

We do this in the usual manner:

# svcadm enable -r svc:/network/iscsi/initiator:default
# iscsiadm add discovery-address
# iscsiadm modify discovery --sendtargets enable

Installing a guest onto the LUN

We went through the above gymnastics so we can have a human-readable Alias for each of the domu's root LUNs. So now we can do:

# virt-install --paravirt --name domu-226 --ram 1024 --os-type solaris --os-variant opensolaris \
  --location nfs: --network bridge,mac=00:14:4f:0f:b5:3e \
  --disk path=/alias/domu-226,driver=phy,subdriver=iscsi \


OpenSolaris 2009.06 guest domain on a Linux dom0 Staring at the C

Just a quick note: you can follow the instructions I provided for the 2008.11 release, with one change. On a 64-bit machine, replace any instances of /boot/x86.microroot with /boot/amd64/x86.microroot. As of 2009.06, the boot archive is split into 32-bit and 64-bit variants. If you get a message like this:

krtld: failed to open '/platform/i86xpv/kernel/amd64/unix'

Then you've probably given the wrong combination of unix and microroot.

By the way, in my previous entry, I mentioned we were working on upstreaming our virt-install changes. During the Xen 3.3 work (more on which soon), I updated to the latest versions and got the needed parts into the upstream version. We've still some ZFS changes to push, but if you're running a recent enough version of Xen on Linux, you may well be able to use virt-install and skip all this horrible hacking!

Begone, trailing spaces! Staring at the C

I read my work email with mutt on a Solaris 9 box. For a while it's been irritating me that when you attempt to cut and paste, it will include trailing spaces on each line instead of stopping at the last "real" character. Some Googling suggested this was because of the lack of the BCE attribute in my xterm-color terminfo definition. Rather than learn how to compile terminfo entries (I've done it before, but I don't want to learn again!), I took the lazier approach: copy /usr/share/terminfo/s/screen-256color-bce from a Fedora 8 box into /home/johnlev/.terminfo/s/, and start mutt with TERM and TERMINFO set appropriately. Now I can cut and paste sanely again.


OpenSolaris 2008.11 as a dom0 Staring at the C

UPDATE: the canonical location for this information is now here - please check there, as it will be updated as necessary, unlike this blog entry.

As a final part to my entries on OpenSolaris and Xen, let's go through the steps needed to turn OpenSolaris into a dom0. Thanks to Trevor O for documenting this for 2008.05. And as before, expect this process to get much, much, easier soon!

I'm going to do the work in a separate BE, so if we mess up, we shouldn't have broken anything. So, first we create our BE:

$ pfexec beadm create -a -d xvm xvm
First, let's install the packages. If you've updated to the development version, a simple pkg install xvm-gui will work, but let's assume you haven't:
$ pfexec beadm mount xvm /tmp/xvm-be
$ pfexec pkg -R /tmp/xvm-be install SUNWvirt-manager SUNWxvm SUNWvdisk SUNWvncviewer
$ pfexec beadm umount xvm

Now we need to actually reboot into Xen. Unfortunately beadm is not yet aware of how to do this, so we'll have to hack it up. We're going to run some awk over the menu.lst file which controls grub:

$ awk '
/^title/ { xvm=0; }
/^title.xvm$/ { xvm=1; }
/^(splashimage|foreground|background)/ {
    if (xvm == 1) next
/^kernel\$/ {
    if (xvm == 1) {
       print("kernel\$ /boot/\$ISADIR/xen.gz")
       sub("^kernel\\$", "module$")
       gsub("console=graphics", "console=text")
       gsub("i86pc", "i86xpv")
       $2=$2 " " $2
{ print }' /rpool/boot/grub/menu.lst >/var/tmp/menu.lst.xvm

Let's check that the awk script (my apologies) worked properly:

$ tail /var/tmp/menu.lst.xvm 
#============ End of LIBBE entry =============
title xvm
findroot (pool_rpool,0,a)
bootfs rpool/ROOT/xvm
kernel$ /boot/$ISADIR/xen.gz
module$ /platform/i86xpv/kernel/$ISADIR/unix /platform/i86xpv/kernel/$ISADIR/unix -B $ZFS-BOOTFS,console=text
module$ /platform/i86pc/$ISADIR/boot_archive
#============ End of LIBBE entry =============

Looks good. We'll move it into place, and reboot:

$ pfexec cp /rpool/boot/grub/menu.lst /rpool/boot/grub/menu.lst.saved
$ pfexec mv /var/tmp/menu.lst.xvm /rpool/boot/grub/menu.lst
$ pfexec reboot

This should boot you into xVM. If everything worked OK, let's enable the services:

$ svcadm enable -r xvm/virtd ; svcadm enable -r xvm/domains

At this point, you should be able to merrily go ahead and install domains!

Update: Todd Clayton pointed out the issue I've filed here: SUNWxvm needs to depend on SUNWvdisk. I've updated the instructions above with the workaround.

Update update: Rich Burridge has fixed it. Nice!


OpenSolaris 2008.11 guest domain on a Linux dom0 Staring at the C

My previous blog post described how to install OpenSolaris 2008.11 on a Solaris dom0 under Xen. This also works on with a Linux dom0. However, since upstream is missing some of our dom0 fixes, it's unfortunately more complicated. In particular, we can't use virt-install, as it doesn't know about Solaris ISOs, and later on, we can't use pygrub to boot from ZFS, since it doesn't know how to read such a filesystem. Bear with me, this gets a little awkward.

This example is using a 32-bit Fedora 8 installation. Your milage is likely to vary if you're using a different version, or another Linux distribution. First some of the configuration parameters you might want to change:

export name="domu-224"
export iso="/isos/osol-2008.11.iso"
export dompath="/export/guests/2008.11"
export rootdisk="$dompath/root.img"
export unixfile="/platform/i86xpv/kernel/unix"

If you're on 64-bit Linux, set unixfile="/platform/i86xpv/kernel/amd64/unix" instead. We need to create ourselves a 10Gb root disk:

mkdir -p $dompath
dd if=/dev/zero count=1 bs=$((1024 * 1024)) seek=10230 of=$rootdisk

Now let's use the configuration we need to install OpenSolaris:

cat >/tmp/domain-$name.xml <<EOF
<domain type='xen'>
 <bootloader_args>--kernel=/platform/i86xpv/kernel/unix --ramdisk=/boot/x86.microroot</bootloader_args>
  <interface type='bridge'>
   <source bridge='eth0' />
       If you have a static DHCP setup, add the domain's MAC address here
       <mac address='00:16:3e:1b:e8:18' />
  <disk type='file' device='cdrom'>
   <driver name='file' />
   <source file='$iso' />
   <target dev='xvdc:cdrom' />
  <disk type='file' device='disk'>
   <driver name='file' />
   <source file='$rootdisk' />
   <target dev='xvda' />

And start up the domain:

virsh create /tmp/domain-$name.xml
virsh console $name

Now you're dropped into the domain's console, and you can use the VNC trick I described to do the install. Answer the questions, wait for the domain to DHCP, then:

domid=`virsh domid $name`
ip=`/usr/bin/xenstore-read /local/domain/$domid/ipaddr/0`
port=`/usr/bin/xenstore-read /local/domain/$domid/guest/vnc/port`
/usr/bin/xenstore-read /local/domain/$domid/guest/vnc/passwd
vncviewer $ip:$port

At this point, you can proceed with the installation as normal. Before you reboot though, we need to do some tricks, due to the lack of ZFS support mentioned above. Whilst still in the live CD environment, bring up a terminal. We need to copy the new kernel and ramdisk to the Linux dom0. We can automate this via a handy script:



root=`pfexec beadm list -H |  grep ';N*R;' | cut -d \; -f 1`
mkdir /tmp/root
pfexec beadm mount $root /tmp/root 2>/dev/null
mount=`pfexec beadm list -H $root | cut -d \; -f 4`
pfexec bootadm update-archive -R $mount
scp $mount/$unixfile root@$dom0:$dompath/kernel.$root
scp $mount/platform/i86pc/$3/boot_archive root@$dom0:$dompath/ramdisk.$root
pfexec beadm umount $root 2>/dev/null
echo "Kernel and ramdisk for $root copied to $dom0:$dompath"
echo "Kernel cmdline should be:"
echo "$unixfile -B zfs-bootfs=rpool/ROOT/$root,bootpath=/xpvd/xdf@51712:a"

For example, we might do:

/tmp/update_dom0 linux-dom0 /export/guests/2008.11
or on 64-bit:
/tmp/update_dom0 linux-dom0 /export/guests/2008.11 amd64

Now, you can finish the installation by clicking the reboot button. This will shut down the domain, ready to run. But first we need the configuration file for running the domain:

cat >/$dompath/$name.xml <<EOF
<domain type='xen'>
  <cmdline>$unixfile -B zfs-bootfs=rpool/ROOT/opensolaris,bootpath=/xpvd/xdf@51712:a</cmdline>
  <interface type='bridge'>
   <source bridge='eth0'/>
  <disk type='file' device='disk'>
   <driver name='file' />
   <source file='$rootdisk' />
   <target dev='xvda' />

virsh define $dompath/$name.xml
virsh start $name
virsh console $name

It should be booting, and you're (finally) done!

Updating the guest

Unfortunately we're not quite out of the woods yet. What we have works fine, but if we update the guest via pkg image-update, we'll need to make changes in dom0 to boot the new boot environment. The update_dom0 script above will do a fine job of copying out the new kernel and ramdisk for the BE that's active on reboot, but you also need to edit the config file. For example, if I wanted to boot into the new BE called opensolaris-1, I'd replace these lines:

<cmdline>$unixfile -B zfs-bootfs=rpool/ROOT/opensolaris,bootpath=/xpvd/xdf@51712:a</cmdline>

with these:

<cmdline>$unixfile -B zfs-bootfs=rpool/ROOT/opensolaris-1,bootpath=/xpvd/xdf@51712:a</cmdline>

then re-configure the domain (whist it's shut down) via virsh undefine $name ; virsh define $dompath/$name.xml.

Yes, we're aware this is rather over-complicated. We're trying to find the time to send our changes to virt-install upstream, as well as ZFS support. Eventually this will make it much easier to use a Linux dom0.


OpenSolaris 2008.11 as a para-virtual Xen guest Staring at the C

UPDATE: the canonical location for this information is now here - please check there, as it will be updated as necessary, unlike this blog entry.

As well obviously working with VirtualBox, OpenSolaris can also run as a guest domain under Xen. The installation CD ships with the paravirtual extensions so you can run it as a fully para-virtualized guest. This provides a significant advantage over fully-virtualized guests, or even guests with para-virtual drivers like Solaris 10 Update 6. Of course, if you choose to, you can still run OpenSolaris fully-virtualized (a.k.a. HVM mode), but there's little advantage to doing so.

One slight wrinkle is that Solaris guests don't yet implement the virtual framebuffer that the Xen infrastructure supports. Since OpenSolaris doesn't yet have a text-mode install, this means that to install such a PV guest, we need a way to bring up a graphical console.

With 2008.11, this is considerably easier. Presuming we're running a Solaris dom0 (either Nevada or OpenSolaris, of course), let's start an install of 2008.11:

# zfs create rpool/zvol
# zfs create -V 10G rpool/zvol/domu-220-root
# virt-install --nographics --paravirt --ram 1024 --name domu-220 -f /dev/zvol/dsk/rpool/zvol/domu-220-root -l /isos/osol-2008.11.iso

This will drop you into the console for the guest to ask you the two initial questions. Since they're not really important in this circumstance, you can just choose the defaults. This example presumes that you have a DHCP server set up to give out dynamic addresses. If you only hand out addresses statically based on MAC address, you can also specify the --mac option. As OpenSolaris more-or-less assumes DHCP, it's recommended to set one up.

Now we need a graphical console in order to interact with the OpenSolaris installer. If the guest domain successfully finished booting the live CD, a VNC server should be running. It has recorded the details of this server in XenStore. This is essentially a name/value config database used for communicating between guest domains and the control domain (dom0). We can start a VNC session as follows:

# domid=`virsh domid domu-220`
# ip=`/usr/lib/xen/bin/xenstore-read /local/domain/$domid/ipaddr/0`
# port=`/usr/lib/xen/bin/xenstore-read /local/domain/$domid/guest/vnc/port`
# /usr/lib/xen/bin/xenstore-read /local/domain/$domid/guest/vnc/passwd
# vncviewer $ip:$port

At the VNC password prompt, enter the given password, and this should bring up a VNC session, and you can merrily install away.


The live CD runs a transient SMF service system/xvm/vnc-config. If it finds itself running on a live CD, it will generate a random VNC password, configure application/x11/x11-server to start Xvnc, and write the values above to XenStore. When application/graphical-login/gdm starts, it will read these service properties and start up the VNC server. The service system/xvm/ipagent tracks the IPv4 address given to the first running interface and writes it to XenStore.

By default, the VNC server is configured not to run post-installation due to security concerns. This can be changed though, as follows:

# svccfg -s x11-server
setprop options/xvm_vnc = "true"

Please remember that VNC is not secure. Since you need elevated privileges to read the VNC password from XenStore, that's sufficiently protected, as long as you always run the VNC viewer locally on the dom0, or via SSH tunnelling or some other secure method.

Note that this works even with a Linux dom0, although you can't yet use virt-install, as the upstream version doesn't yet "know about" OpenSolaris (more on this later).


Building OpenSolaris ISOs Staring at the C

I've recently been figuring out to build OpenSolaris ISOs (from SVR4 packages). It's surprisingly easy, but at least the IPS part is not well documented, so I thought I'd write up how I do it.

There are three main things you're most likely to want to do: build IPS itself, populate an IPS repository, and build an install ISO based on that repository. First, you'll want a copy of the IPS gate:

hg clone ssh:// pkg-gate
For some of my testing, I wanted to test some changed packages. So I mounted a Nevada DVD on /mnt/, then, using mount -F lofs, replaced some of the package directories with ones I'd built previously with my fixes. This effectively gave me a full Nevada DVD with my fixes in, avoiding the horrors of making one. I then cd pkg-gate, and run something like this:
$ cat build-ips
export WS=$1
export REPO=http://localhost:$2
unset http_proxy || true
set -e
echo "START `date`"
cd $WS/src
make install packages
cd $WS/src/util/distro-import
export NONWOS_PKGS="/net/paradise/export/integrate_dock/nv/nv_osol0811/all \
export WOS_PKGS="/mnt/Solaris_11/Product/"
export PYTHONPATH=$WS/proto/root_i386/usr/lib/python2.4/vendor-packages/
export PATH=$WS/proto/root_i386/usr/bin/:$WS/proto/root_i386/usr/lib:$PATH
nohup pkg.depotd -p $2 -d /var/tmp/$USER/repo &
sleep 5
make -e 99/slim_import
echo "END `date`"
$ ./build-ips `pwd` 10023

In fact, since I was running on an older version Nevada (89, precisely), I had to stop after the make install and change src/pyOpenSSL-0.7/ to pick up OpenSSL from /usr/sfw:

IncludeDirs =  [ '/usr/sfw/include' ]
LibraryDirs =  [ '/usr/sfw/lib' ]

(If /usr/bin/openssl exists, you don't need this). So, after this step, which build the IPS tools (and SVR4 package for it), it moves into the "distro-import" directory. This is really a completely different thing from IPS itself, but for convenience it lives in the IPS gate. Its job is to take a set of SVR4 packages (that is, the old Solaris package format) and upload them to a given IPS network repository: in this case, http://localhost:10023.

So, making sure we use the IPS tools we just built, we point a couple of environment variables to the package locations. "WOS" stands for, charmingly, "Wad Of Stuff", and in this context means "packages delivered to Solaris Nevada". There's also some extra packages used for OpenSolaris, listed here as NONWOS_PKGS. I'm not sure where external people can get them from, though.

The core of distro-import is the script, which does the job of transliterating from SVR4-speak into pkgsend(1)-speak. As well as a straight translation, though, a small number of customisations to the existing packages are also made to account for OpenSolaris differences. These are done by dropping the original file contents and picking them up from an ad-hoc SUNWfixes SVR4 package built in the same directory.

Of course, each build has its differences, so they're separated out into sub-directories. As you can see above, to run the import, we make a 99/slim_import target. This basically runs for every package listed in the file 99/slim_custer. This list is more or less what makes up the contents of the live CD. Also of interest is the redist_import target, which builds every package available (see By the way, watch out for distro-import/README: it's not quite up to date.

Another super useful environment variable is JUST_THESE_PKGS: this will only build and import the packages listed. Very useful if you're tweaking a package and don't want to re-import the whole cluster!

At the end of this build, we now have a populated IPS repository living at http://localhost:10023. If we already have an installed OpenSolaris, we could easily use this to install individual new packages, or do an image update (where ipshost is the remote name of your build machine):

# pkg set-authority -P -O http://ipshost:10023 myipsrepo
# pkg install SUNWmynewpackage # or...
# pkg image-update

If we want to test installer or live CD changes, though, we'll need to build an ISO. I did this for the first time today, and it's fall-over easy. First you need an OpenSolaris build machine, and type:

# pkg install SUNWdistro-const

Modify slim_cd.xml to point to your repository, as described here. It's not immediately obvious, but you can specify your URL as http://ipshost:10023 if you're not using the standard port, like me. Then:

# distro_const build ./slim_cd.xml

And that's it: you'll have a fully-working OpenSolaris ISO in /export/dc_output/ (I understand it's a different location after build 99, though). I never knew building an install ISO could be so simple!


Direct mounting of files Staring at the C

As part of my work on Least Privilege for xVM, I worked on implementing direct file mounts. The idea is that we'd modify the Solaris support in virt-install to use these direct mounts, instead of the more laborious older method required.

A long-standing peeve of Solaris users is that in order to mount a file system image (in particular a DVD ISO image), it's a two-step process. This was less than ideal, as many other UNIX OS's made it simple to do: you'd just pass the file to the mount command, along with a special option or two, and it mounts it directly.

With my putback of 6384817 Need persistent lofi based mounts and direct mount(1m) support for lofi, this is now possible (in fact, a little easier) in Solaris. Instead of doing this:

# device=`lofiadm -a /export/solarisdvd.iso`
# mount -F hsfs $device /mnt/iso
# umount /mnt/iso
# lofiadm -d /export/solarisdvd.iso

it's just:

# mount -F hsfs /export/solarisdvd.iso /mnt/iso
# umount /export/solarisdvd.iso

Under the hood, this still uses the lofi driver, it's just automatically used at mount and unmount time. There's no need for an -o loop option as on Linux.

This is supported for most of the file systems you might need in Solaris, namely ufs, hsfs, udfs, and pcfs. This doesn't work for ZFS, as this has its own method for mounting file system images.

I was asked a couple of times why I implemented this in the kernel at all (which meant requiring file system support via vfs_get_lofi(). This was primarily to allow non-root users to access file mounts; in fact this was the primary motivation for implementing this feature from the point of view of the xVM work. In particular, if you have PRIV_SYS_MOUNT, you can do direct file mounts as well as normal mounts. This is important for virt-install, which we want to avoid running as root, but needs to be able to mount DVDs to grab the booting information for when installing a guest.

As always, there's more work that could be done. mount is not smart about relative paths, and should notice (and correct) early if you try pass a relative path as the first argument. Solaris has always (rather annoyingly) required an -F option to identify what kind of file system you're mounting, which is particularly pedantic of it. Equally the lofi driver doesn't comprehend fdisk or VTOC layouts.

Tags: RSS feed Staring at the C

For reasons beyond my ken, doesn't actually list an RSS feed anywhere I can find, but it's at

Update:: it's now grown an RSS icon. Thanks!

xVM Under The Hood: seg_mf Staring at the C

An occasional series wherein I'll describe a part of the xVM implementation. Today, I'll be talking about seg_mf. You may want to read through my explanation of live migration and MMU virtualization first.

The control domain (dom0) often needs access to memory pages that belong to a running guest domain. The most obvious example of this is in constructing the domain during boot, but it's also needed for mapping the shared virtual guest console page, generating guest domain core dumps, etc.

This is implemented via the privcmd driver. Each process that needs to map some area of a guest domain's memory maps a range of anonymous virtual memory. The process then sends a request to the driver to map in a given range or set of machine frames into the given virtual address range. The two requests (IOCTL_PRIVCMD_MMAP and IOCTL_PRIVCMD_MMAP_BATCH) are more or less the same, although the latter allows the user to track MFNs that couldn't be mapped (see below).

Both ioctl()s hook into the seg_mf code. This is a normal Solaris segment driver (see Solaris Internals) with a hook that's used to store the arrays of MFN values that each VA range is to be backed by. This segment driver is a little unusual though: it does not support demand faulting. That is, every page in the segment is faulted in (and locked in) at the time of the ioctl(). This is needed to support the error-reporting interface described below, but it also helps simplify the driver significantly.

To fault the range, we go through each page-size chunk in the mapping. We need to establish a mapping from the virtual address of the chunk to the actual machine frame holding the page owned by the guest domain. This happens in segmf_faultpage(). The HAT isn't used to our strange request, so we load a temporary mapping at the given VA, and replace that with a mapping to the real underlying MFN via HYPERVISOR_update_va_mapping_otherdomain().

Normally, the MFNs given via the ioctl() should be mappable. One exception is HVM live migration. This was implemented, somewhat confusingly, to use the same interfaces but pass GMFNs not MFNs. In particular, for HVM guests, a guest MFN (what a guest thinks is a real machine frame number) is actually a pseudo-physical frame number. As a result, due to ballooning, or PV drivers, etc., this GMFN may not have a real MFN backing it, so the attempt to map it will fail. We mark the MFN as failed in the outgoing array of IOCTL_PRIVCMD_MMAP_BATCH and let the client deal with it. This is generally OK, since the iterative nature of live migration means we can still get to all the pages we need.

One nice enhancement would be to extend pmap to recognise such mappings. In particular qemu-dm has a bunch of such mappings. It'd be relatively easy to mark such mappings as coming from seg_mf. Extra marks for listing the MFN ranges too, though that's a little harder :)


DTrace on xenstored Staring at the C

DTrace support for xenstored has just been merged in the upstream community version of Xen. Why is it useful?

The daemon xenstored runs in dom0 userspace, and implements a simple 'store' of configuration information. This store is used for storing parameters used by running guest domains, and interacts with dom0, guest domains, qemu, xend, and others. These interactions can easily get pretty complicated as a result, and visualizing how requests and responses are connected can be non-obvious.

The existing community solution was a 'trace' option to xenstored: you could restart the daemon and it would record every operation performed. This worked reasonably well, but was very awkward: restarting xenstored means a reboot of dom0 at this point in time. By the time you've set up tracing, you might not be able to reproduce whatever you're looking at any more. Besides, it's extremely inconvenient.

It was obvious that we needed to make this dynamic, and DTrace USDT (Userspace Statically Defined Tracing) was the obvious choice. The patch adds a couple of simple probes for tracking requests and responses; as usual, they're activated dynamically, so have (next to) zero impact when they're not used. On top of these probes I wrote a simple script called xenstore-snoop. Here's a couple of extracts of the output I get when I start a guest domain:

# /usr/lib/xen/bin/xenstore-snoop 
DOM  PID      TX     OP
0    100313   0      XS_GET_DOMAIN_PATH: 6 -> /local/domain/6
0    100313   0      XS_TRANSACTION_START:  -> 930
0    100313   930    XS_RM: /local/domain/6 -> OK
0    100313   930    XS_MKDIR: /local/domain/6 -> OK
6    0        0      XS_READ: /local/domain/0/backend/vbd/6/0/state -> 4
6    0        0      XS_READ: device/vbd/0/state -> 3
0    0        -      XS_WATCH_EVENT: /local/domain/6/device/vbd/0/state FFFFFF0177B8F048
6    0        -      XS_WATCH_EVENT: device/vbd/0/state FFFFFF00C8A3A550
6    0        0      XS_WRITE: device/vbd/0/state 4 -> OK
0    0        0      XS_READ: /local/domain/6/device/vbd/0/state -> 4
6    0        0      XS_READ: /local/domain/0/backend/vbd/6/0/feature-barrier -> 1
6    0        0      XS_READ: /local/domain/0/backend/vbd/6/0/sectors -> 16777216
6    0        0      XS_READ: /local/domain/0/backend/vbd/6/0/info -> 0
6    0        0      XS_READ: device/vbd/0/device-type -> disk
6    0        0      XS_WATCH: cpu FFFFFFFFFBC2BE80 -> OK
6    0        -      XS_WATCH_EVENT: cpu FFFFFFFFFBC2BE80
6    0        0      XS_READ: device/vif/0/state -> 1
6    0        0      [ERROR] XS_READ: device/vif/0/type -> ENOENT

This makes the interactions immediately obvious. We can observe the Xen domain that's doing the request, the PID of the process (this only applies to dom0 control tools), the transaction ID, and the actual operations performed. This has already proven of use in several investigations.

Of course this being DTrace, this is only part of the story. We can use these probes to correlate system behaviour: for example, xenstored transactions are currently rather heavyweight, as they involve copying a large file; these probes can help demonstrate this. Using Python's DTrace support, we can look at which stack traces in xend correspond to which requests to the store; and so on.

This feature, whilst relatively minor, is part of an ongoing plan to improve the observability and RAS of Xen and the solutions Sun are building on top of it. It's very important to us to bring Solaris's excellent observability features to the virtualization space: you've seen the work with zones in this area, and you can expect a lot more improvements for the Xen case too.


I meant to say: after my previous post, I resurrected #opensolaris-dev: if you'd like to talk about OpenSolaris development in a non-hostile environment, please join!


#opensolaris Staring at the C

When OpenSolaris got started, #solaris was a channel filled with pointless rants about GNU-this and Linux-that. Beside complete wrong-headedness, it was a total waste of time and extremely hostile to new people. #opensolaris, in contrast, was actually pretty nice (for IRC!) - sure, the usual pointless discussions but it certainly wasn't hateful.

Recently I'm sad to say #opensolaris has become a really hostile, unpleasant place. I've seen new people arrive and be bullied by a small number of poisonous people until they went away (nice own goal, people!). So if anyone's looking for me for xVM stuff or whatever, I'll be in #onnv-scm or #solaris-xen as usual. And if you do so, please try to keep a civil tongue in your head - it's not hard.

Xen compatibility with Solaris Staring at the C

Maintaining the compatibility of hardware virtualization solutions can be tricky. Below I'll talk about two bugs that needed fixes in the Xen hypervisor. Both of them have unfortunate implications for compatibility, but thankfully, the scope was limited.

6616864 amd64 syscall handler needs fixing for xen 3.1.1

Shortly after the release of 3.1.1, we discovered that all 64-bit processes in a Solaris domain would segfault immediately. After much debugging and head-scratching, I eventually found the problem. On AMD64, 64-bit processes trap into the kernel via the syscall instruction. Under Xen, this will obviously trap to the hypervisor. Xen then 'bounces' this back to the relevant OS kernel.

On real hardware, %rcx and %r11 have specific meanings. Prior to 3.1.1, Xen happened to maintain these values correctly, although the layout of the stack is very different from real hardware. This was broken in the 3.1.1 release: as a result, the %rflags of each process was corrupted, and segfaulted almost immediately. We fixed the bug in Solaris, so we would still work with 3.1.1. This was also fixed (restoring the original semantics) in Xen itself in time for the 3.1.2 release. So there's a small window (early Solaris xVM releases and community versions of Xen 3.1.1) where we're broken, but thankfully, we caught this pretty early. The lesson to be drawn? Clear documentation of the hypervisor ABI would have helped, I think.

6618391 64-bit xVM lets processes fiddle with kernelspace, but Xen bug saves us

Around the same time, I noticed during code inspection that we were still setting PT_USER in PTE entries on 64-bit. This had some nasty implications, but first, some background.

On 32-bit x86, Xen protects itself via segmentation: it carves out the top 64Mb, and refuses to let any of the domains load a segment selector that allows read or write access to that part of the address space. Each domain kernel runs in ring 1 so can't get around this. On 64-bit, this hack doesn't work, as AMD64 does not provide full support for segmentation (given what a legacy technique it is). Instead, and somewhat unfortunately, we have to use page-based permissions via the VM system. Since page table entries only have a single bit ("user/supervisor") instead of being able to say "ring 1 can read, but ring 3 cannot", the OS kernel is forced into ring 3. Normally, ring 3 is used for userspace code. So every time we switch between the OS kernel and userspace, we have to switch page tables entirely - otherwise, the process could use the kernel page tables to write to kernel address-space.

Unfortunately, this means that we have to flush the TLB every time, which has a nasty performance cost. To help mitigate this problem, in Xen 3.0.3, an incompatible change was made. Previously, so that the kernel (running in ring 3, remember) could access its address space, it had to set PT_USER int its kernel page table entries (PTEs). With 3.0.3, this was changed: now, the hypervisor would automatically do that. Furthermore, if Xen did see a PTE with PT_USER set, then it assumed this was a userspace mapping. Thus, it also set PT_GLOBAL, a hardware feature - if such a bit is set, then a corresponding TLB entry is not flushed. This meant that switching between userspace and the OS kernel was much faster, as the TLB entries for userspace were no longer flushed.

Unfortunately, in our kernel, we missed this change in some crucial places, and until we fixed the bug above, we were setting PT_USER even on kernel mappings. This was fairly obviously A Bad Thing: if you caught things just right, a kernel mapping would still be present in the TLB when a user-space program was running, allowing userspace to read from the kernel! And indeed, some simple testing showed this:

dtrace -qn 'fbt:genunix::entry /arg0 > `kernelbase/ { printf("%p ", arg0); }' | \
    xargs -n 1 ~johnlev/bin/i386/readkern | while read ln; do echo $ln::whatis | mdb -k ; done

With the above use of DTrace, MDB, and a little program that attempts to read addresses, we can see output such as:

ffffff01d6f09c00 is ffffff01d6f09c00+0, allocated as a thread structure
ffffff01c8c98438 is ffffff01c8c983e8+50, bufctl ffffff01c8ebf8d0 allocated from as_cache
ffffff01d6f09c00 is ffffff01d6f09c00+0, allocated as a thread structure
ffffff01d44d7e80 is ffffff01d44d7e80+0, bufctl ffffff01d3a2b388 allocated from kmem_alloc_40
ffffff01d44d7e80 is ffffff01d44d7e80+0, bufctl ffffff01d3a2b388 allocated from kmem_alloc_40

Thankfully, the fix was simple: just stop adding PT_USER to our kernel PTE entries. Or so I thought. When I did that, I noticed during testing that the userspace mappings weren't getting PT_GLOBAL set after all (big thanks to MDB's ::vatopfn, which made this easy to see).

Yet more investigation revealed the problem to be in the hypervisor. Unlike certain other popular OSes used with Xen, we set PTE entries in page tables using atomic compare and swap operations. Remember that under Xen, page tables are read-only to ensure safety. When an OS kernel tries to write a PTE, a page fault happens in Xen. Xen recognises the write as an attempt to update a PTE and emulates it. However, since it hadn't been tested, this emulation path was broken: it wasn't doing the correct mangling of the PTE entry to set PT_GLOBAL. Once again, the actual fix was simple.

By the way, that same putback also had the implementation of:

6612324 ::threadlist could identify taskq threads

I'd been doing an awful lot of paging through ::threadlist output recently, and always having to jump through all the (usually irrelevant) taskq threads was driving me insane. So now you can just specify ::threadlist -t and get a much, much, shorter list.


OpenSolaris xVM now available in SX:CE Staring at the C

Build 75 of Solaris Express Community Edition is now out, and it includes our bits. So go ahead, install build 75, select the xVM entry in grub and play around! We're still working on updating the documentation on our community page; in the meantime, you have manpages - start at xVM(5) (and note that the forthcoming build 76 has much improved versions of those docs).

You might be wondering if your machine is capable of running Windows or other operating systems under HVM. Joe Bonasera has a simple program you can run that will tell you. Alternatively, if you're already running with our bits, running 'virt-install' will tell you - if it asks you about creating a fully-virtualized domain, then it should work, and you can end up with a desktop like Russell Blaine's.

Nils, meanwhile, describes how we've improved the RAS of the hypervisor by integrating it with Solaris crash dumps here. This feature has saved our lives numerous times during development as those of us who've done the "hex dump" debugging thing know very well.

Of course, we're not done yet - we have bugs to fix and rough edges to smooth out, and we have significant features to implement. One of the major items we're working on in the near future is the upgrade to Xen 3.1.1 (or possibly 3.1.2, depending on timelines!). This will give us the ability to do live migration of HVM domains, along with a host of other features and improvements.