Old web content Staring at the C

I think it's important that everyone should endeavour to maintain existing web content, even if it's not currently relevant.

Enabling xVM on OpenSolaris Staring at the C

Another significant usability improvement that landed in build 126 is Gary and Bill's work on enabling Xen. Now, running xVM should be as simple as:

# pkg install xvm-gui
# echo 'set zfs:zfs_arc_max = 0x10000000' >>/etc/system # yes, you still need this, sadly
# svcadm enable -r milestone/xvm
# reboot

There's also a new Visual Panel for doing this if you prefer a graphical method. More in the flag day message.


Dry-run migration Staring at the C

As part of our ongoing work on improving the ease of use of xVM, the newly available build 126 of OpenSolaris has my putback for:

6878952 Would like dry-run migration

This feature is useful for doing a simple check as to whether a guest can successfully migrate to another dom0 host. For example, domu-221 here is using a disk path that doesn't exist on the remote host hiss:

# virsh migrate --dryrun domu-221 xen:/// hiss    
error: POST operation failed: xend_post: error from xen daemon:
(xend.err 'Remote server error: Access to vbd:768 failed: error: "/iscsi/nevada-hvm" is not a valid block device.')

This works both with running and shutdown guests. Currently, the checks are fairly limited: are disks of the same path available on the remote host (note there is no checking of GUIDs or whatever to verify they really are the same piece of shared storage); is there enough memory on the remote host; and is the remote host the same CPU vendor. We expect these checks to improve both in scope and in reliability in the future.


xVM and COMSTAR iSCSI Staring at the C

I recently had cause to try out COMSTAR for the first time, and I thought I'd write up the steps needed. Unfortunately, it's considerably more complex than the fall-over-easy shareiscsi=on ZFS feature.

Configuring the COMSTAR server

First install the storage-server packages and enable the services:

# svcadm enable -r stmf
# svcadm enable -r iscsi/target

We want to create a target group for each of our xVM guests, each of which will have one LUN in it. After creating the LUN, we define a "view" that allows that LUN to be visible for that target group:

# stmfadm create-tg domu-226
# zfs create -V 15G export/domu-226
# stmfadm create-lu /dev/zvol/rdsk/export/domu-226
Logical unit created: 600144F0C73ABF0F00004AD75DF2001A
# stmfadm add-view -t domu-226 600144F0C73ABF0F00004AD75DF2001A

Now we need to create the iSCSI target for this target group, that has our single LUN in it.

# itadm create-target -l domu-226
Target iqn.1986-03.com.sun:02:b8596bb9-9bb9-40e9-8cda-add6073ece46 successfully created

Here (finally) is our iSCSI Alias we can use in the clients. But we're not done yet. By default, this target will be able to see all LUNs not in a target group. So we need to make it a member of our domu-226 target group:

# stmfadm add-tg-member -g domu-226 iqn.1986-03.com.sun:02:b8596bb9-9bb9-40e9-8cda-add6073ece46
# stmfadm list-tg -v
Target Group: domu-226
        Member: iqn.1986-03.com.sun:02:b8596bb9-9bb9-40e9-8cda-add6073ece46

Configuring the iSCSI initiator (client)

We do this in the usual manner:

# svcadm enable -r svc:/network/iscsi/initiator:default
# iscsiadm add discovery-address
# iscsiadm modify discovery --sendtargets enable

Installing a guest onto the LUN

We went through the above gymnastics so we can have a human-readable Alias for each of the domu's root LUNs. So now we can do:

# virt-install --paravirt --name domu-226 --ram 1024 --os-type solaris --os-variant opensolaris \
  --location nfs: --network bridge,mac=00:14:4f:0f:b5:3e \
  --disk path=/alias/domu-226,driver=phy,subdriver=iscsi \


OpenSolaris 2009.06 guest domain on a Linux dom0 Staring at the C

Just a quick note: you can follow the instructions I provided for the 2008.11 release, with one change. On a 64-bit machine, replace any instances of /boot/x86.microroot with /boot/amd64/x86.microroot. As of 2009.06, the boot archive is split into 32-bit and 64-bit variants. If you get a message like this:

krtld: failed to open '/platform/i86xpv/kernel/amd64/unix'

Then you've probably given the wrong combination of unix and microroot.

By the way, in my previous entry, I mentioned we were working on upstreaming our virt-install changes. During the Xen 3.3 work (more on which soon), I updated to the latest versions and got the needed parts into the upstream version. We've still some ZFS changes to push, but if you're running a recent enough version of Xen on Linux, you may well be able to use virt-install and skip all this horrible hacking!

Begone, trailing spaces! Staring at the C

I read my work email with mutt on a Solaris 9 box. For a while it's been irritating me that when you attempt to cut and paste, it will include trailing spaces on each line instead of stopping at the last "real" character. Some Googling suggested this was because of the lack of the BCE attribute in my xterm-color terminfo definition. Rather than learn how to compile terminfo entries (I've done it before, but I don't want to learn again!), I took the lazier approach: copy /usr/share/terminfo/s/screen-256color-bce from a Fedora 8 box into /home/johnlev/.terminfo/s/, and start mutt with TERM and TERMINFO set appropriately. Now I can cut and paste sanely again.


OpenSolaris 2008.11 as a dom0 Staring at the C

UPDATE: the canonical location for this information is now here - please check there, as it will be updated as necessary, unlike this blog entry.

As a final part to my entries on OpenSolaris and Xen, let's go through the steps needed to turn OpenSolaris into a dom0. Thanks to Trevor O for documenting this for 2008.05. And as before, expect this process to get much, much, easier soon!

I'm going to do the work in a separate BE, so if we mess up, we shouldn't have broken anything. So, first we create our BE:

$ pfexec beadm create -a -d xvm xvm
First, let's install the packages. If you've updated to the development version, a simple pkg install xvm-gui will work, but let's assume you haven't:
$ pfexec beadm mount xvm /tmp/xvm-be
$ pfexec pkg -R /tmp/xvm-be install SUNWvirt-manager SUNWxvm SUNWvdisk SUNWvncviewer
$ pfexec beadm umount xvm

Now we need to actually reboot into Xen. Unfortunately beadm is not yet aware of how to do this, so we'll have to hack it up. We're going to run some awk over the menu.lst file which controls grub:

$ awk '
/^title/ { xvm=0; }
/^title.xvm$/ { xvm=1; }
/^(splashimage|foreground|background)/ {
    if (xvm == 1) next
/^kernel\$/ {
    if (xvm == 1) {
       print("kernel\$ /boot/\$ISADIR/xen.gz")
       sub("^kernel\\$", "module$")
       gsub("console=graphics", "console=text")
       gsub("i86pc", "i86xpv")
       $2=$2 " " $2
{ print }' /rpool/boot/grub/menu.lst >/var/tmp/menu.lst.xvm

Let's check that the awk script (my apologies) worked properly:

$ tail /var/tmp/menu.lst.xvm 
#============ End of LIBBE entry =============
title xvm
findroot (pool_rpool,0,a)
bootfs rpool/ROOT/xvm
kernel$ /boot/$ISADIR/xen.gz
module$ /platform/i86xpv/kernel/$ISADIR/unix /platform/i86xpv/kernel/$ISADIR/unix -B $ZFS-BOOTFS,console=text
module$ /platform/i86pc/$ISADIR/boot_archive
#============ End of LIBBE entry =============

Looks good. We'll move it into place, and reboot:

$ pfexec cp /rpool/boot/grub/menu.lst /rpool/boot/grub/menu.lst.saved
$ pfexec mv /var/tmp/menu.lst.xvm /rpool/boot/grub/menu.lst
$ pfexec reboot

This should boot you into xVM. If everything worked OK, let's enable the services:

$ svcadm enable -r xvm/virtd ; svcadm enable -r xvm/domains

At this point, you should be able to merrily go ahead and install domains!

Update: Todd Clayton pointed out the issue I've filed here: SUNWxvm needs to depend on SUNWvdisk. I've updated the instructions above with the workaround.

Update update: Rich Burridge has fixed it. Nice!


OpenSolaris 2008.11 guest domain on a Linux dom0 Staring at the C

My previous blog post described how to install OpenSolaris 2008.11 on a Solaris dom0 under Xen. This also works on with a Linux dom0. However, since upstream is missing some of our dom0 fixes, it's unfortunately more complicated. In particular, we can't use virt-install, as it doesn't know about Solaris ISOs, and later on, we can't use pygrub to boot from ZFS, since it doesn't know how to read such a filesystem. Bear with me, this gets a little awkward.

This example is using a 32-bit Fedora 8 installation. Your milage is likely to vary if you're using a different version, or another Linux distribution. First some of the configuration parameters you might want to change:

export name="domu-224"
export iso="/isos/osol-2008.11.iso"
export dompath="/export/guests/2008.11"
export rootdisk="$dompath/root.img"
export unixfile="/platform/i86xpv/kernel/unix"

If you're on 64-bit Linux, set unixfile="/platform/i86xpv/kernel/amd64/unix" instead. We need to create ourselves a 10Gb root disk:

mkdir -p $dompath
dd if=/dev/zero count=1 bs=$((1024 * 1024)) seek=10230 of=$rootdisk

Now let's use the configuration we need to install OpenSolaris:

cat >/tmp/domain-$name.xml <<EOF
<domain type='xen'>
 <bootloader_args>--kernel=/platform/i86xpv/kernel/unix --ramdisk=/boot/x86.microroot</bootloader_args>
  <interface type='bridge'>
   <source bridge='eth0' />
       If you have a static DHCP setup, add the domain's MAC address here
       <mac address='00:16:3e:1b:e8:18' />
  <disk type='file' device='cdrom'>
   <driver name='file' />
   <source file='$iso' />
   <target dev='xvdc:cdrom' />
  <disk type='file' device='disk'>
   <driver name='file' />
   <source file='$rootdisk' />
   <target dev='xvda' />

And start up the domain:

virsh create /tmp/domain-$name.xml
virsh console $name

Now you're dropped into the domain's console, and you can use the VNC trick I described to do the install. Answer the questions, wait for the domain to DHCP, then:

domid=`virsh domid $name`
ip=`/usr/bin/xenstore-read /local/domain/$domid/ipaddr/0`
port=`/usr/bin/xenstore-read /local/domain/$domid/guest/vnc/port`
/usr/bin/xenstore-read /local/domain/$domid/guest/vnc/passwd
vncviewer $ip:$port

At this point, you can proceed with the installation as normal. Before you reboot though, we need to do some tricks, due to the lack of ZFS support mentioned above. Whilst still in the live CD environment, bring up a terminal. We need to copy the new kernel and ramdisk to the Linux dom0. We can automate this via a handy script:



root=`pfexec beadm list -H |  grep ';N*R;' | cut -d \; -f 1`
mkdir /tmp/root
pfexec beadm mount $root /tmp/root 2>/dev/null
mount=`pfexec beadm list -H $root | cut -d \; -f 4`
pfexec bootadm update-archive -R $mount
scp $mount/$unixfile root@$dom0:$dompath/kernel.$root
scp $mount/platform/i86pc/$3/boot_archive root@$dom0:$dompath/ramdisk.$root
pfexec beadm umount $root 2>/dev/null
echo "Kernel and ramdisk for $root copied to $dom0:$dompath"
echo "Kernel cmdline should be:"
echo "$unixfile -B zfs-bootfs=rpool/ROOT/$root,bootpath=/xpvd/xdf@51712:a"

For example, we might do:

/tmp/update_dom0 linux-dom0 /export/guests/2008.11
or on 64-bit:
/tmp/update_dom0 linux-dom0 /export/guests/2008.11 amd64

Now, you can finish the installation by clicking the reboot button. This will shut down the domain, ready to run. But first we need the configuration file for running the domain:

cat >/$dompath/$name.xml <<EOF
<domain type='xen'>
  <cmdline>$unixfile -B zfs-bootfs=rpool/ROOT/opensolaris,bootpath=/xpvd/xdf@51712:a</cmdline>
  <interface type='bridge'>
   <source bridge='eth0'/>
  <disk type='file' device='disk'>
   <driver name='file' />
   <source file='$rootdisk' />
   <target dev='xvda' />

virsh define $dompath/$name.xml
virsh start $name
virsh console $name

It should be booting, and you're (finally) done!

Updating the guest

Unfortunately we're not quite out of the woods yet. What we have works fine, but if we update the guest via pkg image-update, we'll need to make changes in dom0 to boot the new boot environment. The update_dom0 script above will do a fine job of copying out the new kernel and ramdisk for the BE that's active on reboot, but you also need to edit the config file. For example, if I wanted to boot into the new BE called opensolaris-1, I'd replace these lines:

<cmdline>$unixfile -B zfs-bootfs=rpool/ROOT/opensolaris,bootpath=/xpvd/xdf@51712:a</cmdline>

with these:

<cmdline>$unixfile -B zfs-bootfs=rpool/ROOT/opensolaris-1,bootpath=/xpvd/xdf@51712:a</cmdline>

then re-configure the domain (whist it's shut down) via virsh undefine $name ; virsh define $dompath/$name.xml.

Yes, we're aware this is rather over-complicated. We're trying to find the time to send our changes to virt-install upstream, as well as ZFS support. Eventually this will make it much easier to use a Linux dom0.


OpenSolaris 2008.11 as a para-virtual Xen guest Staring at the C

UPDATE: the canonical location for this information is now here - please check there, as it will be updated as necessary, unlike this blog entry.

As well obviously working with VirtualBox, OpenSolaris can also run as a guest domain under Xen. The installation CD ships with the paravirtual extensions so you can run it as a fully para-virtualized guest. This provides a significant advantage over fully-virtualized guests, or even guests with para-virtual drivers like Solaris 10 Update 6. Of course, if you choose to, you can still run OpenSolaris fully-virtualized (a.k.a. HVM mode), but there's little advantage to doing so.

One slight wrinkle is that Solaris guests don't yet implement the virtual framebuffer that the Xen infrastructure supports. Since OpenSolaris doesn't yet have a text-mode install, this means that to install such a PV guest, we need a way to bring up a graphical console.

With 2008.11, this is considerably easier. Presuming we're running a Solaris dom0 (either Nevada or OpenSolaris, of course), let's start an install of 2008.11:

# zfs create rpool/zvol
# zfs create -V 10G rpool/zvol/domu-220-root
# virt-install --nographics --paravirt --ram 1024 --name domu-220 -f /dev/zvol/dsk/rpool/zvol/domu-220-root -l /isos/osol-2008.11.iso

This will drop you into the console for the guest to ask you the two initial questions. Since they're not really important in this circumstance, you can just choose the defaults. This example presumes that you have a DHCP server set up to give out dynamic addresses. If you only hand out addresses statically based on MAC address, you can also specify the --mac option. As OpenSolaris more-or-less assumes DHCP, it's recommended to set one up.

Now we need a graphical console in order to interact with the OpenSolaris installer. If the guest domain successfully finished booting the live CD, a VNC server should be running. It has recorded the details of this server in XenStore. This is essentially a name/value config database used for communicating between guest domains and the control domain (dom0). We can start a VNC session as follows:

# domid=`virsh domid domu-220`
# ip=`/usr/lib/xen/bin/xenstore-read /local/domain/$domid/ipaddr/0`
# port=`/usr/lib/xen/bin/xenstore-read /local/domain/$domid/guest/vnc/port`
# /usr/lib/xen/bin/xenstore-read /local/domain/$domid/guest/vnc/passwd
# vncviewer $ip:$port

At the VNC password prompt, enter the given password, and this should bring up a VNC session, and you can merrily install away.


The live CD runs a transient SMF service system/xvm/vnc-config. If it finds itself running on a live CD, it will generate a random VNC password, configure application/x11/x11-server to start Xvnc, and write the values above to XenStore. When application/graphical-login/gdm starts, it will read these service properties and start up the VNC server. The service system/xvm/ipagent tracks the IPv4 address given to the first running interface and writes it to XenStore.

By default, the VNC server is configured not to run post-installation due to security concerns. This can be changed though, as follows:

# svccfg -s x11-server
setprop options/xvm_vnc = "true"

Please remember that VNC is not secure. Since you need elevated privileges to read the VNC password from XenStore, that's sufficiently protected, as long as you always run the VNC viewer locally on the dom0, or via SSH tunnelling or some other secure method.

Note that this works even with a Linux dom0, although you can't yet use virt-install, as the upstream version doesn't yet "know about" OpenSolaris (more on this later).


Building OpenSolaris ISOs Staring at the C

I've recently been figuring out to build OpenSolaris ISOs (from SVR4 packages). It's surprisingly easy, but at least the IPS part is not well documented, so I thought I'd write up how I do it.

There are three main things you're most likely to want to do: build IPS itself, populate an IPS repository, and build an install ISO based on that repository. First, you'll want a copy of the IPS gate:

hg clone ssh://anon@hg.opensolaris.org/hg/pkg/gate pkg-gate
For some of my testing, I wanted to test some changed packages. So I mounted a Nevada DVD on /mnt/, then, using mount -F lofs, replaced some of the package directories with ones I'd built previously with my fixes. This effectively gave me a full Nevada DVD with my fixes in, avoiding the horrors of making one. I then cd pkg-gate, and run something like this:
$ cat build-ips
export WS=$1
export REPO=http://localhost:$2
unset http_proxy || true
set -e
echo "START `date`"
cd $WS/src
make install packages
cd $WS/src/util/distro-import
export NONWOS_PKGS="/net/paradise/export/integrate_dock/nv/nv_osol0811/all \
export WOS_PKGS="/mnt/Solaris_11/Product/"
export PYTHONPATH=$WS/proto/root_i386/usr/lib/python2.4/vendor-packages/
export PATH=$WS/proto/root_i386/usr/bin/:$WS/proto/root_i386/usr/lib:$PATH
nohup pkg.depotd -p $2 -d /var/tmp/$USER/repo &
sleep 5
make -e 99/slim_import
echo "END `date`"
$ ./build-ips `pwd` 10023

In fact, since I was running on an older version Nevada (89, precisely), I had to stop after the make install and change src/pyOpenSSL-0.7/setup.py to pick up OpenSSL from /usr/sfw:

IncludeDirs =  [ '/usr/sfw/include' ]
LibraryDirs =  [ '/usr/sfw/lib' ]

(If /usr/bin/openssl exists, you don't need this). So, after this step, which build the IPS tools (and SVR4 package for it), it moves into the "distro-import" directory. This is really a completely different thing from IPS itself, but for convenience it lives in the IPS gate. Its job is to take a set of SVR4 packages (that is, the old Solaris package format) and upload them to a given IPS network repository: in this case, http://localhost:10023.

So, making sure we use the IPS tools we just built, we point a couple of environment variables to the package locations. "WOS" stands for, charmingly, "Wad Of Stuff", and in this context means "packages delivered to Solaris Nevada". There's also some extra packages used for OpenSolaris, listed here as NONWOS_PKGS. I'm not sure where external people can get them from, though.

The core of distro-import is the solaris.py script, which does the job of transliterating from SVR4-speak into pkgsend(1)-speak. As well as a straight translation, though, a small number of customisations to the existing packages are also made to account for OpenSolaris differences. These are done by dropping the original file contents and picking them up from an ad-hoc SUNWfixes SVR4 package built in the same directory.

Of course, each build has its differences, so they're separated out into sub-directories. As you can see above, to run the import, we make a 99/slim_import target. This basically runs solaris.py for every package listed in the file 99/slim_custer. This list is more or less what makes up the contents of the live CD. Also of interest is the redist_import target, which builds every package available (see http://pkg.opensolaris.org). By the way, watch out for distro-import/README: it's not quite up to date.

Another super useful environment variable is JUST_THESE_PKGS: this will only build and import the packages listed. Very useful if you're tweaking a package and don't want to re-import the whole cluster!

At the end of this build, we now have a populated IPS repository living at http://localhost:10023. If we already have an installed OpenSolaris, we could easily use this to install individual new packages, or do an image update (where ipshost is the remote name of your build machine):

# pkg set-authority -P -O http://ipshost:10023 myipsrepo
# pkg install SUNWmynewpackage # or...
# pkg image-update

If we want to test installer or live CD changes, though, we'll need to build an ISO. I did this for the first time today, and it's fall-over easy. First you need an OpenSolaris build machine, and type:

# pkg install SUNWdistro-const

Modify slim_cd.xml to point to your repository, as described here. It's not immediately obvious, but you can specify your URL as http://ipshost:10023 if you're not using the standard port, like me. Then:

# distro_const build ./slim_cd.xml

And that's it: you'll have a fully-working OpenSolaris ISO in /export/dc_output/ (I understand it's a different location after build 99, though). I never knew building an install ISO could be so simple!


Direct mounting of files Staring at the C

As part of my work on Least Privilege for xVM, I worked on implementing direct file mounts. The idea is that we'd modify the Solaris support in virt-install to use these direct mounts, instead of the more laborious older method required.

A long-standing peeve of Solaris users is that in order to mount a file system image (in particular a DVD ISO image), it's a two-step process. This was less than ideal, as many other UNIX OS's made it simple to do: you'd just pass the file to the mount command, along with a special option or two, and it mounts it directly.

With my putback of 6384817 Need persistent lofi based mounts and direct mount(1m) support for lofi, this is now possible (in fact, a little easier) in Solaris. Instead of doing this:

# device=`lofiadm -a /export/solarisdvd.iso`
# mount -F hsfs $device /mnt/iso
# umount /mnt/iso
# lofiadm -d /export/solarisdvd.iso

it's just:

# mount -F hsfs /export/solarisdvd.iso /mnt/iso
# umount /export/solarisdvd.iso

Under the hood, this still uses the lofi driver, it's just automatically used at mount and unmount time. There's no need for an -o loop option as on Linux.

This is supported for most of the file systems you might need in Solaris, namely ufs, hsfs, udfs, and pcfs. This doesn't work for ZFS, as this has its own method for mounting file system images.

I was asked a couple of times why I implemented this in the kernel at all (which meant requiring file system support via vfs_get_lofi(). This was primarily to allow non-root users to access file mounts; in fact this was the primary motivation for implementing this feature from the point of view of the xVM work. In particular, if you have PRIV_SYS_MOUNT, you can do direct file mounts as well as normal mounts. This is important for virt-install, which we want to avoid running as root, but needs to be able to mount DVDs to grab the booting information for when installing a guest.

As always, there's more work that could be done. mount is not smart about relative paths, and should notice (and correct) early if you try pass a relative path as the first argument. Solaris has always (rather annoyingly) required an -F option to identify what kind of file system you're mounting, which is particularly pedantic of it. Equally the lofi driver doesn't comprehend fdisk or VTOC layouts.


blogs.sun.com RSS feed Staring at the C

For reasons beyond my ken, blogs.sun.com doesn't actually list an RSS feed anywhere I can find, but it's at http://blogs.sun.com/main/feed/entries/rss.

Update:: it's now grown an RSS icon. Thanks!

xVM Under The Hood: seg_mf Staring at the C

An occasional series wherein I'll describe a part of the xVM implementation. Today, I'll be talking about seg_mf. You may want to read through my explanation of live migration and MMU virtualization first.

The control domain (dom0) often needs access to memory pages that belong to a running guest domain. The most obvious example of this is in constructing the domain during boot, but it's also needed for mapping the shared virtual guest console page, generating guest domain core dumps, etc.

This is implemented via the privcmd driver. Each process that needs to map some area of a guest domain's memory maps a range of anonymous virtual memory. The process then sends a request to the driver to map in a given range or set of machine frames into the given virtual address range. The two requests (IOCTL_PRIVCMD_MMAP and IOCTL_PRIVCMD_MMAP_BATCH) are more or less the same, although the latter allows the user to track MFNs that couldn't be mapped (see below).

Both ioctl()s hook into the seg_mf code. This is a normal Solaris segment driver (see Solaris Internals) with a hook that's used to store the arrays of MFN values that each VA range is to be backed by. This segment driver is a little unusual though: it does not support demand faulting. That is, every page in the segment is faulted in (and locked in) at the time of the ioctl(). This is needed to support the error-reporting interface described below, but it also helps simplify the driver significantly.

To fault the range, we go through each page-size chunk in the mapping. We need to establish a mapping from the virtual address of the chunk to the actual machine frame holding the page owned by the guest domain. This happens in segmf_faultpage(). The HAT isn't used to our strange request, so we load a temporary mapping at the given VA, and replace that with a mapping to the real underlying MFN via HYPERVISOR_update_va_mapping_otherdomain().

Normally, the MFNs given via the ioctl() should be mappable. One exception is HVM live migration. This was implemented, somewhat confusingly, to use the same interfaces but pass GMFNs not MFNs. In particular, for HVM guests, a guest MFN (what a guest thinks is a real machine frame number) is actually a pseudo-physical frame number. As a result, due to ballooning, or PV drivers, etc., this GMFN may not have a real MFN backing it, so the attempt to map it will fail. We mark the MFN as failed in the outgoing array of IOCTL_PRIVCMD_MMAP_BATCH and let the client deal with it. This is generally OK, since the iterative nature of live migration means we can still get to all the pages we need.

One nice enhancement would be to extend pmap to recognise such mappings. In particular qemu-dm has a bunch of such mappings. It'd be relatively easy to mark such mappings as coming from seg_mf. Extra marks for listing the MFN ranges too, though that's a little harder :)


DTrace on xenstored Staring at the C

DTrace support for xenstored has just been merged in the upstream community version of Xen. Why is it useful?

The daemon xenstored runs in dom0 userspace, and implements a simple 'store' of configuration information. This store is used for storing parameters used by running guest domains, and interacts with dom0, guest domains, qemu, xend, and others. These interactions can easily get pretty complicated as a result, and visualizing how requests and responses are connected can be non-obvious.

The existing community solution was a 'trace' option to xenstored: you could restart the daemon and it would record every operation performed. This worked reasonably well, but was very awkward: restarting xenstored means a reboot of dom0 at this point in time. By the time you've set up tracing, you might not be able to reproduce whatever you're looking at any more. Besides, it's extremely inconvenient.

It was obvious that we needed to make this dynamic, and DTrace USDT (Userspace Statically Defined Tracing) was the obvious choice. The patch adds a couple of simple probes for tracking requests and responses; as usual, they're activated dynamically, so have (next to) zero impact when they're not used. On top of these probes I wrote a simple script called xenstore-snoop. Here's a couple of extracts of the output I get when I start a guest domain:

# /usr/lib/xen/bin/xenstore-snoop 
DOM  PID      TX     OP
0    100313   0      XS_GET_DOMAIN_PATH: 6 -> /local/domain/6
0    100313   0      XS_TRANSACTION_START:  -> 930
0    100313   930    XS_RM: /local/domain/6 -> OK
0    100313   930    XS_MKDIR: /local/domain/6 -> OK
6    0        0      XS_READ: /local/domain/0/backend/vbd/6/0/state -> 4
6    0        0      XS_READ: device/vbd/0/state -> 3
0    0        -      XS_WATCH_EVENT: /local/domain/6/device/vbd/0/state FFFFFF0177B8F048
6    0        -      XS_WATCH_EVENT: device/vbd/0/state FFFFFF00C8A3A550
6    0        0      XS_WRITE: device/vbd/0/state 4 -> OK
0    0        0      XS_READ: /local/domain/6/device/vbd/0/state -> 4
6    0        0      XS_READ: /local/domain/0/backend/vbd/6/0/feature-barrier -> 1
6    0        0      XS_READ: /local/domain/0/backend/vbd/6/0/sectors -> 16777216
6    0        0      XS_READ: /local/domain/0/backend/vbd/6/0/info -> 0
6    0        0      XS_READ: device/vbd/0/device-type -> disk
6    0        0      XS_WATCH: cpu FFFFFFFFFBC2BE80 -> OK
6    0        -      XS_WATCH_EVENT: cpu FFFFFFFFFBC2BE80
6    0        0      XS_READ: device/vif/0/state -> 1
6    0        0      [ERROR] XS_READ: device/vif/0/type -> ENOENT

This makes the interactions immediately obvious. We can observe the Xen domain that's doing the request, the PID of the process (this only applies to dom0 control tools), the transaction ID, and the actual operations performed. This has already proven of use in several investigations.

Of course this being DTrace, this is only part of the story. We can use these probes to correlate system behaviour: for example, xenstored transactions are currently rather heavyweight, as they involve copying a large file; these probes can help demonstrate this. Using Python's DTrace support, we can look at which stack traces in xend correspond to which requests to the store; and so on.

This feature, whilst relatively minor, is part of an ongoing plan to improve the observability and RAS of Xen and the solutions Sun are building on top of it. It's very important to us to bring Solaris's excellent observability features to the virtualization space: you've seen the work with zones in this area, and you can expect a lot more improvements for the Xen case too.


I meant to say: after my previous post, I resurrected #opensolaris-dev: if you'd like to talk about OpenSolaris development in a non-hostile environment, please join!


#opensolaris Staring at the C

When OpenSolaris got started, #solaris was a channel filled with pointless rants about GNU-this and Linux-that. Beside complete wrong-headedness, it was a total waste of time and extremely hostile to new people. #opensolaris, in contrast, was actually pretty nice (for IRC!) - sure, the usual pointless discussions but it certainly wasn't hateful.

Recently I'm sad to say #opensolaris has become a really hostile, unpleasant place. I've seen new people arrive and be bullied by a small number of poisonous people until they went away (nice own goal, people!). So if anyone's looking for me for xVM stuff or whatever, I'll be in #onnv-scm or #solaris-xen as usual. And if you do so, please try to keep a civil tongue in your head - it's not hard.

Xen compatibility with Solaris Staring at the C

Maintaining the compatibility of hardware virtualization solutions can be tricky. Below I'll talk about two bugs that needed fixes in the Xen hypervisor. Both of them have unfortunate implications for compatibility, but thankfully, the scope was limited.

6616864 amd64 syscall handler needs fixing for xen 3.1.1

Shortly after the release of 3.1.1, we discovered that all 64-bit processes in a Solaris domain would segfault immediately. After much debugging and head-scratching, I eventually found the problem. On AMD64, 64-bit processes trap into the kernel via the syscall instruction. Under Xen, this will obviously trap to the hypervisor. Xen then 'bounces' this back to the relevant OS kernel.

On real hardware, %rcx and %r11 have specific meanings. Prior to 3.1.1, Xen happened to maintain these values correctly, although the layout of the stack is very different from real hardware. This was broken in the 3.1.1 release: as a result, the %rflags of each process was corrupted, and segfaulted almost immediately. We fixed the bug in Solaris, so we would still work with 3.1.1. This was also fixed (restoring the original semantics) in Xen itself in time for the 3.1.2 release. So there's a small window (early Solaris xVM releases and community versions of Xen 3.1.1) where we're broken, but thankfully, we caught this pretty early. The lesson to be drawn? Clear documentation of the hypervisor ABI would have helped, I think.

6618391 64-bit xVM lets processes fiddle with kernelspace, but Xen bug saves us

Around the same time, I noticed during code inspection that we were still setting PT_USER in PTE entries on 64-bit. This had some nasty implications, but first, some background.

On 32-bit x86, Xen protects itself via segmentation: it carves out the top 64Mb, and refuses to let any of the domains load a segment selector that allows read or write access to that part of the address space. Each domain kernel runs in ring 1 so can't get around this. On 64-bit, this hack doesn't work, as AMD64 does not provide full support for segmentation (given what a legacy technique it is). Instead, and somewhat unfortunately, we have to use page-based permissions via the VM system. Since page table entries only have a single bit ("user/supervisor") instead of being able to say "ring 1 can read, but ring 3 cannot", the OS kernel is forced into ring 3. Normally, ring 3 is used for userspace code. So every time we switch between the OS kernel and userspace, we have to switch page tables entirely - otherwise, the process could use the kernel page tables to write to kernel address-space.

Unfortunately, this means that we have to flush the TLB every time, which has a nasty performance cost. To help mitigate this problem, in Xen 3.0.3, an incompatible change was made. Previously, so that the kernel (running in ring 3, remember) could access its address space, it had to set PT_USER int its kernel page table entries (PTEs). With 3.0.3, this was changed: now, the hypervisor would automatically do that. Furthermore, if Xen did see a PTE with PT_USER set, then it assumed this was a userspace mapping. Thus, it also set PT_GLOBAL, a hardware feature - if such a bit is set, then a corresponding TLB entry is not flushed. This meant that switching between userspace and the OS kernel was much faster, as the TLB entries for userspace were no longer flushed.

Unfortunately, in our kernel, we missed this change in some crucial places, and until we fixed the bug above, we were setting PT_USER even on kernel mappings. This was fairly obviously A Bad Thing: if you caught things just right, a kernel mapping would still be present in the TLB when a user-space program was running, allowing userspace to read from the kernel! And indeed, some simple testing showed this:

dtrace -qn 'fbt:genunix::entry /arg0 > `kernelbase/ { printf("%p ", arg0); }' | \
    xargs -n 1 ~johnlev/bin/i386/readkern | while read ln; do echo $ln::whatis | mdb -k ; done

With the above use of DTrace, MDB, and a little program that attempts to read addresses, we can see output such as:

ffffff01d6f09c00 is ffffff01d6f09c00+0, allocated as a thread structure
ffffff01c8c98438 is ffffff01c8c983e8+50, bufctl ffffff01c8ebf8d0 allocated from as_cache
ffffff01d6f09c00 is ffffff01d6f09c00+0, allocated as a thread structure
ffffff01d44d7e80 is ffffff01d44d7e80+0, bufctl ffffff01d3a2b388 allocated from kmem_alloc_40
ffffff01d44d7e80 is ffffff01d44d7e80+0, bufctl ffffff01d3a2b388 allocated from kmem_alloc_40

Thankfully, the fix was simple: just stop adding PT_USER to our kernel PTE entries. Or so I thought. When I did that, I noticed during testing that the userspace mappings weren't getting PT_GLOBAL set after all (big thanks to MDB's ::vatopfn, which made this easy to see).

Yet more investigation revealed the problem to be in the hypervisor. Unlike certain other popular OSes used with Xen, we set PTE entries in page tables using atomic compare and swap operations. Remember that under Xen, page tables are read-only to ensure safety. When an OS kernel tries to write a PTE, a page fault happens in Xen. Xen recognises the write as an attempt to update a PTE and emulates it. However, since it hadn't been tested, this emulation path was broken: it wasn't doing the correct mangling of the PTE entry to set PT_GLOBAL. Once again, the actual fix was simple.

By the way, that same putback also had the implementation of:

6612324 ::threadlist could identify taskq threads

I'd been doing an awful lot of paging through ::threadlist output recently, and always having to jump through all the (usually irrelevant) taskq threads was driving me insane. So now you can just specify ::threadlist -t and get a much, much, shorter list.


OpenSolaris xVM now available in SX:CE Staring at the C

Build 75 of Solaris Express Community Edition is now out, and it includes our bits. So go ahead, install build 75, select the xVM entry in grub and play around! We're still working on updating the documentation on our community page; in the meantime, you have manpages - start at xVM(5) (and note that the forthcoming build 76 has much improved versions of those docs).

You might be wondering if your machine is capable of running Windows or other operating systems under HVM. Joe Bonasera has a simple program you can run that will tell you. Alternatively, if you're already running with our bits, running 'virt-install' will tell you - if it asks you about creating a fully-virtualized domain, then it should work, and you can end up with a desktop like Russell Blaine's.

Nils, meanwhile, describes how we've improved the RAS of the hypervisor by integrating it with Solaris crash dumps here. This feature has saved our lives numerous times during development as those of us who've done the "hex dump" debugging thing know very well.

Of course, we're not done yet - we have bugs to fix and rough edges to smooth out, and we have significant features to implement. One of the major items we're working on in the near future is the upgrade to Xen 3.1.1 (or possibly 3.1.2, depending on timelines!). This will give us the ability to do live migration of HVM domains, along with a host of other features and improvements.


Automatic start/stop of Xen domains Staring at the C

After answering a query, I said I'd write a blog entry describing what changes we've made to support clean shutdown and start of Xen domains.

Bernd refers to an older method of auto-starting Xen domains used on Linux. In fact, this method has been replaced with the configuration parameters on_xend_start and on_xend_stop. Setting these can ensure that a Xen domain is cleanly shut down when the host (dom0) is shut down, and started automatically as needed. For somewhat obvious reasons, we'd like to have the same semantics as used with zones, if not quite the same implementation (yet, at least).

When I started looking at this, I realised that the community solution had some problems:

Clean shutdown wasn't the default

It seems obvious that by default I'd like my operating systems to shut down cleanly. Only in unusual circumstances would I be happy with an OS being unceremoniously destroyed. We modified our Xen gate to default to on_xend_stop=shutdown.

Suspend on shutdown was dangerous

It is possible to specify on_xend_stop=suspend; this will save the running state to an image file and then destroy the domain (like xm save). However, there is not corresponding on_xend_start setting, nor any logic to ensure that the values match. This is both apparently useless and even dangerous, since starting a new domain but with old file-system state from a suspended domain could be problematic. We've disabled this functionality.

Actions are tied into xend

This was the biggest problem for us: as modelled, if somebody stops xend, then all the domains would be shut down. Similarly, if xend restarts for whatever reason (say, a hardware error), it would start domains again. We've modified this on Solaris. Instead of xend operating on these values, we introduce a new SMF service, system/xctl/domains, that auto-starts/stops domains as necessary. This service is pretty similar to system/zones. We've set up the dependencies such that a restart of the Xen daemons won't cause any running domains to be restarted. For this to work properly within the SMF framework, we also had to modify xend to wait for all domains to finish their state transitions.

You can find our changes here. And yes, we still need to take system/xctl/domains to PSARC.

Clean shutdown implementation

You might be wondering how the dom0 even asks the guest domains to shut down cleanly. This is done via a xenstore entry, control/shutdown. The control tools write a string into this entry, which is being "watched" by the domain. The kernel then reads the value and responds appropriately (xen_shutdown()), triggering a user-space script via the sysevent framework. If nothing happens for a while, it's possible that the script couldn't run for whatever reason. In that case, we time-out and force a "dirty" shutdown from within the kernel.


Solaris Xen update Staring at the C

After an undesirably long time, I'm happy to say that another drop of Solaris on Xen is available here. Sources and other sundry parts are here. Documentation can be found at our community site, and you can read Chris Beal describe how to get started with the new bits.

As you might expect, there's been a massive amount of change since the last OpenSolaris release. This time round, we are based on Xen 3.0.4 and build 66 of Nevada. As always, we'd love to hear about your experiences if you try it out, either on the mailing list or the IRC channel.

In many ways, the most significant change is the huge effort we've put in to stabilize our codebase; a significant number of potential hangs, crashes, and core dumps have been resolved, and we hope we're converging on a good-quality release. We've started looking seriously at performance issues, and filling in the implementation gaps. Since the last drop, notable improvements include:

PAE support
By default, we now use PAE mode on 32-bit, aiding compatibility with other domain 0 implementations; we also can boot under either PAE or non-PAE, if the Xen version has 'bi-modal' support. This has probably been the most-requested change missing from our last release.
HVM support
If you have the right CPU, you can now run fully-virtualized domains such as Windows using a Solaris dom0! Whilst more work is needed here, this does seem to work pretty well already. Mark Johnson has some useful tips on using HVM domains.
New management tools
We have integrated the virt- suite of management tools. virt-manager provides a simple GUI for controlling guest domains on a single host. virt-install and virsh are simple CLIs for installing and managing guest domains respectively. Note that parts of these tools are pre-alpha, and we still have a significant amount of work to do on them. Nonetheless, we appreciate any comments...
PV framebuffer
Solaris dom0 now supports the SDL-based paravirt framebuffer backend, which can be used with domUs that have PV framebuffer support.
Virtual NIC support
The Ethernet bridge used in the previous release has been replaced with virtual NICs from the Crossbow project. This enables future work around smart NICs, resource controls, and more.
Simplified Solaris guest domain install
It's now easy to install a new Solaris guest domain using the DVD ISO. The temporary tool in the last release, vbdcfg, has disappeared now as a result. William Kucharski has a walk-through.
Better SMF usage
Several of the xend configuration properties are now controlled using the SMF framework.
Managed domain support
We now support xend-managed domain configurations instead of using .py configuration files. Certain parts of this don't work too well yet (unfortunately all versions of Xen have similar problems), but we are plugging in the gaps here one by one.
Memory ballooning support
Otherwise known as support for dynamic xm mem-set, this allows much greater flexibility in partitioning the physical memory on a host amongst the guest domains. Ryan Scott has more details.
Vastly improved debugging support
Crash dump analysis and debugging tools have always been a critical feature for Solaris developers. With this release, we can use Solaris tools to debug both hypervisor crashes and problems with guest domains. I talk a little bit about the latter feature below.
xvbdb has been renamed
To simply be xdb. This was a very exciting change for certain members of our team.

We're still working hard on finishing things up for our phase 2 putback into Nevada (where "phase 1" was the separate dboot putback). As well as finishing this work, we're starting to look at further enhancements, in particular some features that are available in other vendors' implementations, such as a hypervisor-copy based networking device, blktap support, para-virtualized drivers for HVM domains (a huge performance fix), and more.

Debugging guest domains

Here I'll talk a little about one of the more minor new features that has nonetheless proven very useful. The xm dump-core command generates an image file of a running domain. This file is a dump of all memory owned by the running domain, so it's somewhat similar to the standard Solaris crash dump files. However, dump-core does not require any interaction with the domain itself, so we can grab such dumps even if the domain is unable to create a crash dump via the normal method (typically, it hangs and can't be interacted with), or something else prevents use of the standard Solaris kernel debugging facilities such as kmdb (an in-kernel debugger isn't very useful if the console is broken).

However, this also means that we have no control over the format used by the image file. With Xen 3.0.4, it's rather basic and difficult to work with. This is much improved in Xen 3.1, but I haven't yet written the support for the new format.

To add support for debugging such image files of a Solaris domain, I modified mdb(1) to understand the format of the image file (the alternative, providing a conversion step, seemed unneccessarily awkward, and would have had to throw away information!). As you can see if you look around usr/src/cmd/mdb in the source drop, mdb(1) loads a module called mdb_kb when debugging such image files. This provides simple methods for reading data from the image file. For example, to read a particular virtual address, we need to use the contents of the domain's page tables in the image file to resolve it to a physical page, then look up the location of that page in the file. This differs considerably from how libkvm works with Solaris crash dumps: there, we have a big array of address translations, which is used directly, instead of the page table contents.

In most other respects, debugging a kernel domain image is much the same as a crash dump:

# xm dump-core solaris-domu core.domu
# mdb core.domu
mdb: warning: dump is from SunOS 5.11 onnv-johnlev; dcmds and macros may not match kernel implementation
Loading modules: [ unix genunix specfs dtrace xpv_psm scsi_vhci ufs ... sppp ptm crypto md fcip logindmux nfs ]
> ::status
debugging domain crash dump core.domu (64-bit) from sxc16
operating system: 5.11 onnv-johnlev (i86pc)
> ::cpuinfo
  0 fffffffffbc4b7f0  1b   40    9 169  yes   yes t-1408926 ffffff00010bfc80 sched
> ::evtchns
Type          Evtchn IRQ IPL CPU ISR(s)
evtchn        1      257 1   0   xenbus_intr
evtchn        2      260 9   0   xenconsintr
virq:debug    3      256 15  0   xen_debug_handler
virq:timer    4      258 14  0   cbe_fire
evtchn        5      259 5   0   xdf_intr
evtchn        6      261 6   0   xnf_intr
evtchn        7      262 6   0   xnf_intr
> ::cpustack -c 0
dispatch_hilevel+0x1f(102, 0)
do_interrupt+0x11d(ffffff00010bfaf0, fffffffffbc86f98)
xen_callback_handler+0x42b(ffffff00010bfaf0, fffffffffbc86f98)
dispatch_softint+0x38(9, 0)
do_interrupt+0x140(ffffff0001593520, fffffffffbc86048)
xen_callback_handler+0x42b(ffffff0001593520, fffffffffbc86048)
turnstile_wakeup+0x16e(fffffffec33a64d8, 0, 1, 0)
xenconswput+0x64(fffffffec42cb658, fffffffecd6935a0)
putnext+0x2f1(fffffffec42cb3b0, fffffffecd6935a0)
ldtermrmsg+0x235(fffffffec42cb2b8, fffffffec3480300)
ldtermrput+0x43c(fffffffec42cb2b8, fffffffec3480300)
putnext+0x2f1(fffffffec42cb560, fffffffec3480300)

Note that both ::cpustack and ::cpuregs are capable of using the actual register set at the time of the dump (since the hypervisor needs to store this for scheduling purposes). You can also see the ::evtchns dcmd in action here; this is invaluable for debugging interrupt problems (and we've fixed a lot of those over the past year or so!).

Currently, mdb_kb only has support for image files of para-virtualized Solaris domains. However, that's not the only interesting target: in particular, we could support mdb in live crash dump mode against a running Solaris domain, which opens up all sorts of interesting debugging possibilities. With a small tweak to Solaris, we can support debugging of fully-virtualized Solaris instances. It's not even impossible to imagine adding Linux kernel support to mdb(1), though it's hard to imagine there would be a large audience for such a feature...


Python and DTrace in build 65 Staring at the C

A significant portion of the Xen control tools are written in Python, in particular xend. It's been somewhat awkward to observe what the daemon is doing at times, necessitating an endless cycle of 'add printf; restart' cycles. A while ago I worked on adding DTrace support to the Python packages we ship in OpenSolaris, and these changes have now made it into the latest build, 65.

As is the case with the other providers people have worked on such as Ruby and Perl, there's two simple probes for function entry and function exit. arg0 contains the filename, arg1 the function name, and arg2 has the line number. So given this simple script to trace the functions called by a particular function invocation, restricted to a given module name:

#!/usr/sbin/dtrace -ZCs

#pragma D option quiet

    /copyinstr(arg1) == $2 && strstr(copyinstr(arg0), $1) != NULL/ {
        self->trace = 1;

    /copyinstr(arg1) == $2 && strstr(copyinstr(arg0), $1) != NULL/ {
        self->trace = 0;

    /self->trace && strstr(copyinstr(arg0), $3) != NULL/ {
        printf("%s %s (%s:%d)\n", probename == "function-entry" ? "->" : "<-",
            copyinstr(arg1), copyinstr(arg0), arg2);

We can run it as follows and get some useful results:

# ./pytrace.d \"hg.py\" \"clone\" \"mercurial\" -c 'hg clone /tmp/test.hg'
-> clone (build/proto/lib/python/mercurial/hg.py:65)
-> repository (build/proto/lib/python/mercurial/hg.py:54)
-> _lookup (build/proto/lib/python/mercurial/hg.py:31)
-> _local (build/proto/lib/python/mercurial/hg.py:16)
-> __getattribute__ (build/proto/lib/python/mercurial/demandload.py:56)
-> module (build/proto/lib/python/mercurial/demandload.py:53)

Of course, this being DTrace, we can tie all of this into general system activity as usual. I also added "ustack helper" support. This is significantly more tricky to implement, but enormously useful for following the path of Python code. For example, imagine we want to look at what's causing write()s in the clone operation above. As usual:

#!/usr/sbin/dtrace -Zs

syscall::write:entry /pid == $target/
        @[jstack(20)] = count();

        trunc(@, 2);

Note that we're using jstack() to make sure we have enough space allocated for the stack strings reported. Now as well as the C stack, we can see what Python functions are involved in the user stack trace:

# ./writes.d -c 'hg clone /tmp/test.hg'
                [ build/proto/lib/python/mercurial/transaction.py:49 (add) ]
                [ build/proto/lib/python/mercurial/revlog.py:1137 (addgroup) ]
                [ build/proto/lib/python/mercurial/localrepo.py:1849 (addchangegroup) ]
                [ build/proto/lib/python/mercurial/localrepo.py:1345 (pull) ]

                [ build/proto/lib/python/mercurial/localrepo.py:1849 (addchangegroup) ]
                [ build/proto/lib/python/mercurial/localrepo.py:1345 (pull) ]
                [ build/proto/lib/python/mercurial/localrepo.py:1957 (clone) ]

Creating a ustack helper

As anyone who's come across the Java dtrace helper source will know, creating a ustack helper is rather a black art.

When a ustack helper is present, it is called in-kernel for each entry in a stack when the ustack() action occurs (source). The D instructions in the helper action are executed such that the final string value is taken as the result of the helper. Typically for Java, there is no associated C function symbol for the PC value at that point in the stack, so the result of the helper is used directly in the stack trace. However, this is not true for Python, so that's why you see a different format above: the normal stack entry, plus the result of the helper in annotated form where it returned a result (in square brackets).

The helper is given two arguments: arg0 is the PC value of the stack entry, and arg1 is the frame pointer. The helper is expected to construct a meaningful string from just those values. In Python, the PyEval_EvalFrame function always has a PyFrameObject * as one of its arguments. By having the helper look at this pointer value and dig around the structures, we can find pointers to the strings containing the file name and function, as well as the line number. We can copy these strings in, and, using alloca() to give ourselves some scratch space, build up the annotation string you see above.

Debugging helpers isn't particularly easy, since it lives and runs in probe context. You can use mdb's DTrace debugging facilities to find out what happened, and some careful mapping between the failing D instructions and the helper source can pinpoint the problem. Using this method it was relatively easy to get a working helper for x86 32-bit. Both SPARC and x86 64-bit proved more troublesome though. The problems were both related to the need to find the PyFrameObject * given the frame pointer. On amd64, the function we needed to trace was passing the arguments in registers, as defined architecturally, so the argument wasn't accessible on the stack via the frame pointer. On SPARC, the pointer we need was stored in a register that was subsequently re-used as a scratch register. Both problems were solved, rather cheesily, by modifying the way the function was called.


Booting para-virtualised OS instances Staring at the C

Whilst I'm waiting for my home directory to reappear, I thought I'd mention some of the work I've done to support easy booting of domains in Xen.

For dom0 to be able to boot a para-virtualised domU, it needs to be able to bootstrap it. In particular, it needs to be able to read the kernel file and its associated ramdisk so it can hand off control to the kernel's entry point when the domain is created. And we must somehow make these files accessible in the dom0. Previously, you had to somehow copy out the files from the domU filesystem into dom0. This was often difficult (consider getting files off an ext2 filesystem in a Solaris dom0), and was obviously prone to errors such as forgetting to update the copies when upgrading the kernel.

For a while now Xen has had support for a bootloader. This runs in userspace and is responsible for copying out the files (that specified by kernel and ramdisk in the domain's config file) to a temporary directory in dom0; the files are then passed on to the domain builder. Xen has shipped with a bootloader called pygrub. Whilst somewhat confusingly named, it essentially emulated the grub menu. It had backends for a couple of Linux filesystems written in Python and worked by searching for a grub.conf file, then presenting a lookalike grub menu for the user to interact with. When an entry was selected, the specified files would be read off the filesystem and passed back to the builder.

This worked reasonably well for Linux, but we felt there was a number of problems. First, the interactive menu only worked for first boot; subsequent reboots would automatically choose an entry without allowing user interaction (though this is now fixed in xen-unstable). Its interactive nature seemed quite a stumbling block for things like remote domain management; you really don't want to babysit domain creation. Also, the implementation of the filesystem backends wasn't ideal; there was only limited Linux filesystem support, and it didn't work very well.

We've adapted pygrub to help with some of these issues. First, we replaced the filesystem code with a C library called libfsimage. The intention here is to provide a stable API for accessing filesystem images from userspace. Thus it provides a simple interface for reading files from a filesystem image and a plugin architecture to provide the filesystem support. This plugin API is also stable, allowing filesystems past, present and future to be transparently supported. Currently there are plugins for ext2, reiserfs, ufs and iso9660, and we expect to have a zfs plugin soon. We borrowed the grub code for all of these plugins to simplify the implementation, but the API allows for any implementation.

Some people were suggesting solutions involving loopback mounts. This was problematic for us for two main reasons. First, filesystem support in the different dom0 OS's is far from complete; for example, Solaris has no ext2 support, and Linux has no (real) ZFS support. Second, and more seriously, it exposes a significant gap in terms of isolation: the dom0 kernel FS code must be entirely resilient against a corrupt domU filesystem image. If we are to consider domU's as untrusted, it doesn't make sense to leave this open as an attack vector.

Another simple change we made was to allow operation without a grub.conf at all. You can specify a kernel and ramdisk and make pygrub automatically load them from the domU filesystem. Even easier, you can leave out all configuration altogether, and a Solaris domU will automatically boot the correct kernel and ramdisk. This makes setting up your config for a domU much easier.

pygrub understands both fdisk partitions and Solaris slices, so simply specifying the disk will cause the bootloader to look for the root slice and grab the right files to boot.

There's more work we can do yet, of course.


64-bit Python in Nevada build 53 Staring at the C

Coming to you in build 53 of OpenSolaris is 64-bit Python, which I worked on with Laszlo Peter of JDS fame. This means Python modules can make use of 64-bit versions of libraries, as well as 64-bit plugins. You can run the 64-bit version of Python via the path /usr/bin/amd64/python (on x86). This path isn't quite set in stone yet, so don't rely on it.

This facility didn't previously exist on any OS, so we had to make some innovations in terms of how Python lets modules build and load. In particular the Makefile used by Python previously hard-coded certain compiler flags etc. We had to make this dynamic. Also, we had to make some modifications to where Python looks for .so files when loading modules. Previously it would just assume that, say, /usr/lib/python2.4/foo.so was of the correct word size. Now, if it's running 64-bit, it will look for /usr/lib/python2.4/64/foo.so.

Similarly, building a Python module using the 64-bit Python will automagically install the .so file in the right place. Thanks to their architecture-independence, we don't need the same tricks for the .pyc files.

The need for this arose from the continuing work on Solaris dom0's running under Xen. The kernel/hypervisor interfaces provided are not 64-bit clean in the sense that 32-bit tools cannot deal with 64-bit domains; as a result, we need to run (the Python-based) xend as a native binary.

As an added bonus, Laca has also upgraded to Python 2.4.4, which finally enables the curses module on Solaris; also fixed are some niggling problems with accidental regeneration of .pyc files.


Save/restore of MP Solaris domUs Staring at the C

In honour of our new release of OpenSolaris on Xen, here's some details on the changes I've made to support save/resume (and hence migration and live migration) with MP Solaris domUs. As before, to actually see the code I'm describing, you'll need to download the sources - sorry about that.

Under Xen, the suspend process is somewhat unusual, in that only the CPU context for (virtual) CPU0 is stored in the state file. This implies that the actual suspend operation must be performed on CPU0, and we have to find some other way of capturing CPU context (that is, the register set) for the other CPUs.

In the Linux implementation, Xen suspend/resume works by using the standard CPU hotplug support for all CPUs other than CPU0. Whilst this works well for Linux, this approach is more troublesome for Solaris. Hot-unplugging the other CPUs doesn't match well with the mapping between Xen and Solaris notions of "offline" CPUs (the interested can read the big comment on line 406 of usr/src/uts/i86xen/os/mp_xen.c for a description of how this mapping currently works). In particular, offline CPUs on Solaris still participate in IPI interrupts, whilst a "down" VCPU in Xen cannot.

In addition, the standard CPU offlining code in Solaris is not built for this purpose; for example, it will refuse to offline a CPU with bound threads, or the last CPU in a partition.

However, all we really need to do is get the other CPUs into a known state which we can recover during the resume process. All the dispatcher data structures etc. associated with the CPUs can remain in place. To this end, we can use pause_cpus() on the other CPUs. By replacing the pause handler with a special routine (cpu_pause_suspend()), we can store the CPU context via a setjmp(), waiting until all CPUs have reached the barrier. We need to disable interrupts (or rather, Xen's virtualized equivalent of interrupts), as we have to tear down all the interrupts as part of the suspend process, and we need to ensure none of the CPUS go wandering off.

Once all CPUs are blocked at the known synchronisation point, we can tell Xen to "down" the other VCPUs so they can no longer run, and complete the remaining cleanup we need to do before we tell Xen we're ready to stop via HYPERVISOR_suspend().

On resume, we will come back on CPU0, as Xen stored the context for that CPU itself. After redoing some of the setup we tore down during suspend, we can move on to resuming the other CPUs. For each CPU, we call mach_cpucontext_restore(). We use the same Xen call used to create the CPUs during initial boot. In this routine, we fiddle a little bit with the context saved in the jmpbuf by setjmp(); because we're not actually returning via a normal longjmp() call, we need to emulate it. This means adjusting the stack pointer to simulate a ret, and pretending we've returned 1 from setjmp() by setting the %eax or %rax register in the context.

When each CPU's context is created, it will look as if it's just returned from the setjmp() in cpu_pause_suspend(), and will continue upon its merry way.

Inevitably, being a work-in-progress, there are still bugs and unresolved issues. Since offline CPUs won't participate in a cpu_pause(), we need to make sure that those CPUs (which will typically be sitting in the idle loop) are safe; currently this isn't being done. There are also some open issues with 64-bit live migration, and suspending SMP domains with virtual disks, which we're busy working on.


Fun with stack corruption Staring at the C

Today, we were seeing a very odd crash in some xen code. The core dump wasn't of great use, since both %eip and %ebp were zeroed, which means no backtrace. Instead I attached mdb to the process and started stepping through to see what was happening. It soon transpired that we were crashing after successfully executing a C function called xspy_introduce_domain(), but before we got back to the Python code that calls into it. After a little bit of head-scratching, I looked closer at the assembly for this function:

xs.so`xspy_introduce_domain:            pushl  %ebp
xs.so`xspy_introduce_domain+1:          movl   %esp,%ebp
xs.so`xspy_introduce_domain+0x57:       subl   $0xc,%esp
xs.so`xspy_introduce_domain+0x5a:       leal   -0xc(%ebp),%eax
xs.so`xspy_introduce_domain+0x5d:       pushl  %eax
xs.so`xspy_introduce_domain+0x5e:       leal   -0x8(%ebp),%eax
xs.so`xspy_introduce_domain+0x61:       pushl  %eax
xs.so`xspy_introduce_domain+0x62:       leal   -0x2(%ebp),%eax
xs.so`xspy_introduce_domain+0x65:       pushl  %eax
xs.so`xspy_introduce_domain+0x66:       pushl  $0xc4b13114
xs.so`xspy_introduce_domain+0x6b:       pushl  0xc(%ebp)
xs.so`xspy_introduce_domain+0x6e:       call   +0x43595220      
xs.so`xspy_introduce_domain+0x11f:      leave
xs.so`xspy_introduce_domain+0x120:      ret

Seems OK - we're pushing three pointers onto the stack (+0x5a-0x65) and two other arguments. Let's look at the sources:

static PyObject *xspy_introduce_domain(XsHandle *self, PyObject *args)
    domid_t dom;
    unsigned long page;
    unsigned int port;

    struct xs_handle *xh = xshandle(self);
    bool result = 0;

    if (!xh)
        return NULL;
    if (!PyArg_ParseTuple(args, "ili", &dom, &page, &port))
        return NULL;

Looking up the definition of PyArg_ParseTuple() gave me the clue as to the problem. The format string specifies that we're giving the addresses of an int, long, and int. Yet in the assembly, the offsets of the leal instructions indicate we're pushing addresses to two 32-bit storage slots, and one 16-bit slot. So when PyArg_ParseTuple() writes its 32-bit quantity, it's going to overwrite two more bytes than it should.

As it happens, we're at the very top of the local stack storage space (-0x2(%ebp)). So those two bytes actually end up over-writing the bottom two bytes of the old %ebp we pushed at the start of the function. Then we pop that corrupted value back into the %ebp register via the leave. This has no effect until our caller calls leave itself. We move %ebp into %esp, then attempt to pop from the top of this stack into %ebp again. As it happens, the memory pointed to by the corrupt %ebp is zeroed; thus, we end up setting %ebp to zero. Finally, our caller does a ret, which pops another zero, but this time into %eip. Naturally this isn't a happy state of affairs, and we find ourselves with a core dump as described earlier.

Presumably this bug happened in the first place because someone didn't notice that domid_t was a 16-bit quantity. What's amazing is that nobody else has been hitting this problem!


A brief tour of i86xen Staring at the C

In this post, I'm going to give a quick walk through the major changes we've made so far in doing our port of Solaris to the Xen "platform". As we've only supplied a tarball of the source tree so far, I can't hyperlink to the relevant bits - sorry about that. As our code is still under heavy development, you can expect some of this code organisation to change significantly; nonetheless I thought this might be useful for those interested in peeking into the internals of what we've done so far.

As you might expect, the vast majority of the changes we've made reside in the kernel. To support booting Solaris under Xen (both domU and dom0, though as we've said the latter is still in the very early stages of development), we've introduced a new platform based on i86pc called i86xen. Wherever possible, we've tried to share common code by using i86pc's sources. There's still some cleanup we can do in this area.

Within usr/src/uts/i86xen, there are a number of Xen-specific source files:

Contains the PSM ("Platform-Specific Module") module for Xen. This mirrors the PSM provided by i86pc, but deals with the hypervisor-provided features such as the clock timer and the events system.
This contains the virtual root nexus driver "xendev". All of the virtual frontend drivers are connected to this.
The virtual block driver. It's currently non-functional with the version of Xen we're working with; we're working hard on getting it functional.
The guts of the kernel/hypervisor code. Amongst other things, it provides interfaces for dealing with events in evtchn.c and hypervisor_machdep.c (the hypervisor version of virtual interrupts, which hook into Solaris's standard interrupt system), the grant table in gnttab.c (used for providing access/transfer of pages between frontend and backend, suspend/resume in xen_machdep.c, and support routines for the debugger and the MMU code (mach_kdi.c and xen_mmu.c respectively).

As mentioned we use the i86pc code where possible, occasionally using #ifdefs where minor differences are found. In particular we re-use the i86pc HAT (MMU management) code found in i86pc/vm. You can also find code for the new boot method described by Joe Bonasera in i86pc/dboot and i86pc/boot.

A number of drivers that are needed by Xen but aren't i86xen specific live under usr/src/uts/common:

common/io/xenbus_*.c common/io/xenbus/
"xenbus" is a simple transport for configuration data provided by domain0; for example, it provides a node control/shutdown which will notify the domainU that the user has requested the domain to be shutdown (or suspended) from domain0's management tools. This code provides this support.
The virtual console frontend driver.
The virtual net device frontend driver.

As you might expect, the userspace changes we've needed to make so far have been reasonably minimal. Despite supporting the new i86xen platform definition, the only significant changes have been to usr/src/cmd/mdb/, where we've added some changes to better support debugging of the Xen-style x86 MMU.


Live migration of Solaris instances Staring at the C

Today we released our current source tree for our Solaris Xen port; for more details and the downloads see the Xen community on OpenSolaris.

One of the most useful features of Xen is its ability to package up a running OS instance (in Xen terminology, a "domainU", where "U" stands for "unprivileged"), plus all of its state, and take it offline, to be resumed at a later time. Recently we performed the first successful live migration of a running Solaris instance between two machines. In this blog I'll cover the various ways you can do this.

Para-virtualisation of the MMU

Typical "full virtualisation" uses a method known as "shadow page tables", whereby two sets of pagetables are maintained: the guest domain's set, which aren't visible to the hardware via cr3, and page tables visible to the hardware which are maintained by the hypervisor. As only the hypervisor can control the page tables the hardware uses to resolve TLB misses, it can maintain the virtualisation of the address space by copying and validating any changes the guest domain makes to its copies into the "real" page tables.

All these duplicates pages come at a cost of course. A para-virtualisation approach (that is, one where the guest domain is aware of the virtualisation and complicit in operating within the hypervisor) can take a different tack. In Xen, the guest domain is made aware of a two-level address system. The domain is presented with a linear set of "pseudo-physical" addresses comprising the physical memory allocated to the domain, as well as the "machine" addresses for each corresponding page. The machine address for a page is what's used in the page tables (that is, it's the real hardware address). Two tables are used to map between pseudo-physical and machine addresses. Allowing the guest domain to see the real machine address for a page provides a number of benefits, but slightly complicates things, as we'll see.


The simplest form of "packaging" a domain is suspending it to a file in the controlling domain (a privileged OS instance known as "domain 0"). A running domain can be taken offline via an xm save command, then restored at a later time with xm restore, without having to go through a reboot cycle - the domain state is fully restored.

xm save xen-7 /tmp/domain.img

An xm save notifies the domain to suspend itself. This arrives via the xenbus watch system on the node control/shutdown, and is handled via xen_suspend_domain(). This is actually remarkably simple. First we leverage Solaris's existing suspend/resume subsystem, CPR, to iterate through the devices attached to the domain's device nexus. This calls each of the virtual drivers we use (the network, console, and block device frontends) with a DDI_SUSPEND argument. The virtual console, for example, simply removes its interrupt handler in xenconsdetach(). As a guest domain, this tears down the Xen event channel used to communicate with the console backend. The rest of the suspend code deals with tearing down some of the things we use to communicate with the hypervisor and domain 0, such as the grant table mappings. Additionally we convert a couple of stored MFN (the frame numbers of machine addresses) values into pseudo-physical PFNs. This is because the MFNs are free to change when we restore the guest domain; as the PFNs aren't "real", they will stay the same. Finally we call HYPERVISOR_suspend() to call into the hypervisor and tell it we're ready to be suspended.

Now the domain 0 management tools are ready to checkpoint the domain to the file we specified in the xm save command. Despite the name, this is done via xc_linux_save(). Its main task is to convert any MFN values that the domain still has into PFN values, then write all its pages to the disk. These MFN values are stored in two main places; the PFN->MFN mapping table managed by the domain, and the actual pages of the page tables.

During boot, we identified which pages store the PFN->MFN table (see xen_relocate_start_info()), and pointed to that structure in the "shared info" structure, which is shared between the domain and the hypervisor. This is used to map the table in xc_linux_save().

The hypervisor keeps track of which pages are being used as page tables. Thus, after domain 0 has mapped the guest domain's pages, we write out the page contents, but modify any pages that are identified as page tables. This is handled by canonicalize_pagetable(); this routine replaces all PTE entries that contain MFNs with the corresponding PFN value.

There are a couple of other things that need to be fixed too, such as the GDT.

xm restore /tmp/domain.img

Restoring a domain is essentially the reverse operation: the data for each page is written into one of the machine addresses reserved for the "new" domain; if we're writing a saved page table, we replace each PTE's PFN value with the new MFN value used by the new instance of the domain.

Eventually the restored domain is given back control, coming out from the HYPERVISOR_suspend() call. Here we need to rebuild the event channel setup, and anything else we tore down before suspending. Finally, we return back from the suspend handler and continue on our merry way.


xm migrate xen-7 remotehost

A normal save/restore cycle happens on the same machine, but migrating a domain to a separate machine is a simple extension of the process. Since our save operation has replaced any machine-specific frame number value with the pseudo-physical frames, we can easily do the restore on a remote machine, even though the actual hardware pages given to the domainU will be different. The remote machine must have the Xen daemon listening on the HTTP port, which is a simple change in its config file. Instead of writing each page's contents to a file, we can transmit it across HTTP to the Xen daemon running on a remote machine. The restore is done on that machine in the same manner as described above.

Live Migration

xm migrate --live xen-7 remotehost

The real magic happens with live migration, which keeps the time the domain isn't kept running to a bare minimum (on the order of milliseconds). Live migration relies on the empirically observed data that an OS instance is unlikely to modify a large percentage of its pages within a certain time frame; thus, by iteratively copying over modified domain pages, we'll eventually reach a point where the remaining data to be copied is small enough that the actual downtime for a domainU is minimal.

In operation, the domain is switched to use a modified form of the shadow page tables described above, known as "log dirty" mode. In essence, a shadow page table is used to notify the hypervisor if a page has been written to, by keeping the PTE entry for the page read-only: an attempt to write to the page causes a page fault. This page fault is used to mark the domain page as "dirty" in a bitmap maintained by the hypervisor, which then fixes up the domain's page fault and allows it to continue.

Meanwhile, the domain management tools iteratively transfers unmodified pages to the remote machine. It reads the dirty page bitmap and re-transmits any page that has been modified since it was last sent, until it reaches a point where it can finally tell the domain to suspend, and switch over to running it on the remote machine. This process is described in more detail in Live Migration of Virtual Machines.

Whilst transmitting all the pages takes a while, the actual time between suspension and resume is typically very small. Live migration is pretty fun to watch happen; you can be logged into the domain over ssh and not even notice that the domain has migrated to a different machine.

Further Work

Whilst live migration is currently working for our Solaris changes, there's still a number of improvements and fixes that need to be made.

On x86, we usually use the TSC register as the basis for a high-resolution timer (heavily used by the microstate accounting subsystem). We don't directly use any virtualisation of the TSC value, so when we restore a domain, we can see a large jump in the value, or even see it go backwards. We handle this OK (once we fixed bug 6228819 in our gate!), but don't yet properly handle the fact that the relationship between TSC ticks and clock frequency can change between a suspend and resume. This screws up our notion of timing.

We don't make any effort to release physical pages that we're not currently using. This makes suspend/resume take longer than it should, and it's probably worth investigating what can be done here.

Currently many hardware-specific instructions and features are enabled at boot by patching in instructions if we discover the CPU supports it. For example we discovered a domain that died badly when it was migrated to a host that didn't support the sfence instruction. If such a kernel is migrated to a machine with different CPUs, the domain will naturally fail badly. We need to investigate preventing incompatible migrations (the standard Xen tools currently do no verification), and also look at whether we can adapt to some of these changes when we resume a domain.


Generating assembly structure offset values with CTF Staring at the C

The Solaris kernel contains a fair amount of assembly, and this often needs to access C structures (and in particular know the size of such structures, and the byte offsets of their members). Since the assembler can't grok C, we need to provide constant values for it to use. This also applies to the C library and kmdb.

In the kernel, the header assym.h provides these values; for example:

#define T_STACK 0x4
#define T_SWAP  0x68
#define T_WCHAN 0x44

These values are the byte offset of certain members into struct _kthread. For each of the types we want to reference from assembly, a template is provided in one of the offsets.in files. For the above, we can see in usr/src/uts/i86pc/ml/offsets.in:

_kthread        THREAD_SIZE
        t_pcb                   T_LABEL
        t_stk                   T_STACK
        t_lwpchan.lc_wchan      T_WCHAN
        t_flag                  T_FLAGS

This file contains structure names as well their members. Each of the members listed (which do not have to be in order, nor does the list need to be complete) cause a define to be generated; by default, an uppercase version of the member name is used. As can be seen, this can be overridden by specifying a #define name to be used. The THREAD_SIZE define corresponds to the bytesize of the entire structure (it's also possible to generate a "shift" value, which is log2(size)).

To generate the header with the right offset and size values we need, a script is used to generate CTF data for the needed types, which then uses this data to output the assym.h header. This is a Perl script called genoffsets, and the build invokes it with a command line akin to:

genoffsets -s ctfstabs -r ctfconvert cc < offsets.in > assym.h

The hand-written offsets.in file serves as input to the script, and it generates the header we need. The script takes the following steps:

  1. Two temporary files are generated from the input. One is a C file consisting of #includes and any other pre-processor directives. The other contains the meat of the offsets file.
  2. The C file containing all the includes is built with the compile line given (I have stripped the compiler options above for readability).
  3. ctfconvert is run on the built .o file.
  4. The pre-processor is run across the second file (the temporary offsets file)
  5. This pre-processed file is passed to ctfstabs along with the .o file.

ctfstabs reads the input offsets file, and for each entry, looks up the relevant value in the CTF data contained in the .o file passed to it. It has two output modes (which I'll come to shortly), and in this case we are using the genassym driver to output the C header. As you can see, this is a fairly simple process of processing each line of the input and looking up the type data in the CTF contained in the .o file.

A similar process is used for generating forth debug files for use when debugging the kernel via the SPARC PROM. This takes a different format of offsets file more appropriate to generating the forth debug macros, described in the forth driver.

To finish off the output header, the output from a small program called genassym (or, on SPARC, genconst) is appended. It contains a bunch of printfs of constants. A lot of those don't actually need to be there since they're simple constant defines, and the assembly file could just include the right header, but others are still there for reasons such as:

  • The macros which hide assembler syntax differences such as _MUL aren't implemented for the C compiler
  • The value is an enum type, which ctfstabs doesn't support
  • The constant is a complicated composed macro that the assembler can't grok

and other reasons. Whilst a lot of these could be cleaned up and removed from these files, it's probably not worth the development effort except as a gradual change.


Resource management of services Staring at the C

SMF introduced the notion of a service as a first-order object in the Solaris OS. Thus, you have administration interfaces capable of dealing with services (as opposed to the implicit service represented by a set of processes, for example). It doesn't seem very well known, but as Stephen Hahn mentions, this also applies to the resource management facilities of Solaris.

A service can be bound to a project (as well as a resource pool, which I won't go into here). This allows us to add resource controls to the project which will apply to the service as a whole, which is significantly more reliable and usable than trying to deal with individual daemons etc. Unfortunately, it's not as obvious to set up as it should be (of which more later), so here's a simple walkthrough.

We're going to set up a simple 'forkbomb' service, which simply runs this program:

#include <unistd.h>
#include <stdlib.h>

int main()
        int first = 1;
        while (1) {
                if (fork() > 0 && first)
                first = 0;

If you try running this program in an environment lacking resource controls, don't expect to be able to do much to your box except reboot it. Note the first parent does an exit(0) so that SMF doesn't think the service has failed (since we'll be a standard contract service). Here's the SMF manifest for our service:

<?xml version="1.0"?>
<!DOCTYPE service_bundle SYSTEM "/usr/share/lib/xml/dtd/service_bundle.dtd.1">
<service_bundle type='manifest' name='forkbomb'>
<service name='application/forkbomb' type='service' version='1'>
                <method_context project='forkbomb'>
                        <method_credential user='root' />

        <instance name='default' enabled='false' />

Note that as well as setting the project in the method context, we've set a method credential; this is a workaround for a problem I'll come to later. Now we need to create the 'forkbomb' project for the service:

# projadd -K 'project.max-lwps=(privileged,100,deny)' forkbomb

Alternatively we could create a new user for the service to use, set the method credential to use that user, then change our 'forkbomb' project to allow the user to join it. It's important to note that this still works even for root, though, so that's what we've done here.

Finally, we can import the manifest as a service, then temporarily enable it (so it won't start next time we boot!):

# svccfg import /opt/forkbomb/manifest/forkbomb.xml
# svcadm enable -t forkbomb

The forkbomb is now running flat out, but under the constraints of the resource controls we set on its project. Thus we still have a running system, and have enough resources to disable our 'mis-behaving' service. Let's have a look at prstat:

Total: 148 processes, 266 lwps, load averages: 68.06, 20.50, 10.75
 21145 root      992K  244K run      1    0   0:00:03 1.4% forkbomb/1
 21132 root      992K  244K run     49    0   0:00:03 1.2% forkbomb/1
 21128 root      992K  244K run     31    0   0:00:03 1.1% forkbomb/1
 21113 root      992K  244K run     31    0   0:00:03 1.1% forkbomb/1
 21176 root      992K  244K run     33    0   0:00:03 1.1% forkbomb/1
 21124 root      992K  244K run     53    0   0:00:03 1.1% forkbomb/1
 21119 root      992K  244K run     52    0   0:00:03 1.1% forkbomb/1
 21156 root      992K  244K run     53    0   0:00:03 1.0% forkbomb/1
 21088 root      992K  244K run     52    0   0:00:03 1.0% forkbomb/1
 21136 root      992K  244K run     43    0   0:00:03 1.0% forkbomb/1
 21133 root      992K  244K run     44    0   0:00:03 1.0% forkbomb/1
 21097 root      992K  244K run     52    0   0:00:03 1.0% forkbomb/1
 21103 root      992K  244K run     56    0   0:00:03 1.0% forkbomb/1
 21092 root      992K  244K run     52    0   0:00:03 1.0% forkbomb/1
 21183 root      992K  244K run     53    0   0:00:03 1.0% forkbomb/1
   100      100   97M   24M   0.6%   0:04:47  95% forkbomb
     1        5   11M 8268K   0.3%   0:00:00 0.0% user.root
    10        3   18M 8060K   0.3%   0:00:00 0.0% group.staff
     0       40  135M   83M   2.6%   0:00:17 0.0% system
Total: 148 processes, 266 lwps, load averages: 70.60, 21.80, 11.24

As we might expect, there's a high system load (since our fork-bomb is ignoring the errors from fork() when it hits its resource limit). Note that the 'forkbomb' project has been clamped to a maximum of 100 LWPs, as you can see in the NPROC field. But most importantly, the system is still usable, and we can stop the troublesome service:

# svcadm disable forkbomb

After a while for the stop method to finish (or time out, both of which will kill all processes in the service contract), we're done!

I mentioned above that we needed to specify a method credential to work around a bug. This is bug 5093847. The way the property lookup works currently, if the use_profile property on the service isn't found, then none of the rest of the method context is examined. Setting the method credential has the side-effect of creating this property, so things work properly. This bug would also be nice to fix since we could directly set the project property via svccfg if the properties for the method context were always created. Any interested parties are strongly encouraged to have a go at fixing it - it's not currently being worked on, and I'd happy to help :)


Reducing CTF overhead Staring at the C

CTF (Compact C Type Format) encapsulates a reduced form of debugging information similar to DWARF and the venerable stabs. It describes types (structures, unions, typedefs etc.) and function prototypes, and is carefully designed to take a minimum of space in the ELF binaries. The kernel binaries that Sun ship have this data embedded as an ELF section (.SUNW_ctf) so that tools like mdb and dtrace can understand types. Of course, it would have been possible to use existing formats such as DWARF, but they typically have a large space overhead and are more difficult to process.

The CTF data is built from the existing stabs/DWARF data generated by the compiler's -g option, and replaces this existing debugging information in the output binary (ctfconvert performs this job).

For the sake of kmdb and crash dumps, the CTF data for each kernel binary is present in the memory image of a booted kernel. This implies it's paramount that the amount of CTF data is minimised. Since each kernel module will have references to common types such as cpu_t, there's a lot of duplicated type data in all the CTF sections. To help avoid this duplication, the kernel build uses a process known rather fancifully as 'uniquification'.


Each type in the CTF data has an integer ID associated with it. Observe that the main genunix kernel module has a large number of the common types I mention above in its CTF data. We can remove the duplicate data found in other modules by replacing the type data with references to the type data in CTF. This process is uniquification. Consider the bmc driver. After building and linking the bmc object, we want to add CTF for its types, but we also uniquify against the genunix binary, like so:

ctfmerge -L VERSION -d ../../intel/genunix/debug64/genunix -o debug64/bmc debug64/bmc_fe.o debug64/bmc_kcs.o

This command takes the CTF data in the objects comprising bmc (previously converted from stabs/DWARF by ctfconvert) and merges them together (removing any shared duplicates between the two different objects). Then it passes through this CTF data, and looks for any types that match ones in the uniqfile (which we specified with the -d option). For each matching type (for example, cpu_t), we replace any references to the local type definition with a reference to genunix's copy of the type data. Remember that type references are simply integer IDs, so this is just a matter of changing the type ID to the one found in genunix's CTF. Let's use ctfdump to look at the results:

$ ctfdump $SRC/uts/i86pc/bmc/debug64/bmc >bmc.ctf
$ ggrep -C2 bmc_kcs_send bmc.ctf
- Types ----------------------------------------------------------------------

  <32769> STRUCT bmc_kcs_send (3 bytes)
        fnlun type=113 off=0
        cmd type=113 off=8
        data type=5287 off=16

Here we see the first member of the struct bmc_kcs_send has a type ID of 113. Since this type ID isn't in the CTF, it must belong to our parent. We look for our parent, then find the type ID we're looking for:

$ grep cth_parname bmc.ctf
  cth_parname  = genunix
$ ctfdump $SRC/uts/intel/genunix/debug64/genunix >genunix.ctf
$ grep '<113>' genunix.ctf
  <113> TYPEDEF uint8_t refers to 86

This manual process is similar to how the CTF lookup actually happens. This uniquification process saves us a significant amount of CTF data, although it causes us some problems, which we'll discuss next.

CTF labels and additive merges

As noted above, all our uniquified modules will have type ID's that refer to the genunix shipped along with them. This means, of course, that if any of the types in genunix itself changes without these modules changing too, all the type references to genunix types will be wrong, since it works by type ID. So, what happens when we need to release kernel changes?

Since we obviously don't want to ship all these modules every time genunix needs to change, we have to keep the existing type IDs in the new genunix binary. But also, we want to have any new or changed types present and correct too. So, instead of doing a full merge and rewriting the existing CTF data in genunix, we perform an "additive merge". This retains the existing CTF types (and IDs) so that references from unchanged modules still point to the right types, and adds on new types.

To do an additive merge, we need to pass a 'withfile' to ctfmerge via its -w option. This first takes all the CTF in the withfile and adds it into the output CTF. Then the CTF from the objects passed to ctfmerge are uniquified against this data. Any remaining types after uniquification are then added on top of the withfile data. This preserves the existing type IDs for any older modules that uniquified against this genunix, whilst also adding the new types.

This 'withfile' is the previous version of genunix. When it was built the first time, we passed -L VERSION to ctfmerge. This adds a label with the value of the environment variable $VERSION. Typically this is something like Generic. When we do the additive merge, we pass in a different label equal to the patch ID of the build, and the additional types are marked with this label. For example, on a Solaris 9 system's genunix:

- Label Table ----------------------------------------------------------------

   5001 Generic
   5981 112233-12

Labels are nothing but a mapping from a string to a particular type ID. So here we see that the original types are numbered from 1 to 5001, and we've done an additive merge on top with the label "112233-12", which added more types.

CTF from the ip module

The genunix module contains many common types, but the ip module also contains a lot of types used by many kernel modules, but not found in genunix. To further reduce the amount of CTF in these modules, we merge in the CTF data found in ip into the genunix CTF. The modules can then uniquify against this combined data, removing many more duplicate types. Note that we don't do this for patch builds, as the ip module might not ship in a patch. Unfortunately this can cause problems (notably bug 6347000, though this isn't yet accessible from opensolaris.org).

Further reading


VisitorVille Staring at the C