vfio-user client in QEMU 10.1 Staring at the C

The recent release of QEMU 10.1 now comes with its very own vfio-user client. You can try this out yourself relatively easily - please give it a go!1

vfio-user is a framework that allows implementing PCI devices in userspace. Clients (such as QEMU) talk the vfio-user protocol over a UNIX socket to a device server; it looks something like this:

vfio-user architecture

To implement a virtual device for a guest VM, there are generally two parts required: “frontend” driver code in the guest VM, and a “backend” device implementation.

The driver is usually - but by no means always - implemented in the guest OS kernel, and can be the same driver real hardware uses (such as a SATA controller), or something special for a virtualized platform (such as virtio-blk).

The job of the backend device implementation is to emulate the device in various ways: respond to register accesses, handle mappings, inject interrupts, and so on.

An alternative to virtual devices are so-called “passthrough” devices, which provide a thin virtualization layer on top of a real physical device, such as an SR-IOV Virtual Function from a physical NIC. For PCI devices, these are typically handled via the VFIO framework.

Other backend implementations can live in all sorts of different places: the host kernel, the emulator process, a hardware device, and so on.

For various reasons, we might want a userspace software device implementation, but not as part of the VMM process (such as QEMU) itself.

The rationale

For virtio-based devices, such “out of process device emulation” is usually done via vhost-user. This allows a device implementation to exist in a separate process, shuttling the necessary messages, file descriptors, and shared mappings between QEMU and the server.

However, this protocol is specific to virtio devices such as virtio-net and so on. What if we wanted a more generic device implementation framework? This is what vfio-user is for.

It is explicitly modelled on the vfio interface used for communication between QEMU and the Linux kernel vfio driver, but it has no kernel component: it’s all done in userspace. One way to think of vfio-user is that it smushes vhost-user and vfio together.

In the diagram above, we would expect much of the device setup and management to happen via vfio-user messages on the UNIX socket connecting the client to the server SPDK process: this part of the system is often referred to as the “control plane”. Once a device is set up, it is ready to handle I/O requests - the “data plane”. For performance reasons, this is often done via sharing device memory with the VM, and/or guest memory with the device. Both vhost-user and vfio-user support this kind of sharing, by passing file descriptors to mmap() across the UNIX socket.

libvfio-user

While it’s entirely possible to implement a vfio-user server from scratch, we have implemented a C library to make this easier: this handles the basics of implementing a typical PCI device, allowing device implementers to focus on the specifics of the emulation.

SPDK

At Nutanix, one of the main reasons we were interested in building all this was to implement virtual storage using the NVMe protocol. To do this we make use of SPDK. SPDK’s NVMe support was originally designed for use in a storage server context (NVMe over Fabrics). As it happens, there are lots of similarities between such a server, and how an NVMe PCI controller needs to work internally.

By re-using this nvmf subsystem in SPDK, alongside libvfio-user, we can emulate a high-performance virtualized NVMe controller for use by a VM. From the guest VM’s operating system, it looks just like a “real” NVMe card, but on the host, it’s using the vfio-user protocol along with memory sharing, ioeventfds, irqfds, etc. to talk to an SPDK server.

The Credits

While I was responsible for getting QEMU’s vfio-user client upstreamed, I was by no means the only person involved. My series was heavily based upon previous work by Oracle by John Johnson and others, and the original work on vfio-user in general was done by Thanos Makatos, Swapnil Ingle, and several others. And big thanks to Cédric Le Goater for all the reviews and help getting the series merged.

Further Work

While the current implementation is working well in general, there’s an awful lot more we could be doing. The client side has enough implemented to cover our immediate needs, but undoubtedly there are other implementations that need extensions. The libvfio-user issues tracker captures a lot of the generic protocol work as well some library-specific issues. In terms of virtual NVMe itself, we have lots of ideas for how to improve the SPDK implementation, across performance, correctness, and functionality.

There is an awful lot more I could talk about here about how this all works “under the hood”; perhaps I will find time to write some more blog posts…


  1. unfortunately, due to a late-breaking regression, you’ll need to use something a little bit more recent than the actual 10.1 release. ↩︎

It's Always DNS Staring at the C

The meme is real, but I think this particular case is sort of interesting, because it turned out, ultimately, to not be due to DNS configuration, but an honest-to-goodness bug in glibc.

As previously mentioned, I heavily rely on email-oauth2-proxy for my work email. Every now and then, I’d see a failure like this:

    Email OAuth 2.0 Proxy: Caught network error in IMAP server at [::]:1993 (unsecured) proxying outlook.office365.com:993 (SSL/TLS) - is there a network connection? Error type <class 'socket.gaierror'> with message: [Errno -2] Name or service not known

This always coincided with a change in my network, but - and this is the issue - the app never recovered. Even though other processes - even Python ones - could happily resolve outlook.office365.com - this long-running daemon remained stuck, until it was restarted.

A bug in the proxy?

My first suspect here was this bit of code:

    1761     def create_socket(self, socket_family=socket.AF_UNSPEC, socket_type=socket.SOCK_STREAM):
1762         # connect to whichever resolved IPv4 or IPv6 address is returned first by the system
1763         for a in socket.getaddrinfo(self.server_address[0], self.server_address[1], socket_family, socket.SOCK_STREAM):
1764             super().create_socket(a[0], socket.SOCK_STREAM)
1765             return

We’re looping across the gai results, but returning after the first one, and there’s no attempt to account for the first address result being unreachable, but later ones being fine.

Makes no sense, right? My guess was that somehow getaddrinfo() was returning IPv6 results first in this list, as at the time, the IPv6 configuration on the host was a little wonky. Perhaps I needed to tweak gai.conf ?

However, while this was a proxy bug, it was not the cause of my issue.

DNS caching?

Perhaps, then, this is a local DNS cache issue? Other processes work OK, even Python test programs, so it didn’t seem likely to be the system-level resolver caching stale results. Python itself doesn’t seem to cache results.

This case triggered (sometimes) when my VPN connection died. The openconnect vpnc script had correctly updated /etc/resolv.conf back to the original configuration, and as there’s no caching in the way, then the overall system state looked correct. But somehow, this process still had wonky DNS?

A live reproduction

I was not going to get any further until I had a live reproduction and the spare time to investigate it before restarting the proxy.

The running proxy in this state could be triggered easily by waking up fetchmail, which made it much easier to investigate what was happening each time.

So what was the proxy doing on line :1763 above? Here’s an strace snippet:

    [pid  1552] socket(AF_INET, SOCK_DGRAM|SOCK_CLOEXEC|SOCK_NONBLOCK, IPPROTO_IP) = 7
[pid  1552] setsockopt(7, SOL_IP, IP_RECVERR, [1], 4) = 0
[pid  1552] connect(7, {sa_family=AF_INET, sin_port=htons(53), sin_addr=inet_addr("ELIDED")}, 16) = 0
[pid  1552] poll([{fd=7, events=POLLOUT}], 1, 0) = 1 ([{fd=7, revents=POLLOUT}])
[pid  1552] sendto(7, "\250\227\1 \0\1\0\0\0\0\0\1\7outlook\toffice365\3c"..., 50, MSG_NOSIGNAL, NULL, 0) = 50
[pid  1552] poll([{fd=7, events=POLLIN}], 1, 5000) = 1 ([{fd=7, revents=POLLERR}])
[pid  1552] close(7)                    = 0

As we might expect, we’re opening a socket, connecting over UDP to port 53, and sending out a request to the DNS server.

This indicated the proximal issue: the DNS server IP address was wrong - the DNS servers used were the ones originally set up by openconnect still. The process wasn’t incorrectly caching DNS results but the DNS servers. Forever.

Nameserver configuration itself is not something that applications typically control, so the next question was - how does this work normally? When I update /etc/resolv.conf, or the thousand other ways to configure name resolution in modern Linux systems, what makes getaddrinfo() continue to work, normally?

/etc/resolv.conf and glibc

So, how does glibc account for changes in resolver configuration?

The contents of the /etc/resolv.conf file are the canonical location for DNS server addresses for processes (like Python ones) using the standard glibc resolver. Logically then, there must be a way for updates to the file to affect running processes.

In glibc, such configuration is represented by struct resolv_context. This is lazily initialized via __resolv_context_get()->maybe_init(), which looks like this:

     68 /* Initialize *RESP if RES_INIT is not yet set in RESP->options, or if
 69    res_init in some other thread requested re-initializing.  */
 70 static __attribute__ ((warn_unused_result)) bool
 71 maybe_init (struct resolv_context *ctx, bool preinit)
 72 {
 73   struct __res_state *resp = ctx->resp;
 74   if (resp->options & RES_INIT)
 75     {
 76       if (resp->options & RES_NORELOAD)
 77         /* Configuration reloading was explicitly disabled.  */
 78         return true;
 79
 80       /* If there is no associated resolv_conf object despite the
 81          initialization, something modified *ctx->resp.  Do not
 82          override those changes.  */
 83       if (ctx->conf != NULL && replicated_configuration_matches (ctx))
 84         {
 85           struct resolv_conf *current = __resolv_conf_get_current ();
 86           if (current == NULL)
 87             return false;
 88
 89           /* Check if the configuration changed.  */
 90           if (current != ctx->conf)
...

Let’s take a look at __resolv_conf_get_current():

    123 struct resolv_conf *
124 __resolv_conf_get_current (void)
125 {
126   struct file_change_detection initial;
127   if (!__file_change_detection_for_path (&initial, _PATH_RESCONF))
128     return NULL;
129
130   struct resolv_conf_global *global_copy = get_locked_global ();
131   if (global_copy == NULL)
132     return NULL;
133   struct resolv_conf *conf;
134   if (global_copy->conf_current != NULL
135       && __file_is_unchanged (&initial, &global_copy->file_resolve_conf))

This is the file change detection code we’re looking for: _PATH_RESCONF is /etc/resolv.conf, and __file_is_unchanged() compares the cached values of things like the file mtime and so on against the one on disk.

If it has in fact changed, then maybe_init() is supposed to go down the “reload configuration” path.

Now, in my case, this wasn’t happening. And the reason for this is line 83 above: the replicated_configuration_matches() call.

Resolution options

We already briefly mentioned gai.conf. There is also, as the resolver.3 man page says, this interface:

    The resolver routines use configuration and state information
contained in a __res_state structure (either passed as the statep
argument, or in the global variable _res, in the case of the
older nonreentrant functions).  The only field of this structure
that is normally manipulated by the user is the options field.

So an application can dynamically alter options too, outside of whatever static configuration there is. And (I think) that’s why we have the replicated_configuration_matches() check:

    static bool
replicated_configuration_matches (const struct resolv_context *ctx)
{
  return ctx->resp->options == ctx->conf->options
    && ctx->resp->retrans == ctx->conf->retrans
    && ctx->resp->retry == ctx->conf->retry
    && ctx->resp->ndots == ctx->conf->ndots;
}

The idea being, if the application has explicitly diverged its options, it doesn’t want them to be reverted just because the static configuration changed. Our Python application isn’t changing anything here, so this should still work as expected.

In fact, though, we find that it’s returning false: the dynamic configuration has somehow acquired the extra options RES_SNGLKUP and RES_SNGLKUPREOP. We’re now very close to the source of the problem!

A hack that bites

So what could possibly set these flags? Turns out the send_dg() function does:

     999                   {
1000                     /* There are quite a few broken name servers out
1001                        there which don't handle two outstanding
1002                        requests from the same source.  There are also
1003                        broken firewall settings.  If we time out after
1004                        having received one answer switch to the mode
1005                        where we send the second request only once we
1006                        have received the first answer.  */
1007                     if (!single_request)
1008                       {
1009                         statp->options |= RES_SNGLKUP;
1010                         single_request = true;
1011                         *gotsomewhere = save_gotsomewhere;
1012                         goto retry;
1013                       }
1014                     else if (!single_request_reopen)
1015                       {
1016                         statp->options |= RES_SNGLKUPREOP;
1017                         single_request_reopen = true;
1018                         *gotsomewhere = save_gotsomewhere;
1019                         __res_iclose (statp, false);
1020                         goto retry_reopen;
1021                       }

Now, I don’t believe the relevant nameservers have such a bug. Rather, what seems to be happening is that when the VPN connection drops, making the servers inaccessible, we hit this path. And these flags are treated by maybe_init() as if the client application set them, and has thus diverged from the static configuration. As the application itself has no control over these options being set like this, this seemd like a real glibc bug.

The fix

I originally reported this to the list back in March; I was not confident in my analysis but the maintainers confirmed the issue. More recently, they fixed it. The actual fix was pretty simple: apply the workaround flags to statp->_flags instead, so they don’t affect the logic in maybe_init(). Thanks DJ Delorie!

Scroll wheel behaviour in vim with gnome-terminal Staring at the C

I intentionally have mouse support disabled in vim, as I find not being able to select text the same way as in any other terminal screen unergonomic.

However, this has an annoying problem as a libvte / gnome-terminal user: the terminal, on switching to an “alternate screen” application like vim that has mouse support disabled, “helpfully” maps scroll wheel events to arrow up/down events.

This is possibly fine, except I use the scroll wheel click as middle-button paste, and I’m constantly accidentally pasting something in the wrong place as a result.

This is unfixable from within vim, since it only sees normal arrow key presses (not ScrollWheelUp and so on).

However, you can turn this off in libvte, by the magic escape sequence:

echo -ne '\e[?1007l'

Also known as XTERM_ALTBUF_SCROLL. This is mentioned in passing in this ticket. Documentation in general is - at best - sparse, but you can always go to the source.

A Headless Office 365 Proxy Staring at the C

As I mentioned in my last post, I’ve been experimenting with replacing davmail with Simon Robinson’s super-cool email-oauth2-proxy, and hooking fetchmail and mutt up to it. As before, here’s a specific rundown of how I configured O365 access using this.

Configuration

We need some small tweaks to the shipped configuration file. It’s used for both permanent configuration and acquired tokens, but the static part looks something like this:

[email@yourcompany.com]
permission_url = https://login.microsoftonline.com/common/oauth2/v2.0/authorize
token_url = https://login.microsoftonline.com/common/oauth2/v2.0/token
oauth2_scope = https://outlook.office365.com/IMAP.AccessAsUser.All https://outlook.office365.com/POP.AccessAsUser.All https://outlook.office365.com/SMTP.Send offline_access
redirect_uri = https://login.microsoftonline.com/common/oauth2/nativeclient
client_id = facd6cff-a294-4415-b59f-c5b01937d7bd
client_secret =

We’re re-using davmail’s client_id again.

Updated 2023-10-10: emailproxy now supports a proper headless mode, as discussed below.

Updated 2022-11-22: you also want to set delete_account_token_on_password_error to False: otherwise, a typo will delete the tokens, and you’ll need to re-authenticate from scratch.

We’ll configure fetchmail as follows:

poll localhost protocol IMAP port 1993
 auth password username "email@yourcompany.com"
 is localuser here
 keep
 sslmode none
 mda "/usr/bin/procmail -d %T"
 folders INBOX

and mutt like this:

set smtp_url = "smtp://email@yourcompany.com@localhost:1587/"
unset smtp_pass
set ssl_starttls=no
set ssl_force_tls=no

When you first connect, you will get a GUI pop-up and you need to interact with the tray menu to follow the authorization flow. After that, the proxy will refresh tokens as necessary.

Running in systemd

Here’s my service file I use, slightly modified from the upstream’s README:

$ cat /etc/systemd/system/emailproxy.service
[Unit]
Description=Email OAuth 2.0 Proxy

[Service]
ExecStart=/usr/bin/python3 /home/localuser/src/email-oauth2-proxy/emailproxy.py --external-auth --no-gui --config-file /home/localuser/src/email-oauth2-proxy/my.config
Restart=always
User=joebloggs
Group=joebloggs

[Install]
WantedBy=multi-user.target

Headless operation

Typically, only initial authorizations require the GUI, so you could easily do the initial dance then use the above systemd service.

Even better, with current versions of email-oauth2-proxy, you can operate in an entirely headless manner! With the above --external-auth and --no-gui options, the proxy will prompt on stdin with a URL you can copy into your browser; pasting the response URL back in will authorize the proxy, and store the necessary access and refresh tokens in the config file you specify.

For example:

$ sudo systemctl stop emailproxy

$ python3 ./emailproxy.py --external-auth --no-gui --config-file /home/localuser/src/email-oauth2-proxy/my.config

# Now connect from mutt or fetchmail.

Authorisation request received for email@yourcompany.com (external auth mode)
Email OAuth 2.0 Proxy No-GUI external auth mode: please authorise a request for account email@yourcompany.com
Please visit the following URL to authenticate account email@yourcompany.com: https://login.microsoftonline.com/common/oauth2/v2.0/authorize?client_id=...

Copy+paste or press [↵ Return] to visit the following URL and authenticate account email@yourcompany.com: https://login.microsoftonline.com/common/oauth2/v2.0/authorize?client_id=...
then paste here the full post-authentication URL from the browser's address bar (it should start with https://login.microsoftonline.com/common/oauth2/nativeclient):

# Paste the updated URL bar contents from your browser in response:

https://login.microsoftonline.com/common/oauth2/nativeclient?code=...

SMTP (localhost:1587; email@yourcompany.com) [ Successfully authenticated SMTP connection - releasing session ]
^C
$ sudo systemctl start emailproxy

Obviously, you’ll need to do this interactively from the terminal, then restart in daemon mode.

email-oauth2-proxy

If you find the above details useful, consider donating to support Simon’s sterling work on oauth2-email-proxy.

Fetchmail and Office 365 Staring at the C

I previously described accessing Office365 email (and in particular its oauth2 flow) via davmail, allowing me to continue using fetchmail, procmail and mutt. As davmail is java, it’s a pain to have around, so I thought I’d give some details on how to do this more directly in fetchmail, as all the available docs I found were a little vague, and it’s quite easy to screw up.

As it happens, I came across a generally better solution shortly after writing this post, on which more later.

Fetchmail 7

Unfortunately there is little interest in releasing a Fetchmail version with oauth2 support - the maintainer is taking a political stance on not integrating it - so you’ll need to check out the next branch from git:

cd ~/src/
git clone -b next git@gitlab.com:fetchmail/fetchmail.git fetchmail-next
cd fetchmail-next
./autogen.sh && ./configure --prefix=/opt/fetchmail7 && make && sudo make install

I used the branch as of 43c18a54 Merge branch 'legacy_6x' into next. Given that the maintainer warns us they might remove oauth2 support, you might need this exact hash…

Generate a token

We need to go through the usual flow for getting an initial token. There’s a helper script for this, but first we need a config file:

user=email@yourcompany.com
client_id=facd6cff-a294-4415-b59f-c5b01937d7bd
client_secret=
refresh_token_file=/home/localuser/.fetchmail-refresh
access_token_file=/home//localuser/.fetchmail-token
imap_server=outlook.office365.com
smtp_server=outlook.office365.com
scope=https://outlook.office365.com/IMAP.AccessAsUser.All https://outlook.office365.com/POP.AccessAsUser.All https://outlook.office365.com/SMTP.Send offline_access
auth_url=https://login.microsoftonline.com/common/oauth2/v2.0/authorize
token_url=https://login.microsoftonline.com/common/oauth2/v2.0/token
redirect_uri=https://login.microsoftonline.com/common/oauth2/nativeclient

Replace email@yourcompany.com and localuser in the above, and put it at ~/.fetchmail.oauth2.cfg. It’s rare to find somebody mention this, but O365 does not need a client_secret, and we’re just going to borrow davmail’s client_id - it’s not a secret in any way, and trying to get your own is a royal pain. Also, if you see a reference to tenant_id anywhere, ignore it - common is what we need here.

Run the flow:

$ # This doesn't get installed...
$ chmod +x ~/src/fetchmail-next/contrib/fetchmail-oauth2.py
$ # Sigh.
$ sed -i 's+/usr/bin/python+/usr/bin/python3+' ~/src/fetchmail-next/contrib/fetchmail-oauth2.py
$ ~/src/fetchmail-next/contrib/fetchmail-oauth2.py -c ~/.fetchmail.oauth2.cfg --obtain_refresh_token_file
To authorize token, visit this url and follow the directions:
  https://login.microsoftonline.com/common/oauth2/v2.0/authorize?...
Enter verification code:

Unlike davmail, this needs just the code, not the full returned URL, so you’ll need to be careful to dig out just the code from the response URL (watch out for any session_state parameter at the end!).

This will give you an access token that will last for around an hour.

Fetchmail configuration

Now we need an oauthbearer .fetchmailrc like this:

set daemon 60
set no bouncemail
poll outlook.office365.com protocol IMAP port 993
 auth oauthbearer username "email@yourcompany.com"
 passwordfile "/home/localuser/.fetchmail-token"
 is localuser here
 keep
 sslmode wrapped sslcertck
 folders INBOX
 mda "/usr/bin/procmail -d %T"

Replace email@yourcompany.com and localuser.

At this point, hopefully starting /opt/fetchmail7/bin/fetchmail will work!

Refresh tokens

As per the OAUTH2 README, fetchmail itself does not take care of refreshing the token, so you need something like this in your crontab:

*/2 * * * * $HOME/src/fetchmail-next/contrib/fetchmail-oauth2.py -c $HOME/.fetchmail.oauth2.cfg --auto_refresh

#opensolaris Staring at the C

When OpenSolaris got started, #solaris was a channel filled with pointless rants about GNU-this and Linux-that. Beside complete wrong-headedness, it was a total waste of time and extremely hostile to new people. #opensolaris, in contrast, was actually pretty nice (for IRC!) - sure, the usual pointless discussions but it certainly wasn't hateful.

Recently I'm sad to say #opensolaris has become a really hostile, unpleasant place. I've seen new people arrive and be bullied by a small number of poisonous people until they went away (nice own goal, people!). So if anyone's looking for me for xVM stuff or whatever, I'll be in #onnv-scm or #solaris-xen as usual. And if you do so, please try to keep a civil tongue in your head - it's not hard.

$HOME Staring at the C

I've not been able to access my homedir (and hence my work mail) all day. I suspect this was a planned outage I've forgotten about, but it's still a big problem. And what kind of planned outage lasts all day?

Link Staring at the C

Our $100M Series B Oxide Computer Company Blog

We don’t want to bury the lede: we have raised a $100M Series B, led by a new strategic partner in USIT with participation from all existing Oxide investors. To put that number in perspective: over the nearly six year lifetime of the company, we have raised $89M; our $100M Series B more than doubles our total capital raised to date — and positions us to make Oxide the generational company that we have always aspired it to be.

If this aspiration seems heady now, it seemed absolutely outlandish when we were first raising venture capital in 2019. Our thesis was that cloud computing was the future of all computing; that running on-premises would remain (or become!) strategically important for many; that the entire stack — hardware and software — needed to be rethought from first principles to serve this market; and that a large, durable, public company could be built by whomever pulled it off.

This scope wasn’t immediately clear to all potential investors, some of whom seemed to latch on to one aspect or another without understanding the whole. Their objections were revealing: "We know you can build this," began more than one venture capitalist (at which we bit our tongue; were we not properly explaining what we intended to build?!), "but we don’t think that there is a market."

Entrepreneurs must become accustomed to rejection, but this flavor was particularly frustrating because it was exactly backwards: we felt that there was in fact substantial technical risk in the enormity of the task we put before ourselves — but we also knew that if we could build it (a huge if!) there was a huge market, desperate for cloud computing on-premises.

Fortunately, in Eclipse Ventures we found investors who saw what we saw: that the most important products come when we co-design hardware and software together, and that the on-premises market was sick of being told that they either don’t exist or that they don’t deserve modernity. These bold investors — like the customers we sought to serve — had been waiting for this company to come along; we raised seed capital, and started building.

And build it we did, making good on our initial technical vision:

While these technological components are each very important (and each is in service to specific customer problems when deploying infrastructure on-premises), the objective is the product, not its parts. The journey to a product was long, but we ticked off the milestones. We got the boards brought up. We got the switch transiting packets. We got the control plane working. We got the rack manufactured. We passed FCC compliance.

And finally, two years ago, we shipped our first system!

Shortly thereafter, more milestones of the variety you can only get after shipping: our first update of the software in the field; our first update-delivered performance improvements; our first customer-requested features added as part of an update.

Later that year, we hit general commercial availability, and things started accelerating. We had more customers — and our first multi-rack customer. We had customers go on the record about why they had selected Oxide — and customers describing the wins that they had seen deploying Oxide.

Customers started landing faster now: enterprise sales cycles are infamously long, but we were finding that we were going from first conversations to a delivered product surprisingly quickly. The quickening pace always seemed to be due in some way to our transparency: new customers were listeners to our podcast, or they had read our RFDs, or they had perused our documentation, or they had looked at the source code itself.

With growing customer enthusiasm, we were increasingly getting questions about what it would look like to buy a large number of Oxide racks. Could we manufacture them? Could we support them? Could we make them easy to operate together?

Into this excitement, a new potential investor, USIT, got to know us. They asked terrific questions, and we found a shared disposition towards building lasting value and doing it the right way. We learned more about them, too, and especially USIT’s founder, Thomas Tull. The more we each learned about the other, the more there was to like. And importantly, USIT had the vision for us that we had for ourselves: that there was a big, important market here — and that it was uniquely served by Oxide.

We are elated to announce this new, exciting phase of the company. It’s not necessarily in our nature to celebrate fundraising, but this is a big milestone, because it will allow us to address our customers' most pressing questions around scale (manufacturing scale, system scale, operations scale) and roadmap scope. We have always believed in our mission, but this raise gives us a new sense of confidence when we say it: we’re going to kick butt, have fun, not cheat (of course!), love our customers — and change computing forever.

Triton on SmartOS bhyve Nahum Shalman

Under Development

There is an open bug in SmartOS that needs to be fixed for this all to work.
These are development notes until this header is removed.

Motivation

I (still) don't run VMware but I do have a SmartOS machine (it's a little nicer than the one from a decade ago).
I now work on Triton for my day job and I want to run CoaL for some testing.

Networking

The first trick is going to be to get some appropriate network tags set up and configured in the way that the CoaL image expects. I'm going to set up both an admin network and an "external" network. The latter will perform the same NAT that gets configured by the scripts for use with VMware.

Admin network.

This is a private network that doesn't need to reach the internet. Since I'll be confining my experiments to a single SmartOS hypervisor I'll just use an etherstub:

nictagadm add -l sdc_admin0

External network.

This one is tricker. CoaL expects this to be a network that can reach the network via NAT. We'll create another etherstub for it, then we'll create a zone to do NAT using Etherstubs:

nictagadm add -l sdc_external0

Provision a zone to be the NAT router using the following json (you can use whatever image_uuid you want, it doesn't actually matter):
coal-nat.json

{
  "alias": "coal-nat",
  "hostname": "coal-nat",
  "brand": "joyent-minimal",
  "max_physical_memory": 128,
  "image_uuid": "2f1dc911-6401-4fa4-8e9d-67ea2e39c271",
  "nics": [
    {
      "nic_tag": "external",
      "ip": "dhcp",
      "allow_ip_spoofing": "1",
      "primary": "1"
    },
    {
      "nic_tag": "sdc_external0",
      "ip": "10.88.88.2",
      "netmask": "255.255.255.0",
      "allow_ip_spoofing": "1",
      "gateway": "10.88.88.2"
    }
  ],
  "customer_metadata" : {
    "manifests" : "network/forwarding.xml\nnetwork/routing/route.xml\nnetwork/routing/ripng.xml\nnetwork/routing/legacy-routing.xml\nnetwork/ipfilter.xml\nsystem/identity.xml\n",
    "smf-import" : "mdata-get manifests | while read name; do svccfg import /lib/svc/manifest/$name; done;",
    "user-script" : "mdata-get smf-import | bash -x; echo -e 'map net0 10.88.88.0/24 -> 0/32\nrdr net0 0/0 port 22 -> 10.88.88.200 port 22 tcp' > /etc/ipf/ipnat.conf; routeadm -u -e ipv4-forwarding; svcadm enable identity:domain; svcadm enable ipfilter"
  }
}

You can also set a static IP address on the first NIC if you prefer.

Create the zone:

vmadm create -f coal-nat.json

Building the headnode VM

Normally SmartOS provides a lot of protection on the vnics. We'll be turning them all off so that the guest can do whatever it wants. This is one of the reasons I like setting up the etherstubs. Even if this VM runs amok the only other zone it can reach is that very minimal NAT zone.

We need to specify the hardcoded MAC addresses that the answers.json file is expecting to see as well:
coal-headnode.json:

{
  "alias": "coal-headnode",
  "brand": "bhyve",
  "bootrom": "uefi",
  "ram": 16384,
  "vcpus": 4,
  "autoboot": false,
  "nics": [
    {
      "mac": "00:50:56:34:60:4c",
      "nic_tag": "sdc_admin0",
      "model": "virtio",
      "ip": "dhcp",
      "allow_dhcp_spoofing": true,
      "allow_ip_spoofing": true,
      "allow_mac_spoofing": true,
      "allow_restricted_traffic": true,
      "allow_unfiltered_promisc": true,
      "dhcp_server": true
    },
    {
      "mac": "00:50:56:3d:a7:95",
      "nic_tag": "sdc_external0",
      "model": "virtio",
      "ip": "dhcp",
      "allow_dhcp_spoofing": true,
      "allow_ip_spoofing": true,
      "allow_mac_spoofing": true,
      "allow_restricted_traffic": true,
      "allow_unfiltered_promisc": true,
      "dhcp_server": true
    }
  ],
  "disks": [
    {
      "boot": true,
      "size": 8192,
      "model": "virtio"
    },
    {
      "size": 65440,
      "model": "virtio"
    }
  ]
}

Create the VM:

vmadm create -f coal-headnode.json

Copying over the CoaL USB stick image

Triton releases live at https://us-central.manta.mnx.io/Joyent_Dev/public/SmartDataCenter/triton.html
But there's a link kept up to date pointing to the latest.

curl -fLO https://us-central.manta.mnx.io/Joyent_Dev/public/SmartDataCenter/coal-latest.tgz
tar xzvf coal-latest.tgz
UUID=$(vmadm list -H -o uuid alias=coal-headnode)
qemu-img convert -f raw -O host_device coal-release-*.vmwarevm/*.img /dev/zvol/dsk/zones/${UUID?}/disk0
zfs set refreservation=0 zones/${UUID}/disk0@sdc-pristine
zfs snapshot zones/${UUID?}/disk0@sdc-pristine

Pre-configuring CoaL

We need to obtain the CoaL answers.json file and reconfigure Loader so that it will behave correctly in the VM.

lofiadm -l -a /dev/zvol/dsk/zones/${UUID?}/disk0
mount -F pcfs /devices/pseudo/lofi@2:c /mnt
curl -kL https://raw.githubusercontent.com/tritondatacenter/sdc-headnode/master/answers.json.tmpl.external | sed 's/vga/ttya/g' > /mnt/private/answers.json
# TODO: LOADER FIXES
umount /mnt
lofiadm -d /dev/lofi/2
zfs snapshot zones/${UUID?}/disk0@configured

Optional: Get a performance boost at the cost of potential VM data corruption if the host loses power:

zfs set sync=disabled zones/${UUID?}

Now you're ready to boot your VM.

vmadm start ${UUID?} ; vmadm console ${UUID?}

B2VT 2025 Josef "Jeff" Sipek

A week ago, I participated in a 242 km bike ride from Wikipedia article:  Bedford to the Wikipedia article:  Harpoon Brewery in Wikipedia article:  Windsor. This was an organized event with about 700 people registered to ride it. I’ve done a number of group rides in the past, but never a major event like this, so I’m going to brain-dump about it. (As a brain-dump, it is not as organized as it could be. Shrug.)

This was not a race, so there is no official timekeeping or ranking.

TL;DR: I rode 242 km in 11 hours and 8 minutes and I lived to tell the tale.

The Course

The full course was a one-way 242 km (150 mile) route with four official rest stops with things to eat and drink. The less insane riders signed up for truncated rides that followed the same route and also ended in Windsor, but skipped the beginning. There was a 182 km option that started at the first rest stop and a 108 km option that started at the second rest stop. Since I did the full ride, I’m going to ignore the shorter options.

The above link to RideWithGPS has the whole course and you can zoom around to your heart’s content, but the gist of it is:

Rest Stops, Food, Drinks

The four official rest stops were at 58 km, 132 km, 169 km, and 220 km. The route passed through a number of towns so it was possible to stop at a convenience store and buy whatever one may have needed (at least in theory).

Each rest stop was well-stocked, so I didn’t need to buy anything from any shops along the way.

There was water, Gatorade, and already-prepared Maurten’s drink mix, as well as a variety of sports nutrition “foods”. There were many Maurten gels and bars, GU gels, stroopwafels, bananas, and pickle slices with pickle juice.

Maurten was one of the sponsors, so there was a ton of their products. I tried their various items during training rides, and so I knew what I liked (their Solid 160 bars) and what I found weird (the drink mix and gels, which I describe as runny and chunky slime, respectively).

My plan was to sustain myself off the Maurten bars and some GU gels I brought along because I didn’t know they were also going to be available. I ended up eating the bars (as planned). I tried a few B2VT-provided GU gel flavors I haven’t tried before (they were fine) and a coconut-flavored stroopwafel (a heresy, IMO). I also devoured a number of bananas and enjoyed the pickles with juice. Drink-wise, I had a bottle of Gatorade and a bottle of water with electrolytes. At each stop, I topped off the Gatorade bottle with more Gatorade, and refilled the other bottle with water and added an electrolyte tablet.

The one item I wish they had at the first 3 stops: hot coffee.

With the exception of the second rest stop, I never had to wait more than 30 seconds to get whatever I needed. At the second stop, I think I just got unlucky, and I arrived at a busy time. I spent about 5 minutes in the line, but I didn’t really care. I still had plenty of time and there was John (one of the other riders that I met a few months ago during a training ride) to chat with while waiting.

In addition to the official rest stops, I stopped twice on the way to stretch and eat some of the stuff I had on me. The first extra stop was by the Winchester, NH post office or at about 111 km. The second extra stop was at the last intersection before the climb around Ascutney which conveniently was at 200 km.

Since I’m on the topic of food, the finish had real food—grilled chicken, burgers, hot dogs, etc. I didn’t have much time before my bus back to Bedford left, so I didn’t get to try the chicken. The burgers and hot dogs were a nice change of flavor from the day of consuming variously-packaged sugars and not much else.

Mechanics

Conte’s Bike Shop (also a sponsor) had a few mechanics provide support to anyone who had issues with their bikes. They’d stay at a rest stop, do their magic, and eventually drive to the next stop helping anyone along the way. They easily put in 12 hours of work that day.

Thankfully, I didn’t have any mechanical issues and didn’t need their services.

Weather

Given the time and distance involved, it is no surprise that the weather at the start and finish was quite different. The good news was that the weather steadily improved throughout the ride. The bad news was that it started rather poor—moderate rain. As a result, everyone got thoroughly soaked in the first 20 km. Rain showers and wet roads (at times it wasn’t clear if there is rain or if it’s just road spray) were pretty standard fare until the second rest stop. Between the second and third stops, the roads got progressively drier. By the 4th stop, the weather was positively nice.

None of this was a surprise. Even though the weather forecasts were uncertain about the details, my general expectation was right. As a side note, I find MeteoBlue’s multi-model and ensemble forecasts quite useful when the distilled-to-a-handful-of-numbers forecasts are uncertain. For example, I don’t care if it is going to be 13°C or 15°C when on the bike. I’ll expect it to be chilly. This is, however, a very large range for the single-number temperature forecast and so it’ll be labeled as uncertain. Similarly, I don’t care if I encounter 10 mm or 15 mm of rain in an hour. I’ll be wet either way.

I kept checking the forecasts as soon as they covered the day of the event. After a few days, I got tired of trying to load up multiple pages and correlating them. I wrote a hacky script that uses MeteoBlue’s API to fetch the hourly forecast for the day, and generate a big table with as much (relevant) information as possible.

You can see the generated table with the (now historical) forecast yourself. I generated this one at 03:32—so, about 2 hours before I started.

Each location-hour pair shows what MeteoBlue calls RainSpot, an icon with cloud cover and rain, the wind direction and speed (along with the headwind component), the temperature, and the humidity.

I was planning to better visualize the temperature and humidity and to calculate the headwind along more points along the path, but I got distracted with other preparations.

Temperature-wise, it was a similar story. Bad (chilly) in the beginning and nice (warm but not too warm) at the end.

Clothing

The weather made it extra difficult to plan what to wear. I think I ended up slightly under-dressed in the beginning, but just about right at the end (or possibly a smidge over-dressed). I wore: bib shorts, shoe covers, a short-sleeved polyester shirt, and the official B2VT short-sleeved jersey.

The shoe covers worked well, until they slid down just enough to reveal the top of the socks. At that point it was game over—the socks wicked all the water in the world right into my shoes. So, of the 242 km I had wet feet for about 220 km. Sigh. I should have packed spare socks into the extra bag that the organizers delivered to rest stop 2 (and then to the finish). They wouldn’t have dried out my shoes, but it would have provided a little more comfort at least temporarily.

For parts of the ride, I employed 2 extra items: a plastic trash bag and aluminum foil.

Between the first rest stop and the 200 km break, I wore a plastic trash bag between the jersey and the shirt. While this wasn’t perfect, it definitely helped me not freeze on the long-ish descents and stay reasonably warm at other times. I probably should have put it on before starting, but I had (unreasonably) hoped that it wouldn’t actively rain.

At the second rest stop, I lined my (well-ventilated) helmet with aluminum foil to keep my head warm. When I took it off, my head was a little bit sweaty. In other words, it worked quite well. As a side note, just before I took the foil out at the third rest stop, multiple people at the stop asked me what it was for and whether it worked.

Pacing & Time Geekery

Needless to say, it was a very long day.

My goal was to get to the finish line before it closed at 18:30. So, I came up with a pessimistic timeline that got me to the finish with 23 minutes to spare. I assumed that my average speed would decrease over time as I got progressively more tired—starting off at 26 km/h and crossing the finish line at 18 km/h. I also assumed that I’d go up the 3 major climbs at a snail’s pace of 10 km/h and that I’d spend progressively more time at the stops.

Well, I was guessing at the speeds based on previous experience. The actual plan was to stay in my power zone 2 (144–195W) no matter what the terrain was like. I was willing to go a little bit harder on occasion to stay in someone’s draft, but any sort of solo effort would be in zone 2.

I signed up for the 15 miles/hour pace group (about 24 km/h), which meant that I would start between 5:00 and 5:30 in the morning. I hoped to start at 5:00 but calculated based on 5:30 start time.

Here’s my plan (note that the fourth stop moved from 218 to 220 km few days before the event, and I didn’t bother re-adjusting the plan):


                     Time of Day     Time
               Dist  In    Out    In    Out
Start             0  N/A   05:30  N/A   00:00
Ashby climb      51  07:27 08:09  01:57 02:39
#1               58  08:09 08:24  02:39 02:54
Hinsdale climb  121  10:55 11:37  05:25 06:07
#2              132  11:37 11:57  06:07 06:27
#3              168  13:35 13:55  08:05 08:25
Ascutney climb  198  15:21 16:15  09:51 10:45
#4              218  16:25 16:50  10:55 11:20
Finish          241  18:07 N/A    12:37 N/A

To have a reference handy, I taped the rest stop distances and expected “out” times to my top-tube:

(After I started writing it, I realized that the start line was totally useless and I should have skipped it. That extra space could have been used for the expected finish time.)

So, how did I do in reality?

Well, I didn’t want to rush in the morning so I ended up starting at 5:30 instead of the planned for 5:00. Oh well.

Until the 4th stop, it felt like I was about 30 minutes ahead of (worst case) schedule, but when I got to the 4th stop I realized that I had a ton of extra time. Regardless, I didn’t delay and headed out toward the finish. I was really surprised that I managed to finish it in just over 11 hours.

Here’s a table comparing the planned (worst case) with the actual times along with deltas between the two.


                       Planned      Actual        Delta
               Dist  In    Out    In    Out    In    Out
Start             0  N/A   00:00  N/A   00:00  N/A   +0:00
Ashby climb      51  01:57 02:39  01:53 02:17  -0:04 -0:22
#1               58  02:39 02:54  02:17 02:33  -0:22 -0:21
Hinsdale climb  121  05:25 06:07  04:59 05:41  -0:26 -0:26
#2              132  06:07 06:27  05:41 06:10  -0:26 -0:17
#3              168  08:05 08:25  07:34 07:55  -0:31 -0:30
Ascutney climb  198  09:51 10:45  09:13 09:37  -0:38 -1:08
#4              218  10:55 11:20  10:08 10:20  -0:47 -1:00
Finish          241  12:37 N/A    11:08 N/A    -1:29 N/A

It is interesting to see that I spent 1h18m at the rest stops (16, 29, 21, and 12 minutes), while I planned for 1h20m (15, 20, 20, and 25 minutes). If I factor in the two pauses I did on my own (3 minutes at 111 km and 9 minutes at 200 km), I spent 1h30m stopped. I knew I was ahead of schedule, and so I didn’t rush at the stops as rushing tends to lead to errors that take more time to rectify than not-rushing would have taken.

I’m also happy to see that my 10 km/h semi-arbitrary estimate for the climbs worked well enough on the first climb and was spot on for the second. The third climb wasn’t as bad, but I stuck with the same estimated speed because I assumed I’d be much more fatigued than I was.

To have a better idea about my average speed after the ride, I plotted my raw speed as well as cumulative average speed that’s reset every time I stop. (In other words, it is the average speed I’d see on the Garmin at any given point in time if I pressed the lap button every time I stopped.) The x-axis is time in minutes, and the y-axis is generally km/h (the exception being the green line which is just the orange line converted to miles per hour).

The average line is 21.7 km/h which is the distance over total elapsed time (11:08). If I ignore all the stopped time and look at only the moving time (9:43), the average speed ends up being 24.9 km/h. Nice!

Power-wise, I did reasonably well. I spent almost 2/3 of the time in zones 1 and 2. I spent a bit more time in zone 3 than I expected, but a large fraction of that is right around 200W. 200 is a number that’s a whole lot easier to remember while riding and so I treat it as the top of my zone 2.

Fatigue & Other Riders

I knew what to expect (more or less) over the first 2/3 of the ride as my longest ride before was 163 km. In many ways, it felt as I expected and in some ways it was a very different ride.

At the third rest stop (168 km), I felt a bit less drained than I expected. I’m guessing that’s because I actively tried to go very easy—to make sure I had something left in me for the last 70 km.

Sitting on the saddle felt as I expected: slowly getting less and less enjoyable but still ok. It is rather annoying that at times one has to choose between drafting and getting out of the saddle for comfort.

What was very different was the “mental progress bar”. Somehow, 160 km feels worse if you are planning to do 163 km than if you are planning to do 242 km. It’s like the mind calibrates the sensations based on the expected distance. Leaving the third rest stop felt like venturing into the unknown. Passing 200 km felt exciting—first time I’ve ever seen a three digit distance starting with anything other than a 1 and only 42 km left to the finish! Leaving the fourth rest stop felt surprisingly good because there were only 22 km left and tons of time to do it in.

In general, I was completely shameless about drafting. If you passed me anywhere except a bigger uphill, I’d hop onto your wheel and stay for as long as possible.

Between about 185–200 km, I was following one such group of riders. This is when I really noticed how tired and sore some people got by this point. One of them got out of the saddle every 30–60 seconds. I don’t blame him, but following him was extra hard since every time he’d get up, he’d ever-so-slightly slow down. That group as a whole was a little incohesive at that point. I tried to help bring a little bit of order to the chaos by taking a pull, but it didn’t help enough for my taste. So, as we got to the intersection right before the climb around Mount Ascutney, I let them go and took a break to celebrate reaching 200 km with some well-earned crackers.

After the long and steady climb from that intersection, the terrain is mostly flat. This is when I noticed another rider’s fatigue. As I passed him solo, he jumped onto my wheel. After a minute or two, he asked me if I knew how much further it is. I found this a bit peculiar—knowing how far one has gone or how much is left is something I spent hours thinking about. I gave him how far I’ve gone (216 km), how long the course is (240 km), did quick & dirty math to give him an idea what’s left, and I threw in that the rest stop is in about 3 km. Then about a minute later, I realized that he dropped while I continued at 200W.

After the mostly flat part, there was a steep but relatively short uphill to the fourth rest stop. This is when I stopped caring about being quite so religious about sticking to 200W max. Instead of spinning up it, I got out of the saddle and went at a more natural-for-me climbing pace (which isn’t sustainable long term). To my surprise, my legs felt fine! Well, it was not quite a surprise since I know that my aerobic ability is (relatively speaking) worse than my anaerobic ability, but it was nice to see that I could still do a bigger effort even after about 5000 kJ of work.

One additional observation I have about long non-solo events like this is that unless you show up with a group of people that will ride together, it is only a matter of time before everyone spreads out based on their preferred pace and you end up solo. People (perhaps correctly) place greater value on sticking to their own pace instead of pushing closer to their limit to keep up with faster people and therefore finishing sooner. I noticed this during the last B2VT training ride and saw it happen again during the real ride. This is much different from the Sunday group rides I’ve attended where people use as much effort as needed to stay with the group.

Conclusion

Overall I’m happy I tried to do this and that I finished. My previous longest-ride was 163 km, so this was 48% longer and therefore it was nice to see that I could do this if I wanted to. Which brings up the obvious question—will I do this again? At least at the moment, my answer is no. Getting ready for a long ride like that takes long rides, and long rides (even something like 5–6 hours) are harder to fit into my schedule, which includes work and plenty of other hobbies. So, at least for the foreseeable future, I’ll stick to 2–2.5 hour rides max with an occasional 100 km.

Garmin Edge 500 & 840 Josef "Jeff" Sipek

First, a little bit of history…

Many years ago, I tried various phone apps for recording my bike rides. Eventually, I settled on Strava. This worked great for the recording itself, but because my phone was stowed away in my saddle bag, I didn’t get to see my current speed, etc. So, in July 2012, I splurged and got a Garmin Edge 500 cycling computer. I used the 500 until a couple of months ago when I borrowed a 520 with a dying battery from someone who just upgraded and wasn’t using it. (I kept using the 500 as a backup for most of my rides—tucked away in a pocket.)

Last week I concluded that it was time to upgrade. I was going to get the 540 but it just so happened that Garmin had a sale and I could get the 840 for the price of 540. (I suppose I could have just gotten the 540 and saved $100, but I went with DC Rainmaker’s suggestion to get the 840 instead of the 540.)

Backups

For many years now, I’ve been backing up my 500 by mounting it and rsync’ing the contents into a Mercurial repository. The nice thing about this approach is that I could remove files from the Garmin/Activities directory on the device to keep the power-on times more reasonable but still have a copy with everything.

I did this on OpenIndiana, then on Unleashed, and now on FreeBSD. For anyone interested, this is the sequence of steps:

$ cd edge-500-backup
# mount -t msdosfs /dev/da0 /mnt
$ rsync -Pax /mnt/ ./
$ hg add Garmin
$ hg commit -m "Sync device"
# umount /mnt

This approach worked with the 500 and the 520, and it should work with everything except the latest devices—540, 840, and 1050. On those, Garmin switched from USB mass storage to MTP for file transfers.

After playing around a little bit, I came up with the following. It uses a jmtpfs FUSE file system to mount the MTP device, after which I rsync the contents to a Mercurial repo. So, generally the same workflow as before!

$ cd edge-840-backup
# jmtpfs -o allow_other /mnt
$ rsync -Pax \
        --exclude='*.img' \
        --exclude='*.db' \
        --exclude='*.db-journal' \
        /mnt/Internal\ Storage/ edge-840-backup/
$ hg add Garmin
$ hg commit -m "Sync device"
# umount /mnt

I hit a timeout issue when rsync tried to read the big files (*.img with map data, and *.db{,-journal} with various databases, so I just told rsync to ignore them. I haven’t looked at how MTP works or how jmtpfs is implemented, but it has the feel of something trying to read too much data (the whole file?), that taking too long, and the FUSE safety timeouts kicking in. Maybe I’ll look into it some day.

Aside from the timeout when reading large files, this seems to work well on my FreeBSD 14.2 desktop.

Is the Information Security industry succeeding? The Trouble with Tribbles...

Yesterday I had a trip up to London and had a wander round Infosecurity Europe. It was an interesting day, lots of things to see, many interesting conversations.

The show itself is huge. We've clearly come out of the doldrums of the last few years where shows had become tiny. And this was a dedicated infosec event, not just one part of a larger IT event.

Going by the size of the event, the number of exhibitors, the number of attendees, the size and extravagance of the displays, I think it's fair to say that Information Security as a business sector is doing very well. There's clearly a huge amount of vendor cash to splash around, and a confidence that customers have plenty of cash to buy the products on offer.

But is making money the correct definition of success here?

Most of the industry has a focus on detection and remediation. The pitch is that your systems are horrendously insecure and you need to give vendor X lots of money so they can detect a failure and help get your business back on its feet.

There was very little, in fact almost nothing, aimed at actually building more secure systems. (Even training and awareness is really nothing more than glossing over the cracks.) Maybe the closest is things aimed at the supply chain, but even that's basically detection of someone else's vulnerabilities.

So, in terms of actually building better systems, the Infosecurity industry is failing. It's not even addressing the problem.

(I would say that one definition of success for an information security company would be for it to do such a good job it's no longer needed. Clearly that's not going to be in many business plans.)

Furthermore, a string of high-profile hacks and breaches clearly indicates that the industry is failing to keep businesses secure.

Random thoughts on a Next Generation Tribblix The Trouble with Tribbles...

I have a little private project called xTribblix.

What's the x stand for? eXtreme? eXtraordinary? eXperimental? neXt generation?

Honestly, I don't know. It doesn't matter, it's just a little bucket I can drop things in to. But essentially, a set of experiments around changing Tribblix that allows me to do interesting things. The aim would be that, if successful, they get folded back into regular Tribblix; if unsuccessful then it's a learning experience.

It's just the logical continuation of the drive I've always had to make Tribblix faster, leaner, cleaner, fitter, easier, more secure. While retaining compatibility and functionality.

There are a few bits of illumos that really ought to be removed. Printing is a prime example - CUPS is a better, more modern implementation, maintained, familiar to everyone, what most Solaris people wanted anyway, and to be honest printing *isn't* an illumos core competency, so it's an ideal target to be outsourced. That's a clear example with a superior replacement already available; most subsystems might have someone crawl out of the woodwork who's inconvenienced by their removal.

So far, I've simply looked at things and decided to implement many of the simple ones for the next release(s) without the need for a separate experimental release. This isn't new, it's been going on for many releases already, and so far I've managed not to break anything that matters.

Some of the things done already (some will be in the next release):

  • grub deprecated
  • update DEFAULT_MAXPID to allow pid > 30000 (eg 99999 like smartos)
  • delete ftpusers, as there's no illumos ftpd
  • long usernames now silent rather than warning
  • removed uucp, and removed the nuucp user
  • zones based on core-tribblix need to worry less about what to remove
  • overlays based on core-tribblix with the actual images having a driver layer on top, so cloud/virtual images can slim down
  • replace /usr/xpg4/bin/more with a link to less
  • replace pax with the heirloom version
  • create /var/adm/loginlog by default
  • increase PASSLENGTH in /etc/default/passwd to 8
  • remove /etc/log and /var/adm/log, latter only used by volcopy
  • transformed away and eliminated most uses of isaexec
  • remove /usr/games
  • remove all legacy printing
  • remove libadt_jni
  • remove ilb
  • remove the old as on x86, everything should use gas
  • remove oawk and man page (and ref in awk.1)
  • remove newform, listusers, asa
  • no longer install doctools by default
  • drop the closed iconv bits, as they're useless
  • remove libfru* on x86
  • replace sendmail with the upstream
  • deprecate mailwrapper

A lot of this is simple package manipulation as I convert the IPS repo produced by an illumos build into SVR4 packages, mostly avoiding the need to patch the source or the build.

There's a lot more that could be done, some examples of what I'm thinking of include:

  • xpgN by default (replace regular binaries in /usr/bin)
  • sort out cpp (last remaining closed bin)
  • everything 64-bit
  • remove /etc links more aggressively
  • no ucb at all [except mebbe install...]
  • see if there are any expensive and unused kstats we could remove
  • firewall on by default
  • passwd blocklists by default
  • extendedFILE(7) enabled by default (although not necessary if everything is 64-bit!)
  • refactor packages so they are along sensible boundaries (with reducing the number of distinct packages being the goal)

 Now all I need is some time to implement all this...

OmniOS Community Edition r151054 OmniOS Community Edition

OmniOSce v11 r151054 is out!

On the 5th of May 2025, the OmniOSce Association has released a new stable version of OmniOS - The Open Source Enterprise Server OS. The release comes with many tool updates, brand-new features and additional hardware support. For details see the release notes.

Note that r151050 is now end-of-life. You should upgrade to r151052 or r151054 to stay on a supported track. r151054 is a long-term-supported (LTS) release with support until May 2028.

For anyone who tracks LTS releases, the previous LTS - r151046 - now enters its last year. You should plan to upgrade to r151054 during the next twelve months for continued support.

OmniOS is fully Open Source and free. Nevertheless, it takes a lot of time and money to keep maintaining a full-blown operating system distribution. Our statistics show that there are almost 2’000 active installations of OmniOS while fewer than 20 people send regular contributions. If your organisation uses OmniOS based servers, please consider becoming a regular patron or taking out a support contract.


Any problems or questions, please get in touch.

Oxide’s Compensation Model: How is it Going? Oxide Computer Company Blog

How it started

Four years ago, we were struggling to hire. Our team was small (~23 employees), and we knew that we needed many more people to execute on our audacious vision. While we had had success hiring in our personal networks, those networks now felt tapped; we needed to get further afield. As is our wont, we got together as a team and brainstormed: how could we get a bigger and broader applicant pool? One of our engineers, Sean, shared some personal experience: that Oxide’s principles and values were very personally important to him — but that when he explained them to people unfamiliar with the company, they were (understandably?) dismissed as corporate claptrap. Sean had found, however, that there was one surefire way to cut through the skepticism: to explain our approach to compensation. Maybe, Sean wondered, we should talk about it publicly?

"I could certainly write a blog entry explaining it," I offered. At this suggestion, the team practically lunged with enthusiasm: the reaction was so uniformly positive that I have to assume that everyone was sick of explaining this most idiosyncratic aspect of Oxide to friends and family. So what was the big deal about our compensation? Well, as a I wrote in the resulting piece, Compensation as a Reflection of Values, our compensation is not merely transparent, but uniform. The piece — unsurprisingly, given the evergreen hot topic that is compensation — got a ton of attention. While some of that attention was negative (despite the piece trying to frontrun every HN hater!), much of it was positive — and everyone seemed to be at least intrigued.

And in terms of its initial purpose, the piece succeeded beyond our wildest imagination: it brought a surge of new folks interested in the company. Best of all, the people new to Oxide were interested for all of the right reasons: not the compensation per se, but for the values that the compensation represents. The deeper they dug, the more they found to like — and many who learned about Oxide for the first time through that blog entry we now count as long-time, cherished colleagues.

That blog entry was a long time ago now, and today we have ~75 employees (and a shipping product!); how is our compensation model working out for us?

How it’s going

Before we get into our deeper findings, two updates that are so important that we have updated the blog entry itself. First, the dollar figure itself continues to increase over time (now at $235,000); things definitely haven’t gotten (and aren’t getting!) any cheaper. And second, we did introduce variable compensation for some sales roles. Yes, those roles can make more than the rest of us — but they can also make less, too. And, importantly: if/when those folks are making more than the rest of us, it’s because they’re selling a lot — a result that can be celebrated by everyone!

Those critical updates out of the way, how is it working? There have been a lot of surprises along the way, mostly (all?) of the positive variety. A couple of things that we have learned:

People take their own performance really seriously. When some outsiders hear about our compensation model, they insist that it can’t possibly work because "everyone will slack off." I have come to find this concern to be more revealing of the person making the objection than of our model, as our experience has been in fact the opposite: in my one-on-one conversations with team members, a frequent subject of conversation is people who are concerned that they aren’t doing enough (or that they aren’t doing the right thing, or that their work is progressing slower than they would like). I find my job is often to help quiet this inner critic while at the same time stoking what I feel is a healthy urge: when one holds one’s colleagues in high regard, there is an especially strong desire to help contribute — to prove oneself worthy of a superlative team. Our model allows people to focus on their own contribution (whatever it might be).

People take hiring really seriously. When evaluating a peer (rather than a subordinate), one naturally has high expectations — and because (in the sense of our wages, anyway) everyone at Oxide is a peer, it shouldn’t be surprising that folks have very high expectations for potential future colleagues. And because the Oxide hiring process is writing intensive, it allows for candidates to be thoroughly reviewed by Oxide employees — who are tough graders! It is, bluntly, really hard to get a job at Oxide.

It allows us to internalize the importance of different roles. One of the more incredible (and disturbingly frequent) objections I have heard is: "But is that what you’ll pay support folks?" I continue to find this question offensive, but I no longer find it surprising: the specific dismissal of support roles reveals a widespread and corrosive devaluation of those closest to customers. My rejoinder is simple: think of the best support engineers you’ve worked with; what were they worth? Anyone who has shipped complex systems knows these extraordinary people — calm under fire, deeply technical, brilliantly resourceful, profoundly empathetic — are invaluable to the business. So what if you built a team entirely of folks like that? The response has usually been: well, sure, if you’re going to only hire those folks. Yeah, we are — and we have!

It allows for fearless versatility. A bit of a corollary to the above, but subtly different: even though we (certainly!) hire and select for certain roles, our uniform compensation means we can in fact think primarily in terms of people unconfined by those roles. That is, we can be very fluid about what we’re working on, without fear of how it will affect a perceived career trajectory. As a concrete example: we had a large customer that wanted to put in place a program for some of the additional work they wanted to see in the product. The complexity of their needs required dedicated program management resources that we couldn’t spare, and in another more static company we would have perhaps looked to hire. But in our case, two folks came together — CJ from operations, and Izzy from support — and did something together that was in some regards new to both of them (and was neither of their putative full-time jobs!) The result was indisputably successful: the customer loved the results, and two terrific people got a chance to work closely together without worrying about who was dotted-lined to whom.

It has allowed us to organizationally scale. Many organizations describe themselves as flat, and a reasonable rebuttal to this are the "shadow hierarchies" created by the tyranny of structurelessness. And indeed, if one were to read (say) Valve’s (in)famous handbook, the autonomy seems great — but the stack ranking decidedly less so, especially because the handbook is conspicuously silent on the subject of compensation. (Unsurprisingly, compensation was weaponized at Valve, which descended into toxic cliquishness.) While we believe that autonomy is important to do one’s best work, we also have a clear structure at Oxide in that Steve Tuck (Oxide co-founder and CEO) is in charge. He has to be: he is held accountable to our investors — and he must have the latitude to make decisions. Under Steve, it is true that we don’t have layers of middle management. Might we need some in the future? Perhaps, but what fraction of middle management in a company is dedicated to — at some level — determining who gets what in terms of compensation? What happens when you eliminate that burden completely?

It frees us to both lead and follow. We expect that every Oxide employee has the capacity to lead others — and we tap this capacity frequently. Of course, a company in which everyone is trying to direct all traffic all the time would be a madhouse, so we also very much rely on following one another too! Just as our compensation model allows us to internalize the values of different roles, it allows us to appreciate the value of both leading and following, and empowers us each with the judgement to know when to do which. This isn’t always easy or free of ambiguity, but this particular dimension of our versatility has been essential — and our compensation model serves to encourage it.

It causes us to hire carefully and deliberately. Of course, one should always hire carefully and deliberately, but this often isn’t the case — and many a startup has been ruined by reckless expansion of headcount. One of the roots of this can be found in a dirty open secret of Silicon Valley middle management: its ranks are taught to grade their career by the number of reports in their organization. Just as if you were to compensate software engineers based on the number of lines of code they wrote, this results in perverse incentives and predictable disasters — and any Silicon Valley vet will have plenty of horror stories of middle management jockeying for reqs or reorgs when they should have been focusing on product and customers. When you can eliminate middle management, you eliminate this incentive. We grow the team not because of someone’s animal urges to have the largest possible organization, but rather because we are at a point where adding people will allow us to better serve our market and customers.

It liberates feedback from compensation. Feedback is, of course, very important: we all want to know when and where we’re doing the right thing! And of course, we want to know too where there is opportunity for improvement. However, Silicon Valley has historically tied feedback so tightly to compensation that it has ceased to even pretend to be constructive: if it needs to be said, performance review processes aren’t, in fact, about improving the performance of the team, but rather quantifying and stack-ranking that performance for purposes of compensation. When compensation is moved aside, there is a kind of liberation for feedback itself: because feedback is now entirely earnest, it can be expressed and received thoughtfully.

It allows people to focus on doing the right thing. In a world of traditional, compensation-tied performance review, the organizational priority is around those things that affect compensation — even at the expense of activity that clearly benefits the company. This leads to all sorts of wild phenomena, and most technology workers will be able to tell stories of doing things that were clearly right for the company, but having to hide it from management that thought only narrowly in terms of their own stated KPIs and MBOs. By contrast, over and over (and over!) again, we have found that people do the right thing at Oxide — even if (especially if?) no one is looking. The beneficiary of that right thing? More often than not, it’s our customers, who have uniformly praised the team for going above and beyond.

It allows us to focus on the work that matters. Relatedly, when compensation is non-uniform, the process to figure out (and maintain) that non-uniformity is laborious. All of that work — of line workers assembling packets explaining themselves, of managers arming themselves with those packets to fight in the arena of organizational combat, and then of those same packets ultimately being regurgitated back onto something called a review — is work. Assuming such a process is executed perfectly (something which I suppose is possible in the abstract, even though I personally have never seen it), this is work that does not in fact advance the mission of the company. Not having variable compensation gives us all of that time and energy back to do the actual work — the stuff that matters.

It has stoked an extraordinary sense of teamwork. For me personally — and as I relayed on an episode of Software Misadventures — the highlights of my career have been being a part of an extraordinary team. The currency of a team is mutual trust, and while uniform compensation certainly isn’t the only way to achieve that trust, boy does it ever help! As Steve and I have told one another more times that we can count: we are so lucky to work on this team, with its extraordinary depth and breadth.

While our findings have been very positive, I would still reiterate what we said four years ago: we don’t know what the future holds, and it’s easier to make an unwavering commitment to the transparency rather than the uniformity. That said, the uniformity has had so many positive ramifications that the model feels more important than ever. We are beyond the point of this being a curiosity; it’s been essential for building a mission-focused team taking on a problem larger than ourselves. So it’s not a fit for everyone — but if you are seeking an extraordinary team solving hard problems in service to customers, consider Oxide!

On efficiency and resilience in IT The Trouble with Tribbles...

Years ago, I was in a meeting when a C-level executive proclaimed:

IT systems run at less than 10% utilization on average, so we're moving to the cloud to save money.

The logic behind this was that you could run systems in the cloud that were the size you needed, rather than the size you had on the floor.

Of course, this particular claim was specious. Did he know the average utilization of our systems, I asked. He did not. (It was at least 30%.)

Furthermore, measuring CPU utilization is just one aspect of a complex multidimensional space. Systems may have spare CPU cycles, but are hitting capacity limits on memory, memory bandwidth, network bandwidth, storage and storage bandwidth. It's rare to have a system so well balanced that it saturates all parameters equally.

Not only that, but the load on all systems fluctuate, even on very short timescales. There will always be troughs between the peaks. And, as we all know, busy systems tend to generate queues and congestion - or, as a technical term, higher utilization leads to increased latency.

Attempting to build systems that maximise efficiency implies minimizing waste. But if you always consider spare capacity as wasted capacity, then you will always get congested systems and slow response. (Just think about queueing at the tills in a supermarket where they've staffed them for average footfall.)

So guaranteeing performance and response time implies a certain level of overprovisioning.

Beyond that, resilient systems need to have sufficient capacity to not only handle normal fluctuations in usage, but abnormal usage due to failures and external events. And resilient design needs to have unused capacity to take up the slack when necessary.

In this case, a blinkered focus on efficiency not only leads to poor response, it also makes systems brittle and incapable of responding if a problem occurs.

A simple way to build resiliency is to have redundant systems - provision spare capacity that springs into action when needed. In such an  active-passive configuration, the standby system might be idle. It doesn't have to be - you might use redundant systems for development/test/batch workloads (this presupposes you have a mechanism like Solaris zones to provide strong workload isolation).

Going to the cloud might solve the problem for a customer, but the cloud provider has exactly the same problem to solve, on a larger scale. They need to provision excess capacity to handle the variability in customer workloads. Which leads to the creation of interesting pricing models - such as reserved instances and the spot markets on AWS.

Understanding emission scopes, or failing to The Trouble with Tribbles...

I've been trying to get my head around all this Scope 1, Scope 2, Scope 3 emissions malarkey. Although it appears that lots of people smarter than me are struggling with it.

Having spent a while looking at how the Scopes are defined, I can understand how this can be difficult.

OK, Scope 1 is an organisation's direct emissions. Presumably an organisation knows what it's doing and how it's doing it, so getting the Scope 1 emissions from that ought to be fairly straightforward.

And Scope 2 is electricity, steam, heating and cooling purchased from someone else. I'm immediately suspicious here because this is a weirdly specific categorisation. But at least it should be easy to calculate - there's a conversion factor but at least you know the usage because it's on a bill you have to pay.

Then Scope 3 is - everything else. The fact that there are 15 official categories included ought to be a big red flag. That it's problematic is shown by the fact so many organisations have problems with it. (And by the growth of an industry to solve the problem for you.)

Personally, I wouldn't have defined it this way. If the idea is to evaluate emissions across the supply chain, then dumping almost all the emissions into the vaguest bucket is always going to be problematic.

So, why wasn't Scope 2 simply defined as the combined Scope 1 emissions of everyone providing services to the organisation. (That includes upstream and downstream, suppliers and employees, by the way.) That has 2 advantages I can see:

  • It's easy to calculate, because Scope 1 is pretty easy to calculate for all the providers of services (and they may well be doing it anyway), and an organisation ought to know who's providing services to it
  • It makes Scope 2 bigger (obviously) because there's more included, and therefore makes Scope 3 smaller, so uncertainties in Scope 3 matter less
  • Because you can better identify the contributors to your Scope 2 emissions, it's easier to know where to start making improvement efforts
I presume there's some reason it wasn't done this way, but I can't immediately see it.

What is this AI anyway? The Trouble with Tribbles...

AI is all the rage right now. It's everywhere, you can't avoid it.

But what is AI?

I'm not going to try and answer that here. What I will do, though, is state the question somewhat differently:

What is meant by "AI" in a given context?

And this matters, because the words we use are important.

The reality is that when you see AI mentioned it really could be almost anything. Some things AI might mean are:

  • Copilot
  • ChatGPT
  • Gemini
  • Some other specific off the shelf public LLM
  • Anything involving any off the shelf LLM
  • A custom domain-specific LLM
  • Machine learning
  • Pattern matching
  • Image recognition
  • Any old computer program
  • One of the AI companies

And there's always the possibility that someone has simply slapped AI on a product as a marketing term with no AI involved.

This persistent abuse of terminology is really unhelpful. Yesterday I went to a very interesting event for conversations about Hopes and Fears around AI.

Am I hopeful or fearful about AI? It depends which of the above definitions you mean.

There are certain uses of what might now be lumped in with AI that have proven to be very successful, but in many cases they're really machine learning, and have actually been around for a long time. I'm very positive about those (for example, helping in medical diagnoses).

On the other hand, if the AI is a stochastic parrot trained via large scale abuse of copyright while wreaking massive environmental damage, then I'm very negative about that.

So I think it's important to get away from sticking the AI label onto everything that might have some remote association with a computer program, and be far more careful in our terminology.

Tribblix on SPARC: sparse devices in an LDOM The Trouble with Tribbles...

I recently added a ddu like capability to Tribblix.

In that article I showed the devices in a bhyve instance. As might be expected there really aren't a lot of devices you need to handle.

What about SPARC, you might ask? Even if you don't, I'll ask for you.

Running Tribblix in a LDOM, this is what you see:

root@sparc-m32:/root# zap ddu
Device SUNW,kt-rng handled by n2rng in TRIBsys-kernel-platform [installed]
Device SUNW,ramdisk handled by ramdisk in TRIBsys-kernel [installed]
Device SUNW,sun4v-channel-devices handled by cnex in TRIBsys-ldoms [installed]
Device SUNW,sun4v-console handled by qcn in TRIBsys-kernel-platform [installed]
Device SUNW,sun4v-disk handled by vdc in TRIBsys-ldoms [installed]
Device SUNW,sun4v-domain-service handled by vlds in TRIBsys-ldoms [installed]
Device SUNW,sun4v-network handled by vnet in TRIBsys-ldoms [installed]
Device SUNW,sun4v-virtual-devices handled by vnex in TRIBsys-kernel-platform [installed]
Device SUNW,virtual-devices handled by vnex in TRIBsys-kernel-platform [installed]
 

It's hardly surprising, but that's a fairly minimal list.

It does make me wonder whether to produce a special SPARC Tribblix image precisely to run in an LDOM. After all, I already have slightly different variants on x86 designed for cloud in general, and one for EC2 specifically, that don't need the whole variety of device drivers that the generic image has to include.

Expecting an AI boom? The Trouble with Tribbles...

I recently went down to the smoke, to Tech Show London.

There were 5 constituent shows, and I found what each sub-show was offering - and the size of each component - quite interesting.

There wasn't much going on in Devops Live, to be honest. Relatively few players had shown up, nothing terribly interesting.

There wasn't that much in Big Data & AI World either. I was expecting much more here, and what there was seemed to be on the periphery. More support services than actual product.

The Cloud & Cyber Security Expo was middling, not great, and there was an AI slant in evidence. Not proper AI, but a sprinkling of AI dust on things just to keep up with the Joneses.

Cloud and AI Infrastructure had a few bright spots. I saw actual hardware on the floor - I had seen disk shelves over in the Big Data section, but here I spotted a Tape Library (I used to use those a lot, haven't seen much in that area for a while) and a VDI blade. Talked to a few people, including the Zabbix and Tailscale stands.

But when it came to Data Centre World, that was buzzing. It was about half the overall floor area, so it was far and away the dominant section. Tremendous diversity too - concrete, generators, power cables, electrical switching, fiber cables, cable management, thermal management, lots of power and cooling. Lots and lots of serious physical infrastructure.

There was an obvious expectation on display that there's a massive market around high-density compute. I saw multiple vendors with custom rack designs - rear-door and liquid cooling in evidence. Some companies addressing the massive demand for water.

If these people are at a trade show, then the target market isn't the 3 or 4 hyperscalers. What's being anticipated in this frenzy is very much companies building out their own datacentre facilities, and that's very much an interesting trend.

There's a saying "During a gold rush, sell shovels". What I saw here was a whole army of shovel-sellers getting ready for the diggers to show up.

Tribblix, UEFI, and UFS The Trouble with Tribbles...

Somewhat uniquely among illumos distributions, Tribblix doesn't require installation to ZFS - it allows the possibility of installing to a UFS root file system.

I'm not sure how widely used this is, but it will get removed as an option at some point, as the illumos UFS won't work past Y2038.

I recently went through the process of testing an install of the very latest Tribblix to UFS, in a bhyve guest running UEFI. The UEFI part was a bit more work, and doing it clarified how some of the internals fit together.

(One reason for doing these unusual experiments is to better understand how things work, especially those that are handed automatically by more mainstream components.)

OK, on to installation.

While install to zfs will automatically lay out zfs pools and file systems, the ufs variant needs manual partitioning. There are two separate concerns - the Tribblix install, and UEFI boot.

The Tribblix installer for UFS assumes 2 things about the layout of the disk it will install to:

  1. The slice s0 will be used to install the operating system to, and mounted at /.
  2. The slice s1 will be used for swap. (On zfs, you create a zfs volume for swap; on ufs you use a separate raw partition.)

It's slightly unfortunate that these slices are hard-coded into the installer.

For UEFI boot we need 2 other slices:

  1. A system partition (this is what's called EFI System partition, aka ESP)
  2. A separate partition to put the stage2 bootloader in. (On zfs there's a little bit of free space you can use; there isn't enough on ufs so it needs to be handled separately.)

The question then arises as to how big these need to be. Now, if you create a root pool with ZFS (using zpool create -B) it will create a 256MB partition for ESP. This turns out to be the minimum size for FAT32 on 4k disks, so that's a size that should always work. On disks with a 512 block size, it needs to be 32MB or larger (there's a comment in the code about 33MB). The amount of data you're going to store there is very much less.

The stage2 partition doesn't have to be terribly big.

So as a result of this I'm going to create a GPT label with 4 slices - 0 and 1 for Tribblix, 3 and 4 for EFI system and boot.

There are 2 things to note here: First,the partitions you create don't have to be laid out on disk in numerical order, you can put the slices in any order you want. This was true for SMI disks too, where it was common practice in Solaris to put swap on slice 1 at the start of the disk with slice 0 after it. Second, EFI/GPT doesn't assign any special significance to slice 2, unlike the old SMI label where slice 2 was conventionally the whole disk. I'm avoiding slice 2 here not because it's necessary, but so as to not confuse anyone used to the old SMI scheme.

The first thing to do with a fresh disk is to go into format, invoked as format -e (expert mode in order to access the EFI options). Select the disk, run fdisk from inside format, and then install an EFI label.

format -e
#
# choose the disk
#
fdisk
y - to accept defaults
l - to label
1 - choose efi

Then we can lay out the partitions. Still in format, type p to enter the partition menu and p to display the partitions.

p - enter partition menu
p - show current partition table

At this point on a new disk it should have 8 as "reserved" and 0 as "usr", with everything else "unassigned". We're going to leave slice 8 untouched.

First note where slice 0 currently starts. I'll resize it at the end, but we're going to put slices 3, 4, and 1 at the start of the disk and then resize 0 to fill in what's left.

To configure the settings for a given slice, just type its number.

Start with slice 3, type 3 and configure the system partition.  This has to use the "system" tag.

tag: system
flags: wm (just hit return to accept)
start: 34
size: 64mb

Type p again to view the partition table and note the last sector of slice 3 we just created, and add 1 to it to give the start sector of the next slice. Type 4 to configure the boot partition, and it must have the tag "boot".

tag: boot
flags: wm (just hit return to accept)
start: 131106
size: 16mb

Type p again to view the partition table, take note of the last sector for the new slice 4, and add 1 to get the start sector for the next one. Which is 1 for the swap partition.

tag: swap
flags: wm (just hit return to accept)
start: 65570
size: 512mb

We're almost done. The final step is to resize partition 0. Again you get the start sector by adding 1 to the last sector of the swap partition you just created. And rather than giving a size you can give the end sector using an 'e' suffix, which should be one less than the start of the reserved partition 8, and also the last sector of the original partition 0. Type 0 and enter something like:

tag: usr
flags: wm (just hit return to accept)
start: 1212450
size: 16760798e

Type 'p' one last time to view the partition table, check that the Tag entries are correct, and that the First and Last Sectors don't overlap.

Then type 'l' to write the label to the disk. It will ask you for the label type - make sure it's EFI again - and for confirmation.

Then we can do the install

./ufs_install.sh c1t0d0s0

It will ask for confirmation that you want to create the file system

At the end it ought to say "Creating pcfs on ESP /dev/rdsk/c1t0d0s3"

If it says "Requested size is too small for FAT32." then that's a hint that you need the system partition to be bigger. (An alternative trick is to mkfs the pcfs file system yourself, if you create it using FAT16 it will still work but you can get away with it being a lot smaller.)

It should also tell you that it's writing the pmbr to slice 4 and to p0.

With that, rebooting into the newly installed system ought to work.

Now, the above is a fairly complicated set of instructions. I could automate this, but do we really want to make it that easy to install to UFS?

Introducing a ddu-alike for Tribblix The Trouble with Tribbles...

Introducing a new feature in Tribblix m36. There's a new ddu subcommand for zap.

In OpenSolaris, the Device Driver Utility would map the devices it found and work out what software was needed to drive them. This isn't that utility, but is inspired by that functionality, rewritten for Tribblix as a tiny little shell script.

As an example, this is the output of zap ddu for Tribblix in a bhyve instance:

jack@tribblix:~$ zap ddu
Device acpivirtnex handled by acpinex in TRIBsys-kernel-platform [installed]
Device pci1af4,1000,p handled by vioif in TRIBdrv-net-vioif [installed]
Device pci1af4,1001 handled by vioblk in TRIBdrv-storage-vioblk [installed]
Device pci1af4,1 handled by vioif in TRIBdrv-net-vioif [installed]
Device pciclass,030000 handled by vgatext in TRIBsys-kernel [installed]
Device pciclass,060100 handled by isa in TRIBsys-kernel-platform [installed]
Device pciex_root_complex handled by npe in TRIBsys-kernel-platform [installed]
Device pnpPNP,303 handled by kb8042 in TRIBsys-kernel [installed]
Device pnpPNP,f03 handled by mouse8042 in TRIBsys-kernel [installed]

Simply put, it will list the devices it finds, which driver is responsible for them, and which package that driver is contained in (and whether that package is installed).

This, while a tiny little feature, is one of those small things that is actually stunningly useful.

If there's a device that we have a driver for that isn't installed, this helps identify it so you know what to install.

What this doesn't do (yet, and unlike the original ddu) is show devices we don't have a driver for at all.

Is all this thing called AI worthwhile? The Trouble with Tribbles...

Before I even start, let's be clear: there are an awful lot of things currently being bundled under the "AI" banner, most of which of neither artificial nor intelligent.

So when I'm talking about AI here, I'm talking about what's being marketed to the masses as AI. This generally doesn't include the more traditional subjects of machine learning or image recognition, which I've often seen relabelled as AI.

But back to the title: is the modern thing called AI worthwhile?

Whatever it is, AI can do some truly remarkable things. That isn't something you can argue against. It can do some truly stupid and hopelessly wrong things as well.

But where does this good stuff fit in? Are businesses really going to benefit by embracing AI?

Well, yes, up to a point. There's a lot of menial work that can be handed off to an AI. It might be able to do it cheaper than a human.

The first snag is Jevon's paradox; by making menial tasks cheaper, a business simply opens the door to larger quantities of menial tasks, so it saves no money and its costs might even go up.

To be honest, though, I would have to say that if you can hand a task off to an AI, is it worth doing in the first place?

That's the rub, yes you might be able to optimise a process by using AI, but you can optimise it much more by eliminating it entirely.

(And you then don't have to pay extra for someone to come along and clean up after the AI has made a mess of it.)

It's not just the first level of process you need to look at. Take the example of summarising meetings. It's not so much that you don't need the summary, but to start with you need to run meetings better so they don't need to be summarised, and even better, the meeting probably wasn't needed at all.

Put it another way: the AI will get you to a local minimum of cost, but not to a global minimum. Worse, as AI gets cheaper and more widely used, that local optimisation makes it even harder to optimise the system globally.

So yes, I'm not convinced that much of the AI currently being rammed down our throats has any utility. It will actively block businesses in the pursuit of improvements, and the infatuation with current trendy AI will harm the development of useful AI.

Experience: dialog with prosecutor alp's notes

Wow! Today we had a guest from prosecutor's office.
They checked if we (South Federal University) ban sites from prosecutor's list. Luckily, our upstream provider does it for us.
But a check was ridiculous.
Yes, we receive daily lists of sites to ban. I though they would check some sites from this list and go in peace. But prosecutor just searched for prohibited works with Google and tried if she can download the materials. Of course, Google found working links :)
What a hell? Why should we imitate some work if everyone knows how to avoid these regulation rules? Why should anyone spend resources for content filtering? I think our government are a herd of archaic dinosaurs who just don't know how to lick chief's arse better

Tailscale for SunOS in 2025 Nahum Shalman

Happy New Year! The wireguard-go port is still sitting around in my fork. I don't know when I will have the energy for the next attempt to get it upstream. In the meantime, I've made some fun progress on the Tailscale side.

Taildrive

The Tailscale folks have shipped Taildrive (currently in alpha) and it's pretty neat. Naturally those of us using Tailscale on illumos wanted to try it out. There was nothing needed directly to get it working, but we had an indirect problem. The tailscale binary communicates with the tailscaled daemon over a unix socket, and the Tailscale folks had added some basic unix based authentication / authorization abstracted in their peercred library. That library needed support added for getpeerucred which meant I had to wire things up all the way down in x/sys/unix before then getting it into peercred. But with that work done, Taildrive now works! I tagged a release with that enabled if you're in a rush to play with it.

Using userspace-networking

Tailscale has a way to run without creating a TUN device. It means that client software on the machine can't connect directly to IPs on the Tailnet (though there is a SOCKS proxy you can use) but tailscaled can still lots of other server-y things (including Taildrive!) That's how Tailscale has been supporting AIX. Which led me to a strange realization: Tailscale had better in-tree support for AIX than it did for illumos and Solaris. No more! We are now on-par with AIX in the official tree!

What's next

I don't know if the Tailscale folks intend to ship binaries for us from their tree, but after their next release it should be possible to build illumos binaries from their tree that you could use to serve up a ZFS filesystem with Taildrive to your tailnet using the userspace-networking driver.

I will of course also rebase my TUN driver patches and tag a release as well.

Are you running Tailscale on illumos or Solaris? Let me know on Bluesky or Mastodon.

Tragedy Older Than Me Nahum Shalman

In July of 2021, in anticipation of the upcoming High Holy Days I purchased a copy of This Is Real and You Are Completely Unprepared: The Days of Awe as a Journey of Transformation by Rabbi Alan Lew, published in 2003. I was in fact completely unprepared to even read it. It sat in or on my nightstand for more than three years. I finally started reading it during the high holidays in October of 2024 (a few days before the one year anniversary of the events of October 7, 2023).

When I read the section excerpted here, I had to immediately flip back to check the publication date. 2003. In so many ways the ongoing tragedy today is in the same place it was over 20 years ago. Rabbi Lew died in 2009. We cannot ask him what he thinks of the world today, but in many ways there is no need. Little has changed. So, to emphasize one more time, let's go back to 2003:

I think that the great philosopher George Santayana got it exactly wrong. I think it is precisely those who insist on remembering history who are doomed to repeat it. For a subject with so little substance, for something that is really little more than a set of intellectual interpretations, history can become a formidable trap— a sticky snare from which we may find it impossible to extricate ourselves. I find it impossible to read the texts of Tisha B’Av, with their great themes of exile and return, and their endless sense of longing for the land of Israel, without thinking of the current political tragedy in the Middle East. I write this at a very dark moment in the long and bleak history of that conflict. Who knows what will be happening there when you read this? But I think it’s a safe bet that whenever you do, one thing is unlikely to have changed. There will likely be a tremendous compulsion for historical vindication on both sides. Very often, I think it is precisely the impossible yearning for historical justification that makes resolution of this conflict seem so impossible. The Jews want vindication for the Holocaust, and for the two thousand years of European persecution and ostracism that preceded it; the Jews want the same Europeans who now give them moral lectures to acknowledge that this entire situation would never have come about if not for two thousand years of European bigotry, barbarism, and xenophobia. They want the world to acknowledge that Israel was attacked first, in 1948, in 1967, in 1973, and in each of the recent Intifadas. They want acknowledgment that they only took the lands from which they were attacked during these conflicts, and offered to return them on one and only one condition— the acknowledgment of their right to exist. When Anwar Sadat met that condition, the Sinai Peninsula, with its rich oil fields and burgeoning settlement towns, was returned to him. And they want acknowledgment that there are many in the Palestinian camp who truly wish to destroy them, who have used the language of peace as a ploy to buy time until they have the capacity to liquidate Israel and the Jews once and for all. They want acknowledgment that they have suffered immensely from terrorism, that a people who lost six million innocents scarcely seventy years ago should not have had to endure the murder of its innocent men, women, and children so soon again. And they want acknowledgment that in spite of all this, they stood at Camp David prepared to offer the Palestinians everything they claimed to have wanted— full statehood, a capital in East Jerusalem— and the response of the Palestinians was the second Intifada, a murderous campaign of terror and suicide bombings.

And the Palestinians? They would like the world to acknowledge that they lived in the land now called Israel for centuries, that they planted olive trees, shepherded flocks, and raised families there for hundreds of years; they would like the world to acknowledge that when they look up from their blue-roofed villages, their trees and their flowers, their fields and their flocks, they see the horrific, uninvited monolith of western culture— immense apartment complexes, shopping centers, and industrial plants on the once-bare and rocky hills where the voice of God could be heard and where Muhammad ascended to heaven. And they would like the world to acknowledge that it was essentially a European problem that was plopped into their laps at the end of the last great war, not one of their own making. They would like the world to acknowledge that there has always been a kind of arrogance attached to this problem; that it was as if the United States and England said to them, Here are the Jews, get used to them. And they would like the world to acknowledge that it is a great indignity, not to mention a significant hardship, to have been an occupied people for so long, to have had to submit to strip searches on the way to work, and intimidation on the way to the grocery store, and the constant humiliation of being subject— a humiliation rendered nearly bottomless when Israel, with the benefit of the considerable intellectual and economic resources of world Jewry, made the desert bloom, in a way they had never been able to do. And they would like the world to acknowledge that there are those in Israel who are determined never to grant them independence, who have used the language of peace as a ploy to fill the West Bank with settlement after settlement until the facts on the ground are such that an independent Palestinian state on the West Bank is an impossibility. They would like the world to acknowledge that there is no such thing as a gentle occupation— that occupation corrodes the humanity of the occupier and makes the occupied vulnerable to brutality.

And I think the need to have these things acknowledged— the need for historical affirmation— is so great on both sides that both the Israelis and the Palestinians would rather perish as peoples than give this need up. In fact, I think they both feel that they would perish as peoples precisely if they did. They would rather die than admit their own complicity in the present situation, because to make such an admission would be to acknowledge the suffering of the other and the legitimacy of the other’s complaint, and that might mean that they themselves were wrong, that they were evil, that they were bad. That might give the other an opening to annihilate or enslave them. That might make such behavior seem justifiable.

I wonder how many of us are stuck in a similar snare. I wonder how many of us are holding on very hard to some piece of personal history that is preventing us from moving on with our lives, and keeping us from those we love. I wonder how many of us cling so tenaciously to a version of a story of our lives in which we appear to be utterly blameless and innocent, that we become oblivious to the pain we have inflicted on others, no matter how unconsciously or inevitably or innocently we have have inflicted it. I wonder how many of us are terrified of acknowledging the truth of our lives because we think it will expose us. How many of us stand paralyzed between the moon and the sun; frozen — unable to act in the moment — because of our terror of the past and because of the intractability of the present circumstances that past has wrought? Forgiveness, it has been said, means giving up our hopes for a better past. This may sound like a joke, but how many of us refuse to give up our version of the past, and so find it impossible to forgive ourselves or others, impossible to act in the present?

I don't have answers. In my childhood I was promised peace in the Middle East. I am still waiting. I wish I knew what was needed to get us there.

Pirkei_Avot Chapter 5, Verse 21 implies that I am perhaps old enough to have some Wisdom, but am not yet old enough to give Counsel. The only wisdom I have obtained so far is that in most disagreements, people can disagree about the "facts", be aproaching the situation with fundamentally different values, or both. I believe that to have any meaningful discussion on a topic as fraught as this one, first common values must be established. Only then can we approach reality side-by-side, examine our beliefs, find mutually trustworthy sources of information, and find agreement about the state of reality. When values are aligned and facts are agreed upon, we might have some hope of letting go of just enough bits of history to find a path through this mess.

Decades ago I was a child promised peace. Today I have children of my own. Today on both side there are children suffering from the choices of their parents and grandparents. All the children deserve better.

There are people on both sides with genocide in their hearts. I don't yet know what to do about that, but we cannot let them win.

I wish that those who should be wise enough to provide counsel, particularly those with power, would get their acts together.

I wish for the peace and safety of all the innocents.

I wish for peace. In my lifetime. This year. Tomorrow. Or even today.

I hope that these words of Rabbi Alan Lew will reach just a few more people thanks to this post being on the internet. I hope they have touched you. Thank you for reading.

OpenTelemetry Tracing for Dropshot Nahum Shalman

I spoke at Oxide's dtrace.conf(24) about a project I've been hacking on for the past couple weeks:

Slides:

OpenTelemetry Tracing for Dropshot
OpenTelemetry Tracing for Dropshot Nahum Shalman The QR code links to this presentation for anyone who wants to read the speaker notes or reread them later. 1 github.com/nshalman
alt

Code:

[DRAFT - DO NOT MERGE] Basic OpenTelemetry integration by nshalman · Pull Request #1201 · oxidecomputer/dropshot
OpenTelemetry for DropshotThis is still very much a rough draft, but I want it to be clearly available for anyone interested.Checklist of things that are needed (note that much of it is currently…
alt

Thoughts on Static Code Analysis The Trouble with Tribbles...

I use a number of tools in static code analysis for my projects - primarily Java based. Mostly

  1. codespell
  2. checkstyle
  3. shellcheck
  4. PMD
  5. SpotBugs

Wait, I hear you say. Spell checking? Absolutely, it's a key part of code and documentation quality. There's absolutely no excuse for shoddy spelling. And I sometimes find that if the spelling's off, it's a sign that concentration levels weren't what they should have been, and other errors might also have crept in.

checkstyle is far more than style, although it has very fixed ideas about that. I have a list of checks that must always pass (now I've cleaned them up at any rate), so that's now at the state where it's just looking for regressions - the remaining things it's complaining about I'm happy to ignore (or the cost of fixing them massively outweighs any benefit to fixing them).

One thing that checkstyle is keen on is thorough javadoc. Initially I might have been annoyed by some of its complaints, but then realised 2 things. First, it makes you consider whether a given API really should be public. And more generally as part of that, having to write javadoc can make you reevaluate the API you've designed, which pushes you towards improving it.

When it comes to shellcheck, I can summarise it's approach as "quote all the things". Which is fine, until it isn't and you actually want to expand a variable into its constituent words.

But even there, a big benefit again is that shellcheck makes you look at the code and think about what it's doing. Which leads to an important point - automatic fixing of reported problems will (apart from making mistakes) miss the benefit of code inspection.

Actual coding errors (or just imperfections) tend to be the domain of PMD and SpotBugs. I have a long list of exceptions for PMD, depending on each project. I'm writing applications for unix-like systems, and I really do want to write directly to stdout and stderr. If I want to shut the application down, then calling System.exit() really is the way to do it.

I've been using PMD for years, and it took a while to get the recent version 7 configured to my liking. But having run PMD against my code for so long means that a lot of the low hanging fruit had already been fixed (and early on my code was much much worse than it is now). I occasionally turn the exclusions off and see if I can improve my code, and occasionally win at this game, but it's a relatively hard slog.

So far, SpotBugs hasn't really added much. I find its output somewhat unhelpful (I do read the reports), but initial impressions are that it's finding things the other tools don't, so I need to work harder to make sense of it.

dtrace.conf(24) Oxide Computer Company Blog

shirt

Sometime in late 2007, we had the idea of a DTrace conference. Or really, more of a meetup; from the primordial e-mail I sent:

The goal here, by the way, is not a DTrace user group, but more of a face-to-face meeting with people actively involved in DTrace — either by porting it to another system, by integrating probes into higher level environments, by building higher-level tools on top of DTrace or by using it heavily and/or in a critical role. That said, we also don’t want to be exclusionary, so our thinking is that the only true requirement for attending is that everyone must be prepared to speak informally for 15 mins or so on what they are doing with DTrace, any limitations that they have encountered, and some ideas for the future. We’re thinking that this is going to be on the order of 15-30 people (though more would be a good problem to have — we’ll track it if necessary), that it will be one full day (breakfast in the morning through drinks into the evening), and that we’re going to host it here at our offices in San Francisco sometime in March 2008.

This same note also included some suggested names for the gathering, including what in hindsight seems a clear winner: DTrace Bi-Mon-Sci-Fi-Con. As if knowing that I should leave an explanatory note to my future self as to why this name was not selected, my past self fortunately clarified: "before everyone clamors for the obvious Bi-Mon-Sci-Fi-Con, you should know that most Millennials don’t (sadly) get the reference." (While I disagree with the judgement of my past self, it at least indicates that at some point I cared if anyone got the reference.)

We settled on a much more obscure reference, and had the first dtrace.conf in March 2008. Befitting the style of the time, it was an unconference (a term that may well have hit its apogee in 2008) that you signed up to attend by editing a wiki. More surprising given the year (and thanks entirely to attendee Ben Rockwood), it was recorded — though this is so long ago that I referred to it as video taping (and with none of the participants mic’d, I’m afraid the quality isn’t very good). The conference, however, was terrific, viz. the reports of Adam, Keith and Stephen (all somehow still online nearly two decades later). If anything, it was a little too good: we realized that we couldn’t recreate the magic, and we demurred on making it an annual event.

Years passed, and memories faded. By 2012, it felt like we wanted to get folks together again, now under a post-lawnmower corporate aegis in Joyent. The resulting dtrace.conf(12) was a success, and the Olympiad cadence felt like the right one; we did it again four years later at dtrace.conf(16).

In 2020, we came back together for a new adventure — and the DTrace Olympiad was not lost on Adam. Alas, dtrace.conf(20) — like the Olympics themselves — was cancelled, if implicitly. Unlike the Olympics, however, it was not to be rescheduled.

More years passed and DTrace continued to prove its utility at Oxide; last year when Adam and I did our "DTrace at 20" episode of Oxide and Friends, we vowed to hold dtrace.conf(24) — and a few months ago, we set our date to be December 11th.

At first we assumed we would do something similar to our earlier conferences: a one-day participant-run conference, at the Oxide office in Emeryville. But times have changed: thanks to the rise of remote work, technologists are much more dispersed — and many more people would need to travel for dtrace.conf(24) than in previous DTrace Olympiads. Travel hasn’t become any cheaper since 2008, and the cost (and inconvenience) was clearly going to limit attendance.

The dilemma for our small meetup highlights the changing dynamics in tech conferences in general: with talks all recorded and made publicly available after the conference, how does one justify attending a conference in person? There can be reasonable answers to that question, of course: it may be the hallway track, or the expo hall, or the after-hours socializing, or perhaps some other special conference experience. But it’s also not surprising that some conferences — especially ones really focused on technical content — have decided that they are better off doing as conference giant O’Reilly Media did, and going exclusively online. And without the need to feed and shelter participants, the logistics for running a conference become much more tenable — and the price point can be lowered to the point that even highly produced conferences like P99 CONF can be made freely available. This, in turn, leads to much greater attendance — and a network effect that can get back some of what one might lose going online. In particular, using chat as the hallway track can be more much effective (and is certainly more scalable!) than the actual physical hallways at a conference.

For conferences in general, there is a conversation to be had here (and as a teaser, Adam and I are going to talk about it with Stephen O’Grady and Theo Schlossnagle on Oxide and Friends next week, but for our quirky, one-day, Olympiad-cadence dtrace.conf, the decision was pretty easy: there was much more to be gained than lost by going exclusively on-line.

So dtrace.conf(24) is coming up next week, and it’s available to everyone. In terms of platform, we’re going to try to keep that pretty simple: we’re going to use Google Meet for the actual presenters, which we will stream in real-time to YouTube — and we’ll use the Oxide Discord for all chat. We’re hoping you’ll join us on December 11th — and if you want to talk about DTrace or a DTrace-adjacent topic, we’d love for you to present! Keeping to the unconference style, if you would like to present, please indicate your topic in the #session-topics Discord channel so we can get the agenda fleshed out.

While we’re excited to be online, there are some historical accoutrements of conferences that we didn’t want to give up. First, we have a tradition of t-shirts with dtrace.conf. Thanks to our designer Ben Leonard, we have a banger of a t-shirt, capturing the spirit of our original dtrace.conf(08) shirt but with an Oxide twist. It’s (obviously) harder to make those free but we have tried to price them reasonably. You can get your t-shirt by adding it to your (free) dtrace.conf ticket. (And for those who present at dtrace.conf, your shirt is on us — we’ll send you a coupon code!)

Second, for those who can make their way to the East Bay and want some hangout time, we are going to have an après conference social event at the Oxide office starting at 5p. We’re charging something nominal for that too (and like the t-shirt, you pay for that via your dtrace.conf ticket); we’ll have some food and drinks and an Oxide hardware tour for the curious — and (of course?) there will be Fishpong.

Much has changed since I sent that e-mail 17 years ago — but the shared values and disposition that brought together our small community continue to endure; we look forward to seeing everyone (virtually) at dtrace.conf(24)!