A disappearing Service Processor Oxide Computer Company Blog

One of the considerations in designing our Oxide rack is asking which parts we expect to be accessible and by what means. The Oxide rack is designed to live in a data center with exclusive access via the network. The only reason an engineer should ever need to physically visit a rack is to replace a failing part, such as a disk. Our Service Processor (SP) is accessible via the management network.

During some of our first attempts at putting our next generation Cosmo sled into an Oxide rack, we would see the Service Processor drop off the network. This is a tricky situation to debug, as without network access we have limited insight into the state of the SP itself. Debugging started based on the state of the rest of the system (original Hubris bug may contains spoilers for the blog post!):

The AMD host CPU was still alive, meaning the full system itself still had power
The SP itself was not broadcasting over the management network that it was alive
There were no increases in network data counters coming from the SP
The fans were spinning at a constant elevated rate. The service processor is responsible for fan control, so this was an indication the fan controller may have fallen back to emergency full power mode.
This was not reproducible on a sled outside a rack

The Service Processor runs our custom operating system, Hubris. Each portion of the system (networking, thermal control, update etc.) is written as a separate task. Hubris is not a true Real Time Operating System with deadline guarantees, but it does have the notion of task priorities. One of our working theories was that we had a software bug that was causing task starvation. If the networking task was unable to run due to some other task eating up all the CPU time, it would not be able to respond over the network. A likely culprit of task starvation could be a task that had gotten into an infinite crash loop, with all CPU time being spent restarting the task. We adjusted the task restart time to have a longer delay to catch this case. We also wanted to be able to observe if the SP was still making progress even if we lacked networking access, and so switched our chassis LED from "always on" to blinking.

We were fortunate to be able to reproduce the issue with these debug changes, but the results were still confusing: in some cases we would see the LED stuck on, and in other cases the LED was stuck off. The task responsible for LED blinking was near the top of priorities, which limited the number of places we could have a stuck task.

One of the many advantages of writing Hubris in Rust is eliminating bug classes such as buffer overflows. A category of issues Hubris is still particularly prone to is stack overflows. This is because Hubris requires manual sizing of stacks for tasks and calculating maximum stack size has proven tricky. Our ability to detect undersized stacks has improved with the addition of emit-stack-sizes feature but we can still hit some edge cases. When a stack overflow occurs, the task safely restarts. A stack overflow in the kernel would potentially produce similar behavior of a system that looks like it isn’t making progress. Unfortunately for us the stack margins on the kernel were relatively large (512 bytes!) so this was an unlikely case.

At this point, we really needed to get more debugging information out of the system. For manufacturing purposes, we have SWD debug headers. These are not expected to be used on a production system and especially not a system in a running rack. We had to do some creative cable pulling to get them attached with the assistance of coworkers in the Oxide office.

a Cosmo sled with a debug probe precariously placed

Fortunately, our cable attachment paid dividends: we reproduced the issue with the probe attached! This was not immediately fruitful: the debug probe was unable to actually halt the CPU via debug halt, which limited our ability to extract diagnostic information. Our Service Processor uses a Cortex-M7 STM32H7, and the number of ways to put the system in such a state is limited.

This put our focus on identifying what parts of the system could cause such behavior. A major change from our first generation Gimlet system was the addition of an FPGA to control more parts of our system such as host flash. This FPGA is connected using a simple, old-school parallel bus, like the sort you might use for RAM, and accessed via the STM32H7 Flexible Memory Controller. As stated in the manual (Section 22.1 RM0433):

          Its main purposes are:
* to translate AXI transactions into the appropriate external device protocol
* to meet the access time requirements of the external memory devices

One way a CPU can potentially get stuck is if it never receives a bus acknowledgement from an external device. A bug in the FPGA timing, for example, could result in the CPU hanging forever when attempting to read a register. To test this theory, we created an FPGA test image with a register that when read would intentionally hang the FMC bus. This produced very similar behavior to what we observed and was a strong indicator we were looking at the right part of the system to find the issue.

We typically rely on full system dumps to debug Hubris problems. This is not possible unless we can halt the CPU. ARM CPUs do support vector catch though: it’s possible to configure the CPU so that on reset, it halts before executing the first instruction. Our hope was that a vector catch reset would unstick the CPU sufficiently without trampling over our existing state. This did work. We lost the running register state with the program counter but the rest of the Hubris state in RAM was preserved across reset and looked reasonably consistent. We could see what Hubris task was running, but nothing there looked like it was accessing the FMC.

Our hardware engineers did a review of FPGA timings and did find that we might not have been meeting timing constraints required by the memory interface. We merged the fix and figured that the vector catch dumps were just inconsistent, most likely due to the cache. When we ran experiments to turn off the cache the dumps were consistent but we never reproduced the actual issue.

We continued hubris development as usual over the next several weeks. One of the changes we worked on during this period was related to our measured boot work. Our Root of Trust (RoT) is responsible for taking a hash of the SP flash at bootup which eventually gets used by higher level software. To acheive the security properties we need, the SP may reset itself multiple times in a row at first bootup. While testing this change, we saw the same symptoms come back: the Cosmo SP would disappear from the network and appear dead. This change turned out to be incredibly good at reproducing the issue, turning a potentially 24+ hour reproduction rate to approximately 10-20 minutes. The initial dumps still didn’t show a significant smoking gun, but we were still highly suspicious of the FMC bus since there were still limited cases that could produce such symptoms.

The high reproduction rate gave us a chance to try many experiments, none of which were fruitful:

Adjusting the rate at which we reset and the number of resets before normally booting
Clearing the FPGA bit stream an extra time
Restricting tasks from accessing the FMC bus
Removing whole tasks that seemed to be unrelated

Finally, staring at the STM32H7 manual provided an insight: maybe the processor itself was performing accesses on the FMC bus that we weren’t expecting! Modern processors hold a large amount of internal state that isn’t directly visible to the programmer. It is not possible for a programmer to know when a CPU will pull data into or out of the cache outside of certain synchronization points or cache instructions. A CPU writing data from the cache to memory is considered a memory access so it’s possible for the CPU to be making memory accesses to addresses unrelated to the current program counter.

Hubris utilizes the Memory Protection Unit (MPU) to provide isolation between tasks and enforce privilege levels. Our configuration uses the MPU for the unprivileged tasks but uses the default memory map for the (privileged) kernel. In the tasks, the FMC is mapped as Uncached Device Memory. Based on our reading of the STM32H7 manual, it turned out our chosen base address for the FMC bus had a default memory type of Normal Cached. This means the FMC has different attributes depending on whether it’s being accessed from a task or the kernel.

Section A3.5.7 of the ARMv7-m reference manual has an entire section about mismatched memory attributes and what properties are lost in this situation. Based on discussion with our hardware engineers, the line "Preservation of the size of accesses" was the most suspicious. Our FPGA interface was designed for 32-bit accesses, and 16-bit or 8-bit accesses could potentially cause problems.

It’s important to note that the kernel was never intentionally accessing the FMC through the Normal Cached mapping. The most likely scenario was:

The CPU running an unprivileged task accessing the FMC issues a store that makes it to the processor’s store buffer
An interrupt occurs, switching us into privileged mode which uses the default memory map
The store hits the cache because the default memory map said that address is cached
The cache attempted to write to memory in ways outside the expected Device Memory attributes

One of the last lines of section A3.5.7 is "Arm strongly recommends that software does not use mismatched attributes for aliases of the same location." The default ARM memory map (which the kernel relies on) assigns different attributes to different sections of the address space, and one of the sections is set up the way we want: device memory, no caching. It turns out the STM32H7 FMC supports changing its base address to appear in this section of address space, likely to avoid the specific problem we were facing. The final fix was changing the base address to the section with matching attributes. We’ve seen no instances of this issue since that fix was merged.

Transparency continues to be an Oxide value. Debugging modern CPUs often involves diving into areas with little transparency. "Under what circumstances will you be unable to access your memory bus" is a tricky question to answer. Our debugging efforts this time were aided by documentation from ARM and STM that eventually explained our problem. Given the difficulty in debugging this issue, highlighting this potential problem in vendor documentation would be beneficial to all customers. Oxide hopes all hardware vendors continue to document as much of their part as possible for the benefit of their customers.

Blahgd Markdown Support Josef "Jeff" Sipek

This post has been written in Markdown.

This blog is running on software that I wrote a number of years ago after being inspired by David Graham who threw together his blog using a couple of shell scripts. I was using Wordpress, which was a behemoth, and I thought that writing my own would making blogging a more pleasant experience for me. (It did.)

I quickly learned that I needed to support more than one post format. At first, the formats used raw or raw-ish HTML. Eventually, I added "fmt 3" which is a LaTeX-like format. I still use that for my posts even though it has limitations that I want to address with "fmt 4".

Fast-forward a few years and Dan set up my software to run his blog. Unsurprisingly, he inquired about Markdown support. Since I have no use for it myself, I told him that patches are welcome :)

Fast-forward a few more years and Dan sent me a patch to add "fmt 5"—Markdown support via libmd4c. Thanks, Dan!

So, this post is kind of an announcement for whomever may care that blahgd now supports Markdown-formatted posts.

P.S. I will continue using "fmt 3" because I like it, but I thought it was fitting to make an exception and write this post in Markdown instead.

This post brought to you in Markup! Kebe Says: Dan McD's blog

I wonder if this renders correctly?

I think it does.

I knew about the MD4C project for some time. I finally modified blahg to exploit MD4C so that I can create posts using markdown.

I haven't tested it all out yet, but I hope to soon. It's been a while (eesh, 3.5 years) since my last post, and no surprised, a lot has happened since then.

If I still have my half-dozen readers, hello again.

OmniOS Community Edition r151056 OmniOS Community Edition

OmniOSce v11 r151056 is out!

On the 3rd of November 2025, the OmniOSce Association has released a new stable version of OmniOS - The Open Source Enterprise Server OS. The release comes with many tool updates, brand-new features and additional hardware support. For details see the release notes.

Note that r151052 is now end-of-life. You should upgrade to r151054 or r151056 to stay on a supported track. r151054 is a long-term-supported (LTS) release with support until May 2028. Note that upgrading directly from r151052 to r151056 is not supported; you will need to update to r151054 along the way.

For anyone who tracks LTS releases, the previous LTS - r151046 - is now in its last six months. You should plan to upgrade to r151054 for continued LTS support.

OmniOS is fully Open Source and free. Nevertheless, it takes a lot of time and money to keep maintaining a full-blown operating system distribution. Our statistics show that there are almost 2’000 active installations of OmniOS while fewer than 20 people send regular contributions. If your organisation uses OmniOS based servers, please consider becoming a regular patron or taking out a support contract.

Any problems or questions, please get in touch.

Old Names and New Places Kebe Says: Dan McD's blog

So recently I acquired @danmcd on Twitter. It was a long time coming. I was relatively late in early-adopting twitter: late spring 2009. By then someone else had claimed the handle danmcd, to my chagrin.

I was chagrined (in 2009) because I’ve been danmcd at SOMEWHERE since 1988. First .edus, even a .gov and .mil, and of course a series of .coms including my own kebe.com

(Who and/or what is Kebe might be another blog post in and of itself. In the meantime, this answer will suffice:

Obi-Wan, “it’s me” )

Names are important. Especially in the virtual world, they establish not only presence, but often place as well. I ended up being @kebesays on twitter for a long time. Luckily, Twitter makes handle-swapping relatively easy, so anyone who was following @kebesays got moved over to @danmcd without issue. I still keep /* XXX KEBE SAYS … at the top, because if you see that in my code, it indicates work-in-progress issues; and aren’t we all works-in-progress?

Speaking of names and places: one name and one place that has been associated with Triton and SmartOS - Joyent - will no longer be associated with SmartOS or Triton. Samsung has decided to use other in-house technology for their future, and that work will continue with Joyent. SmartOS and Triton are being spun off to MNX Solutions, where I will be continuing SmartOS development. See the MNX Triton FAQ and my email for more.

Oh and yes, I’ll get to be ‘danmcd‘ at MNX as well.

Standalone SmartOS Gets Selectable PIs Kebe Says: Dan McD's blog

So what happened?

We’ve introduced a requested feature in SmartOS: the ability to select a platform image from loader(4), aka OS-8231.

To enable this feature, you must (using example bootable pool bootpool):

Update BOTH the boot bits and the Platform Image to this release. Normally piadm(1M) updates both, so please use either latest or another ISO-using installation.
Once booted to this PI, utter piadm activate 20210812T031946Z OR install another ISO-using installation (even if you never use it) to have the new piadm(1M) generate the /bootpool/boot/os/ directory the new modifications to loader require.

This represents a minor flag day because an older piadm(1M) will not update an existing /bootpool/boot/os/ directory. The PI-selection menus live in /bootpool/boot/os/, and will remain in an inconsistent state with older PIs using piadm(1M). It is safe to remove /bootpool/boot/os/ if you wish, as the activated (default) PI always boots correctly modulo actual /bootpool/boot/ corruption regardless.

So Tell Me about the Internals and the os/ Directory!

There were two SmartOS repositories that had changes. The first changeset was in illumos-joyent’s loader(5) Forth files. Alongside some additional support routines, the crux of the change is this addition to the main Joyent loader menu:

\
\ If available, load the "Platform Image Selection" option.
\
try-include /os/pi.rc

If the piadm(1M)-generated file /bootpool/os/pi.rc does not exist, the Joyent loader menu appears as it did prior to this fix.

The os/ Directory and illumos Needing platform/

The os/ directory in a bootable pool’s bootpool/boot filesystem contains directories of Platform Image stamps and the aforementioned pi.rc file.

[root@smartos-efi ~]# piadm list
PI STAMP               BOOTABLE FILESYSTEM            BOOT IMAGE NOW  NEXT 
20210715T010227Z       bootpool/boot                  available  no   no  
20210805T161859Z       bootpool/boot                  available  no   no  
20210812T031946Z       bootpool/boot                  next       yes  yes 
[root@smartos-efi ~]# ls /bootpool/boot/os
20210715T010227Z  20210805T161859Z  pi.rc
[root@smartos-efi ~]#

Each PI stamp directory contains a single platform symbolic link up to the platform-STAMP directory that contains the PI.

[root@smartos-efi ~]# ls -lt /bootpool/boot/os/20210805T161859Z
total 1
lrwxrwxrwx   1 root     root          31 Aug 12 14:41 platform -> ../../platform-20210805T161859Z
[root@smartos-efi ~]#

The Triton Head Node loader menu has a pointer to the “prior Platform Image” has the explicit path of …/os/STAMP/platform contain the platform image. It was a design mistake of the original standalone SmartOS to not lay out platform image in this manner, but given that piadm(1M) must generate the pi.rc file anyway, it is not much more difficult to add symbolic-link construction as well.

The pi.rc File

The pi.rc file includes an additional menu item for the main Joyent loader screen:

It also contains up to three pages of platform images to choose from. Here’s an example of page 1 of 3:

The default PI is on every page, and up to five (5) additional PIs can appear per page. This means 16 PIs (default + 3 * 5) can be offered on a loader screen. Every time a platform image is activated, deleted, or added, the piadm(1M) command regenerates the entire os/ directory, including pi.rc.

So How and Why Do I Use This?

Temporarily revert to and older Platform Image may be useful to check for regressions or to isolate behavior to a specific release.
Developers can use *just* platform-image installations (platform-yyyymmddThhmmssZ.tgz to test their new builds without making the bootable pool unusable.

The piadm list output indicates being booted into a non-default PI by its NOW column:

PI STAMP               BOOTABLE FILESYSTEM            BOOT IMAGE NOW  NEXT 
20210114T041228Z       zones/boot                     available  no   no  
20210114T163038Z       zones/boot                     available  no   no  
20210211T055122Z       zones/boot                     none       no   no  
20210211T163919Z       zones/boot                     none       no   no  
20210224T232633Z       zones/boot                     available  no   no  
20210225T124034Z       zones/boot                     none       no   no  
20210226T213821Z       zones/boot                     none       no   no  
20210311T001742Z       zones/boot                     available  no   no  
20210325T002528Z       zones/boot                     available  no   no  
20210422T002312Z       zones/boot                     available  no   no  
20210520T001536Z       zones/boot                     available  no   no  
20210617T001230Z       zones/boot                     available  no   no  
20210701T204427Z       zones/boot                     available  no   no  
20210715T010227Z       zones/boot                     available  no   no  
20210729T002724Z       zones/boot                     available  no   no  
20210804T003855Z       zones/boot                     available  no   no  
20210805T161859Z       zones/boot                     available  yes  no  
20210812T031946Z       zones/boot                     next       no   yes

In the above example, the SmartOS machine is booted into 20210805T161859Z, but its default is 20210812T031946Z. It would also look this way if piadm activate 20210812T031946Z was just invoked, as the semantics are the same.

MTV (originally 'MTV: Music Television') Turns 40 Kebe Says: Dan McD's blog

That I had to explain MTV's acronym... eeesh.

When Cable TV Was Still Young

Set the wayback machine 40 years plus 6-8 months ago (from the date of this post). Cable TV was rolling out in my suburb of Milwaukee, and it FINALLY arrived at our house. Hurray! We didn't have HBO, but we DID have all of the other fledgling basic cable channels... including Nickelodeon, which was then one of the Warner Amex Satellite Entertainment Company (WASEC) channels. (WASEC, and its progenitor Columbus, Ohio QUBE project, are its own fascinating story.) Nickelodeon mostly had single-digit-aged kids programming, but at night (especially Sunday night) it had a 30-minute show called PopClips, which would play the then mindblowing concept of music videos... or as one friend of mine called them, "Intermissions" (because HBO would play music videos between movies to synch up start times... I didn't have HBO so I trusted him). There is a YouTube narrative video that discusses the show in depth, including its tenuous link to another WASEC channel that was going to start airing 40 years ago today...

I Want My MTV

Anyone sufficiently old knows that MTV stood for Music Television. At midnight US/Eastern time on August 1, 1981, it played its space-program-themed bumper, followed by, "Video Killed the Radio Star" by The Buggles.

Now the local cable company pulled a bit of a dick move with MTV for us. It attached it to HBO. If you didn't have HBO, the cable company scrambled MTV, albeit not as strongly as they did with HBO. They scrambled it by making the picture black-and-white, and cutting out the sound completely. LUCKILY for me, we did have "cable radio" which let us not only get better FM reception, but also the stereo broadcast for MTV. Combine them, and I got to see black-and-white videos with proper sound.

Thanks to people's old videotapes and YouTube, you can watch (modulo a couple of copyright-whiners) the first two hours of MTV here. I'd have embedded this, but I'm guessing the copyright-whiners won that battle too.

There's a lot to unpack about MTV being 40. I'm not going to try too hard in this post, but there are some things that must be acknowledged:

MTV was a generation-defining phenomenon for Generation X. I suppose late-wave Boomers (the last of whom were graduating high school or already in college) could make a claim to ownership of MTV's first audience, but as MTV matured, it was very much initially for us Xers.
It was initially narrowly focussed. The only Black people you'd see on MTV initially were JJ Jackson or members of The Specials. That changed a couple of years later, however.
It spawned at least one knock-off: Friday Night Videos, which unlike MTV didn't require Cable.

Of course MTV doesn't play music videos on it anymore, we have alternatives now: YouTube, DailyMotion, and their ilk. And if you miss your MTV, or want to know what it looked like, you really don't have to look hard; many people have uploaded at least some VHS rips, many alas without music thanks to copyright teardowns. But with artist often putting out their old music on their own YouTube pages, some have taken to curating lists of them. Even NPR has curated the first 100 songs!

All Your Base Are Belong to 20-Somethings, and Solaris 9 Kebe Says: Dan McD's blog

Two Decades Ago…

Someone pointed out recently that the famous Internet meme “All your base are belong to us” turned 20 this week. Boy do I feel old. I was still in California, but Wendy and I were plotting our move to Massachusetts.

In AD 2001, S9 Was Beginning

OF COURSE I watched the video back then. The original Shockwave/Flash version on a site that no longer exists. I used my then-prototype Sun Blade 1000 to watch it, on Netscape, on in-development Solaris 9.

I found a bug in the audio driver by watching it. Luckily for me, portions of the Sun bug database were archived and available for your browsing pleasure. Behold bug 4451857. I reported it, and all of the text there is younger me.

The analysis and solution are not in this version of the bug report, which is a shame, because the maintainer (one Brian Botton) was quite responsive, and appreciated the MDB output. He fixed the bug by moving around a not-shown-there am_exit_task() call.

Another thing missing from the bug report is my “Public Summary” which I thought would tie things up nicely. I now present it here:

In A.D. 2001
S9 was beginning.
Brian: What Happen?
Dan: Someone set up us the livelock
Dan: We get signal
Brian: What!
Dan: MDB screen turn on.
Brian: It’s YOU!
4451857: How are you gentleman?
4451857: All your cv_wait() are belong to us.
4451857: You are on the way to livelock.
Brian: What you say?
4451857: You have no chance to kill -9 make your time.
4451857: HA HA HA HA…
Brian: Take off every am_exit_task().
Dan: You know what you doing
Brian: Move am_exit_task().
Brian: For great bugfix!

Goodbye 2020 Kebe Says: Dan McD's blog

Pardon my latency

Well, at least I’m staying on track for single-digit blog posts in a year. :)

Okay, seriously, 2020’s pandemic-and-other-chaos tends to distract. Also, I did actually have a few things worth my attention.

RFD 176

The second half of 2020 at work has been primarily about RFD 176 – weaning SmartOS and Triton off of the requirement for a USB key. Phases I (standalone SmartOS) and II (Triton Compute Node) are finished. Phase III (Triton Head Node) is coming along nicely, thanks to real-world testing on Equinix Metal (nee Packet), and I hope to have a dedicated blog post about our work in this space coming in the first quarter 2021.

Follow our progress in the rfd176 branches of smartos-live and sdc-headnode.

Twins & College

My twins are US High School seniors, meaning they’re off to college/university next fall, modulo pandemic-and-other-chaos. This means applications, a little stress, and generally folding in pandemic-and-other-chaos issues into the normal flow of things as well. Out of respect for their privacy and autonomy, I’ll stop here to avoid details each of them can spill on their own terms.

On 2021

Both “distractions” mentioned above will continue into 2021, so I apologize in advance for any lack of content here for my half-dozen readers. You can follow me on any of the socials mentioned on the right, because I’ll post there if the spirit moves me (especially on issues of the moment).

A Request to Security Researchers from illumos Kebe Says: Dan McD's blog

A Gentle Reminder About illumos

A very bad security vulnerability in Solaris was patched-and-announced by Oracle earlier this week. Turns out, we in open-source-descendant illumos had something in the same neighborhood. We can’t confirm it’s the same bug because reverse-engineering Oracle Solaris is off the table.

In general if a vulnerability is an old one in Solaris, there’s a good chance it’s also in illumos. Alex Wilson said it best in this recent tweet:

@anthomsec Hi! re: 2020-14871 (Solaris PAM bug), do you know if this was in the OpenSolaris PAM that went into illumos (https://t.co/Rxs5ct09xs) by any chance? We don't get any info via Oracle, but we do inherit their bugs if they're old enough :(
— Alex Wilson (@arekinath) October 22, 2020

If you want to see the full history, the first 11 minutes of my talk from 2016’s FOSDEM contains WHY a sufficiently old vulnerability in Solaris 10 and even Solaris 11 may also be in illumos.

Remember folks, Solaris is closed-source under Oracle, even though it used to be open-source during the last years of Sun’s existence. illumos is open-source, related, but NOT the same as Solaris anymore. Another suggested talk covers this rather well, especially if you start at the right part.

The Actual Request

Because of this history and shared heritage, if you’re a security researcher, PLEASE make sure you find one of many illumos distributions, install it, and try your proof-of-concept on that as well. If you find the same vulnerability in illumos, please report it to us via the security@illumos.org mailing alias. We have a PGP key too!

Thank you, and please test your Solaris exploits on illumos too (and vice-versa).

Now you can boot SmartOS off of a ZFS pool Kebe Says: Dan McD's blog

Booting from a zpool

The most recent published biweekly release of SmartOS has a new feature I authored: the ability to manage and boot SmartOS-bootable ZFS pools.

A few people read about this feature, and jumped to the conclusion that the SmartOS boot philosophy, enumerated here:

The "/" filesystem is on a ramdisk
The "/usr" filesystem is read-only
All of the useful state is stored on the zones ZFS pool.

were suddenly thrown out the window. Nope.

This change is the first phase in a plan to not depend on ISO images or USB sticks for SmartOS, or Triton, to boot.

The primary thrust of this specific SmartOS change was to allow installation-time enabling of a bootable zones pool. The SmartOS installer now allows one to specify a bootable pool, either one created during the "create my special pools" shell escape, or just by specifying zones.

A secondary thrust of this change was to allow running SmartOS deployments to upgrade their zones pools to be BIOS bootable (if the pool structure allows booting), OR to create a new pool with new devices (and use zpool create -B) to be dedicated to boot. For example:

smartos# zpool create -f -B standalone c3t0d0
smartos# piadm bootable -e standalone
smartos# piadm bootable
standalone                     ==> BIOS and UEFI
zones                          ==> non-bootable
smartos#

Under the covers

(NOTE: Edited 3 May 2023 to change "1M" man page refs to "8".)

Most of what’s above can be gleaned from the manual page. This section will discuss what the layout of a bootable pool actually looks like, and how the piadm(8) command sets things up, and expects things to BE set up.

Bootable pool basics

The piadm bootable command will indicate if a pool is bootable at all via the setting of the bootfs property on the pool. That gets you the BIOS bootability check, which admittedly is an assumption. The UEFI check happens by finding the disks s0 slice, and seeing if it’s formatted as pcfs, and if the proper EFI System Partition boot file is present.

bootfs layout

For standalone SmartOS booting, bootfs is supposed to be mounted on "/" with the pathname equal to the bootfs name. By convention, we prefer POOL/boot. Let’s take a look:


smartos# piadm bootable
zones                          ==> BIOS and UEFI
smartos# piadm list
PI STAMP           BOOTABLE FILESYSTEM            BOOT BITS?   NOW   NEXT  
20200810T185749Z   zones/boot                     none         yes   yes  
20200813T030805Z   zones/boot                     next         no    no   
smartos# cd /zones/boot
smartos# ls -lt
total 9
lrwxrwxrwx   1 root     root          27 Aug 25 15:58 platform -> ./platform-20200810T185749Z
lrwxrwxrwx   1 root     root          23 Aug 25 15:58 boot -> ./boot-20200813T030805Z
drwxr-xr-x   3 root     root           3 Aug 14 16:10 etc
drwxr-xr-x   4 root     root          15 Aug 13 06:07 boot-20200813T030805Z
drwxr-xr-x   4 root     root           5 Aug 13 06:07 platform-20200813T030805Z
drwxr-xr-x   4 1345     staff          5 Aug 10 20:30 platform-20200810T185749Z
smartos#

Notice that the Platform Image stamp 20200810T185749Z is currently booted, and will be booted the next time. Notice, however, that there are no “BOOT BITS”, also known as the Boot Image, for 20200810T185749Z, and instead the 20200813T030805Z boot bits are employed? This allows a SmartOS bootable pool to update just the Platform Image (ala Triton) without altering loader. If one utters piadm activate 20200813T030805Z, then things will change:

smartos# piadm activate 20200813T030805Z
smartos# piadm list
PI STAMP           BOOTABLE FILESYSTEM            BOOT BITS?   NOW   NEXT  
20200810T185749Z   zones/boot                     none         yes   no   
20200813T030805Z   zones/boot                     next         no    yes  
smartos# ls -lt
total 9
lrwxrwxrwx   1 root     root          27 Sep  2 00:25 platform -> ./platform-20200813T030805Z
lrwxrwxrwx   1 root     root          23 Sep  2 00:25 boot -> ./boot-20200813T030805Z
drwxr-xr-x   3 root     root           3 Aug 14 16:10 etc
drwxr-xr-x   4 root     root          15 Aug 13 06:07 boot-20200813T030805Z
drwxr-xr-x   4 root     root           5 Aug 13 06:07 platform-20200813T030805Z
drwxr-xr-x   4 1345     staff          5 Aug 10 20:30 platform-20200810T185749Z
smartos#

piadm(8) manipulates symbolic links in the boot filesystem to set versions of both the Boot Image (i.e. loader) and the Platform Image.

Home Data Center 3.0 -- Part 2: HDC's many uses Kebe Says: Dan McD's blog

In the prior post, I mentioned a need for four active ethernet ports. These four ports are physical links to four distinct Ethernet networks. Joyent's SmartOS and Triton characterize these with NIC Tags. I just view them as distinct networks. They are all driven by the illumos igb(7d) driver (hmm, that man page needs updating) on HDC 3.0, and I'll specify them now:

igb0 - My home network.
igb1 - The external network. This port is directly attached to my FiOS Optical Network Terminal's Gigabit Ethernet port.
igb2 - My work network. Used for my workstation, and "external" NIC Tag for my work-at-home Triton deployment, Kebecloud.
igb3 - Mostly unused for now, but connected to Kebecloud's "admin" NIC Tag.

The zones abstraction in illumos allows not just containment, but a full TCP/IP stack to be assigned to each zone. This makes a zone feel more like a proper virtual machine in most cases. Many illumos distros are able to run a full VMM as the only process in a zone, which ends up delivering a proper virtual machine. As of this post's publication, however, I'm only running illumos zones, not full VM ones. Here's their list:

(0)# zoneadm list -cv
  ID NAME             STATUS     PATH                           BRAND    IP    
   0 global           running    /                              ipkg     shared
   1 webserver        running    /zones/webserver               lipkg    excl  
   2 work             running    /zones/work                    lipkg    excl  
   3 router           running    /zones/router                  lipkg    excl  
   4 calendar         running    /zones/calendar                lipkg    excl  
   5 dns              running    /zones/dns                     lipkg    excl  
(0)#

Their zone names correspond to their jobs:

global - The illumos global zone is what exists even in the absence of other zones. Some illumos distros, like SmartOS, encourage minimizing what a global zone has for services. HDC's global zone serves NFS and SMB/CIFS to my home network. The global zone has the primary link into the home network. HDC's global zone has no default route, so if any operations that need out-of-the-house networking either go through another zone (e.g. DNS lookups), or a defaut route must be temporarily added (e.g. NTP chimes, `pkg update`).
webserver - Just like the name says, this zone hosts the web server for kebe.com. For this zone, it uses lofs(7FS), the loopback virtual file system to inherit subdirectories from the global zone. I edit blog entries (like this one) for this zone via NFS from my laptop. The global zone serves NFS, but the files I'm editing are not only available in the global zone, but are also lofs-mounted into the webserver zone as well. The webserver zone has a vnic (see here for details about a vnic, the virtual network interface controller) link to the home network, but has a default route, and the router zone's NAT (more later) forwards ports 80 and 443 to this zone. Additionally, the home network DHCP server lives here, for no other reason than, "it's not the global zone."
work - The work zone is new in the past six years, and as of recently, eschews lofs(7FS) for delegated ZFS datasets. A delegated ZFS dataset, a proper filesystem in this case, is assigned entirely to the zone. This zone also has the primary (and only) link to the work network, a physical connection (for now unused) to my work Triton's admin network, and an etherstub vnic (see here for details about an etherstub) link to the router zone. The work zone itself is a router for work network machines (as well as serves DNS for the work network), but since I only have one public IP address, I use the etherstub to link it to the router zone. The zone, as of recent illumos builds, can further serve its own NFS. This allows even less global-zone participation with work data, and it means work machines do not need backchannel paths to the global zone for NFS service. The work zone has a full illumos development environment on it, and performs builds of illumos rather quickly. It also has its own Unbound (see the DNS zone below) for the work network.
router - The router zone does what the name says. It has a vnic link to the home network and the physical link to the external network. It runs ipnat to NAT etherstub work traffic or home network traffic to the Internet, and redirects well-known ports to their respective zones. It does not use a proper firewall, but has IPsec policy in place to drop anything that isn't matched by ipnat, because in a no-policy situation, ipnat lets unmatched packets arrive on the local zone. The router zone also runs the (alas still closed source) IKEv1 daemon to allow me remote access to this server while I'm remote. It uses an old test tool from the pre-Oracle Sun days a few of you half-dozen readers will know by name. We have a larval IKEv2 out in the community, and I'll gladly switch to that once it's available.
calendar - Blogged about when first deployed, this zone's sole purpose is to serve our calendar both internally and externally. It uses the Radicale server. Many of my complaints from the prior post have been alleviated by subsequent updates. I wish the authors understood interface stability a bit better (jumping from 2.x to 3.0 was far more annoying than it needed to be), but it gets the job done. It has a vnic link to the home network, a default route, and gets calendaring packets shuffled to it by the router zone so my family can access the calendar wherever we are.
dns - A recent switch to OmniOSce-supported NSD and Unbound encouraged me to bring up a dedicated zone for DNS. I run both daemons here, and have the router zone redirect public kebe.com requests here to NSD. The Unbound server services all networks that can reach HDC. It has a vnic link to the home network, and a default route.

The first picture shows HDC as a single entity, and its physical networks. The second picture shows the zones of HDC as Virtual Network Machines, which should give some insight into why I call my home server a Home Data Center.

Home Data Center 3.0 -- Part 1: Back to AMD Kebe Says: Dan McD's blog

Twelve years ago I built my first Home Data Center (HDC). Six years ago I had OmniTI's Supermicro rep put together the second one.

Unlike last time, I'm not going to recap the entirety of HDC 2.0. I will mention briefly that since its 2014 inception, I've only upgraded its mirrored spinning-rust disk drives twice: once from 2TB to 6TB, and last year from 6TB to 14TB. I'll detail the current drives in the parts list.

Like last time, and the time before it, I started with a CPU in mind. AMD has been on a tear with Ryzen and EPYC. I still wanted low-ish power, but since I use some of HDC's resources for work or the illumos community, I figured a core-count bump would be worth the cost of some watts. Lucky me, the AMD Ryzen 7 3700x fit the bill nicely: Double the cores & threads with a 20W TDP increase.

Unlike last time, but like the time before it, I built this one from parts myself. It took a little digging, and I made one small mistake in parts selection, but otherwise it all came together nicely.

Parts

AMD Ryzen 7 3700x - It supports up to 128GB of ECC RAM, it's double the CPU of the old HDC for only 50% more TDP wattage. It's another good upgrade.
Noctua NH-U12S (AM4 edition) CPU cooler - I was afraid the stock cooler would cover the RAM slots on the motherboard. Research suggested the NH-U12S would prevent this problem, and the research panned out. Also Noctua's support email, in spite of COVID, has been quite responsive.
ASRock Rack X470D4U - While only having two Gigabit Ethernet (GigE) ports, this motherboard was the only purpose-built Socket AM4 server motherboard. It has IPMI/BMC on its own Ethernet port (but you'll have to double check it doesn't "failover" to your first GigE port). It has four DIMM slots, and with the current BIOS (mine shipped with it), supports 128GB of RAM. There are variants with Two 10 Gigabit Ethernet (10GigE) ports, but I opted for the less expensive GigE one. If I'd wanted to wait, there's a new, not yet available, X570 version, whose more expensive version has both two 10GigE AND two GigE ports, which would saved me from needing...
Intel I350 dual-port Gigabit Ethernet card - This old reliable is well supported and tested. It brings me up to the four ethernet ports I need.
Nemix RAM - 4x32GB PC3200 ECC Unbuffered DIMMS - Yep, like HDC 2.0, I maxxed out my RAM immediately. 6 years ago I'd said 32GB would be enough, and for the most part that's still true, except I sometimes wish to perform multiple concurrent builds, or memory-map large kernel dumps for debugging. The vendor is new-to-me, and did not have a lot of reviews on Newegg. I ran 2.5 passes of memtest86 against the memory, and it held up under those tests. Nightly builds aren't introducing bitflips, which I saw on HDC 1.0 when it ran mixed ECC/non-ECC RAM.
A pair of 500GB Samsung 860 EVO SATA SSDs - These are slightly used, but they are mirrored, and partitioned as follows:
- s0 -- 256MB, EFI System partiontion (ESP)
- s1 -- 100GB, rpool for OmniOSce
- s2 -- 64GB, ZFS intent log device (slog)
- s3 -- 64GB, unused, possible future L2ARC
- s4 -- 2GB, unused
- The remaining 200-something GB is unassigned, and fodder for the wear-levellers. The motherboard HAS a pair of M.2 connectors for NVMe or SATA SSDs in that form-factor, but these were hand-me-downs, so free.
A pair of Western Digital Ultrastar (nee HGST Ultrastar) HC530 14TB Hard Drives - These are beasts, and according to Backblaze stats released less than a week ago, its 12TB siblings hold up very well with respect to failure rates.
Fractal Design Meshify C case - I'd mentioned a small mistake, and this case was it. NOT because the case is bad... the case is quite good, but because I bought the case thinking I needed to optimize for the microATX form factor, and I really didn't need to. The price I paid for this was the inability to ever expand to four 3.5" drives if I so desire. In 12 years of HDC, though, I've never exceeded that. That's why this is only a small mistake. The airflow on this case is amazing, and there's room for more fans if I ever need them.
Seasonic Focus GX-550 power supply - In HDC 1.0, I had to burn through two power supplies. This one has a 10 year warranty, so I don't think I'll have to stress about it.
OmniOSce stable releases - Starting with HDC 2.0, I've been running OmniOS, and its community-driven successor, OmniOSce. The every-six-month stable releases strike a good balance between refreshes and stability.

I've given two talks on how I use HDC. Since the last of those was six years ago, I'm going to stop now, and dedicate the next post to how I use HDC 3.0.

Now self-hosted at kebe.com Kebe Says: Dan McD's blog

Let's throw out the first pitch.

I've moved my blog over from blogspot to here at kebe.com. I've recently upgraded the hardware for my Home Data Center (the subject of a future post), and while running the Blahg software doesn't require ANY sort of hardware upgrade, I figured since I had the patient open I'd make the change now.

Yes it's been almost five years since last I blogged. Let's see, since the release of OmniOS r151016, I've:

Cut r151018, r151020, and r151022.
Got RIFfed from OmniTI.
Watched OmniOS become OmniOSce with great success.
Got hired at Joyent and made more contributions to illumos via SmartOS.
Tons more I either wouldn't blog about, or just plain forgot to mention.

So I'm here now, and maybe I'll pick up again? The most prolific year I had blogging was 2007 with 11 posts, with 2011 being 2nd place with 10. Not even sure if I *HAVE* a half-dozen readers anymore, but now I have far more control over the platform (and the truly wonderful software I'm using).

While Blahg supports comments, I've disabled them for now. I might re-enabled them down the road, but for now, you can find me on one of the two socials on the right and comment there.

Goodbye blogspot Kebe Says: Dan McD's blog

First off, long time no blog!

This is the last post I'm putting on the Blogspot site. In the spirit of eating my own dogfood, I've now set up a self-hosted blog on my HDC. I'm sure it won't be hard for all half-dozen of you readers to move over. I'll have new content over there, at the very least the Hello, World post, a catchup post, and a HDC 3.0 post to match the ones for 1.0 and 2.0.

Coming Soon (updated) Kebe Says: Dan McD's blog

This is STILL only a test, but with an update. The big question remains: How quickly I can bring over my old Google owned blog entries?

(BTW Jeff, I could get used to this LaTeX-like syntax…)

You’ll start to see stuff trickle in. I’m using the hopefully-pushed-back-soon enhancements to Blahg that allows simple raw HTML for entries. I’ve done some crazy thing to extract, and maybe the hello-world post here will explain that.

From 0-to-illumos on OmniOS r151016 Kebe Says: Dan McD's blog

Today we updated OmniOS to its next stable release: r151016. You can click the link to see its release notes, and you may notice a brief mention the illumos-tools package.

I want to see more people working on illumos. A way to help that is to get people started on actually BUILDING illumos more quickly. To that end, r151016 contains everything to bring up an illumos development environment. You can develop small on it, but this post is going to discuss how we make building all of illumos-gate from scratch easier. (I plan on updating the older post on small/focused compilation after ws(1) and bldenv(1) effectively merge into one tool.)

The first thing you want to do is install OmniOS. The latest release media can be found here, on the Installation page.

After installation, your system is a blank slate. You'll need to set a root password, create a non-root user, and finally add networking parameters. The OmniOS wiki's General Administration Guide covers how to do this.

I've added a new building illumos page to the OmniOS wiki that should detail how straightforward the process is. You should be able to kick off a full nightly(1ONBLD) build quickly enough. If you don't want to edit one of the omnios-illumos-* samples in /opt/onbld/env, just make sure you have a $USER/ws directory, clone one of illumos-gate or illumos-omnios into $USER/ws/testws and use one of the template /opt/onbld/env/omnios-illumos-* files corresponding to illumos-gate or illumos-omnios. For example:


omnios(~)[0]% mkdir ws
omnios(~)[0]% cd ws
omnios(~/ws)[0]% git clone
https://github.com/illumos/illumos-gate/ testws

omnios(~/ws)[0]% /bin/time /opt/onbld/bin/nightly
/opt/onbld/env/omnios-illumos-gate

You can then look in testws/log/log-date&time/mail_msg to see how your build went.

Quick Reminder -- tcp_{xmit,recv}_hiwat and high-bandwidth*delay networks Kebe Says: Dan McD's blog

I was recently working with a colleague on connecting two data centers via an IPsec tunnel. He was using iperf (coming soon to OmniOS bloody along with netperf) to test the bandwidth, and was disappointed in his results.

The amount of memory you need to hold a TCP connection's unacknowledged data is the Bandwidth-Delay product. The defaults shipped in illumos are small on the receive side:


bloody(~)[0]% ndd -get /dev/tcp tcp_recv_hiwat
128000
bloody(~)[0]%

and even smaller on the transmit side:


bloody(~)[0]% ndd -get /dev/tcp tcp_xmit_hiwat
49152
bloody(~)[0]%

Even platforms with Automatic tuning, the maximums they use are often not set highly enough.

Introducing IPsec into the picture adds additional latency (if not so much for encryption thanks to AES-NI & friends, then for the encapsulation and checks). This often is enough to take what are normally good enough maximums and invalidate them as too small. To change these on illumos, you can use the ndd(1M) command shown above, OR you can use the modern, persists-across-reboots, ipadm(1M) command:


bloody(~)[1]% sudo ipadm set-prop -p recv_buf=1048576 tcp
bloody(~)[0]% sudo ipadm set-prop -p send_buf=1048576 tcp
bloody(~)[0]% ipadm show-prop -p send_buf tcp
PROTO PROPERTY              PERM CURRENT      PERSISTENT   DEFAULT      POSSIBLE
tcp   send_buf              rw   1048576      1048576      49152        4096-1048576
bloody(~)[0]% ipadm show-prop -p recv_buf tcp
PROTO PROPERTY              PERM CURRENT      PERSISTENT   DEFAULT      POSSIBLE
tcp   recv_buf              rw   1048576      1048576      128000       2048-1048576
bloody(~)[0]%

There's future work there in not only increasing the upper bound (easy), but also adopting the automatic tuning so the maximum just isn't taken right off the bat.

New HDC service: Calendaring (or, The Limitation Game) Kebe Says: Dan McD's blog

I'll start by stating my biases: I don't like data bloat like ASN.1, XML, or even bloaty protocols like HTTP. (Your homework: Would a 1980s-developed WAN-scale RPC have obviated HTTP? Write a paper with your answer to that question, with support.) I understand the big problems they attempt to solve. I also still think not enough people in the business were paying attention in OS (or Networking) class when seeing the various attempts at data representation during the 80s and 90s. Also, I generally like pushing intelligence out to the end-nodes, and in client/server models, this means the clients. CalDAV rubs me the wrong way on the first bias, and MOSTLY the right way on my second bias, though the clients I use aren't very smart. I will admit near-complete ignorance of CalDAV. I poked a little at its RFC, looking up how Alarms are implemented, and discovered that mostly, Alarm processing is a client issue. ("This specification makes no attempt to provide multi-user alarms on group calendars or to find out for whom an alarm is intended.")

I've configured Radicale on my Home Data Center. I need to publicly thank Lauri Tirkkonen (aka. lotheac on Freenode) for the IPS publisher which serves me up Radicale. Since my target audience is my family-of-four, I wasn't particularly concerned with its reported lack of scalability. I also didn't want to have CalDAV be a supplicant of Apache or another web server for the time. If I decide to revisit my web server choices, I may move CalDAV to that new webserver (likely nginx). I got TLS and four users configured on stock Radicale.

My job was to make an electronic equivalent of our family paper calendar. We have seven (7) colors/categories for this calendar (names withheld from the search engines): Whole-Family, Parent1, Parent2, Both-Parents, Child1, Child2, Both-Children. I thought, given iCal (10.6), Calendar.app (10.10), or Calendar (iOS), it wouldn't be too hard for these to be created and shared. I was mildly wrong.

I'm not sure if what I had to do was a limitation of my clients, of Radicale, or of CalDAV itself, but I had to create seven (7) different accounts, each with a distinct ends-in-'/' URL:

https://.../Whole-Family.ics/
https://.../Parent1.ics/
https://.../Parent2.ics/
https://.../Both-Parents.ics/
https://.../Child1.ics/
https://.../Child2.ics/
https://.../Both-Children.ics/

I had to configure N (large N) devices or machine-logins with these seven (7) accounts. Luckily, Radicale DID allow me to restrict Child1's and Child2's write access to just their own calendars. Apart from that, we want the whole family to read all of the calendars. This means the colors are uniform across all of our devices (stored on the server). It also means any alarms (per above) trigger on ALL of our devices. This makes alarms (something I really like in my own Calendar) useless. Modulo the alarms problem (which can be mitigated by judicious use of iOS's Reminders app and a daily glance at the calendar), this seems to end up working pretty well, so far.

Both children recently acquired iPhones. Which means if I open this service outside our internal home network, we can schedule calendars no matter where we are, and get up to date changes no matter where we are. That will be extremely convenient.

I somewhat hope that one of my half-dozen readers will find something so laughably wrong with how I configured things that any complaints I make will be rendered moot. I'm not certain, however, that will be the case.

Systems Software in the Large Oxide Computer Company Blog

Software is hard (yes, even in an era of vibe coding), and systems software — the silent engine room of modern infrastructure — is especially so. By design, systems software provides an abstraction for programs, insulating programmers from the filthy details that lie beneath; piercing that abstraction to implement the underlying system is to embrace those details and their gnarly implications. Moreover, the expectation for systems software is (rightly) perfection; a system that is merely functional can be deceptively distant from the robustness required of foundational software. Systems software isn’t the only kind of hard software, of course, and indeed software can be difficult just by nature of its scope and composition: it is hard to build software that is just… big. Software that consists of many different modules and components built by multiple people over an extended period of time is known as programming in the large, and its difficulties extend beyond the mere implementation challenges of systems software.

Because the difficulties with developing systems software are broadly orthogonal to those of programming in the large, intersecting these two challenges — that is, developing systems software in the large — is to take on the most grueling of projects: it is the stuff of which mythical man months are literally made. Why would anyone ever develop such a system? Because they are often necessary to tackle software’s equivalents of the wicked problem: problems that are not only never completely solved, but also not even really understood until implementation is well underway. There are not pat answers for developing these systems — nor, infamously, silver bullets — they’re just… brutal.

This is on my mind because of a talk that we had at OxCon last week. OxCon is our affectionate name for the annual Oxide meetup here in Emeryville, and it’s a highlight of the year for everyone at Oxide. This year more than lived up to our high expectations, replete with cameos from the extraordinary IBM 26 Interpreting Card Punch and the Oakland Ballers. At OxCon we like to both reflect back and look forward, so in that spirit, we asked Oxide engineer Dave Pacheco if he might be willing to present on the project he’s been leading the charge on for the past two years: software update.

When we shipped the first Oxide rack two years ago, it had the minimum functionality necessary to update all its software in the field. Our priority was to make this update mechanism robust over all else, and we succeeded in the sense that it is indeed robust — but the experience is not yet the seamless, self-service facility that we have envisioned. Software update for the Oxide rack is exactly the kind of wicked problem that necessitates systems software in the large: it is not merely dynamically overhauling a distributed system, but doing so while remaining operable in the liminal state between the old software and the new. Compounding this was the urgency we felt: delivering self-service update is essential to realize our vision of the cloud experience on premises, and our customers needed it as soon as we could deliver it. As if this weren’t enough, the Oxide update problem has an acute constraint not faced by the public clouds: we need to be able to deliver updates across an air gap — we cannot rely on the public cloud’s hidden crutch of operators and runbooks. It is a problem so wicked, you can practically hear it cackle.

Despite the thorniness of the problem, Dave and team had managed to achieve the ambitious milestones that they had set for themselves at OxCon last year, and I was naturally excited for his presentation this year. That said, I wasn’t ready for what was coming: Dave not only described the tremendous work on software update (delving into both the multi-year history of the project and the significant progress since the last OxCon), but also reflected on leading the software update project itself. The result was an absolutely extraordinary talk, not just on the mechanics of software essential to Oxide, but on the unique challenges of systems software in the large.

Dave’s talk dripped with hard-won wisdom, running the gamut from maintaining focus (and the looming specter of what Dave calls "organizational procrastination") to fighting scope creep and the mechanics of specific technical decisions. We felt Dave’s talk to be too good to be kept to ourselves — and thanks to our transparency, nothing in it needs to be secret; we are thrilled to be able to make it generally available:

This talk is a must watch for anyone doing systems software in the large, containing within it the kind of lessons that are often only learned the hard way. While we think it’s valuable for everyone, should you be the kind of sicko inexplicably drawn to exactly the kind of nasty problems that Dave describes, consider joining us — there is more systems software in the large to be done at Oxide!

vfio-user client in QEMU 10.1 Staring at the C

The recent release of QEMU 10.1 now comes with its very own vfio-user client. You can try this out yourself relatively easily - please give it a go!¹

vfio-user is a framework that allows implementing PCI devices in userspace. Clients (such as QEMU) talk the vfio-user protocol over a UNIX socket to a device server; it looks something like this:

vfio-user architecture

To implement a virtual device for a guest VM, there are generally two parts required: “frontend” driver code in the guest VM, and a “backend” device implementation.

The driver is usually - but by no means always - implemented in the guest OS kernel, and can be the same driver real hardware uses (such as a SATA controller), or something special for a virtualized platform (such as virtio-blk).

The job of the backend device implementation is to emulate the device in various ways: respond to register accesses, handle mappings, inject interrupts, and so on.

An alternative to virtual devices are so-called “passthrough” devices, which provide a thin virtualization layer on top of a real physical device, such as an SR-IOV Virtual Function from a physical NIC. For PCI devices, these are typically handled via the VFIO framework.

Other backend implementations can live in all sorts of different places: the host kernel, the emulator process, a hardware device, and so on.

For various reasons, we might want a userspace software device implementation, but not as part of the VMM process (such as QEMU) itself.

The rationale

For virtio-based devices, such “out of process device emulation” is usually done via vhost-user. This allows a device implementation to exist in a separate process, shuttling the necessary messages, file descriptors, and shared mappings between QEMU and the server.

However, this protocol is specific to virtio devices such as virtio-net and so on. What if we wanted a more generic device implementation framework? This is what vfio-user is for.

It is explicitly modelled on the vfio interface used for communication between QEMU and the Linux kernel vfio driver, but it has no kernel component: it’s all done in userspace. One way to think of vfio-user is that it smushes vhost-user and vfio together.

In the diagram above, we would expect much of the device setup and management to happen via vfio-user messages on the UNIX socket connecting the client to the server SPDK process: this part of the system is often referred to as the “control plane”. Once a device is set up, it is ready to handle I/O requests - the “data plane”. For performance reasons, this is often done via sharing device memory with the VM, and/or guest memory with the device. Both vhost-user and vfio-user support this kind of sharing, by passing file descriptors to mmap() across the UNIX socket.

libvfio-user

While it’s entirely possible to implement a vfio-user server from scratch, we have implemented a C library to make this easier: this handles the basics of implementing a typical PCI device, allowing device implementers to focus on the specifics of the emulation.

SPDK

At Nutanix, one of the main reasons we were interested in building all this was to implement virtual storage using the NVMe protocol. To do this we make use of SPDK. SPDK’s NVMe support was originally designed for use in a storage server context (NVMe over Fabrics). As it happens, there are lots of similarities between such a server, and how an NVMe PCI controller needs to work internally.

By re-using this nvmf subsystem in SPDK, alongside libvfio-user, we can emulate a high-performance virtualized NVMe controller for use by a VM. From the guest VM’s operating system, it looks just like a “real” NVMe card, but on the host, it’s using the vfio-user protocol along with memory sharing, ioeventfds, irqfds, etc. to talk to an SPDK server.

The Credits

While I was responsible for getting QEMU’s vfio-user client upstreamed, I was by no means the only person involved. My series was heavily based upon previous work by Oracle by John Johnson and others, and the original work on vfio-user in general was done by Thanos Makatos, Swapnil Ingle, and several others. And big thanks to Cédric Le Goater for all the reviews and help getting the series merged.

Further Work

While the current implementation is working well in general, there’s an awful lot more we could be doing. The client side has enough implemented to cover our immediate needs, but undoubtedly there are other implementations that need extensions. The libvfio-user issues tracker captures a lot of the generic protocol work as well some library-specific issues. In terms of virtual NVMe itself, we have lots of ideas for how to improve the SPDK implementation, across performance, correctness, and functionality.

There is an awful lot more I could talk about here about how this all works “under the hood”; perhaps I will find time to write some more blog posts…

unfortunately, due to a late-breaking regression, you’ll need to use something a little bit more recent than the actual 10.1 release. ↩︎

It's Always DNS Staring at the C

The meme is real, but I think this particular case is sort of interesting, because it turned out, ultimately, to not be due to DNS configuration, but an honest-to-goodness bug in glibc.

As previously mentioned, I heavily rely on email-oauth2-proxy for my work email. Every now and then, I’d see a failure like this:

    Email OAuth 2.0 Proxy: Caught network error in IMAP server at [::]:1993 (unsecured) proxying outlook.office365.com:993 (SSL/TLS) - is there a network connection? Error type <class 'socket.gaierror'> with message: [Errno -2] Name or service not known

This always coincided with a change in my network, but - and this is the issue - the app never recovered. Even though other processes - even Python ones - could happily resolve outlook.office365.com - this long-running daemon remained stuck, until it was restarted.

A bug in the proxy?

My first suspect here was this bit of code:

    1761     def create_socket(self, socket_family=socket.AF_UNSPEC, socket_type=socket.SOCK_STREAM):
1762         # connect to whichever resolved IPv4 or IPv6 address is returned first by the system
1763         for a in socket.getaddrinfo(self.server_address[0], self.server_address[1], socket_family, socket.SOCK_STREAM):
1764             super().create_socket(a[0], socket.SOCK_STREAM)
1765             return

We’re looping across the gai results, but returning after the first one, and there’s no attempt to account for the first address result being unreachable, but later ones being fine.

Makes no sense, right? My guess was that somehow getaddrinfo() was returning IPv6 results first in this list, as at the time, the IPv6 configuration on the host was a little wonky. Perhaps I needed to tweak gai.conf ?

However, while this was a proxy bug, it was not the cause of my issue.

DNS caching?

Perhaps, then, this is a local DNS cache issue? Other processes work OK, even Python test programs, so it didn’t seem likely to be the system-level resolver caching stale results. Python itself doesn’t seem to cache results.

This case triggered (sometimes) when my VPN connection died. The openconnect vpnc script had correctly updated /etc/resolv.conf back to the original configuration, and as there’s no caching in the way, then the overall system state looked correct. But somehow, this process still had wonky DNS?

A live reproduction

I was not going to get any further until I had a live reproduction and the spare time to investigate it before restarting the proxy.

The running proxy in this state could be triggered easily by waking up fetchmail, which made it much easier to investigate what was happening each time.

So what was the proxy doing on line :1763 above? Here’s an strace snippet:

    [pid  1552] socket(AF_INET, SOCK_DGRAM|SOCK_CLOEXEC|SOCK_NONBLOCK, IPPROTO_IP) = 7
[pid  1552] setsockopt(7, SOL_IP, IP_RECVERR, [1], 4) = 0
[pid  1552] connect(7, {sa_family=AF_INET, sin_port=htons(53), sin_addr=inet_addr("ELIDED")}, 16) = 0
[pid  1552] poll([{fd=7, events=POLLOUT}], 1, 0) = 1 ([{fd=7, revents=POLLOUT}])
[pid  1552] sendto(7, "\250\227\1 \0\1\0\0\0\0\0\1\7outlook\toffice365\3c"..., 50, MSG_NOSIGNAL, NULL, 0) = 50
[pid  1552] poll([{fd=7, events=POLLIN}], 1, 5000) = 1 ([{fd=7, revents=POLLERR}])
[pid  1552] close(7)                    = 0

As we might expect, we’re opening a socket, connecting over UDP to port 53, and sending out a request to the DNS server.

This indicated the proximal issue: the DNS server IP address was wrong - the DNS servers used were the ones originally set up by openconnect still. The process wasn’t incorrectly caching DNS results but the DNS servers. Forever.

Nameserver configuration itself is not something that applications typically control, so the next question was - how does this work normally? When I update /etc/resolv.conf, or the thousand other ways to configure name resolution in modern Linux systems, what makes getaddrinfo() continue to work, normally?

/etc/resolv.conf and glibc

So, how does glibc account for changes in resolver configuration?

The contents of the /etc/resolv.conf file are the canonical location for DNS server addresses for processes (like Python ones) using the standard glibc resolver. Logically then, there must be a way for updates to the file to affect running processes.

In glibc, such configuration is represented by struct resolv_context. This is lazily initialized via __resolv_context_get()->maybe_init(), which looks like this:

     68 /* Initialize *RESP if RES_INIT is not yet set in RESP->options, or if
 69    res_init in some other thread requested re-initializing.  */
 70 static __attribute__ ((warn_unused_result)) bool
 71 maybe_init (struct resolv_context *ctx, bool preinit)
 72 {
 73   struct __res_state *resp = ctx->resp;
 74   if (resp->options & RES_INIT)
 75     {
 76       if (resp->options & RES_NORELOAD)
 77         /* Configuration reloading was explicitly disabled.  */
 78         return true;
 79
 80       /* If there is no associated resolv_conf object despite the
 81          initialization, something modified *ctx->resp.  Do not
 82          override those changes.  */
 83       if (ctx->conf != NULL && replicated_configuration_matches (ctx))
 84         {
 85           struct resolv_conf *current = __resolv_conf_get_current ();
 86           if (current == NULL)
 87             return false;
 88
 89           /* Check if the configuration changed.  */
 90           if (current != ctx->conf)
...

Let’s take a look at __resolv_conf_get_current():

    123 struct resolv_conf *
124 __resolv_conf_get_current (void)
125 {
126   struct file_change_detection initial;
127   if (!__file_change_detection_for_path (&initial, _PATH_RESCONF))
128     return NULL;
129
130   struct resolv_conf_global *global_copy = get_locked_global ();
131   if (global_copy == NULL)
132     return NULL;
133   struct resolv_conf *conf;
134   if (global_copy->conf_current != NULL
135       && __file_is_unchanged (&initial, &global_copy->file_resolve_conf))

This is the file change detection code we’re looking for: _PATH_RESCONF is /etc/resolv.conf, and __file_is_unchanged() compares the cached values of things like the file mtime and so on against the one on disk.

If it has in fact changed, then maybe_init() is supposed to go down the “reload configuration” path.

Now, in my case, this wasn’t happening. And the reason for this is line 83 above: the replicated_configuration_matches() call.

Resolution options

We already briefly mentioned gai.conf. There is also, as the resolver.3 man page says, this interface:

    The resolver routines use configuration and state information
contained in a __res_state structure (either passed as the statep
argument, or in the global variable _res, in the case of the
older nonreentrant functions).  The only field of this structure
that is normally manipulated by the user is the options field.

So an application can dynamically alter options too, outside of whatever static configuration there is. And (I think) that’s why we have the replicated_configuration_matches() check:

    static bool
replicated_configuration_matches (const struct resolv_context *ctx)
{
  return ctx->resp->options == ctx->conf->options
    && ctx->resp->retrans == ctx->conf->retrans
    && ctx->resp->retry == ctx->conf->retry
    && ctx->resp->ndots == ctx->conf->ndots;
}

The idea being, if the application has explicitly diverged its options, it doesn’t want them to be reverted just because the static configuration changed. Our Python application isn’t changing anything here, so this should still work as expected.

In fact, though, we find that it’s returning false: the dynamic configuration has somehow acquired the extra options RES_SNGLKUP and RES_SNGLKUPREOP. We’re now very close to the source of the problem!

A hack that bites

So what could possibly set these flags? Turns out the send_dg() function does:

     999                   {
1000                     /* There are quite a few broken name servers out
1001                        there which don't handle two outstanding
1002                        requests from the same source.  There are also
1003                        broken firewall settings.  If we time out after
1004                        having received one answer switch to the mode
1005                        where we send the second request only once we
1006                        have received the first answer.  */
1007                     if (!single_request)
1008                       {
1009                         statp->options |= RES_SNGLKUP;
1010                         single_request = true;
1011                         *gotsomewhere = save_gotsomewhere;
1012                         goto retry;
1013                       }
1014                     else if (!single_request_reopen)
1015                       {
1016                         statp->options |= RES_SNGLKUPREOP;
1017                         single_request_reopen = true;
1018                         *gotsomewhere = save_gotsomewhere;
1019                         __res_iclose (statp, false);
1020                         goto retry_reopen;
1021                       }

Now, I don’t believe the relevant nameservers have such a bug. Rather, what seems to be happening is that when the VPN connection drops, making the servers inaccessible, we hit this path. And these flags are treated by maybe_init() as if the client application set them, and has thus diverged from the static configuration. As the application itself has no control over these options being set like this, this seemd like a real glibc bug.

The fix

I originally reported this to the list back in March; I was not confident in my analysis but the maintainers confirmed the issue. More recently, they fixed it. The actual fix was pretty simple: apply the workaround flags to statp->_flags instead, so they don’t affect the logic in maybe_init(). Thanks DJ Delorie!

Scroll wheel behaviour in vim with gnome-terminal Staring at the C

I intentionally have mouse support disabled in vim, as I find not being able to select text the same way as in any other terminal screen unergonomic.

However, this has an annoying problem as a libvte / gnome-terminal user: the terminal, on switching to an “alternate screen” application like vim that has mouse support disabled, “helpfully” maps scroll wheel events to arrow up/down events.

This is possibly fine, except I use the scroll wheel click as middle-button paste, and I’m constantly accidentally pasting something in the wrong place as a result.

This is unfixable from within vim, since it only sees normal arrow key presses (not ScrollWheelUp and so on).

However, you can turn this off in libvte, by the magic escape sequence:

echo -ne '\e[?1007l'

Also known as XTERM_ALTBUF_SCROLL. This is mentioned in passing in this ticket. Documentation in general is - at best - sparse, but you can always go to the source.

A Headless Office 365 Proxy Staring at the C

As I mentioned in my last post, I’ve been experimenting with replacing davmail with Simon Robinson’s super-cool email-oauth2-proxy, and hooking fetchmail and mutt up to it. As before, here’s a specific rundown of how I configured O365 access using this.

Configuration

We need some small tweaks to the shipped configuration file. It’s used for both permanent configuration and acquired tokens, but the static part looks something like this:

[email@yourcompany.com]
permission_url = https://login.microsoftonline.com/common/oauth2/v2.0/authorize
token_url = https://login.microsoftonline.com/common/oauth2/v2.0/token
oauth2_scope = https://outlook.office365.com/IMAP.AccessAsUser.All https://outlook.office365.com/POP.AccessAsUser.All https://outlook.office365.com/SMTP.Send offline_access
redirect_uri = https://login.microsoftonline.com/common/oauth2/nativeclient
client_id = facd6cff-a294-4415-b59f-c5b01937d7bd
client_secret =

We’re re-using davmail’s client_id again.

Updated 2023-10-10: emailproxy now supports a proper headless mode, as discussed below.

Updated 2022-11-22: you also want to set delete_account_token_on_password_error to False: otherwise, a typo will delete the tokens, and you’ll need to re-authenticate from scratch.

We’ll configure fetchmail as follows:

poll localhost protocol IMAP port 1993
 auth password username "email@yourcompany.com"
 is localuser here
 keep
 sslmode none
 mda "/usr/bin/procmail -d %T"
 folders INBOX

and mutt like this:

set smtp_url = "smtp://email@yourcompany.com@localhost:1587/"
unset smtp_pass
set ssl_starttls=no
set ssl_force_tls=no

When you first connect, you will get a GUI pop-up and you need to interact with the tray menu to follow the authorization flow. After that, the proxy will refresh tokens as necessary.

Running in systemd

Here’s my service file I use, slightly modified from the upstream’s README:

$ cat /etc/systemd/system/emailproxy.service
[Unit]
Description=Email OAuth 2.0 Proxy

[Service]
ExecStart=/usr/bin/python3 /home/localuser/src/email-oauth2-proxy/emailproxy.py --external-auth --no-gui --config-file /home/localuser/src/email-oauth2-proxy/my.config
Restart=always
User=joebloggs
Group=joebloggs

[Install]
WantedBy=multi-user.target

Headless operation

Typically, only initial authorizations require the GUI, so you could easily do the initial dance then use the above systemd service.

Even better, with current versions of email-oauth2-proxy, you can operate in an entirely headless manner! With the above --external-auth and --no-gui options, the proxy will prompt on stdin with a URL you can copy into your browser; pasting the response URL back in will authorize the proxy, and store the necessary access and refresh tokens in the config file you specify.

For example:

$ sudo systemctl stop emailproxy

$ python3 ./emailproxy.py --external-auth --no-gui --config-file /home/localuser/src/email-oauth2-proxy/my.config

# Now connect from mutt or fetchmail.

Authorisation request received for email@yourcompany.com (external auth mode)
Email OAuth 2.0 Proxy No-GUI external auth mode: please authorise a request for account email@yourcompany.com
Please visit the following URL to authenticate account email@yourcompany.com: https://login.microsoftonline.com/common/oauth2/v2.0/authorize?client_id=...

Copy+paste or press [↵ Return] to visit the following URL and authenticate account email@yourcompany.com: https://login.microsoftonline.com/common/oauth2/v2.0/authorize?client_id=...
then paste here the full post-authentication URL from the browser's address bar (it should start with https://login.microsoftonline.com/common/oauth2/nativeclient):

# Paste the updated URL bar contents from your browser in response:

https://login.microsoftonline.com/common/oauth2/nativeclient?code=...

SMTP (localhost:1587; email@yourcompany.com) [ Successfully authenticated SMTP connection - releasing session ]
^C
$ sudo systemctl start emailproxy

Obviously, you’ll need to do this interactively from the terminal, then restart in daemon mode.

email-oauth2-proxy

If you find the above details useful, consider donating to support Simon’s sterling work on oauth2-email-proxy.

Fetchmail and Office 365 Staring at the C

I previously described accessing Office365 email (and in particular its oauth2 flow) via davmail, allowing me to continue using fetchmail, procmail and mutt. As davmail is java, it’s a pain to have around, so I thought I’d give some details on how to do this more directly in fetchmail, as all the available docs I found were a little vague, and it’s quite easy to screw up.

As it happens, I came across a generally better solution shortly after writing this post, on which more later.

Fetchmail 7

Unfortunately there is little interest in releasing a Fetchmail version with oauth2 support - the maintainer is taking a political stance on not integrating it - so you’ll need to check out the next branch from git:

cd ~/src/
git clone -b next git@gitlab.com:fetchmail/fetchmail.git fetchmail-next
cd fetchmail-next
./autogen.sh && ./configure --prefix=/opt/fetchmail7 && make && sudo make install

I used the branch as of 43c18a54 Merge branch 'legacy_6x' into next. Given that the maintainer warns us they might remove oauth2 support, you might need this exact hash…

Generate a token

We need to go through the usual flow for getting an initial token. There’s a helper script for this, but first we need a config file:

user=email@yourcompany.com
client_id=facd6cff-a294-4415-b59f-c5b01937d7bd
client_secret=
refresh_token_file=/home/localuser/.fetchmail-refresh
access_token_file=/home//localuser/.fetchmail-token
imap_server=outlook.office365.com
smtp_server=outlook.office365.com
scope=https://outlook.office365.com/IMAP.AccessAsUser.All https://outlook.office365.com/POP.AccessAsUser.All https://outlook.office365.com/SMTP.Send offline_access
auth_url=https://login.microsoftonline.com/common/oauth2/v2.0/authorize
token_url=https://login.microsoftonline.com/common/oauth2/v2.0/token
redirect_uri=https://login.microsoftonline.com/common/oauth2/nativeclient

Replace email@yourcompany.com and localuser in the above, and put it at ~/.fetchmail.oauth2.cfg. It’s rare to find somebody mention this, but O365 does not need a client_secret, and we’re just going to borrow davmail’s client_id - it’s not a secret in any way, and trying to get your own is a royal pain. Also, if you see a reference to tenant_id anywhere, ignore it - common is what we need here.

Run the flow:

$ # This doesn't get installed...
$ chmod +x ~/src/fetchmail-next/contrib/fetchmail-oauth2.py
$ # Sigh.
$ sed -i 's+/usr/bin/python+/usr/bin/python3+' ~/src/fetchmail-next/contrib/fetchmail-oauth2.py
$ ~/src/fetchmail-next/contrib/fetchmail-oauth2.py -c ~/.fetchmail.oauth2.cfg --obtain_refresh_token_file
To authorize token, visit this url and follow the directions:
  https://login.microsoftonline.com/common/oauth2/v2.0/authorize?...
Enter verification code:

Unlike davmail, this needs just the code, not the full returned URL, so you’ll need to be careful to dig out just the code from the response URL (watch out for any session_state parameter at the end!).

This will give you an access token that will last for around an hour.

Fetchmail configuration

Now we need an oauthbearer .fetchmailrc like this:

set daemon 60
set no bouncemail
poll outlook.office365.com protocol IMAP port 993
 auth oauthbearer username "email@yourcompany.com"
 passwordfile "/home/localuser/.fetchmail-token"
 is localuser here
 keep
 sslmode wrapped sslcertck
 folders INBOX
 mda "/usr/bin/procmail -d %T"

Replace email@yourcompany.com and localuser.

At this point, hopefully starting /opt/fetchmail7/bin/fetchmail will work!

Refresh tokens

As per the OAUTH2 README, fetchmail itself does not take care of refreshing the token, so you need something like this in your crontab:

*/2 * * * * $HOME/src/fetchmail-next/contrib/fetchmail-oauth2.py -c $HOME/.fetchmail.oauth2.cfg --auto_refresh

#opensolaris Staring at the C

When OpenSolaris got started, #solaris was a channel filled with pointless rants about GNU-this and Linux-that. Beside complete wrong-headedness, it was a total waste of time and extremely hostile to new people. #opensolaris, in contrast, was actually pretty nice (for IRC!) - sure, the usual pointless discussions but it certainly wasn't hateful.

Recently I'm sad to say #opensolaris has become a really hostile, unpleasant place. I've seen new people arrive and be bullied by a small number of poisonous people until they went away (nice own goal, people!). So if anyone's looking for me for xVM stuff or whatever, I'll be in #onnv-scm or #solaris-xen as usual. And if you do so, please try to keep a civil tongue in your head - it's not hard.

$HOME Staring at the C

I've not been able to access my homedir (and hence my work mail) all day. I suspect this was a planned outage I've forgotten about, but it's still a big problem. And what kind of planned outage lasts all day?

Link Staring at the C

Our $100M Series B Oxide Computer Company Blog

We don’t want to bury the lede: we have raised a $100M Series B, led by a new strategic partner in USIT with participation from all existing Oxide investors. To put that number in perspective: over the nearly six year lifetime of the company, we have raised $89M; our $100M Series B more than doubles our total capital raised to date — and positions us to make Oxide the generational company that we have always aspired it to be.

If this aspiration seems heady now, it seemed absolutely outlandish when we were first raising venture capital in 2019. Our thesis was that cloud computing was the future of all computing; that running on-premises would remain (or become!) strategically important for many; that the entire stack — hardware and software — needed to be rethought from first principles to serve this market; and that a large, durable, public company could be built by whomever pulled it off.

This scope wasn’t immediately clear to all potential investors, some of whom seemed to latch on to one aspect or another without understanding the whole. Their objections were revealing: "We know you can build this," began more than one venture capitalist (at which we bit our tongue; were we not properly explaining what we intended to build?!), "but we don’t think that there is a market."

Entrepreneurs must become accustomed to rejection, but this flavor was particularly frustrating because it was exactly backwards: we felt that there was in fact substantial technical risk in the enormity of the task we put before ourselves — but we also knew that if we could build it (a huge if!) there was a huge market, desperate for cloud computing on-premises.

Fortunately, in Eclipse Ventures we found investors who saw what we saw: that the most important products come when we co-design hardware and software together, and that the on-premises market was sick of being told that they either don’t exist or that they don’t deserve modernity. These bold investors — like the customers we sought to serve — had been waiting for this company to come along; we raised seed capital, and started building.

And build it we did, making good on our initial technical vision:

We did our own board designs, allowing for essential system foundation like a true hardware root-of-trust and end-to-end power observability.
We did our own microcontroller operating system, and used it to replace the traditional BMC.
We did our own platform enablement software, eliminating the traditional UEFI BIOS and its accompanying flotilla of vulnerabilities.
We did our own host hypervisor, assuring an integrated and seamless user experience — and eliminating the need for a third-party hypervisor and its concomitant rapacious software licensing.
We did our own switch — and our own switch runtime — eliminating entire universes of integration complexity and operational nightmares.
We did our own integrated storage service, allowing the rack-scale system to have reliable, available, durable, elastic instance storage without necessitating a dependency on a third party.
We did our own control plane, a sophisticated distributed system building on the foundation of our hardware and software components to deliver the API-driven services that modernity demands: elastic compute, virtual networking, and virtual storage.

While these technological components are each very important (and each is in service to specific customer problems when deploying infrastructure on-premises), the objective is the product, not its parts. The journey to a product was long, but we ticked off the milestones. We got the boards brought up. We got the switch transiting packets. We got the control plane working. We got the rack manufactured. We passed FCC compliance.

And finally, two years ago, we shipped our first system!

Shortly thereafter, more milestones of the variety you can only get after shipping: our first update of the software in the field; our first update-delivered performance improvements; our first customer-requested features added as part of an update.

Later that year, we hit general commercial availability, and things started accelerating. We had more customers — and our first multi-rack customer. We had customers go on the record about why they had selected Oxide — and customers describing the wins that they had seen deploying Oxide.

Customers started landing faster now: enterprise sales cycles are infamously long, but we were finding that we were going from first conversations to a delivered product surprisingly quickly. The quickening pace always seemed to be due in some way to our transparency: new customers were listeners to our podcast, or they had read our RFDs, or they had perused our documentation, or they had looked at the source code itself.

With growing customer enthusiasm, we were increasingly getting questions about what it would look like to buy a large number of Oxide racks. Could we manufacture them? Could we support them? Could we make them easy to operate together?

Into this excitement, a new potential investor, USIT, got to know us. They asked terrific questions, and we found a shared disposition towards building lasting value and doing it the right way. We learned more about them, too, and especially USIT’s founder, Thomas Tull. The more we each learned about the other, the more there was to like. And importantly, USIT had the vision for us that we had for ourselves: that there was a big, important market here — and that it was uniquely served by Oxide.

We are elated to announce this new, exciting phase of the company. It’s not necessarily in our nature to celebrate fundraising, but this is a big milestone, because it will allow us to address our customers' most pressing questions around scale (manufacturing scale, system scale, operations scale) and roadmap scope. We have always believed in our mission, but this raise gives us a new sense of confidence when we say it: we’re going to kick butt, have fun, not cheat (of course!), love our customers — and change computing forever.

Triton on SmartOS bhyve Nahum Shalman

Motivation

I (still) don't run VMware but I do have a SmartOS machine (it's a little nicer than the one from a decade ago).
I now work on Triton for my day job and I want to run CoaL for some testing.

Networking

The first trick is going to be to get some appropriate network tags set up and configured in the way that the CoaL image expects. I'm going to set up both an admin network and an "external" network. The latter will perform the same NAT that gets configured by the scripts for use with VMware.

Admin network.

This is a private network that doesn't need to reach the internet. Since I'll be confining my experiments to a single SmartOS hypervisor I'll just use an etherstub:

    nictagadm add -l sdc_admin0

External network.

This one is tricker. CoaL expects this to be a network that can reach the network via NAT. We'll create another etherstub for it, then we'll create a zone to do NAT using Etherstubs:

    nictagadm add -l sdc_external0

Provision a zone to be the NAT router using the following json (you can use whatever image_uuid you want, it doesn't actually matter):
coal-nat.json

    {
  "alias": "coal-nat",
  "hostname": "coal-nat",
  "brand": "joyent-minimal",
  "max_physical_memory": 128,
  "image_uuid": "2f1dc911-6401-4fa4-8e9d-67ea2e39c271",
  "nics": [
    {
      "nic_tag": "external",
      "ip": "dhcp",
      "allow_ip_spoofing": "1",
      "primary": "1"
    },
    {
      "nic_tag": "sdc_external0",
      "ip": "10.88.88.2",
      "netmask": "255.255.255.0",
      "allow_ip_spoofing": "1",
      "gateway": "10.88.88.2"
    }
  ],
  "customer_metadata" : {
    "manifests" : "network/forwarding.xml\nnetwork/routing/route.xml\nnetwork/routing/ripng.xml\nnetwork/routing/legacy-routing.xml\nnetwork/ipfilter.xml\nsystem/identity.xml\n",
    "smf-import" : "mdata-get manifests | while read name; do svccfg import /lib/svc/manifest/$name; done;",
    "user-script" : "mdata-get smf-import | bash -x; echo -e 'map net0 10.88.88.0/24 -> 0/32\nrdr net0 0/0 port 22 -> 10.88.88.200 port 22 tcp' > /etc/ipf/ipnat.conf; routeadm -u -e ipv4-forwarding; svcadm enable identity:domain; svcadm enable ipfilter"
  }
}

You can also set a static IP address on the first NIC if you prefer.

Create the zone:

    vmadm create -f coal-nat.json

Building the headnode VM

Normally SmartOS provides a lot of protection on the vnics. We'll be turning them all off so that the guest can do whatever it wants. This is one of the reasons I like setting up the etherstubs. Even if this VM runs amok the only other zone it can reach is that very minimal NAT zone.

We need to specify the hardcoded MAC addresses that the answers.json file is expecting to see as well:
coal-headnode.json:

    {
  "alias": "coal-headnode",
  "brand": "bhyve",
  "bootrom": "uefi",
  "ram": 16384,
  "vcpus": 4,
  "autoboot": false,
  "nics": [
    {
      "mac": "00:50:56:34:60:4c",
      "nic_tag": "sdc_admin0",
      "model": "virtio",
      "ip": "dhcp",
      "allow_dhcp_spoofing": true,
      "allow_ip_spoofing": true,
      "allow_mac_spoofing": true,
      "allow_restricted_traffic": true,
      "allow_unfiltered_promisc": true,
      "dhcp_server": true
    },
    {
      "mac": "00:50:56:3d:a7:95",
      "nic_tag": "sdc_external0",
      "model": "virtio",
      "ip": "dhcp",
      "allow_dhcp_spoofing": true,
      "allow_ip_spoofing": true,
      "allow_mac_spoofing": true,
      "allow_restricted_traffic": true,
      "allow_unfiltered_promisc": true,
      "dhcp_server": true
    }
  ],
  "disks": [
    {
      "boot": true,
      "size": 8192,
      "model": "virtio"
    },
    {
      "size": 65440,
      "model": "virtio"
    }
  ]
}

Create the VM and get the UUID:

    vmadm create -f coal-headnode.json
UUID=$(vmadm list -H -o uuid alias=coal-headnode)

Copying over the CoaL USB stick image

Triton releases live at https://us-central.manta.mnx.io/Joyent_Dev/public/SmartDataCenter/triton.html

    zfs set refreservation=0 zones/${UUID}/disk0

RELEASE=release-20250724-20250724T033959Z-gb8f2d08
curl -fLO https://us-central.manta.mnx.io/Joyent_Dev/public/SmartDataCenter/${RELEASE?}/headnode/usb-${RELEASE?}.tgz
tar xvf usb-${RELEASE?}.tgz usb-${RELEASE?}-8gb.img
qemu-img convert -f raw -O host_device usb-${RELEASE?}-8gb.img /dev/zvol/dsk/zones/${UUID?}/disk0

zfs snapshot zones/${UUID?}/disk0@sdc-pristine
zfs snapshot zones/${UUID?}/disk1@sdc-pristine

Pre-configuring Triton

We need to obtain the CoaL answers.json file and reconfigure Loader so that it will behave correctly in the VM.

    lofiadm -l -a /dev/zvol/dsk/zones/${UUID?}/disk0
mount -F pcfs /devices/pseudo/lofi@2:c /mnt
curl -kL https://raw.githubusercontent.com/tritondatacenter/sdc-headnode/master/answers.json.tmpl.external | sed 's/vga/ttya/g' > /mnt/private/answers.json
cp /mnt/boot/loader.conf /mnt/boot/loader.conf.orig
cat /mnt/boot/loader.conf.orig | sed '/hash_name=/d;/console/s/ttyb/ttya/;/console/s/,.*text/,text/;/tty[^a]-mode/d;s/ipxe="true"/ipxe="false"/' > /mnt/boot/loader.conf
umount /mnt
lofiadm -d /dev/lofi/2
zfs snapshot zones/${UUID?}/disk0@configured

Optional: Get a performance boost at the cost of potential VM data corruption if the host loses power:

    zfs set sync=disabled zones/${UUID?}

Now you're ready to boot your VM.

    vmadm start ${UUID?} ; vmadm console ${UUID?}

Adding a Compute Node

For the moment I don't have a great answer to how to make the Compute Node PXE boot. This is my current workaround:

coal-computenode.json:

    {
  "alias": "coal-computenode",
  "brand": "bhyve",
  "bootrom": "uefi",
  "ram": 4096,
  "vcpus": 4,
  "autoboot": false,
  "nics": [
    {
      "nic_tag": "sdc_admin0",
      "model": "virtio",
      "ip": "dhcp",
      "allow_dhcp_spoofing": true,
      "allow_ip_spoofing": true,
      "allow_mac_spoofing": true,
      "allow_restricted_traffic": true,
      "allow_unfiltered_promisc": true,
      "dhcp_server": true
    },
    {
      "nic_tag": "sdc_external0",
      "model": "virtio",
      "ip": "dhcp",
      "allow_dhcp_spoofing": true,
      "allow_ip_spoofing": true,
      "allow_mac_spoofing": true,
      "allow_restricted_traffic": true,
      "allow_unfiltered_promisc": true,
      "dhcp_server": true
    }
  ],
  "disks": [
    {
      "boot": true,
      "media": "cdrom",
      "path": "/ipxe/ipxe.iso",
      "model": "virtio"
    },
    {
      "size": 16000,
      "model": "virtio"
    }
  ]
}

Create the VM and get the UUID, then inject a usable ipxe ISO for netbooting:

    CN_UUID=$(vmadm create -f coal-computenode.json 2>&1 |tee /dev/stderr | awk 'END{print $NF}')
mkdir -p /zones/${CN_UUID?}/root/ipxe
curl -fL -o /zones/${CN_UUID?}/root/ipxe/ipxe.iso https://raw.githubusercontent.com/tinkerbell/ipxedust/refs/heads/main/binary/ipxe.iso
vmadm start ${CN_UUID?} ; vmadm console ${CN_UUID?}