Tribblix on SPARC: sparse devices in an LDOM The Trouble with Tribbles...

I recently added a ddu like capability to Tribblix.

In that article I showed the devices in a bhyve instance. As might be expected there really aren't a lot of devices you need to handle.

What about SPARC, you might ask? Even if you don't, I'll ask for you.

Running Tribblix in a LDOM, this is what you see:

root@sparc-m32:/root# zap ddu
Device SUNW,kt-rng handled by n2rng in TRIBsys-kernel-platform [installed]
Device SUNW,ramdisk handled by ramdisk in TRIBsys-kernel [installed]
Device SUNW,sun4v-channel-devices handled by cnex in TRIBsys-ldoms [installed]
Device SUNW,sun4v-console handled by qcn in TRIBsys-kernel-platform [installed]
Device SUNW,sun4v-disk handled by vdc in TRIBsys-ldoms [installed]
Device SUNW,sun4v-domain-service handled by vlds in TRIBsys-ldoms [installed]
Device SUNW,sun4v-network handled by vnet in TRIBsys-ldoms [installed]
Device SUNW,sun4v-virtual-devices handled by vnex in TRIBsys-kernel-platform [installed]
Device SUNW,virtual-devices handled by vnex in TRIBsys-kernel-platform [installed]
 

It's hardly surprising, but that's a fairly minimal list.

It does make me wonder whether to produce a special SPARC Tribblix image precisely to run in an LDOM. After all, I already have slightly different variants on x86 designed for cloud in general, and one for EC2 specifically, that don't need the whole variety of device drivers that the generic image has to include.

Expecting an AI boom? The Trouble with Tribbles...

I recently went down to the smoke, to Tech Show London.

There were 5 constituent shows, and I found what each sub-show was offering - and the size of each component - quite interesting.

There wasn't much going on in Devops Live, to be honest. Relatively few players had shown up, nothing terribly interesting.

There wasn't that much in Big Data & AI World either. I was expecting much more here, and what there was seemed to be on the periphery. More support services than actual product.

The Cloud & Cyber Security Expo was middling, not great, and there was an AI slant in evidence. Not proper AI, but a sprinkling of AI dust on things just to keep up with the Joneses.

Cloud and AI Infrastructure had a few bright spots. I saw actual hardware on the floor - I had seen disk shelves over in the Big Data section, but here I spotted a Tape Library (I used to use those a lot, haven't seen much in that area for a while) and a VDI blade. Talked to a few people, including the Zabbix and Tailscale stands.

But when it came to Data Centre World, that was buzzing. It was about half the overall floor area, so it was far and away the dominant section. Tremendous diversity too - concrete, generators, power cables, electrical switching, fiber cables, cable management, thermal management, lots of power and cooling. Lots and lots of serious physical infrastructure.

There was an obvious expectation on display that there's a massive market around high-density compute. I saw multiple vendors with custom rack designs - rear-door and liquid cooling in evidence. Some companies addressing the massive demand for water.

If these people are at a trade show, then the target market isn't the 3 or 4 hyperscalers. What's being anticipated in this frenzy is very much companies building out their own datacentre facilities, and that's very much an interesting trend.

There's a saying "During a gold rush, sell shovels". What I saw here was a whole army of shovel-sellers getting ready for the diggers to show up.

Tribblix, UEFI, and UFS The Trouble with Tribbles...

Somewhat uniquely among illumos distributions, Tribblix doesn't require installation to ZFS - it allows the possibility of installing to a UFS root file system.

I'm not sure how widely used this is, but it will get removed as an option at some point, as the illumos UFS won't work past Y2038.

I recently went through the process of testing an install of the very latest Tribblix to UFS, in a bhyve guest running UEFI. The UEFI part was a bit more work, and doing it clarified how some of the internals fit together.

(One reason for doing these unusual experiments is to better understand how things work, especially those that are handed automatically by more mainstream components.)

OK, on to installation.

While install to zfs will automatically lay out zfs pools and file systems, the ufs variant needs manual partitioning. There are two separate concerns - the Tribblix install, and UEFI boot.

The Tribblix installer for UFS assumes 2 things about the layout of the disk it will install to:

  1. The slice s0 will be used to install the operating system to, and mounted at /.
  2. The slice s1 will be used for swap. (On zfs, you create a zfs volume for swap; on ufs you use a separate raw partition.)

It's slightly unfortunate that these slices are hard-coded into the installer.

For UEFI boot we need 2 other slices:

  1. A system partition (this is what's called EFI System partition, aka ESP)
  2. A separate partition to put the stage2 bootloader in. (On zfs there's a little bit of free space you can use; there isn't enough on ufs so it needs to be handled separately.)

The question then arises as to how big these need to be. Now, if you create a root pool with ZFS (using zpool create -B) it will create a 256MB partition for ESP. This turns out to be the minimum size for FAT32 on 4k disks, so that's a size that should always work. On disks with a 512 block size, it needs to be 32MB or larger (there's a comment in the code about 33MB). The amount of data you're going to store there is very much less.

The stage2 partition doesn't have to be terribly big.

So as a result of this I'm going to create a GPT label with 4 slices - 0 and 1 for Tribblix, 3 and 4 for EFI system and boot.

There are 2 things to note here: First,the partitions you create don't have to be laid out on disk in numerical order, you can put the slices in any order you want. This was true for SMI disks too, where it was common practice in Solaris to put swap on slice 1 at the start of the disk with slice 0 after it. Second, EFI/GPT doesn't assign any special significance to slice 2, unlike the old SMI label where slice 2 was conventionally the whole disk. I'm avoiding slice 2 here not because it's necessary, but so as to not confuse anyone used to the old SMI scheme.

The first thing to do with a fresh disk is to go into format, invoked as format -e (expert mode in order to access the EFI options). Select the disk, run fdisk from inside format, and then install an EFI label.

format -e
#
# choose the disk
#
fdisk
y - to accept defaults
l - to label
1 - choose efi

Then we can lay out the partitions. Still in format, type p to enter the partition menu and p to display the partitions.

p - enter partition menu
p - show current partition table

At this point on a new disk it should have 8 as "reserved" and 0 as "usr", with everything else "unassigned". We're going to leave slice 8 untouched.

First note where slice 0 currently starts. I'll resize it at the end, but we're going to put slices 3, 4, and 1 at the start of the disk and then resize 0 to fill in what's left.

To configure the settings for a given slice, just type its number.

Start with slice 3, type 3 and configure the system partition.  This has to use the "system" tag.

tag: system
flags: wm (just hit return to accept)
start: 34
size: 64mb

Type p again to view the partition table and note the last sector of slice 3 we just created, and add 1 to it to give the start sector of the next slice. Type 4 to configure the boot partition, and it must have the tag "boot".

tag: boot
flags: wm (just hit return to accept)
start: 131106
size: 16mb

Type p again to view the partition table, take note of the last sector for the new slice 4, and add 1 to get the start sector for the next one. Which is 1 for the swap partition.

tag: swap
flags: wm (just hit return to accept)
start: 65570
size: 512mb

We're almost done. The final step is to resize partition 0. Again you get the start sector by adding 1 to the last sector of the swap partition you just created. And rather than giving a size you can give the end sector using an 'e' suffix, which should be one less than the start of the reserved partition 8, and also the last sector of the original partition 0. Type 0 and enter something like:

tag: usr
flags: wm (just hit return to accept)
start: 1212450
size: 16760798e

Type 'p' one last time to view the partition table, check that the Tag entries are correct, and that the First and Last Sectors don't overlap.

Then type 'l' to write the label to the disk. It will ask you for the label type - make sure it's EFI again - and for confirmation.

Then we can do the install

./ufs_install.sh c1t0d0s0

It will ask for confirmation that you want to create the file system

At the end it ought to say "Creating pcfs on ESP /dev/rdsk/c1t0d0s3"

If it says "Requested size is too small for FAT32." then that's a hint that you need the system partition to be bigger. (An alternative trick is to mkfs the pcfs file system yourself, if you create it using FAT16 it will still work but you can get away with it being a lot smaller.)

It should also tell you that it's writing the pmbr to slice 4 and to p0.

With that, rebooting into the newly installed system ought to work.

Now, the above is a fairly complicated set of instructions. I could automate this, but do we really want to make it that easy to install to UFS?

Introducing a ddu-alike for Tribblix The Trouble with Tribbles...

Introducing a new feature in Tribblix m36. There's a new ddu subcommand for zap.

In OpenSolaris, the Device Driver Utility would map the devices it found and work out what software was needed to drive them. This isn't that utility, but is inspired by that functionality, rewritten for Tribblix as a tiny little shell script.

As an example, this is the output of zap ddu for Tribblix in a bhyve instance:

jack@tribblix:~$ zap ddu
Device acpivirtnex handled by acpinex in TRIBsys-kernel-platform [installed]
Device pci1af4,1000,p handled by vioif in TRIBdrv-net-vioif [installed]
Device pci1af4,1001 handled by vioblk in TRIBdrv-storage-vioblk [installed]
Device pci1af4,1 handled by vioif in TRIBdrv-net-vioif [installed]
Device pciclass,030000 handled by vgatext in TRIBsys-kernel [installed]
Device pciclass,060100 handled by isa in TRIBsys-kernel-platform [installed]
Device pciex_root_complex handled by npe in TRIBsys-kernel-platform [installed]
Device pnpPNP,303 handled by kb8042 in TRIBsys-kernel [installed]
Device pnpPNP,f03 handled by mouse8042 in TRIBsys-kernel [installed]

Simply put, it will list the devices it finds, which driver is responsible for them, and which package that driver is contained in (and whether that package is installed).

This, while a tiny little feature, is one of those small things that is actually stunningly useful.

If there's a device that we have a driver for that isn't installed, this helps identify it so you know what to install.

What this doesn't do (yet, and unlike the original ddu) is show devices we don't have a driver for at all.

Is all this thing called AI worthwhile? The Trouble with Tribbles...

Before I even start, let's be clear: there are an awful lot of things currently being bundled under the "AI" banner, most of which of neither artificial nor intelligent.

So when I'm talking about AI here, I'm talking about what's being marketed to the masses as AI. This generally doesn't include the more traditional subjects of machine learning or image recognition, which I've often seen relabelled as AI.

But back to the title: is the modern thing called AI worthwhile?

Whatever it is, AI can do some truly remarkable things. That isn't something you can argue against. It can do some truly stupid and hopelessly wrong things as well.

But where does this good stuff fit in? Are businesses really going to benefit by embracing AI?

Well, yes, up to a point. There's a lot of menial work that can be handed off to an AI. It might be able to do it cheaper than a human.

The first snag is Jevon's paradox; by making menial tasks cheaper, a business simply opens the door to larger quantities of menial tasks, so it saves no money and its costs might even go up.

To be honest, though, I would have to say that if you can hand a task off to an AI, is it worth doing in the first place?

That's the rub, yes you might be able to optimise a process by using AI, but you can optimise it much more by eliminating it entirely.

(And you then don't have to pay extra for someone to come along and clean up after the AI has made a mess of it.)

It's not just the first level of process you need to look at. Take the example of summarising meetings. It's not so much that you don't need the summary, but to start with you need to run meetings better so they don't need to be summarised, and even better, the meeting probably wasn't needed at all.

Put it another way: the AI will get you to a local minimum of cost, but not to a global minimum. Worse, as AI gets cheaper and more widely used, that local optimisation makes it even harder to optimise the system globally.

So yes, I'm not convinced that much of the AI currently being rammed down our throats has any utility. It will actively block businesses in the pursuit of improvements, and the infatuation with current trendy AI will harm the development of useful AI.

Experience: dialog with prosecutor alp's notes

Wow! Today we had a guest from prosecutor's office.
They checked if we (South Federal University) ban sites from prosecutor's list. Luckily, our upstream provider does it for us.
But a check was ridiculous.
Yes, we receive daily lists of sites to ban. I though they would check some sites from this list and go in peace. But prosecutor just searched for prohibited works with Google and tried if she can download the materials. Of course, Google found working links :)
What a hell? Why should we imitate some work if everyone knows how to avoid these regulation rules? Why should anyone spend resources for content filtering? I think our government are a herd of archaic dinosaurs who just don't know how to lick chief's arse better

Some reasons why I hate you, yes, personally you alp's notes

This material is not going to be... I don't know how should I say it, but I hate you, personally you, who read it. Why? You know or don't know what's happening, but do nothing. You live happy life in some happy state, while our army protects us (and you, but you don't understand it) from neo-nazists. I'm mad when I see another "poor man" with blue-yellow flag on his avatar, because I see what's happening. And just now some whore has removed article about Russkoye Porechnoe massacre from fucken objective wikipedia. Why does it matter to me? Because I'm Russian. Why does it matter for you? Because some of you I hoped were my friends and I want you to see what's happening. Sounds crazy? Perhaps, at least I hope so. What's follows is a partial translation of Fabrisenko interrogation. This individual was captured in Kursk region. "- I've raped a woomen and made it to kneel, and shot her.... We've heard noice in a barn. There, in a haystack three grandfathers and three grandmothers were hiding. We've lead them, tied up hands, lead them to the basement. One of the men tried to resist. He tried to break out. One of my comrades, "Guide" (Provodnik) has hit this eldery man in his head. In temple. Then this Gude has lifted up above. A Bloodworm (Motyl) has taken a grenade from his pouch and threw into the basement. An interrogator: Is it a place where there were three woomen an three eldery men? - Yes. An interrogator: After this have you got down, looked what happened to them? - After this our senior, called Bloodworm (Motyl) has got down to the basement and has looked. There was nobody alive." Fabrisenko was arrested in the neighborhood of the crime scene. So, if you happen to set yellow-blue flag on your avatar, or better send money to theese nazi, I hate you. I hate you if you pay taxes in Europe or US, which helps our enemies to kill our soldiers. I hate you if you choose to ignore this isse. I hate... Perhaps, it doesn't bother your. But especially I hate you if you consider yourself Russian and support these animals. Just remember, I hate you and I have good reasons for this.

Tailscale for SunOS in 2025 Nahum Shalman

Happy New Year! The wireguard-go port is still sitting around in my fork. I don't know when I will have the energy for the next attempt to get it upstream. In the meantime, I've made some fun progress on the Tailscale side.

Taildrive

The Tailscale folks have shipped Taildrive (currently in alpha) and it's pretty neat. Naturally those of us using Tailscale on illumos wanted to try it out. There was nothing needed directly to get it working, but we had an indirect problem. The tailscale binary communicates with the tailscaled daemon over a unix socket, and the Tailscale folks had added some basic unix based authentication / authorization abstracted in their peercred library. That library needed support added for getpeerucred which meant I had to wire things up all the way down in x/sys/unix before then getting it into peercred. But with that work done, Taildrive now works! I tagged a release with that enabled if you're in a rush to play with it.

Using userspace-networking

Tailscale has a way to run without creating a TUN device. It means that client software on the machine can't connect directly to IPs on the Tailnet (though there is a SOCKS proxy you can use) but tailscaled can still lots of other server-y things (including Taildrive!) That's how Tailscale has been supporting AIX. Which led me to a strange realization: Tailscale had better in-tree support for AIX than it did for illumos and Solaris. No more! We are now on-par with AIX in the official tree!

What's next

I don't know if the Tailscale folks intend to ship binaries for us from their tree, but after their next release it should be possible to build illumos binaries from their tree that you could use to serve up a ZFS filesystem with Taildrive to your tailnet using the userspace-networking driver.

I will of course also rebase my TUN driver patches and tag a release as well.

Are you running Tailscale on illumos or Solaris? Let me know on Bluesky or Mastodon.

Tragedy Older Than Me Nahum Shalman

In July of 2021, in anticipation of the upcoming High Holy Days I purchased a copy of This Is Real and You Are Completely Unprepared: The Days of Awe as a Journey of Transformation by Rabbi Alan Lew, published in 2003. I was in fact completely unprepared to even read it. It sat in or on my nightstand for more than three years. I finally started reading it during the high holidays in October of 2024 (a few days before the one year anniversary of the events of October 7, 2023).

When I read the section excerpted here, I had to immediately flip back to check the publication date. 2003. In so many ways the ongoing tragedy today is in the same place it was over 20 years ago. Rabbi Lew died in 2009. We cannot ask him what he thinks of the world today, but in many ways there is no need. Little has changed. So, to emphasize one more time, let's go back to 2003:

I think that the great philosopher George Santayana got it exactly wrong. I think it is precisely those who insist on remembering history who are doomed to repeat it. For a subject with so little substance, for something that is really little more than a set of intellectual interpretations, history can become a formidable trap— a sticky snare from which we may find it impossible to extricate ourselves. I find it impossible to read the texts of Tisha B’Av, with their great themes of exile and return, and their endless sense of longing for the land of Israel, without thinking of the current political tragedy in the Middle East. I write this at a very dark moment in the long and bleak history of that conflict. Who knows what will be happening there when you read this? But I think it’s a safe bet that whenever you do, one thing is unlikely to have changed. There will likely be a tremendous compulsion for historical vindication on both sides. Very often, I think it is precisely the impossible yearning for historical justification that makes resolution of this conflict seem so impossible. The Jews want vindication for the Holocaust, and for the two thousand years of European persecution and ostracism that preceded it; the Jews want the same Europeans who now give them moral lectures to acknowledge that this entire situation would never have come about if not for two thousand years of European bigotry, barbarism, and xenophobia. They want the world to acknowledge that Israel was attacked first, in 1948, in 1967, in 1973, and in each of the recent Intifadas. They want acknowledgment that they only took the lands from which they were attacked during these conflicts, and offered to return them on one and only one condition— the acknowledgment of their right to exist. When Anwar Sadat met that condition, the Sinai Peninsula, with its rich oil fields and burgeoning settlement towns, was returned to him. And they want acknowledgment that there are many in the Palestinian camp who truly wish to destroy them, who have used the language of peace as a ploy to buy time until they have the capacity to liquidate Israel and the Jews once and for all. They want acknowledgment that they have suffered immensely from terrorism, that a people who lost six million innocents scarcely seventy years ago should not have had to endure the murder of its innocent men, women, and children so soon again. And they want acknowledgment that in spite of all this, they stood at Camp David prepared to offer the Palestinians everything they claimed to have wanted— full statehood, a capital in East Jerusalem— and the response of the Palestinians was the second Intifada, a murderous campaign of terror and suicide bombings.

And the Palestinians? They would like the world to acknowledge that they lived in the land now called Israel for centuries, that they planted olive trees, shepherded flocks, and raised families there for hundreds of years; they would like the world to acknowledge that when they look up from their blue-roofed villages, their trees and their flowers, their fields and their flocks, they see the horrific, uninvited monolith of western culture— immense apartment complexes, shopping centers, and industrial plants on the once-bare and rocky hills where the voice of God could be heard and where Muhammad ascended to heaven. And they would like the world to acknowledge that it was essentially a European problem that was plopped into their laps at the end of the last great war, not one of their own making. They would like the world to acknowledge that there has always been a kind of arrogance attached to this problem; that it was as if the United States and England said to them, Here are the Jews, get used to them. And they would like the world to acknowledge that it is a great indignity, not to mention a significant hardship, to have been an occupied people for so long, to have had to submit to strip searches on the way to work, and intimidation on the way to the grocery store, and the constant humiliation of being subject— a humiliation rendered nearly bottomless when Israel, with the benefit of the considerable intellectual and economic resources of world Jewry, made the desert bloom, in a way they had never been able to do. And they would like the world to acknowledge that there are those in Israel who are determined never to grant them independence, who have used the language of peace as a ploy to fill the West Bank with settlement after settlement until the facts on the ground are such that an independent Palestinian state on the West Bank is an impossibility. They would like the world to acknowledge that there is no such thing as a gentle occupation— that occupation corrodes the humanity of the occupier and makes the occupied vulnerable to brutality.

And I think the need to have these things acknowledged— the need for historical affirmation— is so great on both sides that both the Israelis and the Palestinians would rather perish as peoples than give this need up. In fact, I think they both feel that they would perish as peoples precisely if they did. They would rather die than admit their own complicity in the present situation, because to make such an admission would be to acknowledge the suffering of the other and the legitimacy of the other’s complaint, and that might mean that they themselves were wrong, that they were evil, that they were bad. That might give the other an opening to annihilate or enslave them. That might make such behavior seem justifiable.

I wonder how many of us are stuck in a similar snare. I wonder how many of us are holding on very hard to some piece of personal history that is preventing us from moving on with our lives, and keeping us from those we love. I wonder how many of us cling so tenaciously to a version of a story of our lives in which we appear to be utterly blameless and innocent, that we become oblivious to the pain we have inflicted on others, no matter how unconsciously or inevitably or innocently we have have inflicted it. I wonder how many of us are terrified of acknowledging the truth of our lives because we think it will expose us. How many of us stand paralyzed between the moon and the sun; frozen — unable to act in the moment — because of our terror of the past and because of the intractability of the present circumstances that past has wrought? Forgiveness, it has been said, means giving up our hopes for a better past. This may sound like a joke, but how many of us refuse to give up our version of the past, and so find it impossible to forgive ourselves or others, impossible to act in the present?

I don't have answers. In my childhood I was promised peace in the Middle East. I am still waiting. I wish I knew what was needed to get us there.

Pirkei_Avot Chapter 5, Verse 21 implies that I am perhaps old enough to have some Wisdom, but am not yet old enough to give Counsel. The only wisdom I have obtained so far is that in most disagreements, people can disagree about the "facts", be aproaching the situation with fundamentally different values, or both. I believe that to have any meaningful discussion on a topic as fraught as this one, first common values must be established. Only then can we approach reality side-by-side, examine our beliefs, find mutually trustworthy sources of information, and find agreement about the state of reality. When values are aligned and facts are agreed upon, we might have some hope of letting go of just enough bits of history to find a path through this mess.

Decades ago I was a child promised peace. Today I have children of my own. Today on both side there are children suffering from the choices of their parents and grandparents. All the children deserve better.

There are people on both sides with genocide in their hearts. I don't yet know what to do about that, but we cannot let them win.

I wish that those who should be wise enough to provide counsel, particularly those with power, would get their acts together.

I wish for the peace and safety of all the innocents.

I wish for peace. In my lifetime. This year. Tomorrow. Or even today.

I hope that these words of Rabbi Alan Lew will reach just a few more people thanks to this post being on the internet. I hope they have touched you. Thank you for reading.

OpenTelemetry Tracing for Dropshot Nahum Shalman

I spoke at Oxide's dtrace.conf(24) about a project I've been hacking on for the past couple weeks:

Slides:

OpenTelemetry Tracing for Dropshot
OpenTelemetry Tracing for Dropshot Nahum Shalman The QR code links to this presentation for anyone who wants to read the speaker notes or reread them later. 1 github.com/nshalman
alt

Code:

[DRAFT - DO NOT MERGE] Basic OpenTelemetry integration by nshalman · Pull Request #1201 · oxidecomputer/dropshot
OpenTelemetry for DropshotThis is still very much a rough draft, but I want it to be clearly available for anyone interested.Checklist of things that are needed (note that much of it is currently…
alt

Thoughts on Static Code Analysis The Trouble with Tribbles...

I use a number of tools in static code analysis for my projects - primarily Java based. Mostly

  1. codespell
  2. checkstyle
  3. shellcheck
  4. PMD
  5. SpotBugs

Wait, I hear you say. Spell checking? Absolutely, it's a key part of code and documentation quality. There's absolutely no excuse for shoddy spelling. And I sometimes find that if the spelling's off, it's a sign that concentration levels weren't what they should have been, and other errors might also have crept in.

checkstyle is far more than style, although it has very fixed ideas about that. I have a list of checks that must always pass (now I've cleaned them up at any rate), so that's now at the state where it's just looking for regressions - the remaining things it's complaining about I'm happy to ignore (or the cost of fixing them massively outweighs any benefit to fixing them).

One thing that checkstyle is keen on is thorough javadoc. Initially I might have been annoyed by some of its complaints, but then realised 2 things. First, it makes you consider whether a given API really should be public. And more generally as part of that, having to write javadoc can make you reevaluate the API you've designed, which pushes you towards improving it.

When it comes to shellcheck, I can summarise it's approach as "quote all the things". Which is fine, until it isn't and you actually want to expand a variable into its constituent words.

But even there, a big benefit again is that shellcheck makes you look at the code and think about what it's doing. Which leads to an important point - automatic fixing of reported problems will (apart from making mistakes) miss the benefit of code inspection.

Actual coding errors (or just imperfections) tend to be the domain of PMD and SpotBugs. I have a long list of exceptions for PMD, depending on each project. I'm writing applications for unix-like systems, and I really do want to write directly to stdout and stderr. If I want to shut the application down, then calling System.exit() really is the way to do it.

I've been using PMD for years, and it took a while to get the recent version 7 configured to my liking. But having run PMD against my code for so long means that a lot of the low hanging fruit had already been fixed (and early on my code was much much worse than it is now). I occasionally turn the exclusions off and see if I can improve my code, and occasionally win at this game, but it's a relatively hard slog.

So far, SpotBugs hasn't really added much. I find its output somewhat unhelpful (I do read the reports), but initial impressions are that it's finding things the other tools don't, so I need to work harder to make sense of it.

dtrace.conf(24) Oxide Computer Company Blog

shirt

Sometime in late 2007, we had the idea of a DTrace conference. Or really, more of a meetup; from the primordial e-mail I sent:

The goal here, by the way, is not a DTrace user group, but more of a face-to-face meeting with people actively involved in DTrace — either by porting it to another system, by integrating probes into higher level environments, by building higher-level tools on top of DTrace or by using it heavily and/or in a critical role. That said, we also don’t want to be exclusionary, so our thinking is that the only true requirement for attending is that everyone must be prepared to speak informally for 15 mins or so on what they are doing with DTrace, any limitations that they have encountered, and some ideas for the future. We’re thinking that this is going to be on the order of 15-30 people (though more would be a good problem to have — we’ll track it if necessary), that it will be one full day (breakfast in the morning through drinks into the evening), and that we’re going to host it here at our offices in San Francisco sometime in March 2008.

This same note also included some suggested names for the gathering, including what in hindsight seems a clear winner: DTrace Bi-Mon-Sci-Fi-Con. As if knowing that I should leave an explanatory note to my future self as to why this name was not selected, my past self fortunately clarified: "before everyone clamors for the obvious Bi-Mon-Sci-Fi-Con, you should know that most Millennials don’t (sadly) get the reference." (While I disagree with the judgement of my past self, it at least indicates that at some point I cared if anyone got the reference.)

We settled on a much more obscure reference, and had the first dtrace.conf in March 2008. Befitting the style of the time, it was an unconference (a term that may well have hit its apogee in 2008) that you signed up to attend by editing a wiki. More surprising given the year (and thanks entirely to attendee Ben Rockwood), it was recorded — though this is so long ago that I referred to it as video taping (and with none of the participants mic’d, I’m afraid the quality isn’t very good). The conference, however, was terrific, viz. the reports of Adam, Keith and Stephen (all somehow still online nearly two decades later). If anything, it was a little too good: we realized that we couldn’t recreate the magic, and we demurred on making it an annual event.

Years passed, and memories faded. By 2012, it felt like we wanted to get folks together again, now under a post-lawnmower corporate aegis in Joyent. The resulting dtrace.conf(12) was a success, and the Olympiad cadence felt like the right one; we did it again four years later at dtrace.conf(16).

In 2020, we came back together for a new adventure — and the DTrace Olympiad was not lost on Adam. Alas, dtrace.conf(20) — like the Olympics themselves — was cancelled, if implicitly. Unlike the Olympics, however, it was not to be rescheduled.

More years passed and DTrace continued to prove its utility at Oxide; last year when Adam and I did our "DTrace at 20" episode of Oxide and Friends, we vowed to hold dtrace.conf(24) — and a few months ago, we set our date to be December 11th.

At first we assumed we would do something similar to our earlier conferences: a one-day participant-run conference, at the Oxide office in Emeryville. But times have changed: thanks to the rise of remote work, technologists are much more dispersed — and many more people would need to travel for dtrace.conf(24) than in previous DTrace Olympiads. Travel hasn’t become any cheaper since 2008, and the cost (and inconvenience) was clearly going to limit attendance.

The dilemma for our small meetup highlights the changing dynamics in tech conferences in general: with talks all recorded and made publicly available after the conference, how does one justify attending a conference in person? There can be reasonable answers to that question, of course: it may be the hallway track, or the expo hall, or the after-hours socializing, or perhaps some other special conference experience. But it’s also not surprising that some conferences — especially ones really focused on technical content — have decided that they are better off doing as conference giant O’Reilly Media did, and going exclusively online. And without the need to feed and shelter participants, the logistics for running a conference become much more tenable — and the price point can be lowered to the point that even highly produced conferences like P99 CONF can be made freely available. This, in turn, leads to much greater attendance — and a network effect that can get back some of what one might lose going online. In particular, using chat as the hallway track can be more much effective (and is certainly more scalable!) than the actual physical hallways at a conference.

For conferences in general, there is a conversation to be had here (and as a teaser, Adam and I are going to talk about it with Stephen O’Grady and Theo Schlossnagle on Oxide and Friends next week, but for our quirky, one-day, Olympiad-cadence dtrace.conf, the decision was pretty easy: there was much more to be gained than lost by going exclusively on-line.

So dtrace.conf(24) is coming up next week, and it’s available to everyone. In terms of platform, we’re going to try to keep that pretty simple: we’re going to use Google Meet for the actual presenters, which we will stream in real-time to YouTube — and we’ll use the Oxide Discord for all chat. We’re hoping you’ll join us on December 11th — and if you want to talk about DTrace or a DTrace-adjacent topic, we’d love for you to present! Keeping to the unconference style, if you would like to present, please indicate your topic in the #session-topics Discord channel so we can get the agenda fleshed out.

While we’re excited to be online, there are some historical accoutrements of conferences that we didn’t want to give up. First, we have a tradition of t-shirts with dtrace.conf. Thanks to our designer Ben Leonard, we have a banger of a t-shirt, capturing the spirit of our original dtrace.conf(08) shirt but with an Oxide twist. It’s (obviously) harder to make those free but we have tried to price them reasonably. You can get your t-shirt by adding it to your (free) dtrace.conf ticket. (And for those who present at dtrace.conf, your shirt is on us — we’ll send you a coupon code!)

Second, for those who can make their way to the East Bay and want some hangout time, we are going to have an après conference social event at the Oxide office starting at 5p. We’re charging something nominal for that too (and like the t-shirt, you pay for that via your dtrace.conf ticket); we’ll have some food and drinks and an Oxide hardware tour for the curious — and (of course?) there will be Fishpong.

Much has changed since I sent that e-mail 17 years ago — but the shared values and disposition that brought together our small community continue to endure; we look forward to seeing everyone (virtually) at dtrace.conf(24)!

Advancing Cloud and HPC Convergence with Lawrence Livermore National Laboratory Oxide Computer Company Blog

Oxide Computer Company and Lawrence Livermore National Laboratory Work Together to Advance Cloud and HPC Convergence

Oxide Computer Company and Lawrence Livermore National Laboratory (LLNL) today announced a plan to bring on-premises cloud computing capabilities to the Livermore Computing (LC) high-performance computing (HPC) center. The rack-scale Oxide Cloud Computer allows LLNL to improve the efficiency of operational workloads and will provide users in the National Nuclear Security Administration (NNSA) with new capabilities for provisioning secure, virtualized services alongside HPC workloads.

HPC centers have traditionally run batch workloads for large-scale scientific simulations and other compute-heavy applications. HPC workloads do not exist in isolation—there are a multitude of persistent, operational services that keep the HPC center running. Meanwhile, HPC users also want to deploy cloud-like persistent services—databases, Jupyter notebooks, orchestration tools, Kubernetes clusters. Clouds have developed extensive APIs, security layers, and automation to enable these capabilities, but few options exist to deploy fully virtualized, automated cloud environments on-premises. The Oxide Cloud Computer allows organizations to deliver secure cloud computing capabilities within an on-premises environment.

On-premises environments are the next frontier for cloud computing. LLNL is tackling some of the hardest and most important problems in science and technology, requiring advanced hardware, software, and cloud capabilities. We are thrilled to be working with their exceptional team to help advance those efforts, delivering an integrated system that meets their rigorous requirements for performance, efficiency, and security.
— Steve TuckCEO at Oxide Computer Company

Leveraging the new Oxide Cloud Computer, LLNL will enable staff to provision virtual machines (VMs) and services via self-service APIs, improving operations and modernizing aspects of system management. In addition, LLNL will use the Oxide rack as a proving ground for secure multi-tenancy and for smooth integration with the LLNL-developed Flux resource manager. LLNL plans to bring its users cloud-like Infrastructure-as-a-Service (IaaS) capabilities that work seamlessly with their HPC jobs, while maintaining security and isolation from other users. Beyond LLNL personnel, researchers at the Los Alamos National Laboratory and Sandia National Laboratories will also partner in many of the activities on the Oxide Cloud Computer.

We look forward to working with Oxide to integrate this machine within our HPC center. Oxide’s Cloud Computer will allow us to securely support new types of workloads for users, and it will be a proving ground for introducing cloud-like features to operational processes and user workflows. We expect Oxide’s open-source software stack and their transparent and open approach to development to help us work closely together.

— Todd GamblinDistinguished Member of Technical Staff at LLNL

Sandia is excited to explore the Oxide platform as we work to integrate on-premises cloud technologies into our HPC environment. This advancement has the potential to enable new classes of interactive and on-demand modeling and simulation capabilities.

— Kevin PedrettiDistinguished Member of Technical Staff at Sandia National Laboratories

LLNL plans to work with Oxide on additional capabilities, including the deployment of additional Cloud Computers in its environment. Of particular interest are scale-out capabilities and disaster recovery. The latest installation underscores Oxide Computer’s momentum in the federal technology ecosystem, providing reliable, state-of-the-art Cloud Computers to support critical IT infrastructure.

To learn more about Oxide Computer, visit https://oxide.computer.

About Oxide Computer

Oxide Computer Company is the creator of the world’s first commercial Cloud Computer, a true rack-scale system with fully unified hardware and software, purpose-built to deliver hyperscale cloud computing to on-premises data centers. With Oxide, organizations can fully realize the economic and operational benefits of cloud ownership, with access to the same self-service development experience of public cloud, without the public cloud cost. Oxide empowers developers to build, run, and operate any application with enhanced security, latency, and control, and frees organizations to elevate IT operations to accelerate strategic initiatives. To learn more about Oxide’s Cloud Computer, visit oxide.computer.

About LLNL

Founded in 1952, Lawrence Livermore National Laboratory provides solutions to our nation’s most important national security challenges through innovative science, engineering, and technology. Lawrence Livermore National Laboratory is managed by Lawrence Livermore National Security, LLC for the U.S. Department of Energy’s National Nuclear Security Administration.

Media Contact

LaunchSquad for Oxide Computer oxide@launchsquad.com

Remembering Charles Beeler Oxide Computer Company Blog

We are heartbroken to relay that Charles Beeler, a friend and early investor in Oxide, passed away in September after a battle with cancer. We lost Charles far too soon; he had a tremendous influence on the careers of us both.

Our relationship with Charles dates back nearly two decades, to his involvement with the ACM Queue board where he met Bryan. It was unprecedented to have a venture capitalist serve in this capacity with ACM, and Charles brought an entirely different perspective on the practitioner content. A computer science pioneer who also served on the board took Bryan aside at one point: "Charles is one of the good ones, you know."

When Bryan joined Joyent a few years later, Charles also got to know Steve well. Seeing the promise in both node.js and cloud computing, Charles became an investor in the company. When companies hit challenging times, some investors will hide — but Charles was the kind of investor to figure out how to fix what was broken. When Joyent needed a change in executive leadership, it was Charles who not only had the tough conversations, but led the search for the leader the company needed, ultimately positioning the company for success.

Aside from his investment in Joyent, Charles was an outspoken proponent of node.js, becoming an organizer of the Node Summit conference. In 2017, he asked Bryan to deliver the conference’s keynote, but by then, the relationship between Joyent and node.js had become…​ complicated, and Bryan felt that it probably wouldn’t be a good idea. Any rational person would have dropped it, but Charles persisted, with characteristic zeal: if the Joyent relationship with node.js had become strained, so much more the reason to speak candidly about it! Charles prevailed, and the resulting talk, Platform as Reflection of Values, became one of Bryan’s most personally meaningful talks.

Charles’s persistence was emblematic: he worked behind the scenes to encourage people to do their best work, always with an enthusiasm for the innovators and the creators. As we were contemplating Oxide, we told Charles what we wanted to do long before we had a company. Charles laughed with delight: "I hoped that you two would do something big, and I am just so happy for you that you’re doing something so ambitious!"

As we raised seed capital, we knew that we were likely a poor fit for Charles and his fund. But we also knew that we deeply appreciated his wisdom and enthusiasm; we couldn’t resist pitching him on Oxide. Charles approached the investment in Oxide as he did with so many other aspects: with curiosity, diligence, empathy, and candor. He was direct with us that despite his enthusiasm for us personally, Oxide would be a challenging investment for his firm. But he also worked with us to address specific objections, and ultimately he won over his partnership. We were thrilled when he not only invested, but pulled together a syndicate of like-minded technologists and entrepreneurs to join him.

Ever since, he has been a huge Oxide fan. Befitting his enthusiasm, one of his final posts expressed his enthusiasm and pride in what the Oxide team has built.

Charles, thank you. You told us you were proud of us — and it meant the world. We are gutted to no longer have you with us; your influence lives on not just in Oxide, but also in the many people that you have inspired. You were the best of venture capital. Closer to the heart, you were a terrific friend to us both; thank you.

Debugging an OpenJDK crash on SPARC The Trouble with Tribbles...

I had to spend a little time recently fixing a crash in OpenJDK on Solaris SPARC.

What we're seeing is, from the hs_err file:

# A fatal error has been detected by the Java Runtime Environment:
#
#  SIGSEGV (0xb) at pc=0xffffffff57c745a8, pid=18442, tid=37
...
# Problematic frame:
# V  [libjvm.so+0x7745a8]  G1CollectedHeap::allocate_new_tlab(unsigned long, unsigned long, unsigned long*)+0xb8

Well that's odd. I only see this on SPARC, and I've seen it sporadically on Tribblix during the process of continually building OpenJDK on SPARC, but haven't seen it on Solaris. Until a customer hit it in production, which is rather a painful place to find a reproducer.

In terms of source, this is located in the file src/hotspot/share/gc/g1/g1CollectedHeap.cpp (all future source references will be relative to that directory), and looks like:

HeapWord* G1CollectedHeap::allocate_new_tlab(size_t min_size,
                                             size_t requested_size,
                                             size_t* actual_size) {
  assert_heap_not_locked_and_not_at_safepoint();
  assert(!is_humongous(requested_size), "we do not allow humongous TLABs");

  return attempt_allocation(min_size, requested_size, actual_size);
}

That's incredibly simple. There's not much that can go wrong there, is there?

The complexity here is that a whole load of functions get inlined. So what does it call? You find yourself in a twisty maze of passages, all alike. But anyway, the next one down is

inline HeapWord* G1CollectedHeap::attempt_allocation(size_t min_word_size,
                                                     size_t desired_word_size,
                                                     size_t* actual_word_size) {
  assert_heap_not_locked_and_not_at_safepoint();
  assert(!is_humongous(desired_word_size), "attempt_allocation() should not "
         "be called for humongous allocation requests");

  HeapWord* result = _allocator->attempt_allocation(min_word_size, desired_word_size, actual_word_size);

  if (result == NULL) {
    *actual_word_size = desired_word_size;
    result = attempt_allocation_slow(desired_word_size);
  }

  assert_heap_not_locked();
  if (result != NULL) {
    assert(*actual_word_size != 0, "Actual size must have been set here");
    dirty_young_block(result, *actual_word_size);
  } else {
    *actual_word_size = 0;
  }

  return result;
}

That then calls an inlined G1Allocator::attempt_allocation() in g1Allocator.hpp. That calls current_node_index(), which looks safe and then there are a couple of calls to mutator_alloc_region()->attempt_retained_allocation() and mutator_alloc_region()->attempt_allocation(), which come from g1AllocRegion.inline.hpp and both ultimately call a local par_allocate(), which then calls par_allocate_impl() or par_allocate() in heapRegion.inline.hpp.

Now, mostly all these are doing is calling something else. The one really complex piece of code is in par_allocate_impl() which contains

...
  do {
    HeapWord* obj = top();
    size_t available = pointer_delta(end(), obj);
    size_t want_to_allocate = MIN2(available, desired_word_size);
    if (want_to_allocate >= min_word_size) {
      HeapWord* new_top = obj + want_to_allocate;
      HeapWord* result = Atomic::cmpxchg(&_top, obj, new_top);
      // result can be one of two:
      //  the old top value: the exchange succeeded
      //  otherwise: the new value of the top is returned.
      if (result == obj) {
        assert(is_object_aligned(obj) && is_object_aligned(new_top), "checking alignment");
        *actual_size = want_to_allocate;
        return obj;
      }
    } else {
      return NULL;
    }
  } while (true);
}

Right, let's go back to the crash. We can open up the core file in
mdb, and look at the stack with $C

ffffffff7f39d751 libjvm.so`_ZN7VMError14report_and_dieEP6ThreadjPhPvS3_+0x3c(
    101cbb1d0?, b?, fffffffcb45dea7c?, ffffffff7f39ecb0?, ffffffff7f39e9a0?, 0?)
ffffffff7f39d811 libjvm.so`JVM_handle_solaris_signal+0x1d4(b?,
    ffffffff7f39ecb0?, ffffffff7f39e9a0?, 0?, ffffffff7f39e178?, 101cbb1d0?)
ffffffff7f39dde1 libjvm.so`_ZL17javaSignalHandleriP7siginfoPv+0x20(b?,
    ffffffff7f39ecb0?, ffffffff7f39e9a0?, 0?, 0?, ffffffff7e7dd370?)
ffffffff7f39de91 libc.so.1`__sighndlr+0xc(b?, ffffffff7f39ecb0?,
    ffffffff7f39e9a0?, fffffffcb4b38afc?, 0?, ffffffff7f20c7e8?)
ffffffff7f39df41 libc.so.1`call_user_handler+0x400((int) -1?,
    (siginfo_t *) 0xffffffff7f39ecb0?, (ucontext_t *) 0xc?)
ffffffff7f39e031 libc.so.1`sigacthandler+0xa0((int) 11?,
    (siginfo_t *) 0xffffffff7f39ecb0?, (void *) 0xffffffff7f39e9a0?)
ffffffff7f39e5b1 libjvm.so`_ZN15G1CollectedHeap17allocate_new_tlabEmmPm+0xb8(
    10013d030?, 100?, 520?, ffffffff7f39f000?, 0?, 0?)

What you see here is the allocate_new_tlab() at the botton, it throws a signal, the signal handler catches it, passes it ultimately to JVM_handle_solaris_signal() which bails, and the JVM exits.

We can look at the signal. It's at address 0xffffffff7f39ecb0 and is of type siginfo_t, so we can just print it

java:core> ffffffff7f39ecb0::print -t siginfo_t

and we first see

siginfo_t {
    int si_signo = 0t11 (0xb)
    int si_code = 1
    int si_errno = 0
...

OK, the signal was indeed 11 = SIGSEGV. The interesting thing is the si_code of 1, which is defined as

#define SEGV_MAPERR     1       /* address not mapped to object */

Ah. Now, in the jvm you actually see a lot of SIGSEGV, but a lot of them are handled by that mysterious JVM_handle_solaris_signal(). In particular, it'll handle anything with SEGV_ACCERR which is basically something running off the end of an array.

Further down, you can see the fault address

struct  __fault = {
            void *__addr = 0x10
            int __trapno = 0
            caddr_t __pc = 0
            int __adivers = 0
        }

So, we're faulting on address 0x10. Yes, you try messing around down there and you will fault.


That confirms the crash is a SEGV. What are we actually trying to do? We can disassemble the allocate_new_tlab() function and see what's happening - remember the crash was at offset 0xb8

java:core> libjvm.so`_ZN15G1CollectedHeap17allocate_new_tlabEmmPm::dis
...
 libjvm.so`_ZN15G1CollectedHeap17allocate_new_tlabEmmPm+0xb8:

       ldx       [%i4 + 0x10], %i5

That's interesting, 0x10 was the fault address. What's %i4 then?

java:core> ::regs
%i4 = 0x0000000000000000

Yep. Given that, we'll try and read 0x10, giving the SEGV we see.

There's a little more context around that call site. A slightly
expanded view is

 libjvm.so`_ZN15G1CollectedHeap17allocate_new_tlabEmmPm+0xa0:        nop
 libjvm.so`_ZN15G1CollectedHeap17allocate_new_tlabEmmPm+0xa4:        add       %
i5, %g1, %g1
 libjvm.so`_ZN15G1CollectedHeap17allocate_new_tlabEmmPm+0xa8:        casx      [
%g3], %i5, %g1
 libjvm.so`_ZN15G1CollectedHeap17allocate_new_tlabEmmPm+0xac:        cmp       %
i5, %g1
 libjvm.so`_ZN15G1CollectedHeap17allocate_new_tlabEmmPm+0xb0:        be,pn     %
xcc, +0x160  <libjvm.so`_ZN15G1CollectedHeap17allocate_new_tlabEmmPm+0x210>
 libjvm.so`_ZN15G1CollectedHeap17allocate_new_tlabEmmPm+0xb4:        nop
 libjvm.so`_ZN15G1CollectedHeap17allocate_new_tlabEmmPm+0xb8:        ldx       [
%i4 + 0x10], %i5

Now, the interesting thing here is the casx (compare and swap) instruction. That lines up with the Atomic::cmpxchg() in par_allocate_impl() that we were suspecting above. So the crash is somewhere around there.

It turns out there's another way to approach this. If we compile without optimization then effectively we turn off the inlining. The way to do this is to add an entry to the jvm Makefile via make/hotspot/lib/JvmOverrideFiles.gmk

...
else ifeq ($(call isTargetOs, solaris), true)
    ifeq ($(call isTargetCpuArch, sparc), true)
      # ptribble port tweaks
      BUILD_LIBJVM_g1CollectedHeap.cpp_CXXFLAGS += -O0
    endif
endif

If we rebuild (having touched all the files in the directory to force
make to rebuild everything correctly), and run again, we get the full
call stack:

Now the crash is

# V  [libjvm.so+0x80cc48]  HeapRegion::top() const+0xc

which we can expand to the following stack leading up to where it goes
into the signal handler.:

ffffffff7f39dff1 libjvm.so`_ZNK10HeapRegion3topEv+0xc(0?, ffffffff7f39ef40?,
    101583e38?, ffffffff7f39f020?, fffffffa46de8038?, 10000?)
ffffffff7f39e0a1 libjvm.so`_ZN10HeapRegion17par_allocate_implEmmPm+0x18(0?,
    100?, 10000?, ffffffff7f39ef60?, ffffffff7f39ef40?, 8f00?)
ffffffff7f39e181                     
libjvm.so`_ZN10HeapRegion27par_allocate_no_bot_updatesEmmPm+0x24(0?, 100?,
    10000?, ffffffff7f39ef60?, 566c?, 200031?)
ffffffff7f39e231                     
libjvm.so`_ZN13G1AllocRegion12par_allocateEP10HeapRegionmmPm+0x44(100145440?,
    0?, 100?, 10000?, ffffffff7f39ef60?, 0?)
ffffffff7f39e2e1 libjvm.so`_ZN13G1AllocRegion18attempt_allocationEmmPm+0x48(
    100145440?, 100?, 10000?, ffffffff7f39ef60?, 3?, fffffffa46ceff48?)
ffffffff7f39e3a1 libjvm.so`_ZN11G1Allocator18attempt_allocationEmmPm+0xa4(
    1001453b0?, 100?, 10000?, ffffffff7f39ef60?, 7c0007410?, ffffffff7f39ea41?)
ffffffff7f39e461 libjvm.so`_ZN15G1CollectedHeap18attempt_allocationEmmPm+0x2c(
    10013d030?, 100?, 10000?, ffffffff7f39ef60?, 7c01b15e8?, 0?)
ffffffff7f39e521 libjvm.so`_ZN15G1CollectedHeap17allocate_new_tlabEmmPm+0x24(
    10013d030?, 100?, 10000?, ffffffff7f39ef60?, 0?, 0?)

So yes, this confirms that we are indeed in par_allocate_impl() and
it's crashing on the very first line of the code segment I showed
above, where it calls top(). All top() does is return the _top member
of a HeapRegion.

So the only thing that can happen here is that the HeapRegion itself
is NULL. Then the _top member is presumably at offset 0x10, and trying
to access it gives the SIGSEGV.

Now, in G1AllocRegion::attempt_allocation() there's an assert:

  HeapRegion* alloc_region = _alloc_region;
  assert_alloc_region(alloc_region != NULL, "not initialized properly");

However, asserts aren't compiled into production builds.

But the fix here is to fail if we've got NULL and let the caller
retry. There are a lot of calls here, and the general approach is to
return NULL if anything goes wrong, so I do the same for this extra
failure case, adding the following:

  if (alloc_region == NULL) {
    return NULL;
  }

With that, no more of those pesky crashes. (There might be others
lurking elsewhere, of course.)

Of course, what this doesn't explain is why the HeapRegion wasn't
correctly initialized in the first place. But that's another problem
entirely.

How Oxide Cuts Data Center Power Consumption in Half Oxide Computer Company Blog

Here’s a sobering thought: today, data centers already consume 1-2% of the world’s power, and that percentage will likely rise to 3-4% by the end of the decade. According to Goldman Sachs research, that rise will include a doubling in data center carbon dioxide emissions. As the data and AI boom progresses, this thirst for power shows no signs of slowing down anytime soon. Two key challenges quickly become evident for the 85% of IT that currently lives on-premises.

  1. How can organizations reduce power consumption and corresponding carbon emissions?

  2. How can organizations keep pace with AI innovation as existing data centers run out of available power?

Graph of AI & Data Center Growth Boosting Electricity Demand
Masanet et al. (2020), Cisco, IEA, Goldman Sachs Research

Rack-scale design is critical to improved data center efficiency

Traditional data center IT consumes so much power because the fundamental unit of compute is an individual server; like a house where rooms were built one at a time, with each room having its own central AC unit, gas furnace, and electrical panel. Individual rackmount servers are stacked together, each with their own AC power supplies, cooling fans, and power management. They are then paired with storage appliances and network switches that communicate at arm’s length, not designed as a cohesive whole. This approach fundamentally limits organizations' ability to maintain sustainable, high-efficiency computing systems.

Of course, hyperscale public cloud providers did not design their data center systems this way. Instead, they operate like a carefully planned smart home where everything is designed to work together cohesively and is operated by software that understands the home’s systems end-to-end. High-efficiency, rack-scale computers are deployed at scale and operate as a single unit with integrated storage and networking to support elastic cloud computing services. This modern architecture is made available to the market as public cloud, but that rental-only model is ill-fit for many business needs.

Illustration of Oxide racks at a higher density (2x) than conventional ones

Compared to a popular rackmount server vendor, Oxide is able to fill our specialized racks with 32 AMD Milan sleds and highly-available network switches using less than 15kW per rack, doubling the compute density in a typical data center. With just 16 of the alternative 1U servers and equivalent network switches, over 16kW of power is required per rack, leading to only 1,024 CPU cores vs Oxide’s 2,048.

Extracting more useful compute from each kW of power and square foot of data center space is key to the future effectiveness of on-premises computing.

At Oxide, we’ve taken this lesson in advancing rack-scale design, improved upon it in several ways, and made it available for every organization to purchase and operate anywhere in the world without a tether back to the public cloud. Our Cloud Computer treats the entire rack as a single, unified computer rather than a collection of independent parts, achieving unprecedented power efficiency.

By designing the hardware and software together, we’ve eliminated unnecessary components and optimized every aspect of system operation through a control plane with visibility to end-to-end operations.

When we started Oxide, the DC bus bar stood as one of the most glaring differences between the rack-scale machines at the hyperscalers and the rack-and-stack servers that the rest of the market was stuck with. That a relatively simple piece of copper was unavailable to commercial buyers — despite being unequivocally the right way to build it! — represented everything wrong with the legacy approach.

The bus bar in the Oxide Cloud Computer is not merely more efficient, it is a concrete embodiment of the tremendous gains from designing at rack-scale, and by integrating hardware with software.

— Bryan Cantrill

The improvements we’re seeing are rooted in technical innovation

  • Replacing low-efficiency AC power supplies with a high-efficiency DC Bus Bar
    Power conversion is performed once AC power is fed from the data center to the Oxide universal power shelf with a customized power shelf controller (PSC). The power shelf distributes DC power up and down the rack via a bus bar. This eliminates the 70 total AC power supplies found in an equivalent legacy server rack within 32 servers, two top-of-rack switches, and one out-of-band switch, each with two AC power supplies. This power shelf also ensures the load is balanced across phases, something that’s impossible with traditional power distribution units found in legacy server racks.

  • Bigger fans = bigger efficiency gains
    Oxide server sleds are designed to a custom form factor to accommodate larger fans than legacy servers typically use. These fans can move more air more efficiently, cooling the systems using 12x less energy than legacy servers, which each contain as many as 7 fans, which must work much harder to move air over system components.

  • Purpose-built for power efficiency
    Oxide server sleds have less restrictive airflow than legacy servers by eliminating extraneous components like PCIe risers, storage backplanes, and more. Legacy servers need many optional components like these because they could be used for any number of tasks, such as point-of-sale systems, data center servers, or network-attached-storage (NAS) systems. Still, they were never designed optimally for any one of those tasks. The Oxide Cloud Computer was designed from the ground up to be a rack-scale cloud computing powerhouse, and so it’s optimized for exactly that task.

  • Hardware + Software designed together
    The Oxide Cloud Computer includes a robust cloud control plane with deep observability to the full system. By designing the hardware and software together, we can make hardware choices like more intelligent DC-DC power converters that can provide rich telemetry to our control plane, enabling future feature enhancements such as dynamic power capping and efficiency-based workload placement that are impossible with legacy servers and software systems.

Learn more about Oxide’s intelligent Power Shelf Controller

The Bottom Line: Customers and the Environment Both Benefit

Reducing data center power demands and achieving more useful computing per kilowatt requires fundamentally rethinking traditional data center utilization and compute design. At Oxide, we’ve proven that dramatic efficiency gains are possible when you rethink the computer at rack-scale with hardware and software designed thoughtfully and rigorously together.

Ready to learn how your organization can achieve these results? Schedule time with our team here.

Together, we can reclaim on-premises computing efficiency to achieve both business and sustainability goals.

OmniOS Community Edition r151052 OmniOS Community Edition

OmniOSce v11 r151052 is out!

On the 4th of November 2024, the OmniOSce Association has released a new stable version of OmniOS - The Open Source Enterprise Server OS. The release comes with many tool updates, brand-new features and additional hardware support. For details see the release notes.

Note that r151048 is now end-of-life. You should upgrade to r151050 or r151052 to stay on a supported track.

OmniOS is fully Open Source and free. Nevertheless, it takes a lot of time and money to keep maintaining a full-blown operating system distribution. Our statistics show that there are almost 2’000 active installations of OmniOS while fewer than 20 people send regular contributions. If your organisation uses OmniOS based servers, please consider becoming a regular patron or taking out a support contract.


Any problems or questions, please get in touch.

Reflections on Founder Mode The Observation Deck

Paul Graham’s Founder Mode is an important piece, and you should read it if for no other reason that “founder mode” will surely enter the lexicon (and as Graham grimly predicts: “as soon as the concept of founder mode becomes established, people will start misusing it”). When building a company, founders are engaged in several different acts at once: raising capital; building a product; connecting that product to a market; building an organization to do all of these. Founders make lots of mistakes in all of these activities, and Graham’s essay highlights a particular kind of mistake in which founders are overly deferential to expertise or convention. Pejoratively referring to this as “Management Mode”, Graham frames this in the Silicon Valley dramaturgical dyad of Steve Jobs and John Scully. While that’s a little too reductive (anyone seeking to understand Jobs needs to read Randall Stross’s superlative Steve Jobs and the NeXT Big Thing, highlighting Jobs’s many post-Scully failures at NeXT), Graham has identified a real issue here, albeit without much specificity.

For a treatment of the same themes but with much more supporting detail, one should read the (decade-old) piece from Tim O’Reilly, How I failed. (Speaking personally, O’Reilly’s piece had a profound influence on me, as it encouraged me to stand my ground on an issue on which I had my own beliefs but was being told to defer to convention.) But as terrific as it is, O’Reilly’s piece also doesn’t answer the question that Graham poses: how do founders prevent their companies from losing their way?

Graham says that founder mode is a complete mystery (“There are as far as I know no books specifically about founder mode”), and while there is a danger in being too pat or prescriptive, there does seem to be a clear component for keeping companies true to themselves: the written word. That is, a writing- (and reading-!) intensive company culture does, in fact, allow for scaling the kind of responsibility that Graham thinks of as founder mode. At Oxide, our writing-intensive culture has been absolutely essential: our RFD process is the backbone of Oxide, and has given us the structure to formalize, share, and refine our thinking. First among this formalized thinking – and captured in our first real RFD – is RFD 2 Mission, Principles, and Values. Immediately behind that (and frankly, the most important process for any company) is RFD 3 Oxide Hiring Process. These first three RFDs – on the process itself, on what we value, and on how we hire – were written in the earliest days of the company, and they have proven essential to scale the company: they are the foundation upon which we attract people who share our values.

While the shared values have proven necessary, they haven’t been sufficient to eliminate the kind of quandaries that Graham and O’Reilly describe. For example, there have been some who have told us that we can’t possibly hire non-engineering roles using our hiring process – or told us that our approach to compensation can’t possibly work. To the degree that we have had a need for Graham’s founder mode, it has been in those moments: to stay true to the course we have set for the company. But because we have written down so much, there is less occasion for this than one might think. And when it does occur – when there is a need for further elucidation or clarification – the artifact is not infrequently a new RFD that formalizes our newly extended thinking. (RFD 68 is an early public and concrete example of this; RFD 508 is a much more recent one that garnered some attention.)

Most importantly, because we have used our values as a clear lens for hiring, we are able to assure that everyone at Oxide is able to have the same disposition with respect to responsibility – and this (coupled with the transparency that the written word allows) permits us to trust one another. As I elucidated in Things I Learned The Hard Way, the most important quality in a leader is to bind a team with mutual trust: with it, all things are possible – and without it, even easy things can be debilitatingly difficult. Graham mentions trust, but he doesn’t give it its due. Too often, founders focus on the immediacy of a current challenge without realizing that they are, in fact, undermining trust with their approach. Bluntly, founders are at grave risk of misinterpreting Graham’s “Founders Mode” to be a license to micromanage their teams, descending into the kind of manic seagull management that inhibits a team rather than empowering it.

Founders seeking to internalize Graham’s advice should recast it by asking themselves how they can foster mutual trust – and how they can build the systems that allow trust to be strengthened even as the team expands. For us at Oxide, writing is the foundation upon which we build that trust. Others may land on different mechanisms, but the goal of founders should be the same: build the trust that allows a team to kick a Jobsian dent in the universe!

Reflections on Founder Mode Oxide Computer Company Blog

Paul Graham’s Founder Mode is an important piece, and you should read it if for no other reason that "founder mode" will surely enter the lexicon (and as Graham grimly predicts: "as soon as the concept of founder mode becomes established, people will start misusing it"). When building a company, founders are engaged in several different acts at once: raising capital; building a product; connecting that product to a market; building an organization to do all of these. Founders make lots of mistakes in all of these activities, and Graham’s essay highlights a particular kind of mistake in which founders are overly deferential to expertise or convention. Pejoratively referring to this as "Management Mode", Graham frames this in the Silicon Valley dramaturgical dyad of Steve Jobs and John Scully. While that’s a little too reductive (anyone seeking to understand Jobs needs to read Randall Stross’s superlative Steve Jobs and the NeXT Big Thing, highlighting Jobs’s many post-Scully failures at NeXT), Graham has identified a real issue here, albeit without much specificity.

For a treatment of the same themes but with much more supporting detail, one should read the (decade-old) piece from Tim O’Reilly, How I failed. (Speaking personally, O’Reilly’s piece had a profound influence on me, as it encouraged me to stand my ground on an issue on which I had my own beliefs but was being told to defer to convention.) But as terrific as it is, O’Reilly’s piece also doesn’t answer the question that Graham poses: how do founders prevent their companies from losing their way?

Graham says that founder mode is a complete mystery ("There are as far as I know no books specifically about founder mode"), and while there is a danger in being too pat or prescriptive, there does seem to be a clear component for keeping companies true to themselves: the written word. That is, a writing- (and reading-!) intensive company culture does, in fact, allow for scaling the kind of responsibility that Graham thinks of as founder mode. At Oxide, our writing-intensive culture has been absolutely essential: our RFD process is the backbone of Oxide, and has given us the structure to formalize, share, and refine our thinking. First among this formalized thinking – and captured in our first real RFD – is RFD 2 Mission, Principles, and Values. Immediately behind that (and frankly, the most important process for any company) is RFD 3 Oxide Hiring Process. These first three RFDs – on the process itself, on what we value, and on how we hire – were written in the earliest days of the company, and they have proven essential to scale the company: they are the foundation upon which we attract people who share our values.

While the shared values have proven necessary, they haven’t been sufficient to eliminate the kind of quandaries that Graham and O’Reilly describe. For example, there have been some who have told us that we can’t possibly hire non-engineering roles using our hiring process – or told us that our approach to compensation can’t possibly work. To the degree that we have had a need for Graham’s founder mode, it has been in those moments: to stay true to the course we have set for the company. But because we have written down so much, there is less occasion for this than one might think. And when it does occur – when there is a need for further elucidation or clarification – the artifact is not infrequently a new RFD that formalizes our newly extended thinking. (RFD 68 is an early public and concrete example of this; RFD 508 is a much more recent one that garnered some attention.)

Most importantly, because we have used our values as a clear lens for hiring, we are able to assure that everyone at Oxide is able to have the same disposition with respect to responsibility – and this (coupled with the transparency that the written word allows) permits us to trust one another. As I elucidated in Things I Learned The Hard Way, the most important quality in a leader is to bind a team with mutual trust: with it, all things are possible – and without it, even easy things can be debilitatingly difficult. Graham mentions trust, but he doesn’t give it its due. Too often, founders focus on the immediacy of a current challenge without realizing that they are, in fact, undermining trust with their approach. Bluntly, founders are at grave risk of misinterpreting Graham’s "Founders Mode" to be a license to micromanage their teams, descending into the kind of manic seagull management that inhibits a team rather than empowering it.

Founders seeking to internalize Graham’s advice should recast it by asking themselves how they can foster mutual trust – and how they can build the systems that allow trust to be strengthened even as the team expands. For us at Oxide, writing is the foundation upon which we build that trust. Others may land on different mechanisms, but the goal of founders should be the same: build the trust that allows a team to kick a Jobsian dent in the universe!

KORH Minimum Sector Altitude Gotcha Josef "Jeff" Sipek

I had this draft around for over 5 years—since January 2019. Since I still think it is about an interesting observation, I’m publishing it now.

In late December (2018), I was preparing for my next instrument rating lesson which was going to involve a couple of ILS approaches at Worcester, MA (KORH). While looking over the ILS approach to runway 29, I noticed something about the minimum sector altitude that surprised me.

Normally, I consider MSAs to be centered near the airport for the approach. For conventional (i.e., non-RNAV) approaches, this tends to be the main navaid used during the approach. At Worcester, the 25 nautical mile MSA is centered on the Gardner VOR which is 19 nm away.

I plotted the MSA boundary on the approach chart to visualize it better:

It is easy to glance at the chart, see 3300 most of the way around, but not realize that when flying in the vicinity of the airport we are near the edge of the MSA. GRIPE, the missed approach hold fix, is half a mile outside of the MSA. (Following the missed approach procedure will result in plenty of safety, of course, so this isn’t really that relevant.)

What's a decent password length? The Trouble with Tribbles...

What's a decent length for a password?

I think it's pretty much agreed by now that longer passwords are, in general, better. And fortunately stupid complexity requirements are on the way out.

Reading the NIST password rules gives the following:

  • User chosen passwords must be at least 8 characters
  • Machine chosen passwords must be at least 6 characters
  • You must allow passwords to be at least 64 characters

Say what? A 6 character password is secure?

Initially, that seems way off, but it depends on your threat model. If you have a mechanism to block the really bad commonly used passwords, then 6 characters gives you a billion choices. Not many, but you should also be implementing technical measures such as rate limiting.

With that, if the only attack vector is brute force over the network, trying a billion passwords is simply impractical. Even with just passive rate limiting (limited by cpu power and network latency) an attacker will struggle; with active limiting they'll be trying for decades.

That's with just 6 random characters. Go to 8 and you're out of sight. And for this attack vector, no quantum computing developments will make any difference whatsoever.

But what if the user database itself is compromised?

Of course, if the passwords are in cleartext then no amount of fancy rules or length requirements is going to help you at all.

But if an attacker gets encrypted passwords then they can simply brute force them many orders of magnitude faster. Or use rainbow tables. And that's a whole different threat model.

Realistically, protecting against brute force or rainbow table attacks probably needs a 16 character password (or passphrase), and that requirement could get longer over time.

A corollary to this is that there isn't actually much to be gained to requiring password lengths between 8 and 16 characters.

In illumos, the default minimum password length is 6 characters. I recently increased the default in Tribblix to 8, which aligns with the user chosen limit that NIST give.

OmniOS Community Edition r151050 OmniOS Community Edition

OmniOSce v11 r151050 is out!

On the 6th of May 2024, the OmniOSce Association has released a new stable version of OmniOS - The Open Source Enterprise Server OS. The release comes with many tool updates, brand-new features and additional hardware support. For details see the release notes.

Note that r151038 is now end-of-life. You should upgrade to r151046 or r151050 to stay on a supported track. r151046 is an LTS release with support until May 2026, and r151050 is a stable release with support until May 2025.

For anyone who tracks LTS releases, the previous LTS - r151038 - is now end-of-life. You should upgrade to r151046 for continued LTS support.

OmniOS is fully Open Source and free. Nevertheless, it takes a lot of time and money to keep maintaining a full-blown operating system distribution. Our statistics show that there are almost 2’000 active installations of OmniOS while fewer than 20 people send regular contributions. If your organisation uses OmniOS based servers, please consider becoming a regular patron or taking out a support contract.


Any problems or questions, please get in touch.