Tailscale for SunOS in 2025 Nahum Shalman

Happy New Year! The wireguard-go port is still sitting around in my fork. I don't know when I will have the energy for the next attempt to get it upstream. In the meantime, I've made some fun progress on the Tailscale side.

Taildrive

The Tailscale folks have shipped Taildrive (currently in alpha) and it's pretty neat. Naturally those of us using Tailscale on illumos wanted to try it out. There was nothing needed directly to get it working, but we had an indirect problem. The tailscale binary communicates with the tailscaled daemon over a unix socket, and the Tailscale folks had added some basic unix based authentication / authorization abstracted in their peercred library. That library needed support added for getpeerucred which meant I had to wire things up all the way down in x/sys/unix before then getting it into peercred. But with that work done, Taildrive now works! I tagged a release with that enabled if you're in a rush to play with it.

Using userspace-networking

Tailscale has a way to run without creating a TUN device. It means that client software on the machine can't connect directly to IPs on the Tailnet (though there is a SOCKS proxy you can use) but tailscaled can still lots of other server-y things (including Taildrive!) That's how Tailscale has been supporting AIX. Which led me to a strange realization: Tailscale had better in-tree support for AIX than it did for illumos and Solaris. No more! We are now on-par with AIX in the official tree!

What's next

I don't know if the Tailscale folks intend to ship binaries for us from their tree, but after their next release it should be possible to build illumos binaries from their tree that you could use to serve up a ZFS filesystem with Taildrive to your tailnet using the userspace-networking driver.

I will of course also rebase my TUN driver patches and tag a release as well.

Are you running Tailscale on illumos or Solaris? Let me know on Bluesky or Mastodon.

Tragedy Older Than Me Nahum Shalman

In July of 2021, in anticipation of the upcoming High Holy Days I purchased a copy of This Is Real and You Are Completely Unprepared: The Days of Awe as a Journey of Transformation by Rabbi Alan Lew, published in 2003. I was in fact completely unprepared to even read it. It sat in or on my nightstand for more than three years. I finally started reading it during the high holidays in October of 2024 (a few days before the one year anniversary of the events of October 7, 2023).

When I read the section excerpted here, I had to immediately flip back to check the publication date. 2003. In so many ways the ongoing tragedy today is in the same place it was over 20 years ago. Rabbi Lew died in 2009. We cannot ask him what he thinks of the world today, but in many ways there is no need. Little has changed. So, to emphasize one more time, let's go back to 2003:

I think that the great philosopher George Santayana got it exactly wrong. I think it is precisely those who insist on remembering history who are doomed to repeat it. For a subject with so little substance, for something that is really little more than a set of intellectual interpretations, history can become a formidable trap— a sticky snare from which we may find it impossible to extricate ourselves. I find it impossible to read the texts of Tisha B’Av, with their great themes of exile and return, and their endless sense of longing for the land of Israel, without thinking of the current political tragedy in the Middle East. I write this at a very dark moment in the long and bleak history of that conflict. Who knows what will be happening there when you read this? But I think it’s a safe bet that whenever you do, one thing is unlikely to have changed. There will likely be a tremendous compulsion for historical vindication on both sides. Very often, I think it is precisely the impossible yearning for historical justification that makes resolution of this conflict seem so impossible. The Jews want vindication for the Holocaust, and for the two thousand years of European persecution and ostracism that preceded it; the Jews want the same Europeans who now give them moral lectures to acknowledge that this entire situation would never have come about if not for two thousand years of European bigotry, barbarism, and xenophobia. They want the world to acknowledge that Israel was attacked first, in 1948, in 1967, in 1973, and in each of the recent Intifadas. They want acknowledgment that they only took the lands from which they were attacked during these conflicts, and offered to return them on one and only one condition— the acknowledgment of their right to exist. When Anwar Sadat met that condition, the Sinai Peninsula, with its rich oil fields and burgeoning settlement towns, was returned to him. And they want acknowledgment that there are many in the Palestinian camp who truly wish to destroy them, who have used the language of peace as a ploy to buy time until they have the capacity to liquidate Israel and the Jews once and for all. They want acknowledgment that they have suffered immensely from terrorism, that a people who lost six million innocents scarcely seventy years ago should not have had to endure the murder of its innocent men, women, and children so soon again. And they want acknowledgment that in spite of all this, they stood at Camp David prepared to offer the Palestinians everything they claimed to have wanted— full statehood, a capital in East Jerusalem— and the response of the Palestinians was the second Intifada, a murderous campaign of terror and suicide bombings.

And the Palestinians? They would like the world to acknowledge that they lived in the land now called Israel for centuries, that they planted olive trees, shepherded flocks, and raised families there for hundreds of years; they would like the world to acknowledge that when they look up from their blue-roofed villages, their trees and their flowers, their fields and their flocks, they see the horrific, uninvited monolith of western culture— immense apartment complexes, shopping centers, and industrial plants on the once-bare and rocky hills where the voice of God could be heard and where Muhammad ascended to heaven. And they would like the world to acknowledge that it was essentially a European problem that was plopped into their laps at the end of the last great war, not one of their own making. They would like the world to acknowledge that there has always been a kind of arrogance attached to this problem; that it was as if the United States and England said to them, Here are the Jews, get used to them. And they would like the world to acknowledge that it is a great indignity, not to mention a significant hardship, to have been an occupied people for so long, to have had to submit to strip searches on the way to work, and intimidation on the way to the grocery store, and the constant humiliation of being subject— a humiliation rendered nearly bottomless when Israel, with the benefit of the considerable intellectual and economic resources of world Jewry, made the desert bloom, in a way they had never been able to do. And they would like the world to acknowledge that there are those in Israel who are determined never to grant them independence, who have used the language of peace as a ploy to fill the West Bank with settlement after settlement until the facts on the ground are such that an independent Palestinian state on the West Bank is an impossibility. They would like the world to acknowledge that there is no such thing as a gentle occupation— that occupation corrodes the humanity of the occupier and makes the occupied vulnerable to brutality.

And I think the need to have these things acknowledged— the need for historical affirmation— is so great on both sides that both the Israelis and the Palestinians would rather perish as peoples than give this need up. In fact, I think they both feel that they would perish as peoples precisely if they did. They would rather die than admit their own complicity in the present situation, because to make such an admission would be to acknowledge the suffering of the other and the legitimacy of the other’s complaint, and that might mean that they themselves were wrong, that they were evil, that they were bad. That might give the other an opening to annihilate or enslave them. That might make such behavior seem justifiable.

I wonder how many of us are stuck in a similar snare. I wonder how many of us are holding on very hard to some piece of personal history that is preventing us from moving on with our lives, and keeping us from those we love. I wonder how many of us cling so tenaciously to a version of a story of our lives in which we appear to be utterly blameless and innocent, that we become oblivious to the pain we have inflicted on others, no matter how unconsciously or inevitably or innocently we have have inflicted it. I wonder how many of us are terrified of acknowledging the truth of our lives because we think it will expose us. How many of us stand paralyzed between the moon and the sun; frozen — unable to act in the moment — because of our terror of the past and because of the intractability of the present circumstances that past has wrought? Forgiveness, it has been said, means giving up our hopes for a better past. This may sound like a joke, but how many of us refuse to give up our version of the past, and so find it impossible to forgive ourselves or others, impossible to act in the present?

I don't have answers. In my childhood I was promised peace in the Middle East. I am still waiting. I wish I knew what was needed to get us there.

Pirkei_Avot Chapter 5, Verse 21 implies that I am perhaps old enough to have some Wisdom, but am not yet old enough to give Counsel. The only wisdom I have obtained so far is that in most disagreements, people can disagree about the "facts", be aproaching the situation with fundamentally different values, or both. I believe that to have any meaningful discussion on a topic as fraught as this one, first common values must be established. Only then can we approach reality side-by-side, examine our beliefs, find mutually trustworthy sources of information, and find agreement about the state of reality. When values are aligned and facts are agreed upon, we might have some hope of letting go of just enough bits of history to find a path through this mess.

Decades ago I was a child promised peace. Today I have children of my own. Today on both side there are children suffering from the choices of their parents and grandparents. All the children deserve better.

There are people on both sides with genocide in their hearts. I don't yet know what to do about that, but we cannot let them win.

I wish that those who should be wise enough to provide counsel, particularly those with power, would get their acts together.

I wish for the peace and safety of all the innocents.

I wish for peace. In my lifetime. This year. Tomorrow. Or even today.

I hope that these words of Rabbi Alan Lew will reach just a few more people thanks to this post being on the internet. I hope they have touched you. Thank you for reading.

OpenTelemetry Tracing for Dropshot Nahum Shalman

I spoke at Oxide's dtrace.conf(24) about a project I've been hacking on for the past couple weeks:

Slides:

OpenTelemetry Tracing for Dropshot
OpenTelemetry Tracing for Dropshot Nahum Shalman The QR code links to this presentation for anyone who wants to read the speaker notes or reread them later. 1 github.com/nshalman
alt

Code:

[DRAFT - DO NOT MERGE] Basic OpenTelemetry integration by nshalman · Pull Request #1201 · oxidecomputer/dropshot
OpenTelemetry for DropshotThis is still very much a rough draft, but I want it to be clearly available for anyone interested.Checklist of things that are needed (note that much of it is currently…
alt

Thoughts on Static Code Analysis The Trouble with Tribbles...

I use a number of tools in static code analysis for my projects - primarily Java based. Mostly

  1. codespell
  2. checkstyle
  3. shellcheck
  4. PMD
  5. SpotBugs

Wait, I hear you say. Spell checking? Absolutely, it's a key part of code and documentation quality. There's absolutely no excuse for shoddy spelling. And I sometimes find that if the spelling's off, it's a sign that concentration levels weren't what they should have been, and other errors might also have crept in.

checkstyle is far more than style, although it has very fixed ideas about that. I have a list of checks that must always pass (now I've cleaned them up at any rate), so that's now at the state where it's just looking for regressions - the remaining things it's complaining about I'm happy to ignore (or the cost of fixing them massively outweighs any benefit to fixing them).

One thing that checkstyle is keen on is thorough javadoc. Initially I might have been annoyed by some of its complaints, but then realised 2 things. First, it makes you consider whether a given API really should be public. And more generally as part of that, having to write javadoc can make you reevaluate the API you've designed, which pushes you towards improving it.

When it comes to shellcheck, I can summarise it's approach as "quote all the things". Which is fine, until it isn't and you actually want to expand a variable into its constituent words.

But even there, a big benefit again is that shellcheck makes you look at the code and think about what it's doing. Which leads to an important point - automatic fixing of reported problems will (apart from making mistakes) miss the benefit of code inspection.

Actual coding errors (or just imperfections) tend to be the domain of PMD and SpotBugs. I have a long list of exceptions for PMD, depending on each project. I'm writing applications for unix-like systems, and I really do want to write directly to stdout and stderr. If I want to shut the application down, then calling System.exit() really is the way to do it.

I've been using PMD for years, and it took a while to get the recent version 7 configured to my liking. But having run PMD against my code for so long means that a lot of the low hanging fruit had already been fixed (and early on my code was much much worse than it is now). I occasionally turn the exclusions off and see if I can improve my code, and occasionally win at this game, but it's a relatively hard slog.

So far, SpotBugs hasn't really added much. I find its output somewhat unhelpful (I do read the reports), but initial impressions are that it's finding things the other tools don't, so I need to work harder to make sense of it.

dtrace.conf(24) Oxide Computer Company Blog

shirt

Sometime in late 2007, we had the idea of a DTrace conference. Or really, more of a meetup; from the primordial e-mail I sent:

The goal here, by the way, is not a DTrace user group, but more of a face-to-face meeting with people actively involved in DTrace — either by porting it to another system, by integrating probes into higher level environments, by building higher-level tools on top of DTrace or by using it heavily and/or in a critical role. That said, we also don’t want to be exclusionary, so our thinking is that the only true requirement for attending is that everyone must be prepared to speak informally for 15 mins or so on what they are doing with DTrace, any limitations that they have encountered, and some ideas for the future. We’re thinking that this is going to be on the order of 15-30 people (though more would be a good problem to have — we’ll track it if necessary), that it will be one full day (breakfast in the morning through drinks into the evening), and that we’re going to host it here at our offices in San Francisco sometime in March 2008.

This same note also included some suggested names for the gathering, including what in hindsight seems a clear winner: DTrace Bi-Mon-Sci-Fi-Con. As if knowing that I should leave an explanatory note to my future self as to why this name was not selected, my past self fortunately clarified: "before everyone clamors for the obvious Bi-Mon-Sci-Fi-Con, you should know that most Millennials don’t (sadly) get the reference." (While I disagree with the judgement of my past self, it at least indicates that at some point I cared if anyone got the reference.)

We settled on a much more obscure reference, and had the first dtrace.conf in March 2008. Befitting the style of the time, it was an unconference (a term that may well have hit its apogee in 2008) that you signed up to attend by editing a wiki. More surprising given the year (and thanks entirely to attendee Ben Rockwood), it was recorded — though this is so long ago that I referred to it as video taping (and with none of the participants mic’d, I’m afraid the quality isn’t very good). The conference, however, was terrific, viz. the reports of Adam, Keith and Stephen (all somehow still online nearly two decades later). If anything, it was a little too good: we realized that we couldn’t recreate the magic, and we demurred on making it an annual event.

Years passed, and memories faded. By 2012, it felt like we wanted to get folks together again, now under a post-lawnmower corporate aegis in Joyent. The resulting dtrace.conf(12) was a success, and the Olympiad cadence felt like the right one; we did it again four years later at dtrace.conf(16).

In 2020, we came back together for a new adventure — and the DTrace Olympiad was not lost on Adam. Alas, dtrace.conf(20) — like the Olympics themselves — was cancelled, if implicitly. Unlike the Olympics, however, it was not to be rescheduled.

More years passed and DTrace continued to prove its utility at Oxide; last year when Adam and I did our "DTrace at 20" episode of Oxide and Friends, we vowed to hold dtrace.conf(24) — and a few months ago, we set our date to be December 11th.

At first we assumed we would do something similar to our earlier conferences: a one-day participant-run conference, at the Oxide office in Emeryville. But times have changed: thanks to the rise of remote work, technologists are much more dispersed — and many more people would need to travel for dtrace.conf(24) than in previous DTrace Olympiads. Travel hasn’t become any cheaper since 2008, and the cost (and inconvenience) was clearly going to limit attendance.

The dilemma for our small meetup highlights the changing dynamics in tech conferences in general: with talks all recorded and made publicly available after the conference, how does one justify attending a conference in person? There can be reasonable answers to that question, of course: it may be the hallway track, or the expo hall, or the after-hours socializing, or perhaps some other special conference experience. But it’s also not surprising that some conferences — especially ones really focused on technical content — have decided that they are better off doing as conference giant O’Reilly Media did, and going exclusively online. And without the need to feed and shelter participants, the logistics for running a conference become much more tenable — and the price point can be lowered to the point that even highly produced conferences like P99 CONF can be made freely available. This, in turn, leads to much greater attendance — and a network effect that can get back some of what one might lose going online. In particular, using chat as the hallway track can be more much effective (and is certainly more scalable!) than the actual physical hallways at a conference.

For conferences in general, there is a conversation to be had here (and as a teaser, Adam and I are going to talk about it with Stephen O’Grady and Theo Schlossnagle on Oxide and Friends next week, but for our quirky, one-day, Olympiad-cadence dtrace.conf, the decision was pretty easy: there was much more to be gained than lost by going exclusively on-line.

So dtrace.conf(24) is coming up next week, and it’s available to everyone. In terms of platform, we’re going to try to keep that pretty simple: we’re going to use Google Meet for the actual presenters, which we will stream in real-time to YouTube — and we’ll use the Oxide Discord for all chat. We’re hoping you’ll join us on December 11th — and if you want to talk about DTrace or a DTrace-adjacent topic, we’d love for you to present! Keeping to the unconference style, if you would like to present, please indicate your topic in the #session-topics Discord channel so we can get the agenda fleshed out.

While we’re excited to be online, there are some historical accoutrements of conferences that we didn’t want to give up. First, we have a tradition of t-shirts with dtrace.conf. Thanks to our designer Ben Leonard, we have a banger of a t-shirt, capturing the spirit of our original dtrace.conf(08) shirt but with an Oxide twist. It’s (obviously) harder to make those free but we have tried to price them reasonably. You can get your t-shirt by adding it to your (free) dtrace.conf ticket. (And for those who present at dtrace.conf, your shirt is on us — we’ll send you a coupon code!)

Second, for those who can make their way to the East Bay and want some hangout time, we are going to have an après conference social event at the Oxide office starting at 5p. We’re charging something nominal for that too (and like the t-shirt, you pay for that via your dtrace.conf ticket); we’ll have some food and drinks and an Oxide hardware tour for the curious — and (of course?) there will be Fishpong.

Much has changed since I sent that e-mail 17 years ago — but the shared values and disposition that brought together our small community continue to endure; we look forward to seeing everyone (virtually) at dtrace.conf(24)!

Advancing Cloud and HPC Convergence with Lawrence Livermore National Laboratory Oxide Computer Company Blog

Oxide Computer Company and Lawrence Livermore National Laboratory Work Together to Advance Cloud and HPC Convergence

Oxide Computer Company and Lawrence Livermore National Laboratory (LLNL) today announced a plan to bring on-premises cloud computing capabilities to the Livermore Computing (LC) high-performance computing (HPC) center. The rack-scale Oxide Cloud Computer allows LLNL to improve the efficiency of operational workloads and will provide users in the National Nuclear Security Administration (NNSA) with new capabilities for provisioning secure, virtualized services alongside HPC workloads.

HPC centers have traditionally run batch workloads for large-scale scientific simulations and other compute-heavy applications. HPC workloads do not exist in isolation—there are a multitude of persistent, operational services that keep the HPC center running. Meanwhile, HPC users also want to deploy cloud-like persistent services—databases, Jupyter notebooks, orchestration tools, Kubernetes clusters. Clouds have developed extensive APIs, security layers, and automation to enable these capabilities, but few options exist to deploy fully virtualized, automated cloud environments on-premises. The Oxide Cloud Computer allows organizations to deliver secure cloud computing capabilities within an on-premises environment.

On-premises environments are the next frontier for cloud computing. LLNL is tackling some of the hardest and most important problems in science and technology, requiring advanced hardware, software, and cloud capabilities. We are thrilled to be working with their exceptional team to help advance those efforts, delivering an integrated system that meets their rigorous requirements for performance, efficiency, and security.
— Steve TuckCEO at Oxide Computer Company

Leveraging the new Oxide Cloud Computer, LLNL will enable staff to provision virtual machines (VMs) and services via self-service APIs, improving operations and modernizing aspects of system management. In addition, LLNL will use the Oxide rack as a proving ground for secure multi-tenancy and for smooth integration with the LLNL-developed Flux resource manager. LLNL plans to bring its users cloud-like Infrastructure-as-a-Service (IaaS) capabilities that work seamlessly with their HPC jobs, while maintaining security and isolation from other users. Beyond LLNL personnel, researchers at the Los Alamos National Laboratory and Sandia National Laboratories will also partner in many of the activities on the Oxide Cloud Computer.

We look forward to working with Oxide to integrate this machine within our HPC center. Oxide’s Cloud Computer will allow us to securely support new types of workloads for users, and it will be a proving ground for introducing cloud-like features to operational processes and user workflows. We expect Oxide’s open-source software stack and their transparent and open approach to development to help us work closely together.

— Todd GamblinDistinguished Member of Technical Staff at LLNL

Sandia is excited to explore the Oxide platform as we work to integrate on-premises cloud technologies into our HPC environment. This advancement has the potential to enable new classes of interactive and on-demand modeling and simulation capabilities.

— Kevin PedrettiDistinguished Member of Technical Staff at Sandia National Laboratories

LLNL plans to work with Oxide on additional capabilities, including the deployment of additional Cloud Computers in its environment. Of particular interest are scale-out capabilities and disaster recovery. The latest installation underscores Oxide Computer’s momentum in the federal technology ecosystem, providing reliable, state-of-the-art Cloud Computers to support critical IT infrastructure.

To learn more about Oxide Computer, visit https://oxide.computer.

About Oxide Computer

Oxide Computer Company is the creator of the world’s first commercial Cloud Computer, a true rack-scale system with fully unified hardware and software, purpose-built to deliver hyperscale cloud computing to on-premises data centers. With Oxide, organizations can fully realize the economic and operational benefits of cloud ownership, with access to the same self-service development experience of public cloud, without the public cloud cost. Oxide empowers developers to build, run, and operate any application with enhanced security, latency, and control, and frees organizations to elevate IT operations to accelerate strategic initiatives. To learn more about Oxide’s Cloud Computer, visit oxide.computer.

About LLNL

Founded in 1952, Lawrence Livermore National Laboratory provides solutions to our nation’s most important national security challenges through innovative science, engineering, and technology. Lawrence Livermore National Laboratory is managed by Lawrence Livermore National Security, LLC for the U.S. Department of Energy’s National Nuclear Security Administration.

Media Contact

LaunchSquad for Oxide Computer oxide@launchsquad.com

Remembering Charles Beeler Oxide Computer Company Blog

We are heartbroken to relay that Charles Beeler, a friend and early investor in Oxide, passed away in September after a battle with cancer. We lost Charles far too soon; he had a tremendous influence on the careers of us both.

Our relationship with Charles dates back nearly two decades, to his involvement with the ACM Queue board where he met Bryan. It was unprecedented to have a venture capitalist serve in this capacity with ACM, and Charles brought an entirely different perspective on the practitioner content. A computer science pioneer who also served on the board took Bryan aside at one point: "Charles is one of the good ones, you know."

When Bryan joined Joyent a few years later, Charles also got to know Steve well. Seeing the promise in both node.js and cloud computing, Charles became an investor in the company. When companies hit challenging times, some investors will hide — but Charles was the kind of investor to figure out how to fix what was broken. When Joyent needed a change in executive leadership, it was Charles who not only had the tough conversations, but led the search for the leader the company needed, ultimately positioning the company for success.

Aside from his investment in Joyent, Charles was an outspoken proponent of node.js, becoming an organizer of the Node Summit conference. In 2017, he asked Bryan to deliver the conference’s keynote, but by then, the relationship between Joyent and node.js had become…​ complicated, and Bryan felt that it probably wouldn’t be a good idea. Any rational person would have dropped it, but Charles persisted, with characteristic zeal: if the Joyent relationship with node.js had become strained, so much more the reason to speak candidly about it! Charles prevailed, and the resulting talk, Platform as Reflection of Values, became one of Bryan’s most personally meaningful talks.

Charles’s persistence was emblematic: he worked behind the scenes to encourage people to do their best work, always with an enthusiasm for the innovators and the creators. As we were contemplating Oxide, we told Charles what we wanted to do long before we had a company. Charles laughed with delight: "I hoped that you two would do something big, and I am just so happy for you that you’re doing something so ambitious!"

As we raised seed capital, we knew that we were likely a poor fit for Charles and his fund. But we also knew that we deeply appreciated his wisdom and enthusiasm; we couldn’t resist pitching him on Oxide. Charles approached the investment in Oxide as he did with so many other aspects: with curiosity, diligence, empathy, and candor. He was direct with us that despite his enthusiasm for us personally, Oxide would be a challenging investment for his firm. But he also worked with us to address specific objections, and ultimately he won over his partnership. We were thrilled when he not only invested, but pulled together a syndicate of like-minded technologists and entrepreneurs to join him.

Ever since, he has been a huge Oxide fan. Befitting his enthusiasm, one of his final posts expressed his enthusiasm and pride in what the Oxide team has built.

Charles, thank you. You told us you were proud of us — and it meant the world. We are gutted to no longer have you with us; your influence lives on not just in Oxide, but also in the many people that you have inspired. You were the best of venture capital. Closer to the heart, you were a terrific friend to us both; thank you.

Debugging an OpenJDK crash on SPARC The Trouble with Tribbles...

I had to spend a little time recently fixing a crash in OpenJDK on Solaris SPARC.

What we're seeing is, from the hs_err file:

# A fatal error has been detected by the Java Runtime Environment:
#
#  SIGSEGV (0xb) at pc=0xffffffff57c745a8, pid=18442, tid=37
...
# Problematic frame:
# V  [libjvm.so+0x7745a8]  G1CollectedHeap::allocate_new_tlab(unsigned long, unsigned long, unsigned long*)+0xb8

Well that's odd. I only see this on SPARC, and I've seen it sporadically on Tribblix during the process of continually building OpenJDK on SPARC, but haven't seen it on Solaris. Until a customer hit it in production, which is rather a painful place to find a reproducer.

In terms of source, this is located in the file src/hotspot/share/gc/g1/g1CollectedHeap.cpp (all future source references will be relative to that directory), and looks like:

HeapWord* G1CollectedHeap::allocate_new_tlab(size_t min_size,
                                             size_t requested_size,
                                             size_t* actual_size) {
  assert_heap_not_locked_and_not_at_safepoint();
  assert(!is_humongous(requested_size), "we do not allow humongous TLABs");

  return attempt_allocation(min_size, requested_size, actual_size);
}

That's incredibly simple. There's not much that can go wrong there, is there?

The complexity here is that a whole load of functions get inlined. So what does it call? You find yourself in a twisty maze of passages, all alike. But anyway, the next one down is

inline HeapWord* G1CollectedHeap::attempt_allocation(size_t min_word_size,
                                                     size_t desired_word_size,
                                                     size_t* actual_word_size) {
  assert_heap_not_locked_and_not_at_safepoint();
  assert(!is_humongous(desired_word_size), "attempt_allocation() should not "
         "be called for humongous allocation requests");

  HeapWord* result = _allocator->attempt_allocation(min_word_size, desired_word_size, actual_word_size);

  if (result == NULL) {
    *actual_word_size = desired_word_size;
    result = attempt_allocation_slow(desired_word_size);
  }

  assert_heap_not_locked();
  if (result != NULL) {
    assert(*actual_word_size != 0, "Actual size must have been set here");
    dirty_young_block(result, *actual_word_size);
  } else {
    *actual_word_size = 0;
  }

  return result;
}

That then calls an inlined G1Allocator::attempt_allocation() in g1Allocator.hpp. That calls current_node_index(), which looks safe and then there are a couple of calls to mutator_alloc_region()->attempt_retained_allocation() and mutator_alloc_region()->attempt_allocation(), which come from g1AllocRegion.inline.hpp and both ultimately call a local par_allocate(), which then calls par_allocate_impl() or par_allocate() in heapRegion.inline.hpp.

Now, mostly all these are doing is calling something else. The one really complex piece of code is in par_allocate_impl() which contains

...
  do {
    HeapWord* obj = top();
    size_t available = pointer_delta(end(), obj);
    size_t want_to_allocate = MIN2(available, desired_word_size);
    if (want_to_allocate >= min_word_size) {
      HeapWord* new_top = obj + want_to_allocate;
      HeapWord* result = Atomic::cmpxchg(&_top, obj, new_top);
      // result can be one of two:
      //  the old top value: the exchange succeeded
      //  otherwise: the new value of the top is returned.
      if (result == obj) {
        assert(is_object_aligned(obj) && is_object_aligned(new_top), "checking alignment");
        *actual_size = want_to_allocate;
        return obj;
      }
    } else {
      return NULL;
    }
  } while (true);
}

Right, let's go back to the crash. We can open up the core file in
mdb, and look at the stack with $C

ffffffff7f39d751 libjvm.so`_ZN7VMError14report_and_dieEP6ThreadjPhPvS3_+0x3c(
    101cbb1d0?, b?, fffffffcb45dea7c?, ffffffff7f39ecb0?, ffffffff7f39e9a0?, 0?)
ffffffff7f39d811 libjvm.so`JVM_handle_solaris_signal+0x1d4(b?,
    ffffffff7f39ecb0?, ffffffff7f39e9a0?, 0?, ffffffff7f39e178?, 101cbb1d0?)
ffffffff7f39dde1 libjvm.so`_ZL17javaSignalHandleriP7siginfoPv+0x20(b?,
    ffffffff7f39ecb0?, ffffffff7f39e9a0?, 0?, 0?, ffffffff7e7dd370?)
ffffffff7f39de91 libc.so.1`__sighndlr+0xc(b?, ffffffff7f39ecb0?,
    ffffffff7f39e9a0?, fffffffcb4b38afc?, 0?, ffffffff7f20c7e8?)
ffffffff7f39df41 libc.so.1`call_user_handler+0x400((int) -1?,
    (siginfo_t *) 0xffffffff7f39ecb0?, (ucontext_t *) 0xc?)
ffffffff7f39e031 libc.so.1`sigacthandler+0xa0((int) 11?,
    (siginfo_t *) 0xffffffff7f39ecb0?, (void *) 0xffffffff7f39e9a0?)
ffffffff7f39e5b1 libjvm.so`_ZN15G1CollectedHeap17allocate_new_tlabEmmPm+0xb8(
    10013d030?, 100?, 520?, ffffffff7f39f000?, 0?, 0?)

What you see here is the allocate_new_tlab() at the botton, it throws a signal, the signal handler catches it, passes it ultimately to JVM_handle_solaris_signal() which bails, and the JVM exits.

We can look at the signal. It's at address 0xffffffff7f39ecb0 and is of type siginfo_t, so we can just print it

java:core> ffffffff7f39ecb0::print -t siginfo_t

and we first see

siginfo_t {
    int si_signo = 0t11 (0xb)
    int si_code = 1
    int si_errno = 0
...

OK, the signal was indeed 11 = SIGSEGV. The interesting thing is the si_code of 1, which is defined as

#define SEGV_MAPERR     1       /* address not mapped to object */

Ah. Now, in the jvm you actually see a lot of SIGSEGV, but a lot of them are handled by that mysterious JVM_handle_solaris_signal(). In particular, it'll handle anything with SEGV_ACCERR which is basically something running off the end of an array.

Further down, you can see the fault address

struct  __fault = {
            void *__addr = 0x10
            int __trapno = 0
            caddr_t __pc = 0
            int __adivers = 0
        }

So, we're faulting on address 0x10. Yes, you try messing around down there and you will fault.


That confirms the crash is a SEGV. What are we actually trying to do? We can disassemble the allocate_new_tlab() function and see what's happening - remember the crash was at offset 0xb8

java:core> libjvm.so`_ZN15G1CollectedHeap17allocate_new_tlabEmmPm::dis
...
 libjvm.so`_ZN15G1CollectedHeap17allocate_new_tlabEmmPm+0xb8:

       ldx       [%i4 + 0x10], %i5

That's interesting, 0x10 was the fault address. What's %i4 then?

java:core> ::regs
%i4 = 0x0000000000000000

Yep. Given that, we'll try and read 0x10, giving the SEGV we see.

There's a little more context around that call site. A slightly
expanded view is

 libjvm.so`_ZN15G1CollectedHeap17allocate_new_tlabEmmPm+0xa0:        nop
 libjvm.so`_ZN15G1CollectedHeap17allocate_new_tlabEmmPm+0xa4:        add       %
i5, %g1, %g1
 libjvm.so`_ZN15G1CollectedHeap17allocate_new_tlabEmmPm+0xa8:        casx      [
%g3], %i5, %g1
 libjvm.so`_ZN15G1CollectedHeap17allocate_new_tlabEmmPm+0xac:        cmp       %
i5, %g1
 libjvm.so`_ZN15G1CollectedHeap17allocate_new_tlabEmmPm+0xb0:        be,pn     %
xcc, +0x160  <libjvm.so`_ZN15G1CollectedHeap17allocate_new_tlabEmmPm+0x210>
 libjvm.so`_ZN15G1CollectedHeap17allocate_new_tlabEmmPm+0xb4:        nop
 libjvm.so`_ZN15G1CollectedHeap17allocate_new_tlabEmmPm+0xb8:        ldx       [
%i4 + 0x10], %i5

Now, the interesting thing here is the casx (compare and swap) instruction. That lines up with the Atomic::cmpxchg() in par_allocate_impl() that we were suspecting above. So the crash is somewhere around there.

It turns out there's another way to approach this. If we compile without optimization then effectively we turn off the inlining. The way to do this is to add an entry to the jvm Makefile via make/hotspot/lib/JvmOverrideFiles.gmk

...
else ifeq ($(call isTargetOs, solaris), true)
    ifeq ($(call isTargetCpuArch, sparc), true)
      # ptribble port tweaks
      BUILD_LIBJVM_g1CollectedHeap.cpp_CXXFLAGS += -O0
    endif
endif

If we rebuild (having touched all the files in the directory to force
make to rebuild everything correctly), and run again, we get the full
call stack:

Now the crash is

# V  [libjvm.so+0x80cc48]  HeapRegion::top() const+0xc

which we can expand to the following stack leading up to where it goes
into the signal handler.:

ffffffff7f39dff1 libjvm.so`_ZNK10HeapRegion3topEv+0xc(0?, ffffffff7f39ef40?,
    101583e38?, ffffffff7f39f020?, fffffffa46de8038?, 10000?)
ffffffff7f39e0a1 libjvm.so`_ZN10HeapRegion17par_allocate_implEmmPm+0x18(0?,
    100?, 10000?, ffffffff7f39ef60?, ffffffff7f39ef40?, 8f00?)
ffffffff7f39e181                     
libjvm.so`_ZN10HeapRegion27par_allocate_no_bot_updatesEmmPm+0x24(0?, 100?,
    10000?, ffffffff7f39ef60?, 566c?, 200031?)
ffffffff7f39e231                     
libjvm.so`_ZN13G1AllocRegion12par_allocateEP10HeapRegionmmPm+0x44(100145440?,
    0?, 100?, 10000?, ffffffff7f39ef60?, 0?)
ffffffff7f39e2e1 libjvm.so`_ZN13G1AllocRegion18attempt_allocationEmmPm+0x48(
    100145440?, 100?, 10000?, ffffffff7f39ef60?, 3?, fffffffa46ceff48?)
ffffffff7f39e3a1 libjvm.so`_ZN11G1Allocator18attempt_allocationEmmPm+0xa4(
    1001453b0?, 100?, 10000?, ffffffff7f39ef60?, 7c0007410?, ffffffff7f39ea41?)
ffffffff7f39e461 libjvm.so`_ZN15G1CollectedHeap18attempt_allocationEmmPm+0x2c(
    10013d030?, 100?, 10000?, ffffffff7f39ef60?, 7c01b15e8?, 0?)
ffffffff7f39e521 libjvm.so`_ZN15G1CollectedHeap17allocate_new_tlabEmmPm+0x24(
    10013d030?, 100?, 10000?, ffffffff7f39ef60?, 0?, 0?)

So yes, this confirms that we are indeed in par_allocate_impl() and
it's crashing on the very first line of the code segment I showed
above, where it calls top(). All top() does is return the _top member
of a HeapRegion.

So the only thing that can happen here is that the HeapRegion itself
is NULL. Then the _top member is presumably at offset 0x10, and trying
to access it gives the SIGSEGV.

Now, in G1AllocRegion::attempt_allocation() there's an assert:

  HeapRegion* alloc_region = _alloc_region;
  assert_alloc_region(alloc_region != NULL, "not initialized properly");

However, asserts aren't compiled into production builds.

But the fix here is to fail if we've got NULL and let the caller
retry. There are a lot of calls here, and the general approach is to
return NULL if anything goes wrong, so I do the same for this extra
failure case, adding the following:

  if (alloc_region == NULL) {
    return NULL;
  }

With that, no more of those pesky crashes. (There might be others
lurking elsewhere, of course.)

Of course, what this doesn't explain is why the HeapRegion wasn't
correctly initialized in the first place. But that's another problem
entirely.

How Oxide Cuts Data Center Power Consumption in Half Oxide Computer Company Blog

Here’s a sobering thought: today, data centers already consume 1-2% of the world’s power, and that percentage will likely rise to 3-4% by the end of the decade. According to Goldman Sachs research, that rise will include a doubling in data center carbon dioxide emissions. As the data and AI boom progresses, this thirst for power shows no signs of slowing down anytime soon. Two key challenges quickly become evident for the 85% of IT that currently lives on-premises.

  1. How can organizations reduce power consumption and corresponding carbon emissions?

  2. How can organizations keep pace with AI innovation as existing data centers run out of available power?

Graph of AI & Data Center Growth Boosting Electricity Demand
Masanet et al. (2020), Cisco, IEA, Goldman Sachs Research

Rack-scale design is critical to improved data center efficiency

Traditional data center IT consumes so much power because the fundamental unit of compute is an individual server; like a house where rooms were built one at a time, with each room having its own central AC unit, gas furnace, and electrical panel. Individual rackmount servers are stacked together, each with their own AC power supplies, cooling fans, and power management. They are then paired with storage appliances and network switches that communicate at arm’s length, not designed as a cohesive whole. This approach fundamentally limits organizations' ability to maintain sustainable, high-efficiency computing systems.

Of course, hyperscale public cloud providers did not design their data center systems this way. Instead, they operate like a carefully planned smart home where everything is designed to work together cohesively and is operated by software that understands the home’s systems end-to-end. High-efficiency, rack-scale computers are deployed at scale and operate as a single unit with integrated storage and networking to support elastic cloud computing services. This modern architecture is made available to the market as public cloud, but that rental-only model is ill-fit for many business needs.

Illustration of Oxide racks at a higher density (2x) than conventional ones

Compared to a popular rackmount server vendor, Oxide is able to fill our specialized racks with 32 AMD Milan sleds and highly-available network switches using less than 15kW per rack, doubling the compute density in a typical data center. With just 16 of the alternative 1U servers and equivalent network switches, over 16kW of power is required per rack, leading to only 1,024 CPU cores vs Oxide’s 2,048.

Extracting more useful compute from each kW of power and square foot of data center space is key to the future effectiveness of on-premises computing.

At Oxide, we’ve taken this lesson in advancing rack-scale design, improved upon it in several ways, and made it available for every organization to purchase and operate anywhere in the world without a tether back to the public cloud. Our Cloud Computer treats the entire rack as a single, unified computer rather than a collection of independent parts, achieving unprecedented power efficiency.

By designing the hardware and software together, we’ve eliminated unnecessary components and optimized every aspect of system operation through a control plane with visibility to end-to-end operations.

When we started Oxide, the DC bus bar stood as one of the most glaring differences between the rack-scale machines at the hyperscalers and the rack-and-stack servers that the rest of the market was stuck with. That a relatively simple piece of copper was unavailable to commercial buyers — despite being unequivocally the right way to build it! — represented everything wrong with the legacy approach.

The bus bar in the Oxide Cloud Computer is not merely more efficient, it is a concrete embodiment of the tremendous gains from designing at rack-scale, and by integrating hardware with software.

— Bryan Cantrill

The improvements we’re seeing are rooted in technical innovation

  • Replacing low-efficiency AC power supplies with a high-efficiency DC Bus Bar
    Power conversion is performed once AC power is fed from the data center to the Oxide universal power shelf with a customized power shelf controller (PSC). The power shelf distributes DC power up and down the rack via a bus bar. This eliminates the 70 total AC power supplies found in an equivalent legacy server rack within 32 servers, two top-of-rack switches, and one out-of-band switch, each with two AC power supplies. This power shelf also ensures the load is balanced across phases, something that’s impossible with traditional power distribution units found in legacy server racks.

  • Bigger fans = bigger efficiency gains
    Oxide server sleds are designed to a custom form factor to accommodate larger fans than legacy servers typically use. These fans can move more air more efficiently, cooling the systems using 12x less energy than legacy servers, which each contain as many as 7 fans, which must work much harder to move air over system components.

  • Purpose-built for power efficiency
    Oxide server sleds have less restrictive airflow than legacy servers by eliminating extraneous components like PCIe risers, storage backplanes, and more. Legacy servers need many optional components like these because they could be used for any number of tasks, such as point-of-sale systems, data center servers, or network-attached-storage (NAS) systems. Still, they were never designed optimally for any one of those tasks. The Oxide Cloud Computer was designed from the ground up to be a rack-scale cloud computing powerhouse, and so it’s optimized for exactly that task.

  • Hardware + Software designed together
    The Oxide Cloud Computer includes a robust cloud control plane with deep observability to the full system. By designing the hardware and software together, we can make hardware choices like more intelligent DC-DC power converters that can provide rich telemetry to our control plane, enabling future feature enhancements such as dynamic power capping and efficiency-based workload placement that are impossible with legacy servers and software systems.

Learn more about Oxide’s intelligent Power Shelf Controller

The Bottom Line: Customers and the Environment Both Benefit

Reducing data center power demands and achieving more useful computing per kilowatt requires fundamentally rethinking traditional data center utilization and compute design. At Oxide, we’ve proven that dramatic efficiency gains are possible when you rethink the computer at rack-scale with hardware and software designed thoughtfully and rigorously together.

Ready to learn how your organization can achieve these results? Schedule time with our team here.

Together, we can reclaim on-premises computing efficiency to achieve both business and sustainability goals.

OmniOS Community Edition r151052 OmniOS Community Edition

OmniOSce v11 r151052 is out!

On the 4th of November 2024, the OmniOSce Association has released a new stable version of OmniOS - The Open Source Enterprise Server OS. The release comes with many tool updates, brand-new features and additional hardware support. For details see the release notes.

Note that r151048 is now end-of-life. You should upgrade to r151050 or r151052 to stay on a supported track.

OmniOS is fully Open Source and free. Nevertheless, it takes a lot of time and money to keep maintaining a full-blown operating system distribution. Our statistics show that there are almost 2’000 active installations of OmniOS while fewer than 20 people send regular contributions. If your organisation uses OmniOS based servers, please consider becoming a regular patron or taking out a support contract.


Any problems or questions, please get in touch.

Reflections on Founder Mode The Observation Deck

Paul Graham’s Founder Mode is an important piece, and you should read it if for no other reason that “founder mode” will surely enter the lexicon (and as Graham grimly predicts: “as soon as the concept of founder mode becomes established, people will start misusing it”). When building a company, founders are engaged in several different acts at once: raising capital; building a product; connecting that product to a market; building an organization to do all of these. Founders make lots of mistakes in all of these activities, and Graham’s essay highlights a particular kind of mistake in which founders are overly deferential to expertise or convention. Pejoratively referring to this as “Management Mode”, Graham frames this in the Silicon Valley dramaturgical dyad of Steve Jobs and John Scully. While that’s a little too reductive (anyone seeking to understand Jobs needs to read Randall Stross’s superlative Steve Jobs and the NeXT Big Thing, highlighting Jobs’s many post-Scully failures at NeXT), Graham has identified a real issue here, albeit without much specificity.

For a treatment of the same themes but with much more supporting detail, one should read the (decade-old) piece from Tim O’Reilly, How I failed. (Speaking personally, O’Reilly’s piece had a profound influence on me, as it encouraged me to stand my ground on an issue on which I had my own beliefs but was being told to defer to convention.) But as terrific as it is, O’Reilly’s piece also doesn’t answer the question that Graham poses: how do founders prevent their companies from losing their way?

Graham says that founder mode is a complete mystery (“There are as far as I know no books specifically about founder mode”), and while there is a danger in being too pat or prescriptive, there does seem to be a clear component for keeping companies true to themselves: the written word. That is, a writing- (and reading-!) intensive company culture does, in fact, allow for scaling the kind of responsibility that Graham thinks of as founder mode. At Oxide, our writing-intensive culture has been absolutely essential: our RFD process is the backbone of Oxide, and has given us the structure to formalize, share, and refine our thinking. First among this formalized thinking – and captured in our first real RFD – is RFD 2 Mission, Principles, and Values. Immediately behind that (and frankly, the most important process for any company) is RFD 3 Oxide Hiring Process. These first three RFDs – on the process itself, on what we value, and on how we hire – were written in the earliest days of the company, and they have proven essential to scale the company: they are the foundation upon which we attract people who share our values.

While the shared values have proven necessary, they haven’t been sufficient to eliminate the kind of quandaries that Graham and O’Reilly describe. For example, there have been some who have told us that we can’t possibly hire non-engineering roles using our hiring process – or told us that our approach to compensation can’t possibly work. To the degree that we have had a need for Graham’s founder mode, it has been in those moments: to stay true to the course we have set for the company. But because we have written down so much, there is less occasion for this than one might think. And when it does occur – when there is a need for further elucidation or clarification – the artifact is not infrequently a new RFD that formalizes our newly extended thinking. (RFD 68 is an early public and concrete example of this; RFD 508 is a much more recent one that garnered some attention.)

Most importantly, because we have used our values as a clear lens for hiring, we are able to assure that everyone at Oxide is able to have the same disposition with respect to responsibility – and this (coupled with the transparency that the written word allows) permits us to trust one another. As I elucidated in Things I Learned The Hard Way, the most important quality in a leader is to bind a team with mutual trust: with it, all things are possible – and without it, even easy things can be debilitatingly difficult. Graham mentions trust, but he doesn’t give it its due. Too often, founders focus on the immediacy of a current challenge without realizing that they are, in fact, undermining trust with their approach. Bluntly, founders are at grave risk of misinterpreting Graham’s “Founders Mode” to be a license to micromanage their teams, descending into the kind of manic seagull management that inhibits a team rather than empowering it.

Founders seeking to internalize Graham’s advice should recast it by asking themselves how they can foster mutual trust – and how they can build the systems that allow trust to be strengthened even as the team expands. For us at Oxide, writing is the foundation upon which we build that trust. Others may land on different mechanisms, but the goal of founders should be the same: build the trust that allows a team to kick a Jobsian dent in the universe!

Reflections on Founder Mode Oxide Computer Company Blog

Paul Graham’s Founder Mode is an important piece, and you should read it if for no other reason that "founder mode" will surely enter the lexicon (and as Graham grimly predicts: "as soon as the concept of founder mode becomes established, people will start misusing it"). When building a company, founders are engaged in several different acts at once: raising capital; building a product; connecting that product to a market; building an organization to do all of these. Founders make lots of mistakes in all of these activities, and Graham’s essay highlights a particular kind of mistake in which founders are overly deferential to expertise or convention. Pejoratively referring to this as "Management Mode", Graham frames this in the Silicon Valley dramaturgical dyad of Steve Jobs and John Scully. While that’s a little too reductive (anyone seeking to understand Jobs needs to read Randall Stross’s superlative Steve Jobs and the NeXT Big Thing, highlighting Jobs’s many post-Scully failures at NeXT), Graham has identified a real issue here, albeit without much specificity.

For a treatment of the same themes but with much more supporting detail, one should read the (decade-old) piece from Tim O’Reilly, How I failed. (Speaking personally, O’Reilly’s piece had a profound influence on me, as it encouraged me to stand my ground on an issue on which I had my own beliefs but was being told to defer to convention.) But as terrific as it is, O’Reilly’s piece also doesn’t answer the question that Graham poses: how do founders prevent their companies from losing their way?

Graham says that founder mode is a complete mystery ("There are as far as I know no books specifically about founder mode"), and while there is a danger in being too pat or prescriptive, there does seem to be a clear component for keeping companies true to themselves: the written word. That is, a writing- (and reading-!) intensive company culture does, in fact, allow for scaling the kind of responsibility that Graham thinks of as founder mode. At Oxide, our writing-intensive culture has been absolutely essential: our RFD process is the backbone of Oxide, and has given us the structure to formalize, share, and refine our thinking. First among this formalized thinking – and captured in our first real RFD – is RFD 2 Mission, Principles, and Values. Immediately behind that (and frankly, the most important process for any company) is RFD 3 Oxide Hiring Process. These first three RFDs – on the process itself, on what we value, and on how we hire – were written in the earliest days of the company, and they have proven essential to scale the company: they are the foundation upon which we attract people who share our values.

While the shared values have proven necessary, they haven’t been sufficient to eliminate the kind of quandaries that Graham and O’Reilly describe. For example, there have been some who have told us that we can’t possibly hire non-engineering roles using our hiring process – or told us that our approach to compensation can’t possibly work. To the degree that we have had a need for Graham’s founder mode, it has been in those moments: to stay true to the course we have set for the company. But because we have written down so much, there is less occasion for this than one might think. And when it does occur – when there is a need for further elucidation or clarification – the artifact is not infrequently a new RFD that formalizes our newly extended thinking. (RFD 68 is an early public and concrete example of this; RFD 508 is a much more recent one that garnered some attention.)

Most importantly, because we have used our values as a clear lens for hiring, we are able to assure that everyone at Oxide is able to have the same disposition with respect to responsibility – and this (coupled with the transparency that the written word allows) permits us to trust one another. As I elucidated in Things I Learned The Hard Way, the most important quality in a leader is to bind a team with mutual trust: with it, all things are possible – and without it, even easy things can be debilitatingly difficult. Graham mentions trust, but he doesn’t give it its due. Too often, founders focus on the immediacy of a current challenge without realizing that they are, in fact, undermining trust with their approach. Bluntly, founders are at grave risk of misinterpreting Graham’s "Founders Mode" to be a license to micromanage their teams, descending into the kind of manic seagull management that inhibits a team rather than empowering it.

Founders seeking to internalize Graham’s advice should recast it by asking themselves how they can foster mutual trust – and how they can build the systems that allow trust to be strengthened even as the team expands. For us at Oxide, writing is the foundation upon which we build that trust. Others may land on different mechanisms, but the goal of founders should be the same: build the trust that allows a team to kick a Jobsian dent in the universe!

KORH Minimum Sector Altitude Gotcha Josef "Jeff" Sipek

I had this draft around for over 5 years—since January 2019. Since I still think it is about an interesting observation, I’m publishing it now.

In late December (2018), I was preparing for my next instrument rating lesson which was going to involve a couple of ILS approaches at Worcester, MA (KORH). While looking over the ILS approach to runway 29, I noticed something about the minimum sector altitude that surprised me.

Normally, I consider MSAs to be centered near the airport for the approach. For conventional (i.e., non-RNAV) approaches, this tends to be the main navaid used during the approach. At Worcester, the 25 nautical mile MSA is centered on the Gardner VOR which is 19 nm away.

I plotted the MSA boundary on the approach chart to visualize it better:

It is easy to glance at the chart, see 3300 most of the way around, but not realize that when flying in the vicinity of the airport we are near the edge of the MSA. GRIPE, the missed approach hold fix, is half a mile outside of the MSA. (Following the missed approach procedure will result in plenty of safety, of course, so this isn’t really that relevant.)

What's a decent password length? The Trouble with Tribbles...

What's a decent length for a password?

I think it's pretty much agreed by now that longer passwords are, in general, better. And fortunately stupid complexity requirements are on the way out.

Reading the NIST password rules gives the following:

  • User chosen passwords must be at least 8 characters
  • Machine chosen passwords must be at least 6 characters
  • You must allow passwords to be at least 64 characters

Say what? A 6 character password is secure?

Initially, that seems way off, but it depends on your threat model. If you have a mechanism to block the really bad commonly used passwords, then 6 characters gives you a billion choices. Not many, but you should also be implementing technical measures such as rate limiting.

With that, if the only attack vector is brute force over the network, trying a billion passwords is simply impractical. Even with just passive rate limiting (limited by cpu power and network latency) an attacker will struggle; with active limiting they'll be trying for decades.

That's with just 6 random characters. Go to 8 and you're out of sight. And for this attack vector, no quantum computing developments will make any difference whatsoever.

But what if the user database itself is compromised?

Of course, if the passwords are in cleartext then no amount of fancy rules or length requirements is going to help you at all.

But if an attacker gets encrypted passwords then they can simply brute force them many orders of magnitude faster. Or use rainbow tables. And that's a whole different threat model.

Realistically, protecting against brute force or rainbow table attacks probably needs a 16 character password (or passphrase), and that requirement could get longer over time.

A corollary to this is that there isn't actually much to be gained to requiring password lengths between 8 and 16 characters.

In illumos, the default minimum password length is 6 characters. I recently increased the default in Tribblix to 8, which aligns with the user chosen limit that NIST give.

OmniOS Community Edition r151050 OmniOS Community Edition

OmniOSce v11 r151050 is out!

On the 6th of May 2024, the OmniOSce Association has released a new stable version of OmniOS - The Open Source Enterprise Server OS. The release comes with many tool updates, brand-new features and additional hardware support. For details see the release notes.

Note that r151038 is now end-of-life. You should upgrade to r151046 or r151050 to stay on a supported track. r151046 is an LTS release with support until May 2026, and r151050 is a stable release with support until May 2025.

For anyone who tracks LTS releases, the previous LTS - r151038 - is now end-of-life. You should upgrade to r151046 for continued LTS support.

OmniOS is fully Open Source and free. Nevertheless, it takes a lot of time and money to keep maintaining a full-blown operating system distribution. Our statistics show that there are almost 2’000 active installations of OmniOS while fewer than 20 people send regular contributions. If your organisation uses OmniOS based servers, please consider becoming a regular patron or taking out a support contract.


Any problems or questions, please get in touch.

Unsynchronized PPS Experiment Josef "Jeff" Sipek

Late last summer I decided to do a simple experiment—feed my server a PPS signal that wasn’t synchronized to any timescale. The idea was to give chrony a reference that is more stable than the crystal oscillator on the motherboard.

Hardware

For this PPS experiment I decided to avoid all control loop/feedback complexity and just manually set the frequency to something close enough and let it drift—hence the unsynchronized. As a result, the circuit was quite simple:

The OCXO was a $5 used part from eBay. It outputs a 10 MHz square wave and has a control voltage pin that lets you tweak the frequency a little bit. By playing with it, I determined that a 10mV control voltage change yielded about 0.1 Hz frequency change. The trimmer sets this reference voltage. To “calibrate” it, I connected it to a frequency counter and tweaked the trimmer until a frequency counter read exactly 10 MHz.

10 MHz is obviously way too fast for a PPS signal. The simplest way to turn it into a PPS signal is to use an 8-bit microcontroller. The ATmega48P’s design seems to have very deterministic timing (in other words it adds a negligible amount of jitter), so I used it at 10 MHz (fed directly from the OCXO) with a very simple assembly program to toggle an output pin on and off. The program kept an output pin high for exactly 2 million cycles, and low for 8 million cycles thereby creating a 20% duty cycle square wave at 1 Hz…perfect to use as a PPS. Since the jitter added by the microcontroller is measured in picoseconds it didn’t affect the overall performance in any meaningful way.

The ATmega48P likes to run at 5V and therefore its PPS output is +5V/0V, which isn’t compatible with a PC serial port. I happened to have an ADM3202 on hand so I used it to convert the 5V signal to an RS-232 compatible signal. I didn’t do as thorough of a check of its jitter characteristics, but I didn’t notice anything bad while testing the circuit before “deploying” it.

Finally, I connected the RS-232 compatible signal to the DCD pin (but CTS would have worked too).

The whole circuit was constructed on a breadboard with the OCXO floating in the air on its wires. Power was supplied with an iPhone 5V USB power supply. Overall, it was a very quick and dirty construction to see how well it would work.

Software

My server runs FreeBSD with chrony as the NTP daemon. The configuration is really simple.

First, setting dev.uart.0.pps_mode to 2 informs the kernel that the PPS signal is on DCD (see uart(4)).

Second, we need to tell chrony that there is a local PPS on the port:


refclock PPS /dev/cuau0 local 

The local token is important. It tells chrony that the PPS is not synchronized to UTC. In other words, that the PPS can be used as a 1 Hz frequency source but not as a phase source.

Performance

I ran my server with this PPS refclock for about 50 days with chrony configured to log the time offset of each pulse and to apply filtering to every 16 pulses. (This removes some of the errors related to serial port interrupt handling not being instantaneous.) The following evaluation uses only these filtered samples as well as the logged data about the calculated system time error.

In addition to the PPS, chrony used several NTP servers from the internet (including the surprisingly good time.cloudflare.com) for the date and time-of-day information. This is a somewhat unfortunate situation when it comes to trying to figure out how good of an oscillator the OCXO is, as to make good conclusions about one oscillator one needs a better quality oscillator for the comparison. However, there are still a few things one can look at even when the (likely) best oscillator is the one being tested.

NTP Time Offset

The ultimate goal of a PPS source is to stabilize the system’s clock. Did the PPS source help? I think it is easy to answer that question by looking at the remaining time offset (column 11 in chrony’s tracking.log) over time.

This is a plot of 125 days that include the 50 days when I had the PPS circuit running. You can probably guess which 50 days. (The x-axis is time expressed as Wikipedia article:  Modified Julian Date, or MJD for short.)

I don’t really have anything to say aside from—wow, what a difference!

For completeness, here’s a plot of the estimated local offset at the epoch (column 7 in tracking.log). My understanding of the difference between the two columns is fuzzy but regardless of which I go by, the improvement was significant.

Fitting a Polynomial Model

In addition to looking at the whole-system performance, I wanted to look at the PPS performance itself.

As before, the x-axis is MJD. The y-axis is the PPS offset as measured and logged by chrony—the 16-second filtered values.

The offset started at -486.5168ms. This is an arbitrary offset that simply shows that I started the PPS circuit about half a second off of UTC. Over the approximately 50 days, the offset grew to -584.7671ms.

This means that the OCXO frequency wasn’t exactly 10 MHz (and therefore the 1 PPS wasn’t actually at 1 Hz). Since there is a visible curve to the line, it isn’t a simple fixed frequency error but rather the frequency drifted during the experiment.

How much? I used Wikipedia article:  R’s lm function to fit simple polynomials to the collected data. I tried a few different polynomial degrees, but all of them were fitted the same way:


m <- lm(pps_offset ~ poly(time, poly_degree, raw=TRUE))
a <- as.numeric(m$coefficients[1])
b <- as.numeric(m$coefficients[2])
c <- as.numeric(m$coefficients[3])
d <- as.numeric(m$coefficients[4])

In all cases, these coefficients correspond to the 4 terms in a + bt + ct 2 + dt 3 . For lower-degree polynomials, the missing coefficients are 0.

Note: Even though the plots show the x-axis in MJD, the calculations were done in seconds with the first data point at t=0 seconds.

Linear

The simplest model is a linear one. In other words, fitting a straight line through the data set. lm provided the following coefficients:

a=-0.480090626569894
b=-2.25787872135774e-08

That is an offset of -480.09ms and slope of -22.58ns/s (which is also -22.58 ppb frequency error).

Graphically, this is what the line looks like when overlayed on the measured data:

Not bad but also not great. Here is the difference between the two:

Put another way, this is the PPS offset from UTC if we correct for time offset (a) and a frequency error (b). The linear model clearly doesn’t handle the structure in the data completely. The residual is near low-single-digit milliseconds. We can do better, so let’s try to add another term.

Quadratic

lm produced these coefficients for a degree 2 polynomial:

a=-0.484064700277606
b=-1.75349684277379e-08
c=-1.10412099841665e-15

Visually, this fits the data much better. It’s a little wrong on the ends, but overall quite nice. Even the residual (below) is smaller—almost completely confined to less than 1 millisecond.

a is still time offset, b is still frequency error, and c is a time “acceleration” of sorts.

There is still very visible structure to the residual, so let’s add yet another term.

Cubic

As before, lm yielded the coefficients. This time they were:

a=-0.485357232306569
b=-1.44068934233748e-08
c=-2.78676248986831e-15
d=2.45563844387287e-22

That’s really close looking!

The residual still has a little bit of a wave to it, but almost all the data points are within 500 microseconds. I think that’s sufficiently close given just how much non-deterministic “stuff” (both hardware and software) there is between a serial port and an OS kernel’s interrupt handler on a modern server. (In theory, we could add additional terms forever until we completely eliminated the residual.)

So, we have a model of what happened to the PPS offset over time. Specifically, a + bt + ct 2 + dt 3 and the 4 constants. The offset (a of approximately -485ms) is easily explained—I started the PPS at the “wrong” time. The frequency error (b of approximately -14.4 ppb) can be explained as I didn’t tune the oscillator to exactly 10 MHz. (More accurately, I tuned it, unplugged it, moved it to my server, and plugged it back in. The slightly different environment could produce a few ppb error.)

What about the c and d terms? They account for a combination of a lot of things. Temperature is a big one. First of all, it is a home server and so it is subject to air-conditioner cycling on and off at a fairly long interval. This produces sizable swings in temperature, which in turn mess with the frequency. A server in a data center sees much less temperature variation, since the chillers keep the temperature essentially constant (at least compared to homes). Second, the oscillator was behind the server and I expect the temperature to slightly vary based on load.

One could no doubt do more analysis (and maybe at some point I will), but this post is already getting way too long.

Conclusion

One can go nuts trying to play with time and time synchronization. This is my first attempt at timekeeping-related circuitry, so I’m sure there are ways to improve the circuit or the analysis.

I think this experiment was a success. The system clock behavior improved beyond what’s needed for a general purpose server. Getting under 20 ppb error from a simple circuit on a breadboard with absolutely no control loop is great. I am, of course, already tinkering with various ideas that should improve the performance.

Tribblix image structural changes The Trouble with Tribbles...

The Tribblix live ISO and related images are put together every so slightly differently in the latest m34 release.

All along, there's been an overlay (think a group package) called base-iso that lists the packages that are present in the live image. On installation, this is augmented with a few extra packages that you would expect to be present in a running system but which don't make much sense in a live image, to construct the base system.

You can add additional software, but the base is assumed to be present.

The snag with this is that base-iso is very much a single-purpose generic concept. By its very nature it has to be minimal enough to not be overly bloated, yet contain as many drivers as necessary to handle the majority of systems.

As such, the regular ISO image has fallen between 2 stools - it doesn't have every single driver, so some systems won't work, while it has a lot of unnecessary drivers for a lot of common use cases.

So what I've done is split base-iso into 2 layers. There's a new core-tribblix overlay, which is the common packages, and then base-iso adds all the extra drivers. By and large, the regular live image for m34 isn't really any different to what was present before.

But the concepts of "what packages do I need for applications to work" and "what packages do I want to load on a given downloadable ISO" have now been split.

What this allows is to easily create other images with different rules. As of m34, for example, the "minimal" image is actually created from a new base-server overlay, which again sits atop core-tribblix and differs from base-iso in that it has all the FC drivers. If you're installing on a fibre-channel connected system then using the minimal image will work better (and if you're SAN-booted, it will work where the regular ISO won't).

The next use case is that images for cloud or virtual systems simply don't need most of the drivers. This cuts out a lot of packages (although it doesn't actually save that much space).

The standard Tribblix base system now depends on core-tribblix, not base-iso or any of the specific image layers. This is as it should be - userland and applications really shouldn't care what drivers are present.

One side-effect of this change is that it makes minimising zones easier, because what gets installed in a zone can be based on that stripped-down core-tribblix overlay.

Engineering a culture Oxide Computer Company Blog

We ran into an interesting issue recently. On the one hand, it was routine: we had a bug — a regression — and the team quickly jumped on it, getting it root caused and fixed. But on the other, this particular issue was something of an Oxide object lesson, representative not just of the technologies but also of the culture we have built here. I wasn’t the only person who thought so, and two of my colleagues wrote terrific blog entries with their perspectives:

The initial work as described by Matt represents a creative solution to a thorny problem; if it’s clear in hindsight, it certainly wasn’t at the time! (In Matt’s evocative words: "One morning, I had a revelation.") I first learned of Matt’s work when he demonstrated it during our weekly Demo Friday, an hour-long unstructured session to demo our work for one another. Demo Friday is such an essential part of Oxide’s culture that it feels like we have always done it, but in fact it took us nearly two years into the company’s life to get there: over the spring and summer of 2021, our colleague Sean Klein had instituted regular demos for the area that he works on (the Oxide control plane), and others around the company — seeing the energy that came from it — asked if they, too, could start regular demos for their domain. But instead of doing it group by group, we instituted it company-wide starting in the fall of 2021: an unstructured hour once a week in which anyone can demo anything.

In the years since, we have had demos of all scopes and sizes. Importantly, no demo is too small — and we have often found that a demo that feels small to someone in the thick of work will feel extraordinary to someone outside of it. ("I have a small demo building on the work of a lot of other people" has been heard so frequently that it has become something of an inside joke.) Demo Friday is important because it gets to one of our most important drivers as technologists: the esteem of our peers. The thrill that you get from showing work to your colleagues is unparalleled — and their wonderment in return is uniquely inspiring. (Speaking personally, Matt’s demo addressed a problem that I had personally had many times over in working on Hubris — and I was one of the many w00ts in the chat, excited to see his creative solution!)

Having the demos be company-wide has also been a huge win for not just our shared empathy and teamwork but also our curiosity and versatility: it’s really inspiring to have (say) one colleague show how they used PCB backdrilling for signal integrity, and the next show an integration they built using Dropshot between our CRM and spinning up a demonstration environment for a customer. And this is more than just idle intellectual curiosity: our stack is deep — spanning both hardware and software — and the demos make for a fun and engaging way to learn about aspects of the system that we don’t normally work on.

Returning to Matt and Cliff, if Matt’s work implicitly hits on aspects of our culture, Cliff’s story of debugging addresses that culture explicitly, noting that the experience demonstrated:

Tight nonhierarchical integration of the team. This isn’t a Hubris feature, but it’s hard to separate Hubris from the team that built it. Oxide’s engineering team has essentially no internal silos. Our culture rewards openness, curiosity, and communication, and discourages defensiveness, empire-building, and gatekeeping. We’ve worked hard to create and defend this culture, and I think it shows in the way we organized horizontally, across the borders of what other organizations would call teams, to solve this mystery.

In the discussion on Hacker News of Cliff’s piece, this cultural observeration stood out, with a commenter asking:

I’d love to hear more about the motivations for crafting such a culture as well as some particular implementation details. I’m curious if there are drawbacks to fostering "openness, curiosity, and communication" within an organization?

The culture at Oxide is in fact very deliberate: when starting a company, one is building many things at once (the team, the product, the organization, the brand) — and the culture will both inform and be reinforced by all of these. Setting that first cultural cornerstone was very important to us — starting with our mission, principles, and values. Critically, by using our mission, principles, and values as the foundation for our hiring process, we have deliberately created a culture that reinforces itself.

Some of the implementation details:

  • We have uniform compensation (even if it might not scale indefinitely)

  • We are writing intensive (but we still believe in spoken collaboration)

  • We have no formalized performance review process (but we believe in feedback)

  • We record every meeting (but not every conversation)

  • We have a remote work force (but we also have an office)

  • We are non-hierarchical (but we all ultimately report to our CEO)

  • We don’t use engineering metrics (but we all measure ourselves by our customers and their success)

If it needs to be said, there is plenty of ambiguity: if you are using absolutes to think of Oxide (outside of our principles of honesty, integrity and decency!) you are probably missing some nuance of our culture.

Finally, to the (seemingly loaded?) question of the "drawbacks" of fostering "openness, curiosity, and communication" within an organization, the only drawback is that it’s hard work: culture has to be deliberate without being overly prescriptive, and that can be a tricky balance. In this regard, building a culture is very different than building (say) software: it is not engineered in a traditional sense, but is rather a gooey, squishy, organism that will evolve over time. But the reward of the effort is something that its participants care intensely about: it will continue to be (in Cliff’s words) a culture that we work hard to not just create but defend!

OmniOS is not affected by CVE-2024-3094 OmniOS Community Edition

Yesterday we learned of a supply chain back door in the xz-utils software via an announcement at https://www.openwall.com/lists/oss-security/2024/03/29/4. The vulnerability was distributed with versions 5.6.0 and 5.6.1 of xz; and has been assigned CVE-2024-3094.

OmniOS is NOT affected by CVE-2024-3094

The malicious code is only present in binary artefacts if the build system is Linux (and there are some additional constraints too) and if the system linker is GNU ld – neither of which are true for our packages. The payload is also a Linux ELF binary which would not successfully link into code built for OmniOS, and requires features which are only present in the GNU libc.

We have also only ever shipped xz-utils 5.6.x as part of the unstable bloody testing release, stable releases contain older versions:

  • r151038 ships version 5.2.6
  • r151046 ships version 5.4.2
  • r151048 ships version 5.4.4
  • bloody ships version 5.6.1

Despite being unaffected, we have now switched builds of xz in bloody to using the raw source archive, which does not contain the malicious injection code, and generating the autoconf files ourselves. We have not downgraded to an earlier version as it is not clear which earlier version can be considered completely safe given that the perpetrator has been responsible for maintaining and signing releases back to version 5.4.3. Once a cleaned 5.6.2 release is available, we will upgrade to that.


Any problems or questions, please get in touch.

Disabling Monospaced Font Ligatures Josef "Jeff" Sipek

A recent upgrade of FreeBSD on my desktop resulted in just about every program (Firefox, KiCAD, but thankfully not urxvt) rendering various ligatures even for monospaced fonts. Needless to say, this is really annoying when looking at code, etc. Not having any better ideas, I asked on Mastodon if anyone knew how to turn this mis-feature off.

About an hour later, @monwarez@bsd.cafe suggested dropping the following XML in /usr/local/etc/fonts/conf.avail/29-local-noto-mono-fixup.conf and adding a symlink in ../conf.d to enable it:

<?xml version="1.0"?>
<!DOCTYPE fontconfig SYSTEM "urn:fontconfig:fonts.dtd">
<fontconfig>
        <description>Disable ligatures for monospaced fonts to avoid ff, fi, ffi, etc. becoming only one character wide</description>
        <match target="font">
                <test name="family" compare="eq">
                        <string>Noto Sans Mono</string>
                </test>
                <edit name="fontfeatures" mode="append">
                        <string>liga off</string>
                        <string>dlig off</string>
                </edit>
        </match>
</fontconfig>

This solved my problem. Hopefully this will help others. if not, it’s a note-to-self for when I need to reapply this fixup :)

Moore's Scofflaws Oxide Computer Company Blog

Years ago, Jeff Bezos famously quipped that "your margin is my opportunity." This was of course aimed not at Amazon’s customers, but rather its competitors, and it was deadly serious: customers of AWS in those bygone years will fondly remember that every re:Invent brought with it another round of price cuts. This era did not merely reflect Bezos’s relentless execution, but also a disposition towards who should reap the reward of advances in underlying technology: Amazon believed (if implicitly) that improvements at the foundations of computing (e.g., in transistor density, core count, DRAM density, storage density, etc.) should reflect themselves in lower prices for consumers rather than higher margins for suppliers.

Price cuts are no longer a re:Invent staple, having been replaced by a regular Amazon tradition of a different flavor: cutting depreciation (and therefore increasing earnings) by extending the effective life of their servers. (These announcements are understandably much more subdued, as "my depreciation is my margin opportunity" doesn’t have quite the same ring to it.)

As compute needs have grown and price cuts have become an increasingly distant memory, some have questioned their sky-high cloud bills, wondering if they should in fact be owning their compute instead of renting it. When we started Oxide, we knew from operating our own public cloud what those economics looked like — and we knew that over time others of a particular scale would come to the same realization that they would be better off not giving their margin away by renting compute. (Though it’s safe to say that we did not predict that it would be DHH leading the charge!)

Owning one’s own cloud sounds great, but there is a bit that’s unsaid: what about the software? Software is essential for elastic, automated infrastructure: hardware alone does not a cloud make! Unfortunately, the traditional server vendors do not help here: because of a PC-era divide in how systems are delivered, customers are told to look elsewhere for any and all system software. This divide is problematic on several levels. First, it impedes the hardware/software co-design that we (and, famously, others!) believe is essential to deliver the best possible product. Second, it leads to infamous finger pointing when the whole thing doesn’t work. But there is also a thorny economic problem: when your hardware and your software don’t come from the same provider, to whom should go the spoils of better hardware?

To someone who has just decided to buy their hardware out of their frustration with renting it, the answer feels obvious: whoever owns the hardware should naturally benefit from its advances! Unfortunately, the enterprise software vendor delivering your infrastructure often has other ideas — and because their software is neither rented nor bought, but rather comes from the hinterlands of software licensing, they have broad latitude as to how it is priced and used. In particular, this allows them to charge based on the hardware that you run it on — to have per-core software licensing.

This galling practice isn’t new (and is in fact as old as symmetric multiprocessing systems), but it has taken on new dimensions in the era of chiplets and packaging innovation: the advances that your next CPU has over your current one are very likely to be expressed in core count. Per-core licensing allows a third party — who neither made the significant investment in developing the next generation of microprocessor nor paid for the part themselves — to exact a tax on improved infrastructure. (And this tax can be shockingly brazen!) Couple this with the elimination of perpetual licensing, and software costs can potentially absorb the entire gain from a next-generation CPU, leaving a disincentive to run newer, more efficient infrastructure. As an industry, we have come to accept this practice, but we shouldn’t: in the go-go era of Dennard scaling (when clock rates rose at a blistering rate), software vendors never would have been allowed to get away with charging by the gigahertz; we should not allow them to feel so emboldened to charge by core count now!

If it needs to be said, we have taken a different approach at Oxide: when you buy the Oxide cloud computer, all of the software to run it is included. This includes all of the software necessary to run the rack as elastic infrastructure: virtual compute, virtual storage, virtual networking. (And yes, it’s all open source — which unfortunately demands the immediate clarification that it’s actually open source rather than pretend open source.) When we add a new feature to our software, there is no licensing enablement or other such nuisance — the feature just comes with the next update. And what happens when AMD releases a new CPU with twice the core count? The new sled running the new CPU runs along your existing rack — you’re not paying more than the cost of the new sled itself. This gives the dividends of Moore’s Law (or Wright’s Law!) to whom they rightfully belong: the users of compute.

The SunOS JDK builder The Trouble with Tribbles...

I've been building OpenJDK on Solaris and illumos for a while.

This has been moderately successful; illumos distributions now have access to up to date LTS releases, most of which work well. (At least 11 and 17 are fine; 21 isn't quite right.)

There are even some third-party collections of my patches, primarily for Solaris (as opposed to illumos) builds.

I've added another tool. The SunOS jdk builder.

The aim here is to be able to build every single jdk tag, rather than going to one of the existing repos which only have the current builds. And, yes, you could grope through the git history to get to older builds, but one problem with that is that you can't actually fix problems with past builds.

Most of the content is in the jdk-sunos-patches repository. Here there are patches for both illumos and Solaris (they're ever so slightly different) for every tag I've built.

(That's almost every jdk tag since the Solaris/SPARC/Studio removal, and a few before that. Every so often I find I missed one. And there's been the odd bad patch along the way.)

The idea here is to make it easy to build every tag, and to do so on a current system. I've had to add new patches to get some of the older builds to work. The world has changed, we have newer compilers and other tools, and the OS we're building on has evolved. So if someone wanted to start building the jdk from scratch (and remember that you have to build all the versions in sequence) then this would be useful.

I'm using it for a couple of other things.

One is to put back SPARC support on illumos and Solaris. The initial port I did was on x86 only, so I'm walking through older builds and getting them to work on SPARC. We'll almost certainly not get to jdk21, but 17 seems a reasonable target.

The other thing is to enable the test suites, and then run them, and hopefully get them clean. At the moment they aren't, but a lot of that is because many tests are OS-specific and they don't know what Solaris is so get confused. With all the tags, I can bisect on failures and (hopefully) fix them.