Installing MariaDB in a sparse zone OmniOS Community Edition

I have recently done some work on improving the MariaDB 10.4 package that is part of the OmniOS extra package repository, to add more features and to make it easier to deploy. Part of that work involved adding support for socket authentication which makes the default installation more secure.

Here’s a walk-through of creating a sparse zone on OmniOS r151032, and then installing MariaDB within that. Commands that are issued within the global zone are shown with a prompt of gz#, and those within the sparse zone itself are prefixed with database#, which is the name of the zone.

Pre-requisites

Before you can create a sparse zone, the zone brand must be installed and you’ll need a ZFS dataset to act as a zone container. If you’ve used zones before, you might already have these in place.

        gz# pkg install zones brand/sparse
        gz# zfs create -o mountpoint=/zones rpool/zones

Zone creation

In this example I am attaching a VNIC for the zone to an Etherstub called switch10. If you just want to attach it to a global zone NIC, then you can specify global-nic=auto and it will usually do the right thing.

        gz# zonecfg -z database
database: No such zone configured
Use 'create' to begin configuring a new zone.
zonecfg:database> create -t sparse
zonecfg:database> set zonepath=/zones/database
zonecfg:database> add net
zonecfg:database:net> set physical=database0
zonecfg:database:net> set global-nic=switch10
zonecfg:database:net> set allowed-address=172.27.10.7/24
zonecfg:database:net> set defrouter=172.27.10.254
zonecfg:database:net> end
zonecfg:database> add attr
zonecfg:database:attr> set name=resolvers
zonecfg:database:attr> set type=string
zonecfg:database:attr> set value=1.1.1.1
zonecfg:database:attr> end
zonecfg:database> add attr
zonecfg:database:attr> set name=domain-name
zonecfg:database:attr> set type=string
zonecfg:database:attr> set value=omnios.org
zonecfg:database:attr> end
zonecfg:database> verify
zonecfg:database> exit

Zone installation

        gz# zoneadm -z database install
A ZFS file system has been created for this zone.

       Image: Preparing at /zones/database/root.
Sanity Check: Looking for 'entire' incorporation.
   Publisher: Using omnios (https://pkg.omnios.org/r151032/core).
   Publisher: Using extra.omnios (https://pkg.omnios.org/r151032/extra/).
       Cache: Using /var/pkg/publisher.
  Installing: Packages (output follows)
Packages to install: 200
Mediators to change:   4
 Services to change:   6

DOWNLOAD                                PKGS         FILES    XFER (MB)   SPEED
Completed                            200/200     1476/1476      4.9/4.9  1.1k/s

PHASE                                          ITEMS
Installing new actions                     5869/5869
Updating package state database                 Done
Updating package cache                           0/0
Updating image state                            Done
Creating fast lookup database                   Done
 Postinstall: Copying SMF seed repository ... done.
        Done: Installation completed in 56.395 seconds.

Zone boot

        gz# zoneadm -z database boot
gz# zlogin database

Wait for the initial boot to complete by checking the output of the svcs -x command. Once this command returns no output, the zone is fully up.

Check IP connectivity:

        root@database:~# ipadm
ADDROBJ           TYPE     STATE        ADDR
lo0/v4            static   ok           127.0.0.1/8
database0/_a      from-gz  ok           172.27.10.7/24
lo0/v6            static   ok           ::1/128

root@database:~# ping google.com
google.com is alive

Mariadb installation

        root@database:~# pkg list -a '*mariadb*'
NAME (PUBLISHER)                                  VERSION                    IFO
ooce/database/mariadb-103 (extra.omnios)          10.3.21-151032.0           ---
ooce/database/mariadb-104 (extra.omnios)          10.4.11-151032.0           ---
        root@database:~# pkg install mariadb-104
           Packages to install:  2
           Mediators to change:  1
            Services to change:  3
       Create boot environment: No
Create backup boot environment: No

Release Notes:

  --------------------------
  MariaDB Installation Notes
  --------------------------

  When the mariadb service is started for the first time, an initial
  database will be set up and two all-privilege accounts will be created.

  One is root@localhost, it has no password, but you need to
  be system 'root' user to connect. Use, for example, 'sudo mysql'

  The second is mysql@localhost, it has no password either, but
  you need to be the system 'mysql' user to connect.

  You may wish to review the default configuration file at
  /etc/opt/ooce/mariadb-<version>/my.cnf before starting the service
  for the first time.

  --------------------------


DOWNLOAD                                PKGS         FILES    XFER (MB)   SPEED
Completed                                2/2       694/694    52.0/52.0  5.8M/s

PHASE                                          ITEMS
Installing new actions                       991/991
Updating package state database                 Done
Updating package cache                           0/0
Updating image state                            Done
Creating fast lookup database                   Done
Updating package cache                           3/3

Start the database and connect

        root@database:~# svcadm enable mariadb104
root@database:~# mysql
Welcome to the MariaDB monitor.  Commands end with ; or \g.
Your MariaDB connection id is 8
Server version: 10.4.11-MariaDB OmniOS MariaDB Server

Copyright (c) 2000, 2018, Oracle, MariaDB Corporation Ab and others.

Type 'help;' or '\h' for help. Type '\c' to clear the current input statement.

MariaDB [(none)]> select current_user() from dual;
+----------------+
| current_user() |
+----------------+
| root@localhost |
+----------------+
1 row in set (0.000 sec)

Socket authentication is in use by default, which can be checked by verifying that root has an invalid (non-matchable) password hash.

        MariaDB [(none)]> select user, password from mysql.user where user != '';
+-------+----------+
| User  | Password |
+-------+----------+
| root  | invalid  |
| mysql | invalid  |
+-------+----------+
2 rows in set (0.001 sec)

Multithreaded Rust on Threadripper Lice!

I recently ran some benchmarks on a Threadripper 3960X system and the results were surprising me quite a bit. Simplified, the throughput the benchmark recorded went down, from 341 MB/s on a MBP to 136 MB/s on a Threadripper desktop. Prior I had read Daniel Lemire’s notes on the sub optimal performance for simdjson on Zen 2, which is heavily used in the benchmark, but the suggested drop were a few percent not half.

Long story short, this made me curious what caused this. First stop: perf.

perf crossbeam

Notice the first item? It is crossbeam_channel::flavors::array::Channel<T>::recv. Oh my, I never saw that one hogging so much cpu time, in fact we spend more time in receiving from the channel then we spend in parsing or serializing JSON!

Lets add a bit of Threadripper trivia, the design AMD went with was splitting the CPU from a single silicon to multiple small dies, they call CCDs with in turn are consists of two CCXs that then contain the cores and level 1-3 cache. So lets look at another thing, htop (trusty little tool to show our load):

htop load on cores

In this screenshot we can spot that one thread seems to be running on the 5th core, one on the 16th and one on the 19th and 20th. Thinking back to the design of the Threadripper this is a bit of a hint, those cores are on different CCXs and even further on different CCDs so what happens if they were on the same?

Boom 400+ MB/s! taskset -c 0,1,2 does the trick, that’s a really nice improvement and looking at the perf output we can see recv to move from nearly 11% of CPU time to 7.28%, now that’s a neat improvement. Not only is it nearly 3x faster then the first benchmark but also is it 20% faster then on the laptop. So far so good.

perf crossbeam

But it’s still leaves the question, why and if we can do something about this. Enter a little benchmark and look at what it puts out for the first core (it’s a lot of output otherwise).

B 0 -  0: -
B 0 -  1: 818us/send
B 0 -  2: 673us/send
B 0 -  3: 2839us/send
B 0 -  4: 2421us/send
B 0 -  5: 2816us/send
B 0 -  6: 3466us/send
B 0 -  7: 3634us/send
B 0 -  8: 3267us/send
B 0 -  9: 3042us/send
B 0 - 10: 3633us/send
B 0 - 11: 3535us/send
B 0 - 12: 3334us/send
B 0 - 13: 3443us/send
B 0 - 14: 3348us/send
B 0 - 15: 3398us/send
B 0 - 16: 3459us/send
B 0 - 17: 3108us/send
B 0 - 18: 3287us/send
B 0 - 19: 3393us/send
B 0 - 20: 3369us/send
B 0 - 21: 3248us/send
B 0 - 22: 3290us/send
B 0 - 23: 3323us/send

B 0 - 24: 487us/send
B 0 - 25: 812us/send
B 0 - 26: 676us/send
B 0 - 27: 2859us/send
B 0 - 28: 2853us/send
B 0 - 29: 2864us/send
B 0 - 30: 3475us/send
B 0 - 31: 3620us/send
B 0 - 32: 3582us/send
B 0 - 33: 3497us/send
B 0 - 34: 3524us/send
B 0 - 35: 3488us/send
B 0 - 36: 3331us/send
B 0 - 37: 3303us/send
B 0 - 38: 3365us/send
B 0 - 39: 3333us/send
B 0 - 40: 3324us/send
B 0 - 41: 3363us/send
B 0 - 42: 3554us/send
B 0 - 43: 3351us/send
B 0 - 44: 3207us/send
B 0 - 45: 3240us/send
B 0 - 46: 3377us/send
B 0 - 47: 3275us/send

First things first, the numbers here are 0 indexed, unlike in htop where they’re 1 indexed. So core 0 here means core 1 in htop. The test runs only for a second per core combination (as it goes through all cores and otherwise takes a really long time), some variation is to be expected. That gets really slow really fast. We can see that core 24-47 are the SMTs cores on the physical cores 0-23, so 24 being the second thread on core 0. The second observation is that core 0-2 are in the same CCX, performance is reasonable fast here. 3-5 seem to be on the same CCD and so on.

Lets look at the code for the crossbeam channel. The part that’s interesting is that both head and tail are wrapped in CachePadded. Fortunately I have a friend who keeps going on about false sharing whenever performance becomes a topic so that was a really good hint here. Looking through the struct aligning head and tail to the cache line makes a lot of sense they’re frequently accessed from both sides of the queue but there is another part that’s frequently used on both sides. The buffer, and that is just an array of T so it might not align well to the cache. In other words, if we access buffer[x] we might invalidate buffer[x-1] or buffer[x+1] (or more). So what happens if we wrap the the elements in a CachePadded?. The result looks quite nice, it cut down by 50% when going over CCX boundaries:

B 0 -  0: -
B 0 -  1: 630us/send
B 0 -  2: 678us/send
B 0 -  3: 1319us/send
B 0 -  4: 1256us/send
B 0 -  5: 1291us/send
B 0 -  6: 1438us/send
B 0 -  7: 1504us/send
B 0 -  8: 1525us/send
B 0 -  9: 1660us/send
B 0 - 10: 1772us/send
B 0 - 11: 1807us/send
B 0 - 12: 1382us/send
B 0 - 13: 1380us/send
B 0 - 14: 1387us/send
B 0 - 15: 1375us/send
B 0 - 16: 1382us/send
B 0 - 17: 1383us/send
B 0 - 18: 1471us/send
B 0 - 19: 1471us/send
B 0 - 20: 1463us/send
B 0 - 21: 1462us/send
B 0 - 22: 1468us/send
B 0 - 23: 1457us/send

B 0 - 24: 466us/send
B 0 - 25: 619us/send
B 0 - 26: 671us/send
B 0 - 27: 1438us/send
B 0 - 28: 1422us/send
B 0 - 29: 1514us/send
B 0 - 30: 1789us/send
B 0 - 31: 1688us/send
B 0 - 32: 1812us/send
B 0 - 33: 1820us/send
B 0 - 34: 1719us/send
B 0 - 35: 1797us/send
B 0 - 36: 1383us/send
B 0 - 37: 1364us/send
B 0 - 38: 1373us/send
B 0 - 39: 1383us/send
B 0 - 40: 1370us/send
B 0 - 41: 1390us/send
B 0 - 42: 1468us/send
B 0 - 43: 1467us/send
B 0 - 44: 1464us/send
B 0 - 45: 1463us/send
B 0 - 46: 1475us/send
B 0 - 47: 1467us/send

With all of this, the code went from 136 MB/s to over 150 MB/s when not pinned to cores, while this isn’t close to where I’d like to to be, it is a 10% improvement in throughput. And looking at perf again recv is completely gone from the list, which is nice!

perf crossbeam

This is the conclusion for now, if I have more interesting finds I’ll add a continuation - so I’ll keep digging.

hwi: illumos hardware info utility, inspired by inxi Minimal Solaris

Тhere are a number of utilities to get hardware information in illumos  but none have a convenient or complete output. I liked the inxi output on Linux and I tried to reproduce something similar for illumos:


OmniOS Community Edition r151022ef, r151032h, r151030ah OmniOS Community Edition

OmniOS Community Edition weekly releases for w/c 23rd of December 2019 are now available.

For all supported OmniOS releases, OpenSSL 1.0 has been updated to 1.0.2u, which includes a security fix. This is expected to be the last update for the 1.0 series which reaches end of support on the 31st of December 2019.

OmniOS r151030 and above already include OpenSSL 1.1 as the default version with 1.0 libraries delivered alongside for backwards compatibility. The currently selected default version an be checked using pkg mediator - check that 1.1 appears in the VERSION column.

# pkg mediator openssl
MEDIATOR VER. SRC. VERSION IMPL. SRC. IMPLEMENTATION
openssl  vendor    1.1     vendor

To change the default version to 1.1, if necessary, use:

# pkg unset-mediator openssl

Additionally, for r151032 only:

  • The beta UEFI 2.7 firmware for bhyve has been updated and should now work with more systems. This firmware can be selected by setting the bootrom zone attribute to BHYVE_RELEASE-beta.

For further details, please see https://omniosce.org/releasenotes


Any problems or questions, please get in touch.

OmniOS Community Edition r151032e, r151030ae OmniOS Community Edition

OmniOS Community Edition weekly releases for w/c 2nd of December 2019 are now available.

The following updates are available for r151032 and r151030

  • Update Intel CPU Microcode to 20191115.

  • Fixes to support for large (> 2TB) USB hard disks.

  • mpt_sas driver could hang after config header request timeout.

  • OpenJDK updated to 1.8.0_232-09.

Additionally, for r151032 only:

  • KVM zones could lose network connectivity to other zones on the same machine.

  • Improvements to support for recent Linux distributions in lx zones.

  • Fixes for zfs diff between encrypted datasets.

  • 8-bit colour modes did not work properly after boot.

  • Several updates and bug fixes for SMB.

  • make -C could cache wrong directory contents.

  • Fix (rare) crash if zone root cannot be mounted during boot.

For further details, please see https://omniosce.org/releasenotes

Any problems or questions, please get in touch.

The soul of a new computer company The Observation Deck

Over the summer, I described preparing for my next expedition. I’m thrilled to announce that the expedition is now plotted, the funds are raised, and the bags are packed: together with Steve Tuck and Jess Frazelle, we have started Oxide Computer Company.

Starting a computer company may sound crazy (and you would certainly be forgiven a double-take!), but it stems from a belief that I hold in my marrow: that hardware and software should each be built with the other in mind. For me, this belief dates back a quarter century: when I first came to Sun Microsystems in the mid-1990s, it was explicitly to work on operating system kernel development at a computer company — at a time when that very idea was iconoclastic. And when we started Fishworks a decade later, the belief in fully integrated software and hardware was so deeply rooted into our endeavor as to be eponymous: it was the “FISH” in “Fishworks.” In working at a cloud computing company over the past decade, economic realities forced me to suppress this belief to a degree — but it now burns hotter than ever after having endured the consequences of a world divided: in running a cloud, our most vexing problems emanated from the deepest bowels of the stack, when hardware and (especially) firmware operated at cross purposes with our systems software.

As I began to think about what was next, I was haunted by the pain and futility of trying to build a cloud with PC-era systems. At the same time, seeing the kinds of solutions that the hyperscalers had developed for themselves had always left me with equal parts admiration and frustration: their rack-level designs are a clear win — why are these designs cloistered among so few? And even in as much as the hardware could be found through admirable efforts like the Open Compute Project, the software necessary to realize its full potential has remained cruelly unavailable.

Alongside my inescapable technical beliefs has been a commercial one: even as the world is moving (or has moved) to elastic, API-driven computing, there remain good reasons to run on one’s own equipment! Further, as cloud-borne SaaS companies mature from being strictly growth focused to being more margin focused, it seems likely that more will consider buying machines instead of always renting them.

It was in the confluence of these sentiments that an idea began to take shape: the world needed a company to develop and deliver integrated, hyperscaler-class infrastructure to the broader market — that we needed to start a computer company. The “we” here is paramount: in Steve and Jess, I feel blessed to not only share a vision of our future, but to have diverse perspectives on how infrastructure is designed, built, sold, operated and run. And most important of all (with the emphasis itself being a reflection of hard-won wisdom), we three share deeply-held values: we have the same principled approach, with shared aspirations for building the kind of company that customers will love to buy from — and employees will be proud to work for.

Together, as we looked harder at the problem, we saw the opportunity more and more clearly: the rise of open firmware and the broadening of the Open Compute Project made this more technically feasible than ever; the sharpening desire among customers for a true cloud-like on-prem experience (and the neglect those customers felt in the market) made it more in demand than ever. With accelerating conviction that we would build a company to do this, we needed a name — and once we hit on Oxide, we knew it was us: oxides form much of the earth’s crust, giving a connotation of foundation; silicon, the element that is the foundation of all of computing, is found in nature in its oxide; and (yes!) iron oxide is also known as Rust, a programming language we see playing a substantial role for us. Were there any doubt, that Oxide can also be pseudo-written in hexadecimal — as 0x1de — pretty much sealed the deal!

There was just one question left, and it was an existential one: could we find an investor who saw what we saw in Oxide? Fortunately, the answer to this question had been emphatic and unequivocal: in the incredible team at Eclipse Ventures, we found investors that not only understood the space and the market, but also the challenges of solving hard technical problems. And we are deeply honored to have Eclipse’s singular Pierre Lamond joining us on our board; we can imagine no better a start for a new computer company!

So while there is a long and rocky path ahead, we are at last underway on our improbable journey! If you haven’t yet, read Jess’s blog on Oxide being born on a garage. If you find yourself battling the problems we’re aiming to fix, please join our mailing list. If you are a technologist who feels this problem in your bones as we do, consider joining us. And if nothing else, and you would like to hear some terrific stories of life at the hardware/software interface, check out our incredible podcast On the Metal!