Simnet Basics Z IN ASCII - Writing

Basic usage of the simnet device.

CPU and PCH Temperature Sensors in illumos Tales from a Core File

A while back, I did a bit of work that I’ve been meaning to come back to and write about. The first of these are all about making it easier to see the temperature that different parts of the system are working with. In particular, I wanted to make sure that I could understand the temperature of the following different things:

  • Intel CPU Cores
  • Intel CPU Sockets
  • AMD CPUs
  • Intel Chipsets

While on some servers this data is available via IPMI, that doesn’t help you if you’re running a desktop or a laptop. Also, if the OS can know about this as a first class item, why bother going through IPMI to get at it? This is especially true as IPMI sometimes summarizes all of the different readings into a single one.

Seeing is Believing

Now, with these present, you can ask fmtopo to see the sensor values. While fmtopo itself isn’t a great user interface, it’s a great building block to centralize all of the different sensor information in the system. From here, we can build tooling on top of the fault management architecture (FMA) to better see and understand the different sensors. FMA will abstract all of the different sources. Some of them may be delivered by the kernel while others may be delivered by user land. With that in mind, let’s look at what this looks like on a small Kaby Lake NUC:

    [root@estel ~]# /usr/lib/fm/fmd/fmtopo -V *sensor=temp
TIME                 UUID
Aug 12 20:44:08 88c3752d-53c2-ea3a-c787-cbeff0578cd0

hc://:product-id=NUC7i3BNH:server-id=estel:chassis-id=G6BN735007J5/motherboard=0/chip=0/core=0?sensor=temp
  group: protocol                       version: 1   stability: Private/Private
    resource          fmri      hc://:product-id=NUC7i3BNH:server-id=estel:chassis-id=G6BN735007J5/motherboard=0/chip=0/core=0?sensor=temp
  group: authority                      version: 1   stability: Private/Private
    product-id        string    NUC7i3BNH
    chassis-id        string    G6BN735007J5
    server-id         string    estel
  group: facility                       version: 1   stability: Private/Private
    sensor-class      string    threshold
    type              uint32    0x1 (TEMP)
    units             uint32    0x1 (DEGREES_C)
    reading           double    43.000000

hc://:product-id=NUC7i3BNH:server-id=estel:chassis-id=G6BN735007J5/motherboard=0/chip=0/core=1?sensor=temp
  group: protocol                       version: 1   stability: Private/Private
    resource          fmri      hc://:product-id=NUC7i3BNH:server-id=estel:chassis-id=G6BN735007J5/motherboard=0/chip=0/core=1?sensor=temp
  group: authority                      version: 1   stability: Private/Private
    product-id        string    NUC7i3BNH
    chassis-id        string    G6BN735007J5
    server-id         string    estel
  group: facility                       version: 1   stability: Private/Private
    sensor-class      string    threshold
    type              uint32    0x1 (TEMP)
    units             uint32    0x1 (DEGREES_C)
    reading           double    49.000000

hc://:product-id=NUC7i3BNH:server-id=estel:chassis-id=G6BN735007J5/motherboard=0/chip=0?sensor=temp
  group: protocol                       version: 1   stability: Private/Private
    resource          fmri      hc://:product-id=NUC7i3BNH:server-id=estel:chassis-id=G6BN735007J5/motherboard=0/chip=0?sensor=temp
  group: authority                      version: 1   stability: Private/Private
    product-id        string    NUC7i3BNH
    chassis-id        string    G6BN735007J5
    server-id         string    estel
  group: facility                       version: 1   stability: Private/Private
    sensor-class      string    threshold
    type              uint32    0x1 (TEMP)
    units             uint32    0x1 (DEGREES_C)
    reading           double    49.000000

hc://:product-id=NUC7i3BNH:server-id=estel:chassis-id=G6BN735007J5/motherboard=0/chipset=0?sensor=temp
  group: protocol                       version: 1   stability: Private/Private
    resource          fmri      hc://:product-id=NUC7i3BNH:server-id=estel:chassis-id=G6BN735007J5/motherboard=0/chipset=0?sensor=temp
  group: authority                      version: 1   stability: Private/Private
    product-id        string    NUC7i3BNH
    chassis-id        string    G6BN735007J5
    server-id         string    estel
  group: facility                       version: 1   stability: Private/Private
    sensor-class      string    threshold
    type              uint32    0x1 (TEMP)
    units             uint32    0x1 (DEGREES_C)
    reading           double    46.000000

While it’s a little bit opaque, you might be able to see that we have four different temperature sensors here:

  • Core 0′s temperature sensor
  • Core 1′s temperature sensor
  • Package 0′s temperature sensor
  • The chipset’s temperature sensor (Intel calls this the Platform Controller Hub)

On an AMD system, the output is similar, but the sensor exists on a slightly different granularity than a per-core basis. We’ll come back to that a bit later.

Pieces of the Puzzle

To make all of this work, there are a few different pieces that we put together:

  1. Kernel device drivers to cover the Intel and AMD CPU sensors, and the Intel PCH
  2. A new evolving, standardized way to export simple temperature sensors
  3. Support in FMA’s topology code to look for such sensors and attach them

We’ll work through each of these different pieces in turn. The first part of this was to write three new device drivers one to cover each of the different cases that we cared about.

coretemp driver

The first part of the drivers is the coretemp driver. This uses the temperature interface that was introduced on Intel Core family processors. It allows an operating system to read a MSR (Model Specific Register) to determine what the temperature is. The support for this is indicated by a bit in one of the CPUID registers and exists on almost every Intel CPU that has come out since the Intel Core Duo.

Around the time of Intel’s Haswell processor (approximately), Intel added another CPUID bit and MSR that indicates what the temperature is on each socket.

The coretemp driver has two interesting dynamics and problems:

  1. Because of the use of the rdmsr instruction, which reads a model-specific register, one can only read the temperature for the CPU that you’re currently executing on. This isn’t too hard to arrange in the kernel, but it means that when we read the temperature we’ll need to organize what’s called a ‘cross-call’. You can think of a cross-call as a remote procedure call, except that the target is a specific CPU in the system and not a remote host.

  2. Intel doesn’t actually directly encode the temperature in the MSRs. Technically, the value we read represents an offset from the processor’s maximum junction temperature, often abbreviated Tj Max. Modern Intel processors provide a means for us to read this directly via an MSR. However, older ones, unfortunately, do not. On such older processors, the Tj Max actually varies based on not just the processor family, but also the brand, so different processors running at different frequencies have different values. Some of this information can be found in various datasheets, but for the moment, we’ve only enabled this driver for CPUs that we can guarantee the Tj Max value. If you have an older CPU and you’d like to see if we could manually enable it, please reach out.

pchtemp driver

The pchtemp driver is a temperature sensor driver for the Intel platform controller hub (PCH). The driver supports most Intel CPUs since the Haswell generation, as the format of the sensor changed starting with the Haswell-era chipsets.

This driver is much simpler than the other ones. The PCH exposes a dedicated pseudo-PCI device for this purpose. The pchtemp driver simply attaches to that and reads the temperature when required. Unlike the coretemp driver, the offset in temperature is the same across all of the currently supported platforms so we don’t need to encode anything special there like we do for the coretemp driver.

amdf17nbdf driver

The amdf17nbdf driver is a bit of a mouthful. It stands for the AMD Family 17h North Bridge and Data Fabric driver. Let’s take that apart for a moment. On x86, CPUs are broken into different families to represent different generations. Currently all of AMD’s Ryzen/EPYC processors that are based on the Zen microarchitecture are all grouped under Family 17h. The North Bridge is a part of the CPU that is used to interface with various I/O components on the system. The Data Fabric is a part of AMD CPUs which connects CPUs, I/O devices, and DRAM.

On AMD Zen family processors, the temperature sensor exists on a per-die basis. Each die is a group of cores. The physical chip has a number of such units, each of which in the original AMD Naples/Zen 1 processor has up to 8 cores. See the illumos cpuid.c big theory statement for the details on how the CPU is constructed and this terminology. Effectively, you can think of it as there are a number of different temperature sensors, one for each discrete group of cores.

To talk to the temperature sensor, we need to send a message on the what AMD calls the ‘system management network’ (SMN). The SMN is used to connect different management devices together. The SMN can be used for a number of different purposes beyond just temperature sensors. The operating system has a dedicated means of communicating and making requests over the SMN by using the corresponding north bridge, which is exposed as a PCI device.

The same way that with the coretemp driver you needed to issue a rdmsr instruction for the core that you wanted the temperature from, you need to do the same thing here. Each die has its own north bridge and therefore we need to use the right instance to talk to the right group of CPUs.

The wrinkle with all of this is that the north bridge on its own doesn’t give us enough information to map it back to a group of cores that an operator sees. This is critical, since if you can’t tell which cores you’re getting the temperature reading for, it immediately becomes much less useful.

This is where the data fabric devices come into play. The data fabric devices exist at a rather specific PCI bus, device, and function. They all are always defined to be on PCI bus 0. The data fabric device for the first die is always defined to be at device 18h. The next one is at 19h, and so on. This means that we have a deterministic way to map between a data fabric device and a group of cores. Now, that’s not enough on its own. While we know the data fabric, we don’t know how to map that to the north bridge.

Each north bridge in the system is always defined to be on its own PCI bus. The device we care about is always going to be device and function 0. The data fabric happens to have a register which defines for us the starting PCI bus for its corresponding north bridge. This means that we have a path now to get to the temperature sensor. For a given die, we can find the corresponding data fabric. From the data fabric, we can find the corresponding north bridge. And finally, from the north bridge, we can find the corresponding SMN (system management network) that we can communicate with.

With all that, there’s one more wrinkle. On some processors, most notably the Ryzen and ThreadRipper families, the temperature that we read has an offset encoded with it. Unfortunately, these offsets have only been documented in blog posts by AMD and not in the formal documentation. Still, it’s not too hard to take this into account once official documentation becomes available.

While our original focus was on adding support for AMD’s most recent processors, if you have an older AMD processor and are interested in wiring up the temperature sensors on it, please get in touch and we can work together to implement something.

sys/sensors.h

Now that we have drivers that know how to read this information, the next problem we need to tackle is how do we expose this information to user land. In illumos, the most common way is often some kind of structured data that can be retrieved by an ioctl on a character device, or some other mechanism like a kernel statistic.

After talking with some folks, we put together a starting point for a way for a kernel to exposes sets of statistics and created a new header file in illumos called sys/sensors.h. This header file isn’t currently shipped and is only used by software in illumos itself. This makes it easy for us to experiment with the API and change it without worrying about breaking user software. Right now, each of the above drivers implements a specific type of character device that implements the same, generic interface.

The current interface supports two different types of commands. The first, SENSOR_IOCTL_TYPE, answers the question of what kind of sensor is this. Right now, the only valid answer is SENSOR_KIND_TEMPERATURE. The idea is that if we have other sensors, say voltage, current, or
something else, we could return a different kind. Each kind, in turn, promises to implement the same interface and information.

For temperature devices, we need to fill in a singular structure which is used to retrieve the temperature. This structure currently looks something like:

    typedef struct sensor_ioctl_temperature {
        uint32_t        sit_unit;
        int32_t         sit_gran;
        int64_t         sit_temp;
} sensor_ioctl_temperature_t;

This is kind of interesting and incorporates some ideas from Joshua Clulow and Alex Wilson. The sit_unit member is used to describe what unit the temperature is in. For example, it may be in Celsius, Kelvin, or Fahrenheit.

The next two members are a little more interesting. The sit_temp member contains a temperature reading, the sit_gran member is whats important in how we interpret that temperature. While many sensors end up communicating to digital consumers using a power of 2 based reading, that’s not always the case. Some sensors often may report a reading in units such as 1/10th of a degree. Others may actually report something in a granularity of 2 degrees!

To try and deal with this menagerie, the sit_gran member indicates the number of increments per degree in the sit_temp member. If this was set to 10, then that would mean that sit_temp was in 10ths of a degree and to get the actual value in degrees, one would need to divide by 10. On the other hand, a negative value instead means that we would need to multiply. So, a value of -2 would mean that sit_temp was in units of 2 degrees. To get the actual temperature, you would need to multiply sit_temp by 2.

Now, you may ask why not just have the kernel compute this and have a ones digit and a decimal portion. The main goal is to avoid floating point math in the kernel. For various reasons, this historically has been avoided and we’d rather keep it that way. While this may seem a little weird, it does allow for the driver to do something simpler and lets user land figure out how to transform this into a value that makes semantic sense for it. This gets the kernel out of trying to play the how many digits after the decimal point would you like game.

Exposing Devices

The second part of this bit of kernel support is to try and provide a uniform and easy way to see these different things under /dev in the file system. In illumos, when someone creates a minor node in a device driver, you have to specify a type of device and a name for the minor node. While most of the devices in the system use a standard type, we added a new class of types for sensors that translate roughly into where you’ll find them.

So, for example, the CPU drivers use the class that has the string "ddi_sensor:temperature:cpu" (usually as the macro DDI_NT_SENSOR_TEMP_CPU). This is used to indicate that it is a temperature sensor for CPUs. The coretemp driver then creates different nodes for each core and socket. For example, on a system with an Intel E3-1270v3 (Haswell), we see the following devices:

    [root@haswell ~]# find /dev/sensors/
/dev/sensors/
/dev/sensors/temperature
/dev/sensors/temperature/cpu
/dev/sensors/temperature/cpu/chip0
/dev/sensors/temperature/cpu/chip0.core0
/dev/sensors/temperature/cpu/chip0.core1
/dev/sensors/temperature/cpu/chip0.core2
/dev/sensors/temperature/cpu/chip0.core3
/dev/sensors/temperature/pch
/dev/sensors/temperature/pch/ts.0

On the other hand on an AMD EPYC system with two AMD EPYC 7601 processors, we see:

    [root@odyssey ~]# find /dev/sensors/
/dev/sensors/
/dev/sensors/temperature
/dev/sensors/temperature/cpu
/dev/sensors/temperature/cpu/procnode.0
/dev/sensors/temperature/cpu/procnode.1
/dev/sensors/temperature/cpu/procnode.2
/dev/sensors/temperature/cpu/procnode.3
/dev/sensors/temperature/cpu/procnode.4
/dev/sensors/temperature/cpu/procnode.5
/dev/sensors/temperature/cpu/procnode.6
/dev/sensors/temperature/cpu/procnode.7

The nice thing about the current scheme is that anything of type ddi_sensor will have a directory hierarchy created for it based on the different interspersed : characters. This makes it very easy for us to experiment with different kinds of sensors without having to go through too much effort. That said, this is all still evolving, so there’s no promise that this won’t change. Please don’t write code that relies on this. If you do, it’ll likely break!

FMA Topology

The last piece of this was to wire it up in FMA’s topology. To do that, I did a few different pieces. The first was to make it easy to add a node to the topology that represents a sensor backed by this kernel interface. There’s one generic implementation of that which is parametrized by the path.

With that, I first modified the CPU enumerator. The logic will use a core sensor if available, but can also fall back to a processor-node sensor if it exists. Next, I added a new piece of enumeration, which was to represent the chipset. If we have a temperature sensor, then we’ll enumerate the chipset under the motherboard. While this is just the first item there, I suspect we’ll add more over time as we try to properly capture more information about what it’s doing, the firmware revisions that are a part of it, and more.

This piece is, in some ways, the simplest of them all. It just builds on everything else that was already built up. FMA already had a notion of a sensor (which is used for everything from disk temperature to the fan RPM), so this was just a simple matter of wiring things up.

Now, we have all of the different pieces that made the original example of the CPU and PCH temperature sensor work.

Further Reading

If you’re interested in learning more about this, you can find more information in the following resources:

In addition, you can find theory statements that describe the purpose of the different drivers and other pieces that were discussed earlier and
their manual pages:

Finally, if you want to see the actual original commits that integrated these changes, then you can find the following from illumos-gate:

    commit dc90e12310982077796c5117ebfe92ee04b370a3
Author: Robert Mustacchi <rm@joyent.com>
Date:   Wed Apr 24 03:05:13 2019 +0000

    11273 Want Intel PCH temperature sensor
    Reviewed by: Jerry Jelinek <jerry.jelinek@joyent.com>
    Reviewed by: Mike Zeller <mike.zeller@joyent.com>
    Reviewed by: Toomas Soome <tsoome@me.com>
    Reviewed by: Gergő Doma <domag02@gmail.com>
    Reviewed by: Paul Winder <Paul.Winder@wdc.com>
    Approved by: Richard Lowe <richlowe@richlowe.net>

commit f2dbfd322ec9cd157a6e2cd8a53569e718a4b0af
Author: Robert Mustacchi <rm@joyent.com>
Date:   Sun Jun 2 14:55:56 2019 +0000

    11184 Want CPU Temperature Sensors
    11185 i86pc chip module should be smatch clean
    Reviewed by: Hans Rosenfeld <hans.rosenfeld@joyent.com>
    Reviewed by: Jordan Hendricks <jordan.hendricks@joyent.com>
    Reviewed by: Patrick Mooney <patrick.mooney@joyent.com>
    Reviewed by: Toomas Soome <tsoome@me.com>
    Approved by: Garrett D'Amore <garrett@damore.org>

Looking Ahead

While this is focused on a few useful temperature sensors, there are more that we’d like to add. If you have a favorite sensor that you’d like to see, please reach out to me or the broader illumos community and we’d be happy to take a look at it.

Another avenue that we’re looking to explore is having a standardized sensor driver such that a device driver doesn’t necessarily have to have its own character device or implementation of the ioctl handling.

Finally, I’d like to thank all those who helped me as we discussed different aspects of the API, reviewed the work, and tested it. None of this happens in a vacuum.

Resurrecting simnet Z IN ASCII - Writing

Resurrecting the illumos simnet device.

IFR Alternate Minimums Josef "Jeff" Sipek

As some of you already know, I’ve been working on my instrument rating over the past 5–6 months. As part of it, I had to figure out and understand the regulations governing when an alternate airport is needed and the required weather at the destination and alternate airports.

The first part is answered by 91.169(a) and 91.169(b). To give you taste of the regulations, here is (b):

(b) Paragraph (a)(2) of this section does not apply if:

(1) Part 97 of this chapter prescribes a standard instrument approach procedure to, or a special instrument approach procedure has been issued by the Administrator to the operator for, the first airport of intended landing; and

(2) Appropriate weather reports or weather forecasts, or a combination of them, indicate the following:

(i) For aircraft other than helicopters. For at least 1 hour before and for 1 hour after the estimated time of arrival, the ceiling will be at least 2,000 feet above the airport elevation and the visibility will be at least 3 statute miles.

(ii) For helicopters. At the estimated time of arrival and for 1 hour after the estimated time of arrival, the ceiling will be at least 1,000 feet above the airport elevation, or at least 400 feet above the lowest applicable approach minima, whichever is higher, and the visibility will be at least 2 statute miles.

Clear as mud, isn’t it?

The second question (the required weather at the destination and alternate airports) is answered by 91.169(c). Don’t worry, I won’t quote it here.

Since the text of the regulation is not easy to read, I decided that the best way to understand it is to make a flowchart. As I fly airplanes, I’ve ignored any part of the regulations that is about aircraft other than airplanes.

The result:

Clearer? I certainly think so!

The one big thing to keep in mind about this flowchart is that not every approach can be used during planning. This is a semi-large topic of its own.

In short, any approach that you aren’t authorized for, the plane isn’t equipped for, or that has a NOTAM saying that it isn’t available, effectively doesn’t exist. As far as GPS approaches are concerned, if you have a TSO 129 or 196 GPS, then you have another restriction—you cannot plan on using GPS approaches at both your destination and your alternate.

I found it useful to write this down and in the process truly understand the rules. Hopefully, you’ve found this useful as well. Needless to say, you should not rely on this flowchart without verifying that it is correct. Regulations sometimes change, and people sometimes make mistakes when making flowcharts to visualize said regulations. (If you find a problem, let me know!)

One final thought: just because the regulations don’t require an alternate airport doesn’t mean that you shouldn’t have one anyway. Weather often seems to have a mind of its own and a propensity to prove forecasters wrong.

Time-based One-time Passwords Josef "Jeff" Sipek

Recently I ended up playing with Wikipedia article: Time-based One-time Passwords as a second factor when authenticating with various services. When I saw an RFC referenced in the references section, I looked at it to get a better idea of how complicated the algorithm really is. It turns out that TOTP is very simple. So simple that I couldn’t help but put together a quick and dirty implementation in Python.

TOTP itself is documented in RFC 6238. It is a rather short RFC, but that’s because all it really says is “use HOTP and feed it these values”.

HOTP is documented in RFC 4226. This RFC is a bit longer since it has to describe how the counter value gets hashed and the resulting digest gets mangled. Reading it, one will learn that the HMAC-SHA1 is the basic building block of HOTP.

HMAC is documented in RFC 2104.

With these three documents (and a working implementation of SHA1), it is possible to implement your own TOTP.

The Key

If you follow those four RFCs, you’ll have a working TOTP. However, that’s not enough to make use of the code. The whole algorithm is predicated on having a pre-shared secret—a key. Typically, the service you are enabling TOTP for will issue you a key and you have to feed it into the algorithm to start generating passwords. Since showing the user the key in binary is not feasible, some sort of encoding is needed.

I couldn’t find any RFC that documents best practices for sharing the key with the user. After a while, I found a Google Authenticator wiki page describing the format of the key URIs used by Google Authenticator.

It turns out that this is a very common format. It uses a base32 encoding with the padding stripped. (Base32 is documented in RFC 4648.)

The “tricky” part is recreating this padding to make the decoder happy. Since base32 works on 40-bit groups (it converts between 5 raw bytes and 8 base-32 chars), we must pad to the nearest 40-bit group.

The Code

I tried to avoid implementing HMAC-SHA1, but I couldn’t find it in any of the modules Python ships with. Since it is a simple enough algorithm, I implemented it as well. Sadly, it nearly doubles the size of the code.

Warning: This is proof-of-concept quality code. Do not use it in production.

import struct
import hashlib
import base64
import time

# The pre-shared secret (base32 encoded):
key = "VGMT4NSHA2AWVOR6"

def HMAC(k, data, B=64):
    def H(m):
        return hashlib.sha1(m).digest()

    # keys too long get hashed
    if len(k) > B:
        k = H(k)

    # keys too short get padded
    if len(k) < B:
        k = k + ("\x00" * (B - len(k)))

    ikey = "".join([chr(ord(x) ^ 0x36) for x in k])
    okey = "".join([chr(ord(x) ^ 0x5c) for x in k])

    return H(okey + H(ikey + data))

def hotp(K, C, DIGITS=6):
    def Truncate(inp):
        off = ord(inp[19]) & 0xf

        x = [ord(x) for x in inp[off:(off+4)]]

        return ((x[0] & 0x7f) << 24) | (x[1] << 16) | (x[2] << 8) | x[3]

    return Truncate(HMAC(K, struct.pack(">Q", C))) % (10 ** DIGITS)

def totp(K, T=time.time(), X=30, T0=0, DIGITS=6):
    return hotp(K, long(T - T0) / long(X), DIGITS=DIGITS)

# pad to the nearest 40-bit group
if len(key) % 8 != 0:
    key=key + ("=" * (8 - (len(key) % 8)))

key=base64.b32decode(key.upper())

print time.ctime(), time.time()
print "TOTP: %06d" % totp(key)

This code is far from optimal, but I think it nicely demonstrates the simplicity of TOTP.

References

My Logbooks Josef "Jeff" Sipek

Recently a friend asked me about what I use for my pilot logbook. This made me realize that my logging is complicated and that I should probably make a blahg entry about it.

All in all, I have three logbooks to keep track of my flying.

Good ol’ paper logbook

This is the “official” one. In other words, if the FAA wants to see my logbook, that’s what I’ll show them. There’s not much more to say about it.

ForeFlight

This is my casual logbook. A while back I entered everything in, including more accurate counts (full stop vs. touch+go) and better divided up time counts (PIC vs. solo). I use this logbook to answer questions like “How much time do I have?” and “Am I current?”. It is also useful when chatting with people and I want to dig up an entry.

I also use it to keep track of Wikipedia article: Hobbs vs. Wikipedia article: tach time since I pay based on tach time.

A Repository

This is my custom analysis and archive logbook. In this Mercurial repository, I keep a giant JSON file with every entry with even more detail than what’s in ForeFlight.

Alongside it, I also keep a list of latitude/longitude/altitude information for each airport I’ve been to.

From these two files, I can generate various plots. For example, here is one:

Airports as of 2018-07-30

This is a plot of all the airports I’ve ever landed at—color coded based on the year of my first landing there.

This repository also serves as a backup of my paper logbook (in case my physical logbook burns up, gets water damaged, etc.) and an archive of other flying related data. I accomplish this by keeping scans of the paper logbook, copies of any GPS tracklogs I’ve recorded, and so on in a couple of subdirectories.

Post-flight

At the end of each flight, I add an entry to my ForeFlight logbook. Usually, I have ForeFlight recording a tracklog, so a large part of the entry is auto-generated. The bits of info that I add manually are:

  • tach time (useful for billing)
  • time out/off/on/in (I’m trying to figure out how much time I “waste” on the ground to improve my planning accuracy)
  • landing counts
  • any remarks I wouldn’t remember later

Then, when I’m home and have time (this can be later that day, or 3 days later), I pull up the ForeFlight entry, improve/edit the remarks, double check that all the counts make sense (if needed I pull up the tracklog to recount the number of landings, etc.), and then write an entry into my paper logbook.

If I filled up a page of the paper logbook, I scan the two pages and drop them into the repository.

Depending on how I feel, I may update the repository logbook JSON file then and there or at some point later (in the past I’ve even waited for a month due to laziness). Usually, visiting a new airport is motivating enough.

Prague & IETF Josef "Jeff" Sipek

This post is part of a series named “Europe 2017” where I share photos from my adventures in Europe during the summer 2017.

In mid-July 2017, I got to travel to Prague to participate in IETF’s 99th meeting.

The IETF meeting itself was great—lots of useful discussion about the next-generation email accessing protocol (called JMAP).

I stayed a couple of days extra to enjoy Prague, and Holly flew out from Helsinki to revisit Prague where she’s been once before—for our honeymoon.

I dragged my D750 and the two lenses with me and made to sure to take photos (almost) all the time. The gallery contains only a handful of the approximately 1100 raw files. Of those, I selected 11 for this blahg post.

Wikipedia article: Malá Strana Bridge Tower:

Wikipedia article: St. Nicholas Church with the Wikipedia article: Žižkov Television Tower in the background:

Wikipedia article: Matthias Gate with Wikipedia article: St. Vitus Cathedral peeking in the background:

The Wikipedia article: National Theatre:

Wikipedia article: Charles Bridge and a view of Wikipedia article: Old Town:

Wikipedia article: St. Vitus Cathedral from Wikipedia article: Petřín near sunset:

St. Nicholas Church again during sunset (and without the ugly Žižkov TV tower):

A gargoyle keeping the St. Nicholas Church’s masonry safe:

A view from the top of Wikipedia article: Old Town Bridge Tower with roofs and towers of (left to right):

Church of Saint Wikipedia article: Francis of Assisi, the left tower of Wikipedia article: Church of Our Lady before Týn, the clock tower and the Astronomical tower of Wikipedia article: Clementinum:

St. Nicholas Church yet again, this time from the Malá Strana Bridge Tower:

Wikipedia article: Charles Bridge, Wikipedia article: Old Town Bridge Tower, Church of Saint Wikipedia article: Francis of Assisi, and Wikipedia article: Žižkov Television Tower (from the Malá Strana Bridge Tower):

Prague offers a lot to see. The few photos I selected for this blahg post don’t show anywhere near enough of it. There are more photos in the gallery, but even those are merely highlights of what one can see in Prague. If you haven’t been to Prague yet, I highly recommend a trip.

2018-06-05 Josef "Jeff" Sipek

Smart Clock: A New Time — Using three inexpensive wrist watches to achieve 1 second accuracy over an extended period of time.

Repairing the card reader for a 1960s mainframe: cams, relays and a clutch

The 555 Timer IC an Interview with Hans Camenzind—The Designer of the Most Successful Integrated Circuit Ever Developed

High-level Problems with Git and How to Fix Them — A Mercurial developer’s view of Git’s shortcomings.

Mailing lists vs Github

GDL 90 Data Interface Specification — Definition of the serial protocol used by Wikipedia article: UAT receivers to feed the received data to Wikipedia article: MFDs.

GDL 90 Extended SpecificationForeFlight’s extension to GDL 90.

Tallinn Josef "Jeff" Sipek

This post is part of a series named “Europe 2017” where I share photos from my adventures in Europe during the summer 2017.

In late June 2017, Holly and I did a day trip to Wikipedia article: Tallinn. This wasn’t the first time I was in Tallinn, so I knew what the interesting parts of the old town were. As always, there is a gallery with more photos.

Tallinn’s old town is a medieval pocket in a otherwise modern city. In some of the photos you can see the modern civilization right behind a medieval tower.

A view of the Wikipedia article: Alexander Nevsky Cathedral from the tower of Wikipedia article: St. Mary’s Cathedral:

The tower of St. Mary’s Cathedral:

A section of the fortification wall that remains:

I’ve been to Tallinn twice and all my time there was spent in the old town. This makes me far from an expert about what there is to do. With that said, I enjoyed my time there and I recommend a day trip to anyone visiting nearby.

OH-LCD Josef "Jeff" Sipek

This post is part of a series named “Europe 2017” where I share photos from my adventures in Europe during the summer 2017.

When I attended the Kaivopuisto Air Show in early June last year, I learned about the existence of the Finnish Aviation Museum. It took me a month and a half, but eventually I found a free day to go check it out.

The museum itself is packed with all sorts of aircraft on static display. While they were interesting (and I certainly took plenty of photos of them), they aren’t what this post is about. This post is about Lokki—a retired Wikipedia article: DC-3 (registration OH-LCD) on display outside of the museum.

As luck would have it, the folks from the DC Association were there that day trying to see if they could start up Lokki’s engines—after 12 years of inactivity. After a lot of preparation, they managed to start them!

Without further ado, here are a few photos of Lokki (more photos can be found in the gallery).

Wikipedia article: Aero OY was the original name of Finnair:

One of the mechanics working on the left engine:

One of the people from the DC Association, seeing that I was obviously excited about the plane, asked me if I’d like to climb inside. I said yes, of course.

The inside was pretty bare-bones (which is to be expected of a static display that’s normally closed to public). I took a couple of photos inside, but most weren’t that interesting.

Throttle quadrant (note: most of the instrument panel was removed long ago):

It runs!

The livery is pretty simple—polished aluminum with dark blue lettering and a stripe:

I’m not really sure why they wanted to see if they could start the engines, but I’m happy that it worked out. Radial engines just have a unique roar to them.

Anyway, that’s it about Lokki. Hopefully I’ll get around to post processing the photos from the museum itself soon.

Modern Mercurial - Phases Josef "Jeff" Sipek

This post is part of a series named “Modern Mercurial” where I share my realizations about how much Mercurial has advanced since 2005 without me noticing.

Last year, I had a realization that I haven’t been using Mercurial to its full potential. In this post, I’d like to share my thoughts about and usage of Mercurial Phases.

Phases are not a new feature. They made their first appearance back in 2012 as part of Mercurial 2.1, which makes them a little over 6 years old.

What are phases?

While there is a description of phases on the Mercurial wiki, I’ll take a stab at a short intro.

Each commit belongs to one of three phases (public, draft, or secret) which implies a set of allowed operations on the commit. Furthermore, the phase dictates which other phase or phases the commit can transition to.

You can think of the phases as totally ordered (secretdraftpublic) and a commit’s phase can only move in that direction. That is, a secret commit can become either a draft or a public commit, a draft commit can become a public commit, and a public commit is “stuck” being public. (Of course if you really want to, Mercurial allows you to force a commit to any phase via hg phase -f.)

The allowed operations on a commit of a particular phase are pretty self-explanatory:

Public commits are deemed immutable and sharable—meaning that if you try to perform an operation on a commit that would modify it (e.g., hg commit –amend), Mercurial will error out. All read-only operations as well as pushing and pulling are allowed.

Secret commits are mutable and not sharable—meaning that all modifications are allowed, but the commits are not pullable or pushable. In other words, a hg pull will not see secret commits in the remote repository, and a hg push will not push secret commits to the remote repository.

Draft commits are mutable and sharable—a phase between public and secret. Like secret commits, changes to commits are allowed, and like public commits, pushing and pulling is allowed.

Or in tabular form:

Phase Commits Sharing
public immutable allowed
draft mutable allowed
secret mutable prevented

By default, all new commits are automatically marked as draft, and when a draft commit is pushed it becomes public on both ends.

Note that these descriptions ignore the amazing changeset evolution features making their way into current Mercurial since they can blur the “not yet shared” nature of draft commits. (Perhaps I should have titled this post Modern Mercurial (2012 edition) — Phases.)

A note about hg log

Unfortunately, the default hg log output does not display phases at all. I think this is rather unfortunate (but understandable from a backwards compatibility point of view).

Last year, I dedicated a whole post to how I template hg log information including my reasoning for why I display phases the way I do.

How do I use phases?

Now that we have the basic introduction to phases out of the way, let me describe how I mapped them to my workflow.

First of all, I make all new commits start in the secret phase (instead of the default draft) with a quick addition to .hgrc:

[phases]
new-commit = secret

This immediately prevents an accidental hg push from pushing commits that I’m still working on. (Recall that secret commits cannot be pushed.) In at least one repository, this allowed me to regularly have more than 6 heads with various work-in-progress feature ideas without the fear of accidentally messing up a public repository. Before I started using phases, I used separate clones to get similar (but not as thorough) protections.

Now, I work on a commit for a while (keeping it in the secret phase), and when I feel like I’m done, I transition it to the draft phase (via hg phase -d). At that point, I’m basically telling Mercurial (and myself when I later look at hg log) that I’m happy enough with the commit to push it.

Depending on what I’m working on, I may or may not push it immediately after (which would transition the commit to the public phase). Usually, I hold off pushing the commit if it is part of a series, but I haven’t done the last-chance sanity checks of the other commits.

Note: I like to run hg push without specifying a revision to push. I find this natural (and less to type). If I always specified a revision, then phases wouldn’t help me as much.

“Ugly” repos

I have a couple of repositories that I use for managing assorted data like my car’s gasoline utilization. In these repositories, the commits are simple data point additions to a CSV file and the commit messages are repetitive one-liners. (These one-liners create a rather “ugly” commit history.)

In essence, the workflow these repositories see can be summarized as:

$ echo "2018-04-05,12345,17.231," >> data.csv
$ hg commit -m "more gas"
$ hg push

In these repositories, I’ve found that defaulting to the secret phase was rather annoying because every commit was immediately followed by a phase change to allow the push to work. So, for these repos I changed new-commit back to draft.

Edit: I reworded the sentence about Mercurial giving you a way to force a commit to any phase based on feedback on lobste.rs.

Kaivopuisto Air Show 2017 Josef "Jeff" Sipek

This post is part of a series named “Europe 2017” where I share photos from my adventures in Europe during the summer 2017.

In early June 2017, we attended an air show in Wikipedia article: Kaivopuisto. Unfortunately, we found out about it last minute, and so we missed the beginning which included a Finnair Airbus A350 flyby. Pity.

The show included a number of trainers and combat aircraft performing various maneuvers. Here are the highlights (for more photos visit the gallery).

Wikipedia article: Red Arrows:

A seagull joining in:

Wikipedia article: Finnish Coast Guard’s Wikipedia article: Turva nearby with Wikipedia article: Suomenlinna visible behind it:

Wikipedia article: Eurofighter Typhoon:

Wikipedia article: Saab 35 Draken:

Wikipedia article: Saab Gripen:

During one of the passes, I took a burst of images and then assembled them into a Southwest 737 “Airportrait”-style image.

Finnish Air Force Wikipedia article: F-18 Hornet:

A Finnish aerobatics team Wikipedia article: Midnight Hawks flying Wikipedia article: BAE Systems Hawk:

Even though this post has more photos than I typically share, there are many more in the gallery. So, if you are into airplanes, I suggest you peruse it.

Juhannus 2017 Josef "Jeff" Sipek

This post is part of a series named “Europe 2017” where I share photos from my adventures in Europe during the summer 2017.

You may have noticed that I was a bit quiet during the last summer. I have a really good reason for it: I spent five months in Helsinki for work. On weekends, Holly and I got to explore, which led me to accumulate approximately 12000 photos. Sadly, I am quite behind on post processing them all, but I will get through them eventually.

This post is about how I spent Wikipedia article: Juhannus last year.

Juhannus is the name of the Finnish summer solstice holiday. It is a time to relax, spend time with friends and family, and enjoy oneself. Every year, a nearby island, Wikipedia article: Seurasaari, has an afternoon and evening with an assortment of traditional events and bonfires.

There is of course a gallery of my photos.

Every year, one couple is selected to have their wedding on Seurasaari during Juhannus. Here is 2017’s lucky couple:

Before about half a dozen bonfires are set ablaze, a number of “can fires” is lit:

The largest bonfire gets lit by the newlyweds—from a boat:

I’m not sure how exactly the big bonfire pile was constructed, but it didn’t take long for it to grow:

So, that was Juhannus on Seurasaari in 2017. It was a nice and relaxing afternoon and evening, and if I happen to be in Helsinki around Juhannus in the future, I’ll likely spend the day on Seurasaari.

I’m going to end this post with a bit of Finnish (from finland.fi) because languages can be fun:

– Kokoo koko kokko kokoon!
– Koko kokkoko?
– Koko kokko.

Meaning:

– Assemble the Midsummer bonfire!
– The whole Midsummer bonfire?
– Yes, the whole Midsummer bonfire.

(I’m told that kokoo is a dialect form of kokoa.)

Rust Pointers for C Programmers Josef "Jeff" Sipek

I’ve been eyeing Rust for about a year now. Here and there, I tried to use it to make a silly little program, or to implement some simple function in it to see for myself how ergonomic it really was, and what sort of machine code rustc spit out. But last weekend I found a need for a tool to clean up some preprocessor mess, and so instead of hacking together some combination of shell and Python, I decided to write it in Rust.

From my earlier attempts, I knew that there are a lot of different “pointers” but I found all the descriptions of them lacking or confusing. Specifically, Rust calls itself a systems programming language, yet I found no clear description of how the different pointers map to C—the systems programming language. Eventually, I stumbled across The Periodic Table of Rust Types, which made things a bit clearer, but I still didn’t feel like I truly understood.

During my weekend expedition to Rust land, I think I’ve grokked things enough to write this explanation of how Rust does things. As always, feedback is welcomed.

I’ll describe what happens in terms of C. To keep things simple, I will:

  • assume that you are well-versed in C
  • assume that you can read Rust (any intro will teach you enough)
  • not bother with const for the C snippets
  • not talk about mutability

In the following text, I assume that we have some struct T. The actual contents don’t matter. In other words:


struct T {
        /* some members */
};

With that out of the way, let’s dive in!

*const T and *mut T

These are raw pointers. In general, you shouldn’t use them since only unsafe code can dereference them, and the whole point of Rust is to write as much safe code as possible.

Raw pointers are just like what you have in C. If you make a pointer, you end up using sizeof(struct T *) bytes for the pointer. In other words:


struct T *ptr;

&T and &mut T

These are borrowed references. They use the same amount of space as raw pointers and behave same exact way in the generated machine code. Consider this trivial example:


#[no_mangle]
pub fn raw(p: *mut usize) {
    unsafe {
        *p = 5;
    }

}

#[no_mangle]
pub fn safe(p: &mut usize) {
    *p = 5;
}

A rustc invocation later, we have:


raw()
    raw:     55                 pushq  %rbp
    raw+0x1: 48 89 e5           movq   %rsp,%rbp
    raw+0x4: 48 c7 07 05 00 00  movq   $0x5,(%rdi)
             00 
    raw+0xb: 5d                 popq   %rbp
    raw+0xc: c3                 ret    

safe()
    safe:     55                 pushq  %rbp
    safe+0x1: 48 89 e5           movq   %rsp,%rbp
    safe+0x4: 48 c7 07 05 00 00  movq   $0x5,(%rdi)
              00 
    safe+0xb: 5d                 popq   %rbp
    safe+0xc: c3                 ret    

Note that the two functions are bit-for-bit identical.

The only differences between borrowed references and raw pointers are:

  1. references will never point at bogus addresses (i.e., they are never NULL or uninitialized),
  2. the compiler doesn’t let you do arbitrary pointer arithmetic on references,
  3. the borrow checker will make you question your life choices for a while.

(#3 gets better over time.)

Box<T>

These are owned “pointers”. If you are a C++ programmer, you are already familiar with them. Never having truly worked with C++, I had to think about this a bit until it clicked, but it is really easy.

No matter what all the documentation and tutorials out there say, Box<T> is not a pointer but rather a structure containing a pointer to heap allocated memory just big enough to hold T. The heap allocation and freeing is handled automatically. (Allocation is done in the Box::new function, while freeing is done via the Drop trait, but that’s not relevant as far as the memory layout is concerned.) In other words, Box<T> is something like:


struct box_of_T {
        struct T *heap_ptr;
};

Then, when you make a new box you end up putting only what amounts to sizeof(struct T *) on the stack and it magically starts pointing to somewhere on the heap. In other words, the Rust code like this:


let x = Box::new(T { ... });

is roughly equivalent to:


struct box_of_t x;

x.heap_ptr = malloc(sizeof(struct T));
if (!x.heap_ptr)
        oom();

*x.heap_ptr = ...;

&[T] and &mut [T]

These are borrowed slices. This is where things get interesting. Even though it looks like they are just references (which, as stated earlier, translates into a simple C-style pointer), they are much more. These types of references use fat pointers—that is, a combination of a pointer and a length.


struct fat_pointer_to_T {
        struct T *ptr;
        size_t nelem;
};

This is incredibly powerful, since it allows bounds checking at runtime and getting a subset of a slice is essentially free!

&[T; n] and &mut [T; n]

These are borrowed references to arrays. They are different from borrowed slices. Since the length of an array is a compile-time constant (the compiler will yell at you if n is not a constant), all the bounds checking can be performed statically. And therefore there is no need to pass around the length in a fat pointer. So, they are passed around as plain ol’ pointers.


struct T *ptr;

T, [T; n], and [T]

While these aren’t pointers, I thought I’d include them here for completeness’s sake.

T

Just like in C, a struct uses as much space as its type requires (i.e., sum of the sizes of its members plus padding).

[T; n]

Just like in C, an array of structs uses n times the size of the struct.

[T]

The simple answer here is that you cannot make a [T]. That actually makes perfect sense when you consider what that type means. It is saying that we have some variable sized slice of memory that we want to access as elements of type T. Since this is variable sized, the compiler cannot possibly reserve space for it at compile time and so we get a compiler error.

The more complicated answer involves the Sized trait, which I’ve skillfully managed to avoid thus far and so you are on your own.

Summary

That was a lot of text, so I decided to compact it and make the following table. In the table, I assume that our T struct is 100 bytes in size. In other words:


/* Rust */
struct T {
    stuff: [u8; 100],
}

/* C */
struct T {
        uint8_t stuff[100];
};

Now, the table in its full glory:

Rust C Size on
ILP32/LP64
(bytes)
Value

let x: T;

struct T x;
100/100
Raw pointer

let x: *const T;
let x: *mut T;

struct T *x;
4/8
Reference

let x: &T;
let x: &mut T;

struct T *x;
4/8
Box

let x: Box<T>;

struct box_of_T {
        struct T *heap_ptr;
};

struct box_of_T x;
4/8
Array of 2

let x: [T; 2];

struct T x[2];
200/200
Reference to
an array of 2

let x: &[T; 2];

struct T *x;
4/8
A slice

let x: [T];

struct T x[];
unknown at
compile time
A reference
to a slice

let x: &[T];

struct fat_ptr_to_T {
        struct T *ptr;
        size_t nelem;
};

struct fat_ptr_to_T x;
8/16

A word of caution: I assume that the sizes of the various pointers are actually implementation details and shouldn’t be relied on to be that way. (Well, with the exception of raw pointers - without those being fixed FFI would be unnecessarily complicated.)

I didn’t cover str, &str, String, and Vec<T> since I don’t consider them fundamental types, but rather convenience types built on top of slices, structs, references, and boxes.

Anyway, I hope you found this useful. If you have any feedback (good or bad), let me know.

CBOR vs. JSON vs. libnvpair Josef "Jeff" Sipek

My blahg uses nvlists for logging extra information about its operation. Historically, it used Sun libnvpair. That is, it used its data structures as well as the XDR encoding to serialize the data to disk.

A few months ago, I decided to replace libnvpair with my own nvlist implementation—one that was more flexible and better integrated with my code. (It is still a bit of a work-in-progress, but it is looking good.) The code conversion went smoothly, and since then all the new information was logged in JSON.

Last night, I decided to convert a bunch of the previously accumulated libnvpair data files into the new JSON-based format. After whipping up a quick conversion program, I ran it on the data. The result surprised me—the JSON version was about 55% of the size of the libnvpair encoded input!

This piqued my interest. I re-ran the conversion but with CBOR (RFC 7049) as the output format. The result was even better with the output being 45% of libnvpair’s encoding.

This made me realize just how inefficient libnvpair is when serialized. At least part of it is because XDR (the way libnvpair serializes data) uses a lot of padding, while both JSON and CBOR use a more compact encoding for many data types (e.g., an unsigned number in CBOR uses 1 byte for the type and 0, 1, 2, 4, or 8 additional bytes based on its magnitude, while libnvpair always encodes a uint64_t as 8 bytes plus 4 bytes for the type).

Since CBOR is 79% of JSON’s size (and significantly less underspecified compared to the minefield that is JSON), I am hoping to convert everything that makes sense to CBOR. (CBOR being a binary format makes it harder for people to hand-edit it. If hand-editing is desirable, then it makes sense to stick with JSON or other text-based formats.)

The Data & Playing with Compression

The blahg-generated dataset that I converted consisted of 230866 files, each containing an nvlist. The following byte counts are a simple concatenation of the files. (A more complicated format like tar would add a significant enough overhead to make the encoding efficiency comparison flawed.)

Format Size % of nvpair
nvpair 471 MB 100%
JSON 257 MB 54.6%
CBOR 203 MB 45.1%

I also took each of the concatenated files and compressed it with gzip, bzip2, and xz. In each case, I used the most aggressive compression by using -9. The percentages in parentheses are comparing the compressed size to the same format’s uncompressed size. The results:

Format Uncomp. gzip bzip2 xz
nvpair 471 MB 37.4 MB (7.9%) 21.0 MB (4.5%) 15.8 MB (3.3%)
JSON 257 MB 28.7 MB (11.1%) 17.9 MB (7.0%) 14.5 MB (5.6%)
CBOR 203 MB 26.8 MB (13.2%) 16.9 MB (8.3%) 13.7 MB (6.7%)

(The compression ratios are likely artificially better than normal since each of the 230k files has the same nvlist keys.)

Since tables like this are hard to digest, I turned the same data into a graph:

CBOR does very well uncompressed. Even after compressing it with a general purpose compression algorithm, it outperforms JSON with the same algorithm by about 5%.

I look forward to using CBOR everywhere I can.

2017-11-14 Josef "Jeff" Sipek

Doug Engelbart Institute — Online exhibits, historic videos, texts, archive photos, and stories about Doug Engelbart, the inventor of the mouse, hypertext, and GUIs…all in the 1960s

Flight recorders data inspection by Airbus

Parsing JSON is a Minefield

Completely Painless Programmer’s Guide to XYZ, RGB, ICC, xyY, and TRCs — Brain-hurting amount of information about color profiles, etc.

darktable — A Lightroom-like open source software

World plugs — Info about every electric plug form factor in the world

Google Traffic, iOS edition Josef "Jeff" Sipek

Several years ago, I wrote about how Google gets traffic information and how to turn off this location reporting on Android phones. Since then I’ve switched to iPhones. While I normally use the built-in Maps app, I keep Google Maps installed as a fallback—just in case.

I upgraded my phone recently and so I spent some time going through all the apps and making sure they worked and didn’t have more access than necessary. This is when I discovered that the Google Maps app for iOS defaults to collecting location information and sharing it with Google. Given my previous post, this isn’t really surprising.

Turning it off

Anyway, as with the Android post, here are the steps to limit Google’s collection of location information.

First of all, in Settings → Privacy → “Location Services”, I changed Google Maps’s permission to “While Using the App”. If I’m not using the app, then it doesn’t need to know where I am.

Second, in the app itself, go to: Settings → “About, terms & privacy” → “Location data collection”. That’s right, this setting is buried in what appears to be a page with the boring legal notices.

And then turn off the the toggle:

That should do it…at least assuming that Google honors the settings in its own app.

Creative xor Use Josef "Jeff" Sipek

Last month at work I got to try to optimize a function that takes a number and rounds it up to the next power of 2. The previous implementation used a simple loop. I didn’t dive into obscure bit twiddling, but rather used a helper function that is already in the codebase. Yes, I let the compiler do the heavy lifting of turning easy to understand code into good machine code. The x86 binary that gcc 6.3 produced has an interesting idiom, and that’s why I’m writing this entry.

The new code:

static inline unsigned int bits_required32(uint32_t num)
{
        return num == 0 ? 0 : 32 - __builtin_clz(num);
}

/* Returns x, such that x is the smallest power of 2 >= num. */
uint32_t nearest_power(uint32_t num)
{
        if (num == 0)
                return 1;

        return 1U << bits_required32(num - 1);
}

This is a slightly simplified version of the code, but it demonstrates the optimization quite well.

The nearest_power function disassembles as:

nearest_power()
    nearest_power:      8b 54 24 04        movl   0x4(%esp),%edx
    nearest_power+0x4:  b8 01 00 00 00     movl   $0x1,%eax
    nearest_power+0x9:  85 d2              testl  %edx,%edx
    nearest_power+0xb:  74 14              je     +0x14 <nearest_power+0x21>
    nearest_power+0xd:  83 ea 01           subl   $0x1,%edx
    nearest_power+0x10: 74 0f              je     +0xf  <nearest_power+0x21>
    nearest_power+0x12: 0f bd d2           bsrl   %edx,%edx
    nearest_power+0x15: b9 20 00 00 00     movl   $0x20,%ecx
    nearest_power+0x1a: 83 f2 1f           xorl   $0x1f,%edx
    nearest_power+0x1d: 29 d1              subl   %edx,%ecx
    nearest_power+0x1f: d3 e0              shll   %cl,%eax
    nearest_power+0x21: c3                 ret    

The first 6 instructions contain the prologue and deal with num being zero or one—both cases produce the result 1. The remaining 6 instructions make up the epilogue and are where the calculation happens. I’m going to ignore the first half of the function, since the second half is where the interesting things happen.

First, we get the number of leading zeros in num - 1 and stash the value 32 in a register:

    nearest_power+0x12: 0f bd d2           bsrl   %edx,%edx
    nearest_power+0x15: b9 20 00 00 00     movl   $0x20,%ecx

The number of leading zeros (%edx) is in the range 0–31.

Here is the really interesting bit:

    nearest_power+0x1a: 83 f2 1f           xorl   $0x1f,%edx

This xors the number of leading zeros (i.e., 0–31) with 31. To decipher what this does, I find it easier to consider the top 27 bits and the bottom 5 bits separately.

operand binary
0x1f 00000000 00000000 00000000 000 11111
edx 00000000 00000000 00000000 000 xxxxx

The xor of the top bits produces 0 since both the constant 31 and the register containing any of the numbers 0–31 have zeros there.

The xor of the bottom bits negates them since the constant has ones there.

When combined, the xor has the same effect as this C expression:

out = (~in) & 0x1f;

This seems very weird and useless, but it is far from it. It turns out that for inputs 0–31 the above expression is the same as:

out = 31 - in;

I think it is really cool that gcc produced this xor instead of a less optimal multi-instruction version.

The remainder of the disassembly just subtracts and shifts to produce the return value.

Why xor?

I think the reason gcc (and clang for that matter) produce this sort of xor instruction instead of a subtraction is very simple: on x86 the sub instruction’s left hand side and the destination must be the same register. That is, on x86 the sub instruction works as:

x -= y;

Since the destination must be a register, it isn’t possible to express out = 31 - in using just one sub.

Anyway, that’s it for today. I hope you enjoyed this as much as I did.

Ex-Joyeur The Observation Deck

When I was first interviewing with Joyent in July 2010, I recall telling then-CTO Mark Mayo that I was trying to make a decision for the next seven years of my career. Mark nodded sagely at this, assuring me that Joyent was the right move. Shortly after coming to Joyent, I became amazed that Mark had managed to keep a straight face during our conversation: as a venture-funded startup, Joyent lived on a wildly accelerated time scale; seven years into the future for a startup is like seventy for a more established company. (Or, to put it more bluntly, startups with burn and runway are generally default dead and very unlikely to live to see their seventh birthday.)

But Mark also wasn’t wrong: again and again, Joyent beat the odds, in part because of the incredible team that our idiosyncratic company attracted. We saw trends like node.js, containers and data-centric computing long before our peers, attracting customers that were themselves foward-looking technologists. This made for an interesting trip, but not a smooth one: being ahead of the market is as much of a curse as a blessing, and Joyent lived many different lives over the past nine years. Indeed, the company went through so much that somewhere along the way one of our colleagues observed that the story of Joyent could only be told as a musical — an observation so profoundly true that any Joyeur will chuckle to themselves as they are reminded of it.

During one period of particularly intense change, we who endured developed a tradition befitting a company whose story needs musical theater to tell it: at company-wide gatherings, we took to playing a game that we called “ex-Joyeur.” We would stand in a circle, iterating around it; upon one’s turn, one must name an ex-Joyeur who has not already been named — or sit down. Lest this sound like bad attitude, it in fact wasn’t: it was an acknowledgement of the long strange trip that any venture-funded startup takes — a celebration of the musical’s exhaustive and curious dramatis personæ, and a way to remember those who had been with us at darker hours. (And in fact, some of the best players at ex-Joyeur were newer employees who managed to sleuth out long-forgotten colleagues!) Not surprisingly, games of ex-Joyeur were routinely interrupted with stories of the just-named individual — often fond, but always funny.

I bring all of this up because today is my last day at Joyent. After nine years, I will go from a Joyeur to an ex-Joyeur — from a player to a piece on the board. I have had too many good times to mention, and enough stories to last several lifetimes. I am deeply grateful for the colleagues and communities with whom I shared acts in the musical. So many have gone on to do terrific things; I know our paths will cross again, and I already look forward to the reunion. And to those customers who took a chance on us — and especially to Samsung, who acquired Joyent in 2016 — thank you, from the bottom of my heart for believing in us; I hope we have done right by you.

As for me personally, I’m going to take slightly more of a break than the three days I took in 2010; as a technologist, I am finding myself as excited as ever by snow-capped mountains on a distant horizon, and I look forward to taking my time to plot my next seven-year expedition!

OmniOS Community Edition r151030l, r151028al, r151022dj OmniOS Community Edition

OmniOS Community Edition weekly releases for w/c 22nd of July 2019 are now available.

The following updates are available for all supported releases:

Security Fixes

Additionally, for r151030 only:

  • The gcc-ar utility has been fixed for both gcc7 and gcc8.

  • Updates to retpoline external thunk generation for both gcc7 and gcc8. See illumos gcc issue 25 for further details.

For further details, please see https://omniosce.org/releasenotes

Any problems or questions, please get in touch.

Modifying USDT providers with translated arguments Dave Pacheco's Blog

I originally wrote this post in 2012 but didn’t get around to publishing it until now. It’s a pretty technical post about an arcane bug in an uncommonly modified component. If you’re not already pretty familiar with translators for USDT providers, this post is probably not useful for you.

Several years ago, a colleague here at Joyent was trying to use the Node HTTP server probes and saw this:

dtrace: invalid probe specifier node57257:::http-client-request { trace(args[0]->url); }: in action list: failed to resolve native type for args[0]

It was the start of a nightmare: this bug was introduced by code I committed back in June that year on a project I knew at the time was not a lot of code, but extremely tricky to get right. I obviously hadn’t gotten it quite right. Over the next two days, we iterated on several possible solutions. As we fleshed out each one, we discovered something new that exposed an important case that the solution didn’t handle. I’m documenting what we learned here in case anyone runs into similar issues in the future.

Updating USDT providers

An important thing to keep in mind when designing a USDT provider is future extensibility: you may want to add new telemetry later, so you’ll want to be able to extend the native and translated structures in a way that allows the translators to work on both old and new binaries. Be sure to leave zeroed padding in the native (application) structs so that you can easily extend them later. If you do that, you can just add new fields to the native type and translated type and have the translator check for a NULL value. End of story.

Adding fields to existing providers

If you find yourself wanting to add a field to an existing provider that doesn’t have padding (and want to preserve compatibility with older binaries), one approach is to:

  • Modify the structure to lead with a fixed-width sentinel or version field.
  • Then have the translator check the value of this field to determine whether it’s looking at an old- or new-format structure and translate appropriately.
  • You’ll save yourself a lot of pain by ensuring that the sentinel field is the same on both 32-bit and 64-bit processes. If the old structure started with a 64-bit pointer on 64-bit systems, the sentinel should probably be 64-bit as well so you can be more sure that it doesn’t accidentally look like part of a 64-bit pointer. (This is a problem for the node provider, which treats anything less than 4096 in the first 32-bit field as the sentinel. Unfortunately, these values could be contained inside a 64-bit pointer, and the translator would guess wrong.)

Making deeper changes to existing providers

Sometimes you need to make deeper changes, like renaming structures used as probe arguments (perhaps because you need to break a single common structure into two, as was done with the node provider). In that case, remember that:

  • The structures defined in the provider must also be defined in the library file. If you fail to do this, DTrace will barf trying to instrument a particular process because it fails to resolve the native type. However, it may seem to work because if you instrument all processes, it may find one of the old binaries that doesn’t reference the unresolved type. Most confusingly, it will also successfully instrument new binaries and use one of the translators, even though the source type doesn’t match.
  • Be sure to test by using the translator automatically rather than using the “xlate” operator. If you use “xlate”, you won’t see the above problem because you end up casting the argument to the correct type, so DTrace never has to resolve the (wrong) native type. That hides the problem.
  • Beware of this DTrace compiler bug.

You can see the solution we came up with in all its glory.

Summary

If you make changes to an existing USDT provider with translated arguments, be sure to consider and test:

  • new binaries (i.e. using the new provider file) on systems with an updated library file
  • old binaries (i.e. using the old provider file) on systems with an updated library file
  • existing DTrace scripts (important for stable tools like Cloud Analytics). Remember, the point of USDT is stability!
  • with “xlate” as well as with automatic translators
  • instrumenting individual old and new processes, not just “app*:::”

You may be able to ignore new binaries on an older system because users can probably use “dtrace -L” to get the new file.

I had thought I knew a lot about how USDT providers with translated arguments worked, but debugging this issue reminded me how complex this problem is intrinsically. Despite its warts, it works very well.

OmniOS Community Edition r151030j, r151028aj, r151022dh OmniOS Community Edition

OmniOS Community Edition weekly releases for w/c 8th of July 2019 are now available.

The following security fix is available for all supported releases:

For further details, please see https://omniosce.org/releasenotes

Any problems or questions, please get in touch.

Performance Puzzler: The Slow Server Dave Pacheco's Blog

TODO link back to this blog post from the sim repo
-->

In the spirit of the TCP Puzzlers, I want to present a small distributed systems performance puzzler: In a distributed system with a fixed-number of servers and a fixed concurrency, what happens when one of the backend servers gets slow? First, let’s pick apart what this means.

Suppose you have:

  • a service with a fixed number of backend servers — say, 100 servers — each exactly equivalent. You could imagine this as a load balancing service having 100 backends or a key-value store having 100 well-distributed shards.
  • a fixed number of clients — say, 10,000. Each client makes one request to the system at a time. When each request completes, the client makes another one. The 10,000 clients could be 1,000 clients each having a 10-way thread pool or 10 event-oriented clients each capable of 1,000 concurrent requests — either way, the total concurrency is 10,000. It doesn’t matter of the clients and server communicate via RPC, HTTP, or something else.

with the following assumptions (which we’ll revisit later):

  • All requests take about the same amount of work to process, both on the client-side and server-side.
  • The per-request cost outside the server is negligible. (We’re going to ignore client-to-server transit time as well as client processing time.)
  • The backend servers are infinitely scalable. That is, we’ll assume that if they take 1ms to process 1 request, they’ll also take 1ms to process 10 requests issued at the same time.
  • Each request is dispatched to one of the servers uniformly at random.

To start with, suppose each server takes 1ms to complete each request. What’s the throughput of the entire system? What will be the per-server throughput?

There are a few ways to analyze this:

  • From the client perspective, we know there will be 10,000 requests outstanding to the system at any given instant. (Remember, there are 10,000 clients, each making a single request at a time, and we’re assuming zero per-request costs on the client side and in transit.) After 1ms, those requests have all completed and another 10,000 requests are running. At this rate, the system will complete (10,000 requests per millisecond) times (1,000 milliseconds per second) = 10 million requests per second. Intuitively, with all servers equal, we’d expect each server to be handling 1% of these, or 100,000 requests per second.
  • From the server perspective: with 10,000 outstanding requests at all times and 100 servers, we’d expect about 100 requests outstanding per server at all times. Each takes 1ms, so there would be 100,000 completed per server per second (which matches what we said above).

Now, what happens when one of the servers becomes slow? Specifically, suppose all of a sudden one of the servers starts taking 1s to process every request. What happens to overall throughput? What about per-server throughput? (Make a guess before reading on!)

The analysis is a bit more complex than before. From the client perspective, there will still always be 10,000 requests outstanding. 99% of requests will land on a fast server and take 1ms, while the remaining 1% will land on the slow server and take 1s. The expected latency for any request is thus 0.99 * 1ms + 0.01 * 1000ms = 10.99ms. If each of the 10,000 clients completes (1 request / 10.99ms) times (1,000 milliseconds per second), we get an expected throughput of about 91 requests per client, or 909,918 requests per second. (That’s a degradation of 91% from the original 10 million requests per second!)

Above, we assumed that each server had the same throughput, but that’s not so obvious now that one of them is so slow. Let’s think about this another way on the client side: what’s the probability at any given time that a client has a request outstanding to the slow server? I find it easiest to imagine the assignment being round-robin instead of random. Suppose clients issued requests to each of the 99 fast servers, then the slow server, and then started again with the first fast server. In that case, the first 99 requests would take 99ms, and the last request would take 1s. The client would have a request outstanding to the slow server for 1,000 ms / 1,099 ms or about 91% of the time. There’s a more rigorous explanation below, but I think it’s intuitive that the random distribution of work would behave the same way on average.

In that case, we could also expect that 91% of clients (or 9100 clients) have a request outstanding to the slow server at any given time. Each of these takes 1 second. The result is a throughput of 9,100 requests per second on the slow server.

What about the fast server? We’d expect that each fast server has 1/99 of the remaining requests (9% of requests) at any given time, which works out to 9.1 requests per server. At 1ms per request, these servers are doing (drum roll) 9,100 requests per second. The same as the slow server! Is that what you guessed? I didn’t.

This isn’t an accident of the specific numbers I picked. If you run the same analysis algebraically instead of with concrete numbers, you’ll find that the general result holds: in this system, when one server becomes slow, the throughput of every server remains the same.

A (more) formal proof

Let’s define some notation:

  • C: the total client concurrency (10,000 in our example)
  • N: the number of servers (100 in our example)
  • Si: server i (where i ranges from 1 to N)
  • Li: the latency (in seconds) of requests to server Si. In our example, L1 would be 1 second, while Lj = 0.001 for j > 1.
  • Ci: the number of requests outstanding on server Si at any given time.
  • Ti: the throughput of server Si (in requests per second).
  • Pr(ci): the probability that a particular client has a request outstanding to server Si at any given time.
  • Pr(Ri): the probability that a newly-issued request will be issued to server Si.

We’re looking for a formula for Ti. For a server that can handle one request a time, the throughput is simply the inverse of the latency (1/Li). We said that our server was infinitely scalable, which means it can execute an arbitrary number of requests in parallel. Specifically, it would be executing Ci requests in parallel, so the throughput Ti is:


Now, how do we calculate Ci, the concurrency of requests at each server? Well, since each client has exactly one request outstanding at a time, Ci is exactly the number of clients that have a request outstanding to server Si. Since the clients operate independently, that’s just:


To calculate ci, we need the percentage of time that each client has a request outstanding to Si. We define a probability space of events (t, i) indicating that at millisecond timestamp t, the client has a request outstanding to server Si. By simply counting the events, we can say that the probability that a client has a request outstanding to server Si is:


Now, since it’s equally likely that we’ll assign any given request to any server:


that means:


All we’re saying here is that the fraction of time each client spends with a request outstanding to a particular server is exactly that server’s latency divided by the latency of all servers put together. This makes intuitive sense: if you have 20 servers that all take the same amount of time, each client would spend 5% (1/20) of its time on each server. If one of those servers takes twice as long as the others, it will spend 9.5% (2/21) of its time on that one (almost twice as much as before) and 4.8% (1/21) on the others.

With this, we can go back to Ci:



and finally back to Ti:


and we have our result: the throughput at each server does not depend at all on the latency of that server (or any other server, for that matter)!

We can plug in some numbers to sanity check. In our example, we started with:


and we get the results:


which matches what we said above.

Simulating the behavior

One might find this result hard to believe. I wrote a simple simulator to demonstrate it. The simulator maintains a virtual clock with a 1ms resolution. To start with, each client issues a request to a random server. Each virtual millisecond, the simulator determines which requests have completed and has the corresponding clients make another request. This runs for a fixed virtual time, after which we determine the total throughput and the per-server throughput.

The simulator as-is hardcodes the the scenario that I described above (1 server with 1s per request, 99 servers with 1ms per request) and a 10-minute simulation. For the specific numbers I gave earlier, the results are:

    $ node sim.js
simulated time:           600000 ms
total client concurrency: 10000
servers:
     1 server that completes requests with latency 1000 ms (starting with "server_0")
    99 servers that complete requests with latency 1 ms (starting with "server_1")

server_0     5455039 requests (  9092 rps)
server_1     5462256 requests (  9104 rps)
server_2     5459316 requests (  9099 rps)
server_3     5463211 requests (  9105 rps)
server_4     5463885 requests (  9106 rps)
server_5     5456999 requests (  9095 rps)
...
server_95    5457743 requests (  9096 rps)
server_96    5459207 requests (  9099 rps)
server_97    5458421 requests (  9097 rps)
server_98    5458234 requests (  9097 rps)
server_99    5456471 requests (  9094 rps)
overall:   545829375 requests (909715 rps, expect 9097 rps per server)

In my simulations, server_0 is generally a bit slower than the others, but within 0.2% of the overall overage. The longer I run the simulation, the closer it gets to the mean. Given that the slow server is three orders of magnitude slower per request (1s vs. 1ms), I’d say it’s fair to conclude that the per-server throughput does not vary among servers when one server is slow.

Conclusions

The main result here is that in this system, the per-server throughput is the same for all servers, even when servers vary significantly in how fast they process requests. While the implementations of servers may be independent (i.e., they share no common hardware or software components), their behavior is not independent: the performance of one server depends on that of other servers! The servers have essentially been coupled to each other via the clients.

This result implies another useful one: in a system like this one, when overall throughput is degraded due to one or more poorly-performing servers, you cannot tell from throughput alone which server is slow — or even how many slow servers there are! You could determine which server was slow by looking at per-server latency.

Note too that even when this system is degraded by the slow server, the vast majority of requests complete very quickly. At the same time, clients spend the vast majority of their time waiting for the slow server. (We’ve come to refer to this at Joyent as “getting Amdahl’d”, after Amdahl’s Law.)

If you found this result surprising (as I did), then another takeaway is that the emergent behavior of even fairly simple systems can be surprising. This is important to keep in mind when making changes to the system (either as part of development or in production, as during incident response). Our intuition often leads us astray, and there’s no substitute for analysis.

What about those assumptions?

The assumptions we made earlier are critical to our result. Let’s take a closer look.

  • “There are a fixed number of clients.” This is true of some systems, but certainly not all of them. I expect the essential result holds as long as client count is not a function of server performance, which probably is true for many systems. However, if client software responds to poor performance by either reducing concurrency (figuring they might be overloading the server) or increasing it (to try to maintain a given throughput level), the behavior may well be different.
  • “All requests take about the same amount of work to process, both on the client-side and server-side.” This is true of some systems, but not others. When the costs differ between requests, I’d expect the actual per-client and per-request throughput to vary and make the result shown here harder to see, but it doesn’t change the underlying result.
  • “The per-request cost outside the server is negligible.” This isn’t true of real systems, but makes the math and simulation much simpler. I expect the results would hold as long as the per-request client and transit costs were fixed and small relative to the server cost (which is quite often true).
  • “The backend servers are infinitely scalable. That is, we’ll assume that if they take 1ms to process 1 request, they’ll take 1ms to process 10 requests issued at the same time.”. This is the most dubious of the assumptions, and it’s obviously not true in the limit. However, for compute-bound services, this is not an unreasonable approximation up to the concurrency limit of the service (e.g., the number of available threads or cores). For I/O bound services, this is also a reasonable approximation up to a certain point, since disks often work this way. (For disks, it may not take much longer to do several read or write I/Os — even to spinning disks — than to do one such I/O. The OS can issue them all to the disk, and the disk can schedule them so they happen in one rotation of the platter. It’s not always so perfect as that, but it remains true that you can increase the concurrent I/O for disks up to a point without significantly increasing latency.)
  • “Each request is dispatched to one of the servers uniformly at random.” Many client libraries (e.g., for RPC) use either a random-assignment policy like this one or a round-robin policy, which would have a similar result. For systems that don’t, this result likely won’t hold. In particular, a more sophisticated policy that keeps track of per-server latency might prefer the faster servers. That would send less work to the slower server, resulting in better overall throughput and an imbalance of work across servers.

As you can see, the assumptions we made were intended to highlight this particular effect — namely, the way a single slow server becomes a bottleneck in clients that affects throughput on other servers. If you have a real-world system that looks at all similar to this one and doesn’t have a more sophisticated request assignment policy, then most likely this effect is present to some degree but it maybe harder to isolate relative to other factors.

DNS server startup script security issue openindiana

We’ve identified security issue in bind DNS server startup scripts. Users who use the BIND DNS server should upgrade to the very latest version, service/network/dns/bind@9.14.3-2018.0.0.0 9.14.3, not only to obtain the latest version of BIND, but to also include an additional OpenIndiana-specific privilege reduction for BIND processes.  Before this version update, BIND ran with additional privileges beyond what it was supposed to have, which might have allowed easier access to the operating system files after a BIND exploit.