Silly tricks with Docker in the JPC Nahum Shalman

Message from 2018: I was going through my blog post drafts and found this post.
I've made few small tweaks to it that seem to be what I was hoping to add before publishing.
If you've ever wanted to replace your SSH access to a native branded zone with docker exec access, this is the blog post for you. We now return you to the bleeding edge of at least 1 or 2 years ago:

As a quick pre-emptive caveat, this post describes using the Docker CLI tool basically to managed what Joyent are calling "infrastructure containers". Classic illumos zones with many processes in them, not the kind of containers that you typically create and run with Docker. It's a how-to on using the Docker CLI instead of the CloudAPI tools you might usually use in the Joyent Public Cloud (JPC).

I stumbled across this gem in the Joyent docs the other day. It turns out that you can use the Docker CLI tool to create and manage "joyent-minimal" branded zones. Getting Docker set up with your JPC account is beyond the scope of this blog post but it's covered here.

As that first link describes, if instead of a regular image name you provide the uuid of a "smartos" type image it will be used to provision a joyent-minimal branded zone.

Let's fire up a recent image with 128 MB of RAM and a public IP:

docker run -P -d -m 128 --name=tiny 390639d4-f146-11e7-9280-37ae5c6d53d4 /sbin/init  

It will sidestep all the normal zone setup, so we have to manually set the default route:

docker exec -it tiny /bin/bash -c 'route -p add default $(mdata-get sdc:nics | json 1.gateway)'  

Thanks to changes I worked on a while back you can import pretty much any service manifest you want quite easily to bring various services online.

Instead of SSH access, use the docker cli to log in to your container:

docker exec -it tiny /bin/bash  

Experiment with installing packages and enabling any necessary SMF services; have fun!

And of course, don't forget to delete this zone when you're done with it:

docker rm -f tiny  

Run Docker images on SmartOS ~drscream

This feature is available on the SkyLime SmartOS Version because we merged the changes from an existing issue into our branch to support the Docker Registry Version 2. This has been done because most of the existing Docker images only using version 2, which result in less usable images if you only support version 1. With this change no docker version 1 is supported any more, which is the biggest drawback if you've already version 1 images.


Configure imgadm to add docker hub sources:

$ imgadm sources --add-docker-hub

imgadm avail doesn't work against the Hub, so you'll have to search the Hub manually. But you could import images simple via the imgadm import command:

$ imgadm import busybox

Show installed docker images:

$ imgadm list --docker

UUID                                  REPOSITORY                             TAG  IMAGE_ID      CREATED
6357e9ab-0e79-5a0d-697b-b528d925026a  konradkleine/docker-registry-frontend  -    sha256:9976b  2017-10-11T23:50:25Z
5de66518-05f1-1ca2-34ee-6c8750a7a4bb  busybox                                -    sha256:0ffad  2017-11-03T22:39:17Z
1a99421d-7df8-23ec-1758-0b46b730aa1f  registry                               -    sha256:f792f  2017-12-01T22:15:41Z

Configure personal docker registry

Import the official image for the docker registry:

$ imgadm import registry

Install and activate docker registry with the vmadm command. For that store the following file on your SmartOS machine for example in /opt/docker-registry.json:

  "alias": "docker-registry",
  "hostname": "",
  "image_uuid": "1a99421d-7df8-23ec-1758-0b46b730aa1f",
  "nics": [
      "nic_tag": "admin",
      "primary": true,
      "ips": [ "" ],
      "gateways": [ "" ]
  "brand": "lx",
  "docker": "true",
  "kernel_version": "3.13.0",
  "max_physical_memory": 1024,
  "maintain_resolvers": true,
  "resolvers": [
  "quota": 10,
  "internal_metadata": {
    "docker:cmd": "[\"/bin/sh\", \"/\", \"/etc/docker/registry/config.yml\"]"

Please modify the ips, gateways and resolvers field in the JSON manifest.

The docker:cmd is based on the Dockerfile from the repository. The image_uuid need to be set to the latest version you've downloaded via imgadm. If you need to verify it run:

$ imgadm list --docker

Create and run the container:

$ vmadm create -f /opt/docker-registry.json

At the moment the configuration file described in the docker container show us that it will listen on port 5000.

Provide web interface for personal docker registry

This could be easily done with an image provided on Docker Hub.

$ imgadm import konradkleine/docker-registry-frontend:v2

Save the following manifest which describe the setup of the zone, for example in /opt/registry-web.json:

  "alias": "docker-registry-web",
  "hostname": "",
  "image_uuid": "6357e9ab-0e79-5a0d-697b-b528d925026a",
  "nics": [
      "nic_tag": "admin",
      "primary": true,
      "ips": [ "" ],
      "gateways": [ "" ]
  "brand": "lx",
  "docker": "true",
  "kernel_version": "3.13.0",
  "max_physical_memory": 1024,
  "maintain_resolvers": true,
  "resolvers": [
  "quota": 10,
  "internal_metadata": {
    "docker:cmd": "[\"/root/\"]",
    "docker:tty": true,
    "docker:attach_stdin": true,
    "docker:attach_stdout": true,
    "docker:attach_stderr": true,
    "docker:open_stdin": true,
    "docker:env": "[ \"ENV_DOCKER_REGISTRY_HOST=\",\"ENV_DOCKER_REGISTRY_PORT=5000\"]",
    "docker:noipmgmtd": true

You should be able to access the web service via, or whatever IP you're using.

Docker Registry Frontend on SmartOS


Logfiles for docker images are stored in the zone so you need to look there:

$ cat /zones/${UUID}/logs/stdio.log

You may like to login with a shell for some debugging:

$ zlogin -i ${UUID} /native/usr/vm/sbin/dockerexec /bin/sh

A brief story of how you shouldn't promote your open source project alp's notes

I'll just leave it here . And will block any attempt to integrate Pale Moon in our repository. Just to protect our developers from such attitude and trolling.

KPTI/KAISER Meltdown Initial Performance Regressions Brendan Gregg's Blog

The recently revealed [Meltdown and Spectre] bugs are not just extraordinary issues of security, but also performance. The patches that workaround Meltdown introduce the largest kernel performance regressions I've ever seen. Many thanks to the engineers working hard to develop workarounds to these processor bugs. In this post I'll look at the Linux kernel page table isolation (KPTI) patches that workaround Meltdown: what overheads to expect, and ways to tune them. Much of my testing was on Linux 4.14.11 and 4.14.12 a month ago, before we deployed in production. Some older kernels have the KAISER patches for Meltdown, and so far the performance overheads look similar. These results aren't final, since more changes are still being developed, such as for Spectre. Note that there are potentially four layers of overhead for Meltdown/Spectre, this is just one. They are: 1. Guest kernel KPTI patches (this post) 1. Intel microcode updates 1. Cloud provider hypervisor changes (for cloud guests) 1. Retpoline compiler changes ## KPTI Factors To understand the KPTI overhead, there are at least five factors at play. In summary: - **Syscall rate**: there are overheads relative to the syscall rate, although high rates are needed for this to be noticable. At 50k syscalls/sec per CPU the overhead may be 2%, and climbs as the syscall rate increases. At my employer (Netflix), high rates are unusual in cloud, with some exceptions (databases). - **Context switches**: these add overheads similar to the syscall rate, and I think the context switch rate can simply be added to the syscall rate for the following estimations. - **Page fault rate:** adds a little more overhead as well, for high rates. - **Working set size (hot data)**: more than 10 Mbytes will cost additional overhead due to TLB flushing. This can turn a 1% overhead (syscall cycles alone) into a 7% overhead. This overhead can be reduced by A) pcid, available in Linux 4.14, and B) Huge pages. - **Cache access pattern**: the overheads are exacerbated by certain access patterns that switch from caching well to caching a little less well. Worst case, this can add an additional 10% overhead, taking (say) the 7% overhead to 17%. To explore these I wrote a simple microbenchmark where I could vary the syscall rate and the working set size ([source]). I then analyzed performance during the benchmark (active benchmarking), and used other benchmarks to confirm findings. In more detail: ## 1. Syscall rate This is the cost of extra CPU cycles in the syscall path. Plotting the percent performance loss vs syscall rate per CPU, for my microbenchmark:

Applications that have high syscall rates include proxies, databases, and others that do lots of tiny I/O. Also microbenchmarks, which often stress-test the system, will suffer the largest losses. Many services at Netflix are below 10k syscalls/sec per CPU, so this type of overhead is expected to be negligible (<0.5%). If you don't know your syscall rate, you can measure it, eg, using [perf]:
sudo perf stat -e raw_syscalls:sys_enter -a -I 1000
This shows the system-wide syscall rate. Divide it by the CPU count (as reported by mpstat, etc) for the per-CPU rate. Then by 1000 for the graph above. Note that this perf stat command causes some overhead itself, which may be noticeable for high syscall rates (>100k/sec/CPU). You can switch to ftrace/mcount to measure it with lower overhead if that is desired. For example, using my [perf-tools]:
sudo ./perf-tools/bin/funccount -i 1 -d 10 '[sS]y[sS]_*'
Then summing the syscall column. I could have taken one measurement and extrapolated most of the above graph based on a model, but it's good to double check that there's no hidden surprises. The graph is mostly as expected, expect the lower right which shows variance and missing data points: these missing points are due to slightly negative values that are elided by the logarithmic scale. It would be easy to dismiss this all as a 0.2% error margin, magnified by the logarithmic scale, but I think much of it is part of the profile which changes at this syscall rate (between 5k and 10k syscalls/sec/CPU). This will be more obvious in the next sections, where I'll explore it further. My microbenchmark calls a fast syscall in a loop, along with a user-level loop to simulate an application working set size. By gradually increasing the time in the user-level loop from zero, the syscall rate decreases as more cycles are consumed in a CPU-bound user-mode thread. This gives us a spectrum from a high syscall rate of >1M/sec, down to a low syscall rate and mostly user time (>99%). I then measured this syscall range with and without the KPTI patches (by setting nopti, and also running older and newer kernels to double check). I collected CPU profile as a [CPU flame graph] for both systems, but they was boring for a change: extra cycles were just in the syscall code, as one would expect reading the [KPTI changes]. To understand the overhead further I'll need to use instruction-level profiling, such as by using perf annotate, and PMC (Performance Monitoring Counter) analysis of the CPU cycles (more on this later). ## 2. Context Switch & Page Fault Rate These are tracked by the kernel and easily read via /proc or sar(1):
# sar -wB 1
Linux 4.14.12-virtual (bgregg-c5.9xl-i-xxx)       02/09/2018      _x86_64_     (36 CPU)

05:24:51 PM    proc/s   cswch/s
05:24:52 PM      0.00 146405.00

05:24:51 PM  pgpgin/s pgpgout/s   fault/s  majflt/s  pgfree/s pgscank/s pgscand/s pgsteal/s %vmeff
05:24:52 PM      0.00      0.00      2.00      0.00     50.00      0.00      0.00      0.00   0.00
As with syscalls, the higher these rates are, the higher the overheads. I'd add the context switch rate (cswch/s) to the syscall rate for the estimation graphs on this page (normalized per-CPU). ## 3. Working Set Size (hot data) Now my microbenchmark simulates a working set size (WSS) – an area of frequently accessed memory – by reading 100 Mbytes of data in a loop and striding by the cacheline size. Performance gets much worse for lower syscall rates:
Note the overhead "jump" between 10k and 5k syscalls/sec/CPU? The characteristics of this jump is dependent on the instance type, processor, and workload, and can appear at different points with different magnitudes: this is showing a c5.9xl with a 1.8% jump, but on a m4.2xl the jump is around 50k syscalls/sec/CPU and is much larger: 10% (shown here). I’ll analyze this in the next section. Here, we just want to look at the overall trend: much worse performance given a working set. Will your overhead look like this graph, or the earlier one, or worse? That depends on your working set size: this graph is for 100 Mbytes, and the earlier was zero. See my previous post on [working set size estimation] and the full [website]. My guess is that 100 Mbytes is a large-ish case of memory needed syscall-to-syscall. Linux 4.14 introduced [pcid support], which improves performance provided the processor also has pcid (looks like it's common in EC2). Gil Tene wrote a post explaining why [PCID is now a critical performance/security feature on x86]. It's also possible to use huge pages to improve performance further (either transparent huge pages, THP, which is easy to setup but has had problems in older versions with compaction; or explicit huge pages, which should perform better). The following graph summarizes the options:
For this microbenchmark, huge pages improved performance so much – despite KPTI – that it turned the performance loss into a performance gain. The logarithmic axis has elided the negative points, but here they are on a linear axis, zoomed:
Let's say your server was doing 5k syscalls/sec/CPU, and you suspect you have a large working set, similar to this 100 Mbyte test. On recent LTS Linux (4.4 or 4.9) with KPTI (or KAISER) patches, the performance overhead would be about 2.1%. Linux 4.14 has pcid support, so that overhead becomes about 0.5%. With huge pages, the overhead becomes a gain of 3.0%. A quick look at TLB PMCs explains much of this overhead: here I'm using tlbstat, a quick tool I hacked up in my [pmc-cloud-tools] repository. The following results look at a single point in the above graph (on a different system with full PMCs) where the worst case overhead from no KPTI to KPTI without pcid was 8.7%.
# tlbstat -C0 1
2835385    2525393     0.89 291125     4645       6458       187         0.23  0.01
2870816    2549020     0.89 269219     4922       5221       194         0.18  0.01
2835761    2524070     0.89 255815     4586       4993       157         0.18  0.01

pti, nopcid:
# tlbstat -C0 1
2873801    2328845     0.81 6546554    4474231    83593      63481       2.91  2.21
2863330    2326694     0.81 6506978    4482513    83209      63480       2.91  2.22
2864374    2329642     0.81 6500716    4496114    83094      63577       2.90  2.22

pti, pcid:
# tlbstat -C0 1
2862069    2488661     0.87 359117     432040     6241       9185        0.22  0.32
2855214    2468445     0.86 313171     428546     5820       9092        0.20  0.32
2869416    2488607     0.87 334598     434110     6011       9208        0.21  0.32

pti, pcid + thp:
# tlbstat -C0 1
2863464    2594463     0.91 2601       298946     57         6215        0.00  0.22
2845726    2568730     0.90 3330       295951     42         6156        0.00  0.22
2872419    2597000     0.90 2746       298328     64         6211        0.00  0.22
The last two columns show cycles where a data TLB or instruction TLB walk was active in at least one PMH (page miss handler). The first two outputs show TLB details for the 8.7% performance loss, and the extra TLB walk cycles added up to 4.88% of all cycles. The remaining outputs show the introduction of PCID, and the addition of THP. PCID reduces both types of TLB walks, bringing the data walks similar to the pre-KPTI levels. Instruction walks are still elevated. The final output shows the difference huge pages make: data TLB walks are now zero. Instruction walks are still elevated, and I can see from /proc/PID/smaps that the instruction text is not using huge pages: I'll try to fix that with more tuning, which should improve performance even further. Just to show how bad it gets: this is the first point on the graph where the overhead was over 800%, showing the non-KPTI vs KPTI without pcid systems:
# tlbstat -C0 1
2854768    2455917     0.86 565        2777       50         40          0.00  0.00
2884618    2478929     0.86 950        2756       6          38          0.00  0.00
2847354    2455187     0.86 396        297403     46         40          0.00  0.00

pti, nopcid:
# tlbstat -C0 1
2875793    276051      0.10 89709496   65862302   787913     650834     27.40 22.63
2860557    273767      0.10 88829158   65213248   780301     644292     27.28 22.52
2885138    276533      0.10 89683045   65813992   787391     650494     27.29 22.55
2532843    243104      0.10 79055465   58023221   693910     573168     27.40 22.63
Half the CPU cycles have page walks active. I've never seen TLB pain this bad, ever. Just the IPC (instructions per cycle) alone tells you something bad is happening: dropping from 0.86 to 0.10, which is relative to the performance loss. I still recommend including IPC along with (not instead of) any %CPU measurement, so that you really know what the cycles are, as I discussed in [CPU Utilization is Wrong]. ## 4. Cache Access Pattern Depending on the memory access pattern and working set size, an additional 1% to 10% overhead can occur at a certain syscall rate. This was seen as the jump in the earlier graph. Based on PMC analysis and the description of the changes, one suspected factor is additional page table memory demand on the KPTI system, causing the workload to fall out of a CPU cache sooner. Here are the relevant metrics:
This shows that the performance loss jump corresponds to a similar drop in last-level cache (LLC) references (first two graphs): that is not actually interesting, as the lower reference rate is expected from a lower workload throughput. What is interesting is an abrupt drop in LLC hit ratio, from about 55% to 50%, which does not happen without the KPTI patches (the last nopti graph, which shows a small LLC hit ratio improvement). This is sounding like the extra KPTI page table references has pushed the working set out of a CPU cache, causing an abrupt drop in performance. I don't think a 55% to 50% drop in LLC hit ratio can fully explain a 10% performance loss alone. Another factor is at play that would require additional PMCs to analyze, however, this target is an m4.16xl where PMCs are restricted to the architectural set. I described this in [The PMCs of EC2]: the architectural set is better than nothing, and a year ago we had nothing. Late last year, EC2 gained two more options for PMC analysis: the [Nitro hypervisor], which provides all PMCs (more or less), and the bare metal instance type (currently in public preview). My earlier TLB analysis was on a c5.9xl, a Nitro hypervisor system. Unfortunately, the "jump" on that system is only about 1%, making it harder to spot outside the normal variance. In short, there are some additional memory overheads with KPTI that can cause workloads to drop out of CPU cache a little sooner. ## Sanity Tests Putting these graphs to the test: a MySQL OLTP benchmark ran with 75k syscalls/sec/CPU (600k syscalls/sec across 8 CPUs), and is expected to have a large working set (so more closely resembles this 100Mb working set test, than the 0Mb one). The graph estimates the performance loss of KPTI (without huge pages or pcid) to be about 4%. The measured performance loss was 5%. On a 64-CPU system, with the same workload and same per-CPU syscall rate, the measured performance loss was 3%. Other tests were similarly close, as has been the production roll out. The test that least matched the previous graphs was an application with a stress test driving 210k syscalls/sec/CPU + 27k context-switches/sec/CPU, and a small working set (~25 Mbytes). The estimated performance loss should have been less than 12%, but it was 25%. To check that there wasn't an additional overhead source, I analyzed this and saw that the 25% was due to TLB page walk cycles, which is the same overheads studied earlier. I'd guess the large discrepancy was because that application workload was more sensitive to TLB flushing than my simple microbenchmark. ## Fixing Performance ### 1. Linux 4.14 & pcid I mentioned this earlier, and showed the difference in the graph. Get on 4.14, at least, with CPUs that support pcid (check /proc/cpuinfo). ### 2. Huge Pages I mentioned this earlier too. I won't summarize how to configure huge pages here with all their caveats, since that's a huge topic. ### 3. Syscall Reductions If you were at the more painful end of the performance loss spectrum due to a high syscall rate, then an obvious move is to analyze what those syscalls are and look for ways to eliminate some. This used to be routine for systems performance analysis many years ago, but more recently the focus has been user-mode wins. There are many ways to analyze syscalls. Here are several, ordered from most to least overhead: 1. strace 1. perf record 1. perf trace 1. sysdig 1. perf stat 1. bcc/eBPF 1. ftrace/mcount The fastest is ftrace/mcount, and I already had an example earlier from my [perf-tools] repo for counting syscalls. Summarizing 10 seconds:
# ./perf-tools/bin/funccount -d 10 '[sS]y[sS]_*'
Tracing "[sS]y[sS]_*" for 10 seconds...

FUNC                              COUNT
SyS_epoll_wait                        1
SyS_exit_group                        1
SyS_fcntl                             1
SyS_ftruncate                         1
SyS_newstat                          56
SyS_mremap                           62
SyS_rt_sigaction                     73
SyS_select                         1895
SyS_read                           1909
SyS_clock_gettime                  3791
SyS_rt_sigprocmask                 3856

Ending tracing...
Here's the same using my bcc/eBPF version of funccount and the syscall tracepoints, which was only possible since Linux 4.14 thanks to Yonghong Song's [support] for this:
# /usr/share/bcc/tools/funccount -d 10 't:syscalls:sys_enter*'
Tracing 310 functions for "t:syscalls:sys_enter*"... Hit Ctrl-C to end.

FUNC                                    COUNT
syscalls:sys_enter_nanosleep                1
syscalls:sys_enter_newfstat                 3
syscalls:sys_enter_mmap                     3
syscalls:sys_enter_inotify_add_watch        9
syscalls:sys_enter_poll                    11
syscalls:sys_enter_write                   61
syscalls:sys_enter_perf_event_open        111
syscalls:sys_enter_close                  152
syscalls:sys_enter_open                   157
syscalls:sys_enter_bpf                    310
syscalls:sys_enter_ioctl                  395
syscalls:sys_enter_select                2287
syscalls:sys_enter_read                  2445
syscalls:sys_enter_clock_gettime         4572
syscalls:sys_enter_rt_sigprocmask        4572
Now that you know the most frequent syscalls, look for ways to reduce them. You can use other tools to inspect their arguments and stack traces (eg: using perf record, or kprobe in perf-tools, trace in bcc, etc), and look for optimization opportunities. This is performance engineering 101. ## Conclusion and Further Reading The KPTI patches to mitigate Meltdown can incur massive overhead, anything from 1% to over 800%. Where you are on that spectrum depends on your syscall and page fault rates, due to the extra CPU cycle overheads, and your memory working set size, due to TLB flushing on syscalls and context switches. I described and analyzed these in this post. Practically, I'm expecting the systems on the cloud at my employer to experience between 0.1 and 6% overhead with KPTI due to our syscall rates, and I'm expecting we'll take that down to less than 2% with tuning: using 4.14 with pcid support, huge pages (which can also provide some gains), syscall reductions, and anything else we find. This is only one out of four potential sources of overhead from Meltdown/Spectre: there's also cloud hypervisor changes, Intel microcode, and compilation changes. These KPTI numbers are also not final, as Linux is still being developed and improved. Some related reading and references: - Dave Hansen's kernel notes on the overhead: - Linux initial KPTI patch code: - Meltdown and Spectre: - Reading privileged memory with a side-channel: - Red Hat's guide to performance impacts: - PostgreSQL performance testing: - Meltdown Status by Greg Kroah-Hartman: - pcid support: - PCID is now a critical performance/security feature on x86, by Gil Tene: - perf-tools: - perf: - The PMCs of EC2: - pmc-cloud-tools: - CPU Utilization is Wrong: - WSS estimation: [Dave Hansen's kernel notes on the overhead]: [Linux initial KPTI patch code]: [KPTI changes]: [Meltdown and Spectre]: [Reading privileged memory with a side-channel]: [Red Hat's guide to performance impacts]: [PostgreSQL performance testing]: [Meltdown Status in Linux by Greg Kroah-Hartman]: [pcid support]: [PCID is now a critical performance/security feature on x86]: [perf-tools]: [perf]: /perf.html [The PMCs of EC2]: [Nitro hypervisor]: [CPU flame graph]: /FlameGraphs/cpuflamegraphs.html [pmc-cloud-tools]: [CPU Utilization is Wrong]: [working set size estimation]: [website]: /wss.html [support]: [source]:

Talks I have given The Observation Deck

Increasingly, people have expressed the strange urge to binge-watch my presentations. This potentially self-destructive behavior seems likely to have unwanted side-effects like spontaneous righteous indignation, superfluous historical metaphor, and near-lethal exposure to tangential anecdote — and yet I find myself compelled to enable it by collecting my erstwhile scattered talks. While this blog entry won’t link to every talk I’ve ever given, there should be enough here to make anyone blotto!

To accommodate the more recreational watcher as well as the hardened addict, I have also broken my talks up into a a series of trilogies, with each following a particular subject area or theme. In the the future, as I give talks that become available, I will update this blog entry. And if you find that a link here is dead, please let me know!

Before we get to the list: if you only watch one talk of mine, please watch Principles of Technology Leadership (slides) presented at Monktoberfest 2017. This is the only talk that I have asked family and friends to watch, as it represents my truest self — or what I aspire that self to be, anyway.

The talks

Talks I have given, in reverse chronological order:

Trilogies of talks

As with anyone, there are themes that run through my career. While I don’t necessarily give talks in explicit groups of three, looking back on my talks I can see some natural groupings that make for related sequences of talks.

The Software Values Trilogy

In late 2016 and through 2017, it felt like fundamental values like decency and integrity were under attack; it seems appropriate that these three talks were born during this turbulent time:

The Debugging Trilogy

While certainly not the only three talks I’ve given on debugging, these three talks present a sequence on aspects of debugging that we don’t talk about as much:

The Beloved Trilogy

A common theme across my Papers We Love and Systems We Love talks is (obviously?) an underlying love for the technology. These three talks represent a trilogy of beloved aspects of the system that I have spent two decades in:

The Open Source Trilogy

While my career started developing proprietary software, I am blessed that most of it has been spent in open source. This trilogy reflects on my experiences in open source, from the dual perspective of both a commercial entity and as an individual contributor:

The Container Trilogy

I have given many (too many!) talks on containers and containerization, but these three form a reasonable series (with hopefully not too much overlap!):

The DTrace Trilogy

Another area where I have given many more than three talks, but these three form a reasonable narrative:

The Surge Lightning Trilogy

For its six year run, Surge was a singular conference — and the lightning talks were always a highlight. My lightning talks were not deliberately about archaic Unixisms, it just always seemed to work out that way — an accidental narrative arc across several years.

Operating system materials alp's notes

Rust Pointers for C Programmers Josef "Jeff" Sipek

I’ve been eyeing Rust for about a year now. Here and there, I tried to use it to make a silly little program, or to implement some simple function in it to see for myself how ergonomic it really was, and what sort of machine code rustc spit out. But last weekend I found a need for a tool to clean up some preprocessor mess, and so instead of hacking together some combination of shell and Python, I decided to write it in Rust.

From my earlier attempts, I knew that there are a lot of different “pointers” but I found all the descriptions of them lacking or confusing. Specifically, Rust calls itself a systems programming language, yet I found no clear description of how the different pointers map to C—the systems programming language. Eventually, I stumbled across The Periodic Table of Rust Types, which made things a bit clearer, but I still didn’t feel like I truly understood.

During my weekend expedition to Rust land, I think I’ve grokked things enough to write this explanation of how Rust does things. As always, feedback is welcomed.

I’ll describe what happens in terms of C. To keep things simple, I will:

  • assume that you are well-versed in C
  • assume that you can read Rust (any intro will teach you enough)
  • not bother with const for the C snippets
  • not talk about mutability

In the following text, I assume that we have some struct T. The actual contents don’t matter. In other words:

struct T {
	/* some members */

With that out of the way, let’s dive in!

*const T and *mut T

These are raw pointers. In general, you shouldn’t use them since only unsafe code can dereference them, and the whole point of Rust is to write as much safe code as possible.

Raw pointers are just like what you have in C. If you make a pointer, you end up using sizeof(struct T *) bytes for the pointer. In other words:

struct T *ptr;

&T and &mut T

These are borrowed references. They use the same amount of space as raw pointers and behave same exact way in the generated machine code. Consider this trivial example:

pub fn raw(p: *mut usize) {
    unsafe {
        *p = 5;


pub fn safe(p: &mut usize) {
    *p = 5;

A rustc invocation later, we have:

    raw:     55                 pushq  %rbp
    raw+0x1: 48 89 e5           movq   %rsp,%rbp
    raw+0x4: 48 c7 07 05 00 00  movq   $0x5,(%rdi)
    raw+0xb: 5d                 popq   %rbp
    raw+0xc: c3                 ret    

    safe:     55                 pushq  %rbp
    safe+0x1: 48 89 e5           movq   %rsp,%rbp
    safe+0x4: 48 c7 07 05 00 00  movq   $0x5,(%rdi)
    safe+0xb: 5d                 popq   %rbp
    safe+0xc: c3                 ret    

Note that the two functions are bit-for-bit identical.

The only differences between borrowed references and raw pointers are:

  1. references will never point at bogus addresses (i.e., they are never NULL or uninitialized),
  2. the compiler doesn’t let you do arbitrary pointer arithmetic on references,
  3. the borrow checker will make you question your life choices for a while.

(#3 gets better over time.)


These are owned “pointers”. If you are a C++ programmer, you are already familiar with them. Never having truly worked with C++, I had to think about this a bit until it clicked, but it is really easy.

No matter what all the documentation and tutorials out there say, Box<T> is not a pointer but rather a structure containing a pointer to heap allocated memory just big enough to hold T. The heap allocation and freeing is handled automatically. (Allocation is done in the Box::new function, while freeing is done via the Drop trait, but that’s not relevant as far as the memory layout is concerned.) In other words, Box<T> is something like:

struct box_of_T {
	struct T *heap_ptr;

Then, when you make a new box you end up putting only what amounts to sizeof(struct T *) on the stack and it magically starts pointing to somewhere on the heap. In other words, the Rust code like this:

let x = Box::new(T { ... });

is roughly equivalent to:

struct box_of_t x;

x.heap_ptr = malloc(sizeof(struct T));
if (!x.heap_ptr)

*x.heap_ptr = ...;

&[T] and &mut [T]

These are borrowed slices. This is where things get interesting. Even though it looks like they are just references (which, as stated earlier, translates into a simple C-style pointer), they are much more. These types of references use fat pointers—that is, a combination of a pointer and a length.

struct fat_pointer_to_T {
	struct T *ptr;
	size_t nelem;

This is incredibly powerful, since it allows bounds checking at runtime and getting a subset of a slice is essentially free!

&[T; n] and &mut [T; n]

These are borrowed references to arrays. They are different from borrowed slices. Since the length of an array is a compile-time constant (the compiler will yell at you if n is not a constant), all the bounds checking can be performed statically. And therefore there is no need to pass around the length in a fat pointer. So, they are passed around as plain ol’ pointers.

struct T *ptr;

T, [T; n], and [T]

While these aren’t pointers, I thought I’d include them here for completeness’s sake.


Just like in C, a struct uses as much space as its type requires (i.e., sum of the sizes of its members plus padding).

[T; n]

Just like in C, an array of structs uses n times the size of the struct.


The simple answer here is that you cannot make a [T]. That actually makes perfect sense when you consider what that type means. It is saying that we have some variable sized slice of memory that we want to access as elements of type T. Since this is variable sized, the compiler cannot possibly reserve space for it at compile time and so we get a compiler error.

The more complicated answer involves the Sized trait, which I’ve skillfully managed to avoid thus far and so you are on your own.


That was a lot of text, so I decided to compact it and make the following table. In the table, I assume that our T struct is 100 bytes in size. In other words:

/* Rust */
struct T {
    stuff: [u8; 100],

/* C */
struct T {
	uint8_t stuff[100];

Now, the table in its full glory:

Rust C Size on
let x: T;
struct T x;
Raw pointer
let x: *const T;
let x: *mut T;
struct T *x;
let x: &T;
let x: &mut T;
struct T *x;
let x: Box<T>;
struct box_of_T {
	struct T *heap_ptr;

struct box_of_T x;
Array of 2
let x: [T; 2];
struct T x[2];
Reference to
an array of 2
let x: &[T; 2];
struct T *x;
A slice
let x: [T];
struct T x[];
unknown at
compile time
A reference
to a slice
let x: &[T];
struct fat_ptr_to_T {
	struct T *ptr;
	size_t nelem;

struct fat_ptr_to_T x;

A word of caution: I assume that the sizes of the various pointers are actually implementation details and shouldn’t be relied on to be that way. (Well, with the exception of raw pointers - without those being fixed FFI would be unnecessarily complicated.)

I didn’t cover str, &str, String, and Vec<T> since I don’t consider them fundamental types, but rather convenience types built on top of slices, structs, references, and boxes.

Anyway, I hope you found this useful. If you have any feedback (good or bad), let me know.

Why I'm Boycotting Crypto Currencies /dev/dump

Unless you've been living under a rock somewhere, you probably have heard about the crypto currency called "Bitcoin".  Lately its skyrocketed in "value", and a number of other currencies based on similar mathematics have also arisen.  Collectively, these are termed cryptocurrencies.

The idea behind them is fairly ingenious, and based upon the idea that by solving "hard" problems (in terms of mathematics), the currency can limit how many "coins" are introduced into the economy.  Both the math and the social experiment behind them is something that on paper looks really interesting.

The problem is that the explosion of value has created a number of problems, and as a result I won't be accepting any of these forms of currencies for the foreseeable future.

First, the market for each of these currencies is controlled by a relatively small number of individuals who own a majority of the outstanding "coins".  The problem with this is that by collusion, these individuals can generate "fake" transactions, which appear to drive up demand on the coins, and thus lead to a higher "value" (in terms of what people might be willing to pay).  The problem is that this is a "bubble", and the bottom will fall right out if enough people try to sell their coins for hard currency.  As a result, I believe that the value of the coins is completely artificial, and while a few people might convert some of these coins into hard cash for a nice profit, the majority of coin holders are going to be left out in the cold.

Second, the "cost" of performing transactions for some of these currencies is becoming prohibitively expensive.  With most transactions of real currency, its just a matter of giving someone paper currency, or running an electronic transaction that normally completes in milliseconds.  Because of the math associated with cryptocurrencies, the work to sign block chains becomes prohibitive, such that for some currencies transactions can take a lot of time -- and processors are now nearly obliged to charge what would be extortionary rates just to cover their own costs (in terms of electricity and processing power used).

The environmental impact, and monumental waste, caused by cryptocurrencies cannot be overstated.  We now have huge farms of machines running, consuming vast amounts of power, performing no useful work except to "mine" coins.  As time goes on, the amount of work needed to mine each coin grows significantly (an intentional aspect of the coin), but what this means is that we are burning large amounts of power (much of which is fossil-fuel generated!) to perform work that has no useful practical purpose.   Some might say something similar about mining precious metals or gems, but their a many many real practical applications for metals like gold, silver, and platinum, and gems like diamonds and rubies as well.

Finally, as anyone who wants to build a new PC probably realizes, the use of computing hardware, and specifically "GPUs" (graphical processing units, but which also can be used to solve many numerical problems in parallel) have increased in cost dramatically -- consumer grade GPUs are generally only available today for about 2x-3x their MSRPs.  This is because the "miners" of cryptocurrencies have snapped up every available GPU.  The upshot of this is that the supply of this hardware has become prohibitive for hobbyists and professionals alike.  Indeed, much of this hardware would be far far better used in HPC arenas where it could be used to solve real-world problems, like genomic research towards finding a cure for cancer, or protein folding, or any number of other interesting and useful problems which solving would benefit mankind as a whole.  It would not surprise me if a number of new HPC projects have been canceled or put on hold simply because the supply of suitable GPU hardware has been exhausted, and putting some of those projects out of budget reach.

Eventually, when the bottom does fall out of those cryptocurrencies, all that GPU hardware will probably wind up filling land-fills, as many people won't want to buy used GPUs, which may (or may not) have had their lifespans shortened.  (One hopes that at least the eWaste these cause will be recycled, but we know that much eWaste winds up in landfills in third world countries.)

Crypto-curency mining is probably one of the most self-serving and irresponsible (to humanity and our environment) activities one can take today, while still staying in the confines of the law (except in a few jurisdictions which have sensibly outlawed cryptocurrencies.)

It's my firm belief that the world would be far better off if crypto-currencies had never been invented.