Docker, Go and USDT Staring at the C

We have what should be a simple task: we’re on CentOS 7, and we want to deploy a Go binary that will have USDT tracepoints. USDT is an attractive option for a few debugging purposes. It allows applications to define tracepoints with higher levels of stability and semantic meaning than more ad-hoc methods like dynamic uprobes.

Usage of USDT tracepoints tends to have a different focus from other monitoring techniques like logging, Prometheus, OpenTracing etc. These might identify a general issue such as a poor latency metric: you’d then use USDT probes to dig further into the problems in a production system, to identify precisely what’s happening at a particular endpoint or whatever.

USDT in Go

The normal model for USDT involves placing the trace points at specific places in the binary: they are statically defined and built, but dynamically enabled. This is typically done via the DTRACE_PROBE() family of macros.

The only (?) USDT facility for Go is salp. This uses libstapsdt under the hood. This library dynamically creates probes at runtime, even though Go is a compiled language. Yes, this is dynamic static dynamic tracing.

We’re going to use salpdemo in our experiment. This has two USDT probes, p1 and p2 that we’d like to be able to dynamically trace, using bcc-tools’ handy trace wrapper. CentOS 7 doesn’t appear to have support for the later USDT support in perf probe.

Setting up a Docker container for dynamic tracing

For a few different reasons, we’d like to be able to trace from inside the container itself. This has security implications, given what’s implemented today, but bear in mind we’re on CentOS 7, so even if there’s a finer-grained current solution, there’s a good chance it wouldn’t work here. In reality, we would probably use an ad-hoc debugging sidecar container, but we’re going to just use the one container here.

First, we’re going to deploy the container with ansible for convenience:

    $ cat hosts
localhost ansible_connection=local
$ cat playbook.yml
---

- hosts: localhost
  become: yes
  tasks:
    - docker_container:
        name: usdt_test
        image: centos:7
        state: started
        command: sleep infinity
        network_mode: bridge
        ulimits:
          - memlock:8192000:8192000
        capabilities:
          - sys_admin
        volumes:
          - /sys/kernel/debug:/sys/kernel/debug
$ ansible-playbook -i hosts ./playbook.yml

Note that we’re using sleep infinity here to keep our container running so we can play around.

We need the sys_admin capability to be able to program the probes, and the BPF compiler needs the locked memory limit bumping. We also need to mount /sys/kernel/debug read-write (!) in order to be able to write to /sys/kernel/debug/tracing/uprobe_events.

Now let’s install everything we need to be able to trace these probes:

    $ docker exec -it usdt_test yum -y install \
    kernel-devel-$(uname -r) kernel-$(uname -r) bcc-tools

Yes, it’s a lot, but unavoidable. You can, in theory, use mounted volumes for the kernel sources, as described here; however, the read-only mounts break packaging inside the container, so we’re not doing that here.

Tracing the probes in the container

The above was a big hammer, but we should be good to go right? Let’s start up the demo binary:

    $ docker cp ~/salpdemo usdt_test:/root/
$ docker exec -it usdt_test bash
[root@8ccf34663dd2 /]# ~/salpdemo &
[1] 18166
 List the go probes in this demo with
        sudo tplist -vp "$(pgrep salpdemo)" "salp-demo*"
Trace this process with
        sudo trace -p "$(pgrep salpdemo | head -n1)" 'u::p1 "i=%d err=`%s` date=`%s`", arg1, arg2, arg3' 'u::p2 "j=%d flag=%d", arg1, arg2'
        or
        sudo trace -p "$(pgrep salpdemo | head -n1)" 'u::p1 (arg1 % 2 == 0) "i=%d err='%s'", arg1, arg2'

We can indeed list the probes:

    [root@8ccf34663dd2 /]# /usr/share/bcc/tools/tplist -vp $(pgrep salpdemo) | head
salp-demo:p1 [sema 0x0]
  1 location(s)
  3 argument(s)
salp-demo:p2 [sema 0x0]
  1 location(s)
  2 argument(s)
libc:setjmp [sema 0x0]
...

So let’s try the suggested trace invocation:

    # /usr/share/bcc/tools/trace -p "$(pgrep salpdemo | head -n1)" 'u::p1 (arg1 % 2 == 0) "i=%d err='%s'", arg1, arg2'

perf_event_open(/sys/kernel/debug/tracing/events/uprobes/p__tmp_salp_demo_I8qitQ_so_0x270_18166_bcc_18175/id): Invalid argument
Failed to attach BPF to uprobe

Huh. This doesn’t seem to be a permissions issue, since we got EINVAL. In addition, running from the host has the same problem.

I haven’t proved it, but I think our basic issue here is that Centos 7 is missing this kernel fix:

tracing/uprobe: Add support for overlayfs

I spent way too long trying to work around this by placing the binary somewhere other than overlayfs, before I finally dug a little bit more into how libstapsdt actually works, and figured out the problem.

Working around overlayfs and libstapsdt

To build probes dynamically at runtime, libstapsdt does something slightly crazy: it generates a temporay ELF shared library at runtime that contains the USDT probes and uses dlopen() to bring it into the running binary. Let’s have a look:

    [root@8ccf34663dd2 /]# grep salp-demo /proc/$(pgrep salpdemo)/maps
7fa9373b5000-7fa9373b6000 r-xp 00000000 fd:10 1506373                    /tmp/salp-demo-I8qitQ.so
7fa9373b6000-7fa9375b5000 ---p 00001000 fd:10 1506373                    /tmp/salp-demo-I8qitQ.so
7fa9375b5000-7fa9375b6000 rwxp 00000000 fd:10 1506373                    /tmp/salp-demo-I8qitQ.so

The process has mapped in this temporary file, named after the provider. It’s on /tmp, hence overlay2 filesystem, explaining why moving the salpdemo binary itself around made no difference.

So maybe we can be more specific?

    [root@8ccf34663dd2 /]# /usr/share/bcc/tools/trace -p "$(pgrep salpdemo | head -n1)" 'u:/tmp/salp-demo-I8qitQ.so:p1 (arg1 % 2 == 0) "i=%d err='%s'", arg1, arg2'
perf_event_open(/sys/kernel/debug/tracing/events/uprobes/p__tmp_salp_demo_I8qitQ_so_0x270_18166_bcc_18188/id): Invalid argument
Failed to attach BPF to uprobe

Still not there yet. The above bug means that it still can’t find the uprobe given the binary image path. What we really need is the host path of this file. We can get this from Docker:

    $ docker inspect usdt_test | json -a GraphDriver.Data.MergedDir
/data/docker/overlay2/77c1397db72a7f3c7ba3f8af6c5b3824dc9c2ace9432be0b0431a2032ea93bce/merged

This is not good, as obviously we can’t reach this path from inside the container. Hey, at least we can run it on the host though.

    $ sudo /usr/share/bcc/tools/trace 'u:/data/docker/overlay2/77c1397db72a7f3c7ba3f8af6c5b3824dc9c2ace9432be0b0431a2032ea93bce/merged/tmp/salp-demo-I8qitQ.so:p1 (arg1 % 2 == 0) "i=%d err='%s'", arg1, arg2'
Event name (p__data_docker_overlay2_77c1397db72a7f3c7ba3f8af6c5b3824dc9c2ace9432be0b0431a2032ea93bce_merged_tmp_salp_demo_I8qitQ_so_0x270) is too long for buffer
Failed to attach BPF to uprobe

SIGH. Luckily, though:

    $ sudo /usr/share/bcc/tools/trace 'u:/data/docker/overlay2/77c1397db72a7f3c7ba3f8af6c5b3824dc9c2ace9432be0b0431a2032ea93bce/diff/tmp/salp-demo-I8qitQ.so:p1 (arg1 % 2 == 0) "i=%d err='%s'", arg1, arg2'
PID     TID     COMM            FUNC             -
19862   19864   salpdemo        p1               i=64 err=An error: 64
19862   19864   salpdemo        p1               i=66 err=An error: 66

It worked! But it’s not so great: we wanted to be able to trace inside a container. If we mounted /data/docker itself inside the container, we could do that, but it’s still incredibly awkward.

Using tmpfs?

Instead, can we get the generated file onto a different filesystem type? libstapsdt hard-codes /tmp which limits our options.

Let’s start again with /tmp inside the container on tmpfs:

    $ tail -1 playbook.yml
        tmpfs: /tmp:exec

We need to force on exec mount flag here: otherwise, we can’t dlopen() the generated file. Yes, not great for security again.

    $ docker exec -it usdt_test bash
# ~/salpdemo &
...
[root@1f56af6e7bee /]# /usr/share/bcc/tools/trace -p "$(pgrep salpdemo | head -n1)" 'u::p1 "i=%d err=`%s` date=`%s`", arg1, arg2, arg3' 'u::p2 "j=%d flag=%d", arg1, arg2'
PID     TID     COMM            FUNC             -

Well, we’re sort of there. It started up, but we never get any output. Worse, we get the same if we try this in the host now! I don’t know what the issue here is.

Using a volume?

Let’s try a volume mount instead:

    $ tail -3 playbook.yml
        volumes:
          - /sys/kernel/debug:/sys/kernel/debug
          - /tmp/tmp.usdt_test:/tmp

If we run trace in the host now, we can just use u::p1:

    $ sudo /usr/share/bcc/tools/trace -p "$(pgrep salpdemo | head -n1)" 'u::p1 "i=%d err=`%s` date=`%s`", arg1, arg2, arg3' 'u::p2 "j=%d flag=%d", arg1, arg2'
PID     TID     COMM            FUNC             -
6864    6866    salpdemo        p2               j=120 flag=1
...

But we still need a bit of a tweak inside our container:

    # /usr/share/bcc/tools/trace -p "$(pgrep salpdemo | head -n1)" 'u::p1 "i=%d err=`%s` date=`%s`", arg1, arg2, arg3'
PID     TID     COMM            FUNC             -
<no output>
    [root@d72b822cab0f /]# cat /proc/$(pgrep salpdemo | head -n1)/maps | grep /tmp/salp-demo*.so | awk '{print $6}' | head -n1
/tmp/salp-demo-6kcugm.so
[root@d72b822cab0f /]# /usr/share/bcc/tools/trace -p  "$(pgrep salpdemo | head -n1)" 'u:/tmp/salp-demo-6kcugm.so:p1 "i=%d err=`%s` date=`%s`", arg1, arg2, arg3'
PID     TID     COMM            FUNC             -
11593   11595   salpdemo        p1               i=-17 err=`An error: -17` date=`Thu, 06 Aug 2020 13:12:57 +0000`
...

I don’t have any clear idea why the name is required inside the container context, but at least, finally, we managed to trace those USDT probes!

Running a Zabbix server in an OmniOS zone OmniOS Community Edition

This guide shows how to get Zabbix up and running within a zone on an OmniOS system. Zabbix is an open-source monitoring system and is available from the OmniOS Extra repository.

Zone setup

I’m going to use the lightweight sparse zone brand for this so start by making sure that it is installed:

        % pfexec pkg install brand/sparse
No updates necessary for this image.

If the brand is not already installed, then there will be more output from the above command.

Create a new sparse zone called zabbix. My preference is to configure the IP stack within the zone configuration which results in it being automatically applied to the zone when it’s booted, and enables additional protection against settings such as the IP address from being changed from within the zone. Note that, like all zones, the zone path must be a direct descendant of a ZFS dataset. On my system, /zones is the mount point for such a dataset.

        % pfexec zonecfg -z zabbix
zabbix: No such zone configured
Use 'create' to begin configuring a new zone.
zonecfg:zabbix> create -t sparse
zonecfg:zabbix> set zonepath=/zones/zabbix
zonecfg:zabbix> add net
zonecfg:zabbix:net> set physical=zabbix0
zonecfg:zabbix:net> set global-nic=igb0
zonecfg:zabbix:net> set allowed-address=172.27.10.35/24
zonecfg:zabbix:net> set defrouter=172.27.10.254
zonecfg:zabbix:net> end
zonecfg:zabbix> add attr
zonecfg:zabbix:attr> set name=resolvers; set type=string; set value=1.1.1.1
zonecfg:zabbix:attr> end
zonecfg:zabbix> add attr
zonecfg:zabbix:attr> set name=dns-domain; set type=string; set value=omnios.org
zonecfg:zabbix:attr> end
zonecfg:zabbix:attr> verify; commit; exit

By default, the zone’s boot environment will encompass all of the files and directories within. For an application such as Zabbix, it’s important to create a dedicated area to hold files such as the underlying database which should be consistent across different boot environments.

My system has a ZFS pool called data so I’m going to create a new dataset under that and delegate it to the zone. I’m also going to change the mount point for the dataset to /data so that it appears there within the zone.

        % pfexec zfs create data/zabbix
% pfexec zonecfg -z zabbix 'add dataset; set name=data/zabbix; end'

% pfexec zfs umount data/zabbix
% pfexec zfs set mountpoint=/data data/zabbix
% pfexec zfs set zoned=on data/zabbix

Now it’s time to install the zone. Being a sparse zone, this will be pretty quick - only around 5MiB of files are actually installed.

        % pfexec zoneadm -z zabbix install
A ZFS file system has been created for this zone.

       Image: Preparing at /zones/zabbix/root.
Sanity Check: Looking for 'entire' incorporation.
   Publisher: Using omnios (https://pkg.omnios.org/r151034/core/).
   Publisher: Using extra.omnios (https://pkg.omnios.org/r151034/extra/).
       Cache: Using /var/pkg/publisher.
  Installing: Packages (output follows)
Packages to install: 203
Mediators to change:   5
 Services to change:   6

DOWNLOAD                                PKGS         FILES    XFER (MB)   SPEED
Completed                            203/203     1485/1485      4.9/4.9      --

PHASE                                          ITEMS
Installing new actions                     5927/5927
Updating package state database                 Done
Updating package cache                           0/0
Updating image state                            Done
Creating fast lookup database                   Done
 Postinstall: Copying SMF seed repository ... done.
        Done: Installation completed in 16.942 seconds.

Let’s boot the zone and log in:

        % pfexec zoneadm -z zabbix boot
% pfexec zlogin zabbix
[Connected to zone 'zabbix' pts/17]
OmniOS 5.11 omnios-r151034-831ff8e83b   July 2020
root@zabbix#

Since this is the first boot, it will take a minute for all of the service manifests to be imported. Watch the output of svcs -x until nothing is returned:

        root@zabbix# svcs -x
root@zabbix#

Check Internet connectivity and DNS:

        root@zabbix# ping 1.1.1.1
1.1.1.1 is alive
root@zabbix# ping google.com
google.com is alive

and check the delegated dataset:

        root@zabbix# df -h /data
Filesystem             Size   Used  Available Capacity  Mounted on
data/zabbix           3.51T    42K      1.57T     1%    /data

Database

The OmniOS Zabbix package needs a Postgres database for storage. This is one of the things that should be stored on the dedicated dataset that was delegated to the zone.

Create a new ZFS filesystem for the database. For a Postgres database, it’s recommended to set the filesystem recordsize to 8K, and to set the log bias mode to throughput, as shown here. Also for security, executable, setuid and device files are explicitly disabled on the filesystem.

        root@zabbix# zfs create data/zabbix/db
root@zabbix# zfs set recordsize=8k data/zabbix/db
root@zabbix# zfs set logbias=throughput data/zabbix/db
root@zabbix# zfs set exec=off data/zabbix/db
root@zabbix# zfs set devices=off data/zabbix/db
root@zabbix# zfs set setuid=off data/zabbix/db

This new dataset inherits the mountpoint from the filesystem:

        root@zabbix# df -h | grep data/zab
data/zabbix           3.51T    42K      1.57T     1%    /data
data/zabbix/db        3.51T    42K      1.57T     1%    /data/db

Install the zabbix server package, which will automatically install the correct version of Postgres:

        root@zabbix# pkg install zabbix-server
           Packages to install:  6
           Mediators to change:  1
            Services to change:  4
       Create boot environment: No
Create backup boot environment: No

DOWNLOAD                                PKGS         FILES    XFER (MB)   SPEED
Completed                                6/6     3082/3082    22.7/22.7  1.2M/s

PHASE                                          ITEMS
Installing new actions                     4013/4013
Updating package state database                 Done
Updating package cache                           0/0
Updating image state                            Done
Creating fast lookup database                   Done
Updating package cache                           2/2

Set up the initial database in the dedicated ZFS dataset:

        root@zabbix# chown postgres /data/db
root@zabbix# chmod 0700 /data/db
root@zabbix# svccfg -s postgresql12:default \
        setprop application/datadir = /data/db
root@zabbix# svcadm refresh postgresql12:default
root@zabbix# cd /data/db
root@zabbix:/data/db# sudo -u postgres /opt/ooce/pgsql-12/bin/initdb -D .
The files belonging to this database system will be owned by user "postgres".
This user must also own the server process.

The database cluster will be initialized with locale "C".
The default database encoding has accordingly been set to "SQL_ASCII".
The default text search configuration will be set to "english".

Data page checksums are disabled.
...
Success. You can now start the database server using:

Start the database service using svcadm:

        root@zabbix# cd
root@zabbix# svcadm enable postgresql12
root@zabbix# svcs postgresql12
STATE          STIME    FMRI
online         12:32:12 svc:/ooce/database/postgresql12:default

Create the zabbix database user - enter a password to secure the account when prompted.

        root@zabbix# sudo -u postgres createuser --pwprompt zabbix
Enter password for new role:
Enter it again:

Create the database and import initial data:

        root@zabbix# sudo -u postgres createdb -O zabbix \
        -E Unicode -T template0 zabbix
root@zabbix# cd /opt/ooce/zabbix/sql
root@zabbix:/opt/ooce/zabbix/sql# cat schema.sql images.sql data.sql \
        | sudo -u zabbix psql zabbix
... lots of output, not shown here ...
COMMIT

Web interface

Zabbix comes with a web interface written in PHP. I’m going to use the nginx web server to serve this over HTTP.

Install packages:

        root@zabbix# pkg install nginx php-74
...
DOWNLOAD                                PKGS         FILES    XFER (MB)   SPEED
Completed                              10/10       517/517    22.9/22.9  2.2M/s
...

Edit the /etc/opt/ooce/nginx/nginx.conf file and replace the example server block in there with the following:

            server {
        listen       80;
        server_name  localhost;
        root /opt/ooce/zabbix/ui;
        index index.php;
        location ~ \.php$ {
                try_files $uri =404;
                fastcgi_pass unix:/var/opt/ooce/php/run/www-7.4.sock;
                fastcgi_index index.php;
                include fastcgi.conf;
        }
    }

A few PHP settings need to be tweaked for proper Zabbix operation. This example sets the time zone to UTC but you can set it to local time if you prefer.

        root@zabbix# cd /etc/opt/ooce/php-7.4/
root@zabbix# sed -i '/post_max_size/s/=.*/= 16M/' php.ini
root@zabbix# sed -i '/execution_time/s/=.*/= 300/' php.ini
root@zabbix# sed -i '/input_time/s/=.*/= 300/' php.ini
root@zabbix# sed -i '/date.timezone/s/.*/date.timezone = UTC/' php.ini

Grant PHP permissions to manage the zabbix UI configuration file:

        root@zabbix# chown php /opt/ooce/zabbix/ui/conf

Enable PHP and the web server:

        root@zabbix# svcadm enable nginx php74
root@zabbix# svcs nginx php74
STATE          STIME    FMRI
online         12:39:30 svc:/network/http:nginx
online         12:39:30 svc:/application/php74:default

Start the Zabbix services:

        root@zabbix# svcadm enable zabbix:server zabbix:agent
root@zabbix# svcs zabbix
STATE          STIME    FMRI
online         12:40:00 svc:/network/zabbix:server
online         12:40:08 svc:/network/zabbix:agent

You should now be able to point a web browser at the server and go through the initial Zabbix setup process:

Zabbix installer

On the database screen, set the type to Postgres via localhost. Enter the password that you set earlier during database creation:

Zabbix database

Once you get back to the login screen, enter Admin with a password of zabbix to get started:

Zabbix login


Any problems or questions, please get in touch.

How to determine PXE mac address when booting illumos via PXELinux/iPXE Minimal Solaris

In illumos, if you need to determine the interface which was used for booting via PXE then it's possible to use "boot-mac" property:

# /sbin/devprop -s boot-mac 
c:c4:7a:04:ef:2c
But this property is set by illumos pxeboot. On some setup we use PXELinux to boot multiple illumos clients over PXE. For any illumos distribution "append" line in pxelinux.cfg looks like:
label omni PXE
kernel mboot.c32
append omni7/platform/i86pc/kernel/amd64/unix -B install_media=http://192.168.1.1/kayak.zfs.xz,install_config=http://192.168.1.1 ---omni7/miniroot
If you have small amount of clients, then it's possible to just add each client's mac address to the kernel line with -B boot-mac=<hardware-address>, but it doesn't work in case you have a hundreds of clients. 

Pxelinux menu has "ipappend 2" option, which appends "BOOTIF=<hardware-address-of-boot-interface>" to the kernel command line, but pxelinux puts BOOTIF exactly at the end of "append" line, after boot_archive, and kernel does not recognise this variable after boot. There are no any way to set something like -B BOOTIF dynamically here. 

Fortunately, we can boot iPXE from pxelinux menu. DHCP configuration was updated to allow iPXE boot when ipxe.lkrn boot:
...
if exists user-class and option user-class = "iPXE" {
filename "menu.ipxe";
} else {
filename "pxelinux.0";
}

pxelinux.cfg/default:
...
label omni7
kernel ipxe.lkrn
menu.ipxe:
#!ipxe

kernel omni7/platform/i86pc/kernel/amd64/unix -B boot-mac=${netX/mac},install_media=http://192.168.1.1/omni7/kayak.zfs.xz,install_config=http://192.168.1.1/omni7
initrd omni7/miniroot
boot
iPXE allows to get mac address with ${netX/mac} variable, so "boot-mac" will contain mac-address which was used for booting via PXE.

OmniOS Community Edition r151030bl, r151032al, r151034l OmniOS Community Edition

OmniOS weekly releases for w/c 20th of July 2020 are now available.

This update requires a reboot for r151034.

For r151034 only:

  • Some 64-bit PCI devices were not programmed correctly
  • The mlxcx driver has been updated to fix several problems
  • Panic in imc driver on some systems with broken firmware
  • LX: /proc/<pid>/exe symlink was not always present
  • LX: Would occasionally see defunct processes with busy parents
  • LX: Improve support for networking setup in Void linux zones
  • loader could not read a ZFS pool which had a removed slog device
  • vioblk devices could hang under memory pressure
  • It was not possible to run bhyve in the global zone under a DEBUG kernel (although this is possible for testing, bhyve should always be run within a zone for proper protection)
  • Added a depend.ooceextra facet to illumos packages to allow installation without reference to the omnios-extra repository
  • Updated curl to 7.71.1

Additionally, the following packages have been updated for all supported releases:

  • openjdk to to 1.8.0_262
  • rsync to 3.2.2

For further details, please see https://omniosce.org/releasenotes


Any problems or questions, please get in touch.

Customizing EC2 instance storage and networking with the AWS CLI The Trouble with Tribbles...

I use AWS to run illumos quite a bit, either with Tribblix or OmniOS.

Creating EC2 instances with the console is fine for one-offs, but gets a bit tedious. So using the AWS CLI offers a better route, with the ec2 run-instances command.

Yes. there are things like templates and terraform and all sorts of other options. For whatever reason, they don't work in all cases.

In particular, the reasons you might want to customize an instance if you're running illumos might be slightly different than a more traditional usage model.

For storage, there are a couple of customizations we might want. The first is that the AMI has a fairly small root disk, which we might want to make larger. We may be adding zones, with their root filesystems installed on the system pool. We may be adding swap (while anonymous reservation means applications like java don't need to write to swap, you still need space backing the swap space to be available). For the second, there's the fact that we might actually want to use EBS to provide local storage (so we can use ZFS, for example, which has data integrity and manageability benefits).

To automate the enlargement of the root pool, I create a mapping file that looks like this:

[
  {
    "DeviceName": "/dev/xvda",
    "Ebs": {
      "VolumeSize": 12,
      "Encrypted": true
    }
  }
]

The size is in Gigabytes. The /dev/xvda is the normal device name (from EC2, clearly in illumos we have a different naming). If that's in a file called storage.json, then the argument to the ec2 run-instances command is:

--block-device-mappings file://storage.json

Once the instance is running, that will normally (on my instances) show up on c2t0d0, and the rpool can be expanded to use all the available space with the following command:

zpool online -e rpool c2t0d0

To add an additional device, to keep application storage separate, in addition to that enlargement, would involve a json file like:

[
  {
    "DeviceName": "/dev/xvda",
    "Ebs": {
      "VolumeSize": 12,
      "Encrypted": true
    }
  },
  {
    "DeviceName": "/dev/sdf",
    "Ebs": {
      "VolumeSize": 256,
      "DeleteOnTermination": false,
      "Encrypted": true
    }
  }
]

On my instances, I always use /dev/sdf, which comes out as c2t5d0.

For networking, I often end up with multiple IP addresses. This is because we have zones - rather than create multiple EC2 instances, it's far more efficient to run applications in zones on a single system, but then you want to assign each zone its own IP address.

You would think - supported by the documentation - that the --secondary-private-ip-addresses flag to ec2 run-instances would do the job. You would be wrong. That flag, actually, is supposed to just be a convenient shortcut for what I'm about to describe, but it doesn't actually work. (And terraform doesn't support this customization either - it can handle additional IP addresses, but not on the same interface as the primary.)

To configure multiple IP addresses we again turn to a json file. This looks like:

[
  {
    "DeviceIndex": 0,
    "DeleteOnTermination": true,
    "SubnetId": "subnet-0abcdef1234567890",
    "Groups": ["sg-01234567890abcdef"],
    "PrivateIpAddresses": [
      {
        "Primary": true,
        "PrivateIpAddress": "10.15.32.12"
      },
      {
        "Primary": false,
        "PrivateIpAddress": "10.15.32.101"
      }
    ]
  }
]

You have to define (SubnetId) the subnet you're going to use, and (Groups) the security group that will be applied - these belong to the network interface, not to the instance (in the trivial case there's no difference). So you don't specify the security group(s) or the subnet as regular arguments. Then I define two IP addresses (you can have as many as you like), one is set as the primary ("Primary": true), all the others will be secondary ("Primary": false). Again, if this is in a file network.json you feed that to the command like

--network-interfaces file://network.json

One other thing I found is that you can add tags to the instance (and to EBS volumes) at creation, saving you the effort of having to go through and tag things later. It's slightly annoying that it doesn't seem to allow you to apply different tags to different volumes, you can just say "apply these tags to the instance" and "apply these tags to the volumes". The trick is that the example in the documentation is wrong (it has single quotes, which you don't need and don't work).

So the tag specification looks like:

--tag-specifications \
ResourceType=instance,Tags=[{Key=Name,Value=aws123a}] \ ResourceType=volume,Tags=[{Key=Name,Value=aws123a}]

In the square brackets, you can have multiple comma-separated key-value pairs. We have tags marking projects and roles so you have a vague idea of what's what.

Putting this all together you end up with a command like:

aws ec2 run-instances \
--region eu-west-2 \
--image-id ami-01a1a1a1a1a1a1a1a \
--instance-type t2.micro \
--key-name peter-key \
--network-interfaces file://network.json \
--count 1 \
--block-device-mappings file://storage.json \
--disable-api-termination \
--tag-specifications \
ResourceType=instance,Tags=[{Key=Name,Value=aws123a}] \
ResourceType=volume,Tags=[{Key=Name,Value=aws123a}]

Of course, I don't write either the json files or the command invocation by hand. I have a script that knows what all my AMIs and availability zones and subnets and security groups are and does the right thing for each instance I want to build.

OmniOS Community Edition r151030bi, r151032ai, r151034i OmniOS Community Edition

OmniOS weekly releases for w/c 29th of June 2020 are now available.

  • For all supported OmniOS releases, curl has been updated to fix two vulnerabilities (CVE-2020-8169 and CVE-2020-9177)

For further details, please see https://omniosce.org/releasenotes


Any problems or questions, please get in touch.

OmniOS Community Edition r151030bh, r151032ah, r151034h OmniOS Community Edition

This week’s update for all stable OmniOS versions includes an update to the Intel CPU microcode files to mitigate the new Crosstalk class of data leakage CPU vulnerability.

For r151034, there is also a fix for handling of correctly memory errors and an enhancement to lx-branded zones to better support newer Linux distributions.

For further details, please see https://omniosce.org/releasenotes


Any problems or questions, please get in touch.

Java: trying out String deduplication and the G1 garbage collector The Trouble with Tribbles...

As of 8u20, java supports automatic String deduplication.

-XX:+UseG1GC -XX:+UseStringDeduplication

You need to use the G1 garbage collector, and it will do the dedup as you scan the heap. Essentially, it checks each String and if the backing char[] array is the same as one it's already got, it merges the references.

Obviously, this could save memory if you have a lot of repeated strings.

Consider my illuminate utility. One of the thing it does is parse the old SVR4 packaging contents file. That's a big file, and there's a huge amount of duplication - while the file names are obviously unique, things like the type of file, permissions, owner, group, and names of packages are repeated many times. So, does turning this thing on make a difference?

Here's the head of the class histogram (produced by jcmd pid GC.class_histogram).

First without:

 num     #instances         #bytes  class name
----------------------------------------------
   1:       2950682      133505088  [C
   2:       2950130       70803120  java.lang.String
   3:        862390       27596480  java.util.HashMap$Node
   4:        388539       21758184  org.tribblix.illuminate.pkgview.ContentsFileDetail

and now with deduplication:

 num     #instances         #bytes  class name
----------------------------------------------
   1:       2950165       70803960  java.lang.String
   2:        557004       60568944  [C
   3:        862431       27597792  java.util.HashMap$Node
   4:        388539       21758184  org.tribblix.illuminate.pkgview.ContentsFileDetail

Note that there's the same number of entries in the contents file (there's one ContentsFileDetail for each line), and essentially the same number of String objects. But the [C, which is the char[] backing those Strings, has fallen dramatically. You're saving about a third of the memory used to store all that String data.

This also clearly demonstrates that the deduplication isn't on the String objects, those are unchanged, but on the char[] arrays backing those Strings.

Even more interesting is the performance. This is timing of a parser before:

real        1.730556446
user        7.977604040
sys         0.251854581

and afterwards:

real        1.469453551
user        6.054787878
sys         0.407259095

That's actually a bit of a surprise: G1GC is going to have to do work to do the comparisons to see if the strings are the same, and do some housekeeping if they are. However, with just the G1GC on its own, without deduplication, we get a big performance win:

real        1.217800287
user        3.944160155
sys         0.362586413

Therefore, for this case, G1GC is a huge performance benefit, and the deduplication takes some of that performance gain and trades it for memory efficiency.

For the illuminate GUI, without G1GC:

user       10.363291056
sys         0.393676741

and with G1GC:

user        8.151806315
sys         0.401426176

(elapsed time isn't meaningful here as you're waiting for interaction to shut it down)

The other thing you'll sometime see in this context is interning Strings. I tried that, it didn't help at all.

Next, with a little more understanding of what was going on, I tried some modifications to the code to reduce the cost of storing all those Strings.


I did tweak my contents file reader slightly, to break lines up using a simple String.split() rather than using a StringTokenizer. (The java docs recommend you don't use StringTokenizer any more, so this is also a bit of modernization.) I don't think the change of itself makes any difference, but it's slightly less work to simply ignore fields in an array from String.split() than call nextToken() to skip over the ones you don't want.

Saving the size and mtime as long - primitive types - saves a fair amount of memory too. Each String object is 24 bytes plus the content, so the saving is significant. And given that any uses will be of the numerical value, we may as well convert up front.

The ftype is only a single character. So storing that as a char avoids an object, saving space, and they're automatically interned for us.

That manual work gave me about another 10% speedup. What about memory usage?

Using primitive types rather than String gives us the following class histogram:

 num     #instances         #bytes  class name
----------------------------------------------
   1:       1917289      102919512  [C
   2:       1916938       46006512  java.lang.String
   3:        862981       27615392  java.util.HashMap$Node
   4:        388532       24866048  org.tribblix.illuminate.pkgview.ContentsFileDetail
So, changing the code gives almost the same memory saving as turning on String deduplication, without any performance hit.

There are 3 lessons here:

  1. Don't use Strings to store what could be primitive types if you can help it
  2. Under some (not all) circumstances, the G1 garbage collector can be a huge win
  3. When you're doing optimization occasionally the win you get isn't the one you were looking for

ctags, vim and C Staring at the C

Going to the first matching tag in vim with Control-] can be rather annoying. The exuberant-ctags secondary sort key is the filename, not the tag kind. If you have a struct type that’s also a common member name, you’re forced into using :tselect to find the struct instead of all the members. Most of the time, the struct definition is what you want.

To avoid this issue, I sort the tags file such that any kind == "s" entries come first for that tag. It’s a little annoying due to the format of the file, but it does work:

#!/bin/bash

# ctags, but sub-sorted such that "struct request" comes first, rather than
# members with the same name.

# we can't use "-f -", as that elides the TAG_FILE_SORTED preamble
ctags -R -f tags.$$

awk '
BEGIN {
   FS="\t"
   entry=""
   struct=""
   buf=""
}

$1 != entry {
   if (entry != "") {
           printf("%s%s", struct, buf);
   }
   entry=$1;
   struct="";
   buf="";
}

/^.*"\ts/ {
   struct=struct $0 "\n"
   next
}

$1 == entry {
   buf=buf $0 "\n"
}

END {
   printf("%s%s", struct, buf);
}' <tags.$$ >tags

rm tags.$$

Tracing Kernel Functions: How the illumos AMD64 FBT Provider Intercepts Function Calls Z In ASCII - Writing

A line-by-line breakdown of how the illumos AMD64 FBT provider intercepts function calls.