Linux Load Averages: Solving the Mystery Brendan Gregg's Blog

Load averages are an industry-critical metric – my company spends millions auto-scaling cloud instances based on them and other metrics – but on Linux there's some mystery around them. Linux load averages track not just runnable tasks, but also tasks in the uninterruptible sleep state. Why? I've never seen an explanation. In this post I'll solve this mystery, and summarize load averages as a reference for everyone trying to interpret them. Linux load averages are "system load averages" that show the running thread (task) demand on the system as an average number of running plus waiting threads. This measures demand, which can be greater than what the system is currently processing. Most tools show three averages, for 1, 5, and 15 minutes:

$ uptime
 16:48:24 up  4:11,  1 user,  load average: 25.25, 23.40, 23.46

top - 16:48:42 up  4:12,  1 user,  load average: 25.25, 23.14, 23.37

$ cat /proc/loadavg 
25.72 23.19 23.35 42/3411 43603
Some interpretations: - If the averages are 0.0, then your system is idle. - If the 1 minute average is higher than the 5 or 15 minute averages, then load is increasing. - If the 1 minute average is lower than the 5 or 15 minute averages, then load is decreasing. - If they are higher than your CPU count, then you might have a performance problem (it depends). As a set of three, you can tell if load is increasing or decreasing, which is useful. They can be also useful when a single value of demand is desired, such as for a cloud auto scaling rule. But to understand them in more detail is difficult without the aid of other metrics. A single value of 23 - 25, by itself, doesn't mean anything, but might mean something if the CPU count is known, and if it's known to be a CPU-bound workload. Instead of trying to debug load averages, I usually switch to other metrics. I'll discuss these in the "Better Metrics" section near the end. ## History The original load averages show only CPU demand: the number of processes running plus those waiting to run. There's a nice description of this in [RFC 546] titled "TENEX Load Averages", August 1973:
[1] The TENEX load average is a measure of CPU demand. The load average is an average of the number of runnable processes over a given time period. For example, an hourly load average of 10 would mean that (for a single CPU system) at any time during that hour one could expect to see 1 process running and 9 others ready to run (i.e., not blocked for I/O) waiting for the CPU.
The version of this on [ietf.org] links to a PDF scan of a hand drawn load average graph from July 1973, showing that this has been monitored for decades:

source: https://tools.ietf.org/html/rfc546
Nowadays, the source code to old operating systems can also be found online. Here's an except of DEC macro assembly from [TENEX] \(early 1970's\) SCHED.MAC:
NRJAVS==3               ;NUMBER OF LOAD AVERAGES WE MAINTAIN
GS RJAV,NRJAVS          ;EXPONENTIAL AVERAGES OF NUMBER OF ACTIVE PROCESSES
[...]
;UPDATE RUNNABLE JOB AVERAGES

DORJAV: MOVEI 2,^D5000
        MOVEM 2,RJATIM          ;SET TIME OF NEXT UPDATE
        MOVE 4,RJTSUM           ;CURRENT INTEGRAL OF NBPROC+NGPROC
        SUBM 4,RJAVS1           ;DIFFERENCE FROM LAST UPDATE
        EXCH 4,RJAVS1
        FSC 4,233               ;FLOAT IT
        FDVR 4,[5000.0]         ;AVERAGE OVER LAST 5000 MS
[...]
;TABLE OF EXP(-T/C) FOR T = 5 SEC.

EXPFF:  EXP 0.920043902 ;C = 1 MIN
        EXP 0.983471344 ;C = 5 MIN
        EXP 0.994459811 ;C = 15 MIN
And here's an excerpt from [Linux] today (include/linux/sched/loadavg.h):
#define EXP_1           1884            /* 1/exp(5sec/1min) as fixed-point */
#define EXP_5           2014            /* 1/exp(5sec/5min) */
#define EXP_15          2037            /* 1/exp(5sec/15min) */
Linux is also hard coding the 1, 5, and 15 minute constants. There have been similar load average metrics in older systems, including [Multics], which had an exponential scheduling queue average. ## The Three Numbers These three numbers are the 1, 5, and 15 minute load averages. Except they aren't really averages, and they aren't 1, 5, and 15 minutes. As can be seen in the source above, 1, 5, and 15 minutes are constants used in an equation, which calculate exponentially-damped moving sums of a five second average. The resulting 1, 5, and 15 minute load averages reflect load well beyond 1, 5, and 15 minutes. If you take an idle system, then begin a single-threaded CPU-bound workload (one thread in a loop), what would the one minute load average be after 60 seconds? If it was a plain average, it would be 1.0. Here is that experiment, graphed:

Load average experiment to visualize exponential damping
The so-called "one minute average" only reaches about 0.62 by the one minute mark. For more on the equation and similar experiments, Dr. Neil Gunther has written an article on load averages: [How It Works], plus there are many Linux source block comments in [loadavg.c]. ## Linux Uninterruptible Tasks When load averages first appeared in Linux, they reflected CPU demand, as with other operating systems. But later on Linux changed them to include not only runnable tasks, but also tasks in the uninterruptible state (TASK_UNINTERRUPTIBLE or nr_uninterruptible). This state is used by code paths that want to avoid interruptions by signals, which includes tasks blocked on disk I/O and some locks. You may have seen this state before: it shows up as the "D" state in the output ps and top. The ps(1) man page calls it "uninterruptible sleep (usually IO)". Adding the uninterruptible state means that Linux load averages can increase due to a disk (or NFS) I/O workload, not just CPU demand. For everyone familiar with other operating systems and their CPU load averages, including this state is at first deeply confusing. **Why?** Why, exactly, did Linux do this? There are countless articles on load averages, many of which point out the Linux nr_uninterruptible gotcha. But I've seen none that explain or even hazard a guess as to why it's included. My own guess would have been that it's meant to reflect demand in a more general sense, rather than just CPU demand. ## Searching for an ancient Linux patch Understanding why something changed in Linux is easy: you read the git commit history on the file in question and read the change description. I checked the history on loadavg.c, but the change that added the uninterruptible state predates that file, which was created with code from an earlier file. I checked the other file, but that trail ran cold as well: the code itself has hopped around different files. Hoping to take a shortcut, I dumped "git log -p" for the entire Linux github repository, which was 4 Gbytes of text, and began reading it backwards to see when the code first appeared. This, too, was a dead end. The oldest change in the entire Linux repo dates back to 2005, when Linus imported Linux 2.6.12-rc2, and this change predates that. There are historical Linux repos (here and here), but this change description is missing from those as well. Trying to discover, at least, when this change occurred, I searched tarballs on kernel.org and found that it had changed by 0.99.15, and not by 0.99.13 – however, the tarball for 0.99.14 was missing. I found it elsewhere, and confirmed that the change was in Linux 0.99 patchlevel 14, Nov 1993. I was hoping that the release description for 0.99.14 by Linus would explain the change, but that too, was a dead end:
"Changes to the last official release (p13) are too numerous to mention (or even to remember)..." – Linus
He mentions major changes, but not the load average change. Based on the date, I looked up the kernel mailing list archives to find the actual patch, but the oldest email available is from June 1995, when the sysadmin writes:
"While working on a system to make these mailing archives scale more effecitvely I accidently destroyed the current set of archives (ah whoops)."
My search was starting to feel cursed. Thankfully, I found some older linux-devel mailing list archives, rescued from server backups, often stored as tarballs of digests. I searched over 6,000 digests containing over 98,000 emails, 30,000 of which were from 1993. But it was somehow missing from all of them. It really looked as if the original patch description might be lost forever, and the "why" would remain a mystery. ## The origin of uninterruptible Fortunately, I did finally find the change, in a compressed mailbox file from 1993 on [oldlinux.org]. Here it is:
From: Matthias Urlichs <urlichs@smurf.sub.org>
Subject: Load average broken ?
Date: Fri, 29 Oct 1993 11:37:23 +0200


The kernel only counts "runnable" processes when computing the load average.
I don't like that; the problem is that processes which are swapping or
waiting on "fast", i.e. noninterruptible, I/O, also consume resources.

It seems somewhat nonintuitive that the load average goes down when you
replace your fast swap disk with a slow swap disk...

Anyway, the following patch seems to make the load average much more
consistent WRT the subjective speed of the system. And, most important, the
load is still zero when nobody is doing anything. ;-)

--- kernel/sched.c.orig Fri Oct 29 10:31:11 1993
+++ kernel/sched.c  Fri Oct 29 10:32:51 1993
@@ -414,7 +414,9 @@
    unsigned long nr = 0;

    for(p = &LAST_TASK; p > &FIRST_TASK; --p)
-       if (*p && (*p)->state == TASK_RUNNING)
+       if (*p && ((*p)->state == TASK_RUNNING) ||
+                  (*p)->state == TASK_UNINTERRUPTIBLE) ||
+                  (*p)->state == TASK_SWAPPING))
            nr += FIXED_1;
    return nr;
 }
--
Matthias Urlichs        \ XLink-POP N|rnberg   | EMail: urlichs@smurf.sub.org
Schleiermacherstra_e 12  \  Unix+Linux+Mac     | Phone: ...please use email.
90491 N|rnberg (Germany)  \   Consulting+Networking+Programming+etc'ing      42
It's amazing to read the thoughts behind this change from almost 24 years ago. This confirms that the load averages were deliberately changed to reflect demand for other system resources, not just CPUs. Linux changed from "CPU load averages" to what one might call "system load averages". His example of using a slower swap disk makes sense: by degrading the system's performance, the demand on the system (measured as running + queued) should increase. However, load averages decreased because they only tracked the CPU running states and not the swapping states. Matthias thought this was nonintuitive, which it is, so he fixed it. ## Uninterruptible today But don't Linux load averages sometimes go too high, more than can be explained by disk I/O? Yes, although my guess is that this is due to a new code path using TASK_UNINTERRUPTIBLE that didn't exist in 1993. In Linux 0.99.14, there were 13 codepaths that directly set TASK_UNINTERRUPTIBLE or TASK_SWAPPING (the swapping state was later removed from Linux). Nowadays, in Linux 4.12, there are nearly 400 codepaths that set TASK_UNINTERRUPTIBLE, including some lock primitives. It's possible that one of these codepaths should not be included in the load averages. Next time I have load averages that seem too high, I'll see if that is the case, and if it can be fixed. I emailed Matthias (for the first time) to ask what he thought about his load average change almost 24 years later. He responded in one hour (as I mentioned on Twitter), and wrote:
"The point of "load average" is to arrive at a number relating how busy the system is from a human point of view. TASK_UNINTERRUPTIBLE means (meant?) that the process is waiting for something like a disk read which contributes to system load. A heavily disk-bound system might be extremely sluggish but only have a TASK_RUNNING average of 0.1, which doesn't help anybody."
(Getting a response so quickly, or even a response at all, really made my day. Thanks!) So Matthias still thinks it makes sense, at least given what TASK_UNINTERRUPTIBLE used to mean. But TASK_UNITERRUPTIBLE matches more things today. Should we change load averages to be just CPU and disk demand? Scheduler maintainer Peter Zijstra has already sent me one clever option to explore for doing this: include task_struct->in_iowait in load averages instead of TASK_UNINTERRUPTIBLE, so that it more closely matches disk I/O. It begs another question, however, which is what do we really want? Do we want to measure demand on the system in terms of threads, or just demand for physical resources? If it's the former, then waiting on uninterruptible locks should be included as those threads are demand on the system. They aren't idle. So perhaps Linux load averages already work the way we want them to. To better understand uninterruptible code paths, I'd like a way to measure them in action. Then we can examine different examples, quantify time spent in them, and see if it all makes sense. ## Measuring uninterruptible tasks The following is an [Off-CPU flame graph] from a production server, spanning 60 seconds and showing kernel stacks only, where I'm filtering to only include those in the TASK_UNINTERRUPTIBLE state (SVG). It provides many examples of uninterruptible code paths:

If you are new to off-CPU flame graphs: you can click on frames to zoom in, examining the full stacks which appear as a tower of frames. The x-axis size is proportional to the time spent blocked off-CPU, and the sort order (left to right) has no real meaning. The color is blue for off-CPU stacks (I use warm colors for on-CPU stacks), and the saturation has random variance to differentiate frames. I generated this using my offcputime tool from [bcc] \(this tool needs eBPF features from Linux 4.8+\), and my [flame graph] software:
# ./bcc/tools/offcputime.py -K --state 2 -f 60 > out.stacks
# awk '{ print $1, $2 / 1000 }' out.stacks | ./FlameGraph/flamegraph.pl --color=io --countname=ms > out.offcpu.svgb>
I'm using awk to change the output from microseconds to milliseconds. The offcputime "--state 2" matches on TASK_UNINTERRUPTIBLE (see sched.h), and is an option I just added for this post. Facebook's Josef Bacik first did this with his [kernelscope] tool, which also uses bcc and flame graphs. In my examples, I'm just showing the kernel stacks, but offcputime.py supports showing the user stacks as well. As for the flame graph above: it shows that only 926 ms out of 60 seconds were spent in uninterruptible sleep. That's only adding 0.015 to our load averages. It's time in some cgroup paths, but this server is not doing much disk I/O. Here is a more interesting one, this time only spanning 10 seconds (SVG):

The wide tower on the right is showing systemd-journal in proc_pid_cmdline_read() (reading /proc/PID/cmdline), getting blocked, and contributing 0.07 to the load average. And there is a wider page fault tower on the left, that also ends up in rwsem_down_read_failed() (adding 0.23 to the load average). I've highlighted those functions in magenta using the flame graph search feature. Here's an excerpt from rwsem_down_read_failed():
    /* wait to be given the lock */
    while (true) {
        set_task_state(tsk, TASK_UNINTERRUPTIBLE);
        if (!waiter.task)
            break;
        schedule();
    }
This is lock acquisition code that's using TASK_UNINTERRUPTIBLE. Linux has uninterruptible and interruptible versions of mutex acquire functions (eg, mutex_lock() vs mutex_lock_interruptible(), and down() and down_interruptible() for semaphores). The interruptible versions allow the task to be interrupted by a signal, and then wake up to process it before the lock is acquired. Time in uninterruptible lock sleeps usually don't add much to the load averages, but in this case they are adding 0.30. If this was much higher, it would be worth analyzing to see if lock contention could be reduced (eg, I'd start digging on systemd-journal and proc_pid_cmdline_read()!), which should improve performance and lower the load average. Does it make sense for these code paths to be included in the load average? Yes, I'd say so. Those threads are in the middle of doing work, and happen to block on a lock. They aren't idle. They are demand on the system, albeit for software resources rather than hardware resources. ## Decomposing Linux load averages Can the Linux load average value be fully decomposed into components? Here's an example: on an idle 8 CPU system, I launched tar to archive some uncached files. It spends several minutes mostly blocked on disk reads. Here are the stats, collected from three different terminal windows:
terma$ pidstat -p `pgrep -x tar` 60
Linux 4.9.0-rc5-virtual (bgregg-xenial-bpf-i-0b7296777a2585be1) 	08/01/2017 	_x86_64_	(8 CPU)

10:15:51 PM   UID       PID    %usr %system  %guest    %CPU   CPU  Command
10:16:51 PM     0     18468    2.85   29.77    0.00   32.62     3  tar

termb$ iostat -x 60
[...]
avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           0.54    0.00    4.03    8.24    0.09   87.10

Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
xvdap1            0.00     0.05   30.83    0.18   638.33     0.93    41.22     0.06    1.84    1.83    3.64   0.39   1.21
xvdb            958.18  1333.83 2045.30  499.38 60965.27 63721.67    98.00     3.97    1.56    0.31    6.67   0.24  60.47
xvdc            957.63  1333.78 2054.55  499.38 61018.87 63722.13    97.69     4.21    1.65    0.33    7.08   0.24  61.65
md0               0.00     0.00 4383.73 1991.63 121984.13 127443.80    78.25     0.00    0.00    0.00    0.00   0.00   0.00

termc$ uptime
 22:15:50 up 154 days, 23:20,  5 users,  load average: 1.25, 1.19, 1.05
[...]
termc$ uptime
 22:15:14 up 154 days, 23:21,  5 users,  load average: 1.19, 1.17, 1.06
I also collected an Off-CPU flame graph just for the uninterruptible state (SVG):

The final one minute load average is 1.19. Let me decompose that: - 0.33 is from tar's CPU time (pidstat) - 0.67 is from tar's uninterruptible disk reads, inferred (offcpu flame graph has this at 0.69, I suspect as it began collecting a little later and spans a slightly different time range) - 0.04 is from other CPU consumers (mpstat user + system, minus tar's CPU from pidstat) - 0.11 is from kernel workers uninterruptible disk I/O time, flushing disk writes (offcpu flame graph, the two towers on the left) That adds up to 1.15. I'm still missing 0.04, some of which may be rounding and measurement interval offset errors, but a lot may be due to the load average being an exponentially-damped moving sum, whereas the other averages I'm using (pidstat, iostat) are normal averages. Prior to 1.19, the one minute average was 1.25, so some of that will still be dragging us high. How much? From my earlier graphs, at the one minute mark, 62% of the metric was from that minute, and the rest was older. So 0.62 x 1.15 + 0.38 x 1.25 = 1.18. That's pretty close to the 1.19 reported. This is a system where one thread (tar) plus a little more (some time in kernel worker threads) are doing work, and Linux reports the load average as 1.19, which makes sense. If it was measuring "CPU load averages", the system would have reported 0.37 (inferred from mpstat's summary), which is correct for CPU resources only, but hides the fact that there is demand for over one thread's worth of work. I hope this example shows that the numbers really do mean something deliberate (CPU + uninterruptible), and you can decompose them and figure it out. ## Making sense of Linux load averages I grew up with OSes where load averages meant CPU load averages, so the Linux version has always bothered me. Perhaps the real problem all along is that the words "load averages" are about as ambiguous as "I/O". Which type of I/O? Disk I/O? File system I/O? Network I/O? ... Likewise, which load averages? CPU load averages? System load averages? Clarifying it this way lets me make sense of it like this: - On Linux, load averages are (or try to be) "**system load averages**", for the system as a whole, measuring the number of threads that are working and waiting to work (CPU, disk, uninterruptible locks). Put differently, it measures the number of threads that aren't completely idle. Advantage: includes demand for different resources. - On other OSes, load averages are "**CPU load averages**", measuring the number of CPU running + CPU runnable threads. Advantage: can be easier to understand and reason about (for CPUs only). Note that there's another possible type: "**physical resource load averages**", which would include load for physical resources only (CPU + disk). Perhaps one day we'll add additional load averages to Linux, and let the user choose what they want to use: a separate "CPU load averages", "disk load averages", "network load averages", etc. Or just use different metrics altogether. ## What is a "good" or "bad" load average?

Load averages measured in a modern tool
Some people have found values that seem to work for their systems and workloads: they know that when load goes over X, application latency is high and customers start complaining. But there aren't really rules for this. With CPU load averages, one could divide the value by the CPU count, then say that if that ratio is over 1.0 you are running at saturation, which may cause performance problems. It's somewhat ambiguous, as it's a long-term average (at least one minute) which can hide variation. One system with a ratio of 1.5 might be running fine, whereas another at 1.5 that was bursty within the minute might be performing badly. I once administered a two-CPU email server that during the day ran with a CPU load average of between 11 and 16 (a ratio of between 5.5 and 8). Latency was acceptable and no one complained. That's an extreme example: most systems will be suffering with a load/CPU ratio of just 2. As for Linux's system load averages: these are even more ambiguous as they cover different resource types, so you can't just divide by the CPU count. It's more useful for _relative_ comparisons: if you know the system runs fine at a load of 20, and it's now at 40, then it's time to dig in with other metrics to see what's going on. ## Better Metrics When Linux load averages increase, you know you have higher demand for resources (CPUs, disks, and some locks), but you aren't sure which. You can use other metrics for clarification. For example, for CPUs: - **per-CPU utilization**: eg, using mpstat -P ALL 1 - **per-process CPU utilization**: eg, top, pidstat 1, etc - **per-thread run queue (scheduler) latency**: eg, in /proc/PID/schedstats, delaystats, perf sched - **CPU run queue latency**: eg, in /proc/schedstat, perf sched, my [runqlat] bcc tool. - **CPU run queue length**: eg, using vmstat 1 and the 'r' column, or my runqlen bcc tool The first two are utilization metrics, the last three are saturation metrics. Utilization metrics are useful for workload characterization, and saturation metrics useful for identifying a performance problem. The best CPU saturation metrics are measures of run queue (or scheduler) latency: the time a task/thread was in a runnable state, but had to wait its turn. These allow you to calculate the magnitude of a performance problem, eg, the percent of time a thread spent in scheduler latency. Measuring the run queue length instead can suggest that there is a problem, but it's more difficult to estimate the magnitude. The schedstats facility was made a kernel tunable in Linux 4.6 (sysctl kernel.sched_schedstats) and changed to be off by default. Delay accounting exposes the same scheduler latency metric, which is in [cpustat] and I just suggested adding it to [htop] too, as that would make it easier for people to use. Easier than, say, scraping the wait-time (scheduler latency) metric from the (undocumented) /proc/sched_debug output:
$ awk 'NF > 7 { if ($1 == "task") { if (h == 0) { print; h=1 } } else { print } }' /proc/sched_debug
            task   PID         tree-key  switches  prio     wait-time             sum-exec        sum-sleep
         systemd     1      5028.684564    306666   120        43.133899     48840.448980   2106893.162610 0 0 /init.scope
     ksoftirqd/0     3 99071232057.573051   1109494   120         5.682347     21846.967164   2096704.183312 0 0 /
    kworker/0:0H     5 99062732253.878471         9   100         0.014976         0.037737         0.000000 0 0 /
     migration/0     9         0.000000   1995690     0         0.000000     25020.580993         0.000000 0 0 /
   lru-add-drain    10        28.548203         2   100         0.000000         0.002620         0.000000 0 0 /
      watchdog/0    11         0.000000   3368570     0         0.000000     23989.957382         0.000000 0 0 /
         cpuhp/0    12      1216.569504         6   120         0.000000         0.010958         0.000000 0 0 /
          xenbus    58  72026342.961752       343   120         0.000000         1.471102         0.000000 0 0 /
      khungtaskd    59 99071124375.968195    111514   120         0.048912      5708.875023   2054143.190593 0 0 /
[...]
         dockerd 16014    247832.821522   2020884   120        95.016057    131987.990617   2298828.078531 0 0 /system.slice/docker.service
         dockerd 16015    106611.777737   2961407   120         0.000000    160704.014444         0.000000 0 0 /system.slice/docker.service
         dockerd 16024       101.600644        16   120         0.000000         0.915798         0.000000 0 0 /system.slice/
[...]
Apart from CPU metrics, you can also look for utilization and saturation metrics for disk devices. I focus on such metrics in the [USE method], and have a [Linux checklist] of these. While there are more explicit metrics, that doesn't mean that load averages are useless. They are used successfully in scale-up policies for cloud computing microservices, along with other metrics. This helps microservices respond to different types of load increases, CPU or disk I/O. With these policies it's safer to err on scaling up (costing money) than not to scale up (costing customers), so including more signals is desirable. If we scale up too much, we'll debug why the next day. The one thing I keep using load averages for is their historical information. If I'm asked to check out a poor-performing instance on the cloud, then login and find that the one minute average is much lower than the fifteen minute average, it's a big clue that I might be too late to see the performance issue live. But I only spend a few seconds contemplating load averages, before turning to other metrics. ## Conclusion In 1993, a Linux engineer found a nonintuitive case with load averages, and with a three-line patch changed them forever from "CPU load averages" to what one might call "system load averages." His change included tasks in the uninterruptible state, so that load averages reflected demand for disk resources and not just CPUs. These system load averages count the number of threads working and waiting to work (waiting on CPUs, disks, and uninterruptible locks), and are summarized as a triplet of exponentially-damped moving sum averages that use 1, 5, and 15 minutes as constants in an equation. This triplet of numbers lets you see if load is increasing or decreasing, and their greatest value may be for relative comparisons with themselves. The use of the uninterruptible state has since grown in the Linux kernel, and nowadays includes uninterruptible lock primitives. If the load average is a measure of demand in terms of running and waiting threads (and not strictly threads wanting hardware resources), then they are already working the way we want them to. In this post, I dug up the Linux load average patch from 1993 – which was surprisingly difficult to find – containing the original explanation by the author. I also measured stack traces and time in the uninterruptible state using bcc/eBPF on a modern Linux system, and visualized this time as an off-CPU flame graph. This visualization provides many examples of uninterruptible sleeps, and can be generated whenever needed to explain unusually high load averages. I also proposed other metrics you can use to understand system load in more detail, instead of load averages. I'll finish by quoting from a comment at the top of kernel/sched/loadavg.c in the Linux source, by scheduler maintainer Peter Zijlstra:
* This file contains the magic bits required to compute the global loadavg
* figure. Its a silly number but people think its important. We go through
* great pains to make it work on big machines and tickless kernels.
## References - Saltzer, J., and J. Gintell. “The Instrumentation of Multics,” CACM, August 1970 (explains exponentials). - Multics [system_performance_graph] command reference (mentions the 1 minute average). - [TENEX] source code (load average code is in SCHED.MAC). - [RFC 546] "TENEX Load Averages for July 1973" (explains measuring CPU demand). - Bobrow, D., et al. “TENEX: A Paged Time Sharing System for the PDP-10,” Communications of the ACM, March 1972 (explains the load average triplet). - Gunther, N. "UNIX Load Average Part 1: How It Works" PDF (explains the exponential calculations). - Linus's email about Linux 0.99 patchlevel 14. - The load average change email is on [oldlinux.org] \(in alan-old-funet-lists/kernel.1993.gz, and not in the linux directories, which I searched first\). - The Linux kernel/sched.c source before and after the load average change: 0.99.13, 0.99.14. - Tarballs for Linux 0.99 releases are on kernel.org. - The current Linux load average code: [loadavg.c], [loadavg.h] - The [bcc] analysis tools includes my [offcputime], used for tracing TASK_UNINTERRUPTIBLE. - [Flame Graphs] were used for visualizing uninterruptible paths. Thanks to Deirdre Straughan for edits. [Multics]: http://web.mit.edu/Saltzer/www/publications/instrumentation.html [TENEX]: https://github.com/PDP-10/tenex [system_performance_graph]: http://web.mit.edu/multics-history/source/Multics/doc/privileged/system_performance_graph.info [runqlat]: http://www.brendangregg.com/blog/2016-10-08/linux-bcc-runqlat.html [bcc]: https://github.com/iovisor/bcc [htop]: https://github.com/hishamhm/htop/issues/665 [cpustat]: https://github.com/uber-common/cpustat [kernelscope]: https://github.com/josefbacik/kernelscope [flame graph]: https://github.com/brendangregg/FlameGraph [RFC 546]: https://tools.ietf.org/html/rfc546 [QCon 2015]: https://www.slideshare.net/brendangregg/qcon-2015-broken-performance-tools/6 [ietf.org]: https://tools.ietf.org/html/rfc546 [Linux]: https://github.com/torvalds/linux/blob/master/include/linux/sched/loadavg.h [How It Works]: http://www.teamquest.com/import/pdfs/whitepaper/ldavg1.pdf [loadavg.h]: https://github.com/torvalds/linux/blob/master/include/linux/sched/loadavg.h [loadavg.c]: https://github.com/torvalds/linux/blob/master/kernel/sched/loadavg.c [linuxpatch]: https://kernel.googlesource.com/pub/scm/linux/kernel/git/nico/archive/+/75bb5836a8a8c0ee44ffd60a51f357b9568f1381%5E%21/#F265 [Off-CPU flame graph]: http://www.brendangregg.com/blog/2016-01-20/ebpf-offcpu-flame-graph.html [USE method]: http://www.brendangregg.com/usemethod.html [Linux checklist]: http://www.brendangregg.com/USEmethod/use-linux.html [oldlinux.org]: http://oldlinux.org/Linux.old/mail-archive/ [offcputime]: http://www.brendangregg.com/blog/2016-01-20/ebpf-offcpu-flame-graph.html [Flame Graphs]: http://www.brendangregg.com/flamegraphs.html [z0]: https://unix.stackexchange.com/questions/163639/how-did-the-linux-kernel-project-track-bugs-in-the-early-days [z1]: http://www.linuxjournal.com/article/9001?page=0,1 [z2]: http://www.ibiblio.org/pub/historic-linux/ftp-archives/ [z3]: https://www.win.tue.nl/~aeb/ftpdocs/linux-local/kernel.archive/0.99/

Chromebook Adventures Nahum Shalman

A chromebook you say?

I've been in need of a new personal laptop for a while and inspired by this post from @kennwhite I decided that I would get a Chromebook. I put a couple of different models on a wishlist including the one in that blog post, but my siblings opted to get me a nicer one with more cores, more RAM, and more storage for my birthday (thanks!!) (In fact, it's the same one that Kenn appears to be trying next.)

I have a few notes to add to that blog post as it pertains to my use cases including an open problem which, if I solve it, will trigger an update to this post with the solution.

Additional Useful Apps

In addition to many of the apps recommended by Kenn, here are some additional ones that have proven useful:

  • VNC Viewer works very nicely for connecting to the VNC console of some VMs on my SmartOS box.
  • aSPICE/aSPICE Pro is an Android app that I can run on my Chromebook and works better than the HTML5 Chrome app that requires websockets and hasn't been updated in a while. (That said, it's possible that throwing up a zone with a webserver with the latest HTML5 client and the websockets proxying might be a good thing for me to explore at some point.) I bought the paid version a while back to use on my phone/tablet because I was using it so much.
  • Chromebook Recovery Utility but not for the reason you might think. It turns out you can use it to write any image you download to a USB stick. It took me a bit too much searching to find the details so I'm collecting them here.

Writing USB sticks under Chrome OS

Creating a bootable USB stick from Chrome OS

In keeping with Kenn's philosophy of not undermining the security mechanisms of Chrome OS, I had to figure out how to flash images to a USB stick without using dd. This somewhat obscure forum posting held the key, but I will spell it out here for the sake of those who search for this in the future:

  1. Install Chromebook Recovery Utility
  2. Download whatever ISO/USB image you want to write to your USB stick.
  3. This is the key: Rename the file to end with .bin
  4. Insert your USB stick
  5. Launch the Recovery Utility
  6. Click the gear icon in the upper right hand corner
  7. Select "Use local image"
  8. Use the file picker to select your file
  9. Select your USB stick from the drop-down list
  10. Click "Continue"
  11. Read the warning and be certain you want to overwrite that USB stick (you've been warned!!)
  12. Click "Create now"
  13. Wait for it to complete

This should work with any standard ISOhybrid image which seem to be all the rage for Linux distros and the like, but should also work for e.g. a SmartOS USB stick image.

Missing Features

IPMItool

(This section assumes you read about the termux bits in Kenn's post)

I built my SmartOS box with a server-grade motherboard. It has IPMI and I love being able to connect to the Serial Over LAN to do administration since it sits in a corner with no keyboard or monitor. Unfortunately I couldn't find an existing binary package for IPMItool. There wasn't one for termux, so I installed a toolchain in termux, downloaded the IPMItool sources, and tried to build it.

Turns out that some lovely code from 2003 assumes the presence in libc of getpass(3c) (or getpassphrase on platforms that have it). Also turns out that the Chrome OS / Android libc (not totally sure what's at play) doesn't have that function.

I think the solution will either be to manually patch in an implementation cribbed from somewhere else on the internet, or figure out how to use gnulib to provide that functionality. I think there's an example of someone doing something similar here but I didn't dig deeply enough into their tooling to figure it out yet.

If any of you have done this sort of thing before, I'll happily write up the solution here in exchange for guidance on how to do it (and perhaps help build a termux package for it to share the love.)

Add speech to your Fedora system blog'o'less

This article first appeared on Fedora Magazine

By default, Fedora Workstation ships a small package called espeak. It adds a speech synthesizer — that is, text-to-speech software.
In today’s world, talking devices are nothing impressive as they’re very common. You can find speech synthesizers even in your smartphone, a product like Amazon Alexa, or in the announcements at the train station. In addition, synthesized voices are nowadays more or less similar to human speech. We live in a 1980s science fiction movie!
The voice produced by espeak may sound a bit primitive compared to the aforementioned tools. But at the end of the day espeak produces good quality speech. And whether you find it useful or not, at least it can provide some amusement.

Running espeak

In espeak you can set various parameters using command line options. Examples include:
  • amplitude (-a)
  • pitch adjustment (-p)
  • speed of sentences (-s)
  • gap between words (-g)
Each of these options produces various effects and may help you achieve a cleaner voice.
You can also select different voice variants with command line options. For example, try -ven+m3 for a different English male voice, and -ven+f1 for a female one. You can also use different languages. For a list, run this command:
espeak --voices
Note that many languages other than English are experimental attempts.
To create a WAV file instead of actually speaking something, use the -w option:
espeak -w out.wav "Audio file test"
The espeak utility also reads the content of a file for you.
espeak -f plaintextfile
Or you can pass the text to speech from the standard input. In this way, as a simple example, you can build a talking box that alerts you to an event using a voice. Your backup is completed? Add this command to the end of your script:
echo "Backup completed" | espeak -s 160 -a 100 -g 4
Suppose an error shows up in a log file:
tail -1F /your/log/file | grep --line-buffered 'ERROR' | espeak
Or perhaps you want a speaking clock telling you every minute what time it is:
while true; do date +%S | grep '00' && date +%H:%M | espeak; sleep 1; done
As you can guess, use cases are limited only by your imagination. Enjoy your new talking Fedora system!

Creating a decent Tribblix AMI The Trouble with Tribbles...

Previously, I've described how I created my first Tribblix AMI, then how to do it properly in hvm mode so you can run on modern instances in all regions.

That creates something that will work, but is it actually in a state that's useful?

The first thing is to add an EC2 credential service. That's the thing that will query for metadata and install the keys on the system so you can log after the instance is created. I tried the ec2-credential service from OmniOS, but for some reason it didn't work right on Tribblix. I've tweaked mine a little, forcing it to run after the network comes up, adding retries in case there's a problem, and also disabling it in non-global zones.

Of course, there's more instance metadata that I could query and use, but I haven't yet had a need for anything other than the initial key.

The other thing I've been wondering about is the configuration of Tribblix itself - specifically what the storage should look like and what the default software installation should look like.

My image is built on an 8G "disk" or EBS volume. That might seem a little small, but remember that Tribblix is pretty lean and mean. For a typical server configuration you'll probably be looking at about 1G or so, and that's without any special work. The most annoying thing here is that by default you lose 2G to each of dump and swap, so that's effectively half the disk gone. There's opportunity to modify those, especially as I'm typically using t2.micro instances on the free tier that only have 1G of memory. You might not even want dump at all. As for swap, you do want some (so that anonymous reservations don't eat into actual RAM) but you could cut that down a bit.

As I'm writing this I do wonder whether I could pull some of the instance metadata and shrink the dump and swap volumes appropriately.

The assumption I'm making here, though, is that if you're storing any reasonable amount of data that you're going to attach a separate EBS volume, and you can then size that appropriately to the need at hand. (And you can then move that data around independent of your running instance.) So I think that keeping the root volume fairly small is reasonable. It also keeps my AWS bill down, an important consideration as any charges here come out of my own pocket.

Then, what should the baseline software install look like? Tribblix uses overlays, and there's an assumption that you always start from the base overlay. I'm currently using a dedicated overlay that pulls in cli-tools - essentially you get basic shells, compression tools, basic utilities, but not much else. Many of the normal server utilities don't apply to running in the cloud, as they're aimed at monitoring or managing hardware.

The base set of packages is that installed on the ISO. That includes most storage and network drivers, which are irrelevant - on EC2 you know exactly which drivers you need, so almost all the drivers that are installed are unnecessary. What I need here is a better way of handling installation variants, so it knows the drivers aren't supposed to be there - at the moment I could remove them, but updates and upgrades would simply put them back. In the same vein, I could only ship a 64-bit kernel, as we know there are no 32-bit instance types available.

At the moment I have an LX variant, which is a bit of a hack in terms of the way I've packaged it together, but as the number of interesting variants grows I'm going to have to come up with a better way of handling it, especially as you might want multiple variants together - for instance a 64-bit LX-enabled cloud-optimised image.

Building a Tribblix AMI - hvm mode The Trouble with Tribbles...

After having created a Tribblix AMI to prove that Tribblix basically works on EC2, I then moved on to the next issue - how to create an AMI that will run in hvm mode?

As a reminder, pv mode AMIs are deprecated, aren't supported by all instance types, and don't work in all regions. So you really need something that runs in hvm mode.

The first thought might be to convert the existing pv image to a hvm image. I've tried that and, while you can do the conversion, the image doesn't actually work. The problem here is that ZFS has the physical paths of the devices it's installed on embedded in the pool metadata. Changing from pv to hvm mode changes the emulated hardware, in particular the disk paths, so the ZFS pool isn't where it thought it was and the system panics. If you have a mismatch between the disk layout where the pool was created and where you're running you'll get a panic something like this:



If you had console access and could boot from media you could fix this, but AWS doesn't provide that. (And if you could boot from media you could just do a regular install without all the shenanigans involved in producing an AMI.)

So, you have to create the image on a system that looks like EC2. Which means using xen.

Fortunately, this road has been travelled before. These instructions are exactly what you need. They're for OpenIndiana, but will apply to any illumos distribution. And they're the process used by the OpenZFS project to do their testing. (I'll also mention that the OpenZFS folks have put a number of fixes back into illumos that improve the EC2 experience for us.)

I'm not going to repeat those instruction, that would be boring, so I'll talk about what I had to do or change to make those instructions work for me.

I got one of my spare desktop PCs out and installed Ubuntu 16.04 on it. (I must be spoilt by Tribblix, the Ubuntu install was horrendously slow and very high maintenance.) And then installed xen, rebooted as dom0, and set up the bridge networking.

That was my first pothole. There's this thing called systemd that's come along, and it changes the way network configuration is done. Much cussing and googling, but I got it right first time.

Then I discover that there's a new toolstack here. It's all xl not xm, but otherwise seems the same.

I then tried to start a VM, only to be given a completely meaningless and unhelpful error message. Why tell the user what's wrong when you can just vomit a stack trace?

After a bit of head-scratching I worked out that the system didn't actually support hvm mode. If you run xl info and look for virt_caps, it should mention hvm. That's a bit odd, the sticker on the front of the box looks right.

Manufacturers ship hardware with VT-x disabled in the BIOS, it appears. Into the BIOS we go, to find that the relevant settings are greyed out and you need a BIOS password to get into them. Open the box and start looking for jumpers. Fortunately I found a helpful article - the key here was the bit about the jumper being blue, little details like that make all the difference.

OK, so having wiped the BIOS password, gone into the BIOS and enabled VT-x, I go back to xen. Looking at virt_caps now shows hvm, as it should, and my domain starts.

The idea here is that you connect to the console with VNC. Easy enough, but by the time I had got my ssh tunnel set up and started up my VNC client, my VM had gone. I started it again, it starts booting just fine but then issues a few warnings and then a kernel panic. It's all over pretty quick.

In order to catch what it said, I then used vnc2flv. Someone asked me about screen recorders a while back, and I suggested they did what they wanted to do in a vnc session and use vnc2flv to record it. But it's the same here. Once I had the session recorded I can watch the movie and pause it to see what errors it's spitting out.



This, I think, is related to illumos bug 7186. It looks like we can't handle the network presented by newer versions of xen.

To get round this I simply disabled the network interface in the VM definition. Then the VM boots just fine and can be installed. You're a little bit limited in that you can't do updates but, as long as nwam is enabled then it will get itself on the network when you do run it on something that does have a compatible network.

For OmniOS, this means you have to manually enable nwam, as they have networking switched off by default. And remember that you must have networking enabled if you're running on EC2 as there's no other way to access your system.

What you'll also need to ensure at this point is that you have a functional user account you can get in to via ssh. With Tribblix and OpenIndiana you have jack, other distros might need to create a user. You wouldn't want that on a production AMI, of course, but you need to be able to log in to the system the first time in order to complete any configuration and add the various bits of AWS integration that you'll need.

Having got my image installed I followed the instructions through and got an AMI that works just fine.

The configuration file I used is:

builder='hvm'
name='ami-template'
vcpus=1
memory=1024
disk=[  'file:/var/tmp/tribblix-0m20.1.iso,hdb:cdrom,r',
        'file:/root/ami-template.img,xvda,w' ]
boot='d'
vnc=1
vnclisten='0.0.0.0'
vncconsole=1
on_crash='preserve'
xen_platform_pci=1
serial='pty'
on_reboot='destroy'


The one crucial thing here, apart from not having a vif line to create a network, is that you must use xvda for the disk. That's what EC2 will present to you, if you use something else you'll get the same panic on boot that I saw when attempting to convert a pv image.

We're almost done. Next time I'll talk about how to go from something that minimally boots up to something that's done well.

Running illumos on AWS - the first Tribblix AMI The Trouble with Tribbles...

I've run Tribblix on all sorts of hardware - desktops, servers, even the occasional laptop. I've had success running it on some of the smaller cloud providers that allow you to install from a custom ISO, or iPXE, such as my adventures with Vultr.

However, running on AWS has eluded me. You might wonder why you would want to, but the reality is that AWS is a huge player, with many people turning to it as their default (and often only) option. So giving everyone who uses AWS access to Tribblix would be a good thing, and would also offer an easy route for people who might want to play with Tribblix to do so.

The first thing to realize is that AWS is not so much a single cloud as a set of independent clouds. Each region is independent, and has a different set of capabilities. For example, EFS is only available in a few regions. These differences can affect us.

On AWS, there are 2 different types of guest. We have pv (the older, paravirtualized) and hvm (the newer, hardware assisted). Any given AMI (Amazon Machine Image) will only run as either a pv or a hvm guest. And some EC2 instance types are pv, others hvm. Newer regions (such as London) are exclusively hvm, so pv isn't an option.

Building an AMI from scratch looked a little daunting, so I looked to see what other illumos distributions might have made AMIs available. If you go to the community AMI page when launching an instances, the only one you'll find is OmniOS. They even have a page explaining how it was done. The snag is that all their images are pv. For my first set of experiments then, I was operating in the Dublin region.

The OmniOS AMI boots up just fine and works pretty much as you would expect. No problems there. How to get Tribblix running though?

The answer lies in the beauty of ZFS and Boot Environments. The basic approach here is to take a running OmniOS image, create a new Boot Environment, install Tribblix into that Boot Environment, and make the Tribblix Boot Environment the one to boot from next time. Once I've successfully booted the Tribblix image, I can clean up and delete the original OmniOS files.

One of the advantages of Tribblix is that I have my own installer. It's quite a bit simpler than some of the other distros, and thus much easier to mangle to do things in new environments. I decided to use the iPXE image as used in my Vultr experiment, because it was easy and I had it to hand. I then wrote a modified installer script (source here) called img_install that was based on my over_install script used to drop Tribblix into an existing ZFS pool. The difference is that the old over_install was run in the context of a Tribblix Live CD; the new img_install is run in the context of an alternative distro. The other thing in that script is that I don't do any boot loader fiddling - the pv instances have a special pv-grub, which I'm careful not to touch.

(By the way, the same trick will work for other illumos distributions. You just need a source archive of some sort and a script to unpack it. For example, I have a script to unpack some of the ISO images in the tribblix-zones repo, which I use to create alien-root zones. It's the same idea of installing an image in a alternate path.)

So all that was involved was to:

  • Start up an OmniOS instance (a micro instance on the free tier works fine)
  • Run the img_install script to create the alternate BE
  • Reboot, so you boot into Tribblix
  • Delete the old OmniOS BE
  • Finish off the install and apply updates
Then you can do the normal create an image trick on the AWS console, and you have a nice shiny Tribblix AMI.

That all worked out just beautifully. Tribblix runs on EC2 just fine.

In the next article, I'll describe how to create a hvm AMI.

Coloring Flame Graphs: Code Hues Brendan Gregg's Blog

I recently improved flame graph code coloring. If you're automating or implementing flame graphs, this is a small detail that may interest you. (For an intro to flame graphs, see my [website] and [github].) First, a confession. Code-type coloring was a regex hack that took five minutes. In late 2014 I was modifying the JDK to preserve the frame pointer so that traditional stack walkers and profilers would work (an example of the problem is here, where Java methods lack ancestry). After I fixed the frame pointer, profiling Java looked like this (SVG):

It worked! Java methods now had ancesty (stack depth), and appear as towers. I was delighted and showed my colleagues straight away. Amer Ather, another performance engineer at Netflix, suggested I color the Java and kernel frames differently. He was only back at his desk for five minutes when I called him back (SVG):

Done. (I also stripped the extra L from Java symbols.) My hack was the following eight lines of code:
        if (defined $type and $type eq "java") {
                if ($name =~ /::/) {            # C++
                        $type = "yellow";
                } elsif ($name =~ m:/:) {       # Java (match "/" in path)
                        $type = "green"
                } else {                        # system
                        $type = "red";
                }
The "java" $type is from the command line option: --color=java. The $name is the function name. Here are some sample function names: - **Java**
    io/netty/channel/nio/NioEventLoop;.run
    org/mozilla/classfile/ClassFileWriter;.addLoadConstant
- **C++**
    JavaCalls::call\_helper
    JavaThread::thread_main_inner
- **C**
    tcp\_v4\_do\_rcv
    start\_thread
    write
If you cast your regular expression eye over these, you'll quickly see patterns. If it contains "::" it's C++, "/" it's Java, else it's C. And that's what I coded. It mostly worked. But I've noticed the odd case where it gets things wrong. Sometimes the profiled Java symbols use "." instead of "/" as a delimiter. Or, somehow, I have Java methods that lack any package delimiter, so were colored red. I had similar issues with JIT'd code for Node.js. Revisiting how flame graphs for Linux perf are generated (full instructions in [Java Flame Graphs]):
perf record -F 49 -a -g -- sleep 30; ./jmaps
perf script | ./stackcollapse-perf.pl | grep -v cpu_idle | ./flamegraph.pl --color=java > out.svg
It's beginning with the output of perf script (later perf versions added a way to emit a folded summary directly). Here is some truncated perf script output:
java  4811 cpu-clock: 
    ffffffff8100122a hypercall_page ([kernel.kallsyms])
    ffffffff8100aca2 check_events ([kernel.kallsyms])
    ffffffff8104dffe __wake_up_sync_key ([kernel.kallsyms])
    ffffffff8152f86e sock_def_readable ([kernel.kallsyms])
[...]
    ffffffff81662142 system_call_fastpath ([kernel.kallsyms])
        7f62aadf2f7d write (/lib/x86_64-linux-gnu/libc-2.15.so)
        7f62961a5e8b Lsun/nio/ch/FileDispatcherImpl;.write0(Ljava/io/FileDescriptor;JI)I (/tmp/perf-4637.map)
        7f629619dd64 Lsun/nio/ch/SocketDispatcher;.write(Ljava/io/FileDescriptor;JI)I (/tmp/perf-4637.map)
        7f62961b3330 Lsun/nio/ch/IOUtil;.writeFromNativeBuffer(Ljava/io/FileDescriptor;Ljava/nio/ByteBuffer;JLsun/nio/ch/NativeDispatcher;)I (/tmp/perf-4637.map)
[...]
        7f62aa3b1618 JavaThread::thread_main_inner() (/mnt/openjdk8/build/linux-x86_64-normal-server-release/jdk/lib/amd64/server/libjvm.so)
        7f62aa3b186c JavaThread::run() (/mnt/openjdk8/build/linux-x86_64-normal-server-release/jdk/lib/amd64/server/libjvm.so)
        7f62aa272bf2 java_start(Thread*) (/mnt/openjdk8/build/linux-x86_64-normal-server-release/jdk/lib/amd64/server/libjvm.so)
        7f62aa8f2e9a start_thread (/lib/x86_64-linux-gnu/libpthread-2.15.so)
The stackcollapse-perf.pl tool plucks out the symbol name (second column) and discards everything else. But the last column – the segment printed in ( ) – provides more details for identifying code types. Eg: - **[kernel.kallsyms]**: kernel code (I could also match the addr vs the kernel base address for this) - **/tmp/perf-PID.map**: JIT'd code (Java, Node.js, ...) This is what I made use of recently, by adding an --all option to stackcollapse-perf.pl to turn on all annotations. Annotations are inspired by the "[k]" annotations seen in perf report --stdio output. I append them after the function name, so tcp\_sengmsg becomes tcp\_sengmsg\_[k], and that annotation is used and then stripped by flamegraph.pl. Annontation suffixes: - **_[k]**: kernel - **_[j]**: JIT - **_[i]**: inlined function - **_[w]**: waker stack (for [offwake] or [chain] graphs) Making use of both annotations and pattern matching, the "java" palette is now: - green: JIT (Java, Node.js, ...) - aqua: inlined - yellow: C++ - orange: kernel - red: native (user-level) If you're automating flame graphs using my original tools, you might want to consider adding --all to the normal workflow for annotations. These are currently used by the "java" and "js" palettes. Eg:
perf record -F 49 -a -g -- sleep 30; ./jmaps
perf script | ./stackcollapse-perf.pl --all | grep -v cpu_idle | ./flamegraph.pl --color=java > out.svg
If you are using a different profiler (not Linux perf), you might want to consider enhancing its stackcollapse program to have an option to turn on annotations (or I can do it next time I use them). If you are implementing your own flame graph software, you might want to add similar color hues for code types. Finally, it should be clear that changing the hue of code based on a regex is a trivial change to flamegraph.pl. You could add custom rules to your version to highlight your team's code, for example. [Java in Flames]: https://medium.com/netflix-techblog/java-in-flames-e763b3d32166 [Java Flame Graphs]: http://www.brendangregg.com/FlameGraphs/cpuflamegraphs.html#Java [website]: http://www.brendangregg.com/FlameGraphs/cpuflamegraphs.html#Java [flamegraph.pl]: https://github.com/brendangregg/FlameGraph [github]: https://github.com/brendangregg/FlameGraph [perf]: /perf.html [offwake]: http://www.brendangregg.com/blog/2016-02-01/linux-wakeup-offwake-profiling.html [chain]: http://www.brendangregg.com/blog/2016-02-05/ebpf-chaingraph-prototype.html [Jun 21]: https://github.com/brendangregg/FlameGraph/commit/e83d1bcf33c313f4b6c754799646fe038b9fad0f

Modern Mercurial - hg log Josef "Jeff" Sipek

As I pointed out recently, I ended up customizing my .hgrc to better suit my needs. In this post, I’m going to talk about my changes to tailor the hg log output to my liking.

There are three issues I have with the default hg log format:

  1. By default, only the first line of the commit message is shown. To see it fully, you need to use verbose mode.
  2. In verbose mode, the touched files are listed as well without a way to hide them.
  3. In verbose mode, the listed files are not listed one per line, but rather as a single line.

If, like me, you prefer the Linux-kernel style commit messages, you likely want to see the whole message when you look at the log (problem #1). Here is, for example, a screenshot of a commit using the default style (normal and verbose mode):

hg log

You can work around not seeing the whole commit message by always using the verbose mode, but that means that you’ll also be assaulted by the list of changed files (problem #2) without a way to hide it. To make the second problem even worse, the file names are listed on a single line, so all but the most trivial of changes create an impossible to read blob of file names (problem #3). For example, even with only a handful of files touched by a commit:

hg log -v

At least, those are my problems with the default format. I’m sure some people like the default just the way it is. Thankfully, Mercurial is sporting a powerful templating engine, so I can override the style whichever way I want.

Demo

Ok, before I dive into the rather simple config file changes, let’s take a look at a screenshot of the result on a test repository:

hg log -G

As you can see, the format of each log entry is similar to that of git log (note that the whole multi-line commit message is displayed, see revision 1), but with extra information. What exactly does it all mean? I think the best way to explain all the various bits of information is to show you an annotated version of the same screenshot:

hg log -G

I’m now going to describe the reasons why the various bits of information are presented the just way they are. If you aren’t interested in this description, skip ahead to the next section where I present the actual configuration changes I made.

Each commit hash (in yellow) is followed by a number of “items” that tell you more about the commit.

First is the phase. The phase name is abbreviated to a single letter (or no letter for the public phase) and color coded. It is the first item because every commit has a phase, the phase is an important bit of information, and the “encoded” phase info is very compact.

The reasoning behind the phase letters and colors is as follows:

public phase (no letter)
Public commits are not interesting since everyone has them, so don’t draw attention to them by omitting a letter.
secret phase (‘S’)
The only interesting thing about secret commits is that they will not be pushed. That means that they cannot be accidentally pushed either. Since this behavior is “boring”, use dark blue to indicate that they are different from public commits, but do not draw too much attention to them.
draft phase (‘D’)
These are the “dangerous” commits. Pushing them will change the remote repository’s state, so draw significantly more attention to these by using red.

I use letters instead of just using a different color for the commit hash for a very simple reason—if colors aren’t rendering properly, I still want to be able to tell the phases apart.

Second comes the named branch. When looking at several commits (e.g., hg log), most of the time any two adjacent commits will be on the same named branch. On top of that, each commit belongs to exactly one named branch. Therefore, even though the named branch name is not a fixed field, it behaves as one. In my experience, it is a good idea to display fixed fields before any variable length fields to make it easier for the eyes to spot any differences. (Yes, technically the way I display the phase information is not fixed width and therefore the named branch will not always start in the same column, but in practice adjacent commits tend to have the same phase as well, so the named branch will always be in a semi-fixed position.) Note that in Mercurial the “default” branch is usually rendered as the empty string, and I follow that convention with my template.

Third comes the list of tags. Each commit can have many tags. This is the first item on the line that can become unreasonably long. At least in the repositories that I deal with, there aren’t very many tags per commit, so I haven’t seen any bad effects.

Fourth and final comes the list of bookmarks. Much like tags, there can be many, but in practice there are very few. Since I deal with tags more often than bookmarks, I put the bookmark information after the tags. The active bookmark is rendered as bold.

The choice of colors for named branches (cyan), tags (green), and bookmarks (magenta) was guided by a simple principle: they should go well with the yellow color of the changeset line, and not draw too much attention but still be visually distinct. Sadly, on a terminal without color support, they will all render the same way. I think this is still workable, since repositories have conventions for branches/tags/bookmarks naming and therefore the user can still guess what type of name it is. (Worst case, the user can consult other hg commands to figure out what exactly is being displayed.)

The checked out commit and the active bookmark being rendered as bold without any additional indication that they are different is also unfortunate. I haven’t found a pleasant way to render this information that would convey the same information on dumb terminals. (Note that there is a class of terminals that support bold fonts but not different colors. Even those will render this info correctly.)

Config

So, how did I achieve this glorious output? It’s not too complicated, but it took me a while to tune things just to my liking.

First, I make a custom style file with two templates—changeset and changeset_verbose:

changeset_common = '{label(ifcontains(rev, revset('parents()'),
      "log.activechangeset",
      "log.changeset"),
      "commit {rev}:{node}")}\
      {label("log.phase_{phase}",
	ifeq(phase, "public",
	  "",
	  " {ifeq(phase,"draft","D","S")}"))}\
      {label("log.branch", ifeq(branch, "default", "", " {branch}"))}\
      {label("log.tag", if(tags, " {tags}"))}\
      {bookmarks % "{ifeq(bookmark, currentbookmark,
	label('log.activebookmark', " {bookmark}"),
	label('log.bookmark', " {bookmark}"))}"}
    {ifeq(parents,"","","{ifeq(p2rev,-1,"Parent: ","Merge: ")}{parents}\n")}\
    Author: {author}
    Date:   {date(date,"%c %z")}\n
    {indent(desc,"    ")}\n'
changeset_files = '{ifeq(files, "", "", "\n {join(files,\"\n \")}\n")}'

changeset_verbose = '{changeset_common}{changeset_files}\n'
changeset = '{changeset_common}\n'

Normally, changeset is used by hg log and other revision set printing commands, while changeset_verbose is used when you provide them with the -v switch. In my template, the only difference between the two is that the verbose version prints the list of files touched by the commit.

Second, in my .hgrc, I define the colors I want to use for the various bits of info:

[color]
log.activebookmark = magenta bold
log.activechangeset = yellow bold
log.bookmark = magenta
log.branch = cyan
log.changeset = yellow
log.phase_draft = red bold
log.phase_secret = blue bold
log.tag = green

Finally, in my .hgrc, I set the default style to point to my style file:

[ui]
style = $HOME/environ/hg/style

That’s all there is to it! Feel free to take the above snippets and tailor them to your liking.

hg log -v vs. hg log –stat sidenote

My first version of the template did not support the verbose mode. I didn’t think this was a big deal, and I simply used hg log –stat instead. This provides the list of files touched by the commit and a visual indication how much they changed. For example, here’s a close up of two commits in the same test repo:

hg log -G –stat

Then one day, I tried to do that on a larger repo with a cold cache. It was very slow. It made sense why—not only did Mercurial need to list all the commits, it also needed to produce the diff of each commit only to do some basic counting for the diffstat.

My solution to the problem was to make verbose mode list all the files touched by the commit by using {files}. This is rather cheap since it requires consulting the manifest instead of calculating the diff for each commit. For example, here are the same two commits as above but in verbose mode:

hg log -G -v

It certainly has less detail, but it is good enough when you want to search the log output for a specific file name.

Mucking around with IPv6 and illumos zones The Trouble with Tribbles...

The world is running out of IPv4 addresses, and it's time to move to IPv6.

I remember that story being told over and over at conferences in the mid 1990s. Yet, here we are in 2017 and while there has been progress, we're definitely not there yet.

With zones, illumos (and Solaris) give you virtualized application environments (containers is the trendy term - we tend not to use that in the Solarish context because it got polluted by Sun marketing). Those environments (usually) need to be networked, so why not with IPv6.

So here goes with a few notes on the subject.

Shared-IP zones


With zones, the original networking model was a shared-ip stack, where the zone is given a fully configured network that is just a virtual IP configured on an existing interface. All the setup is done in the global zone, which makes it very easy.

(By the way, this was the cause of the limit of 8192 zones per system, because you can only have 8192 virtual addresses on a single physical interface.)

And configuring an IPv4 address is just a case of adding a net section to the zone configuration:

add net
set physical=aggr1
set address=172.18.1.172/24
end

It's exactly the same for IPv6, the only interesting issue is what the IPv6 address would be. Let's start with the link-local address - the one that starts with fe80:: - as you will generally need that even if you don't have a routed IPv6 network. For a physical interface, the IPv6 address is usually derived from the MAC address. We can't use that one, because we're sharing the interface and the global zone has already grabbed it. So the convention here is to construct something from the IPv4 address. It's then guaranteed to be unique in a broadcast network, which is all that matters for a link-local address. So all we have to do is convert the IPv4 address to hex, for example with printf:

printf "%x%x:%x%x\n" 172 18 1 172

which gives ac12:1ac, so the link-local address would be configured as:

add net
set physical=aggr1
set address=fe80::ac12:1ac/10
end

You're pretty much done here, if you do that your global and non-global zones will be able to communicate using IPv6 on the local subnet.

If you had a routable prefix, then the same scheme can be applied. Just put the fragment onto the end of your prefix.

add net
set physical=aggr1
set address=XXXX:XXXX:XXXX:XXXX::ac12:1ac/64
end


Of course, if you're assigned specific IPv6 addresses then you can use those directly. The above scheme is pretty trivial to script, though (and it actually makes it fairly easy to keep your DNS zone files up to date too).

Exclusive-IP zones


For an exclusive-ip zone, you just hand over a network interface to a zone and let it go figure. So it can assign whatever addresses it likes.

In particular, the zone can use the normal MAC address scheme to generate its IPv6 link-local address.

Originally in older Solaris, you needed to use a genuine physical interface. Which limited you a little bit as there are only so many network cards you can jam into a server. OpenSolaris introduced full network virtualization in the form of Crossbow, so any illumos distribution or Solaris 11 can create fully virtualized network stacks and present those to zones in the same way.

In Tribblix, I use zap to manage zones, and it takes care of creating the appropriate vnics and, if appropriate, etherstubs, and wiring things together. I also poke functional /etc/hostname.* and /etc/defaultrouter files into the zone so the networking at least gets configured when the zone boots. Adding IPv6 to the zone is simply a case of creating matching empty /etc/hostname6.* files (one for each vnic) and the IPv6 addresses will get autoconfigured.

There's one wrinkle with exclusive-ip that deserves a whole section, that of restricting the zone to only using addresses that you've set.

Restricting with allowed-address


Remember that an exclusive-ip zone can manage the network interface. So it could allocate the wrong address and generally cause havoc on the network. To prevent this, set the allowed-address property on the interface. For example:

add net
set physical=vnic1
set allowed-address=172.18.1
.172/24
end

The zone manages the interface, but an attempt to configure an invalid address will be thwarted.

You can see what's happening under the hood by running dladm show-linkprop. You'll see that the protection and allowed-ips properties are set.

(As an aside, it would be fantastic if illumos got the configure-allowed-address feature that Solaris 11 has, which would bypass my trickery in having to poke the network setup files in the zone.)

The same thing works for IPv6. The first problem I discovered is that (unlike Solaris 11) illumos won't accept multiple addresses in a list. Initially I thought this was something about IPv6, but it turns out you need to specify each address you want to add in a separate block.

The next question is going to be - what is the IPv6 address going to be? It's derived from the MAC address, so isn't fixed in advance.

The first step is to get the properties of the vnic. Running dladm show-vnic will give you the properties you need, including the MAC address. If you just want the one field, that's fairly easy too.

# dladm show-vnic -p -o MACADDRESS vnic1
2:8:20:c6:71:d

The IPv6 address is made up from that as a EUI-64 address, which is basically the fe80:: prefix, the first 3 octets, then ff:fe, then the last 3 octets. Oh, and the 7th bit gets flipped. And conventionally the leading zero gets suppressed. An ugly way of scripting this in ksh looks like:

/usr/sbin/dladm show-vnic -p -o MACADDRESS $VNIC | \
    /usr/bin/sed 's=:= =g' | read o1 o2 o3 o4 o5 o6
integer -i2 vi=16#$o1
integer -i2 nvi
nvi=$(($vi ^ 2#00000010))
integer -i16 xv=$nvi
no1=${xv/16#/}
if [ "$no1" = "0" ]; then
    no1=""

fi
if [ "$no3" = "0" ]; then
    no3=""

fi
if [ "$no5" = "0" ]; then
    no5=""

fi
printf "fe80::%s%s:%sff:fe%s:%s%s/10" "$no1" "$o2" "$o3" "$o4" "$o5" "$o6"

So, what you want to do is add something like:

add net
set physical=vnic1
set allowed-address=fe80::8:20ff:fec6:71d/10
end

to your zone configuration. And if you have a routable IPv6 address, you'll need to duplicate the block again for that.

In practice, this didn't quite work for me. If you don't set the allowed-address properties then the addresses get configured correctly, but with the protection set the address doesn't come up properly. If you try it then you get:

# ifconfig vnic1 inet6 up
ifconfig: setifflags: SIOCSLIFFLAGS: vnic1: Invalid argument

However, if you explicitly set the address, running ifconfig by hand:

# ifconfig vnic1 inet6 fe80::8:20ff:fec6:71d/10 up

then it works perfectly.

Firefox: usare la CNS su Fedora blog'o'less

La presente guida ha funzionato per me, su Fedora 26. Probabilmente è valida per altre distribuzioni di GNU/Linux (usando il gestore di pacchetti opportuno).
Guida per modo di dire, poiché i passi da seguire sono veramente semplici.
Alla fine del conto per installare il lettore e la Carta Nazionale dei Servizi (o Tessera Sanitaria), la procedura descritta sul sito della Regione Toscana  si rivela macchinosa e inutile.
Viene qui usato un lettore di smartcard B4ID
La carta invece è quella che sul sito della Regione viene identificata come modello 2 (AT2012).


Installare i pacchetti

Installate PC/SC Lite (Middleware to access a smart card using SCard API)

sudo dnf install pcsc-tools

Collegate il lettore e lanciare il comando:

pcsc_scan -n

Inserite la carta nel lettore. Dovrebbe apparire qualcosa del genere:

Fri Jul 14 19:51:49 2017
Reader 0: ACS ACR 38U-CCID 00 00
  Card state: Card inserted,
  ATR: XX XX XX XX XX XX XX XX XX XX XX XX XX XX XX XX XX XX X


Quindi ok, il lettore funziona.
Se non funziona: SPIPPOLATE.

Abilitate il servizio e fatelo partire:

sudo systemctl start pcscd
sudo systemctl enable pcscd.service


Installate OpenSC (Open source smart card tools and middleware):

sudo dnf install opensc

Firefox

Andate in Preferenze -> Avanzate -> Certificati
Quindi Dispositivi di Sicurezza
Pigiate sul tasto Carica
Come nome del modulo mettete quel che vi pare
Specificate quindi il percorso della libreria: /usr/lib64/opensc-pkcs11.so
Premete ok.

A questo punto dovreste vedere un nuovo device.


Inserendo la carta dovrebbe apparire il nominativo e il codice fiscale.


Premendo su Login dovrebbe venire richiesto il PIN della carta.

Ora potete
provare l'effettivo funzionamento collegandovi a un sito che richiede l'autenticazione via CNS, come per esempio https://accessosicuro.rete.toscana.it (dove trovate, fra l'altro, il Fascicolo Sanitario elettronico), oppure https://iris.rete.toscana.it/ (per i pagamenti online della Regione Toscana). oppure il sito dell'INPS.

Con la tessera sanitaria è anche possibile richiedere le credenziali SPID, il Sistema Pubblico di Identità Digitale: https://www.spid.gov.it/richiedi-spid#altre-modalita

What gets into Tribblix? The Trouble with Tribbles...

The software available for Tribblix is a bit of an eclectic mix. How do I choose what software to package?

There are actually a number of different reasons why you get a particular package.

The basics


Some packages are just basic,and you expect to find them. Much of the GNU stack comes in this way. And often things like Perl and Python are a foundational requirement for a lot of other tools.

What I want personally


There are a number of areas where I have specific interests - I'm a bit of a magpie when it comes to X11 window managers, for example. And I need to open office documents, so I had to get LibreOffice working. There a few games or emulators that I like. This also explains why some things might not be present too - I have no real interest in video or multimedia, for example, so that's an area with relatively little coverage.

Oh, that looks cool


I'm often interested in new stuff. (Even if it's actually old stuff that's just new to me.) So if I come across a piece of software and think "that looks cool" I'll often try and build and package it. If it works, fine, it ends up in the repository. Even if I might not end up doing anything with it, I've gone to the effort of making a package so it may as well stay there and somebody else might make use of it.

I have a $DAYJOB


Yes, I have a day job (a very good one, thank you very much), and it involves running applications on illumos. If I'm going to evaluate software I'll do it on Tribblix first. Building stuff on Tribblix is much easier than on, say, OmniOS - I have total control of the environment, and many more tools and prerequisite packages to give me a head start. So I can easily screen out any software that simply isn't going to work, and identify any patches or modifications necessary, before heading into the rather more constrained work environment.

Can you make X or Y or Z available


I get requests from users. I'll pretty much always at least try to add the software asked for - the fact that someone's bothered to ask indicates it might be useful, and I might find it interesting as well. This doesn't always work, of course, and I'll have to punt.

I got bored one day


Sometimes I get a bit of free time (no, this doesn't happen very often), and start looking for packages that might be worth adding. Sometimes this involves looking at other systems to see what they ship.

It's a prerequisite for something else


Dependency hell is a fact of life, so a lot of the time is actually spent building prerequisites. This is one reason for speculatively trying things out - it identifies prerequisites, and they're often going to be needed by other packages too. Although what you'll find is a number of packages with no obvious consumers, because the software I wanted didn't actually work. As I mentioned before, though, I'll keep those packages I built, and they might come in useful later.

How to install Fedora 26 on the Raspberry Pi (Headless and Wireless) blog'o'less

The Raspberry Pi is the most famous SBC (Single-Board Computer). It is a recent news that over five millions of units were sold since it came out.

There are a lot of Linux-based and not Linux-based operating systems that runs on the Raspberry: Fedora is one of the latest landed in this platform. Due to this fact, many things still don’t work: for instance, according to what you can read on the Fedora wiki, expansion HATs, composite TV out, analog sound port and the add-on camera are not yet supported; support for displays other than the official one is not currently planned, GPIO isn't well supported. In addition Fedora supports only Pi version 2 and later.
Obviously the Raspberry Pi Foundation recommends the use of Raspbian, a Debian-based Linux operating system. And as stated before, there are many distributions that have been around for a long time, therefore they probably work better on such platform. Compared to the rest of the world Fedora on ARM could look at early stages of development.
Then, due to these facts the question could be: why using Fedora on the Raspberry Pi when there are more feature rich and widely used distributions? The answer is: isn’t Fedora our favorite distro? So let’s give a try to the ARM version.
Side note: ARM is an architecture officially supported by Fedora.

Let’s get started.


Preparing the SD card

The installation of the Fedora ARM image on an SD card is simple as these two steps:

- download the raw image of your choice (in this case Minimal) from https://arm.fedoraproject.org/

- run this command (supposing you are running Fedora on your PC and that you have installed the fedora-arm-installer package via DNF)

sudo arm-image-installer --image=Fedora-Minimal-armhfp-26-1.5-sda.raw.xz --target=rpi3 --norootpass --resizefs --media=/dev/mmcblk0 -y

Please refer to the Fedora wiki.

Such command will blanks the root password (remember to subsequently set one during first setup) and it will automatically resize the root partition in order to fill up the SD card size.

Once done, remove the SD card from your PC.

The serial console

If you do not have an HDMI screen to connect to the Raspberry Pi, or if you want to run a completely headless box, you have to enable the serial console.

The task is different between the Raspberry 2 and 3. These steps are for the version 3:

- insert the SD card in your PC, it should automatically mount three partitions

- edit the file extlinux.conf

sudo vim /run/media/youruser/__boot/extlinux/extlinux.conf

add "console=tty0 console=ttyAMA0,115200" to the end of the append line

append ro root=UUID=<uuid-uuid> console=tty0 console=ttyS0,115200

- edit the config.txt file in the root partiton

sudo vim /run/media/youruser/sometinglike-4AC9-BABE/config.txt
uncomment the enable_uart line:

enable_uart=1

The onboard wifi adapter

If you want to enable the onboard WiFi adapter, you have to download a file that Fedora can not currently redistribute in the raw image:

sudo curl https://raw.githubusercontent.com/RPi-Distro/firmware-nonfree/master/brcm80211/brcm/brcmfmac43430-sdio.txt -o /run/media/youruser/__/lib/firmware/brcm/brcmfmac43430-sdio.txt
 

Booting the Raspberry

Umount the three partitions, using something like

umount /run/media/youruser/__boot ; umount /run/media/youruser/__ ; umount /run/media/youruser/sometinglike-4AC9-BABE

Then insert the SD card in the Raspberry Pi, connect an Usb to Serial/TTL Adapter (the pin 8 on the Raspberry is the TX and pin 10 is the RX). Then power up the device.

You can look at the boot process using a terminal emulator program like minicom or GTKterm, or even the simple screen command:

screen  /dev/ttyUSB0 115200

Once the boot process will reach the end, you should see the Fedora initial setup where you can create an account, set the root password, configure the timezone, etc.

As soon as you log in, if the Raspberry is connected to the network using the Ethernet interface, and there is a DHCP server in the LAN, you are almost done. Check if you are really online and perform a system update:

sudo dnf update

Setup the wifi

If for some reason you can’t use the Ethernet interface and you need to connect to the WiFi network, you have to configure the wireless interface using the Network Manager command line interface (nmcli).

For instance, if your network uses WPA or WPA2, and there is a DHCP this task is pretty simple:

nmcli radio

nmcli device wifi connect YOURSSID password secretpassword

If you need to set up a static IP there are instead various steps to perform:

Add a connection profile and set the IP/netmask and the default gateway

nmcli con add con-name ConnectionName ifname wlan0 type wifi ssid YOURSSID ip4 192.168.100.200/24 gw4 192.168.100.1

Set up the DNS

nmcli con modify ConnectionName ipv4.dns "8.8.8.8 8.8.4.4"

Configure the wireless security (in this case wpa-psk) and the password

nmcli con modify ConnectionName wifi-sec.key-mgmt wpa-psk

nmcli con modify ConnectionName wifi-sec.psk verysecurepassword

Now you can bring up the connection:

nmcli con up ConnectionName

Conclusion

Here we have an headless and wireless Raspberry Pi 3 running our favorite GNU/Linux distribution!

Please refer to the Fedora Project wiki for more and up to date informations:
https://fedoraproject.org/wiki/Raspberry_Pi

Running LX zones with Tribblix The Trouble with Tribbles...

I mentioned a few months ago a little project I had been working on - nicknamed omnitribblix, it's regular Tribblix with the illumos components coming from illumos-omnios (now via OmniOS Community Edition) rather than vanilla illumos-gate.

One of the changes I made in the recent Milestone 20 update was to split out the release packages to give more flexibility.

Thiis allowed me to release a micro update to Milestone 20 (imaginatively called m20.1 or update 1), which updates the illumos bits but shares the same main package repository as the main Milestone 20 release.

And the other thing I can now do is build variant releases. So Tribblix has an LX variant!

You can download the omnitribblix ISO image from the Tribblix download page. It installs, operates, and is packaged just like regular Tribblix. If you don't use LX zones, you probably wouldn't notice the difference.

(It's versioned as m20lx.1 - the update 1 there means that it's a parallel release to the regular Tribblix Milestone 20 update 1.)

You can also update to the LX variant from either the regular Milestone 20 or Milestone 20 update 1 releases, in the normal way. It's a micro update, or sidegrade perhaps, but uses the same upgrade process as regular upgrades.

And, because of the magic of boot environments, if there's a problem you can roll back.

Anyway, once you have omnitribblix installed, how do you create an LX zone? Very easily, in the same way you create and destroy other zones on Tribblix, using the zap utility.

Before you can do that, though, you need a Linux image of some sort to install.

I've been using the same images I use under Docker. So, for example, if I want Alpine then I would go:

docker run alpine uname -a

and then get the name of the container

docker ps -a

and then export that with

docker export romantic_galileo > alpine.tar

Then copy the alpine.tar file to your omnitribblix system. If you want something a bit richer, then Ubuntu will work. But generally exporting a Docker container like this will work, and the image characteristics will be a good fit for a zone.

And then all you do to create the zone is use zap, specifying that it's an lx brand and telling it where the tarball is:

zap create-zone -z alpine -t lx \
-x 10.0.2.99 -I /tmp/alpine.tar

and just zlogin to it as normal.

There are constraints around networking - you have to be exclusive-ip (the -x flag) and zap will create (and destroy) the vnic for you automatically. But the networking in the zone won't actually be configured. (While you specify the IP address in the command, that just tells zap how to configure the network plumbing and the vnic.) You'll have to log in to the zone and use the native tools to identify and configure the network, like so:

/native/sbin/ifconfig -a
/native/sbin/ifconfig znic0 inet 10.0.2.99 up
/native/usr/sbin/route add net default 10.0.2.2

And off you go. Sitting on an illumos box with all its goodness, with access to the wide variety of the Linux ecosystem at your fingertips.

Java Package Flame Graph Brendan Gregg's Blog

CPU flame graphs visualize running code based on its flow or stack trace ancestry, showing which functions called which other functions and so on. But with Java, there's another way to visualize the same CPU workload which provides some additional insight: **a Java package flame graph**. Instead of visualizing the stack trace hierarchy, this visualizes the Java package name hierarchy. I'll explain with a quick example. Here is a normal stack trace-based [CPU flame graph] for Java, running a microbenchmark \(SVG\):

The y-axis is stack depth. From bottom to top are parent to child functions, and the top edge shows the functions running on CPU. These flame graphs answer many questions easily, such as where the bulk of the CPU time is spent, with ancestry and child functions. But there's one line of questioning that's still tricky: How much CPU time is spent in java/util/\* for example? The Search button (top right) lets you answer this by searching on "java/util", and the bottom right will show 4.3%. But this includes child functions (on purpose). How much CPU time was in java/util methods directly, excluding child functions calls? That takes a bit of effort to figure out, involving zooming on each call and excluding child calls manually. A package flame graph can help here. Now for a Java package flame graph for the same workload, also showing CPU samples \(SVG\):

The y-axis now spans the package name. Click to navigate. This visualizes the on-CPU functions only, so function ancestry is excluded. The time in java/util is grouped together, which can be identified visually: it's 3.91% (it should be less than the earlier flame graph, as it excludes child calls; however, this is also a separate profile and the workload may have varied). There seems to be a grass of many thin rectangles: these are not Java methods, and so don't have a package name to spilt. Is this package flame graph better than the normal stack trace flame graph? Definitely not. I use it in addition, as a different perspective for understanding the same CPU workload. Here's how you can make a package flame graph, using the software from my [FlameGraph] repository:
# perf record -F 99 -a -- sleep 30; ./jmaps
# perf script | ./pkgsplit-perf.pl | grep java | ./flamegraph.pl > out.svg
Notice something? I'm not using -g with perf record, like I normally do, so this is not collecting stack traces. That means that this type of profiling has lower overhead, which is a bonus. It also means that Java doesn't need to be running with -XX:+PreserveFramePointer, although you probably still want to so that you can collect the normal (stack trace) flame graphs. Also, some workloads can bust perf's 127 stack frame limit (tunable in Linux 4.8 onwards), which can badly mess up a normal flame graph to the point where it's unreadable. The package name flame graph will work fine in this situation. I introduced Java package flame graphs in my [JavaOne talk] last year, and was just using them again to find some extra clues. I hope they are useful for you too. [CPU flame graph]: http://www.brendangregg.com/FlameGraphs/cpuflamegraphs.html [FlameGraph]: https://github.com/brendangregg/FlameGraph [JavaOne talk]: https://www.slideshare.net/brendangregg/java-performance-analysis-on-linux-with-flame-graphs/69

DBG-SOS-001: Finding Lock Inversions with DTrace Z IN ASCII - Writing

Tracing lock acquisition order with DTrace's lockstat and fbt providers.

Tweaking binaries with elfedit The Trouble with Tribbles...

On Solaris and illumos, you can inspect shared objects (binaries and libraries) with elfdump. In the most common case, you're simply looking for what shared libraries you're linked against, in which case it's elfdump -d (or, for those of us who were doing this years before elfdump came into existence, dump -Lv). For example:

% elfdump -d /bin/true

Dynamic Section:  .dynamic
     index  tag                value
       [0]  NEEDED            0x1d6               libc.so.1
       [1]  INIT              0x8050d20          

and it goes on a bit. But basically you're looking at the NEEDED lines to see which shared libraries you need. (The other field that's generally of interest for a shared library is the SONAME field.)

However, you can go beyond this, and use elfedit to manipulate what's present here. You can essentially replicate the above with:

elfedit -r -e dyn:dump /bin/true

Here the -r flag says read-only (we're just looking), and -e says execute the command that follows, which is dyn:dump - or just show the dynamic section.

If you look around, you'll see that the classic example is to set the runpath (which you might see as RPATH or RUNPATH in the dump output). This was used to fix up binaries that had been built incorrectly, or where you've moved the libraries somewhere other than where the binary normally looks for them. Which might look like:

elfedit -e 'dyn:runpath /my/local/lib' prog

This is the first example in the man page, and the standard example wherever you look. (Note the quotes - that's a single command input to elfedit.)

However, another common case I come across is where libtool has completely mangled the link so the full pathname of the library (at build time, no less) has been embedded in the binary (either in absolute or relative form). In other words, rather than the NEEDED section being

libfoo.so.1

it ends up being

/home/ptribble/build/bar/.libs/libfoo.so.1

With this sort of error, no amount of tinkering with RPATH is going to help the binary find the library. Fortunately, elfedit can help us here too.

First you need to work out which element you want to modify. Back to elfedit again to dump out the structure

% elfedit -r -e dyn:dump /bin/baz
     index  tag                value
       [0]  POSFLAG_1         0x1                 [ LAZY ]
       [1]  NEEDED            0x8e2               /home/.../libfoo.so.1

It might be further down, of course. But the entry we want to edit is index number 1. We can narrow down the output just to this element by using the -dynndx flag to the dyn:dump command, for example

elfedit -r -e 'dyn:dump -dynndx 1' /bin/baz

or, equivalently, using dyn:value

elfedit -r -e 'dyn:value -dynndx 1' /bin/baz

And we can actually set the value as well. This requires the -s flag to set a string, but you end up with:

elfedit -e 'dyn:value -dynndx -s 1 libfoo.so.1' /bin/baz

and then if you use elfdump or elfedit or ldd to look at the binary, it should pick up the library correctly.

This is really very simple (the hardest part is having to work out what the index of the right entry is). I didn't find anything when searching that actually describes how simple it is, so I thought it worth documenting for the next time I need it.