USENIX/LISA 2013 Blazing Performance with Flame Graphs Brendan Gregg's Blog

In 2013 I gave a plenary at USENIX/LISA on flame graphs: my visualization for profiled stack traces, which is now used by many companies (including Netflix, Facebook, and Linkedin) to identify which code paths consume CPU. The talk is more relevant today, now that flame graphs are widely adopted. Slides are on [slideshare] ([PDF]):

Video is on [youtube]:

The talk explains the origin of flame graphs, how to interpret them, and then tours different profile and trace event types that can be visualized. It predates some flame graph features that were added later: zoom, search, mixed-mode color highlights (--palette=java), and differential flame graphs. I used DTrace to create different types of flame graphs in the talk, but since then I've developed ways to do them on Linux, using [perf] for CPU flame graphs, and [bcc/BPF] for advanced flame graphs: off-CPU and more. My [BPF off-CPU flame graphs] post used my stack track hack, but since then we've added stack trace support to BPF in Linux (4.6), and these can now be implemented without hacks. The tool offcputime in bcc has already been updated to do this (thanks Vicent Marti and others for getting it working well, and Alexei Starovoitov for adding stack trace support to BPF). This talk was 170 slides in 90 minutes, which may have been too much in 2013 when flame graphs were new. There's a reason for this: I'd planned to do a 45 minute talk on CPU flame graphs, ending on slide 98, followed by a different talk. For reasons beyond my control, I was told the night before that I couldn't give that second talk. My plan B, as I'd already discussed with the conference organizers, was to extend the flame graphs talk and add an advanced section. I was up to 5am doing this, and was then woken at 8am by the conference organizers: the plenary speaker had shellfish poisoning, and could I come down and give my flame graphs talk at 9am, instead of later that day? That's how this ended up as a 90 minute plenary! At that LISA I also worked more with USENIX staff, and co-delivered a metrics workshop, as well as another talk. I was proud to be involved with USENIX/LISA and contribute in these ways. And, you can too, the call for proposals for LISA 2017 ends tomorrow (April 24). Since 2013, I've also written about flame graphs in [ACMQ] and [CACM]. For the latest on flame graphs, see the [updates] section of my flame graphs page. [ACMQ]: [CACM]: [PDF]: /Slides/LISA13_Flame_Graphs.pdf [slideshare]: [youtube]: [updates]: /flamegraphs.html#Updates [perf]: /perf.html#FlameGraphs [bcc/BPF]: [BPF off-CPU flame graphs]: /blog/2016-01-20/ebpf-offcpu-flame-graph.html

2017-04-22 Josef "Jeff" Sipek

Noisy Tribblix The Trouble with Tribbles...

I've had a couple of Tribblix users ask me why audio doesn't work.

This was something I had noticed myself, and the reason was not that audio was in some way broken, but that the permissions on the audio devices were wrong - owned and only writeable by root.

Now I only wanted to actually get any audio out on fairly rare occasions, so a quick chown wasn't that much of an imposition. But it obviously needed fixing properly.

My assumption here is that most desktop users will be logging in through the SLiM login manager. So all I need to do is fix the permissions just before it calls setuid() to the logged in user. And then reset them back once the user is done.

Now, I could have made up a bunch of chowns myself, or written a helper. There's actually code in SLiM to call ConsoleKit - but I don't have ConsoleKit, and don't really see the need to maintain a port of it just for this.

But illumos already has the capability to do this, and the normal login mechanisms use it. There's code in libdevinfo that sets the permissions according to the rules laid out in the /etc/logindevperm file. So the code is really just a call to di_devperm_login() and di_devperm_logout(), and all is well.

This also fixed another irritating bug - I can now eject memory sticks as myself, without needing to be root.

The next thing that happens, of course, is that it doesn't take very long to realise that Twitter has a lot of videos that play automatically. So I'm sitting there and I can hear either the internal loudspeaker or my headphones warbling away.

So the next thing I need is a way to shut the thing up. Historically, I used the old CDE sdtaudiocontrol, which was pretty good. (In general, I detested CDE as a desktop, the mailer and calendar were decent enough for their time, and the audio control was the only other thing I used much.) I use Xfce as my desktop, it used to have xfce4-mixer but that's now unmaintained and deprecated (and I removed that as part of the migration from gstreamer-0.10 to gstreamer1). Which pretty much leaves the command line audio utilities in illumos, specifically audioctl. I've added the package so users who update will automatically get that as well.

The command

audioctl set-control volume 0

silences things, while

audioctl set-control volume 75

puts the volume back to normal. I've created aliases mute and unmute for those. A more sophisticated approach would be to save the volume and restore it afterwards, but this is enough for now.

Thank you, Oracle engineers alp's notes

After 2010, when Oracle acquired Sun, most of us, who followed OpenSolaris, were depressed. In one year one of the most advantageous operating systems was closed under steel curtain. Luckily, due to enormous efforts of community, of companies, dependent on OpenSolaris, the system survived. Currently we have several more or less successful illumos distributions, targeting different users. But nowadays there's a (of course, deserved) common negative feeling towards Oracle in illumos community. But let's speak from another point of view. Let's look at things, which illumos community (and in particular, OpenIndiana) got directly or indirectly from Oracle in recent years.

  • Our userland build system, which constantly evolves, however, in different directions, under Oracle control and in our distribution. But still a lot of components can be easily migrated between build systems.
  • A lot of software build receipts and patches, as result, were borrowed with small modifications, from Oracle userland-gate. The process is still going on.
  • We still borrow patches from Solaris pkg-gate. Also differences in underlying kernels are currently rather significant, a lot of changesets from pkg-gate can be ported to OpenIndiana pkg5 repository.
  • Of course, I can not avoid thanking Alan for his constant help in supporting Xorg subsystem and GUI parts of our distribution. He was always helpful to me and Aurélien.
  • Evidently, recent KMS work, integrated into OpenIndiana, wouldn't be possible without Oracle's open drm port, which was ported from Solaris to illumos by Martin Bochnig, and later independently ported and enhanced by Gordon Ross.
- And of course, I cannot count patches, which were suggested to upstream projects by Oracle engineers. Just today when I tried to solve two issues related with IPS and apache 2.4 interaction, I've found two patches by Petr Sumbera, fixing Apache issues on Solaris. So, I want to use the chance and thank all Oracle Solaris engineers for their work on open source projects. I doubt that without them illumos could survive in large scale. Perhaps, we could be an excellent playground for ZFS development, but not an universal operating system...

2017-03-23 Josef "Jeff" Sipek

The million dollar engineering problem — Scaling infrastructure in the cloud is easy, so it’s easy to fall into the trap of scaling infrastructure instead of improving efficiency.

Some Notes on the “Who wrote Linux” Kerfuffle

The Ghosts of Internet Time

How a personal project became an exhibition of the most beautifully photographed and detailed bugs you ever saw — Amazing photos of various bugs.

Calculator for Field of View of a Camera and Lens

The Megaprocessor — A microprocessor built from discrete transistors.

Why Pascal is Not My Favorite Programming Language

EAA Video — An assortment of EAA produced videos related to just about anything aircraft related (from homebuilding to aerobatics to history).

The Unreasonable Effectiveness of Recurrent Neural Networks

The Mobile Cat Josef "Jeff" Sipek

About three weeks ago I got the idea to put a phone in front of the cat and use it to light her. It took a while for everything to align (she needed to be resting just the right way, it had to be dark, and I had to notice and have my camera handy), but I think the result was worth it:

D750: A Year In Statistics Josef "Jeff" Sipek

It has been a year since I got the D750 and I thought it would be fun to gather some statistics about the photos.

While I have used a total of 5 different lenses with the D750, only three of them got to see any serious use. The lenses are:

Nikon AF Nikkor 50mm f/1.8D
This is the lens I used for the first month. Old, cheap, but very good.
Nikon AF Nikkor 70-300mm f/4-5.6D ED
I got this lens many years ago for my D70. During the first month of D750 ownership, I couldn’t resist seeing what it would behave like on the D750. It was a disaster. This lens just doesn’t create a good enough image for the D750’s 24 megapixel sensor.
Nikon AF-S Nikkor 24-120mm f/4 ED VR
I used this lens when I test-drove the D750, so technically I didn’t take these with my camera. With that said, I’m including it because it makes some of the graphs look more interesting.
Nikon AF-S Nikkor 24-70mm f/2.8G ED
After a month of using the 50mm, I got this lens which became my walk around lens.
Nikon AF-S Nikkor 70-200mm f/2.8G ED VR II
Back in June, Nikon had a sale and that ended up being just good enough to convince me to spend more money on photography gear.

Now that we’ve covered what lenses I have used, let’s take a look at some graphs. First of all, the number of images taken with each lens:

Not very surprising. Since June, I have been taking with me either the 24-70mm, or 70-200mm, or both if the extra weight is not a bother. So it is no surprise that the vast majority of my photos have been taken with those two lenses. The 50mm is all about that first month when I had a new toy (the D750!) and so I dragged it everywhere. (And to be fair, the 50mm lens is so compact that it is really easy to drag it everywhere.) The 230 photos taken with the 70-300mm are all (failed) attempts at plane spotting photography.

First, let’s look at the breakdown by ISO (in 1/3 stop increments):

This is not a surprising graph at all. The D750’s base ISO is 100 and the maximum native ISO is 12800. It is therefore no surprise that most of the photos were taken at ISO 100.

I am a bit amused by the spikes at 200, 400, and 800. I know exactly why these happen—when I have to adjust the exposure by a large amount, I tend to scroll the wheels a multiple of three notches.

Outside of the range, there are a couple of photos (52) taken at ISO 50 (which Nikon calls “Lo 1”) to work around the lack of an ND filter. There is actually one other photo outside of the native ISO range that I did not plot at all—the one photo I took at ISO 51200 (“Hi 2”) as a test.

Now, let’s break the numbers down differently—by the aperture used (again in 1/3 stop increments):

I am actually surprised that so many of them are at f/2.8. I’m well aware that most lenses need to be stepped down a little for best image quality, but apparently I don’t do that a third of the time. It is for this kind of insight that I decided to make this blahg post.

Moving on to focal length. This is by far the least interesting graph.

You can clearly see 4 large spikes—at 24 mm, 50 mm, 70 mm, and 200 mm. All of those are because of focal length limits of the lenses. Removing any data points over 500 yields a slightly more readable graph:

It is interesting that the focal length that is embedded in the image doesn’t seem to be just any integer, but rather there appear to be “steps” in which it changes. The step also isn’t constant. For example, the 70-200mm lens seems to encode 5 mm steps above approximately 130 mm but 2-3 mm below it.

I realize this is a useless number given that we are dealing with nothing like a unimodal distribution, but I was curious what the mean focal length was. (I already know that the most common ones are 24 mm and 70 mm for the 24-70mm, and 70 mm and 200 mm for the 70-200mm lens.)

Lens Mean Focal Length Count
24-120    73.24138 87
24-70    46.72043 6485
50    50.00000 1020
70-200    151.69536 4438
70-300    227.82609 230

Keep in mind these numbers include the removed spikes.

Just eyeballing the shutter speed data, I think that it isn’t even worth plotting.

So, that’s it for this year. I found the (basic) statistics interesting enough, and I learned that I stay at f/2.8 a bit too much.

perf sched for Linux CPU scheduler analysis Brendan Gregg's Blog

Linux perf gained a new CPU scheduler analysis view in Linux 4.10: perf sched timehist. As I haven't talked about perf sched before, I'll summarize its capabilities here. If you're in a hurry, it may be helpful to just browse the following screenshots so that you are aware of what is available. (I've also added this content to my [perf examples] page.) perf sched uses a dump-and-post-process approach for analyzing scheduler events, which can be a problem as these events can be very frequent – millions per second – costing CPU, memory, and disk overhead to record. I've recently been writing scheduler analysis tools using [eBPF/bcc] \(including [runqlat]\), which lets me greatly reduce overhead by using in-kernel summaries. But there are cases where you might want to capture every event using perf sched instead, despite the higher overhead. Imagine having five minutes to analyze a bad cloud instance before it is auto-terminated, and you want to capture everything for later analysis. I'll start by recording one second of events:

# perf sched record -- sleep 1
[ perf record: Woken up 1 times to write data ]
[ perf record: Captured and wrote 1.886 MB (13502 samples) ]
That's 1.9 Mbytes for _one second_, including 13,502 samples. The size and rate will be relative to your workload and number of CPUs (this example is an 8 CPU server running a software build). How this is written to the file system has been optimized: it only woke up one time to read the event buffers and write them to disk, which reduces overhead. That said, there are still significant overheads with instrumenting all scheduler events and writing event data to the file system. These events:
# perf script --header
# ========
# captured on: Sun Feb 26 19:40:00 2017
# hostname : bgregg-xenial
# os release : 4.10-virtual
# perf version : 4.10
# arch : x86_64
# nrcpus online : 8
# nrcpus avail : 8
# cpudesc : Intel(R) Xeon(R) CPU E5-2680 v2 @ 2.80GHz
# cpuid : GenuineIntel,6,62,4
# total memory : 15401700 kB
# cmdline : /usr/bin/perf sched record -- sleep 1 
# event : name = sched:sched_switch, , id = { 2752, 2753, 2754, 2755, 2756, 2757, 2758, 2759...
# event : name = sched:sched_stat_wait, , id = { 2760, 2761, 2762, 2763, 2764, 2765, 2766, 2...
# event : name = sched:sched_stat_sleep, , id = { 2768, 2769, 2770, 2771, 2772, 2773, 2774, ...
# event : name = sched:sched_stat_iowait, , id = { 2776, 2777, 2778, 2779, 2780, 2781, 2782,...
# event : name = sched:sched_stat_runtime, , id = { 2784, 2785, 2786, 2787, 2788, 2789, 2790...
# event : name = sched:sched_process_fork, , id = { 2792, 2793, 2794, 2795, 2796, 2797, 2798...
# event : name = sched:sched_wakeup, , id = { 2800, 2801, 2802, 2803, 2804, 2805, 2806, 2807...
# event : name = sched:sched_wakeup_new, , id = { 2808, 2809, 2810, 2811, 2812, 2813, 2814, ...
# event : name = sched:sched_migrate_task, , id = { 2816, 2817, 2818, 2819, 2820, 2821, 2822...
# HEADER_CPU_TOPOLOGY info available, use -I to display
# HEADER_NUMA_TOPOLOGY info available, use -I to display
# pmu mappings: breakpoint = 5, power = 7, software = 1, tracepoint = 2, msr = 6
# HEADER_CACHE info available, use -I to display
# ========
    perf 16984 [005] 991962.879966:   sched:sched_wakeup: comm=perf pid=16999 prio=120 target_cpu=005

The captured trace file can be reported in a number of ways, summarized by the help message:

# perf sched -h

 Usage: perf sched [] {record|latency|map|replay|script|timehist}

    -D, --dump-raw-trace  dump raw trace in ASCII
    -f, --force           don't complain, do it
    -i, --input     input file name
    -v, --verbose         be more verbose (show symbol address, etc)

perf sched latency will summarize scheduler latencies by task, including average and maximum delay:

# perf sched latency

  Task                  |   Runtime ms  | Switches | Average delay ms | Maximum delay ms | Maximum delay at       |
  cat:(6)               |     12.002 ms |        6 | avg:   17.541 ms | max:   29.702 ms | max at: 991962.948070 s
  ar:17043              |      3.191 ms |        1 | avg:   13.638 ms | max:   13.638 ms | max at: 991963.048070 s
  rm:(10)               |     20.955 ms |       10 | avg:   11.212 ms | max:   19.598 ms | max at: 991963.404069 s
  objdump:(6)           |     35.870 ms |        8 | avg:   10.969 ms | max:   16.509 ms | max at: 991963.424443 s
  :17008:17008          |    462.213 ms |       50 | avg:   10.464 ms | max:   35.999 ms | max at: 991963.120069 s
  grep:(7)              |     21.655 ms |       11 | avg:    9.465 ms | max:   24.502 ms | max at: 991963.464082 s
  fixdep:(6)            |     81.066 ms |        8 | avg:    9.023 ms | max:   19.521 ms | max at: 991963.120068 s
  mv:(10)               |     30.249 ms |       14 | avg:    8.380 ms | max:   21.688 ms | max at: 991963.200073 s
  ld:(3)                |     14.353 ms |        6 | avg:    7.376 ms | max:   15.498 ms | max at: 991963.452070 s
  recordmcount:(7)      |     14.629 ms |        9 | avg:    7.155 ms | max:   18.964 ms | max at: 991963.292100 s
  svstat:17067          |      1.862 ms |        1 | avg:    6.142 ms | max:    6.142 ms | max at: 991963.280069 s
  cc1:(21)              |   6013.457 ms |     1138 | avg:    5.305 ms | max:   44.001 ms | max at: 991963.436070 s
  gcc:(18)              |     43.596 ms |       40 | avg:    3.905 ms | max:   26.994 ms | max at: 991963.380069 s
  ps:17073              |     27.158 ms |        4 | avg:    3.751 ms | max:    8.000 ms | max at: 991963.332070 s

To shed some light as to how this is instrumented and calculated, I'll show the events that led to the top event's "Maximum delay at" of 29.702 ms. Here are the raw events from perf sched script:

      sh 17028 [001] 991962.918368:   sched:sched_wakeup_new: comm=sh pid=17030 prio=120 target_cpu=002
     cc1 16819 [002] 991962.948070:       sched:sched_switch: prev_comm=cc1 prev_pid=16819 prev_prio=120
                                                            prev_state=R ==> next_comm=sh next_pid=17030 next_prio=120

The time from the wakeup (991962.918368, which is in seconds) to the context switch (991962.948070) is 29.702 ms. This process is listed as "sh" (shell) in the raw events, but execs "cat" soon after, so is shown as "cat" in the perf sched latency output.

perf sched map shows all CPUs and context-switch events, with columns representing what each CPU was doing and when. It's the kind of data you see visualized in scheduler analysis GUIs (including perf timechart, with the layout rotated 90 degrees). Example output:

# perf sched map
                      *A0           991962.879971 secs A0 => perf:16999
                       A0     *B0   991962.880070 secs B0 => cc1:16863
          *C0          A0      B0   991962.880070 secs C0 => :17023:17023
  *D0      C0          A0      B0   991962.880078 secs D0 => ksoftirqd/0:6
   D0      C0 *E0      A0      B0   991962.880081 secs E0 => ksoftirqd/3:28
   D0      C0 *F0      A0      B0   991962.880093 secs F0 => :17022:17022
  *G0      C0  F0      A0      B0   991962.880108 secs G0 => :17016:17016
   G0      C0  F0     *H0      B0   991962.880256 secs H0 => migration/5:39
   G0      C0  F0     *I0      B0   991962.880276 secs I0 => perf:16984
   G0      C0  F0     *J0      B0   991962.880687 secs J0 => cc1:16996
   G0      C0 *K0      J0      B0   991962.881839 secs K0 => cc1:16945
   G0      C0  K0      J0 *L0  B0   991962.881841 secs L0 => :17020:17020
   G0      C0  K0      J0 *M0  B0   991962.882289 secs M0 => make:16637
   G0      C0  K0      J0 *N0  B0   991962.883102 secs N0 => make:16545
   G0     *O0  K0      J0  N0  B0   991962.883880 secs O0 => cc1:16819
   G0 *A0  O0  K0      J0  N0  B0   991962.884069 secs 
   G0  A0  O0  K0 *P0  J0  N0  B0   991962.884076 secs P0 => rcu_sched:7
   G0  A0  O0  K0 *Q0  J0  N0  B0   991962.884084 secs Q0 => cc1:16831
   G0  A0  O0  K0  Q0  J0 *R0  B0   991962.884843 secs R0 => cc1:16825
   G0 *S0  O0  K0  Q0  J0  R0  B0   991962.885636 secs S0 => cc1:16900
   G0  S0  O0 *T0  Q0  J0  R0  B0   991962.886893 secs T0 => :17014:17014
   G0  S0  O0 *K0  Q0  J0  R0  B0   991962.886917 secs 

This is an 8 CPU system, and you can see the 8 columns for each CPU starting from the left. Some CPU columns begin blank, as we've yet to trace an event on that CPU at the start of the profile. They quickly become populated.

The two character codes you see ("A0", "C0") are identifiers for tasks, which are mapped on the right ("=>"). This is more compact than using process (task) IDs. The "*" shows which CPU had the context switch event, and the new event that was running. For example, the very last line shows that at 991962.886917 (seconds) CPU 4 context-switched to K0 (a "cc1" process, PID 16945).

That example was from a busy system. Here's an idle system:

# perf sched map
                      *A0           993552.887633 secs A0 => perf:26596
  *.                   A0           993552.887781 secs .  => swapper:0
   .                  *B0           993552.887843 secs B0 => migration/5:39
   .                  *.            993552.887858 secs 
   .                   .  *A0       993552.887861 secs 
   .                  *C0  A0       993552.887903 secs C0 => bash:26622
   .                  *.   A0       993552.888020 secs 
   .          *D0      .   A0       993552.888074 secs D0 => rcu_sched:7
   .          *.       .   A0       993552.888082 secs 
   .           .      *C0  A0       993552.888143 secs 
   .      *.   .       C0  A0       993552.888173 secs 
   .       .   .      *B0  A0       993552.888439 secs 
   .       .   .      *.   A0       993552.888454 secs 
   .      *C0  .       .   A0       993552.888457 secs 
   .       C0  .       .  *.        993552.889257 secs 
   .      *.   .       .   .        993552.889764 secs 
   .       .  *E0      .   .        993552.889767 secs E0 => bash:7902

Idle CPUs are shown as ".".

Remember to examine the timestamp column to make sense of this visualization (GUIs use that as a dimension, which is easier to comprehend, but here the numbers are just listed). It's also only showing context switch events, and not scheduler latency. The newer timehist command has a visualization (-V) that can include wakeup events.

perf sched timehist was added in Linux 4.10, and shows the scheduler latency by event, including the time the task was waiting to be woken up (wait time) and the scheduler latency after wakeup to running (sch delay). It's the scheduler latency that we're more interested in tuning. Example output:

# perf sched timehist
Samples do not have callchains.
           time    cpu  task name                       wait time  sch delay   run time
                        [tid/pid]                          (msec)     (msec)     (msec)
--------------- ------  ------------------------------  ---------  ---------  ---------
  991962.879971 [0005]  perf[16984]                         0.000      0.000      0.000 
  991962.880070 [0007]  :17008[17008]                       0.000      0.000      0.000 
  991962.880070 [0002]  cc1[16880]                          0.000      0.000      0.000 
  991962.880078 [0000]  cc1[16881]                          0.000      0.000      0.000 
  991962.880081 [0003]  cc1[16945]                          0.000      0.000      0.000 
  991962.880093 [0003]  ksoftirqd/3[28]                     0.000      0.007      0.012 
  991962.880108 [0000]  ksoftirqd/0[6]                      0.000      0.007      0.030 
  991962.880256 [0005]  perf[16999]                         0.000      0.005      0.285 
  991962.880276 [0005]  migration/5[39]                     0.000      0.007      0.019 
  991962.880687 [0005]  perf[16984]                         0.304      0.000      0.411 
  991962.881839 [0003]  cat[17022]                          0.000      0.000      1.746 
  991962.881841 [0006]  cc1[16825]                          0.000      0.000      0.000 
  991963.885740 [0001]  :17008[17008]                      25.613      0.000      0.057 
  991963.886009 [0001]  sleep[16999]                     1000.104      0.006      0.269 
  991963.886018 [0005]  cc1[17083]                         19.998      0.000      9.948 

This output includes the sleep command run to set the duration of perf itself to one second. Note that sleep's wait time is 1000.104 milliseconds because I had run "sleep 1": that's the time it was asleep waiting its timer wakeup event. Its scheduler latency was only 0.006 milliseconds, and its time on-CPU was 0.269 milliseconds.

There are a number of options to timehist, including -V to add a CPU visualization column, -M to add migration events, and -w for wakeup events. For example:

# perf sched timehist -MVw
Samples do not have callchains.
           time    cpu  012345678  task name           wait time  sch delay   run time
                                   [tid/pid]              (msec)     (msec)     (msec)
--------------- ------  ---------  ------------------  ---------  ---------  ---------
  991962.879966 [0005]             perf[16984]                                          awakened: perf[16999]
  991962.879971 [0005]       s     perf[16984]             0.000      0.000      0.000                                 
  991962.880070 [0007]         s   :17008[17008]           0.000      0.000      0.000                                 
  991962.880070 [0002]    s        cc1[16880]              0.000      0.000      0.000                                 
  991962.880071 [0000]             cc1[16881]                                           awakened: ksoftirqd/0[6]
  991962.880073 [0003]             cc1[16945]                                           awakened: ksoftirqd/3[28]
  991962.880078 [0000]  s          cc1[16881]              0.000      0.000      0.000                                 
  991962.880081 [0003]     s       cc1[16945]              0.000      0.000      0.000                                 
  991962.880093 [0003]     s       ksoftirqd/3[28]         0.000      0.007      0.012                                 
  991962.880108 [0000]  s          ksoftirqd/0[6]          0.000      0.007      0.030                                 
  991962.880249 [0005]             perf[16999]                                          awakened: migration/5[39]
  991962.880256 [0005]       s     perf[16999]             0.000      0.005      0.285                                 
  991962.880264 [0005]        m      migration/5[39]                                      migrated: perf[16999] cpu 5 => 1
  991962.880276 [0005]       s     migration/5[39]         0.000      0.007      0.019                                 
  991962.880682 [0005]        m      perf[16984]                                          migrated: cc1[16996] cpu 0 => 5
  991962.880687 [0005]       s     perf[16984]             0.304      0.000      0.411                                 
  991962.881834 [0003]             cat[17022]                                           awakened: :17020
  991963.885734 [0001]             :17008[17008]                                        awakened: sleep[16999]
  991963.885740 [0001]   s         :17008[17008]          25.613      0.000      0.057                           
  991963.886005 [0001]             sleep[16999]                                         awakened: perf[16984]
  991963.886009 [0001]   s         sleep[16999]         1000.104      0.006      0.269
  991963.886018 [0005]       s     cc1[17083]             19.998      0.000      9.948 

The CPU visualization column ("012345678") has "s" for context-switch events, and "m" for migration events, showing the CPU of the event.

The last events in that output include those related to the "sleep 1" command used to time perf. The wakeup happened at 991963.885734, and at 991963.885740 (6 microseconds later) CPU 1 begins to context-switch to the sleep process. The column for that event still shows ":17008[17008]" for what was on-CPU, but the target of the context switch (sleep) is not shown. It is in the raw events:

  :17008 17008 [001] 991963.885740:       sched:sched_switch: prev_comm=cc1 prev_pid=17008 prev_prio=120
                                                             prev_state=R ==> next_comm=sleep next_pid=16999 next_prio=120

The 991963.886005 event shows that the perf command received a wakeup while sleep was running (almost certainly sleep waking up its parent process because it terminated), and then we have the context switch on 991963.886009 where sleep stops running, and a summary is printed out: 1000.104 ms waiting (the "sleep 1"), with 0.006 ms scheduler latency, and 0.269 ms of CPU runtime.

Here I've decorated the timehist output with the details of the context switch destination in red:

  991963.885734 [0001]             :17008[17008]                                        awakened: sleep[16999]
  991963.885740 [0001]   s         :17008[17008]          25.613      0.000      0.057  next: sleep[16999]
  991963.886005 [0001]             sleep[16999]                                         awakened: perf[16984]
  991963.886009 [0001]   s         sleep[16999]         1000.104      0.006      0.269  next: cc1[17008]
  991963.886018 [0005]       s     cc1[17083]             19.998      0.000      9.948  next: perf[16984]

When sleep finished, a waiting "cc1" process then executed. perf ran on the following context switch, and is the last event in the profile (perf terminated). I've submitted a patch to add this info when a -n option is used.

perf sched script dumps all events (similar to perf script):

# perf sched script
    perf 16984 [005] 991962.879960: sched:sched_stat_runtime: comm=perf pid=16984 runtime=3901506 [ns] vruntime=165...
    perf 16984 [005] 991962.879966:       sched:sched_wakeup: comm=perf pid=16999 prio=120 target_cpu=005
    perf 16984 [005] 991962.879971:       sched:sched_switch: prev_comm=perf prev_pid=16984 prev_prio=120 prev_stat...
    perf 16999 [005] 991962.880058: sched:sched_stat_runtime: comm=perf pid=16999 runtime=98309 [ns] vruntime=16405...
     cc1 16881 [000] 991962.880058: sched:sched_stat_runtime: comm=cc1 pid=16881 runtime=3999231 [ns] vruntime=7897...
  :17024 17024 [004] 991962.880058: sched:sched_stat_runtime: comm=cc1 pid=17024 runtime=3866637 [ns] vruntime=7810...
     cc1 16900 [001] 991962.880058: sched:sched_stat_runtime: comm=cc1 pid=16900 runtime=3006028 [ns] vruntime=7772...
     cc1 16825 [006] 991962.880058: sched:sched_stat_runtime: comm=cc1 pid=16825 runtime=3999423 [ns] vruntime=7876...
Each of these events ("sched:sched_stat_runtime" etc) are tracepoints you can instrument directly using perf record. As I've shown earlier, this raw output can be useful for digging further than the summary commands. That's it for now. Happy hunting. [eBPF/bcc]: [runqlat]: /blog/2016-10-08/linux-bcc-runqlat.html [perf examples]: /perf.html#SchedulerAnalysis

Flying around Mount Monadnock Josef "Jeff" Sipek

Last week I planned on doing a nice cross country flight from Wikipedia article: Fitchburg. Inspired by Garrett Fisher’s photos, I took my camera and the 70-200mm lens with me hoping to get a couple of nice photos of the landscapes in New Hampshire.

Sadly, after taking off from KFIT I found out that not only was there the stiff wind that was forecasted (that’s fine) but the air was sufficiently bumpy that it wouldn’t have been a fun flight. On top of that, the ADS-B unit was having problems acquiring a GPS signal. (Supposedly, the firmware sometimes gets into a funny state like this. The good news is that there is a firmware update available that should address this.) I contacted KASH tower to check if they could see my transponder—they did, so I didn’t have to worry about being totally invisible.

Since I was already off the ground, I decided to do some nearby sightseeing, landing practice, and playing with the Garmin GNS 430 GPS.

First, I headed northwest toward Wikipedia article: Mount Monadnock. While I have seen it in the distance several times before, I never got to see it up close, so this seemed like a worthwhile destination.

As I approached it, I ended up taking out my camera and getting a couple of photos of the hills and mountains in New Hampshire. It was interesting how the the view to the north (deeper into New Hampshire) is hilly, but the view more east (and certainly south) is flatter. (Both taken near Mount Monadnock.)

While the visibility was more than good enough for flying, it didn’t work out that well for photography. In all of the photos, the landscape far away ended up being heavily blue-tinted. No amount of playing around with white balance adjustment in Lightroom was able to correct it. (Either the background was too blue, or the foreground was too yellow/brown.) That’s why all of these photos are black and white.

I made a full turn around Monadnock, taking a number of shots but this one is my favorite:

Once done with Monadnock, I headed south to the Wikipedia article: Quabbin Reservoir in Massachusetts. This is a view toward the south from near its north end:

At this point I started heading to KORH to do some landing practice. Since I was plenty busy, there are no photos.

I’ve never been to this airport before and landing at new airports is always fun. The first interesting thing about it is that it is situated on a hill. While most airports around here are at 200-400 feet MSL, this one is at 1000 feet. The westerly wind favored runway 29 which meant I got to see a second interesting aspect of this airport. The beginning of runway 29 is on the edge of the hill. That by itself doesn’t sound very interesting, but consider that the runway is at 1000 feet while the bottom of the hill (a mere 0.9 km away) is at 500 feet MSL. That is approximately a 17% grade. So, as you approach the runway, at first it looks like you are way too high but the ground comes up significantly faster than normal.

I am still hoping to do my originally planned cross country flight at some point. Rest assured that I will blahg about it.

How old are illumos man pages? The Trouble with Tribbles...

I've recently been looking at improving the state of the illumos man pages.

One thing you'll notice is that the date on some of the manpages is old - really old, some of them are dated 1990. That presumably means that they haven't been modified in any meaningful sense for a quarter of a century.

(By date here I mean the date displayed by the man command. Which isn't necessarily the date somebody last touched it, but should correspond to the last meaningful change.)

The distribution of dates looks like this:

As you can see, the dates go all the way back to 1990. There's not just the odd one or two either, there are a decent number of man pages that comes from the 1990s.

There are some obvious features in the chart above.

There's a noticeable spike in 1996, which of course significantly predates illumos or OpenSolaris. It's not entirely obvious why there should be a spike, but 100 of those man pages are related to libcurses.

I suspect the dip in 2005 is a result of the launch of Solaris 10, when everyone had a bit of a breather before development kicked into full gear again.

Then there's a drop in 2010. In fact, there's just a single page for 2010. That's when the OpenSolaris project closed, so there was little work done at that point. Also, our man pages were only integrated into illumos-gate in early 2011 (prior to which they were kept separate), and it's taken a while for man page updates to pick up again.

Of course, one reason the man pages are so old is that the software they're documenting is old. That's not necessarily a bad thing, if it ain't broke don't fix it as they say, but there is a certain amount of total rubbish that we ought to clear out.

Fedora dnf history blog'o'less

Sometimes it can happen that you remove a package and inadvertently you remove a bunch of dependencies. Like removing the entire desktop manager: what a mess, waht a pain in the ass!
Sometimes dependecies are strange.

Rarely, but it can happen too, after an update something doesn't work anymore. Maybe the updated package is buggy? Better waiting for a fix.

By the way. If you install or uninstall packages by mistake, there is a way to rollback the changes: history.

sudo dnf history list

And you can see the dnf activities.

sudo dnf history info 132

You can see the details of the dnf command labeled as 132

Let's say 132 is the number of the last action

ID     | Command line             | Date and time    | Action(s)      | Altered
   132 | update                   | 2017-03-09 16:06 | Update         |    8 

Issue the following command in order to uninstall these updates:

sudo dnf history undo 132

And you will get something like:

Undoing transaction 132, from Thu Mar  9 21:06:10 2017

No package libgdata-0:0.17.6-3.fc25.x86_64 available.
Error: An operation cannot be undone

Well. WTF!

It seems that, if you would like to have the option to rollback updates, you must enable packages cache beforehand with keepcache=1 in the /etc/dnf/dnf.conf configuration file. This could lead to some disk space used to store previous versions of rpms? I must investigate a bit deeper.


Get the most out of GNOME notifications blog'o'less

Learn how to get more out of your GNOME notifications on Fedora Workstation.
My first article on Fedora Magazine.

Plane-spotting in Manchester, NH Josef "Jeff" Sipek

Last weekend I got to drive to Wikipedia article: Manchester, so I used the opportunity to kill some time near the airport by watching planes and taking photos (gallery).

The winds were coming from the south, so runway 17 was in use. I think those are the best plane spotting conditions at KMHT.

It is relatively easy to watch aircraft depart and fly directly overhead:

Unlike all my previous plane spotting, this time I tried something new—inspired by Mike Kelly’s Airportraits, I decided to try to make some composite images. Here is a Southwest Boeing 737 sporting one of the Wikipedia article: special liveries:

It was certainly an interesting experience.

At first I thought that I would be able to use the 7 frames/second that the D750 can do for the whole departure, but it turns out that the planes move far too slowly, so the camera buffer filled up way too soon and the frame rate became somewhat erratic. What mostly ended up working was switching to 3 frames/second and taking bursts. Next time, aiming for about 2 frames/second should give me enough images to work with.

Even though I used a tripod, I expected that I would have to align the images to remove the minor misalignment between images due to the vibration from the rather strong wind and my hand depressing the shutter. It turns out that the misalignment (of approximately 10 pixels) was minor enough that it did not change the final image.

Here’s an American Airlines commuter taking off from runway 17. (I repositioned to get a less head-on photo as well.)

For those curious, I post processed each of the images in Lightroom, exported them as TIFFs, and then used GIMP to do the layering and masking. Finally, I exported the final image and imported it back into Lightroom for safekeeping.

As a final treat, as I was packing up a US Army Gulfstream took off:

As far as I can tell, they use this one to transport VIPs. I wonder who was on board…