2017-03-23 Josef "Jeff" Sipek

The million dollar engineering problem — Scaling infrastructure in the cloud is easy, so it’s easy to fall into the trap of scaling infrastructure instead of improving efficiency.

Some Notes on the “Who wrote Linux” Kerfuffle

The Ghosts of Internet Time

How a personal project became an exhibition of the most beautifully photographed and detailed bugs you ever saw — Amazing photos of various bugs.

Calculator for Field of View of a Camera and Lens

The Megaprocessor — A microprocessor built from discrete transistors.

Why Pascal is Not My Favorite Programming Language

EAA Video — An assortment of EAA produced videos related to just about anything aircraft related (from homebuilding to aerobatics to history).

The Unreasonable Effectiveness of Recurrent Neural Networks

The Mobile Cat Josef "Jeff" Sipek

About three weeks ago I got the idea to put a phone in front of the cat and use it to light her. It took a while for everything to align (she needed to be resting just the right way, it had to be dark, and I had to notice and have my camera handy), but I think the result was worth it:

D750: A Year In Statistics Josef "Jeff" Sipek

It has been a year since I got the D750 and I thought it would be fun to gather some statistics about the photos.

While I have used a total of 5 different lenses with the D750, only three of them got to see any serious use. The lenses are:

Nikon AF Nikkor 50mm f/1.8D
This is the lens I used for the first month. Old, cheap, but very good.
Nikon AF Nikkor 70-300mm f/4-5.6D ED
I got this lens many years ago for my D70. During the first month of D750 ownership, I couldn’t resist seeing what it would behave like on the D750. It was a disaster. This lens just doesn’t create a good enough image for the D750’s 24 megapixel sensor.
Nikon AF-S Nikkor 24-120mm f/4 ED VR
I used this lens when I test-drove the D750, so technically I didn’t take these with my camera. With that said, I’m including it because it makes some of the graphs look more interesting.
Nikon AF-S Nikkor 24-70mm f/2.8G ED
After a month of using the 50mm, I got this lens which became my walk around lens.
Nikon AF-S Nikkor 70-200mm f/2.8G ED VR II
Back in June, Nikon had a sale and that ended up being just good enough to convince me to spend more money on photography gear.

Now that we’ve covered what lenses I have used, let’s take a look at some graphs. First of all, the number of images taken with each lens:

Not very surprising. Since June, I have been taking with me either the 24-70mm, or 70-200mm, or both if the extra weight is not a bother. So it is no surprise that the vast majority of my photos have been taken with those two lenses. The 50mm is all about that first month when I had a new toy (the D750!) and so I dragged it everywhere. (And to be fair, the 50mm lens is so compact that it is really easy to drag it everywhere.) The 230 photos taken with the 70-300mm are all (failed) attempts at plane spotting photography.

First, let’s look at the breakdown by ISO (in 1/3 stop increments):

This is not a surprising graph at all. The D750’s base ISO is 100 and the maximum native ISO is 12800. It is therefore no surprise that most of the photos were taken at ISO 100.

I am a bit amused by the spikes at 200, 400, and 800. I know exactly why these happen—when I have to adjust the exposure by a large amount, I tend to scroll the wheels a multiple of three notches.

Outside of the range, there are a couple of photos (52) taken at ISO 50 (which Nikon calls “Lo 1”) to work around the lack of an ND filter. There is actually one other photo outside of the native ISO range that I did not plot at all—the one photo I took at ISO 51200 (“Hi 2”) as a test.

Now, let’s break the numbers down differently—by the aperture used (again in 1/3 stop increments):

I am actually surprised that so many of them are at f/2.8. I’m well aware that most lenses need to be stepped down a little for best image quality, but apparently I don’t do that a third of the time. It is for this kind of insight that I decided to make this blahg post.

Moving on to focal length. This is by far the least interesting graph.

You can clearly see 4 large spikes—at 24 mm, 50 mm, 70 mm, and 200 mm. All of those are because of focal length limits of the lenses. Removing any data points over 500 yields a slightly more readable graph:

It is interesting that the focal length that is embedded in the image doesn’t seem to be just any integer, but rather there appear to be “steps” in which it changes. The step also isn’t constant. For example, the 70-200mm lens seems to encode 5 mm steps above approximately 130 mm but 2-3 mm below it.

I realize this is a useless number given that we are dealing with nothing like a unimodal distribution, but I was curious what the mean focal length was. (I already know that the most common ones are 24 mm and 70 mm for the 24-70mm, and 70 mm and 200 mm for the 70-200mm lens.)

Lens Mean Focal Length Count
24-120    73.24138 87
24-70    46.72043 6485
50    50.00000 1020
70-200    151.69536 4438
70-300    227.82609 230

Keep in mind these numbers include the removed spikes.

Just eyeballing the shutter speed data, I think that it isn’t even worth plotting.

So, that’s it for this year. I found the (basic) statistics interesting enough, and I learned that I stay at f/2.8 a bit too much.

perf sched for Linux scheduler analysis Brendan Gregg's Blog

Linux perf gained a new CPU scheduler analysis view in Linux 4.10: perf sched timehist. As I haven't talked about perf sched before, I'll summarize its capabilities here. If you're in a hurry, it may be helpful to just browse the following screenshots so that you are aware of what is available. (I've also added this content to my [perf examples] page.) perf sched uses a dump-and-post-process approach for analyzing scheduler events, which can be a problem as these events can be very frequent – millions per second – costing CPU, memory, and disk overhead to record. I've recently been writing scheduler analysis tools using [eBPF/bcc] \(including [runqlat]\), which lets me greatly reduce overhead by using in-kernel summaries. But there are cases where you might want to capture every event using perf sched instead, despite the higher overhead. Imagine having five minutes to analyze a bad cloud instance before it is auto-terminated, and you want to capture everything for later analysis. I'll start by recording one second of events:

# perf sched record -- sleep 1
[ perf record: Woken up 1 times to write data ]
[ perf record: Captured and wrote 1.886 MB perf.data (13502 samples) ]
That's 1.9 Mbytes for _one second_, including 13,502 samples. The size and rate will be relative to your workload and number of CPUs (this example is an 8 CPU server running a software build). How this is written to the file system has been optimized: it only woke up one time to read the event buffers and write them to disk, which reduces overhead. That said, there are still significant overheads with instrumenting all scheduler events and writing event data to the file system. These events:
# perf script --header
# ========
# captured on: Sun Feb 26 19:40:00 2017
# hostname : bgregg-xenial
# os release : 4.10-virtual
# perf version : 4.10
# arch : x86_64
# nrcpus online : 8
# nrcpus avail : 8
# cpudesc : Intel(R) Xeon(R) CPU E5-2680 v2 @ 2.80GHz
# cpuid : GenuineIntel,6,62,4
# total memory : 15401700 kB
# cmdline : /usr/bin/perf sched record -- sleep 1 
# event : name = sched:sched_switch, , id = { 2752, 2753, 2754, 2755, 2756, 2757, 2758, 2759...
# event : name = sched:sched_stat_wait, , id = { 2760, 2761, 2762, 2763, 2764, 2765, 2766, 2...
# event : name = sched:sched_stat_sleep, , id = { 2768, 2769, 2770, 2771, 2772, 2773, 2774, ...
# event : name = sched:sched_stat_iowait, , id = { 2776, 2777, 2778, 2779, 2780, 2781, 2782,...
# event : name = sched:sched_stat_runtime, , id = { 2784, 2785, 2786, 2787, 2788, 2789, 2790...
# event : name = sched:sched_process_fork, , id = { 2792, 2793, 2794, 2795, 2796, 2797, 2798...
# event : name = sched:sched_wakeup, , id = { 2800, 2801, 2802, 2803, 2804, 2805, 2806, 2807...
# event : name = sched:sched_wakeup_new, , id = { 2808, 2809, 2810, 2811, 2812, 2813, 2814, ...
# event : name = sched:sched_migrate_task, , id = { 2816, 2817, 2818, 2819, 2820, 2821, 2822...
# HEADER_CPU_TOPOLOGY info available, use -I to display
# HEADER_NUMA_TOPOLOGY info available, use -I to display
# pmu mappings: breakpoint = 5, power = 7, software = 1, tracepoint = 2, msr = 6
# HEADER_CACHE info available, use -I to display
# missing features: HEADER_BRANCH_STACK HEADER_GROUP_DESC HEADER_AUXTRACE HEADER_STAT 
# ========
#
    perf 16984 [005] 991962.879966:   sched:sched_wakeup: comm=perf pid=16999 prio=120 target_cpu=005
[...]

The captured trace file can be reported in a number of ways, summarized by the help message:

# perf sched -h

 Usage: perf sched [] {record|latency|map|replay|script|timehist}

    -D, --dump-raw-trace  dump raw trace in ASCII
    -f, --force           don't complain, do it
    -i, --input     input file name
    -v, --verbose         be more verbose (show symbol address, etc)

perf sched latency will summarize scheduler latencies by task, including average and maximum delay:

# perf sched latency

 -----------------------------------------------------------------------------------------------------------------
  Task                  |   Runtime ms  | Switches | Average delay ms | Maximum delay ms | Maximum delay at       |
 -----------------------------------------------------------------------------------------------------------------
  cat:(6)               |     12.002 ms |        6 | avg:   17.541 ms | max:   29.702 ms | max at: 991962.948070 s
  ar:17043              |      3.191 ms |        1 | avg:   13.638 ms | max:   13.638 ms | max at: 991963.048070 s
  rm:(10)               |     20.955 ms |       10 | avg:   11.212 ms | max:   19.598 ms | max at: 991963.404069 s
  objdump:(6)           |     35.870 ms |        8 | avg:   10.969 ms | max:   16.509 ms | max at: 991963.424443 s
  :17008:17008          |    462.213 ms |       50 | avg:   10.464 ms | max:   35.999 ms | max at: 991963.120069 s
  grep:(7)              |     21.655 ms |       11 | avg:    9.465 ms | max:   24.502 ms | max at: 991963.464082 s
  fixdep:(6)            |     81.066 ms |        8 | avg:    9.023 ms | max:   19.521 ms | max at: 991963.120068 s
  mv:(10)               |     30.249 ms |       14 | avg:    8.380 ms | max:   21.688 ms | max at: 991963.200073 s
  ld:(3)                |     14.353 ms |        6 | avg:    7.376 ms | max:   15.498 ms | max at: 991963.452070 s
  recordmcount:(7)      |     14.629 ms |        9 | avg:    7.155 ms | max:   18.964 ms | max at: 991963.292100 s
  svstat:17067          |      1.862 ms |        1 | avg:    6.142 ms | max:    6.142 ms | max at: 991963.280069 s
  cc1:(21)              |   6013.457 ms |     1138 | avg:    5.305 ms | max:   44.001 ms | max at: 991963.436070 s
  gcc:(18)              |     43.596 ms |       40 | avg:    3.905 ms | max:   26.994 ms | max at: 991963.380069 s
  ps:17073              |     27.158 ms |        4 | avg:    3.751 ms | max:    8.000 ms | max at: 991963.332070 s
...]

To shed some light as to how this is instrumented and calculated, I'll show the events that led to the top event's "Maximum delay at" of 29.702 ms. Here are the raw events from perf sched script:

      sh 17028 [001] 991962.918368:   sched:sched_wakeup_new: comm=sh pid=17030 prio=120 target_cpu=002
[...]
     cc1 16819 [002] 991962.948070:       sched:sched_switch: prev_comm=cc1 prev_pid=16819 prev_prio=120
                                                            prev_state=R ==> next_comm=sh next_pid=17030 next_prio=120
[...]

The time from the wakeup (991962.918368, which is in seconds) to the context switch (991962.948070) is 29.702 ms. This process is listed as "sh" (shell) in the raw events, but execs "cat" soon after, so is shown as "cat" in the perf sched latency output.

perf sched map shows all CPUs and context-switch events, with columns representing what each CPU was doing and when. It's the kind of data you see visualized in scheduler analysis GUIs (including perf timechart, with the layout rotated 90 degrees). Example output:

# perf sched map
                      *A0           991962.879971 secs A0 => perf:16999
                       A0     *B0   991962.880070 secs B0 => cc1:16863
          *C0          A0      B0   991962.880070 secs C0 => :17023:17023
  *D0      C0          A0      B0   991962.880078 secs D0 => ksoftirqd/0:6
   D0      C0 *E0      A0      B0   991962.880081 secs E0 => ksoftirqd/3:28
   D0      C0 *F0      A0      B0   991962.880093 secs F0 => :17022:17022
  *G0      C0  F0      A0      B0   991962.880108 secs G0 => :17016:17016
   G0      C0  F0     *H0      B0   991962.880256 secs H0 => migration/5:39
   G0      C0  F0     *I0      B0   991962.880276 secs I0 => perf:16984
   G0      C0  F0     *J0      B0   991962.880687 secs J0 => cc1:16996
   G0      C0 *K0      J0      B0   991962.881839 secs K0 => cc1:16945
   G0      C0  K0      J0 *L0  B0   991962.881841 secs L0 => :17020:17020
   G0      C0  K0      J0 *M0  B0   991962.882289 secs M0 => make:16637
   G0      C0  K0      J0 *N0  B0   991962.883102 secs N0 => make:16545
   G0     *O0  K0      J0  N0  B0   991962.883880 secs O0 => cc1:16819
   G0 *A0  O0  K0      J0  N0  B0   991962.884069 secs 
   G0  A0  O0  K0 *P0  J0  N0  B0   991962.884076 secs P0 => rcu_sched:7
   G0  A0  O0  K0 *Q0  J0  N0  B0   991962.884084 secs Q0 => cc1:16831
   G0  A0  O0  K0  Q0  J0 *R0  B0   991962.884843 secs R0 => cc1:16825
   G0 *S0  O0  K0  Q0  J0  R0  B0   991962.885636 secs S0 => cc1:16900
   G0  S0  O0 *T0  Q0  J0  R0  B0   991962.886893 secs T0 => :17014:17014
   G0  S0  O0 *K0  Q0  J0  R0  B0   991962.886917 secs 
[...]

This is an 8 CPU system, and you can see the 8 columns for each CPU starting from the left. Some CPU columns begin blank, as we've yet to trace an event on that CPU at the start of the profile. They quickly become populated.

The two character codes you see ("A0", "C0") are identifiers for tasks, which are mapped on the right ("=>"). This is more compact than using process (task) IDs. The "*" shows which CPU had the context switch event, and the new event that was running. For example, the very last line shows that at 991962.886917 (seconds) CPU 4 context-switched to K0 (a "cc1" process, PID 16945).

That example was from a busy system. Here's an idle system:

# perf sched map
                      *A0           993552.887633 secs A0 => perf:26596
  *.                   A0           993552.887781 secs .  => swapper:0
   .                  *B0           993552.887843 secs B0 => migration/5:39
   .                  *.            993552.887858 secs 
   .                   .  *A0       993552.887861 secs 
   .                  *C0  A0       993552.887903 secs C0 => bash:26622
   .                  *.   A0       993552.888020 secs 
   .          *D0      .   A0       993552.888074 secs D0 => rcu_sched:7
   .          *.       .   A0       993552.888082 secs 
   .           .      *C0  A0       993552.888143 secs 
   .      *.   .       C0  A0       993552.888173 secs 
   .       .   .      *B0  A0       993552.888439 secs 
   .       .   .      *.   A0       993552.888454 secs 
   .      *C0  .       .   A0       993552.888457 secs 
   .       C0  .       .  *.        993552.889257 secs 
   .      *.   .       .   .        993552.889764 secs 
   .       .  *E0      .   .        993552.889767 secs E0 => bash:7902
...]

Idle CPUs are shown as ".".

Remember to examine the timestamp column to make sense of this visualization (GUIs use that as a dimension, which is easier to comprehend, but here the numbers are just listed). It's also only showing context switch events, and not scheduler latency. The newer timehist command has a visualization (-V) that can include wakeup events.

perf sched timehist was added in Linux 4.10, and shows the scheduler latency by event, including the time the task was waiting to be woken up (wait time) and the scheduler latency after wakeup to running (sch delay). It's the scheduler latency that we're more interested in tuning. Example output:

# perf sched timehist
Samples do not have callchains.
           time    cpu  task name                       wait time  sch delay   run time
                        [tid/pid]                          (msec)     (msec)     (msec)
--------------- ------  ------------------------------  ---------  ---------  ---------
  991962.879971 [0005]  perf[16984]                         0.000      0.000      0.000 
  991962.880070 [0007]  :17008[17008]                       0.000      0.000      0.000 
  991962.880070 [0002]  cc1[16880]                          0.000      0.000      0.000 
  991962.880078 [0000]  cc1[16881]                          0.000      0.000      0.000 
  991962.880081 [0003]  cc1[16945]                          0.000      0.000      0.000 
  991962.880093 [0003]  ksoftirqd/3[28]                     0.000      0.007      0.012 
  991962.880108 [0000]  ksoftirqd/0[6]                      0.000      0.007      0.030 
  991962.880256 [0005]  perf[16999]                         0.000      0.005      0.285 
  991962.880276 [0005]  migration/5[39]                     0.000      0.007      0.019 
  991962.880687 [0005]  perf[16984]                         0.304      0.000      0.411 
  991962.881839 [0003]  cat[17022]                          0.000      0.000      1.746 
  991962.881841 [0006]  cc1[16825]                          0.000      0.000      0.000 
[...]
  991963.885740 [0001]  :17008[17008]                      25.613      0.000      0.057 
  991963.886009 [0001]  sleep[16999]                     1000.104      0.006      0.269 
  991963.886018 [0005]  cc1[17083]                         19.998      0.000      9.948 

This output includes the sleep command run to set the duration of perf itself to one second. Note that sleep's wait time is 1000.104 milliseconds because I had run "sleep 1": that's the time it was asleep waiting its timer wakeup event. Its scheduler latency was only 0.006 milliseconds, and its time on-CPU was 0.269 milliseconds.

There are a number of options to timehist, including -V to add a CPU visualization column, -M to add migration events, and -w for wakeup events. For example:

# perf sched timehist -MVw
Samples do not have callchains.
           time    cpu  012345678  task name           wait time  sch delay   run time
                                   [tid/pid]              (msec)     (msec)     (msec)
--------------- ------  ---------  ------------------  ---------  ---------  ---------
  991962.879966 [0005]             perf[16984]                                          awakened: perf[16999]
  991962.879971 [0005]       s     perf[16984]             0.000      0.000      0.000                                 
  991962.880070 [0007]         s   :17008[17008]           0.000      0.000      0.000                                 
  991962.880070 [0002]    s        cc1[16880]              0.000      0.000      0.000                                 
  991962.880071 [0000]             cc1[16881]                                           awakened: ksoftirqd/0[6]
  991962.880073 [0003]             cc1[16945]                                           awakened: ksoftirqd/3[28]
  991962.880078 [0000]  s          cc1[16881]              0.000      0.000      0.000                                 
  991962.880081 [0003]     s       cc1[16945]              0.000      0.000      0.000                                 
  991962.880093 [0003]     s       ksoftirqd/3[28]         0.000      0.007      0.012                                 
  991962.880108 [0000]  s          ksoftirqd/0[6]          0.000      0.007      0.030                                 
  991962.880249 [0005]             perf[16999]                                          awakened: migration/5[39]
  991962.880256 [0005]       s     perf[16999]             0.000      0.005      0.285                                 
  991962.880264 [0005]        m      migration/5[39]                                      migrated: perf[16999] cpu 5 => 1
  991962.880276 [0005]       s     migration/5[39]         0.000      0.007      0.019                                 
  991962.880682 [0005]        m      perf[16984]                                          migrated: cc1[16996] cpu 0 => 5
  991962.880687 [0005]       s     perf[16984]             0.304      0.000      0.411                                 
  991962.881834 [0003]             cat[17022]                                           awakened: :17020
...]
  991963.885734 [0001]             :17008[17008]                                        awakened: sleep[16999]
  991963.885740 [0001]   s         :17008[17008]          25.613      0.000      0.057                           
  991963.886005 [0001]             sleep[16999]                                         awakened: perf[16984]
  991963.886009 [0001]   s         sleep[16999]         1000.104      0.006      0.269
  991963.886018 [0005]       s     cc1[17083]             19.998      0.000      9.948 

The CPU visualization column ("012345678") has "s" for context-switch events, and "m" for migration events, showing the CPU of the event.

The last events in that output include those related to the "sleep 1" command used to time perf. The wakeup happened at 991963.885734, and at 991963.885740 (6 microseconds later) CPU 1 begins to context-switch to the sleep process. The column for that event still shows ":17008[17008]" for what was on-CPU, but the target of the context switch (sleep) is not shown. It is in the raw events:

  :17008 17008 [001] 991963.885740:       sched:sched_switch: prev_comm=cc1 prev_pid=17008 prev_prio=120
                                                             prev_state=R ==> next_comm=sleep next_pid=16999 next_prio=120

The 991963.886005 event shows that the perf command received a wakeup while sleep was running (almost certainly sleep waking up its parent process because it terminated), and then we have the context switch on 991963.886009 where sleep stops running, and a summary is printed out: 1000.104 ms waiting (the "sleep 1"), with 0.006 ms scheduler latency, and 0.269 ms of CPU runtime.

Here I've decorated the timehist output with the details of the context switch destination in red:

  991963.885734 [0001]             :17008[17008]                                        awakened: sleep[16999]
  991963.885740 [0001]   s         :17008[17008]          25.613      0.000      0.057  next: sleep[16999]
  991963.886005 [0001]             sleep[16999]                                         awakened: perf[16984]
  991963.886009 [0001]   s         sleep[16999]         1000.104      0.006      0.269  next: cc1[17008]
  991963.886018 [0005]       s     cc1[17083]             19.998      0.000      9.948  next: perf[16984]

When sleep finished, a waiting "cc1" process then executed. perf ran on the following context switch, and is the last event in the profile (perf terminated). I've submitted a patch to add this info when a -n option is used.

perf sched script dumps all events (similar to perf script):

# perf sched script
    perf 16984 [005] 991962.879960: sched:sched_stat_runtime: comm=perf pid=16984 runtime=3901506 [ns] vruntime=165...
    perf 16984 [005] 991962.879966:       sched:sched_wakeup: comm=perf pid=16999 prio=120 target_cpu=005
    perf 16984 [005] 991962.879971:       sched:sched_switch: prev_comm=perf prev_pid=16984 prev_prio=120 prev_stat...
    perf 16999 [005] 991962.880058: sched:sched_stat_runtime: comm=perf pid=16999 runtime=98309 [ns] vruntime=16405...
     cc1 16881 [000] 991962.880058: sched:sched_stat_runtime: comm=cc1 pid=16881 runtime=3999231 [ns] vruntime=7897...
  :17024 17024 [004] 991962.880058: sched:sched_stat_runtime: comm=cc1 pid=17024 runtime=3866637 [ns] vruntime=7810...
     cc1 16900 [001] 991962.880058: sched:sched_stat_runtime: comm=cc1 pid=16900 runtime=3006028 [ns] vruntime=7772...
     cc1 16825 [006] 991962.880058: sched:sched_stat_runtime: comm=cc1 pid=16825 runtime=3999423 [ns] vruntime=7876...
Each of these events ("sched:sched_stat_runtime" etc) are tracepoints you can instrument directly using perf record. As I've shown earlier, this raw output can be useful for digging further than the summary commands. That's it for now. Happy hunting. [eBPF/bcc]: https://github.com/iovisor/bcc [runqlat]: /blog/2016-10-08/linux-bcc-runqlat.html [perf examples]: /perf.html#SchedulerAnalysis

Flying around Mount Monadnock Josef "Jeff" Sipek

Last week I planned on doing a nice cross country flight from Wikipedia article: Fitchburg. Inspired by Garrett Fisher’s photos, I took my camera and the 70-200mm lens with me hoping to get a couple of nice photos of the landscapes in New Hampshire.

Sadly, after taking off from KFIT I found out that not only was there the stiff wind that was forecasted (that’s fine) but the air was sufficiently bumpy that it wouldn’t have been a fun flight. On top of that, the ADS-B unit was having problems acquiring a GPS signal. (Supposedly, the firmware sometimes gets into a funny state like this. The good news is that there is a firmware update available that should address this.) I contacted KASH tower to check if they could see my transponder—they did, so I didn’t have to worry about being totally invisible.

Since I was already off the ground, I decided to do some nearby sightseeing, landing practice, and playing with the Garmin GNS 430 GPS.

First, I headed northwest toward Wikipedia article: Mount Monadnock. While I have seen it in the distance several times before, I never got to see it up close, so this seemed like a worthwhile destination.

As I approached it, I ended up taking out my camera and getting a couple of photos of the hills and mountains in New Hampshire. It was interesting how the the view to the north (deeper into New Hampshire) is hilly, but the view more east (and certainly south) is flatter. (Both taken near Mount Monadnock.)

While the visibility was more than good enough for flying, it didn’t work out that well for photography. In all of the photos, the landscape far away ended up being heavily blue-tinted. No amount of playing around with white balance adjustment in Lightroom was able to correct it. (Either the background was too blue, or the foreground was too yellow/brown.) That’s why all of these photos are black and white.

I made a full turn around Monadnock, taking a number of shots but this one is my favorite:

Once done with Monadnock, I headed south to the Wikipedia article: Quabbin Reservoir in Massachusetts. This is a view toward the south from near its north end:

At this point I started heading to KORH to do some landing practice. Since I was plenty busy, there are no photos.

I’ve never been to this airport before and landing at new airports is always fun. The first interesting thing about it is that it is situated on a hill. While most airports around here are at 200-400 feet MSL, this one is at 1000 feet. The westerly wind favored runway 29 which meant I got to see a second interesting aspect of this airport. The beginning of runway 29 is on the edge of the hill. That by itself doesn’t sound very interesting, but consider that the runway is at 1000 feet while the bottom of the hill (a mere 0.9 km away) is at 500 feet MSL. That is approximately a 17% grade. So, as you approach the runway, at first it looks like you are way too high but the ground comes up significantly faster than normal.

I am still hoping to do my originally planned cross country flight at some point. Rest assured that I will blahg about it.

How old are illumos man pages? The Trouble with Tribbles...

I've recently been looking at improving the state of the illumos man pages.

One thing you'll notice is that the date on some of the manpages is old - really old, some of them are dated 1990. That presumably means that they haven't been modified in any meaningful sense for a quarter of a century.

(By date here I mean the date displayed by the man command. Which isn't necessarily the date somebody last touched it, but should correspond to the last meaningful change.)

The distribution of dates looks like this:



As you can see, the dates go all the way back to 1990. There's not just the odd one or two either, there are a decent number of man pages that comes from the 1990s.

There are some obvious features in the chart above.

There's a noticeable spike in 1996, which of course significantly predates illumos or OpenSolaris. It's not entirely obvious why there should be a spike, but 100 of those man pages are related to libcurses.

I suspect the dip in 2005 is a result of the launch of Solaris 10, when everyone had a bit of a breather before development kicked into full gear again.

Then there's a drop in 2010. In fact, there's just a single page for 2010. That's when the OpenSolaris project closed, so there was little work done at that point. Also, our man pages were only integrated into illumos-gate in early 2011 (prior to which they were kept separate), and it's taken a while for man page updates to pick up again.

Of course, one reason the man pages are so old is that the software they're documenting is old. That's not necessarily a bad thing, if it ain't broke don't fix it as they say, but there is a certain amount of total rubbish that we ought to clear out.

Fedora dnf history blog'o'less

Sometimes it can happen that you remove a package and inadvertently you remove a bunch of dependencies. Like removing the entire desktop manager: what a mess, waht a pain in the ass!
Sometimes dependecies are strange.

Rarely, but it can happen too, after an update something doesn't work anymore. Maybe the updated package is buggy? Better waiting for a fix.

By the way. If you install or uninstall packages by mistake, there is a way to rollback the changes: history.

sudo dnf history list

And you can see the dnf activities.

sudo dnf history info 132

You can see the details of the dnf command labeled as 132

Let's say 132 is the number of the last action

ID     | Command line             | Date and time    | Action(s)      | Altered
-------------------------------------------------------------------------------
   132 | update                   | 2017-03-09 16:06 | Update         |    8 

Issue the following command in order to uninstall these updates:

sudo dnf history undo 132

And you will get something like:

Undoing transaction 132, from Thu Mar  9 21:06:10 2017

No package libgdata-0:0.17.6-3.fc25.x86_64 available.
Error: An operation cannot be undone

Well. WTF!

It seems that, if you would like to have the option to rollback updates, you must enable packages cache beforehand with keepcache=1 in the /etc/dnf/dnf.conf configuration file. This could lead to some disk space used to store previous versions of rpms? I must investigate a bit deeper.

References: 
https://bugzilla.redhat.com/show_bug.cgi?id=1218403
https://docs.fedoraproject.org/en-US/Fedora/25/html/System_Administrators_Guide/sec-DNF-Transaction_History.html

Get the most out of GNOME notifications blog'o'less



Learn how to get more out of your GNOME notifications on Fedora Workstation.
My first article on Fedora Magazine.

https://fedoramagazine.org/get-gnome-notifications/

Plane-spotting in Manchester, NH Josef "Jeff" Sipek

Last weekend I got to drive to Wikipedia article: Manchester, so I used the opportunity to kill some time near the airport by watching planes and taking photos (gallery).

The winds were coming from the south, so runway 17 was in use. I think those are the best plane spotting conditions at KMHT.

It is relatively easy to watch aircraft depart and fly directly overhead:

Unlike all my previous plane spotting, this time I tried something new—inspired by Mike Kelly’s Airportraits, I decided to try to make some composite images. Here is a Southwest Boeing 737 sporting one of the Wikipedia article: special liveries:

It was certainly an interesting experience.

At first I thought that I would be able to use the 7 frames/second that the D750 can do for the whole departure, but it turns out that the planes move far too slowly, so the camera buffer filled up way too soon and the frame rate became somewhat erratic. What mostly ended up working was switching to 3 frames/second and taking bursts. Next time, aiming for about 2 frames/second should give me enough images to work with.

Even though I used a tripod, I expected that I would have to align the images to remove the minor misalignment between images due to the vibration from the rather strong wind and my hand depressing the shutter. It turns out that the misalignment (of approximately 10 pixels) was minor enough that it did not change the final image.

Here’s an American Airlines commuter taking off from runway 17. (I repositioned to get a less head-on photo as well.)

For those curious, I post processed each of the images in Lightroom, exported them as TIFFs, and then used GIMP to do the layering and masking. Finally, I exported the final image and imported it back into Lightroom for safekeeping.

As a final treat, as I was packing up a US Army Gulfstream took off:

As far as I can tell, they use this one to transport VIPs. I wonder who was on board…

Oracle Database Appliance: random notes blog'o'less

How to create a shared repository

[root@oda11 ~]# oakcli create repo shared1 -dg data -size 50G

How to import a VM template

You must/can download a template from edelivery.oracle.com
You need to put the template or the assemly inside the DOM0, then import it from ODA_BASE.
[root@oda11 ~]# oakcli import vmtemplate prova -assembly /tmp/OVM_OL7U2_x86_64_PVHVM.ova -repo shared1 -node 0

How to create a VM

[root@oda11 ~]# oakcli clone vm pigna -vmtemplate prova -repo shared1 -node 0

How to create a vdisk

A vdisk that will be attacched to the VM
[root@oda11 ~]# oakcli create vdisk pignadiskl -repo shared1 -size 5G -type local

How to create a cpupool

The same pool name must be configured on each node.
[root@oda12 ~]# oakcli create cpupool provapool -numcpu 2 -node 1
[root@oda12 ~]# oakcli create cpupool provapool -numcpu 2 -node 0

How to configure the VM

Add two network interfaces

[root@oda11 ~]# oakcli configure vm pigna -network "['type=netfront,bridge=net1','type=netfront,bridge=net2']"

Enable failover

[root@oda11 ~]# oakcli configure vm pigna -failover true

Attach an additional disk

[root@oda11 ~]# oakcli modify vm pigna -attachvdisk pignadiskl

Configuring cpupool

[root@oda11 ~]# oakcli configure vm pigna -cpupool provapool

Starting the VM

[root@oda11 ~]# oakcli start vm pigna

Attacching to the console

You must ssh to the ODA_BASE using -Y flag.
[root@oda11 ~]# oakcli show vmconsole pigna
A VNC session will start

Expanding the disk space from inside the VM

This is the vdisk created before.
[root@pigna ~]# btrfs device add /dev/xvdb /
You can attach additional vdisks, and the VM OS will see them without the need to reboot it.

Adding space to the repo

[root@oda11 ~]# oakcli configure repo shared1 -incsize 25G

Enterprise Information systems alp's notes

This year I'm going to read "Enterprise Information systems" course for students of Southern Federal University Institute for Advanced Technologies and Piezotechnics. The materials of this course will be appearring in this Microsoft Class Notebook Google docs link

Flame Graphs vs Tree Maps vs Sunburst Brendan Gregg's Blog

Yesterday I posted about [flame graphs for file systems], showing how they can visualize where disk space is consumed. Many people have [responded], citing other tools and visualizations they prefer: du, ncdu, treemaps, and the sunburst layout. Since there's so much interest in this subject, I've visualized the same files here (the source for linux 4.9.-rc5) in different ways for comparison. ## Flame Graphs Using [FlameGraph] \(SVG\):

While you can mouse-over and click to zoom, at first glance the long labeled rectangles tell the big picture, by comparing their lengths and looking at the longest first:

The drivers directory looks like it's over 50%, with drivers/net about 15% of the total. Many small rectangles are too thin to label, and, they also matter less overall. You can imagine printing the flame graph on paper, or including a screen shot in a slide deck, and it will still convey many high level details in not much space. Here's an example someone just posted to twitter. ## Tree Map Using [GrandPerspective] \(on OSX\):

What can you tell on first glance? Not those big picture details (drivers 50%, etc). You can mouse over tree map boxes to get more details, which this screenshot doesn't convey. It is, however, easier to see that there are a handful of large files with those boxes in the top left, which are under drivers/gpu/drm/amd. Using [Baobab] on Linux:

You can see that the drivers directory is large from the tree list on the left, which includes mini bar graphs for a visual line length comparison (good). You can't see into subdirectories without clicking to expand.

Here I've highlighted the drivers/net box. What percentage is that of total from first glance? It's a little bit more difficult to compare sizes than lengths (compare to earlier). This is also missing labels, when compared to the flame graph, although other tree maps like [SpaceMonger] do have them. An advantage to all tree maps is that we can more easily use vertical space. ## Sunburst Using [Baobab] on Linux:

This is a flame graph (which is an adjacency diagram with an inverted icicle layout), using polar coordinates. It is very pretty, and as someone said "it always wows". Sunbursts are the new pie chart.

Deeper slices exaggerate their size, and look visually larger. The problem is this visualization requires the comparison of angles, instead of lengths, which has been evaluated as a more difficult perceptive task. That larger slice I highlighted is 25.6 Mbytes, and the smaller one is 27.8 Mbytes. ## ncdu
--- /home/bgregg/linux-4.9-rc5 -----------------------------------------------
                         /..          
  405.1 MiB [##########] /drivers
  139.1 MiB [###       ] /arch
   37.5 MiB [          ] /fs
   36.0 MiB [          ] /include
   35.8 MiB [          ] /Documentation
   32.6 MiB [          ] /sound
   27.8 MiB [          ] /net
   14.7 MiB [          ] /tools
    7.5 MiB [          ] /kernel
    6.0 MiB [          ] /firmware
    3.7 MiB [          ] /lib
    3.4 MiB [          ] /scripts
    3.3 MiB [          ] /mm
    3.2 MiB [          ] /crypto
    2.4 MiB [          ] /security
    1.1 MiB [          ] /block
  968.0 KiB [          ] /samples
[...]
This does have ASCII bar charts for line length comparisons, but it's only showing one directory level at a time. ## du
$ du -hs * | sort -hr
406M	drivers
140M	arch
38M	fs
36M	include
36M	Documentation
33M	sound
28M	net
15M	tools
7.5M	kernel
6.1M	firmware
3.7M	lib
3.5M	scripts
3.4M	mm
3.2M	crypto
2.4M	security
1.2M	block
968K	samples
[...]
Requires reading. Although this is so quick it's my usual starting point. ## Which to use If you're designing a file system usage tool, which should you use? Ideally, I'd make flame graphs, tree maps, and sunbursts all available as different ways to understand the same data set. For the default view, I'd probably use the flame graph, but I'd want to check with many sample file systems to ensure it really works best with the data it's visualizing. For more about flame graphs see my ACMQ article [The Flame Graph] \(CACM\), and for more about different visualizations see the ACMQ article [A Tour through the Visualization Zoo] \(CACM\), by Jeffrey Heer, Michael Bostock, and Vadim Ogievetsky, especially the section on Hierarchies. [flame graphs]: http://www.brendangregg.com/flamegraphs.html [The Flame Graph]: http://queue.acm.org/detail.cfm?id=2927301 [A Tour through the Visualization Zoo]: http://queue.acm.org/detail.cfm?id=1805128 [files.pl]: https://raw.githubusercontent.com/brendangregg/FlameGraph/master/files.pl [flamegraph.pl]: https://raw.githubusercontent.com/brendangregg/FlameGraph/master/flamegraph.pl [FlameGraph]: https://github.com/brendangregg/FlameGraph [SVG]: /blog/images/2017/files_linux49.svg [GrandPerspective]: https://sourceforge.net/projects/grandperspectiv/?source=typ_redirect [Baobab]: http://www.marzocca.net/linux/baobab/index.html [previous post]: /blog/2017-02-05/file-system-flame-graph.html [flame graphs for file systems]: /blog/2017-02-05/file-system-flame-graph.html [responded]: https://news.ycombinator.com/item?id=13574825 [SpaceMonger]: http://www.aplusfreeware.com/categories/LFWV/SpaceMonger.html

Visiting Helsinki Josef "Jeff" Sipek

Back in July and August I got to visit Helsinki. Needless to say, I dragged my camera and lenses along and did some sightseeing. Helsinki is a relatively new but welcoming city.

July

My first trip there was in early July (7th-10th). This meant that I was there about two weeks after the summer solstice. At 60°10’ north, this has been the northernmost place I’ve ever been. (I’m not really counting the layover in Reykjavik at 63°59’ north, although I do have an interesting story about that for another time.) If you combine these two relatively boring facts (very far north and near solstice timing), you end up with nearly 19 hours of daylight! This gave me ample time to explore. Below are a couple of photos I took while there. There are more in the gallery.

Approaching Wikipedia article: Senate Square and the Wikipedia article: Helsinki Cathedral:

The cathedral:

Not far from this (Lutheran) cathedral is an Eastern Orthodox cathedral—Wikipedia article: Uspenski Cathedral.

And here is its interior:

Like a number of other cities in Europe, Helsinki is filled with bikes. Most sidewalks seem to be divided into two parts—one for walking and one for biking. The public transit seems to include bike rentals. These rental bikes are very…yellow.

Suomenlinna

On Saturday, July 9th, I took a ferry to the nearby sea fortress—Wikipedia article: Suomenlinna—where I spent the day.

Of course there is a (small) church there. (You can also see it in the above photo in the haze.) This one has a sea-fortress-inspired chain running around it.

The whole fortress is made up of six islands. This allows you to see some of the fortifications up close as well as at some distance.

There are plenty of small buildings of various types scattered around the islands. Some of them are still used as residences, while others got turned into a museum or some other public space.

August

The August trip was longer and consisted of more roaming around the city.

The Helsinki Cathedral in the distance.

There are a fair number of churches—here is the Wikipedia article: Kamppi Chapel.

Heading west of the city center (toward Wikipedia article: Länsisatama) one cannot miss the fact that Helsinki is a coastal city.

Finally, on the last day of my August trip I got to see some sea creatures right in front of the cathedral. They were made of various pieces of plastic. As far as I could tell, this art installation was about environmental awareness.

I took so long to finish writing this post that I’ve gotten to visit Helsinki again last month…but more about that in a separate post. Safe travels!

Where has my disk space gone? Flame graphs for file systems Brendan Gregg's Blog

My laptop was recently running low on available disk space, and it was a mystery as to why. I have different tools to explore the file system, including running the "find / -ls" command from a terminal, but they can be time consuming to use. I wanted a big picture view of space by directories, subdirectories, and so on. I've created a simple open source tool to do this, using flame graphs as the final visualization. To demonstrate, here's the space consumed by the Linux 4.9-rc5 source code. Click to zoom, and Ctrl-F to search ([SVG]):

If you are new to flame graphs, see my [flame graphs] page. In this case, width corresponds to total size. I created them for visualizing stack traces, but since they are a generic hierarchical visualization (technically an [adjacency diagram with an inverted icicle layout]), they are suited for the hierarchy of directories as well. I've also used this to diagnose a similar problem with a friend's laptop, which turned out to be due to a backup application consuming space in a directory completely unknown to them. The following sections show how to create one yourself. Start by opening a terminal session so you can use the command line. ### Using git If you have the "git" command, you can fetch the [FlameGraph] repository and run the commands from it:
git clone https://github.com/brendangregg/FlameGraph
cd FlameGraph
./files.pl /Users | ./flamegraph.pl --hash --countname=bytes > out.svg
Then open out.svg in a browser. Change "/Users" to be the directory you want to visualize. This could be "/" for everything (provided you don't have removable storage or network file systems mounted, which if you do, it would include them in the report by accident). ### Without git If you don't have git, you can download the two Perl programs straight from github: [files.pl] and [flamegraph.pl], either using wget or download them via your browser (save as). The steps can then be:
chmod 755 files.pl flamegraph.pl
./files.pl /Users | ./flamegraph.pl --hash --countname=bytes > out.svg
Again, change "/Users" to be the directory you want to visualize, then open out.svg in a browser. ### Linux source example For reference, the Linux source example I included above was created using:
files.pl linux-4.9-rc5 | flamegraph.pl --hash --countname=bytes \
    --title="Linux source tree by file size" --width=800 > files_linux49.svg
You can customize the flame graph using options:
$ ./flamegraph.pl -h
Option h is ambiguous (hash, height, help)
USAGE: ./flamegraph.pl [options] infile > outfile.svg

	--title       # change title text
	--width       # width of image (default 1200)
	--height      # height of each frame (default 16)
	--minwidth    # omit smaller functions (default 0.1 pixels)
	--fonttype    # font type (default "Verdana")
	--fontsize    # font size (default 12)
	--countname   # count type label (default "samples")
	--nametype    # name type label (default "Function:")
	--colors      # set color palette. choices are: hot (default), mem, io,
	              # wakeup, chain, java, js, perl, red, green, blue, aqua,
	              # yellow, purple, orange
	--hash        # colors are keyed by function name hash
	--cp          # use consistent palette (palette.map)
	--reverse     # generate stack-reversed flame graph
	--inverted    # icicle graph
	--negate      # switch differential hues (blue<->red)
	--help        # this message

	eg,
	./flamegraph.pl --title="Flame Graph: malloc()" trace.txt > graph.svg
You might need to know about the --minwidth option: rectangles thinner than this (1/10th of a pixel when zoomed out) will be elided, to conserve space in the SVG. But that can mean things are missing when you zoom in. If it's a problem, you can set minwidth to 0. Update: see my follow-on post [Flame Graphs vs Tree Maps vs Sunburst]. [flame graphs]: http://www.brendangregg.com/flamegraphs.html [adjacency diagram with an inverted icicle layout]: http://queue.acm.org/detail.cfm?id=2927301 [files.pl]: https://raw.githubusercontent.com/brendangregg/FlameGraph/master/files.pl [flamegraph.pl]: https://raw.githubusercontent.com/brendangregg/FlameGraph/master/flamegraph.pl [FlameGraph]: https://github.com/brendangregg/FlameGraph [SVG]: /blog/images/2017/files_linux49.svg [Flame Graphs vs Tree Maps vs Sunburst]: http://www.brendangregg.com/blog/2017-02-06/flamegraphs-vs-treemaps-vs-sunburst.html

Creating a Tribblix package repository The Trouble with Tribbles...

I've previously described how Tribblix packages are built.

The output of that process should be a zap file, which is an SVR4 package in filesystem format, zipped up. The file naming convention is

PKG_NAME.VERSION.zap

Where PKG_NAME is the SVR4 name of the package (you can define aliases to be more user-friendly), VERSION is obvious - but has to match the installed version as shown, for example, by 'pkginfo -x'.

You can install those packages directly. There's a little helper /usr/lib/zap/instzap that will automate the process of unpacking the zap file and running pkgadd on it, or you can do it by hand.

You don't have to use my tooling. If you've got a scheme for building SVR4 packages already, then you could simply convert those.

However, what you would really want to do is stick the packages in a repository somewhere, so they can be accessed using the normal zap commands.

In Tribblix, a repository is just a web-accessible location that contains the zap files and minimal metadata. The metadata is the catalog and a list of aliases.

The aliases file contains lines with two fields separated by a vertical pipe "|". The first field is the friendly alias, the second the real name of the package. If you want multiple aliases, just add multiple lines to the file.

The catalog file contains lines with 5 fields separated by a vertical pipe "|".The first field is the package name, the second the current version, the third a space-separated list of packages this package depends on, the fourth the size of the file in bytes, the fifth is an MD5 checksum of the file. There's a trailing "|" in this case to terminate the line. The size and checksum are used to verify the download was successful and didn't corrupt the file. If you want to be sure that it's actually the package it claims to be then packages can also be signed.

Here's an example line in the catalog:
TRIBabook|0.5.6.0|TRIBreadline|66923|d98f77cfb3e92ee6495e3902cc46486f|
So the TRIBabook package has a current version of 0.5.6.0, depends on the TRIBreadline package, and has the given size and checksum. It's in the file called TRIBabook.0.5.6.0.zap.

The build repo has scripts that create the catalog and aliases files, which I use for convenience. They do make some assumptions about my package build pipeline, so it might be easier to manage the files by hand.

So, having got a nice place on the web that has your packages and a catalog and some aliases, how does Tribblix know to use it?

Assuming your repository is called myrepo, you need to add a file /etc/zap/repositories/myrepo.repo, containing something like the following (which is the configuration for the main tribblix repo in the milestone 18 release)

NAME=tribblix
DESC=Tribblix packages
URL=http://pkgs.tribblix.org/tribblix-m18/
SIGNED=tribblix

If you aren't signing packages (and how to add the keys to the client is an exercise for the reader) then omit the SIGNED line. The NAME in this case would be myrepo, the DESC is whatever you make it, and the URL points at the directory containing the files.

That's almost it. The other thing you need to do is add the repository to the list of repos that zap will search, which is held in the /etc/zap/repo.list file. By default, this contains

100 tribblix
200 illumos
300 oi

That's a simple list, and the number is a priority, lower numbers have higher priority - they're searched first.

(What the priority scheme means is that if you have a package with the same name in multiple repos, the one in the highest priority repo is the one that will be used. For example, the ssh packages used to come from illumos. As we've moved to openssh, all I had to do was put the new openssh packages into the tribblix repo with the same package names they had before and they were installed instead. Of course, you have to be careful that the replacement package delivers the correct functionality, especially if it's delivering shared libraries.)

If your package names are unique to you (for instance, if you name them MYPKGfoo rather than TRIBfoo) then the priority doesn't matter. If you're deliberately trying to override some of the built-in packages, then your repo has to be the highest priority one so it gets searched first.

If you then utter 'zap refresh' it should go and retrieve the catalog and aliases files and then be set up to use them.

And yes, it would be nice if zap had the facility to manage repos for you - that's planned, I just need to implement it.

Golang bcc/BPF Function Tracing Brendan Gregg's Blog

In this post I'll quickly investigate a new way to trace a Go program: dynamic tracing with Linux 4.x enhanced BPF (aka eBPF). If you search for Go and BPF, you'll find Go interfaces for using BPF (eg, [gobpf]). That's not what I'm exploring here: I'm using BPF to instrument a Go program for performance analysis and debugging. If you're new to BPF, I just summarized it at linux.conf.au a couple of weeks ago ([youtube], [slideshare]). There's a number of ways so far to debug and trace Go already, including (and not limited to): - [Debugging with gdb] and Go runtime support. - The [go execution tracer] for high level execution and blocking events. - GODEBUG with gctrace and schedtrace. BPF tracing can do a lot more, but has its own pros and cons. I'll demonstrate, starting with a simple Go program, hello.go:

package main

import "fmt"

func main() {
        fmt.Println("Hello, BPF!")
}
I'll begin with a gccgo compilation, then do Go gc. (If you don't know the difference, try this [summary] by VonC: in short, gccgo can produce more optimized binaries, but for older versions of go.) ## gccgo Function Counting Compiling:
$ gccgo -o hello hello.go
$ ./hello
Hello, BPF!
Now I'll use my [bcc] tool funccount to dynamically trace and count all Go library functions that begin with "fmt.", while I reran the hello program in another terminal session:
# funccount 'go:fmt.*'
Tracing 160 functions for "go:fmt.*"... Hit Ctrl-C to end.
^C
FUNC                                    COUNT
fmt..import                                 1
fmt.padString.pN7_fmt.fmt                   1
fmt.fmt_s.pN7_fmt.fmt                       1
fmt.WriteString.pN10_fmt.buffer             1
fmt.free.pN6_fmt.pp                         1
fmt.fmtString.pN6_fmt.pp                    1
fmt.doPrint.pN6_fmt.pp                      1
fmt.init.pN7_fmt.fmt                        1
fmt.printArg.pN6_fmt.pp                     1
fmt.WriteByte.pN10_fmt.buffer               1
fmt.Println                                 1
fmt.truncate.pN7_fmt.fmt                    1
fmt.Fprintln                                1
fmt.$nested1                                1
fmt.newPrinter                              1
fmt.clearflags.pN7_fmt.fmt                  2
Detaching...
Neat! The output contains the fmt.Println() called by the program, along with other calls. I didn't need to run Go under any special mode to do this, and I can walk up to an already running Go process and begin doing this instrumentation, without restarting it. So how does it even work? - It uses [Linux uprobes: User-Level Dynamic Tracing], added in Linux 3.5. It overwrites instructions with a soft interrupt to kernel instrumentation, and reverses the process when tracing has ended. - The gccgo compiled output has a standard symbol table for function lookup. - In this case, I'm instrumenting libgo (there's an assumed "lib" before this "go:"), as gccgo emits a dynamically linked binary. libgo has the fmt package. - uprobes can attach to already running processes, or as I did here, instrument a binary and catch all processes that use it. - For efficiency, I'm frequency counting the function calls in kernel context, and only emitting the counts to user space. To the system, the binary looks like this:
$ file hello
hello: ELF 64-bit LSB executable, x86-64, version 1 (SYSV), dynamically linked, interpreter /lib64/ld-linux-x86-64.so.2, for GNU/Linux 2.6.32, BuildID[sha1]=4dc45f1eb023f44ddb32c15bbe0fb4f933e61815, not stripped
$ ls -lh hello
-rwxr-xr-x 1 bgregg root 29K Jan 12 21:18 hello
$ ldd hello
	linux-vdso.so.1 =>  (0x00007ffc4cb1a000)
	libgo.so.9 => /usr/lib/x86_64-linux-gnu/libgo.so.9 (0x00007f25f2407000)
	libgcc_s.so.1 => /lib/x86_64-linux-gnu/libgcc_s.so.1 (0x00007f25f21f1000)
	libc.so.6 => /lib/x86_64-linux-gnu/libc.so.6 (0x00007f25f1e27000)
	/lib64/ld-linux-x86-64.so.2 (0x0000560b44960000)
	libpthread.so.0 => /lib/x86_64-linux-gnu/libpthread.so.0 (0x00007f25f1c0a000)
	libm.so.6 => /lib/x86_64-linux-gnu/libm.so.6 (0x00007f25f1901000)
$ objdump -tT /usr/lib/x86_64-linux-gnu/libgo.so.9 | grep fmt.Println
0000000001221070 g     O .data.rel.ro	0000000000000008              fmt.Println$descriptor
0000000000978090 g     F .text	0000000000000075              fmt.Println
0000000001221070 g    DO .data.rel.ro	0000000000000008  Base        fmt.Println$descriptor
0000000000978090 g    DF .text	0000000000000075  Base        fmt.Println
That looks a lot like a compiled C binary, which you can instrument using many existing debuggers and tracers, including bcc/BPF. It's a lot easier to instrument than runtimes that compile on the fly, like Java and Node.js. The only hitch so far is that function names can contain non-standard characters, like "." in this example. funccount also has options like -p to match a PID, and -i to emit output every interval. It currently can only handle up to 1000 probes at a time, so "fmt.\*" was ok, but matching everything in libgo:
# funccount 'go:*'
maximum of 1000 probes allowed, attempted 21178
... doesn't work yet. Like many things in bcc/BPF, when this limitation becomes too much of a nuisance we'll find a way to fix it. ## Go gc Function Counting Compiling using Go's gc compiler:
$ go build hello.go
$ ./hello
Hello, BPF!
Now counting the fmt functions:
# funccount '/home/bgregg/hello:fmt.*'
Tracing 78 functions for "/home/bgregg/hello:fmt.*"... Hit Ctrl-C to end.
^C
FUNC                                    COUNT
fmt.init.1                                  1
fmt.(*fmt).padString                        1
fmt.(*fmt).truncate                         1
fmt.(*fmt).fmt_s                            1
fmt.newPrinter                              1
fmt.(*pp).free                              1
fmt.Fprintln                                1
fmt.Println                                 1
fmt.(*pp).fmtString                         1
fmt.(*pp).printArg                          1
fmt.(*pp).doPrint                           1
fmt.glob.func1                              1
fmt.init                                    1
Detaching...
You can still trace fmt.Println(), but this is now finding it in the binary rather than libgo. Because:
$ file hello
hello: ELF 64-bit LSB executable, x86-64, version 1 (SYSV), statically linked, not stripped
$ ls -lh hello
-rwxr-xr-x 1 bgregg root 2.2M Jan 12 05:16 hello
$ ldd hello
	not a dynamic executable
$ objdump -t hello | grep fmt.Println
000000000045a680 g     F .text	00000000000000e0 fmt.Println
It's a 2 Mbyte static binary that contains the function. Another difference is that the function names contain more unusual symbols: "\*", "(", etc, which I suspect will trip up other debuggers until they are fixed to handle them (like bcc's trace was). ## gccgo Function Tracing Now I'll try Sasha Goldshtein's trace tool, also from [bcc], to see per-event invocations of a function. Back using gccgo, and I'll start with this simple program from the [go tour], functions.go:
package main

import "fmt"

func add(x int, y int) int {
	return x + y
}

func main() {
	fmt.Println(add(42, 13))
}
Now tracing the add() function:
# trace '/home/bgregg/functions:main.add'
PID    TID    COMM         FUNC             
14424  14424  functions    main.add  
... and with both its arguments:
# trace '/home/bgregg/functions:main.add "%d %d" arg1, arg2'
PID    TID    COMM         FUNC             -
14390  14390  functions    main.add         42 13
Awesome, that worked. Both arguments are printed on the right. trace has other options (try -h), such as for including timestamps and stack traces with the output. ## Go gc Function Tracing Now the wheels start to go of the tracks... Same program, compiled with go build:
$ go build functions.go

# trace '/home/bgregg/functions:main.add "%d %d" arg1, arg2'
could not determine address of symbol main.add

$ objdump -t functions | grep main.add
$
No main.add()? Was it inlined? Disabling inlining:
$ go build -gcflags '-l' functions.go
$ objdump -t functions | grep main.add
0000000000401000 g     F .text	0000000000000020 main.add
Now it's back. Well that was easy. Tracing it and its arguments:
# trace '/home/bgregg/functions:main.add "%d %d" arg1, arg2'
PID    TID    COMM         FUNC             -
16061  16061  functions    main.add         536912504 16
That's wrong. The arguments should be 42 and 13, not 536912504 and 16. Taking a peek with gdb:
$ gdb ./functions
[...]
warning: File "/usr/share/go-1.6/src/runtime/runtime-gdb.py" auto-loading has been declined
 by your `auto-load safe-path' set to "$debugdir:$datadir/auto-load".
[...]
(gdb) b main.add
Breakpoint 1 at 0x401000: file /home/bgregg/functions.go, line 6.
(gdb) r
Starting program: /home/bgregg/functions 
[New LWP 16082]
[New LWP 16083]
[New LWP 16084]

Thread 1 "functions" hit Breakpoint 1, main.add (x=42, y=13, ~r2=4300314240) at
 /home/bgregg/functions.go:6
6	        return x + y
(gdb) i r
rax            0xc820000180	859530330496
rbx            0x584ea0	5787296
rcx            0xc820000180	859530330496
rdx            0xc82005a048	859530698824
rsi            0x10	16
rdi            0xc82000a2a0	859530371744
rbp            0x0	0x0
rsp            0xc82003fed0	0xc82003fed0
r8             0x41	65
r9             0x41	65
r10            0x4d8ba0	5082016
r11            0x0	0
r12            0x10	16
r13            0x52a3c4	5415876
r14            0xa	10
r15            0x8	8
rip            0x401000	0x401000 
eflags         0x206	[ PF IF ]
cs             0xe033	57395
ss             0xe02b	57387
ds             0x0	0
es             0x0	0
fs             0x0	0
gs             0x0	0
I included the startup warning about runtime-gdb.py, since it's helpful: if I want to dig deeper into Go context, I'll want to fix or source that. Even without it, gdb has shown the arguments as the variables "x=42, y=13". I also dumped the registers to compare them to the x86_64 ABI, which is how bcc's trace reads them. From the syscall(2) man page:
       arch/ABI      arg1  arg2  arg3  arg4  arg5  arg6  arg7  Notes
       ──────────────────────────────────────────────────────────────────
[...]
       x86_64        rdi   rsi   rdx   r10   r8    r9    -
42 and 13 don't appear rdi or rsi, or any of the registers. The reason is that Go's gc compiler is not following the standard [AMD64 ABI] function calling convention, which causes problems with this and other debuggers. This is pretty annoying. (I've also heard this complained about before, coincidentally, by my former colleagues). I guess Go needed to use a different ABI for return values, since it can return multiple values, so even if the entry arguments were standard we'd still run into differences. I've browsed the [Quick Guide to Go's Assembler] and the [Plan 9 assembly manual], and it looks like functions are passed on the stack. Here's our 42 and 13:
(gdb) x/3dg $rsp
0xc82003fed0:	4198477	42
0xc82003fee0:	13
BPF can dig these out too. As a proof of concept, I just hacked in a couple of new aliases, "go1" and "go2" for those entry arguments:
# trace '/home/bgregg/functions:main.add "%d %d" go1, go2'
PID    TID    COMM         FUNC             -
17555  17555  functions    main.add         42 13
Works. Hopefully by the time you read this post, I (or someone) has finished this work and added it to bcc trace tool. Preferably as "goarg1", "goarg2", etc. ## Interface Arguments I was going to trace the string argument to fmt.Println() as another example, but its argument is actually an "interface". From go's src/fmt/print.go:
func Println(a ...interface{}) (n int, err error) {
    return Fprintln(os.Stdout, a...)
With gdb you can dig out the string, eg, back to gccgo:
$ gdb ./hello
[...]
(gdb) b fmt.Println
Breakpoint 1 at 0x401c50
(gdb) r
Starting program: /home/bgregg/hello 
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
[New Thread 0x7ffff449c700 (LWP 16836)]
[New Thread 0x7ffff3098700 (LWP 16837)]
[Switching to Thread 0x7ffff3098700 (LWP 16837)]

Thread 3 "hello" hit Breakpoint 1, fmt.Println (a=...) at ../../../src/libgo/go/fmt/print.go:263
263	../../../src/libgo/go/fmt/print.go: No such file or directory.
(gdb) p a
$1 = {__values = 0xc208000240, __count = 1, __capacity = 1}
(gdb) p a.__values
$18 = (struct {...} *) 0xc208000240
(gdb) p a.__values[0]
$20 = {__type_descriptor = 0x4037c0 <__go_tdn_string>, __object = 0xc208000210}
(gdb) x/s *0xc208000210
0x403483:	"Hello, BPF!"
So it can be read (and I'm sure there's an easier way with gdb, too). You could write a custom bcc/BPF program to dig this out, and we can add more aliases to bcc's trace program to deal with interface arguments. ## Function Latency (Update) Here's a quick demo of function latency tracing:
# funclatency 'go:fmt.Println'
Tracing 1 functions for "go:fmt.Println"... Hit Ctrl-C to end.
^C

Function = fmt.Println [3041]
     nsecs               : count     distribution
         0 -> 1          : 0        |                                        |
         2 -> 3          : 0        |                                        |
         4 -> 7          : 0        |                                        |
         8 -> 15         : 0        |                                        |
        16 -> 31         : 0        |                                        |
        32 -> 63         : 0        |                                        |
        64 -> 127        : 0        |                                        |
       128 -> 255        : 0        |                                        |
       256 -> 511        : 0        |                                        |
       512 -> 1023       : 0        |                                        |
      1024 -> 2047       : 0        |                                        |
      2048 -> 4095       : 0        |                                        |
      4096 -> 8191       : 0        |                                        |
      8192 -> 16383      : 27       |****************************************|
     16384 -> 32767      : 3        |****                                    |
Detaching...
That's showing a histogram of latency (in nanoseconds) for fmt.Println(), which I was calling in a loop. ## Next Steps I took a quick look at Golang with dynamic tracing and Linux enhanced BPF, via [bcc]'s funccount and trace tools, with some successes and some challenges. Counting function calls works already. Tracing function arguments when compiled with gccgo also works, whereas Go's gc compiler doesn't follow the standard ABI calling convention, so the tools need to be updated to support this. As a proof of concept I modified the bcc trace tool to show it could be done, but that feature needs to be coded properly and integrated. Processing interface objects will also be a challenge, and multi-return values, again, areas where we can improve the tools to make this easier, as well as add macros to C for writing other custom Go observability tools as well. Hopefully there will be a follow up post (not necessarily by me, feel free to take up the baton if this interests you) that shows improvements to bcc/BPF Go gc argument tracing, interfaces, and return values. Another important tracing topic, which again can be a follow up post, is stack traces. Thankfully Go made frame pointer-based stack traces default in 1.7. Lastly, another important topic that could be a post by itself is tracing Go functions along with kernel context. BPF and bcc can instrument kernel functions, as well as user space, and I can imagine custom new tools that combine information from both. [Debugging with gdb]: https://golang.org/doc/gdb [go execution tracer]: https://golang.org/pkg/runtime/trace/ [gcvis]: https://github.com/davecheney/gcvis [gotrace]: https://github.com/divan/gotrace [summary]: http://stackoverflow.com/a/25811505 [Linux uprobes: User-Level Dynamic Tracing]: /blog/2015-06-28/linux-ftrace-uprobe.html [bcc]: https://github.com/iovisor/bcc [go tour]: https://tour.golang.org/basics/4 [AMD64 ABI]: https://en.wikipedia.org/wiki/X86_calling_conventions#System_V_AMD64_ABI [Quick Guide to Go's Assembler]: https://golang.org/doc/asm [Plan 9 assembly manual]: https://9p.io/sys/doc/asm.html [gobpf]: https://github.com/iovisor/gobpf [youtube]: https://www.youtube.com/watch?v=JRFNIKUROPE [slideshare]: http://www.slideshare.net/brendangregg/bpf-tracing-and-more