How to print ZFS filesystems ordered by space used blog'o'less

How to print ZFS filesystems ordered by space used

zfs get -o value,name -Hp used|sort -n

rpmbuild random notes blog'o'less

sudo dnf install rpmdevtools

rpmdev-setuptree

~/rpmbuild/SRPMS/
~/rpmbuild/SPECS/

~/rpmbuild/SOURCES/
~/rpmbuild/RPMS/
~/rpmbuild/BUILD/
~/.rpmmacros

sudo dnf download --source package

Working at Netflix 2017 Brendan Gregg's Blog

I've now worked at Netflix for over three years. Time flies! I previously wrote about Netflix in [2015] and [2016], and if you are interested in what it's like to work here, I already covered much in those posts. As before, no one at Netflix has asked me to write this, and this is my personal blog and not a company post. I'll start with some exciting news, describe what my job is really like, the culture and mission, and some work updates. ## 100 Million Subscribers! When I joined Netflix in April 2014, we had over 40 million subscribers in 41 countries. We are now in 190 countries and just crossed [100 million subscribers]! It's been thrilling to be part of this and help Netflix scale. You might imagine that at some point we had a major scaling crises, where it looked like we'd fail due to an architectural bottleneck, and engineers worked long nights and weekends to save Netflix from certain disaster. That'd make a great story, but it didn't happen. We're on the EC2 cloud, which has great scalability, and our own cloud architecture of microservices is also designed for scalability. During this time we did do plenty of hard work, rolling out new technologies and major microservice versions, and fixed many problems big and small. But there was no single crisis point. Instead, it has been a process of continual improvements, by many engineers across the company. ## A Day in the Life (Performance Engineering) What do I actually do all day? Most of my day is a 50/50 mixture of proactive projects, and reactive performance analysis. The proactive projects usually take weeks or months, and are where I'm developing a new technology or helping other teams with performance analysis or evaluations. Most of these projects aren't public yet, and some of them involve working with other companies on unreleased products. My work with Linux is different in that it is mostly public, and includes my perf-tools and bcc/eBPF tracing tools. Another long term project is Vector, our instance analysis tool, where I'm adding new performance analysis features. Getting frame pointer support in Java was another project I did a while ago. The reactive work can be for any performance problem that shows up, involving runtimes (Java, Node.js), Linux (and sometimes FreeBSD), or hypervisors (Xen, containers). Recently that's included: - Debugging why perf profiling stopped working in recent Docker containers. - Java core dump analysis for a crashing JVM. - MSR analysis on a instance to show it was running at a lower clock rate. - A latency outlier issue that happened every 15 minutes. - Analyzing slab memory growth on a instance with containers. - Getting flame graphs to work in a new environment. Staff ask for help over chat, either to the perfeng chatroom or me directly, or they come visit my desk in F2. I'm also monitoring various chatrooms and metrics, and will jump in when needed. It's a good balance. Too much reactive work and you don't have time to build better tools and general fire proofing. Too much proactive work and you can become disconnected from the current company pain points, and start building solutions to the problems of yesteryear. About one hour on average each day is meetings. Some of these are regular meetings: we have a team meeting once every two weeks where everyone discusses what they are working on, and I have a one-on-one with my manager once every two weeks. At a lower frequency, I have scheduled meetings with my manager's manager, and their manager. All these manager meetings keep me informed of the current company needs, and help connect me to the right people and projects at Netflix. Once every two weeks, I summarize what I've been working on in a shared doc: the team's bi-weekly status. Then there's some random events that happen during the year. We have offsites, where we plan what to work on each quarter, and team building events. There's also unofficial recreational groups at Netflix, including movie clubs (for good movies, and for bad ones), a karaoke group (which includes some Hamilton fans), and various sports teams. I'm on the Netflix cricket team (if you're at Netflix and didn't know we exist, join the cricket chatroom). I also usually speak at some conferences each year. ## Culture The biggest difference I've found working here is still the culture. We are empowered to do the right thing, and believe in "freedom and responsibility". This is documented in the Netflix [culture deck], and after three years I still find it true. The first seven slides point out that companies can have aspirational values, but the actual values differ:

The actual company values, as opposed to the nice-sounding values, are shown by who gets rewarded, promoted, or let go
Before joining Netflix, you're told to read it and see if this company is right for you. Then while working here, staff cite the culture deck in meetings for decision making advice. It's not nice-sounding values that are printed in the lobby and people forget about. It's an ongoing influence in the day to day running of Netflix. Having it online also beats learning the culture through word of mouth or trial and error. I know people in tech who are burned out but stay in lousy jobs, assuming every workplace is just as terrible. Jobs where there is little to no freedom, no responsibility or accountability, and where dumb office politics is the norm. I wish everyone could have a chance to work at a company like Netflix. Little to no bureaucracy. You can focus on engineering and getting stuff done, with awesome staff who will help you. ## Mission I spoke about this in my 2015 post, but it's worth repeating: our mission is to improve how entertainment is consumed worldwide, by building a great product that people choose to buy. I've noticed a widespread cynicism about successful companies, especially US corporates, where it's assumed that they must be doing something shady to be really competitive. Like selling customer data, or making it difficult to terminate membership. It's been amazing and inspiring to see how Netflix operates, contrary to this belief. We don't do anything shady, and we're proud of that. We're an honest company. ## Work Updates **SRE**: Last year I talked about my site reliability engineering (SRE) work. Since then, our CORE SRE team has grown and I'm no longer needed on the on-call rotation, so I'm back to focusing on performance work. My 18 months of SRE on-call provided many memories and valuable experiences, as well as a deeper understanding of SRE. I talked about what I learned in my [SREcon 2016] keynote, and how the aims and tools differ between performance engineering and SRE performance analysis. I miss the thrill of being paged and knowing I'm going to work with other awesome engineers and fix something _important_ in the next five minutes... or at least try to! If I miss this thrill too much, I can always jump into the CORE chatroom and help with production issues when they happen.

My new desk in building F
**Linux**: I've been contributing to profilers and tracers, and it's been satisfying to help fix these areas that I really care about. In the last three years I developed the ftrace-based [perf-tools] and used them to solve many problems, which I wrote about in [lwn.net] and spoke about at [LISA 2014]. I also worked with Alexei Starovoitov (now at Facebook) on enhanced BPF for tracing, and developed many [bcc tools] that use BPF. I spoke about these at Facebook's [Performance@Scale] event and other conferences. We're rolling out newer kernels now, and it's pretty exciting to use my bcc tools in production. For Linux, I've also done tuning, kernel analysis, [gdb], testing of [hist triggers], testing of some perf patches, and contributed a few trivial patches of my own. **PMCs**: When I considered joining Netflix three years ago, I had two technical concerns: 1. No advanced Linux tracer, and 2. No PMC access in EC2. How am I going to do advanced analysis without these? The more I thought about it, the more I became interested in the challenge, which would be the biggest of my career. Three years later, I've helped solve both of these (as well as devise some workarounds along the way). Now we have [Linux 4.9 eBPF] and [The PMCs of EC2]. Thanks to everyone who helped. **Team Changes**: Our team has grown a little, and we have a new manager, [Ed Hunter], who I worked for before at Sun Microsystems. It's great to be working with Ed again. Our prior manager, Coburn, was promoted. ## Summary When I use an awesome technology, I feel compelled to post about it and share. In this case, it's an awesome company and culture. After three years, I still find Netflix an awesome place to work, and every day I look forward to what I'll work on next. [culture deck]: http://www.slideshare.net/reed2001/culture-1798664 [2015]: http://www.brendangregg.com/blog/2015-01-20/working-at-netflix.html [2016]: http://www.brendangregg.com/blog/2016-03-30/working-at-netflix-2016.html [lwn.net]: http://lwn.net/Articles/608497/ [perf-tools]: https://github.com/brendangregg/perf-tools [LISA 2014]: /blog/2015-03-17/usenix-lisa-2014-linux-ftrace-perf-tools.html [bcc tools]: https://github.com/iovisor/bcc#tools [Performance@Scale]: http://www.brendangregg.com/blog/2016-03-05/linux-bpf-superpowers.html [linux.conf.au]: https://www.youtube.com/watch?v=JRFNIKUROPE [SCALE15x]: https://www.youtube.com/watch?v=w8nFRoFJ6EQ [hist triggers]: /blog/2016-06-08/linux-hist-triggers.html [gdb]: /blog/2016-08-09/gdb-example-ncurses.html [LISA 2016]: https://www.slideshare.net/brendangregg/linux-4x-tracing-tools-using-bpf-superpowers [SREcon 2016]: /blog/2016-05-04/srecon2016-perf-checklists-for-sres.html [Ed Hunter]: https://www.linkedin.com/in/edwhunter/ [100 million subscribers]: https://twitter.com/netflix/status/855545423276032000 [Linux 4.9 eBPF]: /blog/2016-10-27/dtrace-for-linux-2016.html [The PMCs of EC2]: addlink

Container Performance Analysis at DockerCon 2017 Brendan Gregg's Blog

At DockerCon 2017 I gave a talk on Linux container performance analysis, where I showed how to identify three types of performance bottlenecks in a container environment: 1. In the host vs container, using system metrics. 1. In application code in containers, using CPU flame graphs. 1. Deeper in the kernel, using tracing tools. The talk video is on [youtube] \(42 mins\):

And the slides are on [slideshare]:
This talk was a tour of container performance analysis on Linux. I included a quick summary of the necessary background, cgroups and namespaces, as well as analysis methodologies, before digging into the actual tools and metrics. An overall takeaway is to know what is possible, not necessarily learning each tool in detail, as you can look them up later when necessary. I included many performance analysis tools, including basics including top, htop, mpstat, pidstat, free, iostat, sar, perf, and flame graphs; container-aware tools and metrics including systemd-cgtop, docker stats, /proc, /sys/fs/cgroup, nsenter, Netflix Vector, and Intel snap; and advanced tracing-based tools including iosnoop, zfsslower, btrfsdist, funccount, runqlat, and stackcount. ## Reverse Diagnosis I'm a fan of performance analysis methodologies, and I discussed how my [USE method] can be applied to container resource controls. But some controls, like CPU shares and disk I/O weights, get tricky to analyze. How do you know if a container is currently throttled by its share value, vs the system? To make sense of this, I came up with a reverse diagnosis approach: starting with a list of all possible outcomes, and then working backwards to see what metrics are required to identify one of the outcomes. I summarized it for CPU analysis with this flow chart:
The first step refers to /sys/fs/cgroup/.../cpu.stat -> throttled\_time, which indicates when a cgroup (container) is throttled by its hard cap (eg, capped at 2 CPUs). Since that's a straightforward metric, we check it first to take that outcome off the operating table, and continue. See the talk for more details, where I also included a few scenarios beforehand to see if the audience could identify the bottleneck. Try it yourself: it's hard (then try it with the above flow chart!). This may become easier over time as more metrics are added to diagnose states, and time in states, so also check for updates to cgroup metrics in the kernel. ## Netflix Titus The environment I've been analyzing is Netflix Titus, which I summarized at the start of the talk. It was covered in a post published just before my talk: [The Evolution of Container Usage at Netflix]. DockerCon was fun, and a big event: 6,000 attendees. My talk won a "top speaker" [award], which also meant I delivered it a second time for those who didn't catch the first one. Thanks to the Docker staff for putting on a great conference, and for everyone for attending my talk. [youtube]: https://www.youtube.com/watch?v=bK9A5ODIgac [slideshare]: https://www.slideshare.net/brendangregg/container-performance-analysis [The Evolution of Container Usage at Netflix]: https://medium.com/netflix-techblog/the-evolution-of-container-usage-at-netflix-3abfc096781b [USE method]: /usemethod.html [award]: https://twitter.com/brendangregg/status/854827187270242304

CPU Utilization is Wrong Brendan Gregg's Blog

The metric we all use for CPU utilization is deeply misleading, and getting worse every year. What is CPU utilization? How busy your processors are? No, that's not what it measures. Yes, I'm talking about the "%CPU" metric used *everywhere*, by *everyone*. In every performance monitoring product. In top(1). What you may think 90% CPU utilization means:

What it might really mean:
Stalled means the processor was not making forward progress with instructions, and usually happens because it is waiting on memory I/O. The ratio I drew above (between busy and stalled) is what I typically see in production. Chances are, you're mostly stalled, but don't know it. What does this mean for you? Understanding how much your CPUs are stalled can direct performance tuning efforts between reducing code or reducing memory I/O. Anyone looking at CPU performance, especially on clouds that auto scale based on CPU, would benefit from knowing the stalled component of their %CPU. ## What really is CPU Utilization? The metric we call CPU utilization is really "non-idle time": the time the CPU was not running the idle thread. Your operating system kernel (whatever it is) usually tracks this during context switch. If a non-idle thread begins running, then stops 100 milliseconds later, the kernel considers that CPU utilized that entire time. This metric is as old as time sharing systems. The Apollo Lunar Module guidance computer (a pioneering time sharing system) called its idle thread the "DUMMY JOB", and engineers tracked cycles running it vs real tasks as a important computer utilization metric. (I wrote about this [before].) So what's wrong with this? Nowadays, CPUs have become much faster than main memory, and waiting on memory dominates what is still called "CPU utilization". When you see high %CPU in top(1), you might think of the processor as being the bottleneck – the CPU package under the heat sink and fan – when it's really those banks of DRAM. This has been getting worse. For a long time processor manufacturers were scaling their clockspeed quicker than DRAM was scaling its access latency (the "CPU DRAM gap"). That levelled out around 2005 with 3 GHz processors, and since then processors have scaled using more cores and hyperthreads, plus multi-socket configurations, all putting more demand on the memory subsystem. Processor manufacturers have tried to reduce this memory bottleneck with larger and smarter CPU caches, and faster memory busses and interconnects. But we're still usually stalled. ## How to tell what the CPUs are really doing By using Performance Monitoring Counters (PMCs): hardware counters that can be read using [Linux perf], and other tools. For example, measuring the entire system for 10 seconds:
# perf stat -a -- sleep 10

 Performance counter stats for 'system wide':

     641398.723351      task-clock (msec)         #   64.116 CPUs utilized            (100.00%)
           379,651      context-switches          #    0.592 K/sec                    (100.00%)
            51,546      cpu-migrations            #    0.080 K/sec                    (100.00%)
        13,423,039      page-faults               #    0.021 M/sec                  
 1,433,972,173,374      cycles                    #    2.236 GHz                      (75.02%)
   <not supported>      stalled-cycles-frontend  
   <not supported>      stalled-cycles-backend   
 1,118,336,816,068      instructions              #    0.78  insns per cycle          (75.01%)
   249,644,142,804      branches                  #  389.218 M/sec                    (75.01%)
     7,791,449,769      branch-misses             #    3.12% of all branches          (75.01%)

      10.003794539 seconds time elapsed
The key metric here is **instructions per cycle** (insns per cycle: IPC), which shows on average how many instructions we were completed for each CPU clock cycle. The higher, the better (a simplification). The above example of 0.78 sounds not bad (78% busy?) until you realize that this processor's top speed is an IPC of 4.0. This is also known as *4-wide*, referring to the instruction fetch/decode path. Which means, the CPU can retire (complete) four instructions with every clock cycle. So an IPC of 0.78 on a 4-wide system, means the CPUs are running at 19.5% their top speed. The new Intel Skylake processors are 5-wide. There are hundreds more PMCs you can use to dig further: measuring stalled cycles directly by different types. ### In the cloud If you are in a virtual environment, you might not have access to PMCs, depending on whether the hypervisor supports them for guests. I recently posted about [The PMCs of EC2: Measuring IPC], showing how PMCs are now available for dedicated host types on the AWS EC2 Xen-based cloud. ## Interpretation and actionable items If your **IPC is < 1.0**, you are likely memory stalled, and software tuning strategies include reducing memory I/O, and improving CPU caching and memory locality, especially on NUMA systems. Hardware tuning includes using processors with larger CPU caches, and faster memory, busses, and interconnects. If your **IPC is > 1.0**, you are likely instruction bound. Look for ways to reduce code execution: eliminate unnecessary work, cache operations, etc. [CPU flame graphs] are a great tool for this investigation. For hardware tuning, try a faster clock rate, and more cores/hyperthreads. For my above rules, I split on an IPC of 1.0. Where did I get that from? I made it up, based on my prior work with PMCs. Here's how you can get a value that's custom for your system and runtime: write two dummy workloads, one that is CPU bound, and one memory bound. Measure their IPC, then calculate their mid point. ## What performance monitoring products should tell you Every performance tool should show IPC along with %CPU. Or break down %CPU into instruction-retired cycles vs stalled cycles, eg, %INS and %STL. As for top(1), there is tiptop(1) for Linux, which shows IPC by process:
tiptop -                  [root]
Tasks:  96 total,   3 displayed                               screen  0: default

  PID [ %CPU] %SYS    P   Mcycle   Minstr   IPC  %MISS  %BMIS  %BUS COMMAND
 3897   35.3  28.5    4   274.06   178.23  0.65   0.06   0.00   0.0 java
 1319+   5.5   2.6    6    87.32   125.55  1.44   0.34   0.26   0.0 nm-applet
  900    0.9   0.0    6    25.91    55.55  2.14   0.12   0.21   0.0 dbus-daemo
## Other reasons CPU utilization is misleading It's not just memory stall cycles that makes CPU utilization misleading. Other factors include: - Temperature trips stalling the processor. - Turboboost varying the clockrate. - The kernel varying the clock rate with speed step. - The problem with averages: 80% utilized over 1 minute, hiding bursts of 100%. - Spin locks: the CPU is utilized, and has high IPC, but the app is not making logical forward progress. ## Update: is CPU utilization actually wrong? There have been hundreds of comments on this post, here (below) and elsewhere ([1], [2]). Thanks to everyone for taking the time and the interest in this topic. To summarize my responses: I'm not talking about iowait at all (that's disk I/O), and there are actionable items if you know you are memory bound (see above). But is CPU utilization actually wrong, or just deeply misleading? I think many people interpret high %CPU to mean that the processing unit is the bottleneck, which is wrong (as I said earlier). At that point you don't yet know, and it is often an external component. Is the metric technically correct? If the CPU stall cycles can't be used by anything else, aren't they are therefore "utilized waiting" (which sounds like an oxymoron)? In some cases, yes, you could say that %CPU is technically correct, but deeply misleading. With hyperthreads, however, those stalled cycles can now be used by another thread, so %CPU may count cycles "utilized" that are in fact available. That's wrong. In this post I wanted to focus on the interpretation problem and suggested solutions, but yes, there are technical problems with this metric as well. You might just say that utilization as a metric was already broken, as Adrian Cockcroft discussed [previously]. ## Conclusion CPU utilization has become a deeply misleading metric: it includes cycles waiting on main memory, which can dominate modern workloads. You can figure out what %CPU really means by using additional metrics, including instructions per cycle (IPC). An IPC < 1.0 likely means memory bound, and an IPC > 1.0 likely means instruction bound. I covered IPC in my [previous post], including an introduction to the Performance Monitoring Counters (PMCs) needed to measure it. Performance monitoring products that show %CPU – which is all of them – should also show PMC metrics to explain what that means, and not mislead the end user. For example, they can show %CPU with IPC, and/or instruction-retired cycles vs stalled cycles. Armed with these metrics, developers and operators can choose how to better tune their applications and systems. [UnixBench]: https://code.google.com/p/byte-unixbench/ [before]: http://www.brendangregg.com/usemethod.html#Apollo [Linux perf]: /perf.html [The PMCs of EC2: Measuring IPC]: /blog/2017-05-04/the-pmcs-of-ec2.html [previous post]: /blog/2017-05-04/the-pmcs-of-ec2.html [CPU flame graphs]: http://www.brendangregg.com/FlameGraphs/cpuflamegraphs.html [1]: https://news.ycombinator.com/item?id=14301739 [2]: https://www.reddit.com/r/programming/comments/6a6v8g/cpu_utilization_is_wrong/ [previously]: http://www.hpts.ws/papers/2007/Cockcroft_HPTS-Useless.pdf

The PMCs of EC2: Measuring IPC Brendan Gregg's Blog


IPC and LLC loads with a scaling workload
Performance Monitoring Counters (PMCs) are now publicly available from dedicated host types in the AWS EC2 cloud. PMC nerds worldwide rejoice! (All six of us.) There should be more of us in the future, as with the increasing scale of processors and speed of storage devices, the common bottleneck is moving from disks to the memory subsystem. CPU caches, the MMU, memory busses, and CPU interconnects. These can only be analyzed with PMCs.
Memory is the new disk.
If PMCs are new to you, then in a nutshell they are special hardware counters that can be accessed via processor registers, and enabled and read via certain instructions. PMCs provide low-level CPU performance statistics that aren't available anywhere else. In this post I'll summarize the PMCs available in EC2, which are for dedicated hosts only (eg, m4.16xl, i3.16xl), and I'll demonstrate measuring IPC. Note that PMCs are also known as HPCs (hardware performance counters), and other names as well. ### EC2 Dedicated Host PMCs The PMCs available are the architectural PMCs listed in the [Intel 64 and IA-32 Architectures Developer's Manual: vol. 3B], in section 18.2.1.2 "Pre-defined Architectural Performance Events", Table 18-1 "UMask and Event Select Encodings for Pre-Defined Architectural Performance Events". I've drawn my own table of them below with example event mnemonics. **Architectural PMCs**
Event NameUMaskEvent SelectExample Event Mask Mnemonic
UnHalted Core Cycles00H3CHCPU_CLK_UNHALTED.THREAD_P
Instruction Retired00HC0HINST_RETIRED.ANY_P
UnHalted Reference Cycles01H3CHCPU_CLK_THREAD_UNHALTED.REF_XCLK
LLC Reference4FH2EHLONGEST_LAT_CACHE.REFERENCE
LLC Misses41H2EHLONGEST_LAT_CACHE.MISS
Branch Instruction Retired00HC4HBR_INST_RETIRED.ALL_BRANCHES
Branch Misses Retired00HC5HBR_MISP_RETIRED.ALL_BRANCHES
What's so special about these seven architectural PMCs? They give you a good overview of key CPU behavior, sure. But Intel have also chosen them as a golden set, to be highlighted first in the PMC manual and their presence exposed via the CPUID instruction. Note that the Intel mnemonic for LLC here is "longest latency cache", but this is also known as "last level cache" or "level 3 cache" (assuming it's L3). ### PMC Usage Before I demonstrate PMCs, it's important to know that there's two very different ways they can be used: - **Counting**: where they provide a count over an interval. - **Sampling**: where based on a number of events, an interrupt can be triggered to sample the program counter or stack trace. Counting is cheap. Sampling costs more overhead based on the rate of the interrupts (which can be tuned by changing the event trigger threshold), and whether you're reading the PC or the whole stack trace. I'll demonstrate PMCs by using counting to measure IPC. ## Measuring IPC Instructions-per-cycle (IPC) is a good starting point for PMC analysis, and is measured by counting the instruction count and cycle count PMCs. (On some systems it is shown as its invert, cycles-per-instruction, CPI.) IPC is like miles-per-gallon for CPUs: how much bang for your buck. The resource here isn't gallons of gasoline but CPU cycles, and the result isn't miles traveled but instructions retired (ie, completed). The more instructions you can complete with your fixed cycles resource, the better. In the interest of keeping this short, I'll gloss over IPC caveats. There are situations where it can be misleading, like an increase of IPC because your program suffers more spin lock contention, and those spin instructions happen to be very fast. Just like MPG can be misleading, as it can be influenced by the route driven, not just the car's own characteristics. I'll use the Linux [perf] command to measure IPC of a program, noploop, which loops over a series of NOP instructions (no op):
# perf stat ./noploop
^C./noploop: Interrupt

 Performance counter stats for './noploop':

       2418.149339      task-clock (msec)         #    1.000 CPUs utilized          
                 3      context-switches          #    0.001 K/sec                  
                 0      cpu-migrations            #    0.000 K/sec                  
                39      page-faults               #    0.016 K/sec                  
     6,245,387,593      cycles                    #    2.583 GHz                      (75.03%)
   <not supported>      stalled-cycles-frontend  
   <not supported>      stalled-cycles-backend   
    24,766,697,057      instructions              #    3.97  insns per cycle          (75.02%)
        14,738,991      branches                  #    6.095 M/sec                    (75.02%)
            24,744      branch-misses             #    0.17% of all branches          (75.04%)

       2.418826663 seconds time elapsed
I've highlighted IPC ("insns per cycle") in the output. I like noploop as a sanity test. Because this processor is 4-wide (instruction prefetch/decode width), it can process a maximum of 4 instructions with every CPU cycle. Since NOPs are the fastest possible instruction (they do nothing), they can be retired at an IPC rate of 4.0. This goes down to 3.97 with a little loop logic (the program is looping over a block of NOPs). The "<not supported>" metrics are cases where the PMC is not currently available (they are outside of the architectural set, in this case). You can also measure the entire system, using perf with -a. This time I'm measuring a software build:
# perf stat -a -- sleep 10

 Performance counter stats for 'system wide':

     641398.723351      task-clock (msec)         #   64.116 CPUs utilized            (100.00%)
           379,651      context-switches          #    0.592 K/sec                    (100.00%)
            51,546      cpu-migrations            #    0.080 K/sec                    (100.00%)
        13,423,039      page-faults               #    0.021 M/sec                  
 1,433,972,173,374      cycles                    #    2.236 GHz                      (75.02%)
   <not supported>      stalled-cycles-frontend  
   <not supported>      stalled-cycles-backend   
 1,118,336,816,068      instructions              #    0.78  insns per cycle          (75.01%)
   249,644,142,804      branches                  #  389.218 M/sec                    (75.01%)
     7,791,449,769      branch-misses             #    3.12% of all branches          (75.01%)

      10.003794539 seconds time elapsed
That's reporting an IPC of 0.78. perf can also print statistics over time (-I), but the output becomes verbose. I've written a quick wrapper to clean this up and summaize the architectural PMCs on a single line. It's [pmcarch] \(first version\):
# pmcarch 1
CYCLES        INSTRUCTIONS    IPC BR_RETIRED   BR_MISPRED  BMR% LLCREF      LLCMISS     LLC%
90755342002   64236243785    0.71 11760496978  174052359   1.48 1542464817  360223840  76.65
75815614312   59253317973    0.78 10665897008  158100874   1.48 1361315177  286800304  78.93
65164313496   53307631673    0.82 9538082731   137444723   1.44 1272163733  268851404  78.87
90820303023   70649824946    0.78 12672090735  181324730   1.43 1685112288  343977678  79.59
76341787799   50830491037    0.67 10542795714  143936677   1.37 1204703117  279162683  76.83
[...]
This is from a production instance on EC2, and each line of output is a one second summary. ### Interpreting IPC For real-world applications, here's how I'd interpret the IPC: - **IPC < 1**: likely stall cycle bound, also likely memory bound (more PMCs can confirm). Stall cycles are when the CPU isn't making forward progress, likely because it's waiting on memory I/O. In this case, look to tune memory usage: allocate fewer or smaller objects, do zero copy, look at NUMA and memory placement tuning. A [CPU flame graph] will show which code is on-CPU during these stall cycles, and should be clues for were to look for memory usage. - **IPC > 1**: likely instruction bound. Look to tune instructions: a [CPU flame graph] will show which code is on-CPU doing instructions: find ways to reduce executed code. You can combine IPC and flame graphs to show everything at once: [CPI flame graphs] \(CPI is IPC inverted\). This requires using the sampling mode of PMCs to capture stack traces on overflow events. There are, however, caveats with doing this which I'll get to in another post. Note that I'm using these on a modern Linux kernel, 4.4+. There was a problem on older kernels (3.x) where PMCs would be measured incorrectly, leading to a bogus IPC measurement. ## RxNetty Study In 2015 I found PMCs crucial in fully understanding the performance differences between RxNetty and Tomcat as they scaled with client load. Tomcat serves requests using threads for each connection, whereas RxNetty uses event loop threads. Between low and high client counts, Tomcat's CPU cycles per request was largely unchanged, whereas RxNetty became *more* efficient and consumed *less* CPU per request as clients increased. Can you guess why?


IPC and CPU/req as load scales

Click for a slide deck where I explain why on slides 25-27 (these slides are from the [WSPerfLab] repository, summarizing a study by myself, Nitesh Kant, and Ben Christensen.) We knew that we had a 46% higher request rate, and so we began a study to identify and quantify the reasons why. There was 5% caused by X, and 3% caused by Y, and so on. But after weeks of study, we fell short: over 10% of that 46% remained unexplained. I checked and rechecked our numbers, but fell short every time. It was driving me nuts, and casting doubt on everything we'd found so far. With PMCs I was able to identify this last performance difference, and the numbers finally added up! We could break down the 46% difference and explain every percentage point. It was very satisfying. It also emphasized the importance of PMCs: understanding CPU differences is a common task in our industry, and without PMCs you're always missing an important part of the puzzle. This study was done on a physical machine, not EC2, where I'd measured and studied dozens of PMCs. But the crucial PMCs I included in that slide deck summary were the measurements of IPC and the LLC, which are possible with the architectural PMCs now available in EC2. ## How is this even possible in the cloud? You might be wondering how cloud guests can read PMCs at all. It works like this: PMCs are managed via the privileged instructions RDMSR and WRMSR for configuration (which I wrote about in [The MSRs of EC2]), and RDPMC for reading. A privileged instruction causes a guest exit, which is handled by the hypervisor. The hypervisor can then run its own code, and configure PMCs if the actual hardware allows, and save and restore their state whenever it context switches between guests. Mainstream Xen supported this years ago, with its virtual Performance Monitoring Unit (vPMU). It is configured using vpmu=on in the Xen boot line. However, it is rarely turned on. Why? There are hundreds of PMCs, and they are all exposed with vpmu=on. Could some pose a security risk? A number of papers have been published showing PMC side-channel attacks, whereby measuring certain PMCs while sending input to a known target program can eventually leak bits of the target's state. While these are unlikely in practice, and such attacks aren't limited to PMCs (eg, there's also timing attacks), you can understand a paranoid security policy not wanting to enable all PMCs by default. In the cloud it's even harder to do these attacks, as when Xen context switches between guests it switches out the PMCs as well. But still, why enable all the PMCs if they aren't all needed? Imagine if we could create a whitelist of allowed PMCs for secure environments. This is a question I pondered in late 2015, and I ended up contributing the [x86/VPMU: implement ipc and arch filter flags] patch to Xen to provide two whitelist sets as options, chosen with the vpmu boot flag: - **ipc**: Enough PMCs to measure IPC only. Minimum set. - **arch**: The seven architectural PMCs (see table above). Includes IPC. More sets can be added. For example, I can imagine an extended set to allow some Intel vTune analysis. AWS just enabled architectural PMCs. My patch set might be a useful example of how such a whitelist can be implemented, although, how EC2 implemented it might be different to this. ## Conclusion PMCs are crucial for analyzing a (if not *the*) modern system bottleneck: memory I/O. A set of PMCs are now available on dedicated hosts in the EC2 cloud, enough for high-level analysis of memory I/O issues. I used them in this post to measure IPC, which can identify if your applications are likely memory bound or instruction bound, directing further tuning efforts. I've worked with PMCs before, and the sort of wins they help you find can range from small single digit percentages to as much as 2x. The net result for companies like Netflix is that our workloads will run faster on EC2 because we can use PMCs to find these performance wins. Consider that, next time someone is comparing clouds by microbenchmarking alone. It's not just out-of-the-box performance that matters, it's also your ability to observe and tune your applications.
A cloud you can't analyze is a slower cloud.
Thanks to those at Netflix for supporting my work on this, the Xen community for their vpmu work, and everyone at Amazon and Intel who made this happen! (Thanks Joe, Matt, Subathra, Rosana, Uwe, Laurie, Coburn, Ed, Mauricio, Steve, Artyom, Valery, Jan, Boris, and more.) (Yes, I'm happy.) [The MSRs of EC2]: http://www.brendangregg.com/blog/2014-09-15/the-msrs-of-ec2.html [Intel 64 and IA-32 Architectures Developer's Manual: vol. 3B]: http://www.intel.com/content/www/us/en/architecture-and-technology/64-ia-32-architectures-software-developer-vol-3b-part-2-manual.html [perf]: /perf.html [CPU flame graph]: /FlameGraphs/cpuflamegraph.html [CPI flame graphs]: /blog/2014-10-31/cpi-flame-graphs.html [x86/VPMU: implement ipc and arch filter flags]: http://xenbits.xen.org/gitweb/?p=xen.git;a=commitdiff;h=e3cce1799df2f957dfa00f84a5315cbf896490fe [WSPerfLab]: https://github.com/Netflix-Skunkworks/WSPerfLab/tree/master/test-results [pmcarch]: https://github.com/brendangregg/pmc-cloud-tools/blob/master/pmcarch

Exclusive Or Character Josef "Jeff" Sipek

A couple of years ago I blogged about the CCS instruction in the Apollo Guidance Computer. Today I want to tell you about the XC instruction from the System/360 ISA.

Many ISAs have some sort of xor instruction. The 360 is no different. It offers several different xor instructions which differ in the type of operands that they operate on. In all cases, the operation they perform could be summarized as (using C syntax):

A ^= B;

That is one of the operands is used as both a source and a destination.

There are the boring X (reg ^= memory), XR (reg ^= reg), and XI (reg ^= immediate). Then there is XC which is what inspired this post. XC, or Exclusive Or Character, takes two memory locations and a length and performs what appears as byte by byte xor of the two buffers. (The hardware is smart enough to operate on bigger chunks of memory but the effect is as if it was done byte at a time.) In assembly XC looks like:

XC d1(l,b1),d2(b2)

The d are 12-bit unsigned displacements while the b specify the registers with the base address. For each of the operands the actual address is dX plus the value of the bX register. The l is a length field which encodes a length between 1 and 256.

To use more C pseudocode, XC does:

void XC(unsigned char *op1, size_t len, unsigned char *op2)
{
	while (len--) {
		*op1 ^= *op2;
		op1++;
		op2++;
	}
}

(This pseudo code ignores the condition code calculation and exception generation which are not relevant to the discussion.)

This by itself is neat but not every exciting…until you remember that xor can be used to zero out a register. You can use XC to zero out up to 256 bytes of memory. It turns out this idiom is used pretty often in handwritten assembly, and compilers such as gcc even produce such instructions without any special effort on the programmer’s behalf.

For example, in HVF I have this line:

memset(&psw, 0, sizeof(struct psw));

Which GCC helpfully turns into (struct psw is 16 bytes in size):

xc      160(16,%r15),160(%r15)

When I first saw that line in the disassembly of HVF years ago, it blew my mind. It is elegant, fast thanks to the microarchitecture optimizations, and once you are used to the idiom it is clear about what it does. I hope your mind was as blown as mine. Till next time!

USENIX/LISA 2016 Linux bcc/BPF Tools Brendan Gregg's Blog

For USENIX LISA 2016 I gave a talk that was years in the making, on Linux bcc/BPF analysis tools.

"Time to rethink the kernel" - Thomas Graf
Thomas has been using BPF to create new network and application security technologies (project [Cilium]), and build something that's starting to look like microservices in the kernel ([video]). I'm using it for advanced performance analysis tools that do tracing and profiling. Enhanced BPF might still be new, but it's already delivering new technologies, and making us rethink what we can do with the kernel. My LISA 2016 talk begins with a 15 minute demo, showing the progression from ftrace, then perf\_events, to BPF (due to the audio/video settings, this demo is a little hard to follow in the full video, but there's a separate recording of just the demo here: [Linux tracing 15 min demo]). Below is the full talk video (youtube):
The slides are on [slideshare] \([PDF]\):
The rest of the talk can be seen in the standard talk video (youtube):
--> ## Installing bcc/BPF To try out BPF for performance analysis you'll need to be on a newer kernel: at least 4.4, preferably 4.9. The main front end is currently [bcc] (BPF compiler collection), and there are [install instructions] on github, which keep getting improved. For Ubuntu, installation is:
echo "deb [trusted=yes] https://repo.iovisor.org/apt/xenial xenial-nightly main" | sudo tee /etc/apt/sources.list.d/iovisor.list
sudo apt-get update
sudo apt-get install bcc-tools
There's currently a pull request to add snap instructions, as there are nightly builds for snappy as well. ## Listing bcc/BPF Tools This install will add various performance analysis and debugging tools to /usr/share/bcc/tools. Since some require a very recent kernel (4.6, 4.7, or 4.9), there's a subdirectory, /usr/share/bcc/tools/old, which has some older versions of the same tools that work on Linux 4.4 (albeit with some caveats).
# ls /usr/share/bcc/tools
argdist       cpudist            filetop         offcputime   solisten    tcptop    vfsstat
bashreadline  cpuunclaimed       funccount       offwaketime  sslsniff    tplist    wakeuptime
biolatency    dcsnoop            funclatency     old          stackcount  trace     xfsdist
biosnoop      dcstat             gethostlatency  oomkill      stacksnoop  ttysnoop  xfsslower
biotop        deadlock_detector  hardirqs        opensnoop    statsnoop   ucalls    zfsdist
bitesize      doc                killsnoop       pidpersec    syncsnoop   uflow     zfsslower
btrfsdist     execsnoop          llcstat         profile      tcpaccept   ugc
btrfsslower   ext4dist           mdflush         runqlat      tcpconnect  uobjnew
cachestat     ext4slower         memleak         runqlen      tcpconnlat  ustat
cachetop      filelife           mountsnoop      slabratetop  tcplife     uthreads
capable       fileslower         mysqld_qslower  softirqs     tcpretrans  vfscount
Just by listing the tools, you might spot something you want to start with (ext4*, tcp*, etc). Or you can browse the following diagram:
## Using bcc/BPF If you don't have a good starting point, in the [bcc Tutorial] I included a generic checklist of the first ten tools to try. I also included this in my LISA talk:
  1. execsnoop
  2. opensnoop
  3. ext4slower (or btrfs*, xfs*, zfs*)
  4. biolatency
  5. biosnoop
  6. cachestat
  7. tcpconnect
  8. tcpaccept
  9. tcpretrans
  10. runqlat
  11. profile
Most of these have usage messages, and are easy to use. They'll need to be run as root. For example, execsnoop to trace new processes:
# /usr/share/bcc/tools/execsnoop
PCOMM            PID    PPID   RET ARGS
grep             69460  69458    0 /bin/grep -q g2.
grep             69462  69458    0 /bin/grep -q p2.
ps               69464  58610    0 /bin/ps -p 308
ps               69465  100871   0 /bin/ps -p 301
sleep            69466  58610    0 /bin/sleep 1
sleep            69467  100871   0 /bin/sleep 1
run              69468  5160     0 ./run
[...]
And biolatency to record an in-kernel histogram of disk I/O latency:
# /usr/share/bcc/tools/biolatency 
Tracing block device I/O... Hit Ctrl-C to end.
^C
     usecs               : count     distribution
         0 -> 1          : 0        |                                        |
         2 -> 3          : 0        |                                        |
         4 -> 7          : 0        |                                        |
         8 -> 15         : 0        |                                        |
        16 -> 31         : 0        |                                        |
        32 -> 63         : 0        |                                        |
        64 -> 127        : 0        |                                        |
       128 -> 255        : 0        |                                        |
       256 -> 511        : 64       |**********                              |
       512 -> 1023       : 248      |****************************************|
      1024 -> 2047       : 29       |****                                    |
      2048 -> 4095       : 18       |**                                      |
      4096 -> 8191       : 42       |******                                  |
      8192 -> 16383      : 20       |***                                     |
     16384 -> 32767      : 3        |                                        |
Here's its USAGE message:
# /usr/share/bcc/tools/biolatency -h
usage: biolatency [-h] [-T] [-Q] [-m] [-D] [interval] [count]

Summarize block device I/O latency as a histogram

positional arguments:
  interval            output interval, in seconds
  count               number of outputs

optional arguments:
  -h, --help          show this help message and exit
  -T, --timestamp     include timestamp on output
  -Q, --queued        include OS queued time in I/O time
  -m, --milliseconds  millisecond histogram
  -D, --disks         print a histogram per disk device

examples:
    ./biolatency            # summarize block I/O latency as a histogram
    ./biolatency 1 10       # print 1 second summaries, 10 times
    ./biolatency -mT 1      # 1s summaries, milliseconds, and timestamps
    ./biolatency -Q         # include OS queued time in I/O time
    ./biolatency -D         # show each disk device separately
In /usr/share/bcc/tools/docs or the [tools subdirectory] on github, you'll find \_example.txt files for every tool which have screenshots and discussion. Check them out! There are also man pages under man/man8. For more information, please watch my LISA talk at the top of this post when you get a chance, where I explain Linux tracing, BPF, bcc, and tour various tools. ## What's Next? My prior talk at LISA 2014 was [New Tools and Old Secrets (perf-tools)], where I showed similar performance analysis tools using ftrace, an older tracing framework in Linux. I'm still using ftrace, not just for older kernels, but for times where it's more efficient (eg, kernel function counting using the funccount tool). BPF is programmatic, and can do things that ftrace can't. Doing ftrace at LISA 2014, then BPF at LISA 2016, you might wonder I'll propose for LISA 2018. We'll see. I could be covering a higher-level BPF front-end (eg, [ply], if it gets finished), or a BPF GUI (eg, via Netflix Vector), or I could be focused on something else entirely. Tracing was my priority when Linux lacked various capabilities, but now that's done, there are other important technologies to work on... [youtube]: https://www.youtube.com/watch?v=GsMs3n8CB6g [Linux tracing 15 min demo]: https://www.youtube.com/watch?v=GsMs3n8CB6g [Linux tracing in 15 minutes]: /blog/2016-12-27/linux-tracing-in-15-minutes.html [PDF]: /Slides/LISA2016_BPF_tools_16_9.pdf [slideshare]: http://www.slideshare.net/brendangregg/linux-4x-tracing-tools-using-bpf-superpowers [previous post]: /blog/2017-04-23/usenix-lisa-2013-flame-graphs.html [video]: https://www.youtube.com/watch?v=ilKlmTDdFgk [Cilium]: https://github.com/cilium/cilium [New Tools and Old Secrets (perf-tools)]: /blog/2015-03-17/usenix-lisa-2014-linux-ftrace-perf-tools.html [ply]: https://github.com/iovisor/ply [install instructions]: https://github.com/iovisor/bcc/blob/master/INSTALL.md [tools subdirectory]: https://github.com/iovisor/bcc/tree/master/tools [bcc Tutorial]: https://github.com/iovisor/bcc/blob/master/docs/tutorial.md [bcc]: https://github.com/iovisor/bcc

OmniTribblix The Trouble with Tribbles...

In Tribblix, it's a basic principle that I ship upstream software unmodified. I don't impose my own views on installation layout, nor do I customize it. Generally, I apply patches only to make stuff compile.

This means that what you see in Tribblix is exactly what the upstream author intended, and not some distro-specific bastardization of it.

It also makes my life easier, I don't have to maintain patches, and updating software is much easier if it's unmodified.

In particular, I use an absolutely vanilla illumos-gate. (For a long time it differed only in that I had the fix for 5188 applied, relevant because Tribblix actually uses SVR4 packaging, but now that's integrated I don't even need to do that.)

Again, this makes my life easier. (When you're maintaining a distro on your own in your spare time, making decisions that simplify your job is essential.)

But it also has another benefit: because I have no "special" features that I've added, I'm not tied to one particular version or variant or commit of illumos. Any version of illumos-gate will do just fine. When it comes time to make a release, I just clone the gate, build, and go.

What I could do, then, is build an instance of Tribblix atop some other fork of the gate. For example, illumos-omnios.

I did just that, built the gate (it needed a couple of changes to Makefiles because of the way that perl and snmp are slightly different in OmniOS than it is in Tribblix), created packages, built an ISO, booted and installed it in VirtualBox.

As expected, it just works.

But just demonstrating that it works isn't really the reason I wanted to do this. What I'm really after is the LX brand, which has been integrated into current OmniOS.

Installing an LX zone requires a Linux image. The original (Joyent) work was for their own deployment mechanism, using ZFS images. As soon as it was available in OmniOS the first thing I did was use tarballs, which OmniOS now supports. The easiest way to create a Linux image is to create a Docker container the way you like it, and then export it to a tarball. I did that for Alpine and installed a zone based on that.

Then you can do very simple things like:

# zlogin lx1 /bin/uname -a 
Linux lx1 4.4 BrandZ virtual linux x86_64 Linux

It's an attractive idea to simply use this as the base for the next Tribblix release. However, that requires illumos-omnios to be supported in the long term, which is currently at risk.

Modern Mercurial Josef "Jeff" Sipek

I’ve been using both Git and Mercurial since they were first released in 2005. I’ve messed with the internals of both, but I always had a preference for Mercurial (its user interface is cleaner, its design is well thought-out, and so on). So, it should be no surprise that I felt a bit sad every time I heard that some project chose Git over Mercurial (or worse yet, migrated from Mercurial to Git). At the same time, I could see Git improving release after release—but Mercurial did not seem to. Seem is the operative word here.

A couple of weeks ago, I realized that more and more of my own repositories have been Git based. Not for any particular reason other than that I happened to type git init instead of hg init. After some reflection, I decided that I should convert a number of these repositories from Git to Mercurial. The conversion itself was painless thanks to the most excellent hggit extension that lets you clone, pull, and push Git repositories with Mercurial. (I just cloned the Git repository with a hg clone and then cleaned up some of the mess manually—for example, I don’t need the bookmark corresponding to the one and only branch in the original Git repository.) Then the real fun began.

I resumed the work on my various projects, but now with the brand-new Mercurial repositories. Soon after I started hitting various quirks with the Mercurial UI. I realized that the workflow I was using wasn’t really aligned with the UI. Undeterred, I looked for solutions. I enabled the pager extension, the color extension, overrode some of the default colors to be less offensive (and easier to read), enabled the shelve, rebase, and histedit extensions to (along with mq) let me do some minor history rewriting while I iteratively work on changes. (I learned about and switched to the evolve extension soon after.) With each tweak, the user experience got better and better.

Then it suddenly hit me—before these tweaks, I had been using Mercurial like it’s still 2005!

I think this is a very important observation. Mercurial didn’t seem to be improving because none of the user-visible changes were forced onto the users. Git, on the other hand, started with a dreadful UI so it made sense to enable new features by default to lessen the pain.

One could say that Mercurial took the Unix approach—simple and not exactly friendly by default, but incredibly powerful if you dig in a little. (This extensibility is why Facebook chose Mercurial over Git as a Subversion replacement.)

Now I wonder if some of the projects chose Git over Mercurial at least partially because by default Mercurial has been a bit…spartan.

With my .hgrc changes, I get exactly the information I want in a format that’s even better than what Git provided me. (Mercurial makes so much possible via its templating engine and the revsets language.)

So, what does all this mean for Mercurial? It’s hard to say, but I’m happy to report that there is a number of good improvements that should land in the upcoming 4.2 release scheduled for early May. For example, the pager and color functionality is moving into the core and they will be on by default.

Finally, I like my current Mercurial environment quite a lot. The hggit extension is making me seriously consider using Mercurial when dealing with Git repositories that I can’t convert.

USENIX/LISA 2013 Blazing Performance with Flame Graphs Brendan Gregg's Blog

In 2013 I gave a plenary at USENIX/LISA on flame graphs: my visualization for profiled stack traces, which is now used by many companies (including Netflix, Facebook, and Linkedin) to identify which code paths consume CPU. The talk is more relevant today, now that flame graphs are widely adopted. Slides are on [slideshare] ([PDF]):

Video is on [youtube]:

The talk explains the origin of flame graphs, how to interpret them, and then tours different profile and trace event types that can be visualized. It predates some flame graph features that were added later: zoom, search, mixed-mode color highlights (--colors=java), and differential flame graphs. I used DTrace to create different types of flame graphs in the talk, but since then I've developed ways to do them on Linux, using [perf] for CPU flame graphs, and [bcc/BPF] for advanced flame graphs: off-CPU and more. My [BPF off-CPU flame graphs] post used my stack track hack, but since then we've added stack trace support to BPF in Linux (4.6), and these can now be implemented without hacks. The tool offcputime in bcc has already been updated to do this (thanks Vicent Marti and others for getting it working well, and Alexei Starovoitov for adding stack trace support to BPF). This talk was 170 slides in 90 minutes, which may have been too much in 2013 when flame graphs were new. There's a reason for this: I'd planned to do a 45 minute talk on CPU flame graphs, ending on slide 98, followed by a different talk. For reasons beyond my control, I was told the night before that I couldn't give that second talk. My plan B, as I'd already discussed with the conference organizers, was to extend the flame graphs talk and add an advanced section. I was up to 5am doing this, and was then woken at 8am by the conference organizers: the plenary speaker had shellfish poisoning, and could I come down and give my flame graphs talk at 9am, instead of later that day? That's how this ended up as a 90 minute plenary! At that LISA I also worked more with USENIX staff, and co-delivered a metrics workshop, as well as another talk. I was proud to be involved with USENIX/LISA and contribute in these ways. And, you can too, the call for proposals for LISA 2017 ends tomorrow (April 24). Since 2013, I've also written about flame graphs in [ACMQ] and [CACM]. For the latest on flame graphs, see the [updates] section of my flame graphs page. [ACMQ]: http://queue.acm.org/detail.cfm?id=2927301 [CACM]: http://cacm.acm.org/magazines/2016/6/202665-the-flame-graph/abstract [PDF]: /Slides/LISA13_Flame_Graphs.pdf [slideshare]: http://www.slideshare.net/brendangregg/blazing-performance-with-flame-graphs [youtube]: http://www.youtube.com/watch?v=nZfNehCzGdw [updates]: /flamegraphs.html#Updates [perf]: /perf.html#FlameGraphs [bcc/BPF]: https://github.com/iovisor/bcc [BPF off-CPU flame graphs]: /blog/2016-01-20/ebpf-offcpu-flame-graph.html

2017-04-22 Josef "Jeff" Sipek

Noisy Tribblix The Trouble with Tribbles...

I've had a couple of Tribblix users ask me why audio doesn't work.

This was something I had noticed myself, and the reason was not that audio was in some way broken, but that the permissions on the audio devices were wrong - owned and only writeable by root.

Now I only wanted to actually get any audio out on fairly rare occasions, so a quick chown wasn't that much of an imposition. But it obviously needed fixing properly.

My assumption here is that most desktop users will be logging in through the SLiM login manager. So all I need to do is fix the permissions just before it calls setuid() to the logged in user. And then reset them back once the user is done.

Now, I could have made up a bunch of chowns myself, or written a helper. There's actually code in SLiM to call ConsoleKit - but I don't have ConsoleKit, and don't really see the need to maintain a port of it just for this.

But illumos already has the capability to do this, and the normal login mechanisms use it. There's code in libdevinfo that sets the permissions according to the rules laid out in the /etc/logindevperm file. So the code is really just a call to di_devperm_login() and di_devperm_logout(), and all is well.

This also fixed another irritating bug - I can now eject memory sticks as myself, without needing to be root.

The next thing that happens, of course, is that it doesn't take very long to realise that Twitter has a lot of videos that play automatically. So I'm sitting there and I can hear either the internal loudspeaker or my headphones warbling away.

So the next thing I need is a way to shut the thing up. Historically, I used the old CDE sdtaudiocontrol, which was pretty good. (In general, I detested CDE as a desktop, the mailer and calendar were decent enough for their time, and the audio control was the only other thing I used much.) I use Xfce as my desktop, it used to have xfce4-mixer but that's now unmaintained and deprecated (and I removed that as part of the migration from gstreamer-0.10 to gstreamer1). Which pretty much leaves the command line audio utilities in illumos, specifically audioctl. I've added the package so users who update will automatically get that as well.

The command

audioctl set-control volume 0

silences things, while

audioctl set-control volume 75

puts the volume back to normal. I've created aliases mute and unmute for those. A more sophisticated approach would be to save the volume and restore it afterwards, but this is enough for now.

Thank you, Oracle engineers alp's notes

After 2010, when Oracle acquired Sun, most of us, who followed OpenSolaris, were depressed. In one year one of the most advantageous operating systems was closed under steel curtain. Luckily, due to enormous efforts of community, of companies, dependent on OpenSolaris, the system survived. Currently we have several more or less successful illumos distributions, targeting different users. But nowadays there's a (of course, deserved) common negative feeling towards Oracle in illumos community. But let's speak from another point of view. Let's look at things, which illumos community (and in particular, OpenIndiana) got directly or indirectly from Oracle in recent years.

  • Our userland build system, which constantly evolves, however, in different directions, under Oracle control and in our distribution. But still a lot of components can be easily migrated between build systems.
  • A lot of software build receipts and patches, as result, were borrowed with small modifications, from Oracle userland-gate. The process is still going on.
  • We still borrow patches from Solaris pkg-gate. Also differences in underlying kernels are currently rather significant, a lot of changesets from pkg-gate can be ported to OpenIndiana pkg5 repository.
  • Of course, I can not avoid thanking Alan for his constant help in supporting Xorg subsystem and GUI parts of our distribution. He was always helpful to me and Aurélien.
  • Evidently, recent KMS work, integrated into OpenIndiana, wouldn't be possible without Oracle's open drm port, which was ported from Solaris to illumos by Martin Bochnig, and later independently ported and enhanced by Gordon Ross.
- And of course, I cannot count patches, which were suggested to upstream projects by Oracle engineers. Just today when I tried to solve two issues related with IPS and apache 2.4 interaction, I've found two patches by Petr Sumbera, fixing Apache issues on Solaris. So, I want to use the chance and thank all Oracle Solaris engineers for their work on open source projects. I doubt that without them illumos could survive in large scale. Perhaps, we could be an excellent playground for ZFS development, but not an universal operating system...