DBG-SOS-001: Finding Lock Inversions with DTrace Z IN ASCII - Writing

Tracing lock acquisition order with DTrace's lockstat and fbt providers.

Tweaking binaries with elfedit The Trouble with Tribbles...

On Solaris and illumos, you can inspect shared objects (binaries and libraries) with elfdump. In the most common case, you're simply looking for what shared libraries you're linked against, in which case it's elfdump -d (or, for those of us who were doing this years before elfdump came into existence, dump -Lv). For example:

% elfdump -d /bin/true

Dynamic Section:  .dynamic
     index  tag                value
       [0]  NEEDED            0x1d6               libc.so.1
       [1]  INIT              0x8050d20          

and it goes on a bit. But basically you're looking at the NEEDED lines to see which shared libraries you need. (The other field that's generally of interest for a shared library is the SONAME field.)

However, you can go beyond this, and use elfedit to manipulate what's present here. You can essentially replicate the above with:

elfedit -r -e dyn:dump /bin/true

Here the -r flag says read-only (we're just looking), and -e says execute the command that follows, which is dyn:dump - or just show the dynamic section.

If you look around, you'll see that the classic example is to set the runpath (which you might see as RPATH or RUNPATH in the dump output). This was used to fix up binaries that had been built incorrectly, or where you've moved the libraries somewhere other than where the binary normally looks for them. Which might look like:

elfedit -e 'dyn:runpath /my/local/lib' prog

This is the first example in the man page, and the standard example wherever you look. (Note the quotes - that's a single command input to elfedit.)

However, another common case I come across is where libtool has completely mangled the link so the full pathname of the library (at build time, no less) has been embedded in the binary (either in absolute or relative form). In other words, rather than the NEEDED section being

libfoo.so.1

it ends up being

/home/ptribble/build/bar/.libs/libfoo.so.1

With this sort of error, no amount of tinkering with RPATH is going to help the binary find the library. Fortunately, elfedit can help us here too.

First you need to work out which element you want to modify. Back to elfedit again to dump out the structure

% elfedit -r -e dyn:dump /bin/baz
     index  tag                value
       [0]  POSFLAG_1         0x1                 [ LAZY ]
       [1]  NEEDED            0x8e2               /home/.../libfoo.so.1

It might be further down, of course. But the entry we want to edit is index number 1. We can narrow down the output just to this element by using the -dynndx flag to the dyn:dump command, for example

elfedit -r -e 'dyn:dump -dynndx 1' /bin/baz

or, equivalently, using dyn:value

elfedit -r -e 'dyn:value -dynndx 1' /bin/baz

And we can actually set the value as well. This requires the -s flag to set a string, but you end up with:

elfedit -e 'dyn:value -dynndx -s 1 libfoo.so.1' /bin/baz

and then if you use elfdump or elfedit or ldd to look at the binary, it should pick up the library correctly.

This is really very simple (the hardest part is having to work out what the index of the right entry is). I didn't find anything when searching that actually describes how simple it is, so I thought it worth documenting for the next time I need it.


On Tribblix Milestone 20 The Trouble with Tribbles...

Having released a new update for Tribblix, I thought I would add a little commentary on the progress that's being made and the direction things are going in.

This goes beyond the rather dry release notes and list of what's changed.

The big structural change is that the ISO has been built as a single root archive, rather than the old way with a split-off /usr that's lofi-mounted from a compressed image.

The original reason for doing this (and I experimented with it a while ago) was to allow installation on systems without drivers for the device that you're booting from. This might be a system with only USB3 ports, or I've had problems with laptops where illumos doesn't recognize the CD drive. The boot loader (and BIOS) load the initial boot archive, so if you don't need to ever talk to the media device again you're in much better shape.

While we now have USB3 support, this simplified boot is a good thing in any case, and it allows some neat tricks like iPXE boot.

Another logical change is in the release mechanism itself. I've discussed the Tribblix package repositories before. The snag with the traditional repository layout was that the packages that defined a release were in the main Tribblix repository. So, every time I make a new release I end up having to create a whole new Tribblix repository. Every time I update the illumos packages, I needed a new Tribblix repository. Creating a new one isn't too bad; ongoing support for multiple repositories is a lot of unnecessary work.

The way to fix this is to split out the packages (there are 3 of them) that define the properties of a release into their own separate repo. This allows at least 2 new possibilities:

  1. I can release updated illumos packages without spinning a whole new Tribblix release. It would still use the same upgrade mechanism, but the main Tribblix repo is shared and it's a much lighter release process.
  2. I could create variants or spins. For example, I could create a variant that has LX (see omnitribblix). This would just have a different set of illumos packages but shares everything else. Or I could build a 32-bit or 64-bit only distro.
I haven't yet done either of those things, but it's going to happen.

Behind the scenes I've been gradually working to get more packages - especially those that deliver libraries - built as both 32-bit and 64-bit.

Tribblix is fairly clear that it will continue to support 32-bit and 64-bit hardware, at least for a while. (Whereas both OmniOS and OpenIndiana have effectively dropped 32-bit compatibility, mostly by neglect rather than design.) Of course, there is a reasonable amount of software now that's only 64-bit (anything built with go, for example, or OpenJDK 8), but there's a reasonable chance the people using 32-bit hardware aren't necessarily going to want the latest and greatest applications. (This isn't 100% true, by the way - sometime you have to interoperate with other facilities in the environment.) But eventually we're going to have to make a full 64-bit transition, and it would be good to be ready.

That gives a rough idea of the work that's currently underway. Looking ahead, there are a whole long list of packages that need adding or updating (such is a maintainer's life). The one significant place I have been falling behind is that I haven't updated gcc, so that needs work. And, of course, I'm trying to get SPARC into some sort of reasonable shape. But, overall, Tribblix is now pretty solid and a bit more polish and attention to detail would benefit it greatly.

Installing Tribblix on Vultr using iPXE The Trouble with Tribbles...

One of the new features in Tribblix 0m20 is that booting and installing using iPXE now works.

Here's an example of using this functionality to install a server running Tribblix in the Vultr cloud. A similar mechanism ought to work for any other provider that allows iPXE boot.

I'm assuming you have signed up and logged in, then go to deploy a server.

First choose where you want to deploy the server. I'm in the UK, so London is a good choice.


Then the critical bit, selecting the Server Type. The bit you want here is in a slightly confusing location, under the "Upload ISO" tab. But then select the "iPXE" radio button and put in the value http://pkgs.tribblix.org/m20/ipxe.txt


The other key option is Server Size. As with many providers, there's a simple scale. For testing, an instance with 1G of memory is more than adequate.


The deploy it. After a few seconds of installing you can then click the link to manage the server, and then view the console, which uses VNC.

If you're reasonably quick you get to see the initial iPXE screen, and can see it downloading the images:


What you can see here is that it's downloaded the original ipxe script we specified. This looks like:

#!ipxe
dhcp
kernel /m20/platform/i86pc/kernel/amd64/unix
initrd /m20/platform/i86pc/boot_archive
boot
 
Which just says to set up the network using dhcp (this might have already been done, but if you're booting off an ipxe iso it may not have been, so we do it anyway), then download the kernel and the boot archive, then boot from what you've just downloaded.

The kernel and the boot archive are on the iso, I've just unpacked them on the server (so the URL given above for the ipxe script will be reasonably permanent for anybody to use). The only slight tweak I've had to make is that the original boot archive is actually gzip compressed and iPXE can't handle that, so it's been uncompressed. The boot archive also now contains the /usr file system as well, rather than it being split off as before. While I'm sure you could mangle the system to download it and sort things out, it's so much easier to put it inside the boot archive.

Then you get into the normal installer, so log in as jack, su to root, and see what disk(s) are available using the new diskinfo tool. Then you can install Tribblix to that disk:



Don't bother adding additional overlays at this point. It won't work - and you'll get an error about not being able to install overlays (you'll get the error anyway because the installer always tries to add some packages that aren't needed in the live environment). This will be fixed in a future update, but it's relatively harmless.

The other thing you should do before the installation is to change the passwords for root and jack. If you change them before running the installer than the change will propagate to the installed system (because all it's doing is a copy). You really don't want the system to boot up wide open to the internet with the default (and well known) passwords.

Once the (pretty quick) install finishes, it'll look like this:


That's just like a normal install, other than the missing overlays. Then just reboot and you'll soon see the new loader, followed by the system booting.

Due to the missing overlays, you'll get an error about the intrd service failing. You'll have to log in (ssh will work at this point) and then add at least the base overlay:

zap install-overlay base

Plus whatever other overlays you might want. Then you can clear the intrd service and you're good to go.

Tribblix memory requirements The Trouble with Tribbles...

Compared to the other illumos distributions, Tribblix has lower memory requirements.

I'm not talking about crazy stunts like running in 48M; here I'm talking about running a fully fledged system.

I've been doing a bit of testing of the upcoming release, which includes running the install under a range of configurations. The test here is to boot the ISO image in VirtualBox with a range of memory sizes and then install the kitchen sink.

  • The live image won't boot at all on a 256M system
  • The live image will boot on a 512M system, but installing to zfs will fail
  • However, installing to ufs works on a 512M system
  • With 768M, installation to zfs is rather slow
  • With 1G or more, you're fine
The upcoming release is going to be built slightly differently, in that it's no longer a split-off /usr configuration. (I discussed how that worked and those strange zlib files some time ago.) The latest OmniOS is a single image; SmartOS likewise. It's just so much easier to construct, and far more reliable.

That change explains the 256M failure - the ramdisk is about 300M, so it simply won't fit. It's likely to have an impact on the 512M case too - in the old scenario you only paged in the bits of the /usr filesystem as and if you needed them, now it's locked into memory.

On a limited memory system there's a way to make things a bit easier. Simply install the base (no additional overlays) from the installer, then add the rest of the overlays and packages later. The point here is that running from disk doesn't lock up anywhere near as much memory as the full OS being resident in RAM does. And some of the packages in the kitchen sink are rather large, which causes problems.

Once you've got Tribblix installed, how well does it cope? Surprisingly well, to be honest. The Xfce desktop runs quite well in either 512M or 768M of memory. I can run firefox on the 768M system without too many problems (given the way it consumes memory, probably not for a long intensive browsing session), while firefox on a 512M system does run, but it's clearly starting to grind. Java applications work, some smaller ones at least. You need to be realistic in your expectations, but the point is that smaller systems do work.

The most limited systems would tend to be older, possibly 32-bit hardware. I could build a 32-bit only image which would be quite a bit smaller - maybe only two-thirds the size. (And if you really wanted to you could get it even smaller - but then you're in the realms of building custom images using mvi or the like.)

However, the aim of keeping Tribblix viable on smallish systems isn't just to allow the use of old hardware, beneficial though that is. If you're running a service on a cloud or hosting provider then being able to use a 1G server instead of a 2G server will halve your costs, and that's a very good thing to be able to do.

Tribblix SPARC progress The Trouble with Tribbles...

Tribblix is one of the relatively few illumos distributions that runs on both SPARC and x86 hardware.

There are valid reasons for the lack of SPARC support in other distributions. For those backed by commercial entities, it makes no sense to support SPARC as they don't have paying customers to foot the bill. Which leaves SPARC support firmly in the hobbyist realm.

Even in Tribblix, SPARC support has lagged the x86 version somewhat. Again, for entirely predictable reasons. While I do have SPARC hardware, it's relatively slow, noisy, power hungry, and heat-producing compared to my regular x86 boxes. And my day to day use is my x86 workstation, so that drives a lot of the desktop work.

But SPARC development of Tribblix hasn't stopped. Far from it, it's just naturally slower.

The current download ISO image at this time is still Milestone 16. Just to clarify the versioning here - that means it was built from exactly the same illumos commit as the corresponding x86 release. Because it took a little longer to get ready, the userland packages (such as they were) tended to be a bit newer.

There have been 3 more Tribblix release on x86 since then. Over the winter (when it was cold and the heat output from the T5140 I use as a build server was a good thing) I tried building updated illumos versions. The T5140 I'm using to do the builds is running a cobbled-together frankendistro of bits of Tribblix, bits of OpenSXCE, some random bits from other people working on SPARC, and a whole lot of elbow grease. I managed to build illumos at the m17 and m18 release points, but m19 was a step too far (some of the native stuff assumes that the host OS isn't terribly antiquated). What this means is that I need to replace that by a current system, and get a properly self-hosting illumos build.

That modernizes the underlying illumos components a bit. What about the rest of the system? The primary effort there was to replace the old core components that had been been borrowed from OpenSXCE while bootstrapping the distribution in the first place with native packages (and that are then up to date and match the x86 build). Some of the components here are pretty crucial - zlib and libxml2, for instance. At one point I messed up libxml2 slightly - not enough to kill SMF (which would be a big worry) but enough to stop zones working (which, apart from indicating that I had broken it, also left me without an imortant test mechanism). Rebuild everything enough times and the problem eventually cleared.

I also had a go at getting my SunBlade 1500 workstation working. It's not terribly quick, but it's quiet enough and sufficiently low power that I can have it running without negatively impacting the home office. That was a bit of a struggle, the bge network driver currently in illumos doesn't work - I assume I'm seeing bug 7746 here, but the solution - to use an older version of the driver - works well enough. With that box available I not only have more testing available but also a lightweight machine that I can use to keep the package backlog under control.

Graphics on SPARC is an interesting problem. OK, so I don't expect this to be a priority, but it would be nice to have something that worked. The first problem I found (a while ago) was that some of the binary graphics drivers wouldn't work at all. For example, the m64 driver (which is what might drive the graphics in my SunBlade 2000) uses hat_getkpfnum which was removed from illumos courtesy of bug 536. Graphics drivers that load often simply don't work, and getting an X server to start is a bit of a nightmare. After far too much manual fiddling I did manage to get a twm desktop running on the aforementioned SunBlade 1500, but don't expect native graphics support to improve any time soon.

Applications is another matter, there's no reason you couldn't run at least some applications on a SPARC system and display them back on your desktop machine. After all, X11 is a network display protocol (despite all the effort to eradicate that and turn it into a local-only display protocol). Or run a VNC server and access that remotely. So I've started (but not finished) building up the components for useful applications.

I haven't yet got an ISO image. That's likely to be a while, but if you have an existing SPARC system running Tribblix m16 then the upgrade to m18 ought to work. Although I would recommend a couple of changes to the procedure if you're going to try this:

  • Refresh and update everything: 'zap refresh ; zap update-overlay -a'
  • Download the current upgrade script from github and run that script in place of 'zap upgrade'
  • After booting into the newly updated BE, refresh and update everything again, just to make sure you're up to date

How to print ZFS filesystems ordered by space used blog'o'less

How to print ZFS filesystems ordered by space used

zfs get -o value,name -Hp used|sort -n

rpmbuild random notes blog'o'less

sudo dnf install rpmdevtools

rpmdev-setuptree

~/rpmbuild/SRPMS/
~/rpmbuild/SPECS/

~/rpmbuild/SOURCES/
~/rpmbuild/RPMS/
~/rpmbuild/BUILD/
~/.rpmmacros

sudo dnf download --source package

Working at Netflix 2017 Brendan Gregg's Blog

I've now worked at Netflix for over three years. Time flies! I previously wrote about Netflix in [2015] and [2016], and if you are interested in what it's like to work here, I already covered much in those posts. As before, no one at Netflix has asked me to write this, and this is my personal blog and not a company post. I'll start with some exciting news, describe what my job is really like, the culture and mission, and some work updates. ## 100 Million Subscribers! When I joined Netflix in April 2014, we had over 40 million subscribers in 41 countries. We are now in 190 countries and just crossed [100 million subscribers]! It's been thrilling to be part of this and help Netflix scale. You might imagine that at some point we had a major scaling crises, where it looked like we'd fail due to an architectural bottleneck, and engineers worked long nights and weekends to save Netflix from certain disaster. That'd make a great story, but it didn't happen. We're on the EC2 cloud, which has great scalability, and our own cloud architecture of microservices is also designed for scalability. During this time we did do plenty of hard work, rolling out new technologies and major microservice versions, and fixed many problems big and small. But there was no single crisis point. Instead, it has been a process of continual improvements, by many engineers across the company. ## A Day in the Life (Performance Engineering) What do I actually do all day? Most of my day is a 50/50 mixture of proactive projects, and reactive performance analysis. The proactive projects usually take weeks or months, and are where I'm developing a new technology or helping other teams with performance analysis or evaluations. Most of these projects aren't public yet, and some of them involve working with other companies on unreleased products. My work with Linux is different in that it is mostly public, and includes my perf-tools and bcc/eBPF tracing tools. Another long term project is Vector, our instance analysis tool, where I'm adding new performance analysis features. Getting frame pointer support in Java was another project I did a while ago. The reactive work can be for any performance problem that shows up, involving runtimes (Java, Node.js), Linux (and sometimes FreeBSD), or hypervisors (Xen, containers). Recently that's included: - Debugging why perf profiling stopped working in recent Docker containers. - Java core dump analysis for a crashing JVM. - MSR analysis on a instance to show it was running at a lower clock rate. - A latency outlier issue that happened every 15 minutes. - Analyzing slab memory growth on a instance with containers. - Getting flame graphs to work in a new environment. Staff ask for help over chat, either to the perfeng chatroom or me directly, or they come visit my desk in F2. I'm also monitoring various chatrooms and metrics, and will jump in when needed. It's a good balance. Too much reactive work and you don't have time to build better tools and general fire proofing. Too much proactive work and you can become disconnected from the current company pain points, and start building solutions to the problems of yesteryear. About one hour on average each day is meetings. Some of these are regular meetings: we have a team meeting once every two weeks where everyone discusses what they are working on, and I have a one-on-one with my manager once every two weeks. At a lower frequency, I have scheduled meetings with my manager's manager, and their manager. All these manager meetings keep me informed of the current company needs, and help connect me to the right people and projects at Netflix. Once every two weeks, I summarize what I've been working on in a shared doc: the team's bi-weekly status. Then there's some random events that happen during the year. We have offsites, where we plan what to work on each quarter, and team building events. There's also unofficial recreational groups at Netflix, including movie clubs (for good movies, and for bad ones), a karaoke group (which includes some Hamilton fans), and various sports teams. I'm on the Netflix cricket team (if you're at Netflix and didn't know we exist, join the cricket chatroom). I also usually speak at some conferences each year. ## Culture The biggest difference I've found working here is still the culture. We are empowered to do the right thing, and believe in "freedom and responsibility". This is documented in the Netflix [culture deck], and after three years I still find it true. The first seven slides point out that companies can have aspirational values, but the actual values differ:

The actual company values, as opposed to the nice-sounding values, are shown by who gets rewarded, promoted, or let go
Before joining Netflix, you're told to read it and see if this company is right for you. Then while working here, staff cite the culture deck in meetings for decision making advice. It's not nice-sounding values that are printed in the lobby and people forget about. It's an ongoing influence in the day to day running of Netflix. Having it online also beats learning the culture through word of mouth or trial and error. I know people in tech who are burned out but stay in lousy jobs, assuming every workplace is just as terrible. Jobs where there is little to no freedom, no responsibility or accountability, and where dumb office politics is the norm. I wish everyone could have a chance to work at a company like Netflix. Little to no bureaucracy. You can focus on engineering and getting stuff done, with awesome staff who will help you. ## Mission I spoke about this in my 2015 post, but it's worth repeating: our mission is to improve how entertainment is consumed worldwide, by building a great product that people choose to buy. I've noticed a widespread cynicism about successful companies, especially US corporates, where it's assumed that they must be doing something shady to be really competitive. Like selling customer data, or making it difficult to terminate membership. It's been amazing and inspiring to see how Netflix operates, contrary to this belief. We don't do anything shady, and we're proud of that. We're an honest company. ## Work Updates **SRE**: Last year I talked about my site reliability engineering (SRE) work. Since then, our CORE SRE team has grown and I'm no longer needed on the on-call rotation, so I'm back to focusing on performance work. My 18 months of SRE on-call provided many memories and valuable experiences, as well as a deeper understanding of SRE. I talked about what I learned in my [SREcon 2016] keynote, and how the aims and tools differ between performance engineering and SRE performance analysis. I miss the thrill of being paged and knowing I'm going to work with other awesome engineers and fix something _important_ in the next five minutes... or at least try to! If I miss this thrill too much, I can always jump into the CORE chatroom and help with production issues when they happen.

My new desk in building F
**Linux**: I've been contributing to profilers and tracers, and it's been satisfying to help fix these areas that I really care about. In the last three years I developed the ftrace-based [perf-tools] and used them to solve many problems, which I wrote about in [lwn.net] and spoke about at [LISA 2014]. I also worked with Alexei Starovoitov (now at Facebook) on enhanced BPF for tracing, and developed many [bcc tools] that use BPF. I spoke about these at Facebook's [Performance@Scale] event and other conferences. We're rolling out newer kernels now, and it's pretty exciting to use my bcc tools in production. For Linux, I've also done tuning, kernel analysis, [gdb], testing of [hist triggers], testing of some perf patches, and contributed a few trivial patches of my own. **PMCs**: When I considered joining Netflix three years ago, I had two technical concerns: 1. No advanced Linux tracer, and 2. No PMC access in EC2. How am I going to do advanced analysis without these? The more I thought about it, the more I became interested in the challenge, which would be the biggest of my career. Three years later, I've helped solve both of these (as well as devise some workarounds along the way). Now we have [Linux 4.9 eBPF] and [The PMCs of EC2]. Thanks to everyone who helped. **Team Changes**: Our team has grown a little, and we have a new manager, [Ed Hunter], who I worked for before at Sun Microsystems. It's great to be working with Ed again. Our prior manager, Coburn, was promoted. ## Summary When I use an awesome technology, I feel compelled to post about it and share. In this case, it's an awesome company and culture. After three years, I still find Netflix an awesome place to work, and every day I look forward to what I'll work on next. [culture deck]: http://www.slideshare.net/reed2001/culture-1798664 [2015]: http://www.brendangregg.com/blog/2015-01-20/working-at-netflix.html [2016]: http://www.brendangregg.com/blog/2016-03-30/working-at-netflix-2016.html [lwn.net]: http://lwn.net/Articles/608497/ [perf-tools]: https://github.com/brendangregg/perf-tools [LISA 2014]: /blog/2015-03-17/usenix-lisa-2014-linux-ftrace-perf-tools.html [bcc tools]: https://github.com/iovisor/bcc#tools [Performance@Scale]: http://www.brendangregg.com/blog/2016-03-05/linux-bpf-superpowers.html [linux.conf.au]: https://www.youtube.com/watch?v=JRFNIKUROPE [SCALE15x]: https://www.youtube.com/watch?v=w8nFRoFJ6EQ [hist triggers]: /blog/2016-06-08/linux-hist-triggers.html [gdb]: /blog/2016-08-09/gdb-example-ncurses.html [LISA 2016]: https://www.slideshare.net/brendangregg/linux-4x-tracing-tools-using-bpf-superpowers [SREcon 2016]: /blog/2016-05-04/srecon2016-perf-checklists-for-sres.html [Ed Hunter]: https://www.linkedin.com/in/edwhunter/ [100 million subscribers]: https://twitter.com/netflix/status/855545423276032000 [Linux 4.9 eBPF]: /blog/2016-10-27/dtrace-for-linux-2016.html [The PMCs of EC2]: /blog/2017-05-04/the-pmcs-of-ec2.html

Container Performance Analysis at DockerCon 2017 Brendan Gregg's Blog

At DockerCon 2017 I gave a talk on Linux container performance analysis, where I showed how to identify three types of performance bottlenecks in a container environment: 1. In the host vs container, using system metrics. 1. In application code in containers, using CPU flame graphs. 1. Deeper in the kernel, using tracing tools. The talk video is on [youtube] \(42 mins\):

And the slides are on [slideshare]:
This talk was a tour of container performance analysis on Linux. I included a quick summary of the necessary background, cgroups and namespaces, as well as analysis methodologies, before digging into the actual tools and metrics. An overall takeaway is to know what is possible, not necessarily learning each tool in detail, as you can look them up later when necessary. I included many performance analysis tools, including basics including top, htop, mpstat, pidstat, free, iostat, sar, perf, and flame graphs; container-aware tools and metrics including systemd-cgtop, docker stats, /proc, /sys/fs/cgroup, nsenter, Netflix Vector, and Intel snap; and advanced tracing-based tools including iosnoop, zfsslower, btrfsdist, funccount, runqlat, and stackcount. ## Reverse Diagnosis I'm a fan of performance analysis methodologies, and I discussed how my [USE method] can be applied to container resource controls. But some controls, like CPU shares and disk I/O weights, get tricky to analyze. How do you know if a container is currently throttled by its share value, vs the system? To make sense of this, I came up with a reverse diagnosis approach: starting with a list of all possible outcomes, and then working backwards to see what metrics are required to identify one of the outcomes. I summarized it for CPU analysis with this flow chart:
The first step refers to /sys/fs/cgroup/.../cpu.stat -> throttled\_time, which indicates when a cgroup (container) is throttled by its hard cap (eg, capped at 2 CPUs). Since that's a straightforward metric, we check it first to take that outcome off the operating table, and continue. See the talk for more details, where I also included a few scenarios beforehand to see if the audience could identify the bottleneck. Try it yourself: it's hard (then try it with the above flow chart!). This may become easier over time as more metrics are added to diagnose states, and time in states, so also check for updates to cgroup metrics in the kernel. ## Netflix Titus The environment I've been analyzing is Netflix Titus, which I summarized at the start of the talk. It was covered in a post published just before my talk: [The Evolution of Container Usage at Netflix]. DockerCon was fun, and a big event: 6,000 attendees. My talk won a "top speaker" [award], which also meant I delivered it a second time for those who didn't catch the first one. Thanks to the Docker staff for putting on a great conference, and for everyone for attending my talk. [youtube]: https://www.youtube.com/watch?v=bK9A5ODIgac [slideshare]: https://www.slideshare.net/brendangregg/container-performance-analysis [The Evolution of Container Usage at Netflix]: https://medium.com/netflix-techblog/the-evolution-of-container-usage-at-netflix-3abfc096781b [USE method]: /usemethod.html [award]: https://twitter.com/brendangregg/status/854827187270242304

CPU Utilization is Wrong Brendan Gregg's Blog

The metric we all use for CPU utilization is deeply misleading, and getting worse every year. What is CPU utilization? How busy your processors are? No, that's not what it measures. Yes, I'm talking about the "%CPU" metric used *everywhere*, by *everyone*. In every performance monitoring product. In top(1). What you may think 90% CPU utilization means:

What it might really mean:
Stalled means the processor was not making forward progress with instructions, and usually happens because it is waiting on memory I/O. The ratio I drew above (between busy and stalled) is what I typically see in production. Chances are, you're mostly stalled, but don't know it. What does this mean for you? Understanding how much your CPUs are stalled can direct performance tuning efforts between reducing code or reducing memory I/O. Anyone looking at CPU performance, especially on clouds that auto scale based on CPU, would benefit from knowing the stalled component of their %CPU. ## What really is CPU Utilization? The metric we call CPU utilization is really "non-idle time": the time the CPU was not running the idle thread. Your operating system kernel (whatever it is) usually tracks this during context switch. If a non-idle thread begins running, then stops 100 milliseconds later, the kernel considers that CPU utilized that entire time. This metric is as old as time sharing systems. The Apollo Lunar Module guidance computer (a pioneering time sharing system) called its idle thread the "DUMMY JOB", and engineers tracked cycles running it vs real tasks as a important computer utilization metric. (I wrote about this [before].) So what's wrong with this? Nowadays, CPUs have become much faster than main memory, and waiting on memory dominates what is still called "CPU utilization". When you see high %CPU in top(1), you might think of the processor as being the bottleneck – the CPU package under the heat sink and fan – when it's really those banks of DRAM. This has been getting worse. For a long time processor manufacturers were scaling their clockspeed quicker than DRAM was scaling its access latency (the "CPU DRAM gap"). That levelled out around 2005 with 3 GHz processors, and since then processors have scaled using more cores and hyperthreads, plus multi-socket configurations, all putting more demand on the memory subsystem. Processor manufacturers have tried to reduce this memory bottleneck with larger and smarter CPU caches, and faster memory busses and interconnects. But we're still usually stalled. ## How to tell what the CPUs are really doing By using Performance Monitoring Counters (PMCs): hardware counters that can be read using [Linux perf], and other tools. For example, measuring the entire system for 10 seconds:
# perf stat -a -- sleep 10

 Performance counter stats for 'system wide':

     641398.723351      task-clock (msec)         #   64.116 CPUs utilized            (100.00%)
           379,651      context-switches          #    0.592 K/sec                    (100.00%)
            51,546      cpu-migrations            #    0.080 K/sec                    (100.00%)
        13,423,039      page-faults               #    0.021 M/sec                  
 1,433,972,173,374      cycles                    #    2.236 GHz                      (75.02%)
   <not supported>      stalled-cycles-frontend  
   <not supported>      stalled-cycles-backend   
 1,118,336,816,068      instructions              #    0.78  insns per cycle          (75.01%)
   249,644,142,804      branches                  #  389.218 M/sec                    (75.01%)
     7,791,449,769      branch-misses             #    3.12% of all branches          (75.01%)

      10.003794539 seconds time elapsed
The key metric here is **instructions per cycle** (insns per cycle: IPC), which shows on average how many instructions we were completed for each CPU clock cycle. The higher, the better (a simplification). The above example of 0.78 sounds not bad (78% busy?) until you realize that this processor's top speed is an IPC of 4.0. This is also known as *4-wide*, referring to the instruction fetch/decode path. Which means, the CPU can retire (complete) four instructions with every clock cycle. So an IPC of 0.78 on a 4-wide system, means the CPUs are running at 19.5% their top speed. The new Intel Skylake processors are 5-wide. There are hundreds more PMCs you can use to dig further: measuring stalled cycles directly by different types. ### In the cloud If you are in a virtual environment, you might not have access to PMCs, depending on whether the hypervisor supports them for guests. I recently posted about [The PMCs of EC2: Measuring IPC], showing how PMCs are now available for dedicated host types on the AWS EC2 Xen-based cloud. ## Interpretation and actionable items If your **IPC is < 1.0**, you are likely memory stalled, and software tuning strategies include reducing memory I/O, and improving CPU caching and memory locality, especially on NUMA systems. Hardware tuning includes using processors with larger CPU caches, and faster memory, busses, and interconnects. If your **IPC is > 1.0**, you are likely instruction bound. Look for ways to reduce code execution: eliminate unnecessary work, cache operations, etc. [CPU flame graphs] are a great tool for this investigation. For hardware tuning, try a faster clock rate, and more cores/hyperthreads. For my above rules, I split on an IPC of 1.0. Where did I get that from? I made it up, based on my prior work with PMCs. Here's how you can get a value that's custom for your system and runtime: write two dummy workloads, one that is CPU bound, and one memory bound. Measure their IPC, then calculate their mid point. ## What performance monitoring products should tell you Every performance tool should show IPC along with %CPU. Or break down %CPU into instruction-retired cycles vs stalled cycles, eg, %INS and %STL. As for top(1), there is tiptop(1) for Linux, which shows IPC by process:
tiptop -                  [root]
Tasks:  96 total,   3 displayed                               screen  0: default

  PID [ %CPU] %SYS    P   Mcycle   Minstr   IPC  %MISS  %BMIS  %BUS COMMAND
 3897   35.3  28.5    4   274.06   178.23  0.65   0.06   0.00   0.0 java
 1319+   5.5   2.6    6    87.32   125.55  1.44   0.34   0.26   0.0 nm-applet
  900    0.9   0.0    6    25.91    55.55  2.14   0.12   0.21   0.0 dbus-daemo
## Other reasons CPU utilization is misleading It's not just memory stall cycles that makes CPU utilization misleading. Other factors include: - Temperature trips stalling the processor. - Turboboost varying the clockrate. - The kernel varying the clock rate with speed step. - The problem with averages: 80% utilized over 1 minute, hiding bursts of 100%. - Spin locks: the CPU is utilized, and has high IPC, but the app is not making logical forward progress. ## Update: is CPU utilization actually wrong? There have been hundreds of comments on this post, here (below) and elsewhere ([1], [2]). Thanks to everyone for taking the time and the interest in this topic. To summarize my responses: I'm not talking about iowait at all (that's disk I/O), and there are actionable items if you know you are memory bound (see above). But is CPU utilization actually wrong, or just deeply misleading? I think many people interpret high %CPU to mean that the processing unit is the bottleneck, which is wrong (as I said earlier). At that point you don't yet know, and it is often something external. Is the metric technically correct? If the CPU stall cycles can't be used by anything else, aren't they are therefore "utilized waiting" (which sounds like an oxymoron)? In some cases, yes, you could say that %CPU as an OS-level metric is technically correct, but deeply misleading. With hyperthreads, however, those stalled cycles can now be used by another thread, so %CPU may count cycles as utilized that are in fact available. That's wrong. In this post I wanted to focus on the interpretation problem and suggested solutions, but yes, there are technical problems with this metric as well. You might just say that utilization as a metric was already broken, as Adrian Cockcroft discussed [previously]. ## Conclusion CPU utilization has become a deeply misleading metric: it includes cycles waiting on main memory, which can dominate modern workloads. You can figure out what %CPU really means by using additional metrics, including instructions per cycle (IPC). An IPC < 1.0 likely means memory bound, and an IPC > 1.0 likely means instruction bound. I covered IPC in my [previous post], including an introduction to the Performance Monitoring Counters (PMCs) needed to measure it. Performance monitoring products that show %CPU – which is all of them – should also show PMC metrics to explain what that means, and not mislead the end user. For example, they can show %CPU with IPC, and/or instruction-retired cycles vs stalled cycles. Armed with these metrics, developers and operators can choose how to better tune their applications and systems. [UnixBench]: https://code.google.com/p/byte-unixbench/ [before]: http://www.brendangregg.com/usemethod.html#Apollo [Linux perf]: /perf.html [The PMCs of EC2: Measuring IPC]: /blog/2017-05-04/the-pmcs-of-ec2.html [previous post]: /blog/2017-05-04/the-pmcs-of-ec2.html [CPU flame graphs]: http://www.brendangregg.com/FlameGraphs/cpuflamegraphs.html [1]: https://news.ycombinator.com/item?id=14301739 [2]: https://www.reddit.com/r/programming/comments/6a6v8g/cpu_utilization_is_wrong/ [previously]: http://www.hpts.ws/papers/2007/Cockcroft_HPTS-Useless.pdf

The PMCs of EC2: Measuring IPC Brendan Gregg's Blog


IPC and LLC loads with a scaling workload
Performance Monitoring Counters (PMCs) are now publicly available from dedicated host types in the AWS EC2 cloud. PMC nerds worldwide rejoice! (All six of us.) There should be more of us in the future, as with the increasing scale of processors and speed of storage devices, the common bottleneck is moving from disks to the memory subsystem. CPU caches, the MMU, memory busses, and CPU interconnects. These can only be analyzed with PMCs.
Memory is the new disk.
If PMCs are new to you, then in a nutshell they are special hardware counters that can be accessed via processor registers, and enabled and read via certain instructions. PMCs provide low-level CPU performance statistics that aren't available anywhere else. In this post I'll summarize the PMCs available in EC2, which are for dedicated hosts only (eg, m4.16xl, i3.16xl), and I'll demonstrate measuring IPC. Note that PMCs are also known as HPCs (hardware performance counters), and other names as well. ### EC2 Dedicated Host PMCs The PMCs available are the architectural PMCs listed in the [Intel 64 and IA-32 Architectures Developer's Manual: vol. 3B], in section 18.2.1.2 "Pre-defined Architectural Performance Events", Table 18-1 "UMask and Event Select Encodings for Pre-Defined Architectural Performance Events". I've drawn my own table of them below with example event mnemonics. **Architectural PMCs**
Event NameUMaskEvent SelectExample Event Mask Mnemonic
UnHalted Core Cycles00H3CHCPU_CLK_UNHALTED.THREAD_P
Instruction Retired00HC0HINST_RETIRED.ANY_P
UnHalted Reference Cycles01H3CHCPU_CLK_THREAD_UNHALTED.REF_XCLK
LLC Reference4FH2EHLONGEST_LAT_CACHE.REFERENCE
LLC Misses41H2EHLONGEST_LAT_CACHE.MISS
Branch Instruction Retired00HC4HBR_INST_RETIRED.ALL_BRANCHES
Branch Misses Retired00HC5HBR_MISP_RETIRED.ALL_BRANCHES
What's so special about these seven architectural PMCs? They give you a good overview of key CPU behavior, sure. But Intel have also chosen them as a golden set, to be highlighted first in the PMC manual and their presence exposed via the CPUID instruction. Note that the Intel mnemonic for LLC here is "longest latency cache", but this is also known as "last level cache" or "level 3 cache" (assuming it's L3). ### PMC Usage Before I demonstrate PMCs, it's important to know that there's two very different ways they can be used: - **Counting**: where they provide a count over an interval. - **Sampling**: where based on a number of events, an interrupt can be triggered to sample the program counter or stack trace. Counting is cheap. Sampling costs more overhead based on the rate of the interrupts (which can be tuned by changing the event trigger threshold), and whether you're reading the PC or the whole stack trace. I'll demonstrate PMCs by using counting to measure IPC. ## Measuring IPC Instructions-per-cycle (IPC) is a good starting point for PMC analysis, and is measured by counting the instruction count and cycle count PMCs. (On some systems it is shown as its invert, cycles-per-instruction, CPI.) IPC is like miles-per-gallon for CPUs: how much bang for your buck. The resource here isn't gallons of gasoline but CPU cycles, and the result isn't miles traveled but instructions retired (ie, completed). The more instructions you can complete with your fixed cycles resource, the better. In the interest of keeping this short, I'll gloss over IPC caveats. There are situations where it can be misleading, like an increase of IPC because your program suffers more spin lock contention, and those spin instructions happen to be very fast. Just like MPG can be misleading, as it can be influenced by the route driven, not just the car's own characteristics. I'll use the Linux [perf] command to measure IPC of a program, noploop, which loops over a series of NOP instructions (no op):
# perf stat ./noploop
^C./noploop: Interrupt

 Performance counter stats for './noploop':

       2418.149339      task-clock (msec)         #    1.000 CPUs utilized          
                 3      context-switches          #    0.001 K/sec                  
                 0      cpu-migrations            #    0.000 K/sec                  
                39      page-faults               #    0.016 K/sec                  
     6,245,387,593      cycles                    #    2.583 GHz                      (75.03%)
   <not supported>      stalled-cycles-frontend  
   <not supported>      stalled-cycles-backend   
    24,766,697,057      instructions              #    3.97  insns per cycle          (75.02%)
        14,738,991      branches                  #    6.095 M/sec                    (75.02%)
            24,744      branch-misses             #    0.17% of all branches          (75.04%)

       2.418826663 seconds time elapsed
I've highlighted IPC ("insns per cycle") in the output. I like noploop as a sanity test. Because this processor is 4-wide (instruction prefetch/decode width), it can process a maximum of 4 instructions with every CPU cycle. Since NOPs are the fastest possible instruction (they do nothing), they can be retired at an IPC rate of 4.0. This goes down to 3.97 with a little loop logic (the program is looping over a block of NOPs). The "<not supported>" metrics are cases where the PMC is not currently available (they are outside of the architectural set, in this case). You can also measure the entire system, using perf with -a. This time I'm measuring a software build:
# perf stat -a -- sleep 10

 Performance counter stats for 'system wide':

     641398.723351      task-clock (msec)         #   64.116 CPUs utilized            (100.00%)
           379,651      context-switches          #    0.592 K/sec                    (100.00%)
            51,546      cpu-migrations            #    0.080 K/sec                    (100.00%)
        13,423,039      page-faults               #    0.021 M/sec                  
 1,433,972,173,374      cycles                    #    2.236 GHz                      (75.02%)
   <not supported>      stalled-cycles-frontend  
   <not supported>      stalled-cycles-backend   
 1,118,336,816,068      instructions              #    0.78  insns per cycle          (75.01%)
   249,644,142,804      branches                  #  389.218 M/sec                    (75.01%)
     7,791,449,769      branch-misses             #    3.12% of all branches          (75.01%)

      10.003794539 seconds time elapsed
That's reporting an IPC of 0.78. perf can also print statistics over time (-I), but the output becomes verbose. I've written a quick wrapper to clean this up and summaize the architectural PMCs on a single line. It's [pmcarch] \(first version\):
# pmcarch 1
CYCLES        INSTRUCTIONS    IPC BR_RETIRED   BR_MISPRED  BMR% LLCREF      LLCMISS     LLC%
90755342002   64236243785    0.71 11760496978  174052359   1.48 1542464817  360223840  76.65
75815614312   59253317973    0.78 10665897008  158100874   1.48 1361315177  286800304  78.93
65164313496   53307631673    0.82 9538082731   137444723   1.44 1272163733  268851404  78.87
90820303023   70649824946    0.78 12672090735  181324730   1.43 1685112288  343977678  79.59
76341787799   50830491037    0.67 10542795714  143936677   1.37 1204703117  279162683  76.83
[...]
This is from a production instance on EC2, and each line of output is a one second summary. ### Interpreting IPC For real-world applications, here's how I'd interpret the IPC: - **IPC < 1**: likely stall cycle bound, also likely memory bound (more PMCs can confirm). Stall cycles are when the CPU isn't making forward progress, likely because it's waiting on memory I/O. In this case, look to tune memory usage: allocate fewer or smaller objects, do zero copy, look at NUMA and memory placement tuning. A [CPU flame graph] will show which code is on-CPU during these stall cycles, and should be clues for were to look for memory usage. - **IPC > 1**: likely instruction bound. Look to tune instructions: a [CPU flame graph] will show which code is on-CPU doing instructions: find ways to reduce executed code. You can combine IPC and flame graphs to show everything at once: [CPI flame graphs] \(CPI is IPC inverted\). This requires using the sampling mode of PMCs to capture stack traces on overflow events. There are, however, caveats with doing this which I'll get to in another post. Note that I'm using these on a modern Linux kernel, 4.4+. There was a problem on older kernels (3.x) where PMCs would be measured incorrectly, leading to a bogus IPC measurement. ## RxNetty Study In 2015 I found PMCs crucial in fully understanding the performance differences between RxNetty and Tomcat as they scaled with client load. Tomcat serves requests using threads for each connection, whereas RxNetty uses event loop threads. Between low and high client counts, Tomcat's CPU cycles per request was largely unchanged, whereas RxNetty became *more* efficient and consumed *less* CPU per request as clients increased. Can you guess why?


IPC and CPU/req as load scales

Click for a slide deck where I explain why on slides 25-27 (these slides are from the [WSPerfLab] repository, summarizing a study by myself, Nitesh Kant, and Ben Christensen.) We knew that we had a 46% higher request rate, and so we began a study to identify and quantify the reasons why. There was 5% caused by X, and 3% caused by Y, and so on. But after weeks of study, we fell short: over 10% of that 46% remained unexplained. I checked and rechecked our numbers, but fell short every time. It was driving me nuts, and casting doubt on everything we'd found so far. With PMCs I was able to identify this last performance difference, and the numbers finally added up! We could break down the 46% difference and explain every percentage point. It was very satisfying. It also emphasized the importance of PMCs: understanding CPU differences is a common task in our industry, and without PMCs you're always missing an important part of the puzzle. This study was done on a physical machine, not EC2, where I'd measured and studied dozens of PMCs. But the crucial PMCs I included in that slide deck summary were the measurements of IPC and the LLC, which are possible with the architectural PMCs now available in EC2. ## How is this even possible in the cloud? You might be wondering how cloud guests can read PMCs at all. It works like this: PMCs are managed via the privileged instructions RDMSR and WRMSR for configuration (which I wrote about in [The MSRs of EC2]), and RDPMC for reading. A privileged instruction causes a guest exit, which is handled by the hypervisor. The hypervisor can then run its own code, and configure PMCs if the actual hardware allows, and save and restore their state whenever it context switches between guests. Mainstream Xen supported this years ago, with its virtual Performance Monitoring Unit (vPMU). It is configured using vpmu=on in the Xen boot line. However, it is rarely turned on. Why? There are hundreds of PMCs, and they are all exposed with vpmu=on. Could some pose a security risk? A number of papers have been published showing PMC side-channel attacks, whereby measuring certain PMCs while sending input to a known target program can eventually leak bits of the target's state. While these are unlikely in practice, and such attacks aren't limited to PMCs (eg, there's also timing attacks), you can understand a paranoid security policy not wanting to enable all PMCs by default. In the cloud it's even harder to do these attacks, as when Xen context switches between guests it switches out the PMCs as well. But still, why enable all the PMCs if they aren't all needed? Imagine if we could create a whitelist of allowed PMCs for secure environments. This is a question I pondered in late 2015, and I ended up contributing the [x86/VPMU: implement ipc and arch filter flags] patch to Xen to provide two whitelist sets as options, chosen with the vpmu boot flag: - **ipc**: Enough PMCs to measure IPC only. Minimum set. - **arch**: The seven architectural PMCs (see table above). Includes IPC. More sets can be added. For example, I can imagine an extended set to allow some Intel vTune analysis. AWS just enabled architectural PMCs. My patch set might be a useful example of how such a whitelist can be implemented, although, how EC2 implemented it might be different to this. ## Conclusion PMCs are crucial for analyzing a (if not *the*) modern system bottleneck: memory I/O. A set of PMCs are now available on dedicated hosts in the EC2 cloud, enough for high-level analysis of memory I/O issues. I used them in this post to measure IPC, which can identify if your applications are likely memory bound or instruction bound, directing further tuning efforts. I've worked with PMCs before, and the sort of wins they help you find can range from small single digit percentages to as much as 2x. The net result for companies like Netflix is that our workloads will run faster on EC2 because we can use PMCs to find these performance wins. Consider that, next time someone is comparing clouds by microbenchmarking alone. It's not just out-of-the-box performance that matters, it's also your ability to observe and tune your applications.
A cloud you can't analyze is a slower cloud.
Thanks to those at Netflix for supporting my work on this, the Xen community for their vpmu work, and everyone at Amazon and Intel who made this happen! (Thanks Joe, Matt, Subathra, Rosana, Uwe, Laurie, Coburn, Ed, Mauricio, Steve, Artyom, Valery, Jan, Boris, and more.) (Yes, I'm happy.) [The MSRs of EC2]: http://www.brendangregg.com/blog/2014-09-15/the-msrs-of-ec2.html [Intel 64 and IA-32 Architectures Developer's Manual: vol. 3B]: http://www.intel.com/content/www/us/en/architecture-and-technology/64-ia-32-architectures-software-developer-vol-3b-part-2-manual.html [perf]: /perf.html [CPU flame graph]: /FlameGraphs/cpuflamegraph.html [CPI flame graphs]: /blog/2014-10-31/cpi-flame-graphs.html [x86/VPMU: implement ipc and arch filter flags]: http://xenbits.xen.org/gitweb/?p=xen.git;a=commitdiff;h=e3cce1799df2f957dfa00f84a5315cbf896490fe [WSPerfLab]: https://github.com/Netflix-Skunkworks/WSPerfLab/tree/master/test-results [pmcarch]: https://github.com/brendangregg/pmc-cloud-tools/blob/master/pmcarch

Exclusive Or Character Josef "Jeff" Sipek

A couple of years ago I blogged about the CCS instruction in the Apollo Guidance Computer. Today I want to tell you about the XC instruction from the System/360 ISA.

Many ISAs have some sort of xor instruction. The 360 is no different. It offers several different xor instructions which differ in the type of operands that they operate on. In all cases, the operation they perform could be summarized as (using C syntax):

A ^= B;

That is one of the operands is used as both a source and a destination.

There are the boring X (reg ^= memory), XR (reg ^= reg), and XI (reg ^= immediate). Then there is XC which is what inspired this post. XC, or Exclusive Or Character, takes two memory locations and a length and performs what appears as byte by byte xor of the two buffers. (The hardware is smart enough to operate on bigger chunks of memory but the effect is as if it was done byte at a time.) In assembly XC looks like:

XC d1(l,b1),d2(b2)

The d are 12-bit unsigned displacements while the b specify the registers with the base address. For each of the operands the actual address is dX plus the value of the bX register. The l is a length field which encodes a length between 1 and 256.

To use more C pseudocode, XC does:

void XC(unsigned char *op1, size_t len, unsigned char *op2)
{
	while (len--) {
		*op1 ^= *op2;
		op1++;
		op2++;
	}
}

(This pseudo code ignores the condition code calculation and exception generation which are not relevant to the discussion.)

This by itself is neat but not every exciting…until you remember that xor can be used to zero out a register. You can use XC to zero out up to 256 bytes of memory. It turns out this idiom is used pretty often in handwritten assembly, and compilers such as gcc even produce such instructions without any special effort on the programmer’s behalf.

For example, in HVF I have this line:

memset(&psw, 0, sizeof(struct psw));

Which GCC helpfully turns into (struct psw is 16 bytes in size):

xc      160(16,%r15),160(%r15)

When I first saw that line in the disassembly of HVF years ago, it blew my mind. It is elegant, fast thanks to the microarchitecture optimizations, and once you are used to the idiom it is clear about what it does. I hope your mind was as blown as mine. Till next time!

USENIX/LISA 2016 Linux bcc/BPF Tools Brendan Gregg's Blog

For USENIX LISA 2016 I gave a talk that was years in the making, on Linux bcc/BPF analysis tools.

"Time to rethink the kernel" - Thomas Graf
Thomas has been using BPF to create new network and application security technologies (project [Cilium]), and build something that's starting to look like microservices in the kernel ([video]). I'm using it for advanced performance analysis tools that do tracing and profiling. Enhanced BPF might still be new, but it's already delivering new technologies, and making us rethink what we can do with the kernel. My LISA 2016 talk begins with a 15 minute demo, showing the progression from ftrace, then perf\_events, to BPF (due to the audio/video settings, this demo is a little hard to follow in the full video, but there's a separate recording of just the demo here: [Linux tracing 15 min demo]). Below is the full talk video (youtube):
The slides are on [slideshare] \([PDF]\):
The rest of the talk can be seen in the standard talk video (youtube):
--> ## Installing bcc/BPF To try out BPF for performance analysis you'll need to be on a newer kernel: at least 4.4, preferably 4.9. The main front end is currently [bcc] (BPF compiler collection), and there are [install instructions] on github, which keep getting improved. For Ubuntu, installation is:
echo "deb [trusted=yes] https://repo.iovisor.org/apt/xenial xenial-nightly main" | sudo tee /etc/apt/sources.list.d/iovisor.list
sudo apt-get update
sudo apt-get install bcc-tools
There's currently a pull request to add snap instructions, as there are nightly builds for snappy as well. ## Listing bcc/BPF Tools This install will add various performance analysis and debugging tools to /usr/share/bcc/tools. Since some require a very recent kernel (4.6, 4.7, or 4.9), there's a subdirectory, /usr/share/bcc/tools/old, which has some older versions of the same tools that work on Linux 4.4 (albeit with some caveats).
# ls /usr/share/bcc/tools
argdist       cpudist            filetop         offcputime   solisten    tcptop    vfsstat
bashreadline  cpuunclaimed       funccount       offwaketime  sslsniff    tplist    wakeuptime
biolatency    dcsnoop            funclatency     old          stackcount  trace     xfsdist
biosnoop      dcstat             gethostlatency  oomkill      stacksnoop  ttysnoop  xfsslower
biotop        deadlock_detector  hardirqs        opensnoop    statsnoop   ucalls    zfsdist
bitesize      doc                killsnoop       pidpersec    syncsnoop   uflow     zfsslower
btrfsdist     execsnoop          llcstat         profile      tcpaccept   ugc
btrfsslower   ext4dist           mdflush         runqlat      tcpconnect  uobjnew
cachestat     ext4slower         memleak         runqlen      tcpconnlat  ustat
cachetop      filelife           mountsnoop      slabratetop  tcplife     uthreads
capable       fileslower         mysqld_qslower  softirqs     tcpretrans  vfscount
Just by listing the tools, you might spot something you want to start with (ext4*, tcp*, etc). Or you can browse the following diagram:
## Using bcc/BPF If you don't have a good starting point, in the [bcc Tutorial] I included a generic checklist of the first ten tools to try. I also included this in my LISA talk:
  1. execsnoop
  2. opensnoop
  3. ext4slower (or btrfs*, xfs*, zfs*)
  4. biolatency
  5. biosnoop
  6. cachestat
  7. tcpconnect
  8. tcpaccept
  9. tcpretrans
  10. runqlat
  11. profile
Most of these have usage messages, and are easy to use. They'll need to be run as root. For example, execsnoop to trace new processes:
# /usr/share/bcc/tools/execsnoop
PCOMM            PID    PPID   RET ARGS
grep             69460  69458    0 /bin/grep -q g2.
grep             69462  69458    0 /bin/grep -q p2.
ps               69464  58610    0 /bin/ps -p 308
ps               69465  100871   0 /bin/ps -p 301
sleep            69466  58610    0 /bin/sleep 1
sleep            69467  100871   0 /bin/sleep 1
run              69468  5160     0 ./run
[...]
And biolatency to record an in-kernel histogram of disk I/O latency:
# /usr/share/bcc/tools/biolatency 
Tracing block device I/O... Hit Ctrl-C to end.
^C
     usecs               : count     distribution
         0 -> 1          : 0        |                                        |
         2 -> 3          : 0        |                                        |
         4 -> 7          : 0        |                                        |
         8 -> 15         : 0        |                                        |
        16 -> 31         : 0        |                                        |
        32 -> 63         : 0        |                                        |
        64 -> 127        : 0        |                                        |
       128 -> 255        : 0        |                                        |
       256 -> 511        : 64       |**********                              |
       512 -> 1023       : 248      |****************************************|
      1024 -> 2047       : 29       |****                                    |
      2048 -> 4095       : 18       |**                                      |
      4096 -> 8191       : 42       |******                                  |
      8192 -> 16383      : 20       |***                                     |
     16384 -> 32767      : 3        |                                        |
Here's its USAGE message:
# /usr/share/bcc/tools/biolatency -h
usage: biolatency [-h] [-T] [-Q] [-m] [-D] [interval] [count]

Summarize block device I/O latency as a histogram

positional arguments:
  interval            output interval, in seconds
  count               number of outputs

optional arguments:
  -h, --help          show this help message and exit
  -T, --timestamp     include timestamp on output
  -Q, --queued        include OS queued time in I/O time
  -m, --milliseconds  millisecond histogram
  -D, --disks         print a histogram per disk device

examples:
    ./biolatency            # summarize block I/O latency as a histogram
    ./biolatency 1 10       # print 1 second summaries, 10 times
    ./biolatency -mT 1      # 1s summaries, milliseconds, and timestamps
    ./biolatency -Q         # include OS queued time in I/O time
    ./biolatency -D         # show each disk device separately
In /usr/share/bcc/tools/docs or the [tools subdirectory] on github, you'll find \_example.txt files for every tool which have screenshots and discussion. Check them out! There are also man pages under man/man8. For more information, please watch my LISA talk at the top of this post when you get a chance, where I explain Linux tracing, BPF, bcc, and tour various tools. ## What's Next? My prior talk at LISA 2014 was [New Tools and Old Secrets (perf-tools)], where I showed similar performance analysis tools using ftrace, an older tracing framework in Linux. I'm still using ftrace, not just for older kernels, but for times where it's more efficient (eg, kernel function counting using the funccount tool). BPF is programmatic, and can do things that ftrace can't. Doing ftrace at LISA 2014, then BPF at LISA 2016, you might wonder I'll propose for LISA 2018. We'll see. I could be covering a higher-level BPF front-end (eg, [ply], if it gets finished), or a BPF GUI (eg, via Netflix Vector), or I could be focused on something else entirely. Tracing was my priority when Linux lacked various capabilities, but now that's done, there are other important technologies to work on... [youtube]: https://www.youtube.com/watch?v=GsMs3n8CB6g [Linux tracing 15 min demo]: https://www.youtube.com/watch?v=GsMs3n8CB6g [Linux tracing in 15 minutes]: /blog/2016-12-27/linux-tracing-in-15-minutes.html [PDF]: /Slides/LISA2016_BPF_tools_16_9.pdf [slideshare]: http://www.slideshare.net/brendangregg/linux-4x-tracing-tools-using-bpf-superpowers [previous post]: /blog/2017-04-23/usenix-lisa-2013-flame-graphs.html [video]: https://www.youtube.com/watch?v=ilKlmTDdFgk [Cilium]: https://github.com/cilium/cilium [New Tools and Old Secrets (perf-tools)]: /blog/2015-03-17/usenix-lisa-2014-linux-ftrace-perf-tools.html [ply]: https://github.com/iovisor/ply [install instructions]: https://github.com/iovisor/bcc/blob/master/INSTALL.md [tools subdirectory]: https://github.com/iovisor/bcc/tree/master/tools [bcc Tutorial]: https://github.com/iovisor/bcc/blob/master/docs/tutorial.md [bcc]: https://github.com/iovisor/bcc

OmniTribblix The Trouble with Tribbles...

In Tribblix, it's a basic principle that I ship upstream software unmodified. I don't impose my own views on installation layout, nor do I customize it. Generally, I apply patches only to make stuff compile.

This means that what you see in Tribblix is exactly what the upstream author intended, and not some distro-specific bastardization of it.

It also makes my life easier, I don't have to maintain patches, and updating software is much easier if it's unmodified.

In particular, I use an absolutely vanilla illumos-gate. (For a long time it differed only in that I had the fix for 5188 applied, relevant because Tribblix actually uses SVR4 packaging, but now that's integrated I don't even need to do that.)

Again, this makes my life easier. (When you're maintaining a distro on your own in your spare time, making decisions that simplify your job is essential.)

But it also has another benefit: because I have no "special" features that I've added, I'm not tied to one particular version or variant or commit of illumos. Any version of illumos-gate will do just fine. When it comes time to make a release, I just clone the gate, build, and go.

What I could do, then, is build an instance of Tribblix atop some other fork of the gate. For example, illumos-omnios.

I did just that, built the gate (it needed a couple of changes to Makefiles because of the way that perl and snmp are slightly different in OmniOS than it is in Tribblix), created packages, built an ISO, booted and installed it in VirtualBox.

As expected, it just works.

But just demonstrating that it works isn't really the reason I wanted to do this. What I'm really after is the LX brand, which has been integrated into current OmniOS.

Installing an LX zone requires a Linux image. The original (Joyent) work was for their own deployment mechanism, using ZFS images. As soon as it was available in OmniOS the first thing I did was use tarballs, which OmniOS now supports. The easiest way to create a Linux image is to create a Docker container the way you like it, and then export it to a tarball. I did that for Alpine and installed a zone based on that.

Then you can do very simple things like:

# zlogin lx1 /bin/uname -a 
Linux lx1 4.4 BrandZ virtual linux x86_64 Linux

It's an attractive idea to simply use this as the base for the next Tribblix release. However, that requires illumos-omnios to be supported in the long term, which is currently at risk.

Modern Mercurial Josef "Jeff" Sipek

I’ve been using both Git and Mercurial since they were first released in 2005. I’ve messed with the internals of both, but I always had a preference for Mercurial (its user interface is cleaner, its design is well thought-out, and so on). So, it should be no surprise that I felt a bit sad every time I heard that some project chose Git over Mercurial (or worse yet, migrated from Mercurial to Git). At the same time, I could see Git improving release after release—but Mercurial did not seem to. Seem is the operative word here.

A couple of weeks ago, I realized that more and more of my own repositories have been Git based. Not for any particular reason other than that I happened to type git init instead of hg init. After some reflection, I decided that I should convert a number of these repositories from Git to Mercurial. The conversion itself was painless thanks to the most excellent hggit extension that lets you clone, pull, and push Git repositories with Mercurial. (I just cloned the Git repository with a hg clone and then cleaned up some of the mess manually—for example, I don’t need the bookmark corresponding to the one and only branch in the original Git repository.) Then the real fun began.

I resumed the work on my various projects, but now with the brand-new Mercurial repositories. Soon after I started hitting various quirks with the Mercurial UI. I realized that the workflow I was using wasn’t really aligned with the UI. Undeterred, I looked for solutions. I enabled the pager extension, the color extension, overrode some of the default colors to be less offensive (and easier to read), enabled the shelve, rebase, and histedit extensions to (along with mq) let me do some minor history rewriting while I iteratively work on changes. (I learned about and switched to the evolve extension soon after.) With each tweak, the user experience got better and better.

Then it suddenly hit me—before these tweaks, I had been using Mercurial like it’s still 2005!

I think this is a very important observation. Mercurial didn’t seem to be improving because none of the user-visible changes were forced onto the users. Git, on the other hand, started with a dreadful UI so it made sense to enable new features by default to lessen the pain.

One could say that Mercurial took the Unix approach—simple and not exactly friendly by default, but incredibly powerful if you dig in a little. (This extensibility is why Facebook chose Mercurial over Git as a Subversion replacement.)

Now I wonder if some of the projects chose Git over Mercurial at least partially because by default Mercurial has been a bit…spartan.

With my .hgrc changes, I get exactly the information I want in a format that’s even better than what Git provided me. (Mercurial makes so much possible via its templating engine and the revsets language.)

So, what does all this mean for Mercurial? It’s hard to say, but I’m happy to report that there is a number of good improvements that should land in the upcoming 4.2 release scheduled for early May. For example, the pager and color functionality is moving into the core and they will be on by default.

Finally, I like my current Mercurial environment quite a lot. The hggit extension is making me seriously consider using Mercurial when dealing with Git repositories that I can’t convert.