Chapter 10: Profiling and Benchmarking Under Xen
- Disraeli was pretty close: actually, there are Lies, Damn lies,
- Statistics, Benchmarks, and Delivery dates.
- —Anonymous, attributed to Usenet
We’ve made a great fuss over how Xen, as a virtualization technology, offers better performance than competing technologies. However, when it comes to proofs and signs, we have been waving our hands and citing authorities. We apologize! In this chapter we will discuss how to measure Xen’s performance for yourself, using a variety of tools.
We’ll look closely at three general classes of performance monitoring, each of which you might use for a different reason. First, we have benchmarking Xen domU performance. If you are running a hosting service (or buying service from a hosting service), you need to see how the Xen image you are providing (or renting) stacks up to the competition. In this category, we have general-purpose synthetic benchmarks.
Second, we want to be able to benchmark Xen versus other virtualization solutions (or bare hardware) for your workload because Xen has both strengths and weaknesses compared to other virtualization packages. These application benchmarks will help to determine whether Xen is the best match for your application.
Third, sometimes you have a performance problem in your Xen-related or kernel-related program, and you want to pinpoint the bits of code that are moving slowly. This category includes profiling tools, such as OProfile. (Xen developers may also ask you for OProfile output when you ask about performance issues on the xen-devel list.)
Although some of these techniques might come in handy while troubleshooting, we haven’t really aimed our discussion here at solving problems— rather, we try to present an overview of the tools for various forms of speed measurement. See Chapter 15 for more specific troubleshooting suggestions.
A Benchmarking Overview
We’ve seen that the performance of a paravirtualized Xen domain running most workloads approximates that of the native machine. However, there are cases where this isn’t true or where this fuzzy simulacrum of the truth isn’t precise enough. In these cases, we move from prescientific assertion to direct experimentation—that is, using benchmarking tools and simulators to find actual, rather than theoretical, performance numbers.
As we’re sure you know, generalized benchmarking is, if not a “hard problem,”1 at least quite difficult. If your load is I/O bound, testing the CPU will tell you nothing you need to know. If your load is IPC-bound or blocking on certain threads, testing the disk and the CPU will tell you little. Ultimately, the best results come from benchmarks that use as close to real-world load as possible.
The very best way to test, for example, the performance of a server that serves an HTTP web application would be to sniff live traffic hitting your current HTTP server, and then replay that data against the new server, speeding up or slowing down the replay to see if you have more or less capacity than before.
This, of course, is rather difficult both to do and to generalize. Most people go at least one step into “easier” and “more general.” In the previous example, you might pick a particularly heavy page (or a random sampling of pages) and test the server with a generalized HTTP tester, such as Siege. This usually still gives you pretty good results, is a lot easier, and has fewer privacy concerns than running the aforementioned live data.
There are times, however, when a general benchmark, for all its inadequacies, is the best tool. For example, if you are trying to compare two virtual private server providers, a standard, generalized test might be more readily available than a real-world, specific test. Let’s start by examining a few of the synthetic benchmarks that we’ve used.
One classic benchmarking tool is the public domain UnixBench released by BYTE magazine in 1990, available from http://www.tux.org/pub/tux/niemi/unixbench/. The tool was last updated in 1999, so it is rather old. However, it seems to be quite popular for benchmarking VPS providers—by comparing one provider’s UnixBench number to another, you can get a rough idea of the capacity of VM they’re providing.
UnixBench is easy to install—download the source, untar it, build it, and run it.
# tar zxvf unixbench-4.1.0.tgz # cd unixbench-4.1.0 # make # ./Run
(That last command is a literal “Run”—it’s a script that cycles through the various tests, in order, and outputs results.)
You may get some warnings, or even errors, about the -fforce-mem option that UnixBench uses, depending on your compiler version. If you edit the Makefile to remove all instances of -fforce-mem, UnixBench should build successfully.
We recommend benchmarking the Xen instance in single-user mode if possible. Here’s some example output:
INDEX VALUES TEST BASELINE RESULT INDEX Dhrystone 2 using register variables 116700.0 1988287.6 170.4 Double-Precision Whetstone 55.0 641.4 116.6 Execl Throughput 43.0 1619.6 376.7 File Copy 1024 bufsize 2000 maxblocks 3960.0 169784.0 428.7 File Copy 256 bufsize 500 maxblocks 1655.0 53117.0 320.9 File Copy 4096 bufsize 8000 maxblocks 5800.0 397207.0 684.8 Pipe Throughput 12440.0 233517.3 187.7 Pipe-based Context Switching 4000.0 75988.8 190.0 Process Creation 126.0 6241.4 495.3 Shell Scripts (8 concurrent) 6.0 173.6 289.3 System Call Overhead 15000.0 184753.6 123.2 ========= FINAL SCORE............................... 264.5
Armed with a UnixBench number, you at least have some basis for comparison between different VPS providers. It’s not going to tell you much about the specific performance you’re going to get, but it has the advantage that it is a widely published, readily available benchmark.
Other tools, such as netperf and Bonnie++, can give you more detailed performance information.
Analyzing Network Performance
One popular tool for measuring low-level network performance is netperf. This tool supports a variety of performance measurements, with a focus on measuring the efficiency of the network implementation. It’s also been used in Xen-related papers. For one example, see “The Price of Safety: Evaluating IOMMU Performance” by Muli Ben-Yehuda et al.2
First, download netperf from http://netperf.org/netperf/DownloadNetperf.html. We picked up version 2.4.4.
# wget ftp://ftp.netperf.org/netperf/netperf-2.4.4.tar.bz2
Untar it and enter the netperf directory.
# tar xjvf netperf-2.4.4.tar.bz2 # cd netperf-2.4.
Configure, build, and install netperf. (Note that these directions are a bit at variance with the documentation; the documentation claims that /opt/netperf is the hard-coded install prefix, whereas it seems to install in /usr/local for me. Also, the manual seems to predate netperf’s use of Autoconf.)
# ./configure # make # su # make install
netperf works by running the client, netperf, on the machine being benchmarked. netperf connects to a netserver daemon and tests the rate at which it can send and receive data. So, to use netperf, we first need to set up netserver.
In the standard service configuration, netserver would run under inetd; however, inetd is obsolete. Many distros don’t even include it by default. Besides, you probably don’t want to leave the benchmark server running all the time. Instead of configuring inetd, therefore, run netserver in standalone mode:
# /usr/local/bin/netserver Starting netserver at port 12865 Starting netserver at hostname 0.0.0.0 port 12865 and family AF_UNSPEC
Now we can run the netperf client with no arguments to perform a 10-second test with the local daemon.
# netperf TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to localhost (127.0.0.1) port 0 AF_INET Recv Send Send Socket Socket Message Elapsed Size Size Size Time Throughput bytes bytes bytes secs. 10^6bits/sec 87380 16384 16384 10.01 10516.33
Okay, looks good. Now we’ll test from the dom0 to this domU. To do that, we install the netperf binaries as described previously and run netperf with the -H option to specify a target host (in this case, .74 is the domU we’re testing against):
# netperf -H 188.8.131.52,ipv4 TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 192.0.2.74 (192.0.2.74) port 0 AF_INET Recv Send Send Socket Socket Message Elapsed Size Size Size Time Throughput bytes bytes bytes secs. 10^6bits/sec 87380 16384 16384 10.00 638.59
Cool. Not as fast, obviously, but we expected that. Now from another physical machine to our test domU:
# netperf -H 192.0.2.66 TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 192.0.2.66 (192.0.2.66) port 0 AF_INET Recv Send Send Socket Socket Message Elapsed Size Size Size Time Throughput bytes bytes bytes secs. 10^6bits/sec 87380 16384 16384 10.25 87.72
Ouch. Well, so how much of that is Xen, and how much is the network we’re going through? To find out, we’ll run the netserver daemon on the dom0 hosting the test domU and connect to that:
# netperf -H 192.0.2.74 TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 192.0.2.74 (192.0.2.74) port 0 AF_INET Recv Send Send Socket Socket Message Elapsed Size Size Size Time Throughput bytes bytes bytes secs. 10^6bits/sec 87380 16384 16384 10.12 93.66
It could be worse, I guess. The moral of the story? xennet introduces a noticeable but reasonable overhead. Also, netperf can be a useful tool for discovering the actual bandwidth you’ve got available. In this case the machines are connected via a 100Mbit connection, and netperf lists an actual throughput of 93.66Mbits/second.
Measuring Disk Performance with Bonnie++
One of the major factors in a machine’s overall performance is its disk subsystem. By exercising its hard drives, we can get a useful metric to compare Xen providers or Xen instances with, say, VMware guests.
We, like virtually everyone else on the planet, use Bonnie++ to measure disk performance. Bonnie++ attempts to measure both random and sequential disk performance and does a good job simulating real-world loads. This is especially important in the Xen context because of the degree to which domains are partitioned—although domains share resources, there’s no way for them to coordinate resource use.
One illustration of this point is that if multiple domains are trying to access a platter simultaneously, what looks like sequential access from the viewpoint of one VM becomes random accesses to the disk. This makes things like seek time and the robustness of your tagged queuing system much more important. To test the effect of these optimizations on domU performance, you’ll probably want a tool like Bonnie++.
The Bonnie++ author maintains a home page at http://www.coker.com.au/bonnie++/. Download the source package, build it, and install it:
# wget http://www.coker.com.au/bonnie++/bonnie++-1.03c.tgz # cd bonnie++-1.03c # make # make install
At this point you can simply invoke Bonnie++ with a command such as:
This command will run some tests, printing status information as it goes along, and eventually generate output like this:
Version 1.03 ------Sequential Output------ --Sequential Input- --Random- -Per Chr- --Block-- -Rewrite- -Per Chr- --Block-- --Seeks-- Machine Size K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP /sec %CP alastor 2512M 20736 76 55093 14 21112 5 26385 87 55658 6 194.9 0 ........... ------Sequential Create------ --------Random Create-------- -Create-- --Read--- -Delete-- -Create-- --Read--- -Delete-- files /sec %CP /sec %CP /sec %CP /sec %CP /sec %CP /sec %CP 256 35990 89 227885 85 16877 28 34146 84 334227 99 5716 10
Note that some tests may simply output a row of pluses. This indicates that the machine finished them in less than 500 ms. Make the workload more difficult. For example, you might specify something like:
# /usr/local/sbin/bonnie++ -d . -s 2512 -n 256
This specifies writing 2512MB files for I/O performance tests. (This is the default file size, which is twice the RAM size on this particular machine. This is important to ensure that we’re not just exercising RAM rather than disk.) It also tells Bonnie++ to create 256*1024 files in its file creation tests.
We also recommend reading Bonnie++’s online manual, which includes a fair amount of pithy benchmarking wisdom, detailing why the author chose to include the tests that he did, and what meanings the different numbers have.
Of course, the purpose of a server is to run applications—we’re not really interested in how many times per second the VM can do absolutely nothing. For testing application performance, we use the applications that we’re planning to put on the machine, and then throw load at them.
Since this is necessarily application-specific, we can’t give you too many pointers on specifics. There are good test suites available for many popular libraries. For example, we’ve had customers benchmark their Xen instances with the popular web framework Django.3
httperf: A Load Generator for HTTP Servers
Having tested the effectiveness of your domain’s network interface, you may want to discover how well the domain performs when serving applications through that interface. Because of Xen’s server-oriented heritage, one popular means of testing its performance in HTTP-based real-world applications is httperf. The tool generates HTTP requests and summarizes performance statistics. It supports HTTP/1.1 and SSL protocols and offers a variety of workload generators. You may find httperf useful if, for example, you’re trying to figure out how many users your web server can handle before it goes casters-up.
First, install httperf on a machine other than the one you’re testing—it can be another domU, but we usually prefer to install it on something completely separate. This “load” machine should also be as close to the target machine as possible—preferably connected to the same Ethernet switch.
You can get httperf through your distro’s package-management mechanism or from http://www.hpl.hp.com/research/linux/httperf/.
If you’ve downloaded the source code, build it using the standard method. httperf’s documentation recommends using a separate build directory rather than building directly in the source tree. Thus, from the httperf source directory:
# mkdir build # cd build # ../configure # make # make install
Next, run appropriate tests. What we usually do is run httperf with a command similar to this:
# httperf --server 192.168.1.80 --uri /index.html --num-conns 6000 --rate 1500
In this case we’re just demanding a static HTML page, so the request rate is obscenely high; usually we would use a much smaller number in tests of real-world database-backed websites.
httperf will then give you some statistics. The important numbers, in our experience, are the connection rate, the request rate, and the reply rate. All of these should be close to the rate specified on the command line. If they start to decline from that number, that indicates that the server has reached its capacity.
However, httperf isn’t limited to repeated requests for a single file. We prefer to use httperf in session mode by specifying the --wsesslog workload generator. This gives a closer approximation to the actual load on the web server. You can create a session file from your web server logs with a bit of Perl, winding up with a simple formatted list of URLs:
/newsv3/ ....../style/crimson.css ....../style/ash.css ....../style/azure.css ....../images/news.feeds.anime/sites/ann-xs.gif ....../images/news.feeds.anime/sites/annpr-xs.gif ....../images/news.feeds.anime/sites/aod-xs.gif ....../images/news.feeds.anime/sites/an-xs.gif ....../images/news.feeds.anime/header-lite.gif /index.shtml ....../style/sable.css ....../images/banners/igloo.gif ....../images/temp_banner.gif ....../images/faye_header2.jpg ....../images/faye-birthday.jpg ....../images/giant_arrow.gif ....../images/faye_header.jpg /news/ /events/ ....../events/events.css ....../events/summergathering2007/coverimage.jpg (and so forth.)
This session file lists files for httperf to request, with indentations to define bursts; a group of lines that begin with whitespace is a burst. When run, httperf will request the first burst, wait a certain amount of time, then move to the next burst. Equipped with this session file, we can use httperf to simulate a user:
# httperf --hog --server 192.168.1.80 --wsesslog=40,10,urls.txt --rate=1
This will start 40 sessions at the rate of one per second. The new parameter, --wsesslog, takes the input of urls.txt and runs through it in bursts, pausing 10 seconds between bursts to simulate the user thinking.
Again, throw this at your server, increasing the rate until the server can’t meet demand. When the server fails, congratulations! You’ve got a benchmark.
Another Application Benchmark: POV-Ray
Of course, depending on your application, httperf may not be a suitable workload. Let’s say that you’ve decided to use Xen to render scenes with popular open source raytracer POV-Ray. (If nothing else, it’s a good way to soak up spare CPU cycles.)
The POV-Ray benchmark is easy to run. Just give the -benchmark option on the command line:
# povray -benchmark
This renders a standard scene and gives a large number of statistics, ending with an overall summary and rendering time. A domU with a 2.8 GHz Pentium 4 and 256MB of memory gave us the following output:
Smallest Alloc: 9 bytes Largest Alloc: 1440008 bytes Peak memory used: 5516100 bytes Total Scene Processing Times Parse Time: 0 hours 0 minutes 2 seconds (2 seconds) Photon Time: 0 hours 0 minutes 53 seconds (53 seconds) Render Time: 0 hours 43 minutes 26 seconds (2606 seconds) Total Time: 0 hours 44 minutes 21 seconds (2661 seconds)
Now you’ve got a single number that you can easily compare between various setups running POV-Ray, be they Xen instances, VMware boxes, or physical servers.
Tuning Xen for Optimum Benchmarking
Most system administration work involves comparing results at the machine level—analyzing the performance of a Xen VM relative to another machine, virtual or not. However, with virtualization, there are some performance knobs that aren’t obvious but can make a huge difference in the final benchmark results.
First, Xen allocates CPU dynamically and attempts to keep the CPU busy as much as possible. That is, if dom2 isn’t using all of its allocated CPU, dom3 can pick up the extra. Although this is usually a good thing, it can make CPU benchmark data misleading. While testing, you can avoid this problem by specifying the cap parameter to the scheduler. For example, to ensure that domain ID 1 can get no more than 50 percent of one CPU:
# xm sched-credit -d 1 -c 50
Second, guests in HVM mode absolutely must use paravirtualized drivers for acceptable performance. This point is driven home in a XenSource analysis of benchmark results published by VMware, in which XenSource points out that, in VMware’s benchmarks, “XenSource’s Xen Tools for Windows, which optimize the I/O path, were not installed. The VMware benchmarks should thus be disregarded in their entirety.”
Also, shared resources (like disk I/O) are difficult to account, can interact with dom0 CPU demand, and can be affected by other domUs. For example, although paravirtualized Xen can deliver excellent network performance, it requires more CPU cycles to do so than a nonvirtualized machine. This may affect the capacity of your machine.
This is a difficult issue to address, and we can’t really offer a magic bullet. One point to note is that the dom0 will likely use more CPU than an intuitive estimate would suggest; it’s very important to weight the dom0’s CPU allocation heavily, or perhaps even devote a core exclusively to the dom0 on boxes with four or more cores.
For benchmarking, we also recommend minimizing error by benchmarking with a reasonably loaded machine. If you’re expecting to run a dozen domUs, then they should all be performing some reasonable synthetic task while benchmarking to get an appreciation for the real-world performance of the VM.
Profiling with Xen
Of course, there is one way of seeing shared resource use more precisely. We can profile the VM as it runs our application workload to get a clear idea of what it’s doing and—with a Xen-aware profiler—how other domains are interfering with us.
Profiling refers to the practice of examining a specific application to see what it spends time doing. In particular, it can tell you whether an app is CPU or I/O limited, whether particular functions are inefficient, or whether performance problems are occurring outside of the app entirely, perhaps in the kernel.
Here, we’ll discuss a sample setup with Xen and OProfile, using the kernel compile as a standard workload (and one that most Xen admins are likely to be familiar with).
OProfile is probably the most popular profiling package for Linux.4 The kernel includes OProfile support, and the user-space tools come with virtually every distro we know. If you have a performance problem with a particular program and want to see precisely what’s causing it, OProfile is the tool for the job.
OProfile works by incrementing a counter whenever the program being profiled performs a particular action. For example, it can keep count of the number of cache misses or the number of instructions executed. When the counter reaches a certain value, it instructs the OProfile daemon to sample the counter, using a non-maskable interrupt to ensure prompt handling of the sampling request.
Xenoprofile, or Xenoprof, is a version of OProfile that has been extended to work as a system-wide profiling tool under Xen, using hypercalls to enable domains to access hardware performance counters. It supports analysis of complete Xen instances and accounts for time spent in the hypervisor or within another domU.
As of recent versions, Xen includes support for OProfile versions up to 0.9.2 (0.9.3 will require you to apply a patch to the Xen kernel). For now, it would probably be best to use the packaged version to minimize the tedious effort of recompilation.
If you’re using a recent version of Debian, Ubuntu, CentOS, or Red Hat, you’re in luck; the version of OProfile that they ship is already set up to work with Xen. Other distro kernels, if they ship with Xen, will likely also incorporate OProfile’s Xen support.
If you’re not so lucky as to have Xen profiling support already, you’ll have to download and build OProfile, for which we’ll give very brief directions just for completeness.
The first thing to do is to download the OProfile source from http://oprofile.sourceforge.net/. We used version 0.9.4.
First, untar Oprofile, like so:
# wget http://prdownloads.sourceforge.net/oprofile/oprofile-0.9.4.tar.gz # tar xzvf oprofile-0.9.4.tar.gz # cd oprofile-0.9.4
Then configure and build OProfile:
# ./configure --with-kernel-support # make # make install
Finally, do a bit of Linux kernel configuration if your kernel isn’t correctly configured already. (You can check by issuing gzip -d -i /proc/config.gz | grep PROFILE.) In our case that returns:
NOTE: /proc/config.gz is an optional feature that may not exist. If it doesn’t, you’ll have to find your configuration some other way. On Fedora 8, for example, you can check for profiling support by looking at the kernel config file shipped with the distro:
# cat /boot/config-184.108.40.206-42.fc8 | grep PROFILE
If your kernel isn’t set up for profiling, rebuild it with profiling support. Then install and boot from the new kernel (a step that we won’t detail at length here).
To make sure OProfile works, you can profile a standard workload in domain 0. (We chose the kernel compile because it’s a familiar task to most sysadmins, although we’re compiling it out of the Xen source tree.)
Begin by telling OProfile to clear its sample buffers:
# opcontrol --reset
Now configure OProfile.
# opcontrol --setup --vmlinux=/usr/lib/debug/lib/modules/vmlinux --separate=library --event=CPU_CLK_UNHALTED:750000:0x1:1:1
The first three arguments are the command (setup for profiling), kernel image, and an option to create separate output files for libraries used. The final switch, event, describes the event that we’re instructing OProfile to monitor.
The precise event that you’ll want to sample varies depending on your processor type (and on what you’re trying to measure). For this run, to get an overall approximation of CPU usage, we used CPU_CLK_UNHALTED on an Intel Core 2 machine. On a Pentium 4, the equivalent measure would be GLOBAL_POWER_EVENTS. The remaining arguments indicate the size of the counter, the unit mask (in this case, 0x1), and that we want both the kernel and userspace code.
INSTALLING AN UNCOMPRESSED KERNEL ON RED HAT–DERIVED DISTROS
One issue that you may run into with OProfile and kdump, as with any tool that digs into the kernel’s innards, is that these tools expect to find an uncompressed kernel with debugging symbols for maximum benefit. This is simple to provide if you’ve built the kernel yourself, but with a distro kernel it can be more difficult.
Under Red Hat and others, these kernels (and other software built for debugging) are in special -debuginfo RPM packages. These packages aren’t in the standard yum repositories, but you can get them from Red Hat’s FTP site. For Red Hat Enterprise Linux 5, for example, that’d be ftp://ftp.redhat.com/pub/redhat/linux/enterprise/5Server/en/os/i386/Debuginfo.
For the default kernel, you’ll want the packages:kernel-debuginfo-common-`uname -r`.`uname -m`.rpm kernel-PAE-debuginfo-`uname -r`.`uname -m`.rpm
Download these and install them using RPM.# rpm -ivh *.rpm
To start collecting samples, run:
# opcontrol --start
Then run the experiment that you want to profile, in this case a kernel compile.
# /usr/bin/time -v make bzImage
Then stop the profiler.
# opcontrol --shutdown
Now that we have samples, we can extract meaningful and useful information from the mass of raw data via the standard postprofiling tools. The main analysis command is opreport. To get a basic overview of the processes that consumed CPU, we could run:
# opreport -t 2 CPU: Core 2, speed 2400.08 MHz (estimated) Counted CPU_CLK_UNHALTED events (Clock cycles when not halted) with a unit mask of 0x01 (Unhalted bus cycles) count 750000 CPU_CLK_UNHALT...| samples| %| ------------------ 370812 90.0945 cc1 CPU_CLK_UNHALT...| samples| %| ------------------ 332713 89.7255 cc1 37858 10.2095 libc-2.5.so 241 0.0650 ld-2.5.so 11364 2.7611 genksyms CPU_CLK_UNHALT...| samples| %| ------------------ 8159 71.7969 genksyms 3178 27.9655 libc-2.5.so 27 0.2376 ld-2.5.so
This tells us which processes accounted for CPU usage during the compile, with a threshold of 2 percent (indicated by the -t 2 option.) This isn’t terribly interesting, however. We can get more granularity using the --symbols option with opreport, which gives a best guess as to what functions accounted for the CPU usage. Try it.
You might be interested in other events, such as cache misses. To get a list of possible counters customized for your hardware, issue:
Profiling Multiple Domains in Concert
So far, all this has covered standard use of OProfile, without touching on the Xen-specific features. But one of the most useful features of OProfile, in the Xen context, is the ability to profile entire domains against each other, analyzing how different scheduling parameters, disk allocations, drivers, and code paths interact to affect performance.
When profiling multiple domains, dom0 still coordinates the session. It’s not currently possible to simply profile in a domU without dom0’s involvement—domUs don’t have direct access to the CPU performance counters.
Active vs. Passive Profiling
Xenoprofile supports both active and passive modes for domain profiling.
When profiling in passive mode, the results indicate which domain is running at sample time but don’t delve more deeply into what’s being executed. It’s useful to get a quick look at which domains are using the system.
In active mode, each domU runs its own instance of OProfile, which samples events within its virtual machine. Active mode allows better granularity than passive mode, but is more inconvenient. Only paravirtualized domains can run in active mode.
Active profiling is substantially more interesting. For this example, we’ll use three domains: dom0, to control the profiler, and domUs 1 and 3 as active domains.
0 # opcontrol --reset 1 # opcontrol --reset 3 # opcontrol --reset
First, set up the daemon in dom0 with some initial parameters:
0 # opcontrol --start-daemon --event=GLOBAL_POWER_EVENTS:1000000:1:1 --xen=/boot/xen-syms-3.0-unstable --vmlinux=/boot/vmlinux-syms-2.6.18-xen0 --active-domains=1,3
This introduces the --xen option, which gives the path to the uncompressed Xen kernel image, and the --active-domains option, which lists the domains to profile in active mode. The :1 s at the end of the event option tells OProfile to count events in both userspace and kernel space.
NOTE: Specify domains by numeric ID. OProfile won’t interpret names.
Next, start OProfile in the active domUs. The daemon must already be running in dom0, otherwise the domU won’t have permission to access the performance counters.
1 # opcontrol --reset 1 # opcontrol --start
Run the same commands in domain 3. Finally, begin sampling in domain 0:
0 # opcontrol --start
Now we can run commands in the domains of interest. Let’s continue to use the kernel compile as our test workload, but this time complicate matters by running a disk-intensive benchmark in another domain.
1 # time make bzImage 3 # time bonnie++
When the kernel compile and Bonnie++ have finished, we stop OProfile:
0 # opcontrol --stop 0 # opcontrol --shutdown 1 # opcontrol --shutdown 3 # opcontrol --shutdown
Now each domU will have its own set of samples, which we can view with opreport. Taken together, these reports form a complete picture of the various domains’ activity. We might suggest playing with the CPU allocations and seeing how that influences OProfile’s results.
An OProfile Example
Now let’s try applying OProfile to an actual problem. Here’s the scenario: We’ve moved to a setup that uses LVM mirroring on a pair of 1 TB SATA disks. The hardware is a quad-core Intel QX6600, with 8GB memory and an ICH7 SATA controller, using the AHCI driver. We’ve devoted 512MB of memory to the dom0.
We noted that the performance of mirrored logical volumes accessed through xenblk was about one-tenth that of nonmirrored LVs, or of LVs mirrored with the --corelog option. Mirrored LVs with and without –corelog performed fine when accessed normally within the dom0, but performance dropped when accessed via xm block-attach. This was, to our minds, ridiculous.
First, we created two logical volumes in the volume group test: one with mirroring and a mirror log, and one with the --corelog<tt> option.
# lvcreate -m 1 -L 2G -n test_mirror test # lvcreate -m 1 --corelog -L 2G -n test_core test
Then we made filesystems and mounted them:
# mke2fs -j /dev/test/test* # mkdir -p /mnt/test/mirror # mkdir -p /mnt/test/core # mount /dev/test/test_mirror /mnt/test/mirror
Next we started OProfile, using the <tt>--xen option to give the path to our uncompessed Xen kernel image. After a few test runs profiling various events, it became clear that our problem related to excessive amounts of time spent waiting for I/O. Thus, we instruct the profiler to count BUS_IO_WAIT events, which indicate when the processor is stuck waiting for input:
# opcontrol --start --event=BUS_IO_WAIT:500:0xc0 --xen=/usr/lib/debug/boot/xen-syms-2.6.18-53.1.14.el5.debug --vmlinux=/usr/lib/debug/lib/modules/2.6.18-53.1.14.el5xen/vmlinux --separate=all
Then we ran Bonnie++ on each device in sequence, stopping OProfile and saving the output each time.
# bonnie++ -d /mnt/test/mirror # opcontrol --stop # opcontrol --save=mirrorlog # opcontrol --reset
The LV with the corelog displayed negligible iowait, as expected. However, the other experienced quite a bit, as you can see in this output from our test of the LV in question:
# opreport -t 1 --symbols session:iowait_mirror warning: /ahci could not be found. CPU: Core 2, speed 2400.08 MHz (estimated) Counted BUS_IO_WAIT events (IO requests waiting in the bus queue) with a unit mask of 0xc0 (All cores) count 500 Processes with a thread ID of 0 Processes with a thread ID of 463 Processes with a thread ID of 14185 samples % samples % samples % app name symbol name 32 91.4286 15 93.7500 0 0 xen-syms-2.6.18-53.1.14.el5.debug pit_read_counter 1 2.8571 0 0 0 0 ahci (no symbols) 1 2.8571 0 0 0 0 vmlinux bio_put 1 2.8571 0 0 0 0 vmlinux hypercall_page
Here we see that the Xen kernel is experiencing a large number of BUS_IO_WAIT events in the pit_read_counter function, suggesting that this function is probably our culprit. A bit of searching for that function name reveals that it’s been taken out of recent versions of Xen, so we decide to take the easy way out and upgrade. Problem solved—but now we have some idea why.
Used properly, profiling can be an excellent way to track down performance bottlenecks. However, it’s not any sort of magic bullet. The sheer amount of data that profiling generates can be seductive, and sorting through the profiler’s output may take far more time than it’s worth.
So that’s a sysadmin’s primer on performance measurement with Xen. In this chapter, we’ve described tools to measure performance, ranging from the general to the specific, from the hardware focused to the application oriented. We’ve also briefly discussed the Xen-oriented features of OProfile, which aim to extend the profiler to multiple domUs and the hypervisor itself.