Scheduling

From PrgmrWiki

Tuning CPU usage

Some virtual servers, of course, require more resources than others. Memory and disk size are easy to tune – memory you can just specify in the config file, while disk size is determined by the size of the backing device – but fine-grained CPU allocation requires you to adjust the scheduler.

Scheduler basics

The scheduler acts as a referee between the running domains. In some ways it's a lot like the Linux scheduler: It can preempt processes as needed, it tries its best to ensure fair allocation, and it ensures that the CPU wastes as few cycles as possible. As the name suggests, Xen's scheduler schedules domains to run on the physical CPU. These domains in turn schedule and run processes from their internal run queues.

Because the dom0 is just another domain as far as Xen's concerned, it's subject to the same scheduling algorithm as the domUs. This can lead to trouble if it's not assigned a high enough weight, since the dom0 has to be able to respond to I/O requests.

Xen can use a variety of scheduling algorithms, ranging from the simple to the baroque. Although Xen's shipped with a number of schedulers in the past, we're going to concentrate on the credit scheduler; it's the current default and recommended choice, and the only one that the Xen team has indicated any interest in keeping.

The xm dmesg command will tell you, among other things, what scheduler Xen is using.

# xm dmesg | grep scheduler
(XEN) Using scheduler: SMP Credit Scheduler (credit)

If you want to change the scheduler, you can set it as a boot parameter – to change to the SEDF scheduler, for example, append sched=sedf to the kernel line in GRUB. (That's the Xen kernel, not the dom0 Linux kernel loaded by the first "module" line.)

VCPUs and physical CPUs

For convenience, we consider each Xen domain to have one or more virtual CPUs (VCPUs), which periodically run on the physical CPUs. These are the entities that consume credits when run. To examine VCPUs, use xm vcpu-list [domain]:

# xm vcpu-list horatio
Name                              ID VCPUs   CPU State   Time(s) CPU Affinity
horatio                           16     0     0   ---  140005.6 any cpu
horatio                           16     1     2   r--  139968.3 any cpu

In this case, the domain has two VCPUs, 0 and 1. VCPU 1 is in the "running" state on (physical) CPU 1. Note that Xen will try to spread VCPUs across CPUs as much as possible. Unless you've pinned them manually, VCPUs can occasionally switch CPUs, depending on which physical CPUs are available.

To specify the number of VCPUs for a domain, specify the vcpus= directive in the config file. You can also change the number of VCPUs while a domain is running using xm vcpu-set. However, note that you can decrease the number of VCPUs this way, but you can't increase the number of VCPUs beyond the initial count.

To set the CPU affinity, use xm vcpu-pin <domain> <vcpu> <pcpu>. For example, to switch the CPU assignment in the domain horatio, so that VCPU0 runs on CPU2 and VCPU1 on CPU0:

# xm vcpu-pin horatio 0 2
# xm vcpu-pin horatio 1 0

Equivalently, you can pin VCPUs in the xm config file like this:

vcpus=2
cpus=[0,2]   

This gives the domain 2 VCPUs, pins the first VCPU to the first physical CPU, and pins the second VCPU to the third physical CPU.

Credit Scheduler

The Xen team designed the credit scheduler to minimize wasted CPU time. This makes it a "work-conserving" scheduler, in that it tries to ensure that the CPU will always be working, whenever there is work for it to do.

As a consequence, if there is more real CPU available than the domUs are demanding, all domUs get all the CPU they want. When there is contention -- that is, when the domUs in aggregate want more CPU than actually exists -- then the scheduler arbitrates fairly between the domains that want CPU.

Xen treats this more as a target than as an inflexible dictum. The scheduling isn't perfect by any stretch of the imagination. In particular, cycles spent servicing I/O by domain 0 are not charged to the domain responsible, leading to situations where I/O intensive clients get a disproportionate share of CPU usage. Nonetheless, you can get pretty good allocation in non-pathological cases. (Also, in our experience, the CPU sits idle most of the time anyway.)

The credit scheduler assigns each domain a weight, and optionally a cap. The weight indicates the relative CPU allocation of a domain -- if the CPU is scarce, a domain with a weight of 512 will receive twice as much CPU time as a domain with a weight of 256 (the default.) The cap sets an absolute limit on the amount of time a domain can receive, expressed in hundredths of a CPU (note that this number can exceed 100 on multiprocessor hosts.)

The scheduler transforms the weight into a credit allocation for each VCPU, using a separate accounting thread. As a VCPU runs, it consumes credits. Once the VCPU runs out of credits, it only runs when other, more thrifty VCPUs have finished executing. Periodically, the accounting thread goes through and gives everybody more credits.

In this case, the details are probably less important than the practical application. Using the xm sched-credit commands, we can adjust CPU allocation on a per-domain basis. For example, here we'll increase a domain's CPU allocation. First, to list the weight and cap for a domain:

# xm sched-credit -d domain
{'cap': 0, 'weight': 256}

Then, to modify the scheduler's parameters:

# xm sched-credit -d domain -w 512
# xm sched-credit -d domain
{'cap': 0, 'weight': 512}

Of course the value "512" only has meaning relative to the other domains running on the machine. Make sure to set all the domains' weights appropriately.

To set the cap for a domain:

# xm sched-credit -d domain -c cap

Scheduling for providers

We decided to divide the CPU along the same lines as the available RAM -- it stands to reason that a user paying for half the RAM in a box will want more CPU than someone with a 64 MB domain. Thus, in our setup, a customer with 25% of the RAM also has a minimum share of 25% of the CPU cycles.

The simple way to do this is to assign each CPU a weight equal to the number of megabytes of memory it has, and leave the cap empty. The scheduler will then handle converting that into fair proportions – so that our aforementioned user with half the ram will get about as much CPU time as the rest of the users put together.

Of course, that's the worst case; that is what the user will get in an environment of constant struggle for the CPU. If all domains but one are idle, that one can have the entire CPU to itself.

Note: It's essential to make sure that the dom0 has sufficient CPU to service I/O requests. You can handle this by dedicating a CPU to the dom0, or by giving the dom0 a very high weight – higher than any of the domUs. At prgmr.com, we handle the problem by weighting each domU with its RAM amount, and weighting the dom0 at the total amount of physical memory in the box.

This simple memory=weight formula becomes a bit more complex when dealing with multiprocessor systems, since independent systems of CPU allocation then come into play. A good rule would be to allocate VCPUs in proportion to memory. For example, a domain with half the RAM on a box with four cores (and hyperthreading turned off) should have at least two VCPUs. Another solution would be to give all domains as many VCPUs as physical processors in the box – this would allow all domains to burst to the full CPU capacity of the physical machine, but might lead to increased overhead from context swaps.