Troubleshoot ESXi Host and Virtual Machine CPU Performance Issues Using Appropriate Metrics

I have previously written about analyzing host and virtual machine performance using esxtop. This post will continue to look at how this information can help with troubleshooting performance, and what values indicate a problem may exist.

Troubleshooting Host CPU Performance Using esxtop

When using esxtop to help troubleshoot CPU performance the main metrics to be aware of in relation to the host itself are:

PCPU USED(%) – The percentage CPU usage per PCPU and the PCPU usage average across all PCPUs.
PCPU UTIL(%) – The percentage of unhalted CPU cycles per PCPU and the average across all PCPUs.

If a host is suffering from CPU saturation then you should look more closely at the virtual machines running on the host. CPU usage can be lessened by tuning the settings of the virtual machine, for example, increasing memory/decreasing vCPUs. Ultimately though, it may be that the demands of the virtual machines exceed the resources available on the host. DRS will help with this. If DRS is unavailable then you could look to migrate some of the virtual machine workloads off of the affected host. Ensure that, if you are migrating VMs, that it will not cause CPU saturation on the destination host.

Along with ESXTOP, you can also monitor host CPU performance using the charts available via the vSphere client.

Troubleshooting Virtual Machine CPU Performance Using esxtop and vCenter Charts

CPU metrics to monitor when using esxtop to help troubleshoot virtual machine CPU performance issues include:

%USED – This is the percentage of CPU time accounted to the world. This value can be over 100 as, when viewing the world group for the VM, the value maximum value is the number of worlds in the group (NWLD) multiplied by 100. If the %USED value is high it means the VM is using lots of CPU resource. You can expand the VM’s world group to see what is using the resource. Using the example above, the VM’s world group has 5 worlds, which can be seen expanded in the following example.
%SYS – This is the percentage of time that the system services are spending on the VM. If this value is high it tends to mean that the VM is experiencing high I/O. Ideally this should be lower than 20%.
%OVRLP – This is the percentage of time spent by system services on other worlds. When this value is high it is normally an indication that the host is experiencing high I/O.
%RUN – This is the percentage of total time scheduled for the world to run. %USED = %RUN + %SYS – %OVRLP. When the %RUN value of a virtual machine is high, it means the VM is using a lot of CPU resource.
%RDY – This is the percentage of time a world is waiting to run. If this value is higher than 10% it means that the virtual machine is possibly under resource contention. Remember that this value is per vCPU world, so for virtual machine with multiple vCPUs you can expect higher values. This is an important metric to pay attention do. Generally you will see higher ready values when you are using multiple vCPUs on a virtual machine, because the VMKernel will try and schedule all the vCPUs close together, which may result in the VM needing to wait before it’s thread can be executed. Ideally you want to see the ready time to be under 10% but best if closer to 5%.
%MLMTD – This is the percentage of time the world was ready to run but was deliberately not scheduled as it would have violated CPU limits. This value is contained in %RDY. If this value is high then you could increase its limit, adding more vCPUs. This should always be 0 so long as CPU limits are not in place on the VM.
%CSTP – This is the amount of time the world has spent in the ready, co-deschedule state. This is only applicable for SMP VMs. The scheduler tries to execute on all vCPUs. The %CTSP value is the time the vCPU is stopped from executing whilst waiting for other vCPUs in the same virtual machine to execute/catch up. This value shouldn’t be higher than 3%. If it is generally higher than that then it may be that the VM has too many vCPU.
%WAIT – The percentage of time a world has spent in the wait state. The %WAIT is the total wait time which includes %IDLE and I/O wait time.
%IDLE – The percentage of time a world is in idle loop.
%SWPWT – The percentage of time the world is waiting for the VMkernel swapping memory.

As a memory aid, when looking to memorise these metrics, take a look at my post on esxtop flashcards. Alternatively, you can also investigate virtual machine CPU performance using vCenter charts:

More on %RDY

Ready Time (%RDY) is the amount of time a VM is waiting to be scheduled on a physical CPU. I mention above that generally %RDY should be under 10%. This is true, however the values you see in esxtop may be confusing. This is because how you interpret the values is influenced by how many vCPU worlds the virtual machine has. %RDY is a sum of all vCPU %RDY values for the VM. For example, if a virtual machine has 1 vCPU then it’s maximum %RDY value is 100%, however, if a virtual machine has 2 vCPU then it’s maximum possible %RDY value is 200%. Therefore, whilst 10% %RDY is potentially bad for a 1 vCPU VM, it may not be an issue for a 2 vCPU VM.

For virtual machines with multiple vCPUs, you can use the expand option in esxtop to expand the world group to get a break down of %RDY values for each individual vCPU:

 ID      GID NAME             NWLD   %USED    %RUN    %SYS   %WAIT %VMWAIT    %RDY   %IDLE  %OVRLP   %CSTP  %MLMTD  %SWPWT
    4214     1942 vmx                 1    0.14    0.14    0.00   99.81       -    0.05    0.00    0.00    0.00    0.00    0.00
    4216     1942 vmast.4215          1    0.00    0.00    0.00  100.00       -    0.01    0.00    0.00    0.00    0.00    0.00
    4218     1942 vmx-vthread-5:X     1    0.00    0.00    0.00  100.00       -    0.00    0.00    0.00    0.00    0.00    0.00
    4219     1942 vmx-mks:XP4         1    0.01    0.01    0.00  100.00       -    0.00    0.00    0.00    0.00    0.00    0.00
    4220     1942 vmx-vcpu-0:XP4      1    0.41    0.32    0.08   99.53    0.00    0.14  100.00    0.01    0.01    0.00    0.00
    4221     1942 vmx-vcpu-1:XP4      1    0.05    0.05    0.00   99.88    0.76    0.06   99.12    0.00    0.02    0.00    0.00

High %RDY values can be an indication that the host is under physical CPU contention due to the demands of the virtual machines or that a virtual machine(s) has been overprovisioned/allocated too many vCPUs.

If %USED is relatively low, but ready times are high it is a good indication that the virtual machine is over-provisioned. Where possible a single vCPU should be used.

Guest CPU Saturation

If %USED is high then the virtual machine is using a lot of CPU resource. If does not necessarily indicate that there is a problem, but if the Guest’s CPU usage is continually very high then steps may need to be taken to alleviate the problem. There are a couple of approaches to take here. It may be that the virtual machine genuinely is under provisioned in terms of vCPU resources, in which case, adding more vCPUs to the virtual machine will help. Alternatively, there could be a problem with the guest – perhaps a process is taking up all the CPU time due to a fault, or it could be that the virtual machine hasn’t be allocated enough memory, in turn causing pressure on the VM’s CPU resources.

Troubleshoot ESXi Host and Virtual Machine CPU Performance Issues Using Appropriate Metrics

Troubleshooting Host CPU Performance Using esxtop

Troubleshooting Virtual Machine CPU Performance Using esxtop and vCenter Charts

More on %RDY

Guest CPU Saturation

How to Create an Ubuntu Virtual Machine – On Ubuntu!

Troubleshoot ESXi Host and Virtual Machine Memory Performance Issues using Appropriate Metrics