Analyzing esxtop data

I’ve recently written a post about how to collect data with esxtop and resxtop, but how do you interpret that data? esxtop is a great tool for troubleshooting and determining id there are any capacity issues in your environment. There are many metrics available, too many to cover in just this one post, so I will concentrate on the ones used most often when investigating issues related to storage, network, cpu and memory capacity/performance.

Analyzing Disk Performance with esxtop

There are three screens in esxtop relating to disk performance. There is the disk device screen (accessed by pressing ‘u’:

 8:51:42am up 13:29, 313 worlds, 4 VMs, 4 vCPUs; CPU load average: 0.02, 0.15, 0.05

DEVICE                                PATH/WORLD/PARTITION DQLEN WQLEN ACTV QUED %USD  LOAD   CMDS/s  READS/s W
mpx.vmhba1:C0:T0:L0                            -              32     -    0    0    0  0.00    11.51     9.92
mpx.vmhba1:C0:T1:L0                            -              32     -    0    0    0  0.00     0.00     0.00
mpx.vmhba1:C0:T2:L0                            -              32     -    0    0    0  0.00     0.00     0.00
mpx.vmhba32:C0:T0:L0                           -               1     -    0    0    0  0.00     0.00     0.00
t10.F405E46494C4540013C625565687D2A6           -             128     -    0    0    0  0.00     0.00     0.00

And the disk adapter screen, accessed by pressing ‘d’:

 8:52:18am up 13:29, 313 worlds, 4 VMs, 4 vCPUs; CPU load average: 0.02, 0.15, 0.05

 ADAPTR PATH                 NPTH   CMDS/s  READS/s WRITES/s MBREAD/s MBWRTN/s DAVG/cmd KAVG/cmd GAVG/cmd QAVG/
 vmhba0 -                       0     0.00     0.00     0.00     0.00     0.00     0.00     0.00     0.00     0
 vmhba1 -                       3     5.94     5.54     0.40     0.01     0.00     0.19     0.01     0.20     0
vmhba32 -                       1     0.00     0.00     0.00     0.00     0.00     0.00     0.00     0.00     0
vmhba33 -                       2     0.00     0.00     0.00     0.00     0.00     0.00     0.00     0.00     0

The last one is the VM Disk screen, accessed by pressing ‘v’:

 4:43:56pm up 1 day 16:52, 307 worlds, 1 VMs, 1 vCPUs; CPU load average: 0.02, 0.02, 0.01

     GID VMNAME           VDEVNAME NVDISK   CMDS/s  READS/s WRITES/s MBREAD/s MBWRTN/s LAT/rd LAT/wr
   83880 XP                      -      1     0.00     0.00     0.00     0.00     0.00   0.00   0.00

The main disk latency metrics to be aware of here, as described in this KB article, are:

CMDS/s – This is the total amount of commands per second, which includes IOPS and other SCSI commands (e.g. reservations and locks). Generally speaking CMDS/s = IOPS unless there are a lot of other SCSI operations/metadata operations such as reservations.
DAVG/cmd – This is the average response time in milliseconds per command being sent to the storage device.
KAVG/cmd – This is the amount of time the command spends in the VMKernel.
GAVG/cmd – This is the response time as experienced by the Guest OS. This is calculated by adding together the DAVG and the KAVG values.

As a general rule DAVG/cmd, KAVG/cmd and GAVG/cmd should not exceed 10 milliseconds (ms) for sustained lengths of time.

There are also the following throughput metrics to be aware of:

CMDS/s – As discussed above
READS/s – Number of read commands issued per second
WRITES/s – Number of write commands issued per second
MBREAD/s – Megabytes read per second
MBWRTN/s – Megabytes written per second

Analyzing CPU Performance with esxtop

Before looking at the metrics, I want to say a little bit about Worlds. A world, as viewed in esxtop, is an entity that the VMKernel schedules resources for, similar to a process in Windows, for example. A powered on virtual machine will consist of multiple worlds, with each allocated vCPU, for example, having its own world. When you look at a VM in the CPU few of esxtop you are looking at the world group for the VM which contains all the worlds the make up the running virtual machine.

On the CPU screen, accessed by pressing ‘c’ you can choose to filter the list to see only the virtual machines:

3:51:30am up 2 days  3:59, 304 worlds, 1 VMs, 1 vCPUs; CPU load average: 0.01, 0.01, 0.01
PCPU USED(%): 1.9 1.8 1.9 1.9 AVG: 1.9
PCPU UTIL(%): 4.1 3.8 2.8 3.7 AVG: 3.6

      ID      GID NAME             NWLD   %USED    %RUN    %SYS   %WAIT %VMWAIT    %RDY   %IDLE  %OVRLP   %CSTP  %MLMTD  %SWPWT
   83880    83880 XP                  5    1.31    1.17    0.13  497.45    0.07    1.75   98.19    0.02    0.00    0.00    0.00

To expand a world group for a VM, press ‘e’ then type in the GID:

 3:52:44am up 2 days  4:00, 306 worlds, 1 VMs, 1 vCPUs; CPU load average: 0.01, 0.01, 0.01
PCPU USED(%): 1.3 0.9 1.2 0.6 AVG: 1.0
PCPU UTIL(%): 2.0 1.0 1.4 0.8 AVG: 1.3

      ID      GID NAME             NWLD   %USED    %RUN    %SYS   %WAIT %VMWAIT    %RDY   %IDLE  %OVRLP   %CSTP  %MLMTD  %SWPWT
  103065    83880 vmx                 1    0.16    0.16    0.00   99.70       -    0.04    0.00    0.00    0.00    0.00    0.00
  103068    83880 vmast.103067        1    0.00    0.00    0.00   99.89       -    0.01    0.00    0.00    0.00    0.00    0.00
  103069    83880 vmx-vthread-4:X     1    0.00    0.00    0.00   99.90       -    0.00    0.00    0.00    0.00    0.00    0.00
  103070    83880 vmx-mks:XP          1    0.01    0.01    0.00   99.89       -    0.00    0.00    0.00    0.00    0.00    0.00
  103071    83880 vmx-vcpu-0:XP       1    0.96    0.79    0.16   98.59    0.06    0.52   98.53    0.01    0.00    0.00    0.00

So, what are the main CPU counters to be aware of? First of all, there are the ones relating to the physical CPUs in the host. These are:

PCPU USED(%) – The percentage CPU usage per PCPU and the PCPU usage average across all PCPUs.
PCPU UTIL(%) – The percentage of unhalted CPU cycles per PCPU and the average across all PCPUs.

If these values are high it means that you are using a lot of CPU resource on the host. If all of the PCPUs are running at or close to 100% it is likely that you are overcommiting your CPU resources.

Some of the metrics relating to the worlds to pay attention to are:

%USED – This is the percentage of CPU time accounted to the world. This value can be over 100 as, when viewing the world group for the VM, the value maximum value is the number of worlds in the group (NWLD) multiplied by 100. If the %USED value is high it means the VM is using lots of CPU resource. You can expand the VM’s world group to see what is using the resource. Using the example above, the VM’s world group has 5 worlds, which can be seen expanded in the following example.
%SYS – This is the percentage of time that the system services are spending on the VM. If this value is high it tends to mean that the VM is experiencing high I/O.
%OVRLP – This is the percentage of time spent by system services on other worlds. When this value is high it is normally an indication that the host is experiencing high I/O.
%RUN – This is the percentage of total time scheduled for the world to run. %USED = %RUN + %SYS – %OVRLP. When the %RUN value of a virtual machine is high, it means the VM is using a lot of CPU resource.
%RDY – This is the percentage of time a world is waiting to run. If this value is higher than 20% it means that the virtual machine is possibly under resource contention. Remember that this value is per vCPU world, so for virtual machine with multiple vCPUs you can expect higher values.
%MLMTD – This is the percentage of time the world was ready to run but was deliberately not scheduled as it would have violated CPU limits. This value is contained in %RDY. If this value is high then you could increase its limit, adding more vCPUs.
%CSTP – This is the amount of time the world has spent in the ready, co-deschedule state. This is only applicable for SMP VMs. The scheduler tries to execute on all vCPUs. The %CTSP value is the time the vCPU is stopped from executing whilst waiting for other vCPUs in the same virtual machine to execute/catch up.
%WAIT – The percentage of time a world has spent in the wait state. The %WAIT is the total wait time which includes %IDLE and I/O wait time.
%IDLE – The percentage of time a world is in idle loop.
%SWPWT – The percentage of time the world is waiting for the VMkernel swapping memory.

Some things to note:

%USED = %RUN + %SYS – %OVRLP
100% = %RUN + %READY + %CSTP + %WAIT

Analyzing Memory Performance with esxtop

You can view the memory performance data in esxtop by pressing ‘m’:

11:10:16pm up  5:11, 315 worlds, 2 VMs, 4 vCPUs; MEM overcommit avg: 0.00, 0.00, 0.00
PMEM  /MB:  4095   total:   860     vmk,   741 other,   2492 free
VMKMEM/MB:  4077 managed:   244 minfree,  2456 rsvd,   1621 ursvd,  high state
PSHARE/MB:    69  shared,    39  common:    30 saving
SWAP  /MB:     0    curr,     0 rclmtgt:                 0.00 r/s,   0.00 w/s
ZIP   /MB:     0  zipped,     0   saved
MEMCTL/MB:     0    curr,     0  target,   254 max

     GID NAME               MEMSZ    GRANT    SZTGT     TCHD   TCHD_W    SWCUR    SWTGT   SWR/s   SWW/s  LLSWR/s  LLSWW/s   OVHDUW
   24950 XP1               256.00   255.77   306.77    81.92    69.12     0.00     0.00    0.00    0.00     0.00     0.00     5.98
   24962 XP2               256.00   255.77   306.55    69.12    51.20     0.00     0.00    0.00    0.00     0.00     0.00     5.98

The physical memory is shown by the PMEM metric. In the example above we can see that this ESXi host has 4GB RAM, with 860MB in use by the VMkernel and 741MB in use by other processes. There is 2492 MB free.

Of the metrics relating to the virtual machine worlds:

MEMSZ – This is the value ,in MB, of the configured guest memory.
GRANT – This is the amount of memory that has been granted to the world group.
%ACTV – This is the percentage of active guest memory.
%MCTLSZ – This is the percentage of guest memory reclaimed by the balloon driver. If this is high, it can be a sign of memory contention on the host.
SWCUR – Current swap usage. If this is high it is a sign of memory contention on the host.

Analyzing Network Performance with esxtop

Network performance data in esxtop is accessed by pressing ‘n’:

11:40:40pm up  5:41, 314 worlds, 2 VMs, 4 vCPUs; CPU load average: 0.04, 0.04, 0.17

   PORT-ID              USED-BY  TEAM-PNIC DNAME              PKTTX/s  MbTX/s    PKTRX/s  MbRX/s %DRPTX %DRPRX
  33554433           Management        n/a vSwitch0              0.00    0.00       0.00    0.00   0.00   0.00
  33554434               vmnic0          - vSwitch0              7.80    0.02      17.56    0.03   0.00   0.00
  33554435     Shadow of vmnic0        n/a vSwitch0              0.00    0.00       0.00    0.00   0.00   0.00
  33554436               vmnic2          - vSwitch0              0.00    0.00      25.37    0.04   0.00   0.00
  33554437     Shadow of vmnic2        n/a vSwitch0              0.00    0.00       0.00    0.00   0.00   0.00
  33554438                 vmk0     vmnic0 vSwitch0             10.73    0.02       4.88    0.01   0.00   0.00
  33554439                 vmk2     vmnic2 vSwitch0              0.00    0.00       0.00    0.00   0.00   0.00

Metrics to look out for here are MbTX/s (Megabit Transmit) and MbRX/s (Megabit Receive). Keep and eye on %DRPTX and %DRPRX as they can be an indicator of a busy or saturated network.

Useful Links and Resources

https://kb.vmware.com/selfservice/microsites/search.do?language=en_US&cmd=displayKC&externalId=1008205

https://communities.vmware.com/docs/DOC-11812

Analyzing Disk Performance with esxtop

Analyzing CPU Performance with esxtop

Analyzing Memory Performance with esxtop

Analyzing Network Performance with esxtop

Useful Links and Resources

PowerCLI Alternative to Linux Watch Command

A Look at ESXi 5 Lockdown Mode