I’ve recently written this post which discusses RAID levels and how to calculate how many disks you need in your RAID sets to provide the required performance levels for your virtual machines. This post will look at what tools and metrics are available to analyse your virtual machine workloads to determine how they are performing and what their storage needs are.
Analysing Disk Latency and IOPs
Disk latency is the amount of time it takes to pass an I/O request from the vmkernel to the storage array. High latency times occur when there are a large number of I/O requests on a host or storage subsystem. It means that the storage system is being asked to perform too many operations, which leads to performance degradation. There are a number of tools and metrics we can use to monitor I/O work loads for latency.
vscsiStats is a tool that can be executed on a ESXi host to gather disk and latency statistics for virtual machines. I’ve written this post that details how it can be used to help monitor storage performance. The statistics that can be collected include I/O size, Outstanding I/Os, Seek Distance and Latency and offer more detail than those that can be gathered using esxtop or the vSphere client.
esxtop and resxtop
Both esxtop and resxtop can be used to gatherI/O latency and IOPs statistics. I’ve written about this in more detail in other posts, but some of the metrics to look at include:
- CMDS/s – This is the total amount of commands per second, which includes IOPS and other SCSI commands (e.g. reservations and locks). Generally speaking CMDS/s = IOPS unless there are a lot of other SCSI operations/metadata operations such as reservations.
- DAVG/cmd – This is the average response time in milliseconds per command being sent to the storage device.
- KAVG/cmd – This is the amount of time the command spends in the VMKernel.
- GAVG/cmd – This is the response time as experienced by the Guest OS. This is calculated by adding together the DAVG and the KAVG values.
As a general rule DAVG/cmd, KAVG/cmd and GAVG/cmd should not exceed 10 milliseconds (ms) for sustained lengths of time.
There are also the following throughput metrics to be aware of:
- CMDS/s – As discussed above
- READS/s – Number of read commands issued per second
- WRITES/s – Number of write commands issued per second
- MBREAD/s – Megabytes read per second
- MBWRTN/s – Megabytes written per second
The sum of reads and writes equals IOPS, which is the the most common benchmark when monitoring and troubleshooting storage performance. These metrics can be monitored at the HBA or Virtual Machine level.
You can also monitor storage performance using the vSphere client. Counters to look at include disk read rate,disk write rate and disk usage. Disk read rate and disk write rate could be monitored at the LUN level. Both metrics and disk usage can be monitored per host. There a also a number of latency counters that you can check
Additionally, there are a handful of latency counters that should be checked, as stated here.
- deviceLatency – The average amount of time for the physical device to complete a SCSI command. A number greater than 15ms can indicate there are problems.
- kernelLatency– This is how long the VMkernel is taking to process each SCSI command. This valus shouldn’t exceed 4ms
- queueLatency – Measures average time taken per SCSI command in the VMkernel queue. This value should always be zero, otherwise the workload is too high for the array to process the data.
Useful Links and Resources