Calculate Host Failure Requirements

by admin

When you are working out host failure requirements for a ESXi cluster, it is important to account for the combined resource utilisation of the virtual machines that are to run in the cluster, in order to leave enough unused resource to support the virtual machines in the event of a host failure. For example, if you have a cluster with 5 ESXi hosts (with identical hardware specifications), the resource utilization for each (if the load was spread evenly) should not exceed 75 – 80%, in order to account for a single host failure. Admission Control is the cluster feature that helps you ensure there is enough spare resource in the cluster, should a host (or multiple hosts) fail.

Configuring Admission Control

Admission Control is configured in the cluster’s settings:

admission-control-settings

As stated in the settings window, the admission control policy determines the amount of cluster capacity that is reserved for virtual machine failovers. When admission control is enabled the cluster will not allow any virtual machines to be powered on if they will violate availability constraints. This guarantees there there will be enough spare resources to restart virtual machines in the event of a host failure. When admission control is disabled, the cluster will allow VMs to be powered on regardless of whether they violate availability constraints. The availability constraints themselves are defined by the admission control policy.

Admission Control Policy

admission-control-policy

There are three types of admission control policy:

  • Host failures the cluster tolerates. This is simply how many host failures the cluster can tolerate. When this is set as the admission control policy, the cluster makes sure that if the configured number of hosts were to fail, there would still be sufficient resources available to power on the virtual machines. The cluster uses the slot size to work out how much capacity it needs to reserve (more on that later..).
  • Percentage of cluster resources reserved as failover spare capacity. When this policy is set, the cluster ensures that a percentage of total cluster resources is reserved to power on virtual machines in the event of host failure.
  • Specify failover hosts. When this option is set, the cluster will attempt to power on virtual machines on a named failover host(s) in the event of a host failure. Named failover hosts do not run virtual machine workloads unless a failover event has occurred.

It’s important to know which admission control policy is best for your environment. With this in mind, I’ll look at each in a little more detail focusing on how the cluster calculates how many resources to leave free to support failover.

As mentioned above, when ‘Host failures the cluster tolerates’ is selected, the cluster uses the concept of ‘slots’ to determine how many virtual machines the cluster can support. The cluster first determines the slot size, from which it can then determine how many ‘slots’ the cluster can support. The cluster can then work out how many ‘slots’ need to be left free to support powering on VMs in the event of a host failure. So, how is the slot size calculated?

Slot size is made up of both CPU and memory. The CPU part of the slot size is the value of the highest CPU reservation applied to a virtual machine in the cluster. For example, if there is a VM in the cluster with a CPU reservation of 500Mhz, and this is the VM with the highest CPU reservation, then the CPU part of the slot size will be set to 500Mhz. If none of the virtual machines in the cluster have a CPU reservation then the CPU part of the slot size will be set to a default value of 32 MHz. Similarly, the memory part of the slot size is set by taking the value of the largest memory reservation (plus memory overhead) of any VM in the cluster.  Once the slot size has been calculated, the cluster can then work out how many slots the cluster can support by aggregating the total cluster  resources, both CPU and memory, then dividing each by the corresponding slot size value. The outcome of this is two sets of slots, one each for CPU and memory. The smallest of these is taken as the number of available slots for the cluster.

The total available resources for a cluster can be seen on the cluster’s resource allocation tab:

cluster-resource-allocation

You can view detailed information on the slot size for a cluster, and how many slots have been used and are left available by clicking the ‘Advanced Runtime Info’ link on the cluster’s summary tab:

advanced-runtime-info

In the example above we can see that:

  • There are three nodes in the cluster, and all are currently available
  • Each host has 137 slots, with the total number of slots being 411
  • There are 42 used slots (which in this case equals the number of powered on VMs, though it is possible that a VM can use more than one slot)
  • There are 238 available slots (This is the total slots minus the failover slots and used slots)
  • Failover Slots – the number of slots reserved for failover. 137 in this example, as we have configured the cluster to allow for a single host failure.

There are a number of things to bear in mind when using ‘Host failures the cluster tolerates’  as the admission control policy. First, it is important that all your hosts have the same hardware specification. This is because the cluster has to account for the loss of the biggest host. For example, if you have 3 hosts in your cluster, 2 with 64GB ram and the other with 96GB ram, then the cluster has to reserve enough capacity to allow for the failure of the 96GB host. In this example it would make more sense for all three hosts to have 64GB ram. It is also important to try and keep virtual machine reservations similar across the virtual machines in the cluster. If you have a couple of VMs with large reservations it will cause the slot size to be increased to match those reservations, which will likely lead to unused resources. There is a way around this however, by using custom slot sizes, or by changing the admission control policy to ‘Percentage of cluster resources reserved as failover spare capacity’.

The percentage admission control policy tends to be used when the virtual machines in the cluster have a wide range of reservations, or when the hosts in the cluster have different hardware specifications. vSphere 5 allows you to specify a percentage of failover resource for both memory and CPU:

admission-control-percentage

By default, both values will be set to 25%. This means that 25% of your clusters resources will be reserved. For example, if you had a 4 node cluster, the equivalent of 1 host would be reserved (assuming all hosts were of an equal specification). This value may be too high, in the case of a 16 node cluster for example, where 4 host’s worth of resources would be reserved.

As slots are not used with this admission control policy, the view on the cluster’s summary tab changes. Rather than displaying the number of used and free slots, you get an overview of the percentage of resources used and available:

percentage-summary-tab

Current CPU Failover Capacity is calculated by subtracting the total CPU resource requirements from the total amount of CPU resource available in the cluster. The same is done for memory. This concept is explained in detail, with examples, in the vSphere Availability Guide. Using the percentage policy does allow for a bit more flexibility in that it can work with hosts of different sizes and VMs with widely differing reservations, however be aware that use of this policy can lead to resource fragmentation (where a number of hosts in the cluster have spare resource, but no individual host has enough to power on a given VM), though DRS will help with this.

The final admission control policy is straight forward. When you specify a failover host, HA will use that host to power on virtual machines in the event of a host failure. For larger clusters you can specify multiple failover hosts.

Useful Links and Resources

http://www.yellow-bricks.com/2013/01/09/percentage-based-admission-control-policy-rules-out-large-vms-being-restarted/

http://frankdenneman.nl/2013/02/15/ha-percentage-based-admission-control-from-a-resource-management-perspective-part-1/

http://frankdenneman.nl/2011/01/20/setting-correct-percentage-of-cluster-resources-reserved/

Keep up to date with new posts on Buildvirtual.net - Follow us on Twitter:
Be Sociable, Share!

Leave a Comment

*

Previous post:

Next post: