VCAP6-DCV Design Journey – Objective 2.3 – Build Availability Requirements into a vSphere 6.x Logical Design

This post is intended to cover the VCAP design objective around building availability requirements into a logical design. At the time of writing, the required ‘Skills and Abilities’ listed by VMware for this topic are:

Evaluate which logical availability services can be used with a given vSphere solution
Differentiate infrastructure qualities related to availability
Describe the concept of redundancy and the risks associated with single points of failure
Explain class of nines methodology
Determine availability component of service level agreements (SLAs) and service level management processes
Determine potential availability solutions for a logical design based on customer requirements
Create an availability plan, including maintenance processes
Balance availability requirements with other infrastructure qualities
Analyze a vSphere design and determine possible single points of failure

To prepare for this objective you should be familiar with the following documents:

Much of this objective is about designing for high availability, so it’s necessary to have a thorough understanding of the features in vSphere that help you make your solution highly available.

Evaluate which logical availability services can be used with a given vSphere solution

Here we need to think about providing high availability at both at the application layer and at the infrastructure layer. Things to consider here would include:

vSphere HA, vSphere Fault Tolerance, vSphere Clusters, network interface teaming, network physical connectivity, storage multipathing and so on..
Application redundancy: load balancing, clustering, replication
Virtual machine placement – DRS and SDRS rules – affinity/anti-affinity, manual datastore placement

Differentiate infrastructure qualities related to availability

This is being able to tell what applies to what infrastructure quality. The infrastructure qualities are Availability, Manageability, Performance, Recoverability and Security. Once that is closely linked with availability is recoverability. There is a great post here discussing how recoverability impacts availability.

Describe the concept of redundancy and the risks associated with single points of failure, and Analyze a vSphere design and determine possible single points of failure

This one is relatively straight forward. It’s important to identify any single points of failure. Examples of a single point of failure could be:

A single VM running an application
A dual port HBA
A single network uplink/path
pre-v6.5 vCenter

Any SPOFs should be recorded as a risk. If they cannot be addressed by making them highly available, then where possible the risk should be mitigated, e.g. by taking regular backups, having a robust recovery plan.

Explain class of nines methodology

Class of nines is a term used in relation to measuring availability and as such is closely linked to SLAs. There is a article covering this here. Some ones to be familiar with:

Two Nines – 99% Availability – 3.65 days downtime per year
Three Nines – 99.9% Availability – 8.76 hours downtime per year
Four Nines – 99.99% Availability – 52.6 minutes downtime per year
Five Nines – 99.999% Availability – 5.26 minutes downtime per year

The higher the availability the better, however usually with higher availability comes higher cost and is more difficult to achieve.

Determine availability component of service level agreements (SLAs) and service level management processes

Firstly, it’s important to understand what an SLA is. In short, an SLA is a contract/commitment between a supplier and it’s customers. Usually there are targets that should be met, which in terms of availability is usually expressed using the class of nines methodology covered in the previous section. Often there are penalties if targets are not met. It will be necessary to make design choices that aim to meet any agreed SLAs.

Create an availability plan, including maintenance processes

Ideally a Business Impact Analysis will have been performed which will help determine what RTO and RPO values there are in scope. This will then feed into the design, so that these can be met. For example, storage replication and backups. Maintenance windows should also be included, and relevant manual or automated processes established. Important concepts here include:

RPO – Recovery Point Objective – This is the max age that something can be restored back to. This may be an entire VM or data hosted on a VM etc.
RPO – Recovery Time Objective – The maximum amount of time it would take to restore service

vSphere technologies to consider here would include vSphere High Availability and Fault Tolerance, vSphere Replication and Site Recovery Manager.

Balance availability requirements with other infrastructure qualities

I see this as being about understanding how to get the balance right between designing to meet a high availability target/SLA, and how much this may impact the cost and complexity of the solution. Likely there will need to be a compromise between availability and manageability and cost.