What’s the Difference Between High Availability and Fault Tolerance in VMware vSphere?

High availability’s goal within VMware vSphere is to minimize downtime, not prevent it. This feature is available in all editions of vSphere except Essentials. It is designed to handle the failure of any or all of the following:

Loss of a physical ESXi server.
Loss of a virtual machine.
Loss of an application within a virtual machine.

In the first case, when the server fails, all virtual machines on that server fail immediately. Within a few seconds, one or more servers in the cluster will know that the server has failed, and the master node, which coordinates all of the high availability activities within the cluster in conjunction with vCenter, will assign the failed virtual machines to the surviving nodes in the cluster.

In this case, no vMotion was performed – the virtual machine failed and was then restarted on a different node. This can be done even if vCenter is unavailable.

In fact, high availability can even restart a failed vCenter server when the host it was running on failed. This is the original and primary use case of high availability. It is also the only use case that will move a virtual machine to a different physical server.

From a configuration perspective, you can mix and match any kind of server in the cluster as the virtual machine will be restarted from scratch. Thus you could (but really shouldn’t) mix and match Intel and AMD in the same cluster.

In the remaining two use cases, the virtual machine is restarted on the same physical host, and thus it is not really a migration scenario. We will not describe them further here as a migration is not involved.

What if downtime is unacceptable? High availability is not sufficient. This is where fault tolerance comes in. Fault tolerance is a feature of high availability; in other words, you first configure high availability and then layer on the fault tolerance capability.

When the virtual machine is powered on or the feature is added to a running virtual machine, fault tolerance will basically do a vMotion to a second server in the cluster, but instead of completing the move like standard vMotion does, fault tolerance will keep the two servers in lock step (via vLockstep Technology). If the primary copy fails, within a couple of seconds, the secondary will become the new primary and keep processing and will in turn spin up a new secondary server. On the other hand, if the secondary fails, the primary will create a new secondary and keep running. In either case, the goal of fault tolerance is no downtime and no data loss for applications.

From a configuration perspective, since the virtual machine is running in parallel on two different servers, the servers must be vMotion compatible (among other configuration requirements).

High availability is included with vSphere Essentials Plus and higher editions.

This is an excerpt from the Global Knowledge white paper, Virtual Machine Migration Methods in vSphere.

Please support our Sponsors here :