Depending on your environment there is a need to protect vCenter or some of the services included in the vCenter system. A big question to ask yourself is what kind of downtime can you have according to your service levels and what kind of options do we have or need to have in place?
What will go down if you lose a vCenter component?
Like said this depends on your environment and components using vCenter services to connect to and from. A “plain” server virtualization workload for one company is different than a VD workload in a high demanding organization. The latter probably needs the ability to provision a little more urgent then the first example. Want to deploy a vCOPS vApp or VD Desktop, well wait until your vCenter is back. Using solutions like VMware Data Protection requires an operational vCenter with a functioning vCenter Single Sign-On server to restore a virtual machine. Losing that part of your environment could impact your recovery options seriously. Manage/Edit some VM version 10? How will you do that without vSphere Web client? You can’t. Have a HA or DRS cluster? Well HA will still partially function, it will react with restarts when needed. But to add to the cluster will need vCenter to make this posible. DRS needs vCenter to function in manual or automatic mode. And these are just a few examples.
Important to keep in mind, running VM’s will keep on running and HA will keep on HA’ing, no need to panic there.
Let see which components make up vCenter, a little vCenter architecture to start with.
A “standard” vCenter is made up of the components vCenter SSO (Single Sign-on), Lookup Service, Inventory Service, vSphere Web Client and the vCenter Server (with all of it’s services) itself. Optional services are Dump collector, Syslog Collector and Auto Deploy (and optionally TFTP and PXE DHCP service, but they can be on a separate system so not included in the model as a part). vCenter is also expanded by Update Manager, vCOPS and all sorts.
What are your standard protecting options?
- Do nothing
Not advisable, but if you are sure, have a small (just a few hosts and VM’s) environment and have an insight of your environment (or use some scripting to dump your configuration), you could do nothing. You lose part of your services and (in worse case) will have to manually rebuild vCenter and your configuration. You will lose any trending information. Recovery time is typically measured in days, and requires manual intervention.
- Back-up Restore or Replication.
Backup and restore should be an essential part of any availability solution, exclamation mark. This provides a recovery method utilizing tape, disk, replication or snapshot technology. This also enables a recovery method when data corruption occurs, depending on the solution that is. If data is corrupt on the primary VM then a replication to the recovery VM can occur after this moment. vCenter VM replication from primary to recovery site should be well monitored (and tested with SRM plans for example). Preferably used on several layers, application and application data (for example databases, certificates, logs, dump locations etc.). Be sure to know your backup and recovery steps (look in the VMware KB’s for backing up the vCenter Server Appliance services and embedded vPostgres database), document, practice and test them. Recovery time is typically measured in hours or days, and typically requires manual intervention.
- MS SQL Log shipping – database only
A simple and cost effective solution. You can use log shipping to send transaction logs from one database (the primary database) to another (the secondary database) on a constant basis. Continually backing up the transaction logs from a primary database server and then copying and restoring them to a secondary database server keeps the secondary database nearly synchronized (depending on your plan) with the primary database. The destination server acts as a cold standby or backup server. Your destination server can also act as primary database for other databases so you will have some sort of active-active instead of a cold standby. Be ware of licensing in this case, log shipping target only or serving database is a different license show! Has to be setup for every database, include your vCenter, Inventory, SSO and such. Recovery time is depending on your plan, but can be minutes or hours. Requires manual intervention to fail over from primary to secondary.
- SQL mirror / clustering – database only
Depending on the license of MSSQL these are a more robust solution then the previously mentioned SQL log shipping. These have data replication mechanism in place and have the ability to automatically detect failures and do there fail overs with out manual intervention. Mostly used with a Witness out side the cluster/mirror pair to act as a tie breaker to prevent split brain scenario’s in case of partial failures. Mirroring, clustering has to be setup for every database, include your vCenter, Inventory, SSO and such. Clustering can also be done per instance with it’s included databases. Oracle will have it’s own clustering, with Oracle RAC for example. Recovery time is typically measured in minutes. No intervention to fail over.
- Hypervisor HA.
Hypervisor HA will start your VM after a host failure or VMtools timeout. The time it takes to recover is depending on your amount of free slots, your priority of vCenter vs the other workload and the amount of VM’s needed to restart. Depending on your environment this can take some time to start up. Hypervisor HA will not protect against service failures as it will not monitor any application components, it will also not protect against any data corruption. Hypervisor HA is to be used in conjunction with one or more other protection options. For example a vCenter system on HA and SQL databases on MSSQL Cluster. Recovery time is typically measured in minutes or hours depending on your consolidation ratio and restart settings.
- App Aware HA.
If you have the correct edition and have the application aware components in place. Monitors the application and if it goes down, it can be restarted. There is no app aware HA specifically for vCenter yet. But you can protect parts of the applications with app HA, for example MSSQL services. Recovery time is typically measured in minutes or hours.
That is currently a no no. Why did I put it up here? Because it comes up as a question once in a while. FT creates virtual machine “pairs” that run in lock step—essentially mirroring the execution state of a virtual machine. This only protects against host or VM failures. Services that go down or corruption in the application data will be mirrored to the secondary VM.
FT in vSphere 5.5 is still limited to 1 vCPU, and with a small inventory you still need a minimum of 2vCPU. Same goes for for example a database server these also tend to have more vCPU’s. Yes this has been a issue all along for FT, and we know from following those VMworld sessions demo’s that there is work in progress on multiple vCPU FT, but unfortunately this is not yet released. But a similar technique is next up.
- vCenter Server Heartbeat
vCenter Server Heartbeat is a separately licensed vCenter Server plug-in that provides protection of your vCenter system, (physical or virtual). Next to protecting against host failures, heartbeat adds application-level monitoring and intelligence of all vCenter Server components. Heartbeat replicates changes to a cloned virtual machine. The cloned virtual machine can take over when a failure event is triggered.
The vCenter recovery can be accomplished by restarting the vCenter service, by restarting the entire application, or by the entire failover of the vCenter system. Use in conjuction with a data protection like SQL mirroring to protect against corruption. Recovery time is measured in minutes and requires no manual intervention.
- Scale out / HA service pair
Move some of your vCenter services to other components or use multiple same role servers to provide high available and load balanced services. Not all of the vCenter services can be separated this way, but for example SSO can be. Those high availability service are placed behind a third-party network load balancer (for example, Apache HTTPD, vCloud Networking and Security vShield Edge load balancer or load balance appliance like Netscaler).
Move logs to a log insight server, move statistics to vCOPS. Keep vCenter lean and mean.
vCenter Server Heartbeat is a done package for protecting your vCenter server system, but this is at an additional cost. More often you will have some back-end services, like Oracle/MSSQL clustering and back-up restore/replication solutions, already in place or products with a similar need. A combination of protection is the preferred way to utilize those in or to be in place solutions with the need for protection and the allowed recovery/down time. But this is the main thing, know your environment, know how the components interact, know what is needed at which time and know what will be (temporary) unavailable when services are down. Protect against unavailability, corruption and please randomly test to be sure all components are working as expected (even the manual procedures).
And yes sure there will be some other great options out there like a script collection or cold standby solution et al….. but hey isn’t that what the comments section is about. Tell me yours. Share.
– Happy managing your environment!