A large part of our PCI and SAS70 compliance is to maintain, and test, a comprehensive and viable Disaster Recovery / Business Continuity plan. As part of this we will be conducting our annual test of the Technology Availability Plan of our DR plan this coming Friday. A co-worker and I will be flying to Scottsdale, Arizona where our contracted Disaster Recovery Vendor has it’s data-center that is stipulated for us.
For this test we will be testing VMWare and our ability to recover our vSphere environment. We will have 3 servers in the test. The first will be a Windows machine that we will use to install our backup environment and restore data from tape. The other two machines will be ESX servers that we will setup and configure as our VM hosts. We will then restore vCenter Server from tape as well as several other critical servers that we call “Tier 0”.
Tier 0 for our DR Plan consists of critical servers that are required to bring the rest of our environment back online in a disaster. These include, Active Directory, Backup, and a few other infrastructure services that are needed before anything else can be restored.
We hope to have a successful test, and also hope to uncover roadblocks before they become issues in a real world scenario.
We’re experiencing a weird pop-up message on our BackupExec 2010 server. Every morning (after backups have run) we get a string of errors from Windows indicating that there was an error with the Registry Hive. The error reads like this:
“Registry hive (file:) C:\WINDOWS\vmware-SYSTEM\vixmntapiXX was corrupted and it has been recovered. Some data might have been lost.”
This error message is there every morning, and there will be anywhere from 10-20 of them that we have to click through. The only thing that changes is the XX is a number that increments. As near as I can tell there is nothing wrong with the system and there are no symptoms of trouble other than the messages.
Going off the error message itself, and the fact that BE was running without this error until I turned on VMWare backups, I suspect that this is an error with the Agent for VMware Virtual Infrastructure (AVVI) otherwise know as the VMWare Agent. I’ve tried some Googling and haven’t come up with much relating to this error specifically. At this point its really just an annoyance as we have not see anything that would indicate an issue. I’m just crossing my fingers that restores of data from AVVI will actually work!
We’re doing our Disaster Recovery Test this Friday so we’ll know pretty quick if these VMWare backups will work or not! I guess we’ll find out.
So it has been several weeks since my Part 1 post on this topic. We are still struggling with all of our servers getting backed up using AVVI.
I enlisted the help of a co-worker and he wrote an excellent vb script that queries the domain for all the servers, and then goes and restarts the VMTools service on every box. We run this script from our backup server, and it works great. This centralizes the management of that task, and keeps us from having to mess with batch files on every server, and potentially forgetting to add the task, etc. on a new server. You can download the script if you like. Change the .txt extension to .vbs, and edit the service name at the top. Edit the mail server settings at the bottom if you wish to get an emailed report of the results.
I believe that this new process has helped, but we are still having issues getting backups from all the machines. We have found that occasionally the mgmt-vmware service needs to be restarted on the ESX hosts as vCenter has trouble getting the snapshot. I have not yet taken the time to figure out how to automate this, so it is a manual process at the moment.
I’ve been working with Symantec’s new VMWare agent for Backup Exec 2010 for the past couple of months. We were excited to have differential backup of VMDK files through vSphere when they announced the new version of the software. In practice however this is a bit more involved than we bargained for.
Continue reading BackupExec 2010 AVVI