One thing that I just learned overnight is that you really should keep an eye on the Snapshots in ESX/vSphere. We are running the AVVI backups from Backup Exec 2010, and it uses the vSphere storage APIs to do it’s business. BE has vSphere run a snapshot by calling the API and then it grabs that snapshot and sends it to whatever your backup medium is.
In the past I have seen where a snapshot gets left behind and not deleted. Last night I started getting paged from our monitoring system that one of our AD servers was offline. After having to jump through some hoops to get in via VPN (because the AD server was the one used to authenticate and give DHCP to VPN users) I was able to get onto the ESX server. There I saw that the snapshots had hogged up all available disk space on the ESX box and my AD server was stalled as a result. It turns out that the snapshots for my Exchange server were piled up and I had to delete them. Once there was free space again my AD server was back online and everything is OK again.
Now I need to figure out a way to monitor my ESX server for datastore space so that this does not happen again.
For years I’ve been using and administering Backup Exec. For nearly all that time I have not used the Exchange agent to it’s full capacity. At my last job we had always just backed up the entire database, and not had it set to be able to do mailbox level restores. Why? I think originally, some time long ago, it was really slow to do those types of backups and I wasn’t the one in charge at that time.
Jump forward almost a decade and I’m now at a new company. I have just setup our new Backup Exec 2010 R2 server, and I decided to see what the performance was with the Granular Restore Technology (GRT) turned on. We have a modest Exchange database at about 27gb with about 40 mailboxes. Not really knowing what to expect I was happy to see that the whole job ran in 24minutes to our LTO4 drive. Not too shabby, and now I can select individual mailboxes, or even specific folders/messages to restore if I needed to! Obviously this time to backup will increase, but I think I can handle it for the foreseeable future.
Recently we performed our annual Disaster Recovery test. We have learned something very valuable every year and tried to adjust our recovery plans accordingly, with this year being no different. Even with all the new technology, DR still seems to be a tricky undertaking.
The first year we tested our plan we found that we really, really, REALLY, don’t want to restore Active Directory onto unlike hardware. The following year we had gotten a node onto our MPLS cloud which allowed us to have a replicated AD server at the DR site. This greatly reduced the problem of restoring AD. The year after that we tested our phone system portion of the DR plan and discovered that working with the telco in a DR situation will be challenging at the very least. In the two years since that second test there have been some major changes that removed our ability to have replicated AD, so we were back to square one on that front.
This year we thought we would do a restore of our two year old VMWare environment. We had decided to keep the scope to restoring only “Tier 0” service. This included VMWare ESX , Symantec BackupExec, vSphere, and AD. Time permitting we planned to restore as many servers as possible beyond the Tier 0 that were the bare minimum.
In the last year we had made the choice to purchase Symantec’s BackupExec Agent for VMware Virtual Infrastructure (AVVI). This is a BackupExec agent that allows you to backup VMWare Guest OS files directly through the ESX server and/or SAN. The idea is that we would have our virtualized servers backed up to tape at the VMWare file level and that this would allow us to restore directly back to ESX. Continue reading DR Test Results
A large part of our PCI and SAS70 compliance is to maintain, and test, a comprehensive and viable Disaster Recovery / Business Continuity plan. As part of this we will be conducting our annual test of the Technology Availability Plan of our DR plan this coming Friday. A co-worker and I will be flying to Scottsdale, Arizona where our contracted Disaster Recovery Vendor has it’s data-center that is stipulated for us.
For this test we will be testing VMWare and our ability to recover our vSphere environment. We will have 3 servers in the test. The first will be a Windows machine that we will use to install our backup environment and restore data from tape. The other two machines will be ESX servers that we will setup and configure as our VM hosts. We will then restore vCenter Server from tape as well as several other critical servers that we call “Tier 0”.
Tier 0 for our DR Plan consists of critical servers that are required to bring the rest of our environment back online in a disaster. These include, Active Directory, Backup, and a few other infrastructure services that are needed before anything else can be restored.
We hope to have a successful test, and also hope to uncover roadblocks before they become issues in a real world scenario.
So it has been several weeks since my Part 1 post on this topic. We are still struggling with all of our servers getting backed up using AVVI.
I enlisted the help of a co-worker and he wrote an excellent vb script that queries the domain for all the servers, and then goes and restarts the VMTools service on every box. We run this script from our backup server, and it works great. This centralizes the management of that task, and keeps us from having to mess with batch files on every server, and potentially forgetting to add the task, etc. on a new server. You can download the script if you like. Change the .txt extension to .vbs, and edit the service name at the top. Edit the mail server settings at the bottom if you wish to get an emailed report of the results.
I believe that this new process has helped, but we are still having issues getting backups from all the machines. We have found that occasionally the mgmt-vmware service needs to be restarted on the ESX hosts as vCenter has trouble getting the snapshot. I have not yet taken the time to figure out how to automate this, so it is a manual process at the moment.