Recently we performed our annual Disaster Recovery test. We have learned something very valuable every year and tried to adjust our recovery plans accordingly, with this year being no different. Even with all the new technology, DR still seems to be a tricky undertaking.
The first year we tested our plan we found that we really, really, REALLY, don’t want to restore Active Directory onto unlike hardware. The following year we had gotten a node onto our MPLS cloud which allowed us to have a replicated AD server at the DR site. This greatly reduced the problem of restoring AD. The year after that we tested our phone system portion of the DR plan and discovered that working with the telco in a DR situation will be challenging at the very least. In the two years since that second test there have been some major changes that removed our ability to have replicated AD, so we were back to square one on that front.
This year we thought we would do a restore of our two year old VMWare environment. We had decided to keep the scope to restoring only “Tier 0” service. This included VMWare ESX , Symantec BackupExec, vSphere, and AD. Time permitting we planned to restore as many servers as possible beyond the Tier 0 that were the bare minimum.
In the last year we had made the choice to purchase Symantec’s BackupExec Agent for VMware Virtual Infrastructure (AVVI). This is a BackupExec agent that allows you to backup VMWare Guest OS files directly through the ESX server and/or SAN. The idea is that we would have our virtualized servers backed up to tape at the VMWare file level and that this would allow us to restore directly back to ESX.
So here is what we thought would happen in our test. We install ESX onto whatever hardware our DR vendor provides, (It shouldn’t matter right?!) while installing BackupExec directly onto one of the other servers that we are provided. Then we will inventory and catalog the AVVI backups from home, restore AD and vCenter, boot them up in ESX, and start rebuilding our environment from there. Easy peasy right? Not so much.
As I have come to expect, nothing goes right in a DR/DR Test scenario. We were able to get ESX and BackupExec installed without any major obstacles. Our first problem came when cataloging our tapes. This process took far longer than we had expected. At 2 hours in with no indication of when it would ultimately finish we canceled the job as the resources we really needed were already cataloged and able to restore. Not too major a problem so we’re moving on.
I feel like it would be helpful to give a little context to explain the state we were in. We now have a Windows 2003 server that is our BE2010 server, which is in a workgroup. This server also has the VMWare VIC Client and is connected to ESX. We also have the ESX server that is on the same network, but is not yet controlled by vSphere.
BE2010 has some features that allow you to discover the ESX server and pick the datastore to restore to, etc. This seemed great and we setup our restore job accordingly per our current setup. We started by trying to restore our Active Directory VM to ESX. This is where things started to go awry. We were getting a credentials error and the job would not restore. We tried a few different variants unsuccessfully with a number of failed jobs. We suspect that this is due to BE and ESX not being in a domain, and with ESX specifically not being controlled by vCenter. (Chicken or egg right?!)
I know this is an old post, but it’s uncanny how similar the network you’re talking about back in 2010 is to mine. Although probably much larger. I’m a one guy shop, 50 workstations, 15 servers. A little over half of which are running CentOS 6.7 and 7.0. The other half Windows 2016 AD. I’m backing up to an LTO5 library with BE2014. And my virtual infrastructure is ESXi 5.5 with a physical vSphere server on Windows 2008r2.
I’m very interested in your DR test. I’ve never run a DR test on my network, and I am VERY interested. As well as writing a DR plan for my company. Since your network was (back in 2010) I was hoping you might be able to offer myself some advice.
How did you solve the catalog problem? I remember back in BE 2010 or 2012, a catalog was an overnight process for myself. I had recently moved from LTO3 to 5 and had to catalog one of my LTO3 tapes in the library. Not fun.
As you might have noticed, much of my network/infrastructure is still circa 2011-2012. IT budgets at small companies don’t allow for a realistic upgrade schedule in all cases. What can a guy do? I have to work with the cards I’m dealt. With that said, you might guess I do not have the extra hardware laying around to run these tests. Obviously I can’t run the test on my production infrastructure. More virtual servers would push the current hardware past a comfortable level. Production servers would be effected. Any suggestions?
Any suggestions writing a good DR plan? Would it be best to write it as I’m running my first test? Any kind of help you can offer would be much appreciated.