DR Test Results

Recently we performed our annual Disaster Recovery test.  We have learned something very valuable every year and tried to adjust our recovery plans accordingly, with this year being no different.  Even with all the new technology, DR still seems to be a tricky undertaking.

The first year we tested our plan we found that we really, really, REALLY, don’t want to restore Active Directory onto unlike hardware.  The following year we had gotten a node onto our MPLS cloud which allowed us to have a replicated AD server at the DR site.  This greatly reduced the problem of restoring AD.  The year after that we tested our phone system portion of the DR plan and discovered that working with the telco in a DR situation will be challenging at the very least.  In the two years since that second test there have been some major changes that removed our ability to have replicated AD, so we were back to square one on that front.

This year we thought we would do a restore of our two year old VMWare environment.  We had decided to keep the scope to restoring only “Tier 0” service.  This included VMWare ESX , Symantec BackupExec, vSphere, and AD.  Time permitting we planned to restore as many servers as possible beyond the Tier 0 that were the bare minimum.

In the last year we had made the choice to purchase Symantec’s BackupExec Agent for VMware Virtual Infrastructure (AVVI).  This is a BackupExec agent that allows you to backup VMWare Guest OS files directly through the ESX server and/or SAN.  The idea is that we would have our virtualized servers backed up to tape at the VMWare file level and that this would allow us to restore directly back to ESX. Continue reading DR Test Results

Error 500 – Internal Server Error

So for a large part of today my blog has been down.   I have been trying to troubleshoot for a while on my own and have found several suggestions on the web.  Amongst them was ensuring that php5 was being called in my .htaccess file and also by including a php.ini file to set the memory limit. I found a couple of posts on some WordPress forums as well all relating to the same things.  I also found some posts suggesting I disable all my plug-ins which I did by removing their folders from my plug-ins directory.  Still no luck.

I ended up getting frustrated and called my hosting company 1and1.com.  After a few minutes on hold I got through to a rep and she started to run me through everything I had already tried.  She then went looking in my .htaccess files to verify that I had indeed done what I said.  She came back and then asked to put me on hold.  After a few minutes she came back on and told me that it was all working again as expected.

I asked her what had changed and she told me that when I connect via sftp (ssh) I need to ensure that I explicitly close my connection.  She said that if a connection gets hung up and not properly closed that this error/behavior can happen.  I found that to be a bit strange, but my sites are indeed working again, so I will have to watch to see if it happens again.

Disaster Recovery / Business Continuity

A large part of our PCI and SAS70 compliance is to maintain, and test, a comprehensive and viable Disaster Recovery / Business Continuity plan.  As part of this we will be conducting our annual test of the Technology Availability Plan of our DR plan this coming Friday.  A co-worker and I will be flying to Scottsdale, Arizona where our contracted Disaster Recovery Vendor has it’s data-center that is stipulated for us.

For this test we will be testing VMWare and our ability to recover our vSphere environment.  We will have 3 servers in the test.  The first will be a Windows machine that we will use to install our backup environment and restore data from tape.  The other two machines will be ESX servers that we will setup and configure as our VM hosts.  We will then restore vCenter Server from tape as well as several other critical servers that we call “Tier 0”.

Tier 0 for our DR Plan consists of critical servers that are required to bring the rest of our environment back online in a disaster.  These include, Active Directory, Backup, and a few other infrastructure services that are needed before anything else can be restored.

We hope to have a successful test, and also hope to uncover roadblocks before they become issues in a real world scenario.