Disaster Recovery / Business Continuity

A large part of our PCI and SAS70 compliance is to maintain, and test, a comprehensive and viable Disaster Recovery / Business Continuity plan.  As part of this we will be conducting our annual test of the Technology Availability Plan of our DR plan this coming Friday.  A co-worker and I will be flying to Scottsdale, Arizona where our contracted Disaster Recovery Vendor has it’s data-center that is stipulated for us.

For this test we will be testing VMWare and our ability to recover our vSphere environment.  We will have 3 servers in the test.  The first will be a Windows machine that we will use to install our backup environment and restore data from tape.  The other two machines will be ESX servers that we will setup and configure as our VM hosts.  We will then restore vCenter Server from tape as well as several other critical servers that we call “Tier 0”.

Tier 0 for our DR Plan consists of critical servers that are required to bring the rest of our environment back online in a disaster.  These include, Active Directory, Backup, and a few other infrastructure services that are needed before anything else can be restored.

We hope to have a successful test, and also hope to uncover roadblocks before they become issues in a real world scenario.

Registry Hive Recovered?!

We’re experiencing a weird pop-up message on our BackupExec 2010 server.  Every morning (after backups have run) we get a string of errors from Windows indicating that there was an error with the Registry Hive.  The error reads like this:

“Registry hive (file:) C:\WINDOWS\vmware-SYSTEM\vixmntapiXX was corrupted and it has been recovered.  Some data might have been lost.”

This error message is there every morning, and there will be anywhere from 10-20 of them that we have to click through. The only thing that changes is the XX is a number that increments.  As near as I can tell there is nothing wrong with the system and there are no symptoms of trouble other than the messages.

Going off the error message itself, and the fact that BE was running without this error until I turned on VMWare backups, I suspect that this is an error with the Agent for VMware Virtual Infrastructure (AVVI) otherwise know as the VMWare Agent.  I’ve tried some Googling and haven’t come up with much relating to this error specifically.  At this point its really just an annoyance as we have not see anything that would indicate an issue.  I’m just crossing my fingers that restores of data from AVVI will actually work!

We’re doing our Disaster Recovery Test this Friday so we’ll know pretty quick if these VMWare backups will work or not!  I guess we’ll find out.

NetApp LUN Expansion Limit

Yesterday we were working on expanding a drive for one of our SQL servers in VMWare.  Ordinarily we have all our guest drives as .vmdk files, but this SQL server is clustered so it is a Raw Device Mapping (RDM) to a LUN on the NetApp.  We have expanded LUNs like this in the past and not had any issues.  This time we did have issues and it took a bit of searching to figure it out.

This server is not yet in production so some of the volumes were not sized how they were going to be when it goes live.  In this case we were upping a volume to the size it needed to be to go into prod.  This meant going from 40gb up to 500gb to support all the data that would be imported.  At this point we got an error on the NetApp  “New size exceeds this LUN’s initial Geometry.”  After a bit of Googling we found a forum post that NetApp has a limit to the amt you can expand a LUN restricting it to no more than 10x the original size.  I imagine that for most folks this would not be a problem, but if you are like us and going from test/validation to prod on the same LUN and need to expand you could paint yourself into a corner as we did.

We had to blow-away the LUN and recreate it, which ended up being a major pain due to the clustered servers.  The lesson is, size your volumes to an appropriate size initially, or be prepared to do some file copying later on if you reach the limit.

Log files fillup!

We have a variety of servers that run many different applications which log to a file.  This includes IIS, SMTP, FTP, etc.  The list goes on and on.  It is easy to lose track of them all, and even easier to let the log files fill up your drive while your not watching!

We have had to go through and zip up or delete logs to clear up disk space many times.  The full drives have on occasion caused service outages due to not being able to write to the log files.  We have struggled to find an easy/inexpensive way to do this.  It would be easy to write a batch script to delete all the files, but we are under PCI requirements to keep a certain amount of log files on disk for compliance reasons.  Thus the need to zip up the files and delete the originals.

One of our excellent coders at my office came up with this great VB script that will do just this for us.  It will curse through a directory *and all its sub directories* looking for files with the extension ‘.log’ (or whatever extension you want) and mash them all into a single date-stamped zip file.

One thing I need to mention is that this script is looking specifically for log files that have a date stamped file name (this is how it created a date stamped zip), so it will only really work on IIS logs and the like.

Take it for a spin and let me know what you think!

 

[UPDATE: 8/19/2011] I have a new version of this file described in the following post.

SAS70 Control Activity Pains

The company I work for is in our second attestation period.  The first time was very painful and a lot of extra work, but only a 6 month audit period.  This time we’re under a 9 month audit!  It is very taxing to keep on your toes that much, for that long.
During our first period we learned a lot about how we had written our Control Activities to support our Control Objectives.  We were at times too specific on our CA wording and backed ourselves into a corner more than once.  Luckily we did not have any major exceptions and gained a favorable audit opinion which was quite a milestone for our small company.

The single biggest issue that we found was how we worded the Control Activities.  We made the description specific to the control activity in a way that locked us into that description for the duration of our audit.  This became problematic when we ran into an unforseen issue with how we were then required to do things. If there was a problem, we could not change the CA during the audit period.

This is where my biggest/only recommendation comes in.  Word your CAs so that it refers to a policy/procedure document for how you do things.  That way you can update your document as needed without changing the CA itself.