Print

Print


"No campaign plan survives first contact with the enemy" - Carl von Clausewitz

I attended the Lansing Information Systems Security Association Meeting yesterday and took a few notes that I thought others might be interested in. The presentation was by Amerisure Insurance (http://www.amerisure.com/) on how they handled an unexpected system crash event on February 22, 2012 during 9am hour for their ~400 servers (I think they said 80% or so were virtual) and network equipment.

The back story is that Amerisure IT department had prepared for unexpected events in the past on a quarterly basis (I believe I recorded that correctly), even traveling to an off-site backup location in NY for a weekend to bring up the systems for testing their plans and procedures. They had built a fancy "Command Center" (CC) in the Michigan data center with large tables, Internet connections, computers, world clocks, etc.

There had been some problems with the Michigan data center Uninterrupted Power Supply (UPS) previously. The UPS vendor sent their top technician who was trying to diagnosis the problem and, in the process, accidentally hit the primary circuit breaker that took down the data center power in the process. 

During this event they found multiple gaps in their processes to handle unexpected events, such as:
No power in the data center prevented the badge card swipe sensor from working, so access to the CC was difficult.
The well prepared, tested and thoroughly documented DR plan and processes were entirely online, no hard copy, and with no power they had no Internet, wired or wireless.
The CC had thick block walls which meant that 3G and 4G cellular data connections were practically impossible, they had to move out of the room to have consistent access.
The phones were VoIP, so those would function for 1 hour on some backup system before they'd go off-line (I didn't understand this and they didn't elaborate).
The IT Manager had left the company a few weeks before this event, the next guy in charge was on vacation and the guy after him was at an off-site meeting. No one really knew who to look to for leadership, although a brand new VP (not of IT though, was for a different department I believe) took up the responsibility to provide some measure of order by lacked knowledge depth.
There was a backup email system for emergency, but access to this was difficult for the IT people and the rest of the nationwide company and insurance agents forgot they had access to this and most didn't know their passwords to access the accounts.
Initial communication to the rest of the company was almost 11/2 hours after the power outage occurred, the second communication was almost 3 hours later. All of the phone lists were online and personal cell phone numbers for others in the company weren't recorded anywhere that weren't electronic.
The CC quickly became hot and stale with 40 people crammed in it to determine what action to take to restore the servers and service to the company. Not all of those people were needed to make the core decisions and many could have waited somewhere else and been notified what to do.
Something they didn't expect was, with adrenaline pumping, no one took time to eat or drink or realized they hadn't for almost 8 hours and this started to affect them. Someone finally brought in water and ordered pizza.
They had become lax in their policy of not performing maintenance of critical infrastructure equipment during the core business hours. (made me think of the mid-day power outage IT Services had earlier this year because of something with power and maintenance)

Some of the resolutions they've implemented to prevent these problems are:
Using an external web hosting company where they can quickly put up a message to communicate others that there is a problem.
Determine who, in an emergency, needs to be in the meeting to make the action decisions and who can wait somewhere else to be informed on what action to take.
Put a printed hard copy of the DR plan in the CC
Put a mini-fridge in the CC that has a supply of water, at the very least.
Print a list of phone numbers on a small card and put this behind the employee badge in their lanyard.
Place a pastic cover with a lock over the primary circuit breaker
Color code the employee badges to help people identify the chain of command during unexpected events so they know who to look to
Send employees quarterly emails telling them to check and respond to a message sent to the backup email system so users are reminded that they have the system and refreshed on what the credentials are to access it.
Add a stand-alone printer to the CC
Improved ventilation in the CC
Install whiteboards / flip-boards in the CC

In conclusion, the IT team felt that the situation was a disaster since it took almost 48 hours to restore all of the services and test for data loss and perform recovery. In contrast, the rest of the company thought it was pretty extraordinary that they recovered things so quickly, so they (IT) learned about how their perception of the situation made it worse. Humorously, as details leaked out to the rest of the company about what had occurred the story became that the UPS driver was delivering a package to IT and tripped on a power cord as he cut through the data center. 



Troy Murray
Michigan State University
College of Medicine
Life Science
1355 Bogue St, B-136D
East Lansing, MI 48824
E: [log in to unmask]
P: 517-432-2760
F: 517-355-7254
RedHat 5 Certified Technician
RedHat 5 Certified Systems Administrator
HL7 V2.6/2.5 Certified Control Specialist