It’s 5:48 AM Saturday during a 45-day, 7x24, colocation datacenter migration.
A failed disk drive combined with failover testing causes a storage array “head” to throw an error and need to be taken off-line for analysis.
DBAs in India migrating databases suddenly notice performance is half what it was the day before and copy jobs are failing.
The environment is in transition so the customer’s IT organization hasn’t taken over support from the vendor. A bunch of engineers are debating, over email, why this has happened.
The Datacenter Migration Project Manager gets in his car and drives to the co-location site, monitoring the discussion of the engineers.
Having never replaced a disk drive in a storage array, the Project Manager calls the customer’s Technical Manager for instructions on how to replace the drive. Support personnel in India see the drive come back on-line and then the redundant storage array “head” come back on-line.
The engineers do the “naked hula dance” figuring the problem has been solved. Everything is back on-line, no reported errors, hardware has all “green lights.” Migration continues.
I’ll give you a second to guess all the things wrong with this scenario……………….
Successful datacenter buildouts and migrations should be precision activities involving architects, engineers, project managers, vendors, staff, and management. Processes need to be documented and followed. Let’s see what went wrong here.
- Diagnostics, engineering, and analysis were occurring while semi-productions were proceeding. While testing windows were delineated, other activities were proceeding at the same time.
- DB performance was noticed through copy jobs. Where was the monitoring?
- The vendor who did the installation is now two weeks beyond the delivery date and no handoff has occurred to the internal support organization. Who is supporting the environment?
- Procedures were not set-up with the colocation provider allowing vendors to show up and replace the disk drive. Why was a Project Manager replacing the disk drive?
- Once everything came back on-line work proceeded. Why wasn’t root cause analysis performed?
- DBAs, support personnel, engineers, vendors, project managers, and technical managers were all doing their own thing. Why wasn’t there a problem or incident management process followed?
In this multi-part series we examine some important activities and processes one should have before beginning a datacenter migration.