For many businesses, the question is not if a disaster or power outage could happen, but when it will occur? And how severe it will be? As a company whose primary products are internet-based applications, it is of vital importance to Enable that our ability to host our clients’ applications remains unaffected in every eventuality that could be reasonably foreseen. Disaster recovery (DR) is not something that any business can afford to get wrong. This is why Enable takes it very seriously and, after recent events, can boast of having first-hand experience.
Sometime around 03:00 on Monday, December 4th, a water heater in the roof space in our unit 11 building failed and began to leak. The water heater was located above the stairwell and leaked through the first floor, destroyed ceiling tiles, flooded one of the staff toilets, and made its way through to the ground floor. Unfortunately, significant water had entered our server room. The water had run through all three server cabinets and the servers within. All local servers, networks, and internet access shut down after power was lost. This meant no access to telephone, email or any local IT infrastructure including domains, web servers, file servers, source control and deployment services.
One of the main advantages of hosting in the cloud rather than locally, is that no live systems or data were affected, this meant minimal impact on our clients. Also, due to sensible management of time and resources, projects that we had underway were still delivered to clients on time.
Ewan Gibb is the IT Manager at Enable and has been with the company for over 15 years, helping it evolve into the well-organized and responsive IT department that it is today. Ewan was heavily involved with the DR process for this incident from the moment it was reported, ensuring that our systems were back online and operating normally as soon as possible. Here are Ewan’s recollections of how the day’s events unfolded.
Who was first on the scene and how did they describe the situation?
The issue was discovered around 07:15 by Mark (IT Technician), who was first in the building. At that point, there was two or three inches of water on the floor of one of the units, between the server room and one of our meeting spaces.
What immediate action did the first responder take?
Mark notified David Hunt (Operations Director), then myself, of the situation, advising that off-site backups would be required. When David arrived the water and power were turned off to all units, to stop the flow of water and for safety reasons respectively.
Enable has now had its first experience dealing with a major DR emergency
What interim measures were put in place to ensure business continuity?
Select members of staff were advised to work from home. Those members of staff remaining to work on disaster recovery tasks were placed in our second building, unit 8. Internet access was provided to staff via 4G routers and tethering mobile phones. Emergency access to incoming email was enabled via our email filtering provider, FuseMail, who provided a web accessible location for our email. Telephone numbers were re-routed so calls could be taken, and backups were provided to relevant staff members so they could work on restoring services.
What was prioritized in terms of the DR process?
For our IT team, the priorities initially were in ensuring internet and email access for members of staff, and that telephone calls could be taken. Once the most basic IT services had been provided, the IT team was then focused on dealing with the water damaged servers. The servers were removed from the server room, first dried manually, then left to dry, to give them a chance of coming back online. After PAT tests were performed to confirm it was safe to attempt to turn the servers back on, by Wednesday, miraculously, about three-quarters of the servers came back online.
What measures have been put in place to reduce the risk of this happening again in the future?
Firstly, the water heater will be moved to the ground floor. Many key IT services have been moved into the cloud, so they are not reliant on local infrastructure. The exercise has highlighted the importance of off-site backups. The frequency of our off-site backups has now been increased and the contents even more comprehensive. An updated and more resilient local IT infrastructure will be put in place.
How is our DR process documented? Who takes ownership of this from different areas of the business?
DR is documented as part of our business management system, which is required for ISO 27001. The disaster management team consists of the Managing Director, Operations Director, Commercial Director, IT Manager, and Office Manager.
Enable has now had its first experience dealing with a major DR emergency. Thanks to the efforts of our IT team, our internal systems were back up and running in just two days. Twice a year, Enable thoroughly tests its disaster recovery plan and the outcome of these tests is reviewed and used as a basis to incrementally improve its internal protocols. This ensures Enable staff are best advised to act efficiently during a future disaster recovery scenario.