In August 2016 a small fire at the Delta Airlines Atlanta data centre caused a power cut. The resulting chaos led to the cancellation of 2300 flights and the disruption of many more. In September Delta reported that the outage had cost them $100 million in lost revenue.
In July 2016 the Canadian government’s main data centre in Ottawa shut down due to smoke from a small fire. 50,000 government workers lost email services for the day and the processing of 80,000 payroll transactions was delayed.
I could go on, in fact, I could go on at length as fires in data centre happen on a monthly basis somewhere around the world.
From a Capitoline survey of 219 catastrophic failures, we have concluded the following;
Non-IT causes of failure
- Power – Any loss of power anywhere from the utility all the way to the IT equipment
- HVAC – Loss of cooling not directly related to a power cut
- Fire – Fire within the data centre facility
- Fire malfunction – A malfunction of the fire system or improper maintenance that has caused the fire system to be activated inappropriately and the data centre facility shut down
- Attack – Either a physical attack on the facility or cabling or an external DOS attack
- Environment – Floods, storms, forest fires etc.
- Unknown – A data centre outage was reported but the cause not given
Power failures are clearly the most important at 43%. If we presume the same ratio with the unidentified failures then 43% of 9% is about 4% so we can estimate that in reality 47% or nearly half of all facilities failures are power related.
We can also see the importance of fire events which are the second most common cause of outage at 21%. We understand that in the sample period six data centres were written off from fire or flood. Fire malfunction accounts for one in twenty outages. These events come about due to a malfunction in the fire system or incorrect maintenance procedures which have either set the fire extinguishing system off or commanded an emergency shutdown of the power.
Air conditioning failures are fairly small at 7%. Most air conditioning failures reported are actually due to a power outage further up the line. Some are due to tripping out during exceptionally hot weather where the system hasn’t been designed to the ‘hottest possible day’ rule.
Another way of looking at the figures is by time impact of the failure event. The 219 failures accounted for 2074 hours of reported downtime. Figure 2 looks at the distribution of outage time by event type.
Once again, in Figure 3, we have taken out IT related outages to just show the time impact of facilities-related events.
What is interesting about figure 3 is that it shows just how disruptive fire incidents can be. The time impact of fire events has overtaken power as the most disruptive event.
Having a fire safety management plan
The problem we have today with fire safety management is that most managers see it as a range of disparate and unrelated technical issues that also fall across differing management responsibilities. There is also an attitude that somebody else, usually a manufacturer or contractor, must have sorted all this out at the build stage. This is rarely the case. The data centre manager is responsible for the overall fire safety management of the building: nobody will have done it for you.
Fire safety engineering is a big subject, in fact, we have written several e-books and white papers on the subject: but we can summarise it here:
Design the building to use low flammability materials, e.g. specify low flammability data cabling, such as smoke detection, alarm, emergency lighting etc.
Construct the building to ensure fire risk is minimised. Specify fire rated zones and fire barriers at strategic locations.
Operate the building to minimise fire risk, e.g. have policies and procedures to remove all packaging material and do not allow rubbish to build up.
Manage the building and staff to operate with an awareness of fire risk. This can include practicing fire drills and sensible risk awareness.
Specify the correct fire detection, alarm and response e.g.
- Quick reaction aspirating smoke detectors in computer and plant rooms
- Automatic fire extinguishing systems
- Overpressure relief mechanisms for gas-based suppression systems
Conform to national regulations and requirements e.g.
- UK BS 6266/5839, Building regulations Part B
- US NFPA 72/75/76
- US NFPA 2001 approved fire suppression gases
- EU F-Gas Regulations
- EU 92/58/EEC – Minimum requirements for the provision of safety signs at work
Test all the components of the fire safety system such as smoke detection, alarm, emergency lighting etc.
Maintain all the components and systems of the fire safety system and make sure records are kept of all maintenance and corrective actions.
Inspect all the components and systems of the fire safety system. For example, check every year that all cable penetrations through wall still have their fire seals intact or have been resealed. Every year gas-protected rooms must be pressure tested and certificated to prove they can still keep the fire suppression gas in.
Integrate all the elements of the fire safety system, e.g.
- Integrate detection, alarm and suppression system
- Integrate data centre alarm to building alarm
- Connect fire alarms to BMS/DCIM system
- Arrange Electronic Emergency Power Off procedures
- Decide on air conditioning status under gas release event
- Set up ventilation to turn off during gas release event
Document all the processes and procedures and create a fire cause and effect diagram or algorithm that explains the system integration.
Recover after the fire event. Most organisations do not have a recovery plan to get back in business after a fire event. We can see that the average time of outage for a fire is 25 hours. It should be possible to get up and running again within a few hours after a minor fire or smoke event, but this is rarely the case. Have a plan that describes everything from purging the room of smoke and gas to what equipment needs to be reset and how and who is to do it.
Fire safety engineering is a complex issue and rarely managed in a comprehensive and holistic manner. Small fire events become major outages and there is a long list of engineering requirements to minimise risks in both design and operation of a data centre. Seek help if you feel your fire system isn’t totally understood and under control.