construction building skyscraper
Supposed safe fire extinguishing systems are not as safe as we thought: various storage systems got destroyed. Storage disaster on a Sunday morning.
A quiet early Sunday morning around half past seven. A typical ICT guy should be sound asleep the next few hours. But this morning our phone wakes us. That week my wife was on out-of-office hours watch shift. The security services people informed my wife they received an fire alarm. A few hours earlier(!) The local fire department had already responded to it, and because they could not find anything actually burning they assumed it was not that big of deal. So they waited for the next shift to follow up on the evets and inform us. Mostly because they forced the lock on the door of our datacentre to gain access to it. Maybe we wanted to inspect the datacentre and get a locksmith?
Irony: this datacentre is located on the premises of the fire department…
For us it was a 10 min. drive so we opted to visit the location.
Arrived on the scene on first impression it looked business as usual on the datacentre.
A quick glance on the racks revealed no havoc. The storage systems had al green lights, servers and switches appeared to working normal.
So what was the alarm about then? Let’s have a look at the fire extinguishing system. Hey that’s not normal: the bottle pressure gauges indicate zero pressure in the system. At closer inspection, my conclusion is the system had performed a full fire extinguishing cycle.
There was however no clue why it had done so, also the control panel did not indicate anything abnormal.
Ok, so let’s deal with that later, now we are here let us do a quick check if everything is running normal. Upon logging into some systems some odd behaviour caused us to have a closer look at all systems. Every minute we got more puzzled by our systems. It became clear things were not normal and we quickly suspected our storage system to be the culprit.
The Netapp storage showed to be normal at first glance, which was consistent with the all green lights on the drives. More detailed search revealed the storage system to be severely broken: 56 disk in total had died! So why did it not show disaster in the drive bays?
It appeared the system decided the situation was too severe. It did not have a clue what was going on an simply had stalled a few aggregates.
As our luck would have it, the storage system had almost no free space. It was to be replaced for a new system soon, and the company was filling the system beyond the recommended 70%.
So after escalating the incident to our managers the next hours we dedicated our efforts on stopping all test systems and finding free space to move all data from this datacentre to the other datacentre. With 56 disks already failed, we were worried maybe more disks would fail in the next few hours, disabling even more aggregates.
Problem is, who are you going to call for 56 spare disks on a Sunday morning? We tried, but even Netapp did not have so much spare disks available in a few hours. This is not a disaster they often have to deal with.
It became a busy Sunday for us, relocating storage and virtual services to get the business up and running for the next Monday morning. A few calls in our with other ICT specialists in our network we quickly concluded this dying of disks after an inert gas fire extinguishing is not as rare as you should think. In some cases disk died hours or days after the extinguishing happened. It became clear the only safe way to get through the coming days was to completely move all data to the other datacentre and write off the storage in the affected datacentre.
We faced a few very busy weeks, on one hand trying to keep as much systems running given the remaining storage, trying to find a solution for the defective systems and investigating the root cause of the incident.
Curious what happened that Sunday?
You won’t believe how a system supposed to safeguard ICT systems appeared to be able to actually destroy it…
The datacentre we are talking about is build in a box often used as an refrigeration room. These are ready made boxes of plastics.
Along one side of the room two air conditioning systems keep the box cool. Perpendicular to this a cluster of large gas bottles containing inert gas make up a fire extinguishing system.
The rest of the room is filled with server racks containing all equipment.
Extinguishing a fire with inert gas requires replacing all air in the room containing oxygen with a non-flammable gas. So a lot of gas has to be injected in the room in a very short time.
To give you an idea: the room has two automatically opening 2 square feet hatches. Without it the box would simply explode.
To release this amount of gas a very quick end large valve is needed at a pressure over 200 bar (2900psi). This is often solved by the use of a pilot bottle: a small bottle of gas at a somewhat lower pressure which operates the main valve of the system. The pilot bottle valve is smaller and operated by an electromechanical solenoid. It’s pressured gas than actuates the pneumatic main valve.
What happened is this:
the air conditioning systems are switched by a large electric relay. Working in the box you had to get used to a loud noise and some vibration on the floor every time the air conditioning switched on. These constant vibration pulses travelled also to the fire extinguishing system. Small box, the air conditioning and fire extinguishing system perpendicular to each other at a few feet.
Years of beating caused the solenoid of the pilot bottle to fail… The failing pilot solenoid released the gas of the pilot bottle, activating the main valve.
So how are disks affected by this inert gas?
This has nothing to do with the gas itself. The gas is inert and not a possible thread to electronics or hard disks.
You won’t believe it: it is noise!
To get the room quickly and evenly filled, the gas is injected into the room viz nozzles in the floor and at the ceiling. The stream of gas through a cheap nozzle causes a very loud noise. The sound is loud enough to cause heavy vibration on the disks, causing disk heads to hit rotating platters.
Want to know more about noise and disks? Read my blog Noise slows your disks!