Today around 11:30 Webfusion, Donhost and other providers under the same umbrella experienced major service outage taking down all servers in Leeds Data Centre. Thousands of websites and web services were unavailable for nearly 3h. That includes those of our websites that use Leeds Data Centre servers.
[Update 11/01/2011 23:10] Some websites, as reported by other Webfusions / Donhost clients, took much longer to get back on-line. Clarification is needed here: It took over 3h for the first servers to be seen back on-line however many of them still not fully functional.
Please refer to the comments below for the latest reports from affected users.
Would you guess the reason!? I wouldn’t guess. I would rather think of some natural disaster or something… but not this…
According to Webfusion, contractors testing fire systems in super secured data centre actually triggered the fire system by mistake. This has started all emergency procedures and shooting down the servers one by one in emergency mode to prevent data loss.
I hope that they did not fire the FM200 gas discharge and sprinklers. Well… It seems that super secure data centres that are built like a bunker resistant to any external threads are very vulnerable to internal threads and human beings.
“We have confidence in the system,” Galvin says. “We’ve had no outages of any significant size — and we could not say that before, not with our individual servers or with our earlier cluster. So we’re seeing a two-fold improvement — in reliability and in focus. We don’t have to worry about what the platform can do.” – source Case Study: Webfusion places trust in reliable storage platform to deliver bullet-proof innovations in web hosting.
Confidence in the system didn’t help at all. Someone is missing the human factor in his calculations. The system might be “Bullet-proof” according to their own case study but certainly stupid-proof it is not. I suppose today was the last day at work for someone. …happens.
Original system status update from Webfusion:
Created: 11 January 2011, 13:46
Last Updated: –
We would like to apologise for todays’ system outage, and to explain why this has occurred.
An external third party was carrying out routine maintenance in our data centre, and testing our systems for fire prevention. Unfortunately, due to human error, our fire prevention systems were in fact triggered.
As a result of this, and acting as the system should in the event of a real fire, all of our servers were sent in to a safe mode whereby they went offline.
Safety is our biggest concern, hence the system is configured to react in this way to avoid a major incident and permanent data loss.
We deeply regret any problems this may have caused you, and assure you we are doing our utmost to return to normal service levels as quickly as we possibly can.
If you have, any updates valuable to this article then please share it here and I will update this post accordingly.