ALTERNATE UNIVERSE DEV

The Stack Overflow Podcast

SE Podcast #36 – We Got Hit by a Hurricane

So as you may have heard in the news, the east coast got hit pretty hard by Hurricane Sandy - in particular, our datacenter in Lower Manhattan was almost knocked entirely offline.  If not for the incredible efforts of Fog Creek Software, Squarespace, and Peer1 (the datacenter) there would have certainly been days of outages for everyone involved.

We've got a ton of people from Stack, Fog Creek and Squarespace on to tell the CRAZY story of exactly what happened last week! Guests include: David Fullerton**, VP Engineering at Stack Exchange; Geoff Dalgas & Nick Craver, both core developers at Stack Exchange; Alex Miller; Michael Pryor; Mendy Berkowitz, lead sysadmin for Fog Creek; Babak Ghaheremanpour, longtime Creeker; Anthony Casalena**, CEO and founder of Squarespace.

We're planning on telling the whole story of Hurricane Sandy - it's roughly in chronological order here

  • We are from New York, and all of our offices and equipment are located there. Hurricane Sandy recently hit us, as you may have heard.

  • We go back all the way to Monday night, 10/29. Nick got the first communications from Peer1, our datacenter, which was warning everyone that the power was going out for everything south of 34th street.

  • Monday night, we thought all was safe in sound. Stack Exchange had some failover plans in place, however, as you heard about on a previous podcast.

  • On the Fog Creek side, things were still relatively calm. They were basically blindsided, because the datacenter was confident that they had generator fuel for "like, days".

  • Then the storm hit. There was wind and a little bit of rain. Everything in Zone A got flooded basically immediately, as predicted, but if you didn't live in Zone A you didn't really notice.

  • Michael Pryor's foreshadowing. He saw a Hacker News post saying that Internap, another datacenter, was down - and started making plans to protect Fog Creek if Peer1 went down.

  • Suddenly, we get word that the generator only has thirty minutes of fuel left.

  • Mike Mazzei was the only Peer1 staffer there at the time, and he was stretched pretty thin. He is basically a super hero and ended up saving the day.

  • Anthony managed to get exactly one email on Tuesday morning, and it happened to be about running out of fuel in the middle of the day (where he had previously thought they had a few days of fuel to spare).

  • "Let me tell you what it looked like when I showed up." Michael describes the scene on Broad St. for us.

  • Based on flawed information from the NOC, Fog Creek makes plans to shut everything down at 10:45AM.

  • Bradford was the only sysadmin who was awake and connected. He said we had to start doing a controlled shutdown

  • Mike has the idea that if we can get the fuel up to the generator, we can keep everything online.

  • Someone from Squarespace found empty 55-gallon drums on Craigslist and brought them down to the datacenter. The first attempt is pushing these barrels of diesel up the stairs.

  • The building's major task was getting the water pumped out of the basement, so at first Fog Creek and Squarespace and Peer1 were able to work on the fuel issue relatively unfettered.

  • Fog Creek decides to bring their servers back up, since they had people on the ground in the datacenter now to monitor the situation

  •  The bucket brigade begins!

  • Michael goes home and sleeps for three hours. He then heads back to Peer1 and checks the generator tank which is only a quarter full...

  • Joel tells us about trying to raise the alarm with incommunicado sysadmins Mendy and Sven and get them back online

  • Sven starts working on with some others was moving Trello onto AWS

  • Michael tells us about how lucky he got with the Fog Creek fishtank during last year's power outage. Another example of how we were very lucky to be accidentally prepared for this event.

  • Everyone laughs at us for having datacenters in Manhattan, but the clear benefit is that we had the physical ability to make things happen because the employees of the company are close, and downtown Manhattan is a priority to get back up and running, resources-wise.

  • Wednesday morning was the day where we had the day laborers. Michael noticed that there were people carrying fuel that he didn't recognize, and then they started carrying our fuel to our tank. Turns out they were day laborers, and they needed payin'.

  • The system was in place, and it worked - we put a ton of fuel on the roof.  At that point, we thought there would be a happy ending.

  • Enter Thursday. Anthony wakes up to find that the workers are not allowed in the building.

  • The building management and ownership just didn't understand what a datacenter does. We were "the telco guys".

  • Things go south with the building management and ownership because of a conflict with the day laborers, because the original company who hired the day laborers didn't pay them.

  • Everyone stays quiet and tries to just stay out of the way. Mike Mazzei gets the building manager to let the bucket brigade resume using only the eleven people that were already in the building - no outside help was allowed.

  • We were allowed to do this until suddenly we weren't anymore. Mike gives a "we did all we could" speech and everyone prepares to inform customers that the outage was inevitable…

  • More army stories from Joel: the biggest challenge in a crisis situation is the "Fog of War" - 5% good communication and 95% rumor flying around.

  • The building finally gets the pump going and fills the header, and then we're basically okay.

  • When Mike Mazzei got frazzled, Joel went ballistic on Peer1 corporate. We discuss how they should have handled the situation and put in more support.

  • Another army story! When it hits the fan, you find yourself doing things that have a 1% probability of success, but it's all you've got so you do it anyway.

  •  STATUS QUO: Thursday night, the pump gets going. Friday and throughout the weekend, things are calm. Work continues on all the contingency plans, but the situation is more or less stabilized.

  • The overarching key is communication, not only internally within your company, but with your customers.

Episode source