Our Takeaway after Basecamp’s System Outage: Prevention Is Always Better than Cure

As a communications agency, staying ahead of the game is our daily bread. For this, our entire staff gathers around a huge table every Thursday morning for a weekly meeting to discuss the status of all the clients we work for: reviewing goals, adjusting strategies and resolving emerging issues immediately is crucial in ensuring that our clients’ projects are on-track and make great strides. But what if the project management tool to coordinate all of this is down all at once when it is most needed? Well, that’s exactly what we – among many others – faced two weeks ago.       

 

What happened?

On November 9, Basecamp 3 had been stuck in read-only mode for nearly five hours straight. Although access to existing messages, to-do lists and files were still fully available, albeit non-edible, new data could not be added into the team communication software whatsoever.

 

What exactly caused the outage in the first place?

Every type of activity, from posting a message, updating a to-do list or merely liking one of your co-workers’ comments, is meticulously tracked and archived into a large table of events. As it transpired, Basement’s “database hit the ceiling of 2,147,483,647 on [this] very busy events table,” according to a later statement by David Heinemeier Hansson, co-founder of the software. The capacity to write any new events was reached or to put into layman’s terms: Basecamp was running out of memory space.

 

How did service users cope with this downtime?

It didn’t take long until customers took to Twitter asking for updates on the issue. Some affected users also didn’t recoil from expressing their frustration about the service failure:

 

 

 

On the flipside, a number of users took it with a grain of salt appreciating a somewhat unforeseen coffee break:

 

 

 

What to learn from it?

In light of most users’ positive encouragement online, Basecamp certainly did a great job at handling the issue as fast and, more importantly, as transparent as possible. As we continuously highlighted in previous blog posts about thought leadership, there’s a golden rule in crisis communications: Validate concern and show action – and that’s precisely what happened. After users had been informed about the failure on Twitter early on, Basecamp demonstrated great endeavor in providing updates consistently and soothing customers by specifying an exact time by which the system would be back online, even reminding users to consider time differences depending on their location. Needless to say, this proactive engagement helped the software company, which prides itself on offering a reliable service with an uptime record of 99.998%, to maneuver through this incident in an adept manner.

This week, Basecamp eventually settled this system outage by presenting a concluding failure report to its customers alongside another apology on the part of the leadership.

The crux of the whole matter here is – as Basecamp itself acknowledges repeatedly – that this failure, above all, was very much avoidable. An easy fix that could have averted a fairly disproportionate outcome if it only came up a little earlier. All this being said, the takeaway of the story is: carving out some time for ancillary meetings to re-evaluate what’s on your plate is of utmost importance, as we have seen: prevention is always better than cure!