Facebook’s outage, this week’s biggest news globally and a crisis avoided most by any tech company out there. Many questions were posed as to how one of the biggest global tech firms could face downtime lasting for that many hours. What went wrong – but most importantly, how should a company where tech is at the core of the firm, deal with such a situation? We discussed the happenings with our tech team and they shared some important aspects to dealing with an outage.
First things first, communication is crucial, the team expresses. Internally, you connect everyone you need to the situation. At the same time, you make sure to communicate to external stakeholders as soon as possible. You don’t need to have all details ready, but by informing external parties about the issue and ensuring them you’re on it, you create space and time for your team to get to the bottom of things.
Big tech vs start-up
Of course, in a start- or scale-up, communication flows are shorter as less people are involved. With a big company like Facebook, departments are bigger and information needs to get to more people. Compared to a start-up this can pose challenges when handling the crisis.
The impact of downtime differs greatly between a big tech like Facebook and a young company. However both businesses will face their own specific consequences. In Facebook’s situation, the level of impact of their downtime was immense. Globally, users of Facebook, Instagram and Whatsapp were interrupted in their day-to-day activities. However, when things were back up, app-users took no time to open their apps again. The loyalty, or dependence, of Facebook’s users prove the brands’ position. With already a huge brand name in the market, the impact of downtime may be large, but the credibility of the firm may be less affected.
For a tech start-up there is more pressure of keeping credibility up as the brand image is still being built. According to the team, an outage can bring in more risk here for start-ups compared to bigger established firms.
Balancing innovation and risk
According to our Tech Lead Steve, although having disastrous errors happening in your systems is something you want to avoid at all times, you cannot completely exclude the risk of technical error from your operations – and you shouldn’t aim for this either. He adds:
“You cannot release your products with shaking hands every time. Without risk, you cannot move forward. Of course, you decrease the chance of a bug in your systems, but you cannot exclude this one hundred percent. The risk goes hand in hand with innovation, as you have to test things out.’’
In order to manage your technological risk well, one of the things you can do as a team is to analyse different potential scenarios that could happen to your business, and create action plans accordingly. It’s important to have general plans in place that you can follow step by step when something like an outage disrupts your operations. But companies should be realistic: you cannot predict everything, the team believes.
The urge to quick fix and hack
“One of the things you often see happen in a situation like an outage, is the urge to ‘quickly fix’ the problem. But when something like this takes place, your tech team should not try to solve things by using hacks or loose-end solutions.” Steve comments.
To avoid moving too quickly to fix the situation, it’s essential to take a step back and analyse the situation in a clear and rational way. The team can decide to move back to the last fully operating product version and take things from there. One of the great things about software development is that there’s always the option of returning to another, working, product version.
Lead Architect Edwin also emphasizes the importance of not basing interventions on assumptions only. He explains that sometimes panic and stress can cause a team to notice odd things in the system and directly zoom in on those, while the real error remains somewhere else.
Blame culture lets stress and panic take over
Staying away from a blame culture is the best way to keep panic and stress from slowing down progress, Edwin believes:
“People sometimes start panicking and are worried about keeping their jobs or being disliked by the team. If your corporate culture operates like that, then you will see panic quickly arising in those situations. People cannot think strategically anymore, while rational thinking is key to move forward.”
That’s why company culture in general matters a lot, everyday. If you establish a culture where people (can) make mistakes and the team comes together to learn and move on from them, you will reap the benefits when a situation like an outage happens.
Next to acceptance of mistakes comes trust. Having a tech team where developers can depend on each other is crucial when you’re facing a complex situation like Facebook did this week. For those familiar with Tuckman’s group development stages: it’s only when the team has overcome issues in the past, established trust and shares a common goal, that they enter the performing stage allowing for effective collaboration.
Tuckman describes four phases of group development: forming, storming, norming and performing. In the first stage, team members meet and get to know each other and the company projects. Once they progress, they will experience different working styles and express any concerns or issues that may rise when collaborating. When the team is able to overcome this, they move to the norming stage where they co-operate more effectively and share goals. After these stages a team moves into the performing stage, as roles are established and trust is achieved.
When an outage hits your company, your most important asset is the team to solve the situation. If your team has not reached the performing stage yet, they may not be able to successfully address key challenges. The culture lived out in your company will decide how your teams will walk through the different development stages to reach the ultimate goal of performing well together. Conditional to reach this, is an environment that encourages team members to communicate openly, to not avoid conflict, to make mistakes and collectively learn.
In short, risk can never be fully excluded from a tech company’s operations, and it doesn’t have to be if you want to progress and innovate. However, it is important to manage risks as effectively as possible and establish general go-to strategies to follow when errors do occur. Fundamental to handling any crisis in your organization is the trust and team spirit that allows for strategic thinking without putting the blame on individual members.