Blog

Microsoft 2024 IT Outage: Lessons in Business Continuity with Philippe Tassé-Gagné, Vice President of Consulting Services

Discover our award-winning BCM software

On July 19, 2024, a faulty CrowdStrike update triggered a global outage of Microsoft services, affecting millions of users and disrupting numerous business sectors.

In the face of this unprecedented situation, business continuity management has become a central topic for companies of all sizes.

To better understand the challenges and solutions, we interviewed Philippe Tassé-Gagné, Vice President of Consulting Services at Premier Continuum and had a very interesting conversation.

Mr. Tassé-Gagné is a recognized expert in business continuity and organizational resilience, with over 25 years of experience. He was also awarded Continuity and Resilience Consultant 2024, at the BCI Americas Awards 2024.

Enjoy reading!

1. Philippe, could you briefly explain what happened during the Microsoft outage in July 2024?

Of course. On July 19, 2024, an update to the CrowdStrike cyber security software caused Blue Screen of Death (BSOD) errors on many Windows devices, affecting more than 8.5 million devices.

Why the Blue Screen of Death Still Matters in a World of IT Chaos
Example of a "Blue Screen" error

This update caused significant interruptions to Microsoft 365 services, affecting critical applications like Outlook, Teams, and OneDrive. The outage had a global impact, disrupting key sectors such as transportation, social and health services, financial services, and many others.

On a personal level, one of the things that surprised me the most was that even the radio wasn't working that morning. When everything goes wrong, even the source we consider the most reliable wasn't available!

2. Have you ever had to manage or support your clients in handling a similar incident?

Apart from the COVID-19 pandemic, I have not had to manage an event of such magnitude in my 28-year career.

It reminds me of the saying: "We are always prepared for the last incident we experienced."

For example, when COVID-19 hit, organizations turned to their existing continuity and crisis management plans, which were themselves based on the last major event of a similar nature, in this case, the H1N1 flu virus of 2009. As a result, the protocols were more or less adapted for COVID-19, but they might not have been suitable for other types of disruptions related to personnel.

The key takeaway here is essentially: how many organizations really take the time to debrief, gather data and information, and update their plans and procedures based on similar scenarios following a crisis? Not many companies... yet it is of paramount importance.

3. In your opinion, why were the impacts of the CrowdStrike incident so widespread?

Many organizations have increased dependencies on Microsoft. And even though the risks of service and tool interruptions are low, we must consider that the impacts can be catastrophic.

We are becoming increasingly dependent on multinational companies, especially those that dominate the market, like Microsoft. In my opinion, this kind of outage will shake many organizations and push those most affected to question whether it is worth implementing alternative solutions, and at what cost? The alternatives are not numerous...

Take, for example, Delta Air Lines, which is threatening to sue CrowdStrike for the losses incurred. of their flights had to be delayed, and many others were canceled.*

We understand their loss, but it raises a counterintuitive question: which other provider can they turn to?

Delta will likely continue to do business with Microsoft, a key player in their operations. Microsoft, in turn, will probably continue to collaborate with CrowdStrike, given the robustness and reputation of their cybersecurity solutions. So, even if Delta sues CrowdStrike, they will remain indirectly tied to them through Microsoft. This highlights the complexity and interconnectedness of today’s technological ecosystems, where choices are limited, and mutual dependencies are inevitable.

Source: CNBC. "Delta hires David Boies to seek damages from CrowdStrike, Microsoft after outage," published on July 29, 2024, https://www.cnbc.com/2024/07/29/delta-hires-david-boies-to-seek-damages-from-crowdstrike-microsoft-.html

4. What should organizations do following the CrowdStrike incident or the Microsoft Azure service outage?

An important takeaway is to become aware of the blind trust we place in these multinationals. On the client side, the CrowdStrike update was automatically downloaded and wasn’t necessarily tested by organizations before being installed.

This event should definitely encourage companies to be more efficient and vigilant about the update process for their critical systems. Some companies tested the CrowdStrike update before installing it, which allowed them to avoid impacts on their operations. In this sense, large organizations should evaluate or at least test updates, whenever possible, before installing them.

Obviously, small and medium-sized organizations cannot review every update, but it is always good to revisit their systems and dependencies to reassess the risks. And while the risks were relatively low, we must remember that our systems are becoming increasingly interdependent.

You can refer to this article to learn more on the Microsoft Azure service outage.

5. How can developing organizational resilience equip companies to handle this kind of disruption?

Organizational resilience is the ability of an organization to absorb shocks and adapt to a changing environment. Developing this capacity can particularly help organizations strengthen their cyber resilience and better prepare for technological outages.

  • Notably, the CrowdStrike incident highlighted the increased vulnerability of organizations to cyberattacks. During the crisis, disciplined vigilance was essential, as a security breach provided a golden opportunity for cybercriminals. Working on increasing organizational resilience allows organizations to remain vigilant and ready to respond quickly to disruptions.
  • Moreover, when working on organizational resilience, backups, recovery strategies or workarounds for priority activities or critical business services are put in place. Organizations can then take the time to recharacterize their critical infrastructures.
In other words, you must ask yourself: "Do I ultimately have single points of failure that I hadn't perceived or didn't consider relevant?"

There may not always be solutions, but by recognizing that this type of IT outage is more likely than expected, continuity and resilience teams can develop secure workarounds to manage similar situations in the future.

6. How could we improve our preparedness for another IT outage of this magnitude?

I believe that preparation involves raising awareness, training, and conducting continuity and resilience exercises. We must ensure that all members of the organization understand their role in a crisis, and it is crucial to focus on developing the skills of the crisis management team through continuous training and regular exercises.

An example I like to use is the following: whether it's in IT recovery plans or during a cyber exercise, it’s important to specify the types of cyber incidents in question. Is it a cyberattack, or is it a loss of system access?

These two outages could impact an organization’s primary communication means, but you certainly cannot manage the crisis in the same way. It’s absolutely necessary to design parallel measures and strategies for an appropriate response and to practice implementing them effectively with the crisis management team.

Nowadays, preparing for a cyberattack is a good practice. It is by far the most likely risk or hazard.

7. Do you think this Microsoft IT outage will become a case study for the future?

The Microsoft outage caused by CrowdStrike is an IT incident, and the IT sector tends to adapt better than most other sectors. That said, it wouldn't be surprising if the lessons learned from this event quickly become best practices.

I do hope that following this incident, organizations will take more time to perform checks before installing updates. However, I reiterate, this is no small task. It requires good systems in place and the right tools, which not all companies can afford. Nonetheless, many organizations have outdated systems and are therefore more vulnerable to this type of event. In my opinion, these organizations should be more cautious, as should those providing essential services, such as healthcare and transportation services.

In the short term, I advise all organizations to document the impacts of this incident and how it was managed. By analyzing this information, it will become clearer to identify opportunities for improvement in future crises. In other words, it is essential to learn from mistakes to strengthen resilience and improve business continuity plans, so that organizations are better prepared to face future disruptions.

To go further…

This concludes our interview with Philippe Tassé-Gagné, Vice President of Consulting Services and Talent Development at Premier Continuum.

We warmly thank Mr. Tassé-Gagné for sharing his insights and expertise with us on this topic.

To learn more about this IT outage, we invite you to read our article: Incident of July 19, 2024: When an Update Has Global Impacts.

For more information on business continuity management and organizational resilience, consult our team of experts now.