The idea that a single IT misstep could cripple companies across entire industries might have seemed like a huge overstatement. However, the recent Microsoft outage is a stark reminder of how interconnected our world is. On July 19th, 2024, a faulty security update from CrowdStrike wreaked havoc on Microsoft Windows systems worldwide. How could such an IT catastrophe unfold? Let’s dive in and explore the causes.

What happened?

Many high-profile companies, such as JP Morgan Chase, Walmart, and Shell, use Falcon, CrowdStrike’s cybersecurity software, to protect their IT infrastructure from data breaches. In fact, it’s used by 82 percent of US state governments and 48 percent of the largest US cities.

Unlike traditional security systems that require bulky hardware and constant updates, CrowdStrike Falcon operates entirely in the cloud. It works through an agent installed on user devices, be it Windows, Mac, or Linux. Once installed, this program connects seamlessly to CrowdStrike’s cloud platform.

So, CrowdStrike’s latest software update for Windows users turned out to be faulty, causing a Blue Screen of Death (BSOD) at boot. You need the system to boot to be able to roll back that update, which is a dead-end scenario for a non-technical user.

Adding to the confusion, an outage hit Microsoft Azure services and the Microsoft 365 suite of apps in the central US earlier on Thursday. While a company spokesperson clarified these as separate issues (one impacting Azure, the other CrowdStrike), cybersecurity consultant Thomas Parenty (a former National Security Agency analyst) offered a different perspective. He suggests a possible link: “The systems required for the connection to Azure could have been initially impacted by the CrowdStrike issue, rendering the service unavailable.” \

Talk about a chaotic end to the week!

What are the consequences?

The widespread impact of this incident is staggering, considering the CrowdStrike agent is installed on millions of devices – from servers and personal computers to even internet-connected devices (IoT). The update, intended to enhance system security, ironically caused widespread crashes across various industries, including:

Why did this happen?

Firstly, most organizations deploy software updates automatically, so the rogue updates spread like wildfire. Secondly, the culprit was a poorly written code – an error that CrowdStrike has since taken full responsibility for. While the exact details of this blunder remain unclear, one thing is certain: rigorous software testing could have prevented this IT disaster entirely or at least significantly reduced its impact.

Why the update might have caused issues:

What can Windows users do?

The good news is that CrowdStrike engineers shared a workaround. Here it is:

  1. Boot Windows into Safe Mode or the Windows Recovery Environment

  2. Navigate to the C:\Windows\System32\drivers\CrowdStrike directory

  3. Locate the file matching “C-00000291*.sys”, and delete it.

  4. Boot the host normally.

The bad news is that it doesn’t work for everyone. First, if you’re not a techie, you don’t understand half of what’s here. Also, this manual fix can’t be applied remotely or to cloud-based systems, requiring physical access to each impacted device. Unfortunately, this translates to a lengthy recovery process for system administrators.

Prevention is better than cure

So, what lessons can we learn from one of the most widespread tech meltdowns? Prevention is always better than cure. While having a detailed incident response plan is good, what’s even better is having an ongoing and well-established quality assurance process.

Prevent Faulty Updates With These Testing Types:

Strengthen Your Infrastructure and Processes:

Improve Communication and Collaboration:

Microsoft estimated that 8.5 million computers worldwide were knocked out by a major IT outage caused by a faulty CrowdStrike update. This chaos highlights the critical need for thorough software testing. Companies must prioritize comprehensive testing and strong IT processes to prevent future disasters. Remember, in tech, testing isn't optional—it's essential.