BLACK FRIDAY DEAL: Use coupon ‘LEVELUP40’ and get a 40% off on all Annual Plans.
Popular with:
Developer
Security Engineer
Security Architect

One Week After the CrowdStrike Incident

Updated:
July 26, 2024
Written by
Abhay Bhargav

A week ago, the world ground to a halt. Well, the digital world. But it’s not because of a cyber attack but because of a faulty update from a cybersecurity firm. (I’ll be honest with you, I thought someone hacked Microsoft!)

A week ago, July 19, 2024, a CrowdStrike update caused widespread Blue Screen of Death (BSOD) errors on Windows machines. Yup, that blue screen that our nightmares were made of. It crippled systems across industries, from healthcare facilities scrambling to maintain patient care to airlines grounding flights and media outlets going dark. The initial shockwaves have subsided, but a lot of industries and businesses are still feeling the repercussions. 

While the immediate crisis is over, there are still questions that need answers. How did a company at the forefront of cybersecurity allow such an error to happen? What were the systemic failures that caused this much disruption? And most of all, what steps are being taken to prevent another global tech outage from happening again?

For Microsoft, how could a third-party update bring such a huge entity like Microsoft to its knees?

It’s time to move on from the initial reactions and talk about the technical details to understand the root causes of this failure and to find out the lessons learned.

Table of Contents

  1. The First 24 Hours
  2. What Went Wrong
  3. The Global Impact of the CrowdStrike Incident
  4. A Wake-Up Call for the Cybersecurity Industry

The First 24 Hours

For countless users around the world, Friday morning started with a rude awakening. Their computers shut down unexpectedly, only to be surprised by the dreaded Blue Screen of Death. Panic followed as some of them were met with an unexpected day off, courtesy of a system-wide crash. Everybody was thinking the same thing, that there's a massive cyber attack that is happening. We were all just waiting for someone to claim that they were the ones responsible for the attack.

It got bad. Airports were filled with stranded passengers, and they were not happy. Major carriers like Delta, American Airlines, and United struggled with check-in systems. As for financial institutions, like  JPMorgan Chase and Bank of America, they had to close branches as ATMs and online banking services went offline. Retail stores, from Walmart to Amazon, experienced point-of-sale failures, causing long queues and frustrated customers. Can you just imagine the operational nightmares?

How IT teams worldwide reacted

To make matters worse, it all happened on a Friday, right as the weekend began. IT teams worldwide were caught off guard and had to be in emergency mode. As the gravity of the situation became more obvious, teams mobilized to contain the damage and restore operations. Unfortunately, with limited resources and personnel, progress was slow.

In the first 24 hours, many IT teams had to adopt a triage approach, prioritizing critical systems and services. This usually meant manually rebooting machines which is a labor-intensive process made even more complex for those using BitLocker encryption. Some organizations had disaster recovery plans in place, but the speed and scale of the outage tested the limits of their strategies.

How CrowdStrike reacted

CrowdStrike quickly acknowledged the problem and issued a public apology. The company attributed the problem to a faulty content update that caused widespread system crashes. They then released a “how-to” to help users uninstall the faulty update. But as the scale of the disaster became more obvious, it was clear that CrowdStrike faced a huge challenge in containing the damage and restoring trust.

What Went Wrong

After what happened, the company conducted a “preliminary post incident review” of the incident to understand the root cause. The primary issue was traced back to specific configuration file errors and logic issues within the update. These errors caused critical system conflicts that eventually led to the widespread Blue Screen of Death (BSOD) errors that Windows users experienced.

So what exactly went wrong? The faulty update contains “problematic content data” that handled certain Windows system calls incorrectly. And to add fuel to the fire, this misconfiguration caused the update to conflict with Windows Defender’s antimalware service executable (MsMpEng.exe), which is an important system process. After the deployment of the update, the BSOD was triggered and rendered Windows computers inoperable all over the world.

But it doesn’t end there. The update contained logic errors in its installation script. These errors led to improper handling of memory allocation during the update process which actually further made the system crashes worse. Specifically, the installation script failed to correctly release memory resources and caused a memory leak that overwhelmed system resources and led to crashes.

While conducting the initial investigation, cybersecurity professionals discovered that the update didn’t go through enough testing across diverse environments. It’s lack of thorough testing that let the configuration errors and logic flaws go unnoticed until it was deployed.

The Global Impact of the CrowdStrike Incident

This incident which involved Crowdstrike and Microsoft affected an estimated 2 million businesses and countless users across different regions and sectors. According to Microsoft, the outage affected 8.5 million Windows devices. The scale of the disruption just shows that the interconnectedness of today’s IT infrastructure could be dangerous and the importance of shared responsibility between software providers and end users. Both parties need to do their part to ensure security and quick responses to security incidents. In short, mutual accountability.

Different regions and sectors experienced varying levels of disruption. Here’s some of them:

Airlines

  1. Delta Airlines (USA) - Personnel weren’t able to access important scheduling and communication systems. Passengers were stranded at airports which led to more frustration and logistical issues.
  2. British Airways (UK) - Flights were either canceled or delayed which left thousands of passengers with no knowledge of what was happening, or rebooking options. The airline’s check-in systems were also disrupted which resulted in long lines and more confusion at airports.
  3. Qantas (Australia) - Suffered from similar disruptions, with many flights grounded and passengers that can’t access their boarding pass or flight information. The airline had to resort to manually checking in passengers.
  4. Lufthansa (Germany) - Over 100 flights were canceled and there were significant delays because of system outages that affected booking and check-in processes.

Healthcare

  1. Mayo Clinic (USA) - Experienced system failures that affected patient records and delayed treatments.
  2. NHS (UK) - Reported that its electronic health records (EHR) system went offline. The staff had to revert to manual record-keeping.
  3. Apollo Hospitals (India) - Encountered disruptions in their appointment scheduling system that caused patients to just cancel their appointments or reschedule. The patient flow was impacted and waiting areas became overcrowded.
  4. Singapore General Hospital (Singapore) - Delayed surgeries and emergency care because of a critical system failure. Staff couldn’t access patient records and manage hospital operations because of what happened.

Media

  1. The New York Times (USA) - Journalists were unable to access critical systems needed for publishing, and news delivery was delayed
  2. BBC (UK) - The network had to switch to pre-recorded content temporarily because of broadcast interruptions and internal communication breakdowns 
  3. Al Jazeera (Qatar) - Journalists weren’t able to access editorial systems. This slowed down the news production process and impacted the timely reporting of breaking news.
  4. NHK (Japan) - Broadcasting services were disrupted affecting their ability to provide live coverage and delays in news reporting.

Banking

  1. Wells Fargo (USA) - Customers couldn’t access their accounts and perform transactions. ATMs and online banking services were intermittently unavailable.
  2. HSBC (UK) - Customers unable to access online banking or use ATMs. There were increased calls to customer service and long wait times for assistance.
  3. Commonwealth Bank (Australia) - Affected online banking, ATM services, and payment processing.

Retail

  1. Walmart (USA) - Checkout counters stopped working. System outage affected supply chain management.  Stores had to manually process sales.
  2. Carrefour (France) - Point-of-sale systems went offline and inventory management systems were affected. This resulted in stock shortages and customer dissatisfaction.
  3. Woolworths (Australia) - Issues with their online shopping platform. Customers couldn’t place orders and delivery services got delayed.
  4. Tesco (UK) - Checkout system failures. Long queues and delays in processing payments, impact both in-store and online shopping.

Comparison with Other Major IT Outages

WannaCry Ransomware Attack

The 2017 WannaCry ransomware attack was a huge one, particularly in the healthcare sector. The attack was because of an exploited vulnerability in the Windows OS, which Microsoft had patched two months prior. But many organizations did not apply the patch that exposed them. The NHS in the UK was severely affected, with numerous hospitals and GP surgeries couldn’t access patient data. It leads to canceled appointments and delayed treatments. The estimated damage was around $4 billion. So this is what happens when patch management isn’t timely, huh?

AWS Outage in 2020

In November 2020, Amazon Web Services (AWS) experienced a major outage that affected many high-profile clients like Roku, Adobe, and The Washington Post. The problem was a failure in the Kinesis data processing service, which overloaded the system. AWS was the issue here, and the outage made many realize how dependent companies are on cloud infrastructure. This incident was the cause of major financial losses for those who were affected. 

Facebook Outage in 2021

Facebook, Instagram, and WhatsApp went offline for almost 6 hours in October 2020. The outage was because of a configuration change in the backbone routers that coordinate network traffic. Services were interrupted because of the disrupted communication between data centers. It was because of Facebook, and billions of users worldwide were affected. 

What CrowdStrike Did to Address the Incident One Week Later

In the aftermath of the “largest IT outage in history”, CrowdStike took steps that addressed the problem and mitigated its impact on affected organizations. Here’s a detailed overview of what CrowdStrike did one week after the incident:

July 19, 2024

  • CrowdStrike quickly issued detailed remediation guidance to help IT teams address the BSOD issues. It included step-by-step instructions for identifying and resolving the update-related problems:
  1. Locate the channel file in the %WINDIR%\System32\drivers\CrowdStrike directory.
  2. Ensure the file with the timestamp 2024-07-19 0409 UTC is replaced with the good version timestamped 2024-07-19 0527 UTC or later.
  3. Replace the problematic file with the stable version to resolve the BSOD issues.

  • The company immediately began rolling back the problematic update, reverting systems to a previous stable state to prevent further occurrences of the BSOD.

July 20-21, 2024

  • CrowdStrike maintained open and transparent communication with its customers throughout the incident. There were regular updates to keep everyone up to date about the status of the remediation efforts and any new developments.
  • There were dedicated CrowdStrike support teams that assist affected customers. These teams worked around the clock to provide technical support, answer queries, and offer help on how to mitigate the impact of the incident.
  • CrowdStrike’s engineering teams worked swiftly to develop a new, stable patch. This patch underwent rigorous testing to make sure that it resolved the issues without introducing a new problem

July 22-23, 2024

  • In response to the incident, CrowdStrike reviewed and improved its testing protocols. They also implemented more comprehensive testing in diverse environments to prevent similar issues in the future.
  • CrowdStrike identified the specific channel file responsible for the system crashes. The problematic file, timestamped 2024-07-19 0409 UTC, was replaced with a stable version.
  • The company reverted the changes by deploying a good version of the file that made sure that systems returned to normal operation.
  • CrowdStrike provided updated dashboard tools to help IT teams detect and manage impacted hosts to streamline the recovery process.
  • The problematic file was added to Falcon’s known-bad list to prevent it from causing further issues.

July 24, 2024

  • CrowdStrike collaborated closely with Microsoft to address the compatibility issues that contributed to the incident. This partnership’s goal is to make sure that both companies’ products work seamlessly together moving forward.
  • CrowdStrike sought the expertise of industry specialists to review the incident and provide recommendations for improving their update processes and overall security measures.

Customer Assurance

CrowdStrike offered a $10 UberEats gift card to affected customers as an apology for the inconvenience caused by the incident. It’s a gesture that is dedicated to acknowledging the extra work and disruption caused to their partners.

A Wake-Up Call for the Cybersecurity Industry

Because of the CrowdStrike incident, the entire cybersecurity industry is reminded to reevaluate how updates and patch management are handled. Organizations, and even us individuals, are now more aware of the possibilities and risks of software updates. For sure, this incident will prompt many to invest in more comprehensive risk assessment frameworks, like regular audits, simulations, and stress tests to find the vulnerabilities that might get exploited.

Here at AppSecEngineer, we encourage everybody, every company, to also invest in your people. We all know the statistics that involve human error. Think about it: if your organization is nurturing a security-first environment, then the possibilities of incidents like what happened with CrowdStrike will be significantly reduced. Proactive security, that’s it.

Let me ask you a very important question: Do you think July 19 will become the Inter-national Blue Screen of Death Day?

Source for article
Abhay Bhargav

Abhay Bhargav

Abhay is a speaker and trainer at major industry events including DEF CON, BlackHat, OWASP AppSecUSA. He loves golf (don't get him started).

Ready to Elevate Your Security Training?

Empower your teams with the skills they need to secure your applications and stay ahead of the curve.
Get Our Newsletter
Get Started
X
X
Copyright AppSecEngineer © 2023