A week ago, the world ground to a halt. Well, the digital world. But it’s not because of a cyber attack but because of a faulty update from a cybersecurity firm. (I’ll be honest with you, I thought someone hacked Microsoft!)
A week ago, July 19, 2024, a CrowdStrike update caused widespread Blue Screen of Death (BSOD) errors on Windows machines. Yup, that blue screen that our nightmares were made of. It crippled systems across industries, from healthcare facilities scrambling to maintain patient care to airlines grounding flights and media outlets going dark. The initial shockwaves have subsided, but a lot of industries and businesses are still feeling the repercussions.
While the immediate crisis is over, there are still questions that need answers. How did a company at the forefront of cybersecurity allow such an error to happen? What were the systemic failures that caused this much disruption? And most of all, what steps are being taken to prevent another global tech outage from happening again?
For Microsoft, how could a third-party update bring such a huge entity like Microsoft to its knees?
It’s time to move on from the initial reactions and talk about the technical details to understand the root causes of this failure and to find out the lessons learned.
For countless users around the world, Friday morning started with a rude awakening. Their computers shut down unexpectedly, only to be surprised by the dreaded Blue Screen of Death. Panic followed as some of them were met with an unexpected day off, courtesy of a system-wide crash. Everybody was thinking the same thing, that there's a massive cyber attack that is happening. We were all just waiting for someone to claim that they were the ones responsible for the attack.
It got bad. Airports were filled with stranded passengers, and they were not happy. Major carriers like Delta, American Airlines, and United struggled with check-in systems. As for financial institutions, like JPMorgan Chase and Bank of America, they had to close branches as ATMs and online banking services went offline. Retail stores, from Walmart to Amazon, experienced point-of-sale failures, causing long queues and frustrated customers. Can you just imagine the operational nightmares?
To make matters worse, it all happened on a Friday, right as the weekend began. IT teams worldwide were caught off guard and had to be in emergency mode. As the gravity of the situation became more obvious, teams mobilized to contain the damage and restore operations. Unfortunately, with limited resources and personnel, progress was slow.
In the first 24 hours, many IT teams had to adopt a triage approach, prioritizing critical systems and services. This usually meant manually rebooting machines which is a labor-intensive process made even more complex for those using BitLocker encryption. Some organizations had disaster recovery plans in place, but the speed and scale of the outage tested the limits of their strategies.
CrowdStrike quickly acknowledged the problem and issued a public apology. The company attributed the problem to a faulty content update that caused widespread system crashes. They then released a “how-to” to help users uninstall the faulty update. But as the scale of the disaster became more obvious, it was clear that CrowdStrike faced a huge challenge in containing the damage and restoring trust.
After what happened, the company conducted a “preliminary post incident review” of the incident to understand the root cause. The primary issue was traced back to specific configuration file errors and logic issues within the update. These errors caused critical system conflicts that eventually led to the widespread Blue Screen of Death (BSOD) errors that Windows users experienced.
So what exactly went wrong? The faulty update contains “problematic content data” that handled certain Windows system calls incorrectly. And to add fuel to the fire, this misconfiguration caused the update to conflict with Windows Defender’s antimalware service executable (MsMpEng.exe), which is an important system process. After the deployment of the update, the BSOD was triggered and rendered Windows computers inoperable all over the world.
But it doesn’t end there. The update contained logic errors in its installation script. These errors led to improper handling of memory allocation during the update process which actually further made the system crashes worse. Specifically, the installation script failed to correctly release memory resources and caused a memory leak that overwhelmed system resources and led to crashes.
While conducting the initial investigation, cybersecurity professionals discovered that the update didn’t go through enough testing across diverse environments. It’s lack of thorough testing that let the configuration errors and logic flaws go unnoticed until it was deployed.
This incident which involved Crowdstrike and Microsoft affected an estimated 2 million businesses and countless users across different regions and sectors. According to Microsoft, the outage affected 8.5 million Windows devices. The scale of the disruption just shows that the interconnectedness of today’s IT infrastructure could be dangerous and the importance of shared responsibility between software providers and end users. Both parties need to do their part to ensure security and quick responses to security incidents. In short, mutual accountability.
Different regions and sectors experienced varying levels of disruption. Here’s some of them:
The 2017 WannaCry ransomware attack was a huge one, particularly in the healthcare sector. The attack was because of an exploited vulnerability in the Windows OS, which Microsoft had patched two months prior. But many organizations did not apply the patch that exposed them. The NHS in the UK was severely affected, with numerous hospitals and GP surgeries couldn’t access patient data. It leads to canceled appointments and delayed treatments. The estimated damage was around $4 billion. So this is what happens when patch management isn’t timely, huh?
In November 2020, Amazon Web Services (AWS) experienced a major outage that affected many high-profile clients like Roku, Adobe, and The Washington Post. The problem was a failure in the Kinesis data processing service, which overloaded the system. AWS was the issue here, and the outage made many realize how dependent companies are on cloud infrastructure. This incident was the cause of major financial losses for those who were affected.
Facebook, Instagram, and WhatsApp went offline for almost 6 hours in October 2020. The outage was because of a configuration change in the backbone routers that coordinate network traffic. Services were interrupted because of the disrupted communication between data centers. It was because of Facebook, and billions of users worldwide were affected.
In the aftermath of the “largest IT outage in history”, CrowdStike took steps that addressed the problem and mitigated its impact on affected organizations. Here’s a detailed overview of what CrowdStrike did one week after the incident:
CrowdStrike offered a $10 UberEats gift card to affected customers as an apology for the inconvenience caused by the incident. It’s a gesture that is dedicated to acknowledging the extra work and disruption caused to their partners.
Because of the CrowdStrike incident, the entire cybersecurity industry is reminded to reevaluate how updates and patch management are handled. Organizations, and even us individuals, are now more aware of the possibilities and risks of software updates. For sure, this incident will prompt many to invest in more comprehensive risk assessment frameworks, like regular audits, simulations, and stress tests to find the vulnerabilities that might get exploited.
Here at AppSecEngineer, we encourage everybody, every company, to also invest in your people. We all know the statistics that involve human error. Think about it: if your organization is nurturing a security-first environment, then the possibilities of incidents like what happened with CrowdStrike will be significantly reduced. Proactive security, that’s it.
Let me ask you a very important question: Do you think July 19 will become the Inter-national Blue Screen of Death Day?
Contact Support
help@appsecengineer.com
1603 Capitol Avenue,
Suite 413A #2898,
Cheyenne, Wyoming 82001,
United States
Contact Support
help@appsecengineer.com
68 Circular Road, #02-01, 049422, Singapore