CrowdStrike’s Lesson: 3 Ways to Minimize IT Risk (In Case the World Goes Offline Again)

Related Solutions: Cybersecurity Solutions, Technology Services, Risk Advisory Services

August 9, 2024

Contributors: Jacob Harrand, Senior Technology Solutions and Risk Management advisor

As the world recovers from the recent global IT outage caused by the faulty CrowdStrike update, we at Rehmann believe it’s worthwhile to break down what happened, why, and — most importantly — explain how your organization can prepare for a similar event.

What Happened & Why

As you’re likely aware, on Friday, July 19, 2024, CrowdStrike pushed an update to its Falcon Sensor for Windows. The Falcon sensor has two primary ways of maintaining detections for known threats: sensor content and rapid-response content.

Sensor content is released on a regular cadence and was bundled with the Falcon Sensor update itself. Rapid response content is used for behavioral pattern matching, a way of identifying potential security threats by analyzing the behavior of programs and users rather than relying solely on known threat signatures.

The rapid response content pushed on July 19 was a channel file, a type of update file that security software uses to enhance its ability to detect and respond to new and emerging threats.

The Falcon Sensor’s kernel driver, which is a core part of the software, loads these channel files when the computer starts up, ensuring the sensor has the latest threat detections running at a deeper level of the Windows operating system and making it harder for attackers to hide.

However, there was a problem with the channel file that caused the kernel driver to fail in a way that couldn’t be fixed. This led to the infamous blue screen of death (BSOD). As a result, computers around the world entered a “boot loop,” repeatedly trying and failing to start up, eventually landing on the Windows Recovery screen — aka, the Blue Screen of Death (BSOD). 4

What is less widely known: This is not the first time this year that CrowdStrike has caused computers to crash. In June 2024, CrowdStrike’s Falcon sensor for open-source operating system Linux — specifically for the Red Hat, Rocky Linux, and Debian “flavors” of Linux — caused a similar issue, triggering a Linux version of a BSOD for its users.

Why CrowdStrike’s Misstep Matters — Even for Non-CrowdStrike Users

At this point you might be thinking, “My company doesn’t use CrowdStrike. How does this affect me?”

The incident at CrowdStrike highlights a vulnerability in the way modern Endpoint Detection and Response (EDR) tools work. Modern EDRs rely on being connected to their vendor’s cloud to instantly receive new detections as new threats arise. Due to this reliance on EDRs and their cloud connection, a widespread IT outage can be caused by any modern EDR, all of which start components of themselves before the rest of the operating system.

The Windows and Linux outages caused by CrowdStrike are not the first widespread global outages caused by a security software company. On April 21, 2010, McAfee released a virus definition update for its antivirus (AV) product that caused millions of computers around the world to blue-screen, just like this recent event. (An interesting tidbit: The CTO at the time of the McAfee event is the current CrowdStrike CEO. Due to the loss of confidence in McAfee following its global outage, the company was forced to sell and was bought out by Intel.)

All told, these outages reveal a worrisome truth: Unless we as an industry massively change the way we handle securing the Windows operating system, any modern EDR, all of which rely on booting components of themselves before booting the rest of the operating system, could make us vulnerable to massive outages.

What You Can Do

No matter what operating system your organization relies on, there are proactive moves your organization can make to not only prepare for a situation like the one that befell CrowdStrike users but also reduce the risk of being part of one. Here, our top three recommendations:

1. Conduct a Risk Assessment

The ideal starting place when doing any risk management is a risk assessment — either part of an IT risk assessment or your organization’s overarching risk assessment. If risks are not identified and rated based on an organization’s unique environment, technologies, people, and processes, then controls to mitigate or remediate the risks cannot be designed and implemented.

There are a few unique risks that Rehmann believes too many risk assessments don’t — but should — consider. While the following situations may not apply to every organization, we recommend you consider and explore these as part of any risk assessment your organization employs. They may lead to identification of risks unique to your organization.

Suppose a similar situation arises where there is a need to boot into safe mode to modify a system driver. If on-premises local administrator password solution (LAPS) is in place, how would the organization gain access should the domain controllers (DCs) go down as well? With on-premises LAPS, the domain controllers store the password.

If BitLocker is enabled on non-hybrid/cloud joined machines the BitLocker keys are stored on the DCs. If the DCs are down to the same issue, how do you gain access to the BitLocker keys?

If you have remote employees, how do you recover their machines if a similar fix is needed of modifying protected files before an organization’s remote management toolset can be used?

Have you talked with your EDR vendors to understand how they test their deployment? An example from Fortinet can be seen here.

In addition to the above risks, there should be a re-evaluation of the below risks with the additional context of the CrowdStrike incident.

What percent of line of business applications are self-hosted versus hosted by a third party?

What resources are backed up and at what frequency? Are systems that have unique backup methods like SQL servers backed up properly?

How often are backups tested? How are backups tested?

Are backup jobs monitored for failure?

How many critical business processes can be done via a manual work process? For any processes that cannot be done manually, how long can the business operate without their completion?

2. Create a Business Continuity Plan

After updates are complete to an organization’s risk assessment, your business continuity plan (BCP) can be updated to include a strategy to continue operations in case such risks occur.

Manual Work Processes

Most business operations today utilize some sort of computing power. As seen with Delta Airlines, when IT systems go down, everything can come to a screeching halt. One option an organization has to mitigate the impact of an IT system outage is to have developed robust manual work processes for its business-critical operations. An example of this is having a binder full of preprinted forms at the ready for manually entering data that can later be entered in an organization’s ERP. Another example: Having documentation of how to run credit cards with stand-alone credit card readers that can be tied out to a transaction after the fact.

Backups

Discussion with peers in the wider IT industry has brought stories where, in some instances, it was quicker to restore certain servers before knowing the root cause and fix. Some organizations were able to roll back to a known “good” backup state from the night before and resume operations on those servers.

Scenarios like this are a prime example of why a robust backup strategy with regular and realistic testing of backups is needed. Rehmann recommends that backup schedules are based on the window of time for which data is deemed “good.” For example, an application server that is used to generate delivery routes the night before might only need to be backed up right before and after the route generation is done, whereas the primary file server for an organization might need to have incremental backups taken every hour if there is data critical to the continued operation of the business.

3. Implement (and Test) a Disaster Recovery Plan

In addition to updating the BCP, an organization should update its disaster recovery (DR) plan with processes that, in the case of a similar IT outage, will quickly enable return to normal operations.

IT planning/capabilities

While it is not feasible to put every single recovery procedure in the DR plan, it is recommended to have departments build out a listing of critical IT systems and applications and prioritize the order they need to be brought back online. IT should then design procedures for returning the systems to operational status.

As discussed earlier, existing controls in the environment, such as BitLocker or LAPS, can act as speed bumps to recovery. We recommended you proactively seek to identify these speed bumps and design, test, and implement procedures to either bypassing or fixing these speed bumps in the recovery process.

Access to additional resources

The second way an organization can prepare for a similar IT outage is to have some sort of preplanned and agreed upon/contracted surge IT capabilities. Just like organizations might have agreements for third-party DR sites or incident-response firms on retainer, organizations should consider having a source of additional IT staffing. There are several ways this can be done, most commonly through a contract with or retainer for a managed service provider. In a DR scenario, the organization can then call upon the managed service provider to provide additional staff, capabilities, or geographic reach, accelerating the effort and reducing the amount of time it takes to return to normal business operations.

For millions around the world, affected or not, the CrowdStrike incident was no doubt scary and stressful. But it also serves as a critical learning opportunity: No matter the size of an organization’s IT infrastructure, system failures are a risk to all. While none of us can yet control or prevent a worldwide outage, it is within our power to create a plan, protocols, and processes to reduce the impact of one.

Here at Rehmann, we’re committed to not only dissecting the root causes of such events but also equipping you with the knowledge to fortify your systems against future incidents or threats. Remember, you’re not navigating this alone; our team is ready to answer your questions, assess your organization’s risk, or tailor whatever level of IT support your organization or your current IT team needs. Click here to reach out, get more free resources, or learn what Rehmann Technology Solutions can do for you.