CrowdStrike Aftermath: Lessons Learned for Future Recovery
Plenty of finger pointing is underway in the wake of the outage that brought down 8.5 million Windows machines worldwide, but the recovery process is far more complex than a reboot.
What was learned in the wake of the Blue Screen Apocalypse -- the largest IT outage, so far, in history?
Last week, 8.5 million Windows devices worldwide, including computers critical to major airlines, went down due to a bad update. While many organizations have recovered, IT leadership has its fair share to think about going forward.
Details continue to emerge about the mass outage attributed largely to an update of CrowdStrike’s Falcon Sensor software that went awry -- with some fingers pointing at tech policies from the European Commission as well as a few supernatural conspiracy theories.
Bigfoot and the Loch Ness Monster are not to blame here, and the path forward will not be found by reading tea leaves.
What comes next for organizations affected by the outage may be more complex than a patch or reboot. Delta Air Lines remained hobbled by the outage some five days after the initial incident, with the US Department of Transportation launching an investigation into the airline’s continued flight disruptions and Delta’s handling of customer service.
CrowdStrike has provided remediation steps, and an apology from CEO George Kurtz, about the incident with updates posted to an information hub. Finding a path to recovery may go hand-in-hand with preparing for possible outages in the future, as rare as the CrowdStrike issue was.
How We Got Here
Eric Grenier, director analyst with Gartner, says an unusual confluence of factors led to the outage. “This is worlds colliding basically,” he says. “The reason why the impact is so large is because Windows is the most popular operating system in the world. CrowdStrike is one of the most widely used endpoint security tools. So, when CrowdStrike has a problem with a bad update, the impact is large.”
CrowdStrike’s Falcon Sensor software is meant to detect and block threats on users’ systems including the kernel, which it needs access to in order to fulfill its functions. As widespread as last week’s outage was, Grenier says other software vendors should assess their quality assurance procedures and workflows as well. “CrowdStrike is not the first one to ever send out a bad update, and I could almost say with some certainty they’re probably not going to be the last ones.”
What further compounded the problem, Grenier says, was the fact that the fix was a manual remediation. “We needed hands on the keyboard. There’s currently no remote remediation option, and then you needed elevated privileges depending on how you were doing this.”
Some organizations needed additional assistance getting back on their feet. Grenier says, “Just to make the problem even worse, if you had full disk encryption, which I’d argue most enterprises are using, you needed recovery keys for that encryption, whether it was through BitLocker, or another full disk encryption vendor and organizations may not have been prepared for all of that.”
Grenier says that a lesson for organizations may be that their business continuity planning should be reviewed to ensure it is up to date, valid, and stress-tested. And while enterprises explore what they might have done differently, there are still more details to discover with this incident. “I think we need to wait to see from CrowdStrike a full root cause analysis as far as what the problem actually was,” he says. “We could speculate … but I think it would be a disservice to start talking about it without them really giving that root cause analysis.”
Not Just a CrowdStrike Concern
CrowdStrike may be centerstage for the outage, but a bad update from other providers with the same level of kernel access might have a comparable impact. “It could happen to any security provider simply because of the architecture of Windows itself,” says John Raven, managing director of Microsoft cloud transformation at TEKsystems. “The only way to operate CrowdStrike and those types of security toolchains is to have that type of privileged access. A kernel driver has intimate access to the systems’ most inner workings, so when it goes sideways, it has a problem.”
Raven says Microsoft tried to do the right thing many years ago by abstracting the kernel away from third parties but was blocked by regulatory agencies. “They were about to API everything and enforce everyone to go through a security API tier instead, but it was deemed as anti-competitive to smaller security firms,” he says.
Microsoft cast blame for the outage on the European Commission, citing the 2009 agreement that required the company to grant kernel access to third-party security providers. That agreement was meant to open up competition to other companies though Microsoft offers its own Windows Defender security alternative.
Raven notes that Apple users did not suffer the outage though there is a Falcon Sensor framework for that operating system because Apple deprecated the use of the kernel access extension. “They’ve put an abstraction layer in front that called system extensions,” he says. “They did that specifically for a new security framework, which is precisely what Microsoft tried to do years ago, but for some reason, Apple got away with it.”
Still Moving Fast, Still Breaking Things
Mindsets in IT to be lean and not waste resources or time, with the goal that everything will run smoothly, could have contributed to the CrowdStrike outage, says Subodha Kumar, the Paul R. Anderson distinguished chair professor of statistics, operations, and data science at Temple University. “We cannot live in a world of just cutting down on the cost and just relying that everything will work fine,” he says. “We have to include these kinds of things in our processes, and we need to have a plan around it.”
Kumar is also the founding director of the Center for Business Analytics and Disruptive Technologies at Temple University’s Fox School of Business. He says despite the existence of automated mechanisms that could roll back bad updates, many companies sidestep such resources because they can require a lot of space. Kumar also says monitoring tools should be updated or invested in more, potentially bringing AI-based tools into the mix, to detect such issues. “Most importantly, we need to have redundant systems so that we can switch to it very fast,” he says, which he admits can be very costly.
The urgent rush to find answers, blame, and ways to recover from the outage has led to a bit of ambulance chasing, says Will Glazier, director of threat research with API security and bot management company Cequence. He says though CrowdStrike focuses on endpoint security and his company covers the protection of APIs in network, he looked for parallels. “Where all the reliable lessons learned are happening is around deployment of rules, updates, signatures, the kind of stuff that we have to do in security to keep pace with the bad guys. But obviously, something didn’t quite go right there in that whole process,” Glazier says.
There has been speculation that the demand to be nimble on deployment contributed to the outage, pushing the update out before the eventual errors were caught. The pressure of immediacy in tech is nothing new -- it echoes back to the “Move Fast and Break Things” era, which allegedly ended.
“I feel like it’s the eternal struggle and you’re damned if you do, damned if you don’t,” Glazier says. “If CrowdStrike bogs themself down with process and they don’t provide reliable, timely protection to a threat and their customers get exploited -- then they’re having a different conversation with their clients, where people are still probably chasing them for damages. The same kind of pain is happening, but it’s from a false negative perspective of, ‘You know, our system didn’t catch what it was meant to catch.’”
About the Author
You May Also Like