It smells like Failure and Crowdstrike

On July 19th 2024 there was a unexpected impact to all sectors of the economy,
I kOn July 19th, 2024 there was an unexpected impact to all sectors of the economy,
I know this firsthand, as I was on call when this happened, which put me on the front line working to clean up the mess this caused in our Citrix environment. It made for a crazy morning and being bleary eyed from a late night downtime that pushed me to a 22 hour day, and only getting a 2 hour nap before getting a call from our help desk related to this issue at 4 AM.
By then it was determined it was Crowdstrike that was causing the issue, what was the issue Crowdstrike released a standard update to its vulnerability scanner Falcon Sensor that wrecked havoc across three continents, the US, Australia and the UK by causing Microsoft Windows 10 and 11 PC’s to blue-screen. This had an immediate and cascading effect as airlines, shipping ports, government agencies, banks, business unable to process transactions, take online sales, emergency call centers and hospitals slowly grind to a halt.
We were busy as we had multiple tracks of work being done at the same time. We had the analysis of the problem and determining the number of business systems, I.E. backend applications that were affected, and trying to determine how many our of end users’ desktops were impacted.
We had immediate triage going on, as well as health checks of systems to validate they had not been impacted.
All this while members of various teams struggled to access the environment since their PCs / Laptops were also among the causalities of the incident.
It took several hours to understand the scope of the problem, but once it was determined to use our well thought out DR process.
We were able to implement it rather quickly. We were able to return our uses to production by the end of the day, an achievement that shows how all teams pulled together to address this unprecedented problem. It also shows that the investments that were made in our DR environment and procedures showed how critical they can be in a crisis.
I’ve had a couple days to recover, and I have a fresh cup of coffee and I’m thinking about the incident and what the fallout might be.
The Who and What
Let’s start with what is this company, for those who might not know CrowdStrike Holdings, Inc. is an American cybersecurity technology company based in Austin, Texas. It provides cloud workload protection and endpoint security, threat intelligence, and cyberattack response services.
It helped the US Department of Justice to charge five Chinese military hackers with economic cyber espionage against U.S. Corporations, and uncovered the activities of a group connected to Russia’s Federal Security Service that conducted intelligence operations against global targets, primarily in the energy sector.
Crowdstrike went public on the Nasdaq in June 2019 and joined the S&P 500 index in June 2024
Crowdstrikes stock prices followed an inverse curve going lower than more and more recognizable brands/companies reported outages. Microsoft reported that 8.5 Million devices were affected,. Microsoft has released an usb recovery tool that will help quickly recovering impacted machines via a bootable USB Drive. (Additional BitLocker Key Modification)
The obvious logistical problem in this work from home world is how do we get a qualified person in front of the PC to perform the corrective action?
Companies far and wide are scrambling to do just that.
QA Sucks …
This shows how large of a market share, both Microsoft and Crowdstrike command, and that is the problem.
Crowdstrike has become a single point of failure for the economy, capable of quickly degrading all services that it touches.
The balance between being agile to identify and close potential cyber attack vectors and quality control is tough for any organization. Yet that is what is required, and perhaps what customers should demand.
“This incident appears to be a severe failure of quality control, not a malicious act,” cybersecurity strategist and former FBI counterintelligence official Eric O’Neill said of Friday’s paralysis.
Quality Assurance isn’t sexy, you can’t sell QA. Its very purpose is to slow down, create repeatable testing procedures which should ensure that bad code doesn’t make it out the door. It is a cost center, and it is one of the first areas that are examined when cuts need to be made.
In 2023 an employee reported round of layoffs, of 200 people at crowdstrike with a RTO (Return to Office) policy being cited as the reason. This included engineers, devs and QA testers.
A Glassdoor post from September 2023 sums up the work culture as:
The focus on “world domination” has created a noticeable difference in morale and business operations. People and process first culture has taken a backseat to being the biggest, fastest and strongest security provider. More burnout, more patchy solutions to quickly solve problems that could have been superbly implemented with proper communication and expectations set on timing.
and a follow up post on January 2024 has listed the following reasons for missteps the company was suffering from.
Absolutely toxic senior leadership; no emotional IQ, no communication, immature “leadership”
*The culture can best be described as: old boys club, bro, and/or a fraternity*Re-org after re-org with no clear direction beyond the bottom line*CAO (Counter Adversary Operations) makes absolutely no sense and is confusing internally and externally —*Rolling layoffs with zero transparency and blatant lies (lies about why people were laid off, firings disguised as layoffs, false promises that “this is the last one”)these layoffs deemed critical functions “redundant,” which led to critical work being piled on top of burned out, understaffed, and underpaid folks left behind
These and other posts paint a picture of a company that shifted the responsibility of QA from a dedicated team to the engineers writing the code. With inadequate testing of the full package before deploying to the world. Which leads to the issue we all experienced on Friday.
Hindsight is 20/20, and this is just my opinion, but all the signs show that the Leadership lost sight of its purpose, it grew too fast and there wasn’t the experience at the C-Suite required to see the cliff it was heading for. This is a management problem that goes to the core of the business and its culture.
They slashed its QA safety net, they shuffled the deck chairs on the ship enough that the employees felt lost and without clear direction.
We can’t and shouldn’t be trusting any third party company. This is gut check time. A 3 billion dollar (2023 revenue) company took shortcuts to save 50 cents, and the impact is still being assessed. It’s time that some safe guards be put in place. As everyone who uses this software should be looking for more control over when and how these updates are deployed.
Using automated processes to push these updates to critical systems needs to be re-evaluated, and those impacted should be looking at performing its own Quality Assurance before deploying these updates to their production environments.
Mistakes were made …
Yet as employees stress about the layoffs that will surely come, lets not forget that CEO George Kurtz and his 46 Million dollar salary package (a 237:1 Ratio when compared to the Median Employee Pay) created and promoted this frat bro culture and the stock holders should take a look at the leadership that was responsible for the decisions that lowered the stock price and work to restructure the company that can better manage the responsibility it finds itself now holding.
As its stock prices decline, and the lawsuits pile up, Crowdstrike has proven it is too big to fail, and that means it is very likely that the tax payers will bear the burden for this failure.
Coffee cup runneth dry….I’m out of coffee, and it’s time for some more.
Next week will be interesting and maybe we will hear more about Crowdstrike in the coming days.
Happy Sunday!