Project & Programme Management Category Banner Image

How did your incident management plan stand up to the CrowdStrike outage?

By Gary Duffield | 23 July 2024

The Microsoft and CrowdStrike issues today and for the last few days will be reviewed, discussed, and dissected for years. Or people will bring it up until the next global IT meltdown. You can expect Computer Science educators to teach the reasons and impacts of these events. Doctoral candidates will make it the subject of much PhD research. C-00000291*.sys has now locked its place in history.

 Amid talk about the details, one overarching realisation has been: 

If this is what can happen by accident, imagine what could happen if it were intentional!  

The CrowdStrike outage and updates so far  

Whilst the global outage of Microsoft has not directly affected Lumify Group (there are no blue screens on the services we look after internally), it has clearly disrupted third-party services and applications.  

It has also affected our ICT training schedule. We’ve had several course attendance cancellations as IT Professionals have been retrenched to the office to focus on restoring their services to normal.   

We are talking about blue-chip businesses, Government bodies and small to medium-sized businesses alike. This is a technology challenge that doesn’t discriminate.  

The issues with CrowdStrike in the Philippines have impacted organisations, from airlines like Cebu Pacific and AirAsia, those in the  Philippine Stock Exchange, SMEs and even personal users. 

Globally, we’ve read about the domino effect on the consumers of the impacted services. People have been stranded in airports or have been unable to pay for their shopping.

But now, let’s also consider the impact on those tasked with restoring regular service: The network and server administrators who had no idea how quickly their Friday afternoons would escalate and get re-prioritised. 

And of course, somewhere, someone is waking up this morning knowing they developed the very 41Kb update that caused this. Or that they pressed “Publish”. Putting ourselves in their shoes, it could have been any of us. 

Unfortunately for them, it wasn’t a quick reboot. Microsoft and CrowdStrike published workarounds quickly – suggesting deleting C-00000291*.sys or editing the registry. Sounds simple, until you realise you have no idea what your BitLocker key is, and your network admin is on leave.

Clearly, skills and competency play a part here. This is true for both the vendor and our own organisations.  

The challenge is this: 

Imagine a petrol company contracting a third party to develop a fuel additive that ultimately took every Toyota out. It would be impossible for the driver to mitigate or prevent it short of not filling up at all. 

There must be trust in the process in place at the fuel company. The same is true for CrowdStrike and other vendors; we must trust their processes and testing. We are literally operating in good faith.  

Last week, as we've seen, a process failed. We have unintentionally given developers the power to hold the balance between order and chaos. A 41Kb file caused planes to divert, ships to go unloaded and groceries to be left in the trolley at the checkout.  

The grounding of DevOps and Security frameworks, principles, and practices, is now, more than even before, a MUST HAVE. It should be non-negotiable that software developers can deliver solid, robust and reliable solutions in the future of technology. 

What is the next priority as the fog clears about the Crowdstrike today and services return to normal? 

Before anything else, understand within the team what happened. Get clarity on questions like "What is CrowdStrike?" "And was it a hacking incident?" For the record, CrowdStrike Holdings, Inc. is a cyber security technology company that provides endpoint security, threat intelligence and cyber attack response services. And no, it was not a hacking incident, but an update gone awry.  

Once that has been settled, the team at Lumify Group (especially our IT staff and technical instructors) recommend that you review your approach to incident management and DevOps. 

It is time to consider your approach to project management. Are you using the right methodology? AgilePM or PRINCE2 might be a better fit.  Alternatively, you can explore implementing Six Sigma, which tolerates a very low level of defects.  

This CrowdStrike outage may be a wake-up call to examine competencies and skills. Organisations in ANZ can use the SFIA Framework for free. But that isn’t a five-minute fix; it’s a big picture, a strategy. It’s also risk mitigation, preventing your organisation “doing a CrowdStrike.”  

You may have already noticed that scams have switched up a gear; we’ve seen an increase in phishing emails and very professional notifications that our streaming service needs to be updated—all fake. 

With skills and competencies in mind, and once services are restored, we suggest you consider the following steps:  

Today. Determine if your cyber security end user training is up to date.   This means that at least 85% of the staff have completed it. If your cyber end-user training is updated, consider products such as PhishMe as part of the approach. Cofense PhishMe is a SaaS platform that immerses users in a real-world Phishing experience. 
Examine if DevOps is done by accident or by design in your business. There are several best practice bodies you can look for guidance. These include PeopleCert, the organisation behind the Information Technology Infrastructure Library (ITIL). PeopleCert also own the DevOps Institute. 
Did your service delivery and incident management approach stand up to Friday's challenges? You did have an incident response plan, didn’t you? We assume and hope that you did. Many of our clients in APAC consume ITIL training at the Foundation level. With the CrowdStrike outage, is it time to expand that profile to advanced ITIL training? 
Do you have an up-to-date skills and competency matrix? Now, it is good to perform an audit within the team. 
Review your information security systems and ISO 27001 certification. Lumify became an ISO 27001-certified organisation last year. It was a long and painful process but immensely valuable. We now have an ISO/IEC 27001 Lead Auditor –a trained individual on the team who is more robust than the official auditor.   
Remember that human error is inevitable – but we can reduce the frequency through upskilling. Most breaches (68%) involve a non-malicious human element, a person making an error or falling prey to a social engineering attack, according to Verizon Business's 17th-annual Data Breach Investigations Report (DBIR). 
Perform a post-mortem review for your business. Gather the department together to share feedback on the following questions: How prepared were you to deal with this? Had you run through your incident response plan, with everyone knowing their responsibilities? Or perhaps this event calls you to develop and practice your plans more regularly.

The Lumify IT team will be reviewing these points internally too. But for now, we are off to find our BitLocker recovery keys.  

We hope your reboot is quick so that normal service in your business is restored.  

Good luck.