Quality Assurance in AI Development: Lessons from the CrowdStrike IT Outage
By Assaf Melochna, President and Co-Founder of Aquant
The recent global software outage caused by a faulty update from CrowdStrike disrupted several critical sectors, highlighting concerns about the robustness of digital infrastructure. Originally thought to be a data security issue, one cybersecurity expert recently clarified that this incident resulted from a breach in the software supply chain rather than a traditional cybersecurity breach. The new software version’s failure to catch potential issues caused significant outages across various IT infrastructures, affecting financial services, airports, and supply chains. The expert emphasized that the root cause was the failure of assurance processes to identify the risk associated with the update, underscoring the need for systematic risk mitigation and thorough testing.
The Need for Enterprise-Grade AI Systems
Many companies are attempting to build their own AI applications internally, driven by the promise of customized solutions tailored to their specific needs. However, the CrowdStrike incident underscores how easily a bug in the system can cause massive disruption. Building AI applications in-house can be fraught with risks if not backed by rigorous quality assurance processes and enterprise-level standards. This incident happened to a company considered enterprise-grade, so imagine the level of disruption if everyone started building AI applications in-house without the stringent oversight typically found in enterprises?
An enterprise-grade AI system is designed with the robustness, security, and scalability necessary to support mission-critical operations. These systems undergo extensive testing and validation, adhere to stringent compliance standards, and incorporate best practices for risk mitigation. Without these enterprise-grade assurances, companies may find themselves vulnerable to similar issues as CrowdStrike, potentially facing widespread outages. For example, Microsoft opened the floodgates for anyone to create their own “copilots,” however, according to security researcher Michael Bargury, a former senior security architect in Microsoft’s Azure Security CTO office, all bots created or modified with the service aren’t secure by default and could potentially lead to serious security vulnerabilities.
As more organizations venture into developing their own AI solutions using these open platforms, the risk of encountering issues like those experienced by CrowdStrike increases significantly. Ensuring that AI systems are built and maintained with enterprise-grade quality assurance processes is essential to avoid such pitfalls. Companies must recognize the importance of investing in robust QA frameworks and partnering with experts who can provide the necessary oversight and support.
Technology is rapidly evolving, and given the relentless pace of change, incidents like this are not entirely surprising. It has served as a lesson for other developers by highlighting the critical importance of rigorous quality assurance (QA) processes before deploying any update. Here, I’ll share best practices that AI developers and their customers should review before deploying updates.
Comprehensive Testing and Validation Procedures
For this particular case, the main problem was in the software supply chain, where the update failed to meet stringent security and quality standards. A flaw in the update, not caught during the quality assurance process, had a massive ripple effect and led to significant consequences globally. This likely won’t always be the case. For example, an undetected bug in a widely-used application could cause minor data corruption, inconveniencing users and eroding trust. While something like this may not be as consequential, not catching these flaws during the quality assurance phase can still impact customers and the reputation of the vendor.
Before any rollout or update is deployed, the stakeholders involved must undergo comprehensive testing and validation. This includes:
- Unit Testing: Ensuring each component functions correctly in isolation.
- Integration Testing: Verifying that different components work together as intended.
- System Testing: Checking the entire system’s functionality and performance under real-world conditions.
- User Acceptance Testing (UAT): Getting end-users to test the system to ensure it meets their needs and expectations.
Rigorous testing helps ensure that every aspect of AI solutions performs reliably. This approach helps identify and rectify issues at the earliest stages, minimizing the risk of widespread disruptions.
Phased Rollout
A phased rollout strategy can significantly mitigate the impact of potential issues. Instead of deploying an update across the entire system at once, a phased approach involves:
- Staged Deployment: Rolling out the update in stages, starting with a small subset of users.
- Monitoring and Feedback: Continuously monitoring the performance and gathering feedback at each stage.
- Gradual Scaling: Gradually expanding the deployment as confidence in the update’s stability grows.
This method ensures that any problems can be detected and addressed on a smaller scale before they affect the entire user base. For instance, if CrowdStrike had implemented a phased rollout, the impact of the faulty update might have been contained and addressed more swiftly.
Regular Feedback Loops and Robust Monitoring Systems
Regular feedback loops and robust monitoring systems are essential for catching issues early. These practices involve:
- Continuous Monitoring: Implementing real-time monitoring tools to track system performance and identify anomalies.
- User Feedback Mechanisms: Encouraging users to report issues and providing easy channels for them to do so.
- Iterative Improvements: Using the data and feedback to make iterative improvements and quickly address any emerging issues.
By establishing these feedback loops and monitoring systems, AI developers can quickly detect and resolve problems, minimizing downtime and maintaining user trust. Robust monitoring could have alerted the team to the issue before it escalated into a global outage.
Moving forward, developers can learn from this mistake. Organizations should adopt a proactive approach, ensuring that updates are tested in controlled environments before full deployment. Rigorous quality assurance processes are non-negotiable for AI developers and their customers. Comprehensive testing and validation, phased rollouts, and continuous feedback and monitoring are critical strategies to ensure smooth and reliable updates.
By adhering to these best practices, the highest level of service and reliability can be delivered to customers. As previously mentioned, issues like the recent software outage are somewhat inevitable given the rapid evolution and complexity of digital infrastructure. However, with robust QA processes in place, the impact of such incidents can be significantly minimized, thereby maintaining customer trust and safeguarding the vendor’s reputation. Embracing these practices is crucial for navigating the ever-changing digital environment and delivering seamless, reliable experiences to users worldwide.
About Assaf:
Assaf Melochna’s experience incorporates strong leadership skills built upon a strong technical foundation. He is an expert in service and has business and technical expertise in enterprise software. Assaf started Aquant with his co-founder Shahar with the vision of helping service companies transform the way they deliver service and serves as president.