
For most of its history, software testing has rested on a simple assumption: that the same input, given to the same system, will produce the same output. Defects were things you could find, reproduce, fix, and confirm fixed. A test either passed or it didn’t. Quality was a measurable state. That assumption has now disappeared.
Today’s AI-infused systems, whether inside a business intelligence pipeline, a customer-facing agent, or a decisioning engine, don’t behave that way. Ask the same model the same question twice and the answer might differ. Retrain it next month and the behaviour shifts. “Works on my machine” has become “worked in this particular probability distribution, at that exact moment, for that user.” This is not a minor adjustment. It is a change in what testing is for.
I’ve spent my career leading quality functions through shifts like this, and the same lesson keeps landing: the testing instincts that got us here are still valuable, but they are operating on a different kind of thing now.
The implications run deeper than tooling.
LLMs generate test cases competently, fix flaky suites, and summarise failures. The cost of producing any given test is falling towards zero. But the more interesting shift is above the tool layer.
The thing being tested has changed. A probabilistic model has a distribution of acceptable outputs, a distribution of unacceptable ones, and a fuzzy border between them. “Did it pass?” becomes “how often, under what conditions, and how do we feel about the failure modes?” Assertions give way to evaluations.
The surface area has exploded. An AI-infused product doesn’t have a finite number of screens and flows. Exhaustive testing was always an illusion, but it was a useful one. Now we must admit what we always knew: we are sampling the behaviour of a system we cannot fully enumerate. And the failure modes that matter are rarely evenly distributed. A system will often behave correctly in 90% of conditions and catastrophically in the remaining 10%: one edge case, one data regime, one kind of user. A pass/fail test against a green build would call that a pass. Production will not.
The failure modes are new. Traditional defects are local: a bug sits in a line of code. AI defects are distributed: a stale training set, a missing document in retrieval, a prompt that was clever six months ago and is now being exploited. Root cause analysis becomes root system analysis. The QA function must start thinking more like epidemiologists than detectives.
What humans must still own
It is tempting to imagine that AI will test AI and humans can step back. That is backwards. The mechanical parts of testing are precisely where AI provides the greatest leverage. What AI cannot do, and will not do well any time soon, is tell you what “good enough” means for your business.
Somebody has to decide what correct is. Is a 3% false positive rate acceptable when the consequence is a rejected mortgage application? Is this summary misleading, or merely terse? These questions don’t fall out of a confusion matrix. They require domain knowledge, ethical clarity, and commercial context.
The QA leader’s role is moving in that direction. Less gatekeeping at the end of delivery, more shaping the criteria by which the organisation understands its own systems. Quality is becoming a governance discipline as much as an engineering one, and that is a promotion, not a demotion. The best-run AI deployments I have seen share a feature that is easy to miss: some of their most valuable quality controls are not tests at all. They are kill switches, rate limits, escalation thresholds, circuit breakers that fire when behaviour exceeds a defined envelope. They are human judgements about what you are willing to let the system do on its worst day, encoded as constraints.
And who tests the tools?
There is a problem underneath all of this that the industry has not yet taken seriously. If your test cases are generated by an LLM, your failures triaged by an LLM, your evals scored by an LLM, then a regression in that LLM produces a regression in your quality signal. The suite keeps running. The dashboard stays green. The first you hear about it is from an incident in production, which is exactly what testing was supposed to prevent.
This is different from “test frameworks have bugs too”. A deterministic tool fails ‘clearly’ or not at all. A probabilistic tool is more likely to degrade in the same way the systems it is testing degrade: gradually, unevenly, and not very ‘clearly’. Use of LLM tools for testing is quality theatre dressed as progress, and it is the single biggest risk I see in how organisations are adopting AI in their testing pipelines. Partial answers exist (hold critical evals to deterministic scoring, keep a hand-curated reference corpus, treat the testing tool itself as a system under test) but a human still has to look at the results, periodically and seriously.
The skills to retain, and the ones to build
The fundamentals of testing (risk-based thinking, professional scepticism, a nose for edge cases, the instinct to ask “what happens if…?”) are more valuable now, not less. That imagination is the scarce resource. What needs to be added is a different literacy: probabilistic thinking (distributions, confidence, drift, what a statistic actually tells you), data literacy (an AI system is only as good as the data it sits on), eval design (measurement frameworks for behaviours without a single right answer), and a working fluency in how modern AI systems fail (hallucination, prompt injection, context poisoning, emergent tool use), because you cannot test for what you cannot name.
Quality Assurance’s moment
The organisations navigating this well have stopped treating quality as the thing you do before release and started treating it as a live signal about how their systems are behaving. Testing is not a phase; it is an ongoing instrumentation of reality. Humans are in the loop where judgement matters, and out of it where it doesn’t.
This is not a smaller role for QA. It is a much larger one. The machines are getting smarter, and the judgement layer is getting more important, not less. That is where the Q&A profession belongs.