LLMs may only occasionally match the best performers on your team, but they can be relied upon to make mistakes at least as often as the most junior members, or indeed your customers. Having access to cheap, scalable Stupidity as a Service is surprisingly useful, allowing you to write new forms of test for your documentation, APIs and schemas. I call this Cognitive Fuzzing and it helps identify places where requirements, signatures or behaviours are unclear.
It’s fair to argue that LLMs don’t think like humans (or at all), and so the kinds of mistakes they make might not be the same as those committed by humans. That said, in my experience there is a definite overlap: what’s good for the robot goose is good for the gander. And either way, you’re directly minimising the cognitive overhead for Model Context Protocol clients, which I suspect will become table-stakes for most APIs in the future.
A cognitive fuzz test is a functional test in the usual Given/When/Then style. The givens and thens match your usual test suite, but the whens are implemented by calling an LLM, which is given access to your documentation and schemas in some form. This can be a very loose workflow where you provide tools via MCP to the LLM, and just give it the top level use case to fulfil. Or it can be more targeted, like asking the LLM to formulate a database or GraphQL query, and checking the results.
These tests are non-deterministic, and so you’re likely to see regular failures. If some use cases consistently fail, you have some evidence that the cognitive overhead of interacting with that part of the system is too high.
I first realised this was a useful way of working during a failed attempt to build an AI assistant over some quite complex data. This was a fairly straightforward Natural Language to SQL (NL2SQL) system which provided the LLM with schema info and a customer query, to be translated into SQL with the results checked and a response returned to the client. The workflow was implemented in DSPy, which offers various forms of optimisation for the prompts in an agentic workflow, but ultimately the system became more and more complex, with increasing amounts of pleading and scolding to nudge the LLM in the right direction.
The correct answer here is the same as when you find yourself lamenting the apparent stupidity of one of your users: your system is too complex and your documentation isn’t good enough. I came to realise that everywhere the LLM was struggling, deep down I was already uncomfortable with some of the ways the database schema was structured and documented. Real customers had struggled with similar issues in the past because of poorly chosen names.
And so, the issues here became functional tests. Where possible, we have been writing simplified views with new documentation to make some workflows easier, and checking that the LLM picks them up and delivers better results. Not only will this benefit future AI projects, it will almost certainly make customers’ lives easier as we simplify language, make things more explicit, and fill in missing documentation.
Jakob Nielsen once wrote about company taglines:
look at how you present the company in the main copy on the home page. Rewrite the text to say exactly the opposite. Would any company ever say that? If not, you're not saying much with your copy, either.
I’ve started to see codebases with truly excellent, in-depth information about their maintenance, architecture and style in an AGENTS.md. And yet if I look in the actual human README.md there’s often only perfunctory information about getting the code up and running. I’ve come to believe that if you’re writing something in your AGENTS.md and you wouldn’t tell a human the opposite, then it’s very likely useful documentation for any contributor to the codebase, and should just be part of your central documentation. If you find yourself having to press the LLM to get certain things right, this is valuable information about a part of your codebase where humans might also benefit from more documentation or a refactor.
With the news that 95% of AI initiatives in enterprises are arguably failing I believe the approach of exploiting LLMs for Stupidity as a Service is an excellent way of salvaging value from projects that have stumbled. You may not be finding that your teams are 10-100x more productive with LLMs in the loop, but you can avoid some of the days when teams or customers achieve 0x because they misunderstand some requirements and have to rework everything.