The Myth of Deterministic Software
Something very uncomfortable about the trend of LLMs replacing deterministic software is the idea that we are trading reliability for power. LLMs as typically deployed are non-deterministic in a strict sense—due to sampling with nonzero temperature, and optimizations for floating-point operations—but more importantly cannot be statically analyzed or tested on all realistic inputs, and therefore their behavior cannot be fully anticipated. This is fine when LLMs are being used to generate software, provided that software is analyzed and tested well, but much more uncomfortable when software starts to integrate LLMs at runtime. I've seen this in a number of instances:
- Prompted classifiers replacing grep-based filters
- Prompts replacing traditional glue code scripts
- Agents with web-fetch tools replacing web scrapers
- UIs deprioritizing explicit controls in favor of natural language interfaces
- Desktop or browser-use agents providing an interface for software instead of building an API
- Dealing with unstructured data instead of agreeing on a data structure, e.g. reconciling receipts with payment records
At first I considered the power-for-reliability trade-off worthwhile, but I've realized that, beyond a certain point of reliability, there is no trade-off at all. Deterministic software is 100% reliable only if you are willing to make certain simplifying assumptions that are never true in 100% of cases:
- The spec for the software matches the user's understanding
- The use cases supported by the software fully cover the user's needs
- Where a human does become involved—and at some point a human must—they make no mistakes
- The interfaces the software interacts with never make breaking changes
- The meaning of the data the software interacts with never changes, or if it does, its settings are updated to fully account for this in real time
On top of this, most deterministic software that tries to grapple with these problems becomes a mess of complexity and edge-case handling, inevitably introducing bugs.
When you properly account for these, I don't think most complex pieces of deterministic software can really be said to be more than 95%[0] reliable in general. Even the best avionics are limited in their ability to compensate for hardware failures, extreme weather conditions, or pilot mistakes. Perhaps something like 99.9% reliability can be achieved in the most critical deployments over moderate timescales, but the last 0.1% is important and unachievable. Because LLM-based software can better address these limitations, it is at least possible in principle to get more reliability from LLM-based than from traditional software.
The same logic applies to security, not just correctness, and I am beginning to see something similar in my own work on prompt injection. If you adopt a strict definition of prompt injection, as I have previously tried to do, it is hard to see how an LLM could achieve 100% reliability, even with reasonable constraints on the attackers[1]. But this is not the relevant framing for users, who want to minimize their overall risk exposure. Real attackers are not limited to prompt injection as I narrowly defined it: social engineering, email compromise, typosquatting, malicious guides, fraudulent representations—an attacker can use all of these to accomplish the same goals. I believe we will soon have models that are sufficiently resistant to the narrow kind of prompt injection that it will no longer be worth attempting against applications that properly isolate untrusted input, and attackers will resort to other methods—if they have not already.
To me, this changes how to think about software engineering. We have always been only probabilistically solving problems, and we have gained the ability to trade reliability along some dimensions for more reliability than previously imagined along others. There is much opportunity for this trade to be a positive one.
- ^
Made up number but you get the gist.
- ^
For example, in an earlier post I discuss defenses that rely on attackers being limited in how many optimization steps they can run, and only against a black-box.