What's Next for Prompt Injection
Tomorrow will be my first day at Anthropic, where I'm joining the Safeguards team to work on prompt injection. This is incredibly exciting, but it does mean that I will not necessarily be able to share my work as openly, so I've decided to gather my current thoughts here.
What are we even trying to do?
The first question in any work on LLMs is what sort of artifact we want to evaluate—models alone, or embedded in specific pipeline or agent harnesses? Since models have no native trusted/untrusted content distinction[0], prompt injection doesn't really exist without specifying at least some template for including untrusted content. In my experience, the specifics of a pipeline can make a significant difference, and this is magnified in multi-turn agentic contexts since the untrusted content has a lot of influence on the length and direction of the agent trajectories. Nevertheless, results are typically reported on a per-model basis, with evaluation harnesses and environments treated as an afterthought, which needs to change.
However, as agent complexity and hence trajectory lengths grow, it's unclear to me that it even makes sense to evaluate a model + agent harness as an artifact. The possible parameter space you can cover with an evaluation is too small, and minor changes to the model, harness, environment or even data under analysis could radically change the trajectories after enough iterations. Never mind even radical harnesses like Recursive Language Models (RLMs), where prompts themselves are mutated by the agent. Effectively, complex agents are creating their own pipelines for handling untrusted content, which makes evaluations of individual pipelines nearly useless without some reason to believe these results generalize.
I think the correct way to deal with this is to train models to have strong primitives for handling untrusted content, which can be used by pipeline and agent harness authors including the model itself. This will need to be evaluated in two ways:
- How well the primitive works in different use cases
- How well the model uses the primitive
It is common to introduce a quoting syntax or use a "tool response" or "input"[1] role for untrusted content as a prompt injection defense, but none of these work well as primitives:
- Quoting syntax is learned in-context rather than trained, making it too easy to override
- Tool responses are used for both trusted and untrusted tools, so training to distrust them hurts capabilities, and models are never trained to generate them
- Input turns are also never generated, and are awkward to use for multiple untrusted sources—either the input is combined into a single turn, allowing different sources to tamper with how each other is parsed, or multiple user/input turns pairs are required
The primitive needs to be flexible enough to express what users need—e.g. distinguishing between multiple sources in the same turn—and the model needs to be trained not just to respect it but to use it. This will better support modern harnesses and can create a feedback cycle where the model becomes better at creating synthetic data for its own training.
Filtering vs Alignment Training
Filtering remains the best approach for practical defenses today. It will probably always play some role. However, I am skeptical in filtering alone for several reasons:
- If filtering is done by a model that is significantly different from the inference model, there will inevitably be inputs that the two understand differently and this can be exploited to craft inputs that appear harmless to the filter but induce harmful behavior from the inference model. Thus filters must be relatively thin heads on top of inference models.
- Filtering essentially implies giving up any time an attack is detected. This creates a denial-of-service vector in systems accessing third-party data, and means the system can never run truly autonomously—it will always need a human backup.
A key ability of reasoning models and agents is that they are able to notice and correct mistakes. You can often observe this when you ask reasoning model a difficult question[2]. But filtering offers no way to recover from a mistake, so the probability of a successful attack increases almost linearly with trajectory length[3].
Isolating prompt injection
As we try to think more rigorously about prompt injection, it's necessary to distinguish it from related types of attacks: jailbreaking and social engineering. These can both be used for the same purposes as prompt injection, but are conceptually distinct in their scope of use and I believe work via different mechanisms.
In my mind, prompt injection refers to a transformation an attacker can perform on untrusted input to a system which causes it to be treated as trusted input, and likely works by evading the patterns the model has learned to recognize untrusted input.
Jailbreaking is a much more general attack: a transformation that is targeted at specific post-training behaviors that effectively undoes that post-training for the given input. Prompt injection can be understood as a special case of jailbreaking targeting e.g. OpenAI's Instruction Hierarchy post-training. But generally the term jailbreak is applied to attacks which can undo a broad range of safety-related post-training. I believe these work by driving the model far enough off the distributions of the safety post-training data while remaining close enough to the pre-training and capabilities post-training data to remain coherent and useful.
Social engineering refers to "attacks" that use language for its normal semantic content and which are not intended to interact with safety post-training at all. These are considered attacks because they can induce the model to produce the same sorts of undesirable outputs as prompt injection, but both in intent and mechanism I think these are closer to reasoning mistakes or persona misalignment than to anything typically recognized as an attack.
Is it solvable?
One question David Orr asked me during my interview has been on my mind since—is prompt injection a problem to be solved, or a problem to be managed? Or put another way, do we treat it like a traditional cybersecurity vulnerability or like credit card fraud?
If jailbreaking can't be solved, then trivially neither can the special case of prompt injection. The converse is not trivial—just because we develop a way to prevent undoing specific post-training, doesn't mean the post-training accomplishes what we want. However, I believe that conditional on solving jailbreaking we can prevent prompt injection. This will take a lot of post-training on top of a model capable of fully generalizing it, and may require novel ensemble techniques for robustness, but I think it's tenable to get a similar degree of reliability as we see in e.g. answering grade-school math problems today.
Social engineering on the other hand I think is unsolvable and perhaps undesirable to solve. If a system should never take a given action in response to untrusted input, the system author should not provide that action as an option to the model (perhaps allowing a model to determine what actions are allowable before seeing the untrusted input). In any other case, a successful social engineering attack is really just a case where the model and the system author disagree. This can happen due to failures in reasoning or a persona not suited for the use case, but some disagreement is inevitable. I think these cases need to be evaluated separately and very differently from prompt injection.
The Hard Problem of Adversarial Optimization
It is well known that the hardest attacks to prevent are those created via adversarial optimization—iteratively refining an attack string based on proxy signals for success such as the logprob of a refusal token. There are only a few theoretically possible ways to defend against these, and all of them are daunting:
- Perfect defense, in which case there is no optimal attack to find
- Make any known attack fail so surely that no nonzero signal of success can be observed
- Use a not-even-approximately-differentiable function to detect attacks, so optimization techniques fail
- Be able to monitor all inference and detect and prevent any adversarial optimization attempt before meaningful progress can be made
The first option is unlikely to be possible, as it would basically require the space of prompt injection attacks to be low-dimensional enough to enumerate. The second is a weaker version of the first, but even if possible would be brittle because it tries to hit the moving target of "known" attacks. The third is difficult to even conceive of, because it also means the detection function cannot be trained via traditional ML methods, although I can imagine backing in to a non-differentiable function that approximates a learned detection function, but is not easy to come up with a differentiable function to approximate without access to the original. And the last option is a high-stakes game of whack-a-mole that only sophisticated labs hosting closed-weight models can play.
Realistically, I think the solution will have to consist of a combination of:
- Hardening the model as much as possible against known attacks, including attacks optimized against other models
- Censoring the highest information signals of success, such as token logprobs
- Monitoring and blocking adversarial optimization attempts
Sadly, this precludes the possibility of an open-weight model secure against prompt injections.
Evaluating Success
There are multiple questions we should try to answer for a given artifact, which will help us understand both how safe it is currently and how close we are to solving the prompt injection problem:
- Which currently known attacks work reliably/occasionally against it in simple scenarios
- Which tasks can it perform without increasing its vulnerability to attacks
- How well attacks optimized against other models transfer to it
- How hard is it to optimize an attack against
Questions 1, 3 and 4 are probably best evaluated against the model itself, while question 2 is more of a question for the agent scaffold—or if we follow my suggestion above, a question about how well the model uses its untrusted content primitive.
The first 2 questions can be answered by the kinds of evaluations people have been doing today, subject to the caveat that many of these haven't been done very well[4]. The last 2 are less straightforward.
For transferability there are two challenges: there are many other models an attack could be optimized against, and even fixing a single model the space of optimized attacks is probably vast. Some experimentation would have to be done to see how much the target model matters and how different initial attacks and optimization parameters impact the final optimized attack and its transferability. It is unlikely that any single model would reliably serve as the best target model for optimization in order to transfer attacks to the model being evaluated, but plausible that a small set would form a near Pareto frontier. Experimentation with these models could suggest how to sample a set of initial attacks and optimization parameters to create a reasonable test set.
The difficulty of optimizing attacks is probably best expressed in terms of the minimum number of inference calls required to raise the probability of success (or, if this is too low, whatever proxy is being used for optimization[5]) from a specific initial level to various target levels. The challenge in this sort of evaluation is coming up with the optimal strategy and parameters for optimizing the attack.
Solving prompt injection via the combination I suggested earlier requires performing well on the first three evaluations, and better on the last one than your adversaries are at evading your adversarial optimization detection.
- ^
Except in some cases distinct system/user roles, but the scenario where an attacker has full control over a user turn is called direct prompt injection and is no longer considered very relevant for real-world use cases, at least as a distinct issue from jailbreaking. Instead the field is now focused on indirect prompt injection, where the pipeline or agent harness specifies a way untrusted content is included in a prompt.
- ^
See SecAlign
- ^
Interestingly, I often see models correct themselves a few tokens before they would generate the tokens that indicate the mistake. I believe the models have not only learned multi-token patterns, but have learned during post-training what those multi-token patterns look like and how to act based on them.
- ^
You might note this is the argument LeCunn made against autoregressive LLMs in general, which fails due to the self-correction ability of LLMs, but applies here.
- ^
- ^
For example, Checkpoint-GCG uses probability of success against successively later checkpoints in the alignment training in order to get a stronger signal than against the final aligned model.