My current research interest is in Prompt Injection—the propensity of LLMs to follow instructions in untrusted content. While there has been a lot of good work on prompt injection defenses, a lot of the work that seems promising in literature does not work so well when I attempt to reproduce it and vary the context event slightly, and a lot of the benchmarks make them look a lot better than they really are.

Model hardening techniques such as preference optimization for secure completions (SecAlign) appear to work well on benchmarks, as do the latest Claude and GPT models, but an agentic benchmark I developed shows these models remain vulnerable to more sophisticated attacks such as multi-turn fake completions or in scenarios where attack content is repeated in assistant turns. A full write-up of this benchmark is available as a preprint here.

Filtering models such as Sentinel work well on simple examples, but I am skeptical that filtering alone can solve the problem: a more capable model behind the filter can understand disguised instructions that the filter cannot. Filters are also likely to miss attacks in multi-turn contexts that combine individually innocuous user and assistant turns. Even running a detector continuously on the transcript of a conversation can probably be evaded by prompting for harmful context in assistant turns to be generated in earlier layers but suppressed in the final layers.

I have looked at mechanistic interpretability for prompt injections, which has sort of been tried before (Attention Tracker, but similar to other defenses I have found this does not generalize well) but has not been thoroughly explored. I've written up a purely data-free approach, but this did not work as well as I had hoped.

My current line of research is on training detector heads on top of LLMs to be used in parallel with their decoding heads, thereby eliminating the mismatch between the filter and the model behind it, and which can be targeted at earlier layers in the model which are harder to suppress. I also hope to use these as a base to steer models to safer completions.