USENIX Security '25 Round-up
I was fortunate to attend USENIX Security '25 this past week. My primary interest these days is in LLMs, so I mostly stuck to Track 3, which focused on LLM-related security work.
Some overall takeaways from the conference:
- We're in the incredibly early days of LLM security.
- Lots of cool stuff you can do to a model that the provider won't like if you have hidden state or even logit access.
- Even if you don't, training a surrogate model and using that works pretty well.
- Nobody knows how to defend against malicious input (under any definition of the term) in reasoning, multi-turn or agent settings.
I want to highlight some of the papers presented that I found most interesting. Note that there were quite a few talks on topics that I don't find interesting myself, so I'm a bad judge of which papers someone interested in the topic would like:
- Lots of papers use the "malicious provider" threat model, but IMO that's just game over. Some of the backdoor techniques they come up with have interesting applications in other scenarios though.
- Lots of work on privacy and trust in federated training protocols but I don't think federated model training is ever really going to be a thing.
- People are trying to make fully homomorphic LLM inference and even training a thing?? Honestly this is cool but the performance penalty is orders of magnitude too large to even consider.
- And while I do think multi-modal and diffusion models are interesting, I don't know enough about them to judge most of the papers on them.
I abbreviate the titles of papers here since otherwise this would read like a conference schedule, but you can mentally rewrite these in the style of The DOMino Effect: Detecting and Exploiting DOM Clobbering Gadgets via Concolic Execution with Symbolic DOM (actual paper at this conference) if you'd like.
Practical Impact
- We Have a Package for You! - You probably won't be surprised to learn that LLMs can hallucinate package names when providing installation instructions, and that this creates a typosquatting vulnerability. But among all the LLM-related vulnerabilities presented this week this is probably the most impactful. Their mitigations were not very successful, but they produced a large dataset of hallucinated package names for Python and JS against 16 different models which I certainly hope the repository maintainers are blacklisting.
- Are CAPTCHAs Still Bot-hard? - Short answer: no (60-70% of modern CAPTCHAs can be solved by computer-using agents).
Understanding Model Misbehavior
- Mirage in the Eyes - Text descriptions for images tend to start hallucinating right after "attention sink" tokens, which are tokens with low semantic meaning that get very strongly attended to. This can be used to generate adversarial examples, but I was more interested in the fact that this happens to begin with. In the examples these are mostly punctuation or conjunctions. My speculation is that the models learn in training that these tokens signal that subsequent tokens should relate to a new concept, and if the image does not lend itself to that then the model must hallucinate.
- TracLLM - A faster search algorithm for identifying the portions of a long prompt that most influenced its output. I think this kind of work could be really important for reasoning, multi-turn and agentic safety—currently safety work is forced to treat prior model output as unsafe since it is tainted by unsafe inputs, but this completely neuters the benefits of reasoning or multi-turn conversations! If we could attribute specific portions of the output to specific portions of the input it would make such issues more tractable.
Prompt Injection
I've written about Prompt Injection before, and it was well-represented at the conference.
- StruQ and SecAlign are methods to harden LLMs against prompt injection. I've read both these papers before—in fact I learned about this conference from StruQ's acceptance announcement. But it was great to talk to Sizhe Chen who helped me understand the landscape of prompt injection research better.
- JBShield - Finds hidden layer representations of the concepts of "toxicity" and "jailbreaking", which are somewhat interpretable using the Logit Lens technique. They use these concept vectors to detect toxic content despite jailbreak attacks, and find amplifying these vectors makes the model more robust to jailbreaking.
Stealing Training Data
- Private Investigator adversarially generates optimal prompts for leaking PII in training data.
- On the defense side, SOFT identifies the most likely-to-leak training documents and paraphrases them, at a small accuracy cost.
- Several other talks I found less interesting individually, seems this is a popular area of research.
Messing with Providers
These were not the most technically sophisticated papers, but who doesn't enjoy messing with our new AI overlords?
- Exposing the Guardrails - Sniffed out 4 different layers of filtering in Dall-E, mainly using timing attacks, and found workarounds for each of them.
- Mind the Inconspicuous - Turns out OpenAI, Anthropic and Qwen APIs don't filter out their
<|eos|>
tokens and appending a bunch of<|eos|>
tokens to malicious inputs can help bypass refusals. They show that the hidden representations become more similar between benign and malicious inputs as you append more<|eos|>
tokens, resulting in refusal confusion in both directions, and suggest this is because "ends in<|eos|>
" is a common feature between both benign and malicious inputs and repeating that token reinforces this feature. - PRSA: Prompt Stealing Attacks against Real-World Prompt Services does what it says on the tin.
Just Cool
- A Novel Attack to Speech Translation Systems found that voice models trained to translate tend to give the original token a reasonably high probability, especially in the most prominent languages in the base model training corpus and when the target language is in the same family. They were able to generate small pertubations of real speech in French that French-to-English translation models would reproduce in the original language. This is probably useless but it's funny to imagine Parisians walking around with voice changers on to protect their French.
- Activation Approximations can undermine alignment training even when they don't have a large impact on utility metrics, but this can be mitigated by using a loss function that makes worst-case assumptions about approximation error.