Euan Ong

I've just finished my undergraduate degree in Computer Science at the University of Cambridge (where I ranked 1st / ~120 every year). I'm currently working on interpretability (previously adversarial robustness) at Anthropic.

My ambition is to develop powerful, yet safe and interpretable abstract reasoners, whose internal state and behaviour remain transparent to the end user.

To this end, I'm interested in both pragmatic, scaling-friendly approaches to elicit latent knowledge from neural networks, and longer-term bets exploring how the mathematical toolkits we use to understand and structure programs (such as formal methods, types and category theory) could help us discover the right abstractions to reverse-engineer them.

Email  /  CV  /  Google Scholar  /  Twitter  /  LinkedIn  /  Github

profile photo

Research

So far, my research has broadly focused on studying the behaviour of neural networks in vitro: understanding both how they generalise when learning to perform abstract tasks, and what this tells us about the algorithms they've learned in order to do so.

Previously, I've probed the foundations of neural algorithmic reasoning, explored attacks on vision-language models, and poked language model representations with a stick.

More recently, I've also done some work on adversarial robustness, mostly at Anthropic.

Featured work (reasoning & interpretability)

nap Probing the Foundations of Neural Algorithmic Reasoning
Euan Ong
Technical Report, 2023; ICML Differentiable Almost Everything (Spotlight), 2024
abstract / full text / project page

I explored a fundamental claim of neural algorithmic reasoning, and found evidence to refute it through statistically robust ablations. Based on my observations, I developed a way to parallelise differentiable algorithms that preserves their efficiency and correctness guarantees while alleviating their performance bottlenecks. This work formed part of my Bachelor's thesis, which won the CS department's Best Dissertation Award.

successor Successor Heads: Recurring, Interpretable Attention Heads In The Wild
Rhys Gould, Euan Ong, George Ogden, Arthur Conmy
ICLR, 2024; NeurIPS ATTRIB (Oral), 2023
arXiv / reviews / project page / tweeprint

We discovered successor heads: attention heads present in a range of LLMs that increment tokens from ordinal sequences (e.g. numbers, months and days). We isolated a common numeric subspace within embedding space, that for any given token (e.g. 'February'), encodes the index of that token within its ordinal sequence (e.g. months). We also found that numeric token representations can be decomposed into interpretable features representing the value of the token mod 10, which can be used to edit the numeric value of the representation via vector arithmetic.

monoids Learnable Commutative Monoids for Graph Neural Networks
Euan Ong, Petar Veličković
Learning on Graphs Conference, 2022
arXiv / reviews / project page / tweeprint

Using ideas from abstract algebra and functional programming, we built a new GNN aggregator that beats the state of the art on complex aggregation problems (especially out-of-distribution), while remaining efficient and parallelisable on large graphs.

Featured work (robustness)

constitutional classifiers Constitutional classifiers: Defending against universal jailbreaks across thousands of hours of red teaming
Mrinank Sharma*, Meg Tong*, Jesse Mu*, Jerry Wei*, Jorrit Kruthoff*, Scott Goodfriend*, Euan Ong*, Alwin Peng, ..., Jan Leike, Jared Kaplan, Ethan Perez (* denotes equal contribution)
arXiv, 2025
arXiv / Anthropic article / tweeprint

We built a system of constitutional classifiers to prevent jailbreaks. A prototype version of our system withstood over 3,000 hours of expert red teaming with no universal jailbreaks found. Newer versions of our system also have minimal over-refusals and moderate run-time overhead.

hijacks Image Hijacks: Adversarial Images can Control Generative Models at Runtime
Luke Bailey*, Euan Ong*, Stuart Russell, Scott Emmons (* denotes equal contribution)
ICML, 2024
arXiv / project page + demo / tweeprint

We discovered that adversarial images can hijack the behaviour of vision-language models (VLMs) at runtime. We developed a general method for crafting these image hijacks, and trained image hijacks forcing VLMs to output arbitrary text, leak their context window and comply with harmful instructions. We also derived an algorithm to train hijacks forcing VLMs to behave as though they were given an arbitrary prompt, which we used to make them believe the Eiffel Tower is in Rome.

Other projects

agents Building and Evaluating Alignment Auditing Agents
Trenton Bricken, Rowan Wang, Sam Bowman, Euan Ong, Johannes Treutlein, Jeff Wu, Evan Hubinger, Samuel Marks
Anthropic Alignment Science Blog, 2025
paper / tweeprint

We developed three AI agents to autonomously complete alignment auditing tasks. In testing, our agents successfully uncovered hidden goals, built safety evaluations, and surfaced concerning behaviors. I helped to build the infrastructure for the experiments in this paper.

probes Cost-Effective Constitutional Classifiers via Representation Re-use
Hoagy Cunningham, Alwin Peng, Jerry Wei, Euan Ong, Fabien Roger, Linda Petrini, Misha Wagner, Vladimir Mikulik, Mrinank Sharma
Anthropic Alignment Science Blog, 2025
paper

We benchmark approaches to re-using LLM internals to make LLM monitoring more cost-effective. I helped to build the infrastructure for the evaluations in this paper.

auditing Auditing Language Models for Hidden Objectives
Samuel Marks*, Johannes Treutlein*, Trenton Bricken, Jack Lindsey, Jonathan Marcus, Siddharth Mishra-Sharma, Daniel Ziegler, ..., Euan Ong, ... , Kelley Rivoire, Adam Jermyn, Monte MacDiarmid, Tom Henighan, Evan Hubinger (* denotes equal contribution)
arXiv, 2025
arXiv / Anthropic article / tweeprint

We deliberately train a language model with a hidden objective and use it as a testbed for studying alignment audits. I participated in the auditing game, on the team with only black-box model access.

personality machine Personality Machine
Euan Ong*, Jamie Chen*, Kyra Zhou*, Marcus Handley*, Mingle Chen*, Ori Vasilescu*, Eleanor Drage (* denotes equal contribution)
CST Group Project, 2022

In collaboration with the Centre for Gender Studies at Cambridge, we built a tool highlighting the questionable logic behind the use of AI-driven personality assessments often used in hiring. Our tool demonstrates how arbitrary changes in facial expression, clothing, lighting and background can give radically different personality readings, and was featured in the BBC and the Telegraph.

attention heads Dissecting Deep Learning for Systematic Generalisation
Euan Ong, Etaash Katiyar, Kai-En Chong, Albert Qiaochu Jiang
Informal research, 2021

We investigated the capabilities of transformers to systematically generalise when learning to recognise formal languages (such as Parity and 2-Dyck), empirically corroborating various theoretical claims about transformer generalisation. Inspired by our observations, we derived a parallel, stackless algorithm for recognising 2-Dyck that could (in principle) be implemented by a transformer with a constant number of attention layers.

oxtd Object Detection in Thermal Imagery via Convolutional Neural Networks
Euan Ong, Niki Trigoni, Pedro Porto Barque de Gusmão
Technical report, 2019

We trained a Faster R-CNN object detection network to identify landmarks (e.g. doors and windows) in thermal images of indoor environments, with applications in the development of navigational aids for search and rescue operations.

miscellanea

research aside, I love communities and spaces that help people live more meaningful lives.

during my time at Cambridge, I started [scale down], a co-working group to help students make space for the things they care about; over the years, this sprouted more nodes across the UK, and grew into a national student-run collective called ensemble.

more recently, I wrote some perhaps-unorthodox career advice for undergrads (reviews).


Design inspired by Jon Barron's site.