|
Euan Ong
I've just finished my undergraduate degree in Computer Science at the University of Cambridge
(where I ranked 1st / ~120 every year). I'm currently working on interpretability (previously
adversarial robustness) at Anthropic.
My ambition is to develop powerful, yet safe and interpretable
abstract reasoners, whose
internal
state and behaviour remain transparent to the end
user.
To this end, I'm interested in both pragmatic, scaling-friendly approaches to elicit latent knowledge from neural networks, and
longer-term bets exploring how the mathematical toolkits we use to understand and structure programs
(such as formal methods,
types and category theory) could
help us discover the right abstractions to reverse-engineer them.
Email /
CV /
Google Scholar
/
Twitter /
LinkedIn /
Github
|
|
Research
So far, my research has broadly focused on studying the behaviour of neural networks in
vitro:
understanding both how they generalise when learning to perform abstract tasks, and
what
this tells us about the algorithms they've learned in order to do so.
Previously, I've probed the foundations of neural
algorithmic
reasoning, explored attacks on vision-language models, and poked language model
representations
with a stick.
More recently, I've also done some work on adversarial robustness, mostly at Anthropic.
|
Featured work (reasoning & interpretability)
|
|
Probing the Foundations of Neural Algorithmic Reasoning
Euan Ong
Technical Report, 2023; ICML Differentiable Almost Everything (Spotlight), 2024
abstract
/
full text
/
project page
I explored a fundamental claim of neural algorithmic
reasoning, and found evidence to refute it through statistically robust ablations. Based on my
observations, I developed a way to parallelise differentiable algorithms that preserves their
efficiency and correctness guarantees while alleviating their performance bottlenecks. This work
formed part of my Bachelor's thesis, which won the CS department's Best Dissertation Award.
|
|
Successor Heads: Recurring, Interpretable Attention Heads In The
Wild
Rhys Gould,
Euan Ong,
George Ogden,
Arthur Conmy
ICLR, 2024;
NeurIPS ATTRIB (Oral), 2023
arXiv
/
reviews
/
project page
/
tweeprint
We discovered successor heads: attention heads present in a range of LLMs that increment
tokens from ordinal sequences (e.g. numbers, months and days). We isolated a common numeric subspace
within embedding space, that for any given token (e.g. 'February'), encodes the index of that token
within its ordinal sequence (e.g. months). We also found that numeric token representations can be
decomposed into interpretable features representing the value of the token mod 10, which
can be used to edit the numeric value of the representation via vector arithmetic.
|
|
Learnable Commutative Monoids for Graph Neural Networks
Euan Ong, Petar Veličković
Learning on Graphs Conference, 2022
arXiv
/
reviews
/
project page
/
tweeprint
Using ideas from abstract algebra and functional programming, we built a new GNN aggregator that
beats the state of the art on complex aggregation problems (especially out-of-distribution), while
remaining efficient and parallelisable on large graphs.
|
Featured work (robustness)
|
|
Constitutional classifiers: Defending against universal jailbreaks across
thousands of hours of red teaming
Mrinank Sharma*, Meg
Tong*, Jesse Mu*, Jerry
Wei*, Jorrit
Kruthoff*, Scott
Goodfriend*, Euan
Ong*,
Alwin Peng, ..., Jan Leike,
Jared Kaplan, Ethan Perez (* denotes equal contribution)
arXiv, 2025
arXiv
/
Anthropic article
/
tweeprint
We built a system of constitutional classifiers to prevent jailbreaks. A prototype version of our
system withstood over 3,000 hours of expert red teaming with no universal jailbreaks found. Newer
versions of our system also have minimal over-refusals and moderate run-time overhead.
|
|
Image Hijacks: Adversarial Images can Control Generative Models at
Runtime
Luke Bailey*, Euan Ong*, Stuart Russell, Scott Emmons (* denotes equal contribution)
ICML, 2024
arXiv
/
project page + demo
/
tweeprint
We discovered that adversarial images can hijack the behaviour of vision-language models (VLMs) at
runtime. We developed a general method for crafting these image hijacks,
and trained image hijacks forcing VLMs to output arbitrary text, leak their context window and
comply with harmful instructions. We also derived an algorithm to train hijacks forcing
VLMs to behave as though they were given an arbitrary prompt, which we used to make them believe the
Eiffel Tower is in Rome.
|
Other projects
|
|
Building and Evaluating Alignment Auditing Agents
Trenton Bricken, Rowan Wang, Sam Bowman, Euan Ong, Johannes Treutlein, Jeff Wu, Evan
Hubinger, Samuel Marks
Anthropic Alignment Science Blog, 2025
paper
/
tweeprint
We developed three AI agents to autonomously complete alignment auditing tasks. In testing, our
agents successfully uncovered hidden goals, built safety evaluations, and surfaced concerning
behaviors. I helped to build the infrastructure for the experiments in this paper.
|
|
Cost-Effective Constitutional Classifiers via Representation Re-use
Hoagy Cunningham, Alwin Peng, Jerry Wei, Euan Ong, Fabien Roger, Linda Petrini, Misha
Wagner, Vladimir Mikulik, Mrinank Sharma
Anthropic Alignment Science Blog, 2025
paper
We benchmark approaches to re-using LLM internals to make LLM monitoring more cost-effective. I
helped to build the infrastructure for the evaluations in this paper.
|
|
Auditing Language Models for Hidden Objectives
Samuel Marks*, Johannes Treutlein*, Trenton Bricken, Jack Lindsey, Jonathan Marcus, Siddharth
Mishra-Sharma, Daniel Ziegler, ..., Euan Ong, ... , Kelley Rivoire, Adam
Jermyn, Monte MacDiarmid, Tom Henighan, Evan Hubinger
(* denotes equal contribution)
arXiv, 2025
arXiv
/
Anthropic article
/
tweeprint
We deliberately train a language model with a hidden objective and use it as a testbed for studying
alignment audits. I participated in the auditing game, on the team with only black-box model
access.
|
|
Personality Machine
Euan Ong*,
Jamie Chen*,
Kyra Zhou*,
Marcus Handley*,
Mingle Chen*,
Ori Vasilescu*,
Eleanor Drage
(* denotes equal contribution)
CST Group Project, 2022
In collaboration with the Centre for Gender Studies at Cambridge, we built a tool highlighting the
questionable logic behind the use of AI-driven personality assessments often used in hiring. Our
tool demonstrates how arbitrary changes in facial expression, clothing, lighting and background can
give radically different personality readings, and was featured in the BBC and the
Telegraph.
|
|
Dissecting Deep Learning for Systematic Generalisation
Euan Ong, Etaash
Katiyar, Kai-En Chong, Albert
Qiaochu Jiang
Informal research, 2021
We investigated the capabilities of transformers to systematically generalise when learning to
recognise formal languages (such as
Parity and 2-Dyck), empirically corroborating various theoretical claims about transformer generalisation.
Inspired by our observations, we
derived a parallel, stackless algorithm for recognising 2-Dyck that could (in principle) be
implemented by a transformer with a constant number of attention layers.
|
|
Object Detection in Thermal Imagery via Convolutional Neural
Networks
Euan Ong,
Niki Trigoni,
Pedro Porto Barque de Gusmão
Technical report, 2019
We trained a Faster R-CNN object detection network to identify landmarks (e.g. doors and windows) in
thermal images of indoor environments, with applications in the development of navigational aids for
search and rescue operations.
|
miscellanea
research aside, I love communities and spaces that help people live more meaningful lives.
during my time at Cambridge, I started [scale
down], a co-working group to help students make space for the things they care about;
over the years, this sprouted more nodes across the UK, and grew into a national student-run
collective
called ensemble.
more recently, I wrote some perhaps-unorthodox career advice for
undergrads (reviews).
|
|