How I Got Here

2023 I started with the intuition that self-supervised approaches were undersold. The field was drunk on supervised fine-tuning; I wrote "The Future Remains Unsupervised" for Deep Learning Indaba as a counterweight.

2024 Then I got interested in failure modes. Why do models produce confident nonsense? The term "hallucination" felt too forgiving—I started calling it confabulation, borrowing from neuroscience. Began building metrics for measuring it.

2025 Now I'm going deeper: mechanistic interpretability. Can we detect when a model "knows" it's being evaluated? Can we find the exact pathways where reasoning goes wrong? The work is harder but more satisfying. I'm at Bluedot this year, focused on this full-time.

Active Projects

Do LLMs internally represent being monitored? Found 92.3% probe accuracy at layer 16 with 70.3% transfer to subtle cues.

Open: Does awareness direction causally affect policy decisions?

Mechanistic analysis of reasoning in GPT-2. Attribution graphs, faithfulness detection, targeted interventions.

Open: Are CoT traces post-hoc rationalizations or actual reasoning?

Testing "LoRA Without Regret" methodology on numerical → natural language forecasts.

Open: Can RLHF optimize for both accuracy and style?