How I Got Here
2023 I started with the intuition that self-supervised approaches were undersold. The field was drunk on supervised fine-tuning; I wrote "The Future Remains Unsupervised" for Deep Learning Indaba as a counterweight.
2024 Then I got interested in failure modes. Why do models produce confident nonsense? The term "hallucination" felt too forgiving—I started calling it confabulation, borrowing from neuroscience. Began building metrics for measuring it.
2025 Now I'm going deeper: mechanistic interpretability. Can we detect when a model "knows" it's being evaluated? Can we find the exact pathways where reasoning goes wrong? The work is harder but more satisfying. I'm at Bluedot this year, focused on this full-time.
Active Projects
Mechanistic analysis of reasoning faithfulness in GPT-2 Small. Linear probe achieves 88% accuracy (ROC-AUC 0.949) detecting unfaithful CoT from residual streams. Identified 23 causal circuit components; steering vectors achieve ~72% success shifting unfaithful → faithful.
Open: Do these circuits generalize across model scales and reasoning tasks?
Do LLMs internally represent being monitored? Found 92.3% probe accuracy at layer 16 with 70.3% transfer to subtle cues.
Open: Does awareness direction causally affect policy decisions?
Testing "LoRA Without Regret" methodology on numerical → natural language forecasts.
Open: Can RLHF optimize for both accuracy and style?