> hello world_
I'm Ashioya Jotham Victor
Exploring the mechanics of reasoning in language models
through mechanistic interpretability
$ cat about.md
I'm Ashioya Jotham Victor, a researcher passionate about making AI systems more interpretable and safe. My work sits at the intersection of mechanistic interpretability, chain-of-thought reasoning, and AI alignment.
Currently investigating how faithful chain-of-thought reasoning emerges in transformer models β probing not just what models say, but whether their internal computations actually reflect the reasoning they output.
I believe that understanding the mechanistic basis of reasoning is crucial for building AI systems we can truly trust.
π Mombasa, Kenya
π§ victorashioya960@gmail.com
$ ls research/
How I think about AI safety and interpretability
We offer no explanation as to why these architectures seem to work; we attribute their success, as all else, to divine benevolence.
I don't think scaling gets us to robust reasoning.
Language models exhibit what Ethan Mollick calls "jagged intelligence": superhuman at some tasks, brittle at others humans find trivial. This jagged profile is why interpretability mattersβwe need to understand where and why systems break.
My work traces how specific neurons, attention heads, and pathways give rise to model capabilities and failures. I'm particularly drawn to hallucinations: when models produce confident-but-wrong outputs.
WHERE I STAND
- World models > scaling language models
- Alignment is stewardship, not prophecy
- Pessimism β intelligence
- To see a failure mode is to inherit responsibility for it
HOW I GOT HERE
I started with the intuition that self-supervised approaches were undersold. The field was drunk on supervised fine-tuning; I wrote "The Future Remains Unsupervised" for Deep Learning Indaba as a counterweight.
Then I got interested in failure modes. Why do models produce confident nonsense? The term "hallucination" felt too forgivingβI started calling it confabulation, borrowing from neuroscience. Began building metrics for measuring it.
Now I'm going deeper: mechanistic interpretability. Can we detect when a model "knows" it's being evaluated? Can we find the exact pathways where reasoning goes wrong? The work is harder but more satisfying.
CORE AREAS OF INVESTIGATION
AI Safety & Alignment
Ensuring AI systems behave as intended and remain safe as they scale. Focused on technical approaches to alignment and robustness.
Mechanistic Interpretability
Reverse-engineering the internal computations of neural networks to understand how they process information and form representations.
Reasoning Faithfulness
Investigating whether chain-of-thought reasoning traces in LLMs genuinely reflect the model's actual computation process.
AI Control
Developing techniques to ensure powerful AI systems remain controllable even when they exceed human-level capabilities. Practical guardrails over theoretical guarantees.
Scalable Oversight
Developing methods to reliably evaluate and supervise AI systems that may exceed human-level capabilities in narrow domains.
π RESEARCH DIARY
Working notes, paper annotations, technical history from n-grams to Transformers.
Browse the repo β$ ls projects/
Things I've built and contributed to
β
FEATURED> CoT-Faithfulness-Mech-Interp
Mechanistic interpretability analysis of chain-of-thought faithfulness. Includes activation patching, circuit analysis, and benchmarks for evaluating whether reasoning traces genuinely reflect model computations.
IN PROGRESS> Reasoning Circuit Atlas
Interactive atlas mapping reasoning circuits across model scales. Visualizes how faithfulness circuits develop during training.
EXPLORATION> Superposition & Reasoning
How superposition in MLP layers affects the model's ability to maintain faithful intermediate reasoning states.
IN DEVELOPMENT> PatchLib
Lightweight library for structured activation patching experiments with clean abstractions for causal tracing.

> FaithBench
A benchmark for evaluating the faithfulness of chain-of-thought reasoning in large language models. Measures whether reasoning traces genuinely reflect model computations.

> CircuitLens
Visualization tool for mechanistic interpretability circuits. Maps neural network components to human-understandable computation graphs.

> SafePrompt
A safety-aware prompt engineering framework that systematically tests prompt injection vulnerabilities and robustness in LLM systems.

> TensorLens
Neural network activation analysis and visualization toolkit. Inspect intermediate layer activations to understand model behavior at a granular level.

> ThinkAloud
Tool for analyzing model reasoning chains step by step. Identifies where reasoning diverges from intended logic and quantifies faithfulness.
$ cat writing/
Research papers and essays
β Publications
[01] AkiliCode-14B: Repair-Oriented Post-Training for an Open 14B Code Model
Technical Report Β· 2025
[03] Sauti STT V1: Swahili Automatic Speech Recognition Performance and Error Analysis
Technical Report Β· 2024
β Essays & Blog Posts
> AI Priesthood vs AI Pluralism
On builders, clerics, and the future of intelligence. Every major technological revolution produces two kinds of institutions β and AI is no different.
> Why I Pivoted from Alignment to Control
On moving from "how do we make AI want what we want?" to "how do we keep AI from doing what we donβt want?"
> My Research Ethos for Alignment & Interpretability
Builder, anti-doomer, anti-gnostic. I approach AI alignment and interpretability as fields of repair, not apocalyptic forecasting.
> Detecting Unfaithful Chain of Thought
Technical deep-dive into detecting when language models produce reasoning traces that donβt reflect their actual computation.
> The Anticipatory Disruption Trap
On the failure mode of preemptively optimizing for disruptions that never materialize.
> RLHFβd to Death
On the homogenizing effects of RLHF and what we lose when we optimize too hard for "helpfulness."
> Inverse Scaling
When bigger models get worse β and what that tells us about the limits of scale.
$ cat now.md
What I'm working on now
WORKING ON
Researcher in Amnesty International Kenya's RightUp 2.0 program β a youth-led research initiative investigating tech-facilitated repression and how surveillance tools and digital tactics are used to silence activists and civil society.
Junior Research Fellow at the ILINA Program, focusing on technical AI safety and mechanistic interpretability. Building "runtime safety governors" using activation steering to detect and correct deceptive model behavior in real-time, specifically testing if these internal safety controls generalize to Swahili.
Leading partnerships at GDG Pwani β currently prepping for Google I/O Extended Pwani 2026, reaching out to partners and sponsors.
THINKING ABOUT
The politics of AI governance. Who gets to decide what "safe" means, and whether the current landscape is trending toward pluralism or priesthood.
The gap between alignment research and deployment reality. The field has a theory-practice problem that nobody wants to name directly.
READING
The Infinity Machine by Sebastian Mallaby β the biography of Demis Hassabis and the story of DeepMind's quest toward superintelligence.
Re-reading Seeing Like a State (Scott) β keeps being relevant. Working through the Anthropic interpretability papers to trace how the field's assumptions evolved.
BUILDING
A small tool for visualizing attention patterns in a way that's actually useful for debugging. Early stages. May never ship.
Last updated: May 2026
$ cat contact.md
Get in touch
Feel free to reach out for collaborations, research discussions, or just to say hello.
I'm always interested in connecting with fellow researchers, developers, and anyone passionate about AI safety.