🐘

> hello world_
I'm Ashioya Jotham Victor

Exploring the mechanics of reasoning in language models through mechanistic interpretability

scroll downβ–Ό

$ cat about.md

I'm Ashioya Jotham Victor, a researcher passionate about making AI systems more interpretable and safe. My work sits at the intersection of mechanistic interpretability, chain-of-thought reasoning, and AI alignment.

Currently investigating how faithful chain-of-thought reasoning emerges in transformer models β€” probing not just what models say, but whether their internal computations actually reflect the reasoning they output.

I believe that understanding the mechanistic basis of reasoning is crucial for building AI systems we can truly trust.

πŸ“ Mombasa, Kenya

πŸ“§ victorashioya960@gmail.com


$ ls research/

How I think about AI safety and interpretability

β€œ
We offer no explanation as to why these architectures seem to work; we attribute their success, as all else, to divine benevolence.
β€” Noam Shazeer,GLU Variants Improve Transformer(2020)

I don't think scaling gets us to robust reasoning.

Language models exhibit what Ethan Mollick calls "jagged intelligence": superhuman at some tasks, brittle at others humans find trivial. This jagged profile is why interpretability mattersβ€”we need to understand where and why systems break.

My work traces how specific neurons, attention heads, and pathways give rise to model capabilities and failures. I'm particularly drawn to hallucinations: when models produce confident-but-wrong outputs.

WHERE I STAND

  • World models > scaling language models
  • Alignment is stewardship, not prophecy
  • Pessimism β‰  intelligence
  • To see a failure mode is to inherit responsibility for it
Full ethos β†’

HOW I GOT HERE

2023

I started with the intuition that self-supervised approaches were undersold. The field was drunk on supervised fine-tuning; I wrote "The Future Remains Unsupervised" for Deep Learning Indaba as a counterweight.

2024

Then I got interested in failure modes. Why do models produce confident nonsense? The term "hallucination" felt too forgivingβ€”I started calling it confabulation, borrowing from neuroscience. Began building metrics for measuring it.

2025

Now I'm going deeper: mechanistic interpretability. Can we detect when a model "knows" it's being evaluated? Can we find the exact pathways where reasoning goes wrong? The work is harder but more satisfying.


CORE AREAS OF INVESTIGATION

01

AI Safety & Alignment

Ensuring AI systems behave as intended and remain safe as they scale. Focused on technical approaches to alignment and robustness.

02

Mechanistic Interpretability

Reverse-engineering the internal computations of neural networks to understand how they process information and form representations.

03

Reasoning Faithfulness

Investigating whether chain-of-thought reasoning traces in LLMs genuinely reflect the model's actual computation process.

04

AI Control

Developing techniques to ensure powerful AI systems remain controllable even when they exceed human-level capabilities. Practical guardrails over theoretical guarantees.

05

Scalable Oversight

Developing methods to reliably evaluate and supervise AI systems that may exceed human-level capabilities in narrow domains.

πŸ“ RESEARCH DIARY

Working notes, paper annotations, technical history from n-grams to Transformers.

Browse the repo β†’

$ ls projects/

Things I've built and contributed to

CoT-Faithfulness-Mech-Interp thumbnail
β˜… FEATURED

> CoT-Faithfulness-Mech-Interp

Mechanistic interpretability analysis of chain-of-thought faithfulness. Includes activation patching, circuit analysis, and benchmarks for evaluating whether reasoning traces genuinely reflect model computations.

PythonPyTorchTransformerLens
Reasoning Circuit Atlas thumbnail
IN PROGRESS

> Reasoning Circuit Atlas

Interactive atlas mapping reasoning circuits across model scales. Visualizes how faithfulness circuits develop during training.

PythonPlotlyStreamlit
Superposition & Reasoning thumbnail
EXPLORATION

> Superposition & Reasoning

How superposition in MLP layers affects the model's ability to maintain faithful intermediate reasoning states.

SAEFeature Extraction
PatchLib thumbnail
IN DEVELOPMENT

> PatchLib

Lightweight library for structured activation patching experiments with clean abstractions for causal tracing.

PythonPyTorchOpen Source
FaithBench thumbnail

> FaithBench

A benchmark for evaluating the faithfulness of chain-of-thought reasoning in large language models. Measures whether reasoning traces genuinely reflect model computations.

AI SafetyBenchmarkLLM
CircuitLens thumbnail

> CircuitLens

Visualization tool for mechanistic interpretability circuits. Maps neural network components to human-understandable computation graphs.

InterpretabilityVisualizationOpen Source
SafePrompt thumbnail

> SafePrompt

A safety-aware prompt engineering framework that systematically tests prompt injection vulnerabilities and robustness in LLM systems.

AI SafetyPrompt EngineeringSecurity
TensorLens thumbnail

> TensorLens

Neural network activation analysis and visualization toolkit. Inspect intermediate layer activations to understand model behavior at a granular level.

ML ToolsVisualizationPyTorch
ThinkAloud thumbnail

> ThinkAloud

Tool for analyzing model reasoning chains step by step. Identifies where reasoning diverges from intended logic and quantifies faithfulness.

InterpretabilityReasoningResearch

$ cat writing/

Research papers and essays

β˜… Publications

[01] AkiliCode-14B: Repair-Oriented Post-Training for an Open 14B Code Model

Technical Report Β· 2025

[02] Sauti TTS: A Swahili Text-to-Speech Technical Report

Technical Report Β· 2025

[03] Sauti STT V1: Swahili Automatic Speech Recognition Performance and Error Analysis

Technical Report Β· 2024

[04] Sauti ASR Technical Report

Technical Report Β· 2024

[05] The Future Remains Unsupervised

Deep Learning Indaba Β· 2023

✎ Essays & Blog Posts

> AI Priesthood vs AI Pluralism

On builders, clerics, and the future of intelligence. Every major technological revolution produces two kinds of institutions β€” and AI is no different.

> Why I Pivoted from Alignment to Control

On moving from "how do we make AI want what we want?" to "how do we keep AI from doing what we don’t want?"

> My Research Ethos for Alignment & Interpretability

Builder, anti-doomer, anti-gnostic. I approach AI alignment and interpretability as fields of repair, not apocalyptic forecasting.

> Detecting Unfaithful Chain of Thought

Technical deep-dive into detecting when language models produce reasoning traces that don’t reflect their actual computation.

> The Anticipatory Disruption Trap

On the failure mode of preemptively optimizing for disruptions that never materialize.

> State of AI 2025

A personal take on where the field stands and where it’s drifting.

> RLHF’d to Death

On the homogenizing effects of RLHF and what we lose when we optimize too hard for "helpfulness."

> Inverse Scaling

When bigger models get worse β€” and what that tells us about the limits of scale.


$ cat now.md

What I'm working on now

⚑

WORKING ON

Researcher in Amnesty International Kenya's RightUp 2.0 program β€” a youth-led research initiative investigating tech-facilitated repression and how surveillance tools and digital tactics are used to silence activists and civil society.

Junior Research Fellow at the ILINA Program, focusing on technical AI safety and mechanistic interpretability. Building "runtime safety governors" using activation steering to detect and correct deceptive model behavior in real-time, specifically testing if these internal safety controls generalize to Swahili.

Leading partnerships at GDG Pwani β€” currently prepping for Google I/O Extended Pwani 2026, reaching out to partners and sponsors.

πŸ’­

THINKING ABOUT

The politics of AI governance. Who gets to decide what "safe" means, and whether the current landscape is trending toward pluralism or priesthood.

The gap between alignment research and deployment reality. The field has a theory-practice problem that nobody wants to name directly.

πŸ“–

READING

The Infinity Machine by Sebastian Mallaby β€” the biography of Demis Hassabis and the story of DeepMind's quest toward superintelligence.

Re-reading Seeing Like a State (Scott) β€” keeps being relevant. Working through the Anthropic interpretability papers to trace how the field's assumptions evolved.

πŸ”§

BUILDING

A small tool for visualizing attention patterns in a way that's actually useful for debugging. Early stages. May never ship.

Last updated: May 2026


$ cat contact.md

Get in touch

Feel free to reach out for collaborations, research discussions, or just to say hello.

I'm always interested in connecting with fellow researchers, developers, and anyone passionate about AI safety.