🐘

> hello world_
I'm Ashioya Jotham Victor

Exploring the mechanics of reasoning in language models
through mechanistic interpretability

GDE — AI GDG — Partnerships Lead ILINA — Jr Research Fellow

scroll down▼

$ cat about.md

I'm Ashioya Jotham Victor, a researcher passionate about making AI systems more interpretable and safe. My work sits at the intersection of mechanistic interpretability, chain-of-thought reasoning, and AI alignment.

Currently investigating how faithful chain-of-thought reasoning emerges in transformer models — probing not just what models say, but whether their internal computations actually reflect the reasoning they output.

I believe that understanding the mechanistic basis of reasoning is crucial for building AI systems we can truly trust.

★

GDE

Google Developer Expert

📍 Mombasa, Kenya

📧 victorashioya960@gmail.com

$ ls research/

How I think about AI safety and interpretability

“

We offer no explanation as to why these architectures seem to work; we attribute their success, as all else, to divine benevolence.

— Noam Shazeer,GLU Variants Improve Transformer(2020)

I don't think scaling gets us to robust reasoning.

Language models exhibit what Ethan Mollick calls "jagged intelligence": superhuman at some tasks, brittle at others humans find trivial. This jagged profile is why interpretability matters—we need to understand where and why systems break.

My work traces how specific neurons, attention heads, and pathways give rise to model capabilities and failures. I'm particularly drawn to hallucinations: when models produce confident-but-wrong outputs.

WHERE I STAND

World models > scaling language models
Alignment is stewardship, not prophecy
Pessimism ≠ intelligence
To see a failure mode is to inherit responsibility for it

Full ethos →

HOW I GOT HERE

2023

I started with the intuition that self-supervised approaches were undersold. The field was drunk on supervised fine-tuning; I wrote "The Future Remains Unsupervised" for Deep Learning Indaba as a counterweight.

2024

Then I got interested in failure modes. Why do models produce confident nonsense? The term "hallucination" felt too forgiving—I started calling it confabulation, borrowing from neuroscience. Began building metrics for measuring it.

2025

Now I'm going deeper: mechanistic interpretability. Can we detect when a model "knows" it's being evaluated? Can we find the exact pathways where reasoning goes wrong? The work is harder but more satisfying.

CORE AREAS OF INVESTIGATION

AI Safety & Alignment

Ensuring AI systems behave as intended and remain safe as they scale. Focused on technical approaches to alignment and robustness.

Mechanistic Interpretability

Reverse-engineering the internal computations of neural networks to understand how they process information and form representations.

Reasoning Faithfulness

Investigating whether chain-of-thought reasoning traces in LLMs genuinely reflect the model's actual computation process.

AI Control

Developing techniques to ensure powerful AI systems remain controllable even when they exceed human-level capabilities. Practical guardrails over theoretical guarantees.

Scalable Oversight

Developing methods to reliably evaluate and supervise AI systems that may exceed human-level capabilities in narrow domains.

📝 RESEARCH DIARY

Working notes, paper annotations, technical history from n-grams to Transformers.

Browse the repo →

$ ls projects/

Things I've built and contributed to

★ FEATURED

> CoT-Faithfulness-Mech-Interp

Mechanistic interpretability analysis of chain-of-thought faithfulness. Includes activation patching, circuit analysis, and benchmarks for evaluating whether reasoning traces genuinely reflect model computations.

PythonPyTorchTransformerLens

GitHub

IN PROGRESS

> Reasoning Circuit Atlas

Interactive atlas mapping reasoning circuits across model scales. Visualizes how faithfulness circuits develop during training.

PythonPlotlyStreamlit

EXPLORATION

> Superposition & Reasoning

How superposition in MLP layers affects the model's ability to maintain faithful intermediate reasoning states.

SAEFeature Extraction

IN DEVELOPMENT

> PatchLib

Lightweight library for structured activation patching experiments with clean abstractions for causal tracing.

PythonPyTorchOpen Source

> FaithBench

A benchmark for evaluating the faithfulness of chain-of-thought reasoning in large language models. Measures whether reasoning traces genuinely reflect model computations.

AI SafetyBenchmarkLLM

GitHub

★ FEATURED

> Eval Awareness Research

Mechanistic interpretability experiments investigating whether reasoning models internally represent when they're being evaluated. Inspired by Anthropic's alignment faking research.

Mechanistic InterpretabilityAI SafetyProbing

GitHub

> ThinkAloud

Tool for analyzing model reasoning chains step by step. Identifies where reasoning diverges from intended logic and quantifies faithfulness.

InterpretabilityReasoningResearch

GitHub

$ cat writing/

Research papers and essays

★ Publications

[01] AkiliCode-14B: Repair-Oriented Post-Training for an Open 14B Code Model

Technical Report · 2025

Paper →

[02] Sauti TTS: A Swahili Text-to-Speech Technical Report

Technical Report · 2025

Paper →

[03] Sauti STT V1: Swahili Automatic Speech Recognition Performance and Error Analysis

Technical Report · 2024

Paper →

[04] Sauti ASR Technical Report

Technical Report · 2024

Paper →

[05] The Future Remains Unsupervised

Deep Learning Indaba · 2023

Paper →

✎ Essays & Blog Posts

> AI Priesthood vs AI Pluralism

On builders, clerics, and the future of intelligence. Every major technological revolution produces two kinds of institutions — and AI is no different.

2026-02Read →

> Why I Pivoted from Alignment to Control

On moving from "how do we make AI want what we want?" to "how do we keep AI from doing what we don’t want?"

2026-01Read →

> My Research Ethos for Alignment & Interpretability

Builder, anti-doomer, anti-gnostic. I approach AI alignment and interpretability as fields of repair, not apocalyptic forecasting.

2026-01Read →

> Detecting Unfaithful Chain of Thought

Technical deep-dive into detecting when language models produce reasoning traces that don’t reflect their actual computation.

2025Read →

> The Anticipatory Disruption Trap

On the failure mode of preemptively optimizing for disruptions that never materialize.

2025Read →

> State of AI 2025

A personal take on where the field stands and where it’s drifting.

2025Read →

> RLHF’d to Death

On the homogenizing effects of RLHF and what we lose when we optimize too hard for "helpfulness."

2024Read →

> Inverse Scaling

When bigger models get worse — and what that tells us about the limits of scale.

2024Read →

$ cat now.md

What I'm working on now

⚡

WORKING ON

Researcher in Amnesty International Kenya's RightUp 2.0 program — a youth-led research initiative investigating tech-facilitated repression and how surveillance tools and digital tactics are used to silence activists and civil society.

Junior Research Fellow at the ILINA Program, focusing on technical AI safety and mechanistic interpretability. Building "runtime safety governors" using activation steering to detect and correct deceptive model behavior in real-time, specifically testing if these internal safety controls generalize to Swahili.

Leading partnerships at GDG Pwani — currently prepping for Google I/O Extended Pwani 2026, reaching out to partners and sponsors.

💭

THINKING ABOUT

The politics of AI governance. Who gets to decide what "safe" means, and whether the current landscape is trending toward pluralism or priesthood.

The gap between alignment research and deployment reality. The field has a theory-practice problem that nobody wants to name directly.

📖

READING

The Infinity Machine by Sebastian Mallaby — the biography of Demis Hassabis and the story of DeepMind's quest toward superintelligence.

Re-reading Seeing Like a State (Scott) — keeps being relevant. Working through the Anthropic interpretability papers to trace how the field's assumptions evolved.

🔧

BUILDING

A small tool for visualizing attention patterns in a way that's actually useful for debugging. Early stages. May never ship.

Last updated: May 2026

$ cat contact.md

Get in touch

Feel free to reach out for collaborations, research discussions, or just to say hello.

I'm always interested in connecting with fellow researchers, developers, and anyone passionate about AI safety.

📧 victorashioya960@gmail.com 🔗 GitHub 🐦 @ashioyajotham_🎓 Google Scholar 💼 LinkedIn

> hello world_I'm Ashioya Jotham Victor

$ cat about.md

$ ls research/

I don't think scaling gets us to robust reasoning.

WHERE I STAND

HOW I GOT HERE

CORE AREAS OF INVESTIGATION

AI Safety & Alignment

Mechanistic Interpretability

Reasoning Faithfulness

AI Control

Scalable Oversight

📝 RESEARCH DIARY

$ ls projects/

> CoT-Faithfulness-Mech-Interp

> Reasoning Circuit Atlas

> Superposition & Reasoning

> PatchLib

> FaithBench

> Eval Awareness Research

> ThinkAloud

$ cat writing/

★ Publications

[01] AkiliCode-14B: Repair-Oriented Post-Training for an Open 14B Code Model

[02] Sauti TTS: A Swahili Text-to-Speech Technical Report

[03] Sauti STT V1: Swahili Automatic Speech Recognition Performance and Error Analysis

[04] Sauti ASR Technical Report

[05] The Future Remains Unsupervised

✎ Essays & Blog Posts

> AI Priesthood vs AI Pluralism

> Why I Pivoted from Alignment to Control

> My Research Ethos for Alignment & Interpretability

> Detecting Unfaithful Chain of Thought

> The Anticipatory Disruption Trap

> State of AI 2025

> RLHF’d to Death

> Inverse Scaling

$ cat now.md

WORKING ON

THINKING ABOUT

READING

BUILDING

$ cat contact.md

> hello world_
I'm Ashioya Jotham Victor