From Tokens to Concepts: Building a Local SAE Interpretability Lab on an M1 Mac

February 24, 2026

9 min read

ai interpretability transformers sparse-autoencoders pytorch

Project repository SAE Interpretability Lab (tag: exp-2026-02-24) Layer sweeps, top-k sparsity experiments, and an interactive feature explorer on Apple Silicon.

I started this project with a simple question:

Can sparse autoencoders (SAEs) surface meaningful features from a small open-weight language model, locally, in a reproducible way?

Then it evolved into something more useful: an interpretability lab I can actually run, iterate, and explore.

If you missed the first post, start here:

Part 1: https://curtiscovington.com/posts/sae-interpretability-on-a-small-transformer/

This post is the continuation: in Part 1, I proved the pipeline worked. In this one, I pushed sparsity harder, compared top-k regimes, and built an interactive explorer so insights are easier to discover.

Why this matters

A lot of interpretability work is either:

too infra-heavy to reproduce quickly, or
too qualitative to compare rigorously.

I wanted both:

reproducible runs,
measurable transfer,
and a practical UI for feature exploration.

All local. No paid APIs. No cloud.

Setup

Model: EleutherAI/pythia-70m-deduped (6 transformer layers)
Activation target: one layer at a time, MLP output stream
Data domains:
- A: wiki/prose-style text
- B: Python-heavy code corpus
Token budget: ~100k tokens per domain per layer for collection (A and B each)
Hardware: Apple Silicon (M1), MPS with CPU fallback
SAE family: top-k sparse SAE (plus earlier ReLU+L1 baseline)

Experiment 1 recap: layer sweep

I trained SAEs across layers and evaluated:

reconstruction (R² / MSE),
sparsity (L0/L1),
transfer degradation (A→B, B→A).

Protocol note: each SAE run uses a 90/10 split of collected activations per domain (train/val). Reported A_A and B_B are on held-out validation activations for the same domain; A_B and B_A are cross-domain evaluations using those trained SAEs.

Best layer in sparse sweep: Layer 2 (strong reconstruction + transfer balance).

That gave me the next question:

If sparsity is the point, where is the sweet spot for top-k?

Experiment 2: top-k sweep (k=48, 24, 16) with larger dictionary

I pushed dictionary width to 2048 features and swept:

k = 48
k = 24
k = 16

Key result

The best overall configuration was:

k=16, layer=2

A_A R²: 0.962
B_B R²: 0.987
A→B ratio: 1.69
B→A ratio: 2.00
L0: 16
A_A MSE: 0.0183, A_B MSE: 0.0310
B_B MSE: 0.0149, B_A MSE: 0.0299

(Transfer ratio definition: A→B ratio = MSE(train on A, eval on B) / MSE(train on A, eval on A) and analogously for B→A.)

In short:

higher k preserved slightly more raw reconstruction,
but k=16 gave better sparsity/transfer tradeoff and became the strongest overall score in this framework.

That’s a real win for this direction.

What kinds of features appeared?

When I inspected top features and contexts:

Code-leaning features

import/module wiring patterns
parser/token/AST metadata (lineno, offsets, parse-version flows)
exception and guard-control flow (except, raise, returns)

Prose-leaning features

historical/biographical narrative fragments
institutional/reporting language
entity/topic clusters

Not perfectly monosemantic yet — but clearly more structured than random latent noise.

A quick caveat: on a small model at this scale, many features are still fuzzy and mixed compared to clean human-level concepts. That’s expected. The next frontier is scaling this lab to larger models (and better hardware) so concept structure becomes sharper and more semantically stable.

New pivot: from static plots to an explorer

Raw PCA plots (PC1, PC2) are useful, but not intuitive alone. So I built an interactive feature explorer in the repo:

map view (2D/3D feature map),
selectivity filters (A-leaning / B-leaning / shared),
feature inspector with top contexts by domain,
nearest-feature navigation,
cluster concept summaries.

This changed the workflow from: “look at one static chart” to “browse neighborhoods, inspect, and discover concepts.”

That’s the biggest practical shift: this is now a tooling foundation, not a one-off notebook result.

3D concept map (k=16, layer=2)

This is the feature-space concept map from the best run. Each point is an SAE feature; nearby points tend to behave similarly and often share concept-like patterns.

For this map, each feature vector is constructed from the SAE decoder row for that feature (its learned output direction in model space) concatenated with a small selectivity/statistics embedding (freqA-freqB, magA-magB, freqA, freqB). I then project those feature vectors to 3D with PCA.

Cluster colors come from k-means with k=10 on the PCA-reduced feature vectors using Euclidean distance. k=10 was chosen as a practical visual granularity for exploration (not heavily tuned).

3D SAE feature concept map

What I learned

Sparse-first objectives work — pushing top-k down exposed better tradeoff structure than stopping at dense recon quality.
Layer matters — early/mid layers can dominate transfer/sparsity balance.
Interpretability needs UX — if you can’t inspect neighborhoods and contexts quickly, you won’t extract insight.

Limitations

Single small model (70M), single architecture family.
Domain corpora are lightweight and intentionally constrained.
Labels are heuristic, not human-annotated concept taxonomies.
No causal intervention tests yet in this phase.

Next steps

Add feature intervention experiments (suppress/amplify, measure output/logit shifts).
Expand domain matrix (add a third corpus type).
Improve concept labeling pipeline (semi-automatic + human curation).
Track longitudinal runs in “lab mode” with saved findings cards.

TL;DR

This started as an SAE experiment and became a local interpretability lab.

And the current best result is clear enough to act on:

Top-k = 16 (layer 2, d_sae=2048) is the best current sweet spot for sparse, transferable features in this setup.