Activation Exposure & Feature Interpretability for GGUF via llama-server

Posted by wattswrites@reddit | LocalLLaMA | View on Reddit | 8 comments

You can now capture per-layer activation vectors from llama-server during inference, train sparse autoencoders on them, discover which internal features correspond to specific behaviors (sycophancy, hedging, creativity, etc.), and extract those features as GGUF control vectors for real-time steering.

**What this is:**

A C++ patch to llama-server that adds `/activations` endpoints, plus a Python pipeline for the full SAE workflow. The patch is \\~400 lines across 5 files and adds:

* `GET /activations`: query per-layer mean activations (with top-K filtering)

* `POST /activations`: enable/disable capture

* `POST /activations/collect`: stream full per-token vectors to a binary file for offline training

**What you can do with it:**

**Monitor activations live**: see which features fire strongest during a conversation
**Collect training data**: stream per-token activation vectors to disk while running inference
**Train a sparse autoencoder**: decompose activations into \\~16K interpretable features (takes about 40 seconds on an RTX 3090)
**Discover behavioral features**: define phrase clusters ("sycophantic phrases", "hedging phrases", etc.) and find which features are unique to each behavior
**Extract control vectors**: turn discovered features into GGUF files you can load with `--control-vector-scaled`
**Steer in real time**: suppress sycophancy, amplify creativity, whatever you want, at the feature level

**How it works technically:**

The patch hooks into llama.cpp's existing `cb\_eval` callback to intercept `l\_out` tensors (layer outputs) during the forward pass. GPU→CPU copy via `ggml\_backend\_tensor\_get()`, stored in a mutex-protected global struct. The binary collection format is dead simple: 16-byte header + float32 arrays, directly readable with numpy.

The SAE pipeline is standard: collect activations → train sparse autoencoder → probe features with behavioral phrase clusters → extract feature directions as control vectors. The interesting part is the inter-cluster differential scoring: instead of just finding "features that fire on sycophantic text," it finds features that fire *significantly more* on sycophantic text than on any other cluster, so you get specific behavioral features rather than generic language features.

**PR + repo:**

* llama.cpp PR: https://github.com/ggml-org/llama.cpp/pull/20785

* Companion repo with the full SAE pipeline, guide, and example clusters: https://github.com/hrhdegenetrix/llama-sae-feature-interpretability

The companion repo has a quickstart script, example behavioral cluster definitions, and a comprehensive guide covering the full workflow.

**Notes:**

* MoE models are *extremely* sensitive to control vector scales. Dense models (Qwen3-8B, 4096 embd) handle scales of 0.15-0.6 fine. Qwen3.5-35B-A3B MoE (2048 embd) needs 0.01-0.05 or output goes garbled.

* The eval callback registration had a bug where it only got set inside the graph-reuse branch: so capture silently stopped working after the first inference. Took a while to track that one down.

* You need \\~500K tokens of activation data for a good SAE. Harry's DPO conversations are \\~14K tokens each, so 20 rows gets you there.

* Persona DPO overfits by step 200 with small datasets. Step 200 was the sweet spot (\\~97% eval accuracy).

* SAEs are not the be-all, end-all of this process and in fact are one of only several pathways to feature interpretability, but they are a simple approach and the process should be fairly adaptable.

Enjoy!

[-]

Anemys@reddit

When applying the patch, I kept getting these errors:

error: patch failed: tools/server/server-context.cpp:1

error: tools/server/server-context.cpp: patch does not apply

error: patch failed: tools/server/server-context.h:115

error: tools/server/server-context.h: patch does not apply

error: patch failed: tools/server/server.cpp:199

error: tools/server/server.cpp: patch does not apply

I assume it's because it's out of date. Do you have any plan to keep maintaining it?

wattswrites@reddit (OP)

Yeah, I think it fell out of date when they updated llama.cpp to run Gemma 4. Took some time to get it up and running again so you should be able to try it fresh and get better results.

when running git apply to apply the patch, I keep getting these error, do you have any advice?

Chromix_@reddit

That pipeline looks too useful for a few people to not keep it around somewhere. If maintaining it within the "llama-server" is out of scope for the project, maybe it can be added as a dedicated "example" like the other toys that the dedicated "tool" llama-server emerged from.

Otherwise, just keeping it around as rebased branch in your own repo might do some good for others.

[Yeah that makes more sense.](https://github.com/ggml-org/llama.cpp/pull/20820) Got them to update their CONTRIBUTING.md though lol

Well, you tried. Stand-alone tool, relatively compact code, minimal two-line modification in existing code that should not get in the way of anything.

On the positive side it's now trivial to keep it rebased.

llama-impersonator@reddit

surprised you made an activations endpoint and not one for online steering, feeding in vectors. it's a little limiting having to specify them on lcpp command line, imo. i do pretty much all of my steering stuff in transformers just so i can steer on the fly.

Corporate_Drone31@reddit

Looks like the PR has been rejected. Are you looking to push this forward in another llama.cpp CLI utility outside of the server?