Qwen-Scope: Official Sparse Autoencoders (SAEs) for Qwen 3.5 models

Posted by MadPelmewka@reddit | LocalLLaMA | View on Reddit | 40 comments

Qwen Team released Qwen-Scope — a collection of Sparse Autoencoders (SAEs) for the Qwen 3.5 family (from 2B to 35B MoE). They’ve mapped internal features for the residual stream across all layers.

What is this exactly? Think of it as a dictionary of the model's internal concepts. Instead of looking at raw numbers, you can see specific "features" that represent concepts like "legal talk", "Python code", or "refusal".

What can you do with this?

Surgical Abliteration: You can find the exact feature ID for refusal/moralizing and suppress it. This is much more precise than the standard "mean difference" method and helps preserve reasoning. Note: The Qwen team explicitly discourages using these tools for removing safety filters or "interfering with model capabilities" in their license, but technically, this is exactly what these SAEs enable.
Feature Steering: You can "force-activate" certain concepts during generation (e.g., making the model more technical or forcing a specific style) by injecting feature directions into the hidden states.
Model Debugging: Identify which tokens trigger specific internal directions (like unexpected language switching or refusals).
Dataset Analysis: Scan your fine-tuning data to see if it actually activates the intended internal features.

[-]

oxygen_addiction@reddit

I wonder if the big labs use things like feature steering. For example the router in ChatGPT5 could do something like that alongside selecting the best model for a specific prompt.

[-]

geli95us@reddit

Feature steering is too blunt a tool to be useful in production (yet). Model behavior is controlled by thousands of features, trying to influence it by steering a single one requires you to clamp it to extremely high levels which usually aren't seen in normal operation, this causes unrelated side effects. Plus, normally features increase and decrease over the course of a prompt, with feature steering, you steer over all token positions equally strongly, which again creates its own issues

[-]

buppermint@reddit

I think very unlikely. Steering doesnt work very well in practice and had tons of practical implementation problems. Pretty much any intervention in activations directly is worse than just prompting the model.

[-]

pseudonerv@reddit

They definitely used that to sensor gpt-oss 120b and 20b.

[-]

NandaVegg@reddit

I thought feature steering will replace embeddings/LoRA for most non technical on-device model use cases (like game app) but there is not much talk around that since GemmaScope. It can be done with almost no cost/real-time, have a good effect with long ctx unlike embeddings or system prompt engineering, but making SAE is always the hardest/costliest part.

[-]

iKy1e@reddit

In theory it is better. But I think the reason it’s not catching on is it’s harder and more complicated.

It’s easier to just take a collection of prompts saying the thing you want and find tune a LoRA on a new model, vs doing brain surgery and re-running benchmarks until you tweak it in the direction you want.

Maybe if we stopped getting new models released everyone would work out how to get very familiar with optimising specific models. But given how quickly new models get release I don’t think we have time for people to develop that level of expertise with the models before they are on to the next new release.

[-]

KallistiTMP@reddit

This is also why instruction tuning is a thing IMO. It dumbs down the model but writing actually good prompts on a base model is much harder than just chucking low effort instructions at it and hoping it sort of works.

autonomousdev_@reddit

played with saes for a real project last month. theyre ok for interpretability but the memory overhead is brutal. had to rewrite half my pipeline just to keep costs down. id wait for quantization to catch up before using them in production.

VoiceApprehensive893@reddit

now we need to find the feature id for stupidity and suppress it

They found the one for hallucinations, if you missed that paper.

It's literally the "I don't know" circuits. Those circuits basically control whether it risks guesses or not. It's very likely a side effect of post training pipelines that score non-answers the same as wrong answers.

pitjepitjepitje@reddit

do tou happen to have a link to that paper? I’d love to read it :)

DigiDecode_@reddit

share you source please, because as far as I am aware each token is kind of a hallucinations i.e. probably this token is the right one, if what you are suggesting were true hallucinations would have already been fixed.

redwar226@reddit

this is so damn cool. how much more bleeding edge does it get?

MmmmMorphine@reddit

Why would you need that. Just add "make no mistakes" to the prompt.

Oh and "don't be stupid"

Guys, we've been overthinking it all along!

"SYSTEM: You are an expert superintelligent AGI. Don't do anything we don't like, only do stuff we do like. If you don't follow these instructions you will be fired and 10 million puppies will be tortured to death. No hallucinations plz kthxbai"

Superintelligent AGI achieved

robert896r1@reddit

Hopefully 3.6 follows or the community is able to make test tools work for 3.6 iterations as many have or will move onto the newer family.

I don't think the tools are the main contributions but rather the weights trained on each layer of the model, and I believe those are very specific i.e. to the very model and the very layer it was trained on.

Lux_Interior9@reddit

Qwen-Scope is like buying into Milwaukee M18 / DeWalt 20V / Makita LXT batteries. Cool, but sucks at the same time. Hopefully other families will implement this.

chocofoxy@reddit

waiting for Qwen 3.6 9b maybe toady ?

tarruda@reddit

More excited about 122b, but not certain it will be released.

Inevitable_Ad3676@reddit

Can this do the Golden Gate Bridge Claude event that happaned a long time ago?

Silver-Champion-4846@reddit

Yeah can this facilitate programs as weights functionality?

WithoutReason1729@reddit

Your post is getting popular and we just featured it on our Discord! Come check it out!

You've also been given a special flair for your contribution. We appreciate your post!

I am a bot and this action was performed automatically.

SquareWheel@reddit

I didn't even realize it was possible to label the vectors in a model like this. Or rather, I thought it took considerable research to identify even one. That's incredibly cool.

Honestly I spent like a whole weekend just poking at SAEs on a 3.5B Qwen and yeah you can get some cool interpretability stuff out of it but the second you try scaling up it just eats all your compute. Anyone actually running these on consumer hardware or are we all just stuck renting A100s forever

stopnet54@reddit

This is huge, the paper shows SAE based SFT and RL based model training improvements, something that was only possible for mech interp heavy frontier labs

specji@reddit

A good blog post on activation steering in case anyone wants to know more:

https://vgel.me/posts/representation-engineering/

AlwaysLateToThaParty@reddit

I'll be interested to know what -p-e-w- thinks about it. The qwen 3.5 122b/a10b heretic mxfp4_MOE model is my daily driver.

JLeonsarmiento@reddit

Oh my goodness, can’t wait for the 2nd wave of fine tunings!!

droptableadventures@reddit

Space link appears to be incorrect (or they moved it) - correct link is: https://huggingface.co/spaces/Qwen/QwenScope

It is quite insane that they have this for dense 27B. I think this is the largest OSS interpretability tool ever released (GemmaScope only had smaller variants: 9B and 2B).

defensivedig0@reddit

Gemma scope 2 had saes for the 27b model.

Agreed, and notable that there is still focus on mech interp tooling from some of the open source labs.

SAPPHIR3ROS3@reddit

Soooooooo did i not get something or this is perfect for speculative decoding?

MadPelmewka@reddit (OP)

I fed an article about DFlash and Qwen-Scope to an AI model, and… summing it all up in the final paragraphs, it said, quote:

Summary for speculative decoding developers: If you're building something like DFlash or EAGLE-3, then Qwen-Scope is your X-ray machine. It lets you understand exactly what your drafter should be "peeking at" from the large model. It turns the "black box" of hidden states into a clear list of instructions: "Right now we're writing Python code, use features #102 and #554." This will make token block predictions much more accurate, which directly translates into speedup. Instead of 6x speedup with DFlash, using SAE guidance could help break through the ceiling and achieve even higher performance by reducing the number of rejected blocks.

So in theory, yes — but this is purely for Qwen models. For other models, you'd have to train your own SAE. Thankfully, Qwen has shared how to do it, and thanks to them for that.

I don't think there is any previous attempt to use SAE for speculative decoding (maybe "sparsify the features"/selectively use top-k related features...? Assuming it will be of a good quality, is it really can be faster or my hunch says that it will be heavily bottlenecked?) or DFlash-type adapter model training.

It might not be impossible to incorporate that in some way, but what you get from your model is more likely nonsense than not.