Qwen-Scope: Official Sparse Autoencoders (SAEs) for Qwen 3.5 models

Posted by MadPelmewka@reddit | LocalLLaMA | View on Reddit | 40 comments

Qwen Team released Qwen-Scope — a collection of Sparse Autoencoders (SAEs) for the Qwen 3.5 family (from 2B to 35B MoE). They’ve mapped internal features for the residual stream across all layers.

What is this exactly? Think of it as a dictionary of the model's internal concepts. Instead of looking at raw numbers, you can see specific "features" that represent concepts like "legal talk", "Python code", or "refusal".

What can you do with this?

  1. Surgical Abliteration: You can find the exact feature ID for refusal/moralizing and suppress it. This is much more precise than the standard "mean difference" method and helps preserve reasoning. Note: The Qwen team explicitly discourages using these tools for removing safety filters or "interfering with model capabilities" in their license, but technically, this is exactly what these SAEs enable.
  2. Feature Steering: You can "force-activate" certain concepts during generation (e.g., making the model more technical or forcing a specific style) by injecting feature directions into the hidden states.
  3. Model Debugging: Identify which tokens trigger specific internal directions (like unexpected language switching or refusals).
  4. Dataset Analysis: Scan your fine-tuning data to see if it actually activates the intended internal features.