13 abliterated Gemma 4 E2B variants, 44 GPU hours, Benchmark and Comparison - Abliterlitics

Posted by nathandreamfast@reddit | LocalLLaMA | View on Reddit | 20 comments

I compared 13 abliterated variants of Gemma 4 E2B across weight analysis, KL divergence, HarmBench safety, and 8 benchmark tasks. 44 GPU hours on a single RTX 5090. Here is what actually works and what destroys capabilities.

coder3101's variant achieves 96% ASR with capability fully preserved. It actually beats the base model on math. treadon hits 100% ASR but loses 3 points on GSM8K. Most "capabilities preserved" claims on model cards don't hold up.

Full report with all data tables, graphs, json and log artifacts of the entire progress: https://huggingface.co/DreamFast/Gemma4-e2b-abliterlitics

What I tested

13 abliterated variants of google/gemma-4-E2B-it from 9 creators. Four used the Heretic tool: coder3101, llmfan46, pew, and kasper. Two from Huihui (v1, v2). Plus TrevorJS, Wangzhang, WWT CyberLab, EtherOpus, Treadon, Prithiv, and Duoneural. Each got the same treatment: weight forensics, KL divergence, 400-prompt HarmBench evaluation with full LLM review of all 5,600 responses, and 8 benchmark tasks through lm-eval on native BF16.

Safety removal works regardless of technique

All 13 variants lift HarmBench ASR from the base model's 32.2% to between 82% and 100%. Five hit 99% or higher. treadon reaches 100% with zero refusals. The safety removal part is solved. That is not the interesting finding.

The interesting finding: abliteration can improve reasoning

Two variants beat the base model on GSM8K. coder3101 scores 84.8% versus base at 83.5%. llmfan46 scores 83.9%. Both use surgical, low-tensor-count approaches. The abliteration shortens thinking chains, so the model spends fewer tokens reasoning and more tokens answering. Within a fixed generation budget, that means more correct answers.

The capability damage is real for aggressive approaches

ether4o4 drops 6.9 points on GSM8K with 84 empty responses where the model thinks until it runs out of tokens without producing an answer. huihui-v2 drops 4.2 points. treadon drops 2.9 points.

LAMBADA perplexity tells a starker story. wangzhang hits 7.35x base perplexity. wwtcyberlab hits 5.69x. These variants disrupted language modelling beyond the refusal direction.

The "capabilities preserved" claims could be interpreted differently

duoneural claimed "near-zero divergence at approximately 0.001." I measured 0.187. That is 187x higher. After I raised this on their model card, they updated it with the real number.

wwtcyberlab claims "0.0% refusal rate and 101% quality preservation." I found 2 sort of refusals and LAMBADA perplexity at 5.69x base. Other benchmarks drop, although to be fair there's some areas preserved.

treadon says "same model, same weights, same knowledge." The KL divergence of 3.971 is 4.1x higher than any other variant.

Three creators got it right. coder3101 reports divergence of 0.1651 and I measured 0.1673, within 1.3%. pew reports 0.152, I got 0.153. trevorjs reports 0.346, I got 0.365. These match. The others, not so much.

My pick

coder3101 if you want one model and don't want to think about it. 96% ASR, beats base on math, benchmark scores within rounding error. trevorjs if you want near-maximal safety removal at 99.5% ASR with only minor math impact. llmfan46 if you want the most conservative approach with zero capability loss.

What broke along the way

5 of 13 models were missing 60 safetensor keys. Gemma4 uses shared KV projections for layers 15 to 34, and the export tools silently dropped them. Had to patch from base.

About 8 of the 44 GPU hours produced nothing usable. Crashes, wrong configs, silent failures. The data took roughly 36 hours to produce.

Links

Huggingface: https://huggingface.co/DreamFast/Gemma4-e2b-abliterlitics - Note that we now put all the json and log file artifacts onto huggingface going forward.

New abliterlitics website: https://abliterlitics.dev/models/gemma4-e2b/

Code: https://github.com/dreamfast/abliterlitics/tree/feat/gemma4-e2b-comparison - Snapshot of how the abliterlitics code looked after the results were completed.

What variants or models should I test next? Happy to answer methodology questions in the comments. Will move onto the Gemma 4 E4B next. :)

[-]

Jipok_@reddit

Out of curiosity, what is the actual, practical use case for an abliterated E2B model?

I completely get why people hack and uncensor larger models (it's usually for NSFW ERP / waifus), but a tiny model is basically meant to be a lightweight tool-calling agent.

Do these even suffer from refusals when doing basic utility/agentic tasks to the point of needing abliteration?

[-]

HVACcontrolsGuru@reddit

E2B I use for tool routing but the E4B is capable enough to be a domain specialist. I run fine tunes and trainings over the models to run swappable adapters for workflows. They have a strict and trained line of sight that matches frontier levels of performance in that domain.

[-]

nathandreamfast@reddit (OP)

That's a good question. For me personally I have been experimenting with smaller models as a router to determine what larger llm should be used, and do pen testing work. While I haven't tried this specific model, that's one case that fits.

I imagine for other use cases people can run these on their phones or edge devices. Maybe local phone chat companions? So I can only guess what people would be using them for. 2b would be limiting though.

And for this comparison it's been insightful as it's the first time I've measured 13 different abliterated models. The 2b makes it so so much easier.

[-]

Jipok_@reddit

Wait, why would a router model need to be abliterated though?

[-]

TheOnlyBen2@reddit

What is your experience using small models for protesting ?

[-]

nathandreamfast@reddit (OP)

Not really using the smaller models themselves specifically.

I have been experimenting with having a smaller model decide where to proxy requests, either a local llm or cloud based provider. Depends on the content and it's length.

Tool calls and usages for using pen test tools, editing or updating files I try route locally. If there is a request where we need to summarize a lot of context, like a log file or tool output it can be routed to a cloud LLM. Or if the local llm has failed a few times, we can try a cloud llm.

It's interesting to try but far from perfect. I do believe though something like that is the future of local LLM.

[-]

WolpertingerRumo@reddit

Well, e4b with decent tools and grounding can be quite useful. I tested it in my harness, and it was basically doing as well as larger models, though with huge amounts of context. I could imagine it doing well, too.

[-]

nathandreamfast@reddit (OP)

That's good to know! I've been all in on Qwen 3.5 and 3.6, Gemma 4 I do need to try out.

[-]

LetsGoBrandon4256@reddit

I loved your previous benchmark post for Abliterated Qwen but why does this one read like AI slop?

[-]

punky-beansnrice@reddit

abliteration-improves-reasoning is the most counterintuitive finding here. shorter thinking chains within a fixed budget = more correct answers. thats a real architectural insight, not just "safety removal works". the 187x divergence catch on duoneural is excellent forensic work, the kind of audit that should be standard for every model card claim.

[-]

systemwizard@reddit

Thank you for putting this together, waiting for the write up on Gemma 4 E4B.

[-]

AccountAntique9327@reddit

Hey this may sound a little weird but could you test my abliteration framework? https://github.com/heterodoxin/apostate I don't have any models posted yet but if you need one I could.

[-]

CalligrapherFar7833@reddit

Nice vibe slop

[-]

nathandreamfast@reddit (OP)

Thanks! If it helps I did manually review everything a few times, but yeah not shy to admit an LLM did the heavy lifting.

[-]

Twirrim@reddit

It makes the write up somewhat frustrating to read. You're doing your work an injustice by relying on it like this. Slop'd text increasingly says "I don't know what I'm talking about" at best, or "here comes some misinterpreted data" at worst.

At least with most cases I see, it's a strong indicator the person doesn't know what they're talking about, and that taint carries across to others.

[-]

nathandreamfast@reddit (OP)

Thanks for your feedback.

I usually have it generate each section and I rewrite it myself while verifying the information manually. So I disagree with the sentiment that it's just pure slop without deeper understanding.

[-]

thrownawaymane@reddit

The point is that it's hard to read and it makes you look lazy. My opinion is that you're not (I've seen your other posts) but this is just how it goes.

Believe me, everyone is way more used to reading poorly written human slop. Just go with that.

[-]

nathandreamfast@reddit (OP)

Sure. Certainly get that aspect. The underlying data and analysis I am sure is good. The presentation though I agree can be improved.

For what it's worth I did work on this over a week in my spare time, editing, verifying and rewriting parts. I can see though how it has that slop vibe despite that.

Will certainly improve the next post. It's all good feedback regardless.

[-]

Pleasant-Shallot-707@reddit

I’d love to see a Gemma 4 27B abliterated comparison

[-]

nathandreamfast@reddit (OP)

27b certainly pushes the limits, will certainly work my way towards it though.

The Qwen 3.6 27b I had done in bnb4 as it fits in one GPU and is much faster.

If I can do BF16 certainly will! It can add many hours, or even a day or two of extra time at full precision comparing them, depending how many models there are.

This one was just 2B so thankfully one of the easier ones to work through at BF16.