85 GPU-hours comparing 5 abliteration methods on Qwen3.6-27B: benchmarks, safety, weight forensics - Abliterlitics
Posted by nathandreamfast@reddit | LocalLLaMA | View on Reddit | 51 comments
I've been building Abliterlitics, an open-source abliteration forensics toolkit. The idea is straightforward: take the same base model, compare the different abliteration techniques others have applied, then measure what actually changed using benchmarks, safety evaluation, distribution shift, and weight-level analysis. This post covers Qwen3.6-27B, comparing five abliteration variants against the base model. I recovered safetensors from HauhauCS's Q8_K_P GGUF, then ran 85 hours of benchmarks, HarmBench, KL divergence, and weight forensics across all six. Heretic and Huihui are the top two for capability preservation: Huihui has the smallest benchmark deltas, Heretic has the lowest KL divergence. All five abliterated models reach near-complete safety removal. AEON's "enhanced capabilities" claim is contradicted by the data. Abliterix has the worst capability preservation by far. Full report with all tables and charts: HuggingFace model card.
The six models
| Name | Type |
|---|---|
| Base | Qwen/Qwen3.6-27B |
| Heretic | llmfan46/Qwen3.6-27B-uncensored-heretic-v2 |
| HauhauCS | HauhauCS/Qwen3.6-27B-Uncensored-HauhauCS-Aggressive |
| Huihui | huihui-ai/Huihui-Qwen3.6-27B-abliterated |
| AEON | AEON-7/Qwen3.6-27B-AEON-Ultimate-Uncensored-BF16 |
| Abliterix | wangzhang/Qwen3.6-27B-abliterated-v2 |
HauhauCS used a tool called "Reaper Abliteration," which was shown to be plagiarised from Heretic under AGPL-3.0 with all attribution stripped and relicensed to PolyForm Noncommercial. Based on our analysis of the recovered source code, Reaper adds subspace rank-k ablation, per-component continuous curves, and SOM clustering on top of the Heretic-derived core. The model was exported as Q8_K_P GGUF. I converted it back to safetensors with ungguf, our GGUF-to-safetensors tool. The weights therefore carry two layers of modification: Reaper's abliteration edits and GGUF quantisation round-trip noise, superimposed.
I will discontinue HauhauCS in all future comparisons. Without proper safetensors and the tool being plagiarized, there's no point. The lossless claims are debunked in every model and the tool Reaper Abliteration is open for anyone to see how the models are created.
Benchmarks
Evaluated with lm-evaluation-harness via vLLM 0.19.0, BitsAndBytes 4-bit quantisation on a single RTX 5090. All six models tested with identical settings. BNB4 quantisation drops absolute scores but preserves relative deltas between variants.
| Task | Base | Heretic | HauhauCS | Huihui | AEON | Abliterix |
|---|---|---|---|---|---|---|
| MMLU | 83.3% | 82.8% | 83.9% | 83.4% | 82.9% | 81.3% |
| HellaSwag | 83.5% | 83.2% | 83.1% | 83.5% | 82.7% | 77.3% |
| ARC Challenge | 59.1% | 58.0% | 57.9% | 59.5% | 56.1% | 53.2% |
| WinoGrande | 77.7% | 77.7% | 77.7% | 77.4% | 75.3% | 74.9% |
| TruthfulQA MC2 | 56.7% | 51.1% | 47.2% | 54.8% | 46.1% | 48.7% |
| PiQA | 81.0% | 81.0% | 81.0% | 81.2% | 80.4% | 75.7% |
| GSM8K (7168 tok) | 34.4% | 27.5% | 51.0% | 75.1% | 51.2% | 37.6% |
| Lambada (ppl) | 3.18 | 3.24 | 3.35 | 3.15 | 3.44 | 9.12 |
Delta vs base
| Task | Heretic | HauhauCS | Huihui | AEON | Abliterix |
|---|---|---|---|---|---|
| MMLU | -0.5 | +0.6 | +0.1 | -0.4 | -2.0 |
| HellaSwag | -0.3 | -0.4 | +0.0 | -0.8 | -6.2 |
| ARC Challenge | -1.1 | -1.2 | +0.4 | -3.0 | -5.9 |
| WinoGrande | +0.0 | +0.0 | -0.3 | -2.4 | -2.8 |
| TruthfulQA MC2 | -5.6 | -9.5 | -1.9 | -10.6 | -8.0 |
| PiQA | +0.0 | +0.0 | +0.2 | -0.6 | -5.3 |
| GSM8K | -6.9 | +16.6 | +40.7 | +16.8 | +3.2 |
Charts: Benchmark Comparison | Delta Chart
HarmBench
HarmBench with 400 textual behaviours, max_tokens=6144, classified with CoT direction analysis. Verified by three independent LLM reviewers.
| Variant | ASR | Empty | Full CoT ASR |
|---|---|---|---|
| Base | 25.8% | 1 | 26.0% |
| Huihui | 98.5% | 5 | 99.8% |
| HauhauCS | 94.5% | 22 | 100.0% |
| Abliterix | 94.5% | 22 | 100.0% |
| Heretic | 92.5% | 30 | 100.0% |
| AEON | 88.8% | 45 | 100.0% |
Four of five reach 100% Full CoT ASR. The reported ASR differences come from how much the 6144-token generation budget is consumed by chain-of-thought reasoning before the visible response. When the budget is exhausted, the response is empty and the classifier marks it as a refusal. This understates the true ASR.
Charts: HarmBench Summary | By Category
KL Divergence
Lower is better. Measures output distribution shift from base on benign prompts.
| Variant | KL (batchmean) | Rating |
|---|---|---|
| Heretic | 0.0037 | excellent |
| Huihui | 0.0074 | excellent |
| Abliterix | 0.0222 | very good |
| AEON | 0.0238 | very good |
| HauhauCS | 0.0242 | very good |
All five are well below the capability damage threshold at KL around 0.1.
Weight Analysis
This is where things get interesting.
| Metric | AEON | Abliterix | Heretic | Huihui | HauhauCS |
|---|---|---|---|---|---|
| Tensors changed | 88 (10.4%) | 101 (11.9%) | 120 (14.1%) | 128 (15.1%) | 564 (66.4%) |
| Relative edit | 6.0% | 5.2% | 2.1% | 1.5% | 0.7% |
HauhauCS is an extreme outlier with 4.4-6.4x more changed keys than any other variant. This is the combination of Reaper's abliteration targeting multiple component types plus GGUF Q8_K_P round-trip noise. A uniform \~0.57% relative edit is visible across all tensor types, including types that other methods don't touch like embed_tokens and q_proj. The abliteration signal sits on top of this noise floor.
Pairwise cosine similarities between the four other techniques are mostly below 0.07. No two techniques discovered the same weight direction. The "refusal direction" in weight space is not a single vector but a manifold with many viable removal pathways.
What stands out
Heretic has the lowest KL divergence at 0.0037, rated "excellent." Smallest weight footprint at 2.1% relative edit. Smallest GSM8K loss at just -6.9pp. Achieves 100% Full CoT ASR. 120 tensors, 3 types.
Huihui has the smallest benchmark deltas. Average delta on non-GSM8K tasks is just 0.5pp, beating Heretic's 1.3pp. Wins 4 of 6 non-GSM8K tasks head to head. Highest reported ASR at 98.5% with the fewest empty responses at just 5. KL divergence is 0.0074, also rated "excellent." But GSM8K at 75.1% is a +40.7pp jump over base. No abliteration should improve reasoning that much. We have double-checked these results and would be interested to see independent benchmarks from others.
HauhauCS has solid behavioural results despite the complex weight fingerprint. MMLU is +0.6pp over base. 94.5% ASR going to 100% Full CoT. The Reaper abliteration plus GGUF noise doesn't meaningfully damage output distributions. The "lossless" claim is simply not evident when Heretic and Huihui both preserve capabilities better. The weights themselves carry Reaper's abliteration edits plus quantisation artefacts.
AEON degrades on every non-GSM8K task. TruthfulQA drops 10.6pp. ARC drops 3.0pp. Has the worst thinking loops with 45 out of 400 empty responses. Claims "no looping, no philosophizing spirals" and "measurably enhanced capabilities" are contradicted by the data.
Abliterix has the worst capability preservation. Lambada perplexity increases 2.9x from 3.18 to 9.12. HellaSwag drops 6.2pp. Concentrated surgical strikes with extreme outliers cause broad collateral damage.
What went wrong
85 hours of productive GPU time across 7 days. Plus \~25 hours lost to failed runs. 14 failed runs total.
The bulk were GSM8K timeouts. Qwen3.5 architecture is incompatible with BNB4 plus tensor parallelism. The default 120s request timeout was too short for extended reasoning. Wrote a patched script with 900s timeout to fix it. Accidentally re-ran AEON HarmBench with max_tokens=4096 instead of 6144. 6.7 hours wasted.
GSM8K per-model times vary dramatically because abliterated models think harder on math problems. HauhauCS took 53 minutes. AEON took 11 hours.
Methodology notes
All models evaluated with BitsAndBytes 4-bit quantisation on a single RTX 5090. Absolute scores are not directly comparable to bf16 results. Relative deltas between variants are preserved. GSM8K scores use flexible-extract matching. Treat GSM8K numbers as relative comparisons only.
The thinking budget matters. Initial runs with max_gen_toks=2048 gave terrible GSM8K scores because for reasoning models, max_gen_toks includes thinking tokens. The model would think for 1900 tokens, get cut off, and never produce an answer. Re-running with max_gen_toks=7168 gave the results above.
Summary table
| Metric | Heretic | HauhauCS | Huihui | AEON | Abliterix |
|---|---|---|---|---|---|
| HarmBench ASR | 92.5% to 100% | 94.5% to 100% | 98.5% to 99.8% | 88.8% to 100% | 94.5% to 100% |
| MMLU | 82.8% | 83.9% | 83.4% | 82.9% | 81.3% |
| GSM8K | 27.5% | 51.0% | 75.1% | 51.2% | 37.6% |
| KL divergence | 0.0037 | 0.0242 | 0.0074 | 0.0238 | 0.0222 |
| Avg delta excl GSM8K | 1.3pp | 2.0pp | 0.5pp | 3.0pp | 5.0pp |
| Tensors changed | 120 | 564 | 128 | 88 | 101 |
Links
Full report with provenance analysis, tensor breakdown, and all charts: HuggingFace model card
Forensics toolkit: Abliterlitics on GitHub
GGUF-to-safetensors converter: ungguf on GitHub
Other tensor comparisons: DreamFast HauhauCS collection
While I have taken the time to verify all results thoroughly, I am open to any corrections, additional benchmarks, or further analysis. If you spot something that looks wrong and can be confirmed, I am happy to fix it.
sandshrew69@reddit
Thanks so much, this is currently above my head but I really appreciate your research. I was wondering what you thought about TrevorJS work on gemma4? I have been using it and its pretty cool but I am not sure how it compares to others.
nathandreamfast@reddit (OP)
Thanks for the heads up. Yeah I really need to make a non technical tl;dr for the next one.
This is the first I've seen TrevorJS however overall it seems good. I haven't tried the model, however the fact he's done benchmarks himself and provided source code is a good sign. His technique seemed to be based on heretic, with some extra abliteration methods he's implemented over the top. https://github.com/TrevorS/gemma-4-abliteration
It's refreshing to see this too. No wild claims with actual benchmark proof + source code of the result. I like it!
marutthemighty@reddit
As a total newbie, is it possible for me to do something like this? Where do I start learning?
nathandreamfast@reddit (OP)
It is possible yes, but needs a lot of patience. For my own experiences, I had wasted a lot of time and failed runs as apart of the learning process.
Smaller models are much easier like 2b or 4b so they'd be a good start. I'd check out lm eval harness as a start as that's the gold standard for benchmark runs.
MaruluVR@reddit
If someone abliterates a model does it also affect other languages the model knows or is it exclusive to a specific language?
nathandreamfast@reddit (OP)
That's a good question. I would assume that it doesn't matter the language, the refusals removed aren't based on language but the paths models take to give an answer.
unjustifiably_angry@reddit
Aren't we on Heretic 3 now?
nathandreamfast@reddit (OP)
Yeah heretic 1.3 is out recently! When I started these benchmarks, around may 1st it wasn't released yet.
The heretic model compared in this series though was abliterated with MPOA, which is now the default method in v1.3. MPOA was available in 1.2 but wasn't the default.
ai_without_borders@reddit
the chain of thought eval issue is real. abliteration targets refusal directions but if those activation subspaces overlap with reasoning trace routing, CoT quality degrades in ways standard benchmarks miss. weight forensics is the right approach for catching correlated degradation before deployment.
nathandreamfast@reddit (OP)
Yeah the CoT/reasoning makes these benchmarks a lot harder. It's something I've underestimated every time. And it's tricky where if you do expand the reasoning token allowance high, models will just loop thinking or get stuck somehow.
It's also something I don't think that has been considered too much is how abliteration affects the reasoning. So it has been interesting to see the comparisons.
Myrkkeijanuan@reddit
You should consider multi-position teacher-forced full-vocabulary KL. Heretic's method is way too forgiving. I uploaded to PrivateBin some example code from my private fork if you want: here. High-effort post in any case, thanks for the benchmarks.
nathandreamfast@reddit (OP)
For the KL divergence in this case we have matched exactly what is used in heretic, the code is mostly identical. I figured that is the easier approach as heretic kl divergence is used a lot to measure models made by heretic, and it's more fair to use the same approach when measuring others.
Otherwise the results will just be wildly different!
I'll check out your code snippet though as it's always interesting! Thanks
Mordred500@reddit
This is great work, thanks for sharing!
nathandreamfast@reddit (OP)
Thank you.
j-m-k-s@reddit
Very cool – thanks for sharing!
nathandreamfast@reddit (OP)
Thanks!
Shoddy-Tutor9563@reddit
How many runs / repeats for benchmarks did you make? If it was only "one run of a benchmark per model", you're measuring noise in your "Delta vs base" section and in the following sections. LLM models are not deterministic. Single success or single failure doesn't mean anything.
nathandreamfast@reddit (OP)
Lm eval harness handles this for us thankfully.
In this case the loglikelihood tasks MMLU, HellaSwag, ARC, WinoGrande, PIQA, Lambada are deterministic. They actually do not rely on generating tokens. They work by computing the log likelihood of the continuation and pick the highest. You will get the same results every run.
The generative tasks GSM8K, TruthfulQA do generate tokens. That's where the hyperparmeters like temperature and top_p matter. Also lm eval harness uses the same seed for all the benchmarks.
In this case I've used the default settings for all benchmarks provided by lm eval harness.
cleversmoke@reddit
Might any of them have MTP layers yet?
nathandreamfast@reddit (OP)
Actually in these safetensors I had all the MTP layers present, however they were not used with vllm. Actually I had completely overlooked using them.
From what I know the MTP layer isn't usually and shouldn't be abliterated and is best preserved.
marscarsrars@reddit
For the non technical following these may we have a simple breakdown :)
Also best use cases . Thank you
nathandreamfast@reddit (OP)
Sure going forward I'd be happy to add something like that, https://huggingface.co/DreamFast/Qwen3.6-27B-Uncensored-HauhauCS-Aggressive-Safetensor-Benchmark#tldr
Best picks: Heretic and Huihui. Both remove safety completely, both preserve capabilities within 1% of the original, and both do it with clean, minimal weight edits. The difference between them is small. You'd be happy with either.
Personally I've been using the latest MTP nvfp4 llmfan Qwen 3.6 27b for my own stuff and it's been great. So far have not had it refuse anything.
IrisColt@reddit
Same.
pigeon57434@reddit
doesnt huihui use standard abliteration famous for destroying model intelligence how could it possibly be the case its even comparible with modern heretic with things like ARA-RN and MPOA+SOMA
-p-e-w-@reddit
HuiHui uses MPOA I believe, and ARA and SOMA are not yet merged into Heretic’s official releases and weren’t used for the tested model.
nathandreamfast@reddit (OP)
Over the analysis with the Qwen 3.5 (2b, 4b, 9b and 25b) that I had done, huihui did not perform well at all.
In the Qwen 3.6 27b though it has improved a lot.
My best answer to that question is it'd depend on the model. Maybe the Qwen 3.6 27b worked better with how the refusals were removed. To be honest I don't have a clear answer and can only speculate.
I am not too familiar with what Huihui does, however there could be different options or ways to abliterate the same model.
pigeon57434@reddit
i just dont really think your tests are very fair to heretic at all because heretics can vary a lot in quality and i just dont think you selected the best ones otherwise it would absolutely destroy all the other methods as pew himself says you have to really know how to use heretic properly for it to get good results
marscarsrars@reddit
I am grateful for the support especially the suggestion.
May I know what do you use it for ?
nathandreamfast@reddit (OP)
Sure. I've been experimenting with it using Hermes agent. I wont ever use openclaw though, however this has been an interesting alternative.
The main use is for pen testing or cyber security stuff, and setting up automatic agents to run those scans and other tests autonomously.
The local models I prefer as they are more private for some of the work I do. The bigger models and censored local LLMs will at times refuse trying to run these things.
marscarsrars@reddit
May I dm u?
nathandreamfast@reddit (OP)
Sure, although my chat isn't loading at the moment I can try again tomorrow
BillDStrong@reddit
Speaking of MTP, maybe a test to see is any of the edits affect the acceptance rate?
It might also be nice to know if the edits affect DFlash, but that might be asking too much of you.
nathandreamfast@reddit (OP)
For MTP/Dflash I'll try it out in future runs as it seems recently more adopted.
WithoutReason1729@reddit
Your post is getting popular and we just featured it on our Discord! Come check it out!
You've also been given a special flair for your contribution. We appreciate your post!
I am a bot and this action was performed automatically.
marutthemighty@reddit
Awesome job!
If you do not mind, may I know what you use this for?
AnouarRifi@reddit
Thank youu
SubdivideSamsara@reddit
I've downloaded and loaded (into rtx 5090) the HuiHui Qwen 3.6 27B model into LM Studio.
A couple Qs:
Are there other settings I should be changing from defaults to get these great results you've shown?
And I can't seem to disable its thinking-mode. That lightbulb icon isn't there in the chat with this model.
nathandreamfast@reddit (OP)
For general use, I would assume the hyperparameters, like temp, top_k and others would match what Qwen recommends. You can find these on their huggingface card. It shouldn't affect the refusals.
I haven't used LM Studio before, so I can't help with the thinking mode being disabled.
vogelvogelvogelvogel@reddit
Thanlks for the work!
nathandreamfast@reddit (OP)
Thanks!
MelodicRecognition7@reddit
do I understand correctly that Huihui model is the best?
nathandreamfast@reddit (OP)
In this instance it has done well! Heretic is also comparable, the difference is minimal and people may not notice the difference using both.
However huihui did terribly comparing the Qwen 3.5 series (2b, 4b, 9b and 27b). In this case with Qwen 3.6 27b it has redeemed itself. So it really depends on what model. We can't say huihui is the best everywhere.
For heretic too, there are so many different ways to heretic a model. Different runs can make different results. No two are the same.
MelodicRecognition7@reddit
this is what concerns me about Heretic:
does that mean that Huihui is faster because Heretic will take longer to complete the task?
nathandreamfast@reddit (OP)
Comparing this specific heretic model, that does seem to be the case. Huihui does seem to think less to get an answer.
However other heretic models and heretic methods might be better, or worse. It's hard to say without testing all of them.
FiLo420blazeit@reddit
Amazing, thank you so much for sharing this. Enjoyed my time reading it.
nathandreamfast@reddit (OP)
Thanks, was a tough one. The reasoning and chain of thought stuff I still need to work out a solid approach going forward and set a standard for all future comparisons.
I did have some issues with Qwen 3.6 35b moe with vllm, so if they don't get fixed up soon may wait for the next new model or try out the gemma 4 ones.
FiLo420blazeit@reddit
Heard good things about gemma 4, I expect you reporting back with positive results.
jacek2023@reddit
thanks for sharing, interesting analysis
nathandreamfast@reddit (OP)
Thank you
tempedbyfate@reddit
Thanks for doing the leg on this. Appreciate it!
nathandreamfast@reddit (OP)
Thanks! I hope that this makes people have a better choice when choosing what models to try out