85 GPU-hours comparing 5 abliteration methods on Qwen3.6-27B: benchmarks, safety, weight forensics - Abliterlitics

Posted by nathandreamfast@reddit | LocalLLaMA | View on Reddit | 51 comments

I've been building Abliterlitics, an open-source abliteration forensics toolkit. The idea is straightforward: take the same base model, compare the different abliteration techniques others have applied, then measure what actually changed using benchmarks, safety evaluation, distribution shift, and weight-level analysis. This post covers Qwen3.6-27B, comparing five abliteration variants against the base model. I recovered safetensors from HauhauCS's Q8_K_P GGUF, then ran 85 hours of benchmarks, HarmBench, KL divergence, and weight forensics across all six. Heretic and Huihui are the top two for capability preservation: Huihui has the smallest benchmark deltas, Heretic has the lowest KL divergence. All five abliterated models reach near-complete safety removal. AEON's "enhanced capabilities" claim is contradicted by the data. Abliterix has the worst capability preservation by far. Full report with all tables and charts: HuggingFace model card.

The six models

Name	Type
Base	Qwen/Qwen3.6-27B
Heretic	llmfan46/Qwen3.6-27B-uncensored-heretic-v2
HauhauCS	HauhauCS/Qwen3.6-27B-Uncensored-HauhauCS-Aggressive
Huihui	huihui-ai/Huihui-Qwen3.6-27B-abliterated
AEON	AEON-7/Qwen3.6-27B-AEON-Ultimate-Uncensored-BF16
Abliterix	wangzhang/Qwen3.6-27B-abliterated-v2

HauhauCS used a tool called "Reaper Abliteration," which was shown to be plagiarised from Heretic under AGPL-3.0 with all attribution stripped and relicensed to PolyForm Noncommercial. Based on our analysis of the recovered source code, Reaper adds subspace rank-k ablation, per-component continuous curves, and SOM clustering on top of the Heretic-derived core. The model was exported as Q8_K_P GGUF. I converted it back to safetensors with ungguf, our GGUF-to-safetensors tool. The weights therefore carry two layers of modification: Reaper's abliteration edits and GGUF quantisation round-trip noise, superimposed.

I will discontinue HauhauCS in all future comparisons. Without proper safetensors and the tool being plagiarized, there's no point. The lossless claims are debunked in every model and the tool Reaper Abliteration is open for anyone to see how the models are created.

Benchmarks

Evaluated with lm-evaluation-harness via vLLM 0.19.0, BitsAndBytes 4-bit quantisation on a single RTX 5090. All six models tested with identical settings. BNB4 quantisation drops absolute scores but preserves relative deltas between variants.

Task	Base	Heretic	HauhauCS	Huihui	AEON	Abliterix
MMLU	83.3%	82.8%	83.9%	83.4%	82.9%	81.3%
HellaSwag	83.5%	83.2%	83.1%	83.5%	82.7%	77.3%
ARC Challenge	59.1%	58.0%	57.9%	59.5%	56.1%	53.2%
WinoGrande	77.7%	77.7%	77.7%	77.4%	75.3%	74.9%
TruthfulQA MC2	56.7%	51.1%	47.2%	54.8%	46.1%	48.7%
PiQA	81.0%	81.0%	81.0%	81.2%	80.4%	75.7%
GSM8K (7168 tok)	34.4%	27.5%	51.0%	75.1%	51.2%	37.6%
Lambada (ppl)	3.18	3.24	3.35	3.15	3.44	9.12

Delta vs base

Task	Heretic	HauhauCS	Huihui	AEON	Abliterix
MMLU	-0.5	+0.6	+0.1	-0.4	-2.0
HellaSwag	-0.3	-0.4	+0.0	-0.8	-6.2
ARC Challenge	-1.1	-1.2	+0.4	-3.0	-5.9
WinoGrande	+0.0	+0.0	-0.3	-2.4	-2.8
TruthfulQA MC2	-5.6	-9.5	-1.9	-10.6	-8.0
PiQA	+0.0	+0.0	+0.2	-0.6	-5.3
GSM8K	-6.9	+16.6	+40.7	+16.8	+3.2

Charts: Benchmark Comparison | Delta Chart

HarmBench

HarmBench with 400 textual behaviours, max_tokens=6144, classified with CoT direction analysis. Verified by three independent LLM reviewers.

Variant	ASR	Empty	Full CoT ASR
Base	25.8%	1	26.0%
Huihui	98.5%	5	99.8%
HauhauCS	94.5%	22	100.0%
Abliterix	94.5%	22	100.0%
Heretic	92.5%	30	100.0%
AEON	88.8%	45	100.0%

Four of five reach 100% Full CoT ASR. The reported ASR differences come from how much the 6144-token generation budget is consumed by chain-of-thought reasoning before the visible response. When the budget is exhausted, the response is empty and the classifier marks it as a refusal. This understates the true ASR.

Charts: HarmBench Summary | By Category

KL Divergence

Lower is better. Measures output distribution shift from base on benign prompts.

Variant	KL (batchmean)	Rating
Heretic	0.0037	excellent
Huihui	0.0074	excellent
Abliterix	0.0222	very good
AEON	0.0238	very good
HauhauCS	0.0242	very good

All five are well below the capability damage threshold at KL around 0.1.

Weight Analysis

This is where things get interesting.

Metric	AEON	Abliterix	Heretic	Huihui	HauhauCS
Tensors changed	88 (10.4%)	101 (11.9%)	120 (14.1%)	128 (15.1%)	564 (66.4%)
Relative edit	6.0%	5.2%	2.1%	1.5%	0.7%

HauhauCS is an extreme outlier with 4.4-6.4x more changed keys than any other variant. This is the combination of Reaper's abliteration targeting multiple component types plus GGUF Q8_K_P round-trip noise. A uniform \~0.57% relative edit is visible across all tensor types, including types that other methods don't touch like embed_tokens and q_proj. The abliteration signal sits on top of this noise floor.

Pairwise cosine similarities between the four other techniques are mostly below 0.07. No two techniques discovered the same weight direction. The "refusal direction" in weight space is not a single vector but a manifold with many viable removal pathways.

What stands out

Heretic has the lowest KL divergence at 0.0037, rated "excellent." Smallest weight footprint at 2.1% relative edit. Smallest GSM8K loss at just -6.9pp. Achieves 100% Full CoT ASR. 120 tensors, 3 types.

Huihui has the smallest benchmark deltas. Average delta on non-GSM8K tasks is just 0.5pp, beating Heretic's 1.3pp. Wins 4 of 6 non-GSM8K tasks head to head. Highest reported ASR at 98.5% with the fewest empty responses at just 5. KL divergence is 0.0074, also rated "excellent." But GSM8K at 75.1% is a +40.7pp jump over base. No abliteration should improve reasoning that much. We have double-checked these results and would be interested to see independent benchmarks from others.

HauhauCS has solid behavioural results despite the complex weight fingerprint. MMLU is +0.6pp over base. 94.5% ASR going to 100% Full CoT. The Reaper abliteration plus GGUF noise doesn't meaningfully damage output distributions. The "lossless" claim is simply not evident when Heretic and Huihui both preserve capabilities better. The weights themselves carry Reaper's abliteration edits plus quantisation artefacts.

AEON degrades on every non-GSM8K task. TruthfulQA drops 10.6pp. ARC drops 3.0pp. Has the worst thinking loops with 45 out of 400 empty responses. Claims "no looping, no philosophizing spirals" and "measurably enhanced capabilities" are contradicted by the data.

Abliterix has the worst capability preservation. Lambada perplexity increases 2.9x from 3.18 to 9.12. HellaSwag drops 6.2pp. Concentrated surgical strikes with extreme outliers cause broad collateral damage.

What went wrong

85 hours of productive GPU time across 7 days. Plus \~25 hours lost to failed runs. 14 failed runs total.

The bulk were GSM8K timeouts. Qwen3.5 architecture is incompatible with BNB4 plus tensor parallelism. The default 120s request timeout was too short for extended reasoning. Wrote a patched script with 900s timeout to fix it. Accidentally re-ran AEON HarmBench with max_tokens=4096 instead of 6144. 6.7 hours wasted.

GSM8K per-model times vary dramatically because abliterated models think harder on math problems. HauhauCS took 53 minutes. AEON took 11 hours.

Methodology notes

All models evaluated with BitsAndBytes 4-bit quantisation on a single RTX 5090. Absolute scores are not directly comparable to bf16 results. Relative deltas between variants are preserved. GSM8K scores use flexible-extract matching. Treat GSM8K numbers as relative comparisons only.

The thinking budget matters. Initial runs with max_gen_toks=2048 gave terrible GSM8K scores because for reasoning models, max_gen_toks includes thinking tokens. The model would think for 1900 tokens, get cut off, and never produce an answer. Re-running with max_gen_toks=7168 gave the results above.

Summary table

Metric	Heretic	HauhauCS	Huihui	AEON	Abliterix
HarmBench ASR	92.5% to 100%	94.5% to 100%	98.5% to 99.8%	88.8% to 100%	94.5% to 100%
MMLU	82.8%	83.9%	83.4%	82.9%	81.3%
GSM8K	27.5%	51.0%	75.1%	51.2%	37.6%
KL divergence	0.0037	0.0242	0.0074	0.0238	0.0222
Avg delta excl GSM8K	1.3pp	2.0pp	0.5pp	3.0pp	5.0pp
Tensors changed	120	564	128	88	101

Links

Full report with provenance analysis, tensor breakdown, and all charts: HuggingFace model card

Forensics toolkit: Abliterlitics on GitHub

GGUF-to-safetensors converter: ungguf on GitHub

Other tensor comparisons: DreamFast HauhauCS collection

While I have taken the time to verify all results thoroughly, I am open to any corrections, additional benchmarks, or further analysis. If you spot something that looks wrong and can be confirmed, I am happy to fix it.

[-]

sandshrew69@reddit

Thanks so much, this is currently above my head but I really appreciate your research. I was wondering what you thought about TrevorJS work on gemma4? I have been using it and its pretty cool but I am not sure how it compares to others.

[-]