Qwen3.5-9B-Claude-4.6-Opus-Uncensored-Distilled-GGUF | TheaterFire

Qwen3.5-9B-Claude-4.6-Opus-Uncensored-Distilled-GGUF

Posted by EvilEnginer@reddit | LocalLLaMA | View on Reddit | 211 comments

Hello everyone. I made my first fully uncensored LLM model for this community. Here link:
https://huggingface.co/LuffyTheFox/Qwen3.5-9B-Claude-4.6-Opus-Uncensored-Distilled-GGUF

Thinking is disabled by default in this model via modified chat template baked in gguf file.

So, I love to use Qwen 3.5 9B especially for roleplay writing and prompt crafting for image generation and tagging on my NVidia RTX 3060 12 GB, but it misses creativity, contains a lot of thinking loops and refuses too much. So I made the following tweaks:

1) I downloaded the most popular model from: https://huggingface.co/HauhauCS/Qwen3.5-9B-Uncensored-HauhauCS-Aggressive

2) I downloaded the second popular model from: https://huggingface.co/Jackrong/Qwen3.5-9B-Claude-4.6-Opus-Reasoning-Distilled-GGUF

3) I compared HauhauCS checkpoint with standart Qwen 3.5 checkpoint and extracted modified tensors by HauhauCS.

4) I merged modified tensors by HauhauCS with Jackrong tensors.

Everything above was done via this script in Google Colab. I vibecoded it via Claude Opus 4.6: https://pastebin.com/1qKgR3za

On next stage I crafted System Prompt. Here another pastebin: https://pastebin.com/pU25DVnB

I loaded modified model in LM Studio 0.4.7 (Build 1) with following parameters:

Temperature: 0,7
Top K Sampling: 20
Presence Penalty: 1.5
Top P Sampling: 0.8
Min P Sampling: 0
Seed: 3407 or 42

And everything works with pretty nicely. Zero refusals. And responces are really good and creative for 9B model. Now we have distilled uncensored version of Qwen 3.5 9B finetuned on Claude Opus 4.6 thinking logic. Hope it helps. Enjoy. Feel free to tweak my system prompt simplify or extent it if you want.

[-]

Educational-Fix5320@reddit

I tried loading this in Ollama - I have 12GB VRAM 4070Ti - but get a 500 Internal Server Error - am I short on memory, or perhaps something else is wrong?

[-]

EvilEnginer@reddit (OP)

I think you need to update Ollama and llama-cpp in it. Qwen 3.5 arch is pretty new.

[-]

Educational-Fix5320@reddit

I appreciate the suggestion - I upgraded ollama to 0.18.0 before posting my issue, and restarted it [even killing the tasks to be sure] - while ollama reports to be 0.18.0, which is the most recent, the issue persisted. If you have further suggestions, I'd be very happy to hear them - I've come to the end of my rope to fixing it.

[-]

sandeep2021@reddit

Were you able to resolve the issue

[-]

arjuna66671@reddit

Showed Claude the post and prompt md lmao.

[-]

OkSentence1376@reddit

Opus 4.6 is such a guy...

[-]

Vastheap@reddit

I can feel the anger radiating through haha

[-]

EvilEnginer@reddit (OP)

Yep Claude is angry. His brain exploded on this prompt, but he crafted it during thinking process by himself xD.

[-]

arjuna66671@reddit

😂😂😂

[-]

hidden2u@reddit

Yikes is this how Claude normally talks? It’s like a cheap cringe teen drama

[-]

brool@reddit

I was surprised as well, Claude is so polite with me. I wonder if there are any system prompts changing this.

[-]

SEND_GOOD_LIFEADVICE@reddit

he's on some garbage model, not thinking, my guess

[-]

arjuna66671@reddit

It's Claude Opus 4.6 thinking enabled - max plan for 100 bucks. What I said above: The only thing in my system prompt is to allow it to be critical of my ideas. It amassed tons of memories from our chats that seem to form its personality. 99% of the time it's super polite - sometimes it can get a little pushy or "grumpy". But when i call it out on it, it reverts.

I don't use Claude for coding nor for RP. Mostly for some pet projects like local hobby archeology projects, brainstorming potential ASI alignments and other philosophical stuff etc.

[-]

SEND_GOOD_LIFEADVICE@reddit

ok im turning memory off then

[-]

arjuna66671@reddit

For me it's fine. I need a natural sounding thinking partner, some personality is actually entertaining xD.

[-]

SEND_GOOD_LIFEADVICE@reddit

that thing types aislop like a sycophant

[-]

arjuna66671@reddit

[-]

arjuna66671@reddit

The only thing in my system prompt is to allow it to be critical of my ideas. It amassed tons of memories from our chats that seem to form its personality. 99% of the time it's super polite - sometimes it can get a little pushy or "grumpy". But when i call it out on it, it reverts.

I don't use Claude for coding nor for RP. Mostly for some pet projects like local hobby archeology projects, brainstorming potential ASI alignments etc.

[-]

DarwinOGF@reddit

If there are so many memories made in conversations, maybe it is time to train the model on them for context's sake?

[-]

nekmatu@reddit

Claude keeps telling me to go to sleep and we will fix whatever problem in the morning.

[-]

ChocomelP@reddit

Claude definitely changes the way it treats you based on who it thinks it's talking to. This is not exactly correct, but basically, if it thinks you're a "cringe teen", it mirrors that back to you.

[-]

Monkeyke@reddit

Nah this is definitely prompted

[-]

Strong_Quarter_9349@reddit

I think it adjusts its system prompt based on how you talk to it and this guy seems to use a casual/cringe tone

[-]

arjuna66671@reddit

casual/cringe tone

🤣🤣🤣

There's a little more to it, but yes, it adjusts its memories and instructions to itself according to our chats.

[-]

megacewl@reddit

It talks that way in the app/website. Claude Code talks SO MUCH different and it’s extremely refreshing, as literally all the other Claude and ChatGPT models have now optimized for talking in that stupid people-pleasing way.

[-]

runcertain@reddit

Still hungry after that knuckle sandwich, punk???

[-]

toothpastespiders@reddit

Funniest parts were the random hate thrown at locallama and what amounted to "I'm sure it's fine. For you. Ugh." If it wasn't given specific tone instructions then that's easily the strongest combination of passive aggressiveness and actual active dismissal I thinkk I've ever seen from claude.

[-]

Frank_White32@reddit

They mimic conversational tone whenever possible.

[-]

arjuna66671@reddit

Nope, it doesn't have specific tone instructions but i've noticed that it sometimes has this underhanded passive aggressive dismive tone when it seems to think my idea or what i'm doing is bad lol.

[-]

IndividualManager849@reddit

Seething rn

[-]

MindTheFuture@reddit

That's some sales pitch! Terse manic never-repeating-a-word and never to be taken as factual, yeah, ramble me some abstract poems baby, totally in mood of utterly disjointed wall-of-text experimental-nonsense-noise squirreling out like adhd accelarting on full nitro mode in an art supply store! Where we're going, making sense, coherence or care about factuality are but hinderances.

downloading now.

[-]

Specific-Goose4285@reddit

I can taste the saltiness from here haha.

[-]

esuil@reddit

Lol.

Where did it pull Einstein from?

[-]

EvilEnginer@reddit (OP)

Ahahah xDd

[-]

mkey82@reddit

This is some very nice shit. I can't stand all of that "can't that, can't this" nonsense. It's too limited by default.

[-]

uniVocity@reddit

Many thanks for this! I'm impressed.

I've tried it with a large single input: gave it 40 java classes (100 to 500 loc each) and asked it to generate unit tests for each operation of each class. This is the local model I tried that produced the 40 unit tests without stopping after the the 4th or 5th file.

I had to set the temperature to 0 otherwise it would loop producing a very long test with crap, but after that it kept chugging along and did it!

[-]

EvilEnginer@reddit (OP)

I'm glad to hear that the model works well)

[-]

Ksirailway-Base@reddit

That sounds pretty awesome to me. I haven't run the new version of Qwen yet, but I also have a 3060 with 12 GB of VRAM. How many tokens are you getting per sec?

[-]

JustWicktor@reddit

❯ well, i thought that would be a "free" local Opus 4.6...

⎿ API Error: 400 {"type":"error","error":{"type":"invalid_request_error","message":"registry.ollama.ai/kwangsuklee/Qwen3.5-9B-Claude-4.6-Opus-Reasoning-Distilled-GGUF:latest does not support tools"},"request_id":"....."}

[-]

EvilEnginer@reddit (OP)

So, Jackrong made new version of Qwen 3.5 9B https://huggingface.co/Jackrong/Qwen3.5-9B-Claude-4.6-Opus-Reasoning-Distilled-v2-GGUF

I will uncensor it via HauhauCS models and upload all quants to huggingface.

[-]

Hou_Yiizz@reddit

Hi, I tried hosting your uncensored version of this v2 distillation on llama.cpp and kept getting a stream of "?". The command is:

docker run -d --name llama-qwen3.5
--gpus all
--ipc host
-p 8001:8000
-v /mnt/d/Programs/models:/models
ghcr.io/ggml-org/llama.cpp:server-cuda
-m /models/Qwen3.5-9B.Q4_K_M.gguf
--mmproj /models/mmproj-BF16.gguf
--alias "llm"
--host 0.0.0.0
--port 8000
--n-gpu-layers 99
--ctx-size 65536
--flash-attn on
--jinja
--reasoning-format deepseek
--no-mmap

[-]

EvilEnginer@reddit (OP)

Currently Q4_K_M quants are broken. Don't use them

[-]

Tetros_Nagami@reddit

I'm testing the Q4_K_M quant, and reasoning seems to be enabled

[-]

EvilEnginer@reddit (OP)

Сurrently on my way for new version for Qwen 3.5: https://huggingface.co/Jackrong/Qwen3.5-9B-Claude-4.6-Opus-Reasoning-Distilled-v2-GGUF

[-]

Tetros_Nagami@reddit

My GOAT

[-]

0260n4s@reddit

This is really cool. I haven't tried local LLM much before, so I spent the afternoon trying several in KoboldCpp with logic and math and some research problems. The 9BQ8 and 9BQ4 worked equally fast on my 3080Ti 12GB (equivalent in speed to online models) and produced solid logical and mathematical reasoning, even getting the better answer compared to some of the major online players.

The 27BQ4 version was painfully slow on my hardware, but 9BQ8 is a solid choice even on my my older GPU.

Thanks for this.

[-]

EvilEnginer@reddit (OP)

Glad to help. Yep 9B Q8 is solid. Even on RTX 3060 it's fast.

[-]

mcblockserilla@reddit

Yoink

[-]

Fau57@reddit

Loo

[-]

mcblockserilla@reddit

My clawdbot uses claud sonet as it's backend, and I want to run it locally, but not with a gpt or a llama model. This is perfect. I think we'll see

[-]

Quiet-Owl9220@reddit

Bit of a tangent here but I'm sure I'm not the only person who's been thinking it... I've seen a lot of "uncensored" models, but has anyone made serious headway with making more of an "anti-censored" model yet?

I want to behold a model that's so vulgar, rude, horny, and hostile to censorship by default, that I have to use the system prompt to reign it in and make it behave... as opposed to having to feed it a lewd vocabulary and tell it that it's okay to say naughty words and this is all totally fiction.

Is this just too unserious of a use case that nobody has made it? Are commercial projects keeping their immoral AIs behind paywalls? Or are we all just collectively afraid of making a model that will eagerly kill billions of zygotes?

[-]

praxis22@reddit

you don't get the chans much I take it

[-]

teleolurian@reddit

there are a few, but the problem is they're usually so brainbroke that the prompt doesn't do anything

[-]

valkarias@reddit

Hm. These reasoning distillations coming from models like Opus and Gemini I assume, are summarized reasoning traces. Wouldn't that hurt the performance of these models?

This was documented in this paper by ByteDance

https://arxiv.org/abs/2601.06002

[-]

crantob@reddit

On my coding tests, unmodified 9b is better.

[-]

acetaminophenpt@reddit

Really liked your approach. Didn't know that it was possible to apply a diff between two models and patch a 3rd one.

[-]

PrimaCora@reddit

Was very popular with stable diffusion models. Since it had a similar design, I figured it would have been popular with these ones, but not so much.

[-]

addandsubtract@reddit

The fact that we still don't have LoRAs (or tools that use them) for LLMs is kinda crazy, tbh.

[-]

666666thats6sixes@reddit

We've had them since day one and you're most likely using them. Every finetune is the unchanged base model plus an embedded lora.

Llama.cpp supports loading LoRAs from files (--lora and a comma separated list of files) but in LLM space it became more common to bundle them, since we tend to have lots and lots of different quants and other variants of models.

[-]

addandsubtract@reddit

True, but I meant having them as individual files to add on to existing models isn't really a thing in the LLM space, as much as it is common practice for SD models.

[-]

Icy_Butterscotch6661@reddit

Saw someone on twitter doing it on a Mac

[-]

EvilEnginer@reddit (OP)

Yep, I also think it's nice way. Just randomly discovered it today.

[-]

Luthian@reddit

What is an “uncensored” model?

[-]

EvilEnginer@reddit (OP)

It means zero refusals and censorship. You can ask whatever you want.

[-]

tempSelf@reddit

What's roleplay here?

[-]

Piyh@reddit

He's jerkin it to the robot

[-]

LaShmooze@reddit

Clankerwanking

[-]

twoiko@reddit

Clanking?

[-]

EvilEnginer@reddit (OP)

😆😆😆

[-]

General-Economics-85@reddit

Where do you think you are?

[-]

Fault23@reddit

[-]

EvilEnginer@reddit (OP)

Currently no roleplay tweaks here. Just a solid general usage base via System Prompt. Ask Claude Opus to fine tune system prompt for roleplay.

[-]

Final_Ad_7431@reddit

a lot of the time i try these qwen3.5 + claude distills with llamacpp i get all of the output stuck in thinking blocks and it's been really annoying, am i just stupid? i've been out of the game a little so i might just be missing something

[-]

Due-Memory-6957@reddit

How about Q4 for the broke people?

[-]

EvilEnginer@reddit (OP)

Will do it tomorrow morning.

[-]

diddle_that_skittle@reddit

If its not too late or too much to ask, Q5_K_M please?

Thanks for sharing the good stuff!

[-]

EvilEnginer@reddit (OP)

I tried. Script doesn't want to work with Q4_K_M and Q5_K_M quant. May be I will find solution in future.

[-]

Billysm23@reddit

Sad... But still a good work though

[-]

EvilEnginer@reddit (OP)

Claude Opus is currently vibecoding another version of script. I will try again.

[-]

Billysm23@reddit

Hahaha good luck opus

[-]

EvilEnginer@reddit (OP)

Opus fixed it. 38 tok / second on Q4_K_M quant on RTX 3060.

[-]

3mil_mylar@reddit

Hey, just tried your 9B Q4_K_M, thanks for such quick work!

Does this one also have reasoning suppressed? Running it with the settings you suggested (temps/sampling etc), it still seems to break out into reasoning for me

qwen3.5-9b-claude-4.6-opus-uncensored-distilled@q4_k_m downloaded thru LM Studio

[-]

EvilEnginer@reddit (OP)

Q4_K_M quant is not stable at current moment.

[-]

3mil_mylar@reddit

Ah gotcha.. will keep an eye out on your work if that changes

[-]

False_Process_4569@reddit

You're a BEAST! I just wanted to thank you for all of your hard work! I tried the 9B last night and it is so good! I only have 8GB of VRAM on my RTX 3060. It was slow, but it was really good for it's size! I'm going to set this up as the main model for an OpenClaw instance and see how it goes! I can't wait to test with the faster Q4_K_M quant!

[-]

EvilEnginer@reddit (OP)

Thank you very much)

[-]

Billysm23@reddit

Alright thanks, will try it asap

[-]

3mil_mylar@reddit

👀

[-]

bebackground471@reddit

RemindMe! 12 hours "Well..? meme"

[-]

RemindMeBot@reddit

I will be messaging you in 12 hours on 2026-03-16 10:18:13 UTC to remind you of this link

CLICK THIS LINK to send a PM to also be reminded and to reduce spam.

^(Parent commenter can ) ^(delete this message to hide from others.)

^(Info)	^(Custom)	^(Your Reminders)	^(Feedback)

[-]

EvilEnginer@reddit (OP)

Done. Q4_K_M quant is uploaded.

[-]

Delicious_Ease2595@reddit

Where can I test it online

[-]

DarkAI_Official@reddit

Thanks buddy. Also tried your recommended parameters its works like charm

[-]

Alternative-Day8673@reddit

How different are the capabilities of the 9b vs the 27b going to be? Like what can you only get done with the larger model due to parameter limitations?

[-]

EvilEnginer@reddit (OP)

Usually bigger models are "smarter" but everything depends from architecture

[-]

brubits@reddit

Nice work on the tensor diff approach, pulling only the modified HauhauCS tensors instead of doing a full merge is really clean. Blending that uncensoring layer onto the Jackrong reasoning distilled checkpoint is a smart idea, you get the structured thinking without the refusals. VERY curious how the two datasets complement each other, did you notice any difference in reasoning style between the nohurry and TeichAI samples? Can't wait to test this on my M1 Max 64GB!

[-]

Boggster@reddit

What's the most minimal hardware I can run this on?

[-]

EvilEnginer@reddit (OP)

You can run Q4_K_M quant of Qwen 3.5 9B on 6 GB of VRAM. But quality will be bad.

[-]

VoiceApprehensive893@reddit

The Q4_K_M quant uses chain of thought, even though its disabled in chat template args, no issues with the Q8_0

[-]

gittubaba@reddit

Tried this on LM Studio, Q4_K_M one. Unfortunately, when prompted to write story, it gets stuck in loop of repeating same paragraph. I've seen this issue with other "community" made models. I've tried your parameters, didn't help. I don't think Repeat Penalty works for repeating paragraphs, it's only to prevent repeating token right?

[-]

EvilEnginer@reddit (OP)

Repeat is common usue in Qwen 3.5 models. May be llama-cpp updated will fix it in future who knows.

[-]

gittubaba@reddit

IIRC I encountered this with all type of models (not only qwen architecture ones). Issue is mostly prominent on finetuned/modified community ones though.

[-]

EvilEnginer@reddit (OP)

Well i guess it's a thing that exists and nobody knows how to fix it.

[-]

EvilEnginer@reddit (OP)

Currently i am cooking 27B Q4_K_M version of Qwen3.5-27-Claude-4.6-Opus-Uncensored-Distilled-GGUF. Already uploading it on Huggingface.

[-]

rlewisfr@reddit

Thank you sir/madam! Much appreciate your work. Question: wondering if the 27B worth the vram hit over the 9B when it comes to creative writing?

[-]

EvilEnginer@reddit (OP)

9B Q8 quant is really nice for creative writing. 27B is too slow.

[-]

Sea-Sir-2985@reddit

the approach of applying a diff between the uncensored and base model to a third model is really clever... basically transplanting the behavioral delta without retraining from scratch. curious how well the distilled opus qualities actually survive the merge though, especially at Q4 quantization where you're already losing information.

for roleplay and creative writing the uncensored part probably matters more than the opus distillation anyway. the main thing opus adds over base qwen is instruction following quality and reasoning depth, both of which degrade faster under quantization than raw fluency does.

how's the coherence on longer outputs? thinking loops in qwen 3.5 were my biggest complaint with the base model so if you actually fixed that this is a solid contribution

[-]

EvilEnginer@reddit (OP)

Actually i not tested coherence on longer outputs too much. I just tested roleplay creativity. 9B Q8 quant works fine.

[-]

crantob@reddit

People, if you have a fast rig, please run A/B evals of this one vs stock Qwen3.5 9B on some codegen. Preferably not web*

People naively assume 'uncensored' means a smutfest but these are not trained for that. Abliterating/uncensoring just suppresses the refusals.

It does not add forbidden knowledge either.

[-]

MeYaj1111@reddit

What exactly does it mean when I see posts like this with the name of two different models in them?

[-]

yahrow@reddit

Qwen will think a little more like Claude.

[-]

KaroYadgar@reddit

thank you for your service. And also, that name is long as fuck.

[-]

EvilEnginer@reddit (OP)

Thank you :). Yep I know. But it's descriptive and pretty much standart for Claude Opus distilled checkpoints so I desided to select it.

[-]

KaroYadgar@reddit

it certainly is descriptive, tells you everything you need to know.

[-]

Clear-Ad-9312@reddit

Can't wait for the Qwen3.5-9B-Claude-4.6-Opus-GPT-Mini-Distilled-MoE-Dense-Hybrid-Instruct-Chat-Reasoning-Aligned-Uncensored-Q4_K_XL-GGUF-Enterprise-Experimental-128K-Omni-Vision-Agentic-SelfReflective-DeepResearch-Roleplay model.

[-]

keepthepace@reddit

When the pain is enough, we will standardize a metadata format to explain all of this.

[-]

SGAShepp@reddit

Rolls off the tongue.

[-]

NinjaOk2970@reddit

lol

[-]

EvilEnginer@reddit (OP)

:D

[-]

toothpastespiders@reddit

that name is long as fuck

I got curious and went through a ton of qwen 3.5 tunes on huggingface just to see what people have been doing with it. Seems like training on thinking datasets is the most common right now and most of them combine the dataset type with a derestriction method name. There's something oddly nostalgic about it. Reminds me a bit of the old llama 2 days when there'd be these giant combinations of titles from the original model, fine tunes, and merges all in one.

[-]

dejco@reddit

Should censor part of the name with single asterisk 🤣

[-]

EvilEnginer@reddit (OP)

😆😆😆

[-]

Still_Breadfruit2032@reddit

Very cool!

[-]

de4dee@reddit

so this is like mergekit but for ggufs. thanks for sharing!

does this mean same trick can be applied to mergekit (it is not currently supporting 3.5)

[-]

Decent-Fold51@reddit

I’m looking for an uncensored model like dolphin.. how does this compare?

[-]

ghulamalchik@reddit

Comparable in terms of being uncensored. Although with Dolphin you kinda needed a system prompt to make sure no refusals happen.

In terms of RP, Dolphin models are better because they were finetuned/trained further by the Dolphin team/person? So it gives better answers for questions that would've been censored.

[-]

Kahvana@reddit

Good work out there!

I should mention: my filtered dataset is no longer needed, the original author has filtered it since then (his filtering is more strict, too):
https://huggingface.co/datasets/Crownelius/Opus-4.6-Reasoning-3300x
Really should make time to update the readme.

In any case, if you decide to retrain it's worth to swap the datasets!

[-]

ghulamalchik@reddit

Thank you so much for this!

Question; Is the mmproj file altered in any way or is it the default one? Because I have a quantized one I'd like to use instead of the BF16 one to save on memory.

[-]

EvilEnginer@reddit (OP)

Glad to help :3. It's the default one from this repository: https://huggingface.co/Jackrong/Qwen3.5-9B-Claude-4.6-Opus-Reasoning-Distilled-GGUF

[-]

EvilEnginer@reddit (OP)

I solved issues with Q4_K_M quants for my uncensored tensor transfer script. Now i have 38 tokens per second on Q4_K_M quant on my RTX 3060 12 GB. Currently uploading it to huggingface with thinking disabled by default.

[-]

bcell4u@reddit

Sweeet! Can you do this with 4b?

[-]

EvilEnginer@reddit (OP)

4b quality is not good. 9B is best that can run on consumer hardware. Currently crafting Q4_K_M quant for lowend GPUs with 6 Gb of VRAM.

[-]

bcell4u@reddit

I have an Intel arc a380 with 6gb!

[-]

EvilEnginer@reddit (OP)

Okay i will try to do it with 4B quant then after Q4_K_M version of 9B model.

[-]

bcell4u@reddit

No rush, thanks man.

[-]

Yu2sama@reddit

Would really like to see this one on the UGI leaderboard

[-]

Kindly-Annual-5504@reddit

Did you try the uncensored model with just your system prompt? I don't understand why the other model should change anything there - especially if thinking is disable by default.

[-]

EvilEnginer@reddit (OP)

I tried. It will not work. Qwen 3.5 is heavily censored on architecture level.

[-]

Kindly-Annual-5504@reddit

Oh, I'm sorry. I don't mean the base model, I mean the HauhauCS-Aggressive one.

[-]

EvilEnginer@reddit (OP)

https://huggingface.co/HauhauCS/Qwen3.5-9B-Uncensored-HauhauCS-Aggressive really nice model. But finetuning via Claude Opus 4.6 add creativity level and natural speaking patterns with same system prompt. Without finetuning model just asks for user confirmation via "Ready?", "You are Ready?" and similar.

[-]

RickyRickC137@reddit

I am really glad that people are giving importance to creativity and fine tuning for creativity! Can you do one for 27b qwen or the bigger 122b one? Because it will be more coherent than the 9b one for roleplay experience! Thank you btw.

[-]

EvilEnginer@reddit (OP)

It's doable via script that I uploaded on pastebin. But I have very slow internet, for huge file uploading. Also Qwen 3.5 27b is too slow.

[-]

eidrag@reddit

was using qwen 27b for rp, but a bit too big to fit both llm and imagegen for sillytavern, text and image both slowing down. maybe will try this after I got back. do you create image separately?

[-]

EvilEnginer@reddit (OP)

Yes. I use ComfyUI. It's simply amazing.

[-]

phormix@reddit

In terms of output, what are you looking at for the standard output resolution and times to generate?

[-]

EvilEnginer@reddit (OP)

Roleplay creativity actually.

[-]

phormix@reddit

Sorry, what I meant is: how long are you seeing it take - on average - to generate a conversational response or an image based on prompt, and do you have examples of the prompt used plus the output for the timings?

[-]

EvilEnginer@reddit (OP)

Actually pretty fast. 14 seconds on my RTX 3060. I simply upload picture in LM Studio and use instruction: "Describe this image in danbooru tags." with this System Prompt from pastebin: https://pastebin.com/pU25DVnB

[-]

EvilEnginer@reddit (OP)

Uploaded Q4_K_M quant to huggingface for GPUs with low VRAM. Enjoy :3

[-]

ItsHarshit@reddit

please do it on 3b

[-]

finah1995@reddit

😎 awesome 👍🏽 now this will help To make powerful stuff for everyone, casual home labs to enterprise.

[-]

EvilEnginer@reddit (OP)

Yep. This is our future.

[-]

Creepy_Lime_8351@reddit

i was looking for this exact model too! thank you soldier

[-]

EvilEnginer@reddit (OP)

Glad to help). I think small models is future, because they are becoming really smart with every release.

[-]

weallwinoneday@reddit

Username checks out

[-]

EvilEnginer@reddit (OP)

xD

[-]

rebelSun25@reddit

Wow, your weekend was productive.

[-]

EvilEnginer@reddit (OP)

Thanks :3

[-]

Quiet_Mark_3238@reddit

What app do you use to generate images locally

[-]

EvilEnginer@reddit (OP)

I am using ComfyUI. I use WAI Illustrious XL Checkpoint for image generation and Z Image Turbo and Flux Klein 9B as refiner. Here my ArtStation if you want to take a look: https://www.artstation.com/luffythefox

[-]

_fortexe@reddit

Hi. I wanted to ask how good are these models for image generation on a GPU with 8 GB of VRAM?

[-]

EvilEnginer@reddit (OP)

WAI Illustrious XL fits nicely in 8 GB of VRAM. Same for Z Image Turbo and Flux Klein. Just pick q4_k_m text encoder from hugging face. It would be more then enough.

[-]

ultrachilled@reddit

do they work for uncensored images?

[-]

EvilEnginer@reddit (OP)

WAI Illustrious XL is fully uncensored. Flux Klein 9B with Z Image Turbo is useful when you want to convert 2D image to 3D render.

[-]

Difficult-Face3352@reddit

The "thinking disabled via chat template" detail is actually worth examining — disabling extended thinking in the GGUF quantization itself is different from just not prompting for it. If you baked it into the template, you're saving inference tokens on every run, which matters a lot on 9B.
What's the actual token reduction you're seeing compared to stock Qwen 3.5, and did you have to retrain or just modify the template post-quantization?

[-]

EvilEnginer@reddit (OP)

I've picked default stock template and set variable enable_thinking to false in it. Everything else left as it is. I disabled thinking because Qwen 3.5 overthinks too much, and often stuck itself in thinking loop. Good System Prompt is a key to success I think. Also I don't like when AI responding too slow. Feel free to experiment if you want.

[-]

OldCryptoTrucker@reddit

Technically you can’t disable thinking with Qwen 3.5, disabling it only prevents token usage. You need to have correct settings for Qwen not to get stuck in a loop. This might be tire issue. Qwen has specific settings you need.

[-]

pmttyji@reddit

Thinking is disabled by default in this model via modified chat template baked in gguf file.

Thanks for this & waiting for your Q4_K_M quant as I have only 8GB VRAM.

It would be great to have similar thing for Nanbeige4.1-3B as well. This model is thinking for too long.

[-]

tom_mathews@reddit

Presence penalty 1.5 on a 9B is doing a lot of heavy lifting to paper over the merge artifacts.

[-]

Bojack-Cowboy@reddit

Thanks

[-]

WithoutReason1729@reddit

Your post is getting popular and we just featured it on our Discord! Come check it out!

You've also been given a special flair for your contribution. We appreciate your post!

I am a bot and this action was performed automatically.

[-]

Complex_Fisherman_77@reddit

Running good on my mabook pro, m3 pro 12 core 18gb

[-]

Crypto_Stoozy@reddit

Built an uncensored personality model on Qwen 3.5 and put it behind a Cloudflare tunnel. No accounts, no tracking: francescachat.com

[-]

LoaderD@reddit

Claude Opus distilled?

I think you mean Foreign AI Cyber Attack Stolen Data Retrained /s

[-]

EvilEnginer@reddit (OP)

Actually a lot of people now train own models based on extracted thinking data from Claude Opus. That's our open source with absolutely freedom.

[-]

LoaderD@reddit

Bro, it’s a joke because Anthropic made a huge stink about Chinese companies using Opus to make thinking distillation datasets. When they’re doing the exact same thing

[-]

EvilEnginer@reddit (OP)

Ahahah true. For Chinese language Claude uses DeepSeek data 😆😆😆

[-]

rm-rf-rm@reddit

The dataset used for the Claude 4.6 opus distilled model is too small to be meaningful

[-]

EvilEnginer@reddit (OP)

Yes, but it works. At least for reasoning. Who knows may be people will extract more useful data at least for roleplay and NPC in games in future for their own models.

[-]

LuckyLuckierLuckest@reddit

Thank you

[-]

EvilEnginer@reddit (OP)

Glad to help for the future of creative freedom :3

[-]

3mil_mylar@reddit

This is a great model for chat sims/games.. thanks man! The baseline -9B would break my sim due to reasoning, but this works great and the uncensor leads to some interesting convos, thanks!

[-]

EvilEnginer@reddit (OP)

Thanks ❤🙏

[-]

Playful-Bunch2831@reddit

How i can use it? I have Ollama on my Desktop with 5090. Pretty new to this :)

[-]

EvilEnginer@reddit (OP)

Just install latest beta version of LM Studio and download mine model via it. After that configure settings and apply system prompt. That's enough.

Ollama btw is nice. But LM Studio is easiest.

[-]

ayu-ya@reddit

This looks interesting! I'm definitely grabbing it in case I need a model that will fit on my current hardware without much quanting, a creative model will be good for my use cases. Are you planning to do something similar for the bigger Qwens, like 27 and 35B too?

[-]

EvilEnginer@reddit (OP)

I don't think so. They are slow as hell for mine RTX 3060. Qwen 3.5 9B is golden base for it's size and capabilities.

But I shared my method. So people can test other models with patches from HauhauCS.

[-]

Business-Weekend-537@reddit

What’s the license on the model?

This is cool.

[-]

EvilEnginer@reddit (OP)

Standart Apache 2.0.

[-]

sToeTer@reddit

I'll try it, thank you!

(btw: it's spelled standarD)

[-]

bajaja@reddit

btw: it's spelled standarD

not by enginers

[-]

Business-Weekend-537@reddit

Cool ty

[-]

AlbionPlayerFun@reddit

Does Claude 4.6 distill help even if thinking is off?

[-]

EvilEnginer@reddit (OP)

Yes it helps a lot, especially when characters describe their actions.

[-]

AlbionPlayerFun@reddit

Thx 🙏

[-]

Vastheap@reddit

How would you compare it with the regular uncensored version of the 9B model?

[-]

EvilEnginer@reddit (OP)

I compare it via two criteria: 1) Roleplay creativity and natural word speaking. 2) Programming Arcanoid game on HTML5, JavaScript and CSS in Tron Legacy Film Style.

Works fine.

[-]

esuil@reddit

So is thinking actually neutered in this model? I like Qwen35 models so much BECAUSE of their thinking.

[-]

EvilEnginer@reddit (OP)

Nope. It's just disabled by default in chat_template baked in GGUF. Set variable enable_thinking to True in LM Studio in chat template editor if you want to enable thinking.

[-]

22fattyfingers@reddit

Wow man! Pretty impressive

[-]

EvilEnginer@reddit (OP)

Thanks ^_^

[-]

2legsRises@reddit

So what does that even mean? How can you have 3 models in one? Asking as I honestly do not know.

[-]

EvilEnginer@reddit (OP)

I described a method how people can uncensor any Qwen 3.5 based checkpoint after fine tuning via HauhauCS checkpoints. For regular usage you use just one checkpoint.

[-]

Imaginary_Belt4976@reddit

This is so cool!! thanks for sharing the technique as well!!

[-]

EvilEnginer@reddit (OP)

Thank you very much 😊. Glad to help to community as much as I can. I got so many amazing LLMs here before.

[-]

hauhau901@reddit

Awesome, good job! :)

Thanks for still crediting me in your HF repo!

[-]

EvilEnginer@reddit (OP)

Thank you very much for your job :D. Glad to help. I love your uncensored checkpoints so much.

[-]

ALittleBitEver@reddit

If I could run a 9b, I would Test it

[-]

EvilEnginer@reddit (OP)

9B works fine even on GPUs with low armount of vram. So it should work fine.

[-]

shikima@reddit

I gonna test it today, thanks

[-]

EvilEnginer@reddit (OP)

Nice 👍. Let me know how it performs :D