Qwen3.5-9B-Claude-4.6-Opus-Uncensored-Distilled-GGUF
Posted by EvilEnginer@reddit | LocalLLaMA | View on Reddit | 211 comments
Hello everyone. I made my first fully uncensored LLM model for this community. Here link:
https://huggingface.co/LuffyTheFox/Qwen3.5-9B-Claude-4.6-Opus-Uncensored-Distilled-GGUF
Thinking is disabled by default in this model via modified chat template baked in gguf file.
So, I love to use Qwen 3.5 9B especially for roleplay writing and prompt crafting for image generation and tagging on my NVidia RTX 3060 12 GB, but it misses creativity, contains a lot of thinking loops and refuses too much. So I made the following tweaks:
1) I downloaded the most popular model from: https://huggingface.co/HauhauCS/Qwen3.5-9B-Uncensored-HauhauCS-Aggressive
2) I downloaded the second popular model from: https://huggingface.co/Jackrong/Qwen3.5-9B-Claude-4.6-Opus-Reasoning-Distilled-GGUF
3) I compared HauhauCS checkpoint with standart Qwen 3.5 checkpoint and extracted modified tensors by HauhauCS.
4) I merged modified tensors by HauhauCS with Jackrong tensors.
Everything above was done via this script in Google Colab. I vibecoded it via Claude Opus 4.6: https://pastebin.com/1qKgR3za
On next stage I crafted System Prompt. Here another pastebin: https://pastebin.com/pU25DVnB
I loaded modified model in LM Studio 0.4.7 (Build 1) with following parameters:
Temperature: 0,7
Top K Sampling: 20
Presence Penalty: 1.5
Top P Sampling: 0.8
Min P Sampling: 0
Seed: 3407 or 42
And everything works with pretty nicely. Zero refusals. And responces are really good and creative for 9B model. Now we have distilled uncensored version of Qwen 3.5 9B finetuned on Claude Opus 4.6 thinking logic. Hope it helps. Enjoy. Feel free to tweak my system prompt simplify or extent it if you want.
Educational-Fix5320@reddit
I tried loading this in Ollama - I have 12GB VRAM 4070Ti - but get a 500 Internal Server Error - am I short on memory, or perhaps something else is wrong?
EvilEnginer@reddit (OP)
I think you need to update Ollama and llama-cpp in it. Qwen 3.5 arch is pretty new.
Educational-Fix5320@reddit
I appreciate the suggestion - I upgraded ollama to 0.18.0 before posting my issue, and restarted it [even killing the tasks to be sure] - while ollama reports to be 0.18.0, which is the most recent, the issue persisted. If you have further suggestions, I'd be very happy to hear them - I've come to the end of my rope to fixing it.
sandeep2021@reddit
Were you able to resolve the issue
arjuna66671@reddit
Showed Claude the post and prompt md lmao.
OkSentence1376@reddit
Opus 4.6 is such a guy...
Vastheap@reddit
I can feel the anger radiating through haha
EvilEnginer@reddit (OP)
Yep Claude is angry. His brain exploded on this prompt, but he crafted it during thinking process by himself xD.
arjuna66671@reddit
😂😂😂
hidden2u@reddit
Yikes is this how Claude normally talks? It’s like a cheap cringe teen drama
brool@reddit
I was surprised as well, Claude is so polite with me. I wonder if there are any system prompts changing this.
SEND_GOOD_LIFEADVICE@reddit
he's on some garbage model, not thinking, my guess
arjuna66671@reddit
It's Claude Opus 4.6 thinking enabled - max plan for 100 bucks. What I said above: The only thing in my system prompt is to allow it to be critical of my ideas. It amassed tons of memories from our chats that seem to form its personality. 99% of the time it's super polite - sometimes it can get a little pushy or "grumpy". But when i call it out on it, it reverts.
I don't use Claude for coding nor for RP. Mostly for some pet projects like local hobby archeology projects, brainstorming potential ASI alignments and other philosophical stuff etc.
SEND_GOOD_LIFEADVICE@reddit
ok im turning memory off then
arjuna66671@reddit
For me it's fine. I need a natural sounding thinking partner, some personality is actually entertaining xD.
SEND_GOOD_LIFEADVICE@reddit
that thing types aislop like a sycophant
arjuna66671@reddit
arjuna66671@reddit
The only thing in my system prompt is to allow it to be critical of my ideas. It amassed tons of memories from our chats that seem to form its personality. 99% of the time it's super polite - sometimes it can get a little pushy or "grumpy". But when i call it out on it, it reverts.
I don't use Claude for coding nor for RP. Mostly for some pet projects like local hobby archeology projects, brainstorming potential ASI alignments etc.
DarwinOGF@reddit
If there are so many memories made in conversations, maybe it is time to train the model on them for context's sake?
nekmatu@reddit
Claude keeps telling me to go to sleep and we will fix whatever problem in the morning.
ChocomelP@reddit
Claude definitely changes the way it treats you based on who it thinks it's talking to. This is not exactly correct, but basically, if it thinks you're a "cringe teen", it mirrors that back to you.
Monkeyke@reddit
Nah this is definitely prompted
Strong_Quarter_9349@reddit
I think it adjusts its system prompt based on how you talk to it and this guy seems to use a casual/cringe tone
arjuna66671@reddit
🤣🤣🤣
There's a little more to it, but yes, it adjusts its memories and instructions to itself according to our chats.
megacewl@reddit
It talks that way in the app/website. Claude Code talks SO MUCH different and it’s extremely refreshing, as literally all the other Claude and ChatGPT models have now optimized for talking in that stupid people-pleasing way.
runcertain@reddit
Still hungry after that knuckle sandwich, punk???
toothpastespiders@reddit
Funniest parts were the random hate thrown at locallama and what amounted to "I'm sure it's fine. For you. Ugh." If it wasn't given specific tone instructions then that's easily the strongest combination of passive aggressiveness and actual active dismissal I thinkk I've ever seen from claude.
Frank_White32@reddit
They mimic conversational tone whenever possible.
arjuna66671@reddit
Nope, it doesn't have specific tone instructions but i've noticed that it sometimes has this underhanded passive aggressive dismive tone when it seems to think my idea or what i'm doing is bad lol.
IndividualManager849@reddit
Seething rn
MindTheFuture@reddit
That's some sales pitch! Terse manic never-repeating-a-word and never to be taken as factual, yeah, ramble me some abstract poems baby, totally in mood of utterly disjointed wall-of-text experimental-nonsense-noise squirreling out like adhd accelarting on full nitro mode in an art supply store! Where we're going, making sense, coherence or care about factuality are but hinderances.
downloading now.
Specific-Goose4285@reddit
I can taste the saltiness from here haha.
esuil@reddit
Lol.
Where did it pull Einstein from?
EvilEnginer@reddit (OP)
Ahahah xDd
mkey82@reddit
This is some very nice shit. I can't stand all of that "can't that, can't this" nonsense. It's too limited by default.
uniVocity@reddit
Many thanks for this! I'm impressed.
I've tried it with a large single input: gave it 40 java classes (100 to 500 loc each) and asked it to generate unit tests for each operation of each class. This is the local model I tried that produced the 40 unit tests without stopping after the the 4th or 5th file.
I had to set the temperature to 0 otherwise it would loop producing a very long test with crap, but after that it kept chugging along and did it!
EvilEnginer@reddit (OP)
I'm glad to hear that the model works well)
Ksirailway-Base@reddit
That sounds pretty awesome to me. I haven't run the new version of Qwen yet, but I also have a 3060 with 12 GB of VRAM. How many tokens are you getting per sec?
JustWicktor@reddit
❯ well, i thought that would be a "free" local Opus 4.6...
⎿ API Error: 400 {"type":"error","error":{"type":"invalid_request_error","message":"registry.ollama.ai/kwangsuklee/Qwen3.5-9B-Claude-4.6-Opus-Reasoning-Distilled-GGUF:latest does not support tools"},"request_id":"....."}
EvilEnginer@reddit (OP)
So, Jackrong made new version of Qwen 3.5 9B https://huggingface.co/Jackrong/Qwen3.5-9B-Claude-4.6-Opus-Reasoning-Distilled-v2-GGUF
I will uncensor it via HauhauCS models and upload all quants to huggingface.
Hou_Yiizz@reddit
Hi, I tried hosting your uncensored version of this v2 distillation on llama.cpp and kept getting a stream of "?". The command is:
docker run -d --name llama-qwen3.5
--gpus all
--ipc host
-p 8001:8000
-v /mnt/d/Programs/models:/models
ghcr.io/ggml-org/llama.cpp:server-cuda
-m /models/Qwen3.5-9B.Q4_K_M.gguf
--mmproj /models/mmproj-BF16.gguf
--alias "llm"
--host 0.0.0.0
--port 8000
--n-gpu-layers 99
--ctx-size 65536
--flash-attn on
--jinja
--reasoning-format deepseek
--no-mmap
EvilEnginer@reddit (OP)
Currently Q4_K_M quants are broken. Don't use them
Tetros_Nagami@reddit
I'm testing the Q4_K_M quant, and reasoning seems to be enabled
EvilEnginer@reddit (OP)
Сurrently on my way for new version for Qwen 3.5: https://huggingface.co/Jackrong/Qwen3.5-9B-Claude-4.6-Opus-Reasoning-Distilled-v2-GGUF
Tetros_Nagami@reddit
My GOAT
0260n4s@reddit
This is really cool. I haven't tried local LLM much before, so I spent the afternoon trying several in KoboldCpp with logic and math and some research problems. The 9BQ8 and 9BQ4 worked equally fast on my 3080Ti 12GB (equivalent in speed to online models) and produced solid logical and mathematical reasoning, even getting the better answer compared to some of the major online players.
The 27BQ4 version was painfully slow on my hardware, but 9BQ8 is a solid choice even on my my older GPU.
Thanks for this.
EvilEnginer@reddit (OP)
Glad to help. Yep 9B Q8 is solid. Even on RTX 3060 it's fast.
mcblockserilla@reddit
Yoink
Fau57@reddit
Loo
mcblockserilla@reddit
My clawdbot uses claud sonet as it's backend, and I want to run it locally, but not with a gpt or a llama model. This is perfect. I think we'll see
Quiet-Owl9220@reddit
Bit of a tangent here but I'm sure I'm not the only person who's been thinking it... I've seen a lot of "uncensored" models, but has anyone made serious headway with making more of an "anti-censored" model yet?
I want to behold a model that's so vulgar, rude, horny, and hostile to censorship by default, that I have to use the system prompt to reign it in and make it behave... as opposed to having to feed it a lewd vocabulary and tell it that it's okay to say naughty words and this is all totally fiction.
Is this just too unserious of a use case that nobody has made it? Are commercial projects keeping their immoral AIs behind paywalls? Or are we all just collectively afraid of making a model that will eagerly kill billions of zygotes?
praxis22@reddit
you don't get the chans much I take it
teleolurian@reddit
there are a few, but the problem is they're usually so brainbroke that the prompt doesn't do anything
valkarias@reddit
Hm. These reasoning distillations coming from models like Opus and Gemini I assume, are summarized reasoning traces. Wouldn't that hurt the performance of these models?
This was documented in this paper by ByteDance
https://arxiv.org/abs/2601.06002
crantob@reddit
On my coding tests, unmodified 9b is better.
acetaminophenpt@reddit
Really liked your approach. Didn't know that it was possible to apply a diff between two models and patch a 3rd one.
PrimaCora@reddit
Was very popular with stable diffusion models. Since it had a similar design, I figured it would have been popular with these ones, but not so much.
addandsubtract@reddit
The fact that we still don't have LoRAs (or tools that use them) for LLMs is kinda crazy, tbh.
666666thats6sixes@reddit
We've had them since day one and you're most likely using them. Every finetune is the unchanged base model plus an embedded lora.
Llama.cpp supports loading LoRAs from files (
--loraand a comma separated list of files) but in LLM space it became more common to bundle them, since we tend to have lots and lots of different quants and other variants of models.addandsubtract@reddit
True, but I meant having them as individual files to add on to existing models isn't really a thing in the LLM space, as much as it is common practice for SD models.
Icy_Butterscotch6661@reddit
Saw someone on twitter doing it on a Mac
EvilEnginer@reddit (OP)
Yep, I also think it's nice way. Just randomly discovered it today.
Luthian@reddit
What is an “uncensored” model?
EvilEnginer@reddit (OP)
It means zero refusals and censorship. You can ask whatever you want.
tempSelf@reddit
What's roleplay here?
Piyh@reddit
He's jerkin it to the robot
LaShmooze@reddit
Clankerwanking
twoiko@reddit
Clanking?
EvilEnginer@reddit (OP)
😆😆😆
General-Economics-85@reddit
Where do you think you are?
Fault23@reddit
EvilEnginer@reddit (OP)
Currently no roleplay tweaks here. Just a solid general usage base via System Prompt. Ask Claude Opus to fine tune system prompt for roleplay.
Final_Ad_7431@reddit
a lot of the time i try these qwen3.5 + claude distills with llamacpp i get all of the output stuck in thinking blocks and it's been really annoying, am i just stupid? i've been out of the game a little so i might just be missing something
Due-Memory-6957@reddit
How about Q4 for the broke people?
EvilEnginer@reddit (OP)
Will do it tomorrow morning.
diddle_that_skittle@reddit
If its not too late or too much to ask, Q5_K_M please?
Thanks for sharing the good stuff!
EvilEnginer@reddit (OP)
I tried. Script doesn't want to work with Q4_K_M and Q5_K_M quant. May be I will find solution in future.
Billysm23@reddit
Sad... But still a good work though
EvilEnginer@reddit (OP)
Claude Opus is currently vibecoding another version of script. I will try again.
Billysm23@reddit
Hahaha good luck opus
EvilEnginer@reddit (OP)
Opus fixed it. 38 tok / second on Q4_K_M quant on RTX 3060.
3mil_mylar@reddit
Hey, just tried your 9B Q4_K_M, thanks for such quick work!
Does this one also have reasoning suppressed? Running it with the settings you suggested (temps/sampling etc), it still seems to break out into reasoning for me
qwen3.5-9b-claude-4.6-opus-uncensored-distilled@q4_k_m downloaded thru LM Studio
EvilEnginer@reddit (OP)
Q4_K_M quant is not stable at current moment.
3mil_mylar@reddit
Ah gotcha.. will keep an eye out on your work if that changes
False_Process_4569@reddit
You're a BEAST! I just wanted to thank you for all of your hard work! I tried the 9B last night and it is so good! I only have 8GB of VRAM on my RTX 3060. It was slow, but it was really good for it's size! I'm going to set this up as the main model for an OpenClaw instance and see how it goes! I can't wait to test with the faster Q4_K_M quant!
EvilEnginer@reddit (OP)
Thank you very much)
Billysm23@reddit
Alright thanks, will try it asap
3mil_mylar@reddit
👀
bebackground471@reddit
RemindMe! 12 hours "Well..? meme"
RemindMeBot@reddit
I will be messaging you in 12 hours on 2026-03-16 10:18:13 UTC to remind you of this link
CLICK THIS LINK to send a PM to also be reminded and to reduce spam.
^(Parent commenter can ) ^(delete this message to hide from others.)
EvilEnginer@reddit (OP)
Done. Q4_K_M quant is uploaded.
Delicious_Ease2595@reddit
Where can I test it online
DarkAI_Official@reddit
Thanks buddy. Also tried your recommended parameters its works like charm
Alternative-Day8673@reddit
How different are the capabilities of the 9b vs the 27b going to be? Like what can you only get done with the larger model due to parameter limitations?
EvilEnginer@reddit (OP)
Usually bigger models are "smarter" but everything depends from architecture
brubits@reddit
Nice work on the tensor diff approach, pulling only the modified HauhauCS tensors instead of doing a full merge is really clean. Blending that uncensoring layer onto the Jackrong reasoning distilled checkpoint is a smart idea, you get the structured thinking without the refusals. VERY curious how the two datasets complement each other, did you notice any difference in reasoning style between the nohurry and TeichAI samples? Can't wait to test this on my M1 Max 64GB!
Boggster@reddit
What's the most minimal hardware I can run this on?
EvilEnginer@reddit (OP)
You can run Q4_K_M quant of Qwen 3.5 9B on 6 GB of VRAM. But quality will be bad.
VoiceApprehensive893@reddit
The Q4_K_M quant uses chain of thought, even though its disabled in chat template args, no issues with the Q8_0
gittubaba@reddit
Tried this on LM Studio, Q4_K_M one. Unfortunately, when prompted to write story, it gets stuck in loop of repeating same paragraph. I've seen this issue with other "community" made models. I've tried your parameters, didn't help. I don't think Repeat Penalty works for repeating paragraphs, it's only to prevent repeating token right?
EvilEnginer@reddit (OP)
Repeat is common usue in Qwen 3.5 models. May be llama-cpp updated will fix it in future who knows.
gittubaba@reddit
IIRC I encountered this with all type of models (not only qwen architecture ones). Issue is mostly prominent on finetuned/modified community ones though.
EvilEnginer@reddit (OP)
Well i guess it's a thing that exists and nobody knows how to fix it.
EvilEnginer@reddit (OP)
Currently i am cooking 27B Q4_K_M version of Qwen3.5-27-Claude-4.6-Opus-Uncensored-Distilled-GGUF. Already uploading it on Huggingface.
rlewisfr@reddit
Thank you sir/madam! Much appreciate your work. Question: wondering if the 27B worth the vram hit over the 9B when it comes to creative writing?
EvilEnginer@reddit (OP)
9B Q8 quant is really nice for creative writing. 27B is too slow.
Sea-Sir-2985@reddit
the approach of applying a diff between the uncensored and base model to a third model is really clever... basically transplanting the behavioral delta without retraining from scratch. curious how well the distilled opus qualities actually survive the merge though, especially at Q4 quantization where you're already losing information.
for roleplay and creative writing the uncensored part probably matters more than the opus distillation anyway. the main thing opus adds over base qwen is instruction following quality and reasoning depth, both of which degrade faster under quantization than raw fluency does.
how's the coherence on longer outputs? thinking loops in qwen 3.5 were my biggest complaint with the base model so if you actually fixed that this is a solid contribution
EvilEnginer@reddit (OP)
Actually i not tested coherence on longer outputs too much. I just tested roleplay creativity. 9B Q8 quant works fine.
crantob@reddit
People, if you have a fast rig, please run A/B evals of this one vs stock Qwen3.5 9B on some codegen. Preferably not web*
People naively assume 'uncensored' means a smutfest but these are not trained for that. Abliterating/uncensoring just suppresses the refusals.
It does not add forbidden knowledge either.
MeYaj1111@reddit
What exactly does it mean when I see posts like this with the name of two different models in them?
yahrow@reddit
Qwen will think a little more like Claude.
KaroYadgar@reddit
thank you for your service. And also, that name is long as fuck.
EvilEnginer@reddit (OP)
Thank you :). Yep I know. But it's descriptive and pretty much standart for Claude Opus distilled checkpoints so I desided to select it.
KaroYadgar@reddit
it certainly is descriptive, tells you everything you need to know.
Clear-Ad-9312@reddit
Can't wait for the Qwen3.5-9B-Claude-4.6-Opus-GPT-Mini-Distilled-MoE-Dense-Hybrid-Instruct-Chat-Reasoning-Aligned-Uncensored-Q4_K_XL-GGUF-Enterprise-Experimental-128K-Omni-Vision-Agentic-SelfReflective-DeepResearch-Roleplay model.
keepthepace@reddit
When the pain is enough, we will standardize a metadata format to explain all of this.
SGAShepp@reddit
Rolls off the tongue.
NinjaOk2970@reddit
lol
EvilEnginer@reddit (OP)
:D
toothpastespiders@reddit
I got curious and went through a ton of qwen 3.5 tunes on huggingface just to see what people have been doing with it. Seems like training on thinking datasets is the most common right now and most of them combine the dataset type with a derestriction method name. There's something oddly nostalgic about it. Reminds me a bit of the old llama 2 days when there'd be these giant combinations of titles from the original model, fine tunes, and merges all in one.
dejco@reddit
Should censor part of the name with single asterisk 🤣
EvilEnginer@reddit (OP)
😆😆😆
Still_Breadfruit2032@reddit
Very cool!
de4dee@reddit
so this is like mergekit but for ggufs. thanks for sharing!
does this mean same trick can be applied to mergekit (it is not currently supporting 3.5)
Decent-Fold51@reddit
I’m looking for an uncensored model like dolphin.. how does this compare?
ghulamalchik@reddit
Comparable in terms of being uncensored. Although with Dolphin you kinda needed a system prompt to make sure no refusals happen.
In terms of RP, Dolphin models are better because they were finetuned/trained further by the Dolphin team/person? So it gives better answers for questions that would've been censored.
Kahvana@reddit
Good work out there!
I should mention: my filtered dataset is no longer needed, the original author has filtered it since then (his filtering is more strict, too):
https://huggingface.co/datasets/Crownelius/Opus-4.6-Reasoning-3300x
Really should make time to update the readme.
In any case, if you decide to retrain it's worth to swap the datasets!
ghulamalchik@reddit
Thank you so much for this!
Question; Is the mmproj file altered in any way or is it the default one? Because I have a quantized one I'd like to use instead of the BF16 one to save on memory.
EvilEnginer@reddit (OP)
Glad to help :3. It's the default one from this repository: https://huggingface.co/Jackrong/Qwen3.5-9B-Claude-4.6-Opus-Reasoning-Distilled-GGUF
EvilEnginer@reddit (OP)
I solved issues with Q4_K_M quants for my uncensored tensor transfer script. Now i have 38 tokens per second on Q4_K_M quant on my RTX 3060 12 GB. Currently uploading it to huggingface with thinking disabled by default.
bcell4u@reddit
Sweeet! Can you do this with 4b?
EvilEnginer@reddit (OP)
4b quality is not good. 9B is best that can run on consumer hardware. Currently crafting Q4_K_M quant for lowend GPUs with 6 Gb of VRAM.
bcell4u@reddit
I have an Intel arc a380 with 6gb!
EvilEnginer@reddit (OP)
Okay i will try to do it with 4B quant then after Q4_K_M version of 9B model.
bcell4u@reddit
No rush, thanks man.
Yu2sama@reddit
Would really like to see this one on the UGI leaderboard
Kindly-Annual-5504@reddit
Did you try the uncensored model with just your system prompt? I don't understand why the other model should change anything there - especially if thinking is disable by default.
EvilEnginer@reddit (OP)
I tried. It will not work. Qwen 3.5 is heavily censored on architecture level.
Kindly-Annual-5504@reddit
Oh, I'm sorry. I don't mean the base model, I mean the HauhauCS-Aggressive one.
EvilEnginer@reddit (OP)
https://huggingface.co/HauhauCS/Qwen3.5-9B-Uncensored-HauhauCS-Aggressive really nice model. But finetuning via Claude Opus 4.6 add creativity level and natural speaking patterns with same system prompt. Without finetuning model just asks for user confirmation via "Ready?", "You are Ready?" and similar.
RickyRickC137@reddit
I am really glad that people are giving importance to creativity and fine tuning for creativity! Can you do one for 27b qwen or the bigger 122b one? Because it will be more coherent than the 9b one for roleplay experience! Thank you btw.
EvilEnginer@reddit (OP)
It's doable via script that I uploaded on pastebin. But I have very slow internet, for huge file uploading. Also Qwen 3.5 27b is too slow.
eidrag@reddit
was using qwen 27b for rp, but a bit too big to fit both llm and imagegen for sillytavern, text and image both slowing down. maybe will try this after I got back. do you create image separately?
EvilEnginer@reddit (OP)
Yes. I use ComfyUI. It's simply amazing.
phormix@reddit
In terms of output, what are you looking at for the standard output resolution and times to generate?
EvilEnginer@reddit (OP)
Roleplay creativity actually.
phormix@reddit
Sorry, what I meant is: how long are you seeing it take - on average - to generate a conversational response or an image based on prompt, and do you have examples of the prompt used plus the output for the timings?
EvilEnginer@reddit (OP)
Actually pretty fast. 14 seconds on my RTX 3060. I simply upload picture in LM Studio and use instruction: "Describe this image in danbooru tags." with this System Prompt from pastebin: https://pastebin.com/pU25DVnB
EvilEnginer@reddit (OP)
Uploaded Q4_K_M quant to huggingface for GPUs with low VRAM. Enjoy :3
ItsHarshit@reddit
please do it on 3b
finah1995@reddit
😎 awesome 👍🏽 now this will help To make powerful stuff for everyone, casual home labs to enterprise.
EvilEnginer@reddit (OP)
Yep. This is our future.
Creepy_Lime_8351@reddit
i was looking for this exact model too! thank you soldier
EvilEnginer@reddit (OP)
Glad to help). I think small models is future, because they are becoming really smart with every release.
weallwinoneday@reddit
Username checks out
EvilEnginer@reddit (OP)
xD
rebelSun25@reddit
Wow, your weekend was productive.
EvilEnginer@reddit (OP)
Thanks :3
Quiet_Mark_3238@reddit
What app do you use to generate images locally
EvilEnginer@reddit (OP)
I am using ComfyUI. I use WAI Illustrious XL Checkpoint for image generation and Z Image Turbo and Flux Klein 9B as refiner. Here my ArtStation if you want to take a look: https://www.artstation.com/luffythefox
_fortexe@reddit
Hi. I wanted to ask how good are these models for image generation on a GPU with 8 GB of VRAM?
EvilEnginer@reddit (OP)
WAI Illustrious XL fits nicely in 8 GB of VRAM. Same for Z Image Turbo and Flux Klein. Just pick q4_k_m text encoder from hugging face. It would be more then enough.
ultrachilled@reddit
do they work for uncensored images?
EvilEnginer@reddit (OP)
WAI Illustrious XL is fully uncensored. Flux Klein 9B with Z Image Turbo is useful when you want to convert 2D image to 3D render.
Difficult-Face3352@reddit
The "thinking disabled via chat template" detail is actually worth examining — disabling extended thinking in the GGUF quantization itself is different from just not prompting for it. If you baked it into the template, you're saving inference tokens on every run, which matters a lot on 9B.
What's the actual token reduction you're seeing compared to stock Qwen 3.5, and did you have to retrain or just modify the template post-quantization?
EvilEnginer@reddit (OP)
I've picked default stock template and set variable enable_thinking to false in it. Everything else left as it is. I disabled thinking because Qwen 3.5 overthinks too much, and often stuck itself in thinking loop. Good System Prompt is a key to success I think. Also I don't like when AI responding too slow. Feel free to experiment if you want.
OldCryptoTrucker@reddit
Technically you can’t disable thinking with Qwen 3.5, disabling it only prevents token usage. You need to have correct settings for Qwen not to get stuck in a loop. This might be tire issue. Qwen has specific settings you need.
pmttyji@reddit
Thanks for this & waiting for your Q4_K_M quant as I have only 8GB VRAM.
It would be great to have similar thing for Nanbeige4.1-3B as well. This model is thinking for too long.
tom_mathews@reddit
Presence penalty 1.5 on a 9B is doing a lot of heavy lifting to paper over the merge artifacts.
Bojack-Cowboy@reddit
Thanks
WithoutReason1729@reddit
Your post is getting popular and we just featured it on our Discord! Come check it out!
You've also been given a special flair for your contribution. We appreciate your post!
I am a bot and this action was performed automatically.
Complex_Fisherman_77@reddit
Running good on my mabook pro, m3 pro 12 core 18gb
Crypto_Stoozy@reddit
Built an uncensored personality model on Qwen 3.5 and put it behind a Cloudflare tunnel. No accounts, no tracking: francescachat.com
LoaderD@reddit
Claude Opus distilled?
I think you mean Foreign AI Cyber Attack Stolen Data Retrained /s
EvilEnginer@reddit (OP)
Actually a lot of people now train own models based on extracted thinking data from Claude Opus. That's our open source with absolutely freedom.
LoaderD@reddit
Bro, it’s a joke because Anthropic made a huge stink about Chinese companies using Opus to make thinking distillation datasets. When they’re doing the exact same thing
EvilEnginer@reddit (OP)
Ahahah true. For Chinese language Claude uses DeepSeek data 😆😆😆
rm-rf-rm@reddit
The dataset used for the Claude 4.6 opus distilled model is too small to be meaningful
EvilEnginer@reddit (OP)
Yes, but it works. At least for reasoning. Who knows may be people will extract more useful data at least for roleplay and NPC in games in future for their own models.
LuckyLuckierLuckest@reddit
Thank you
EvilEnginer@reddit (OP)
Glad to help for the future of creative freedom :3
3mil_mylar@reddit
This is a great model for chat sims/games.. thanks man! The baseline -9B would break my sim due to reasoning, but this works great and the uncensor leads to some interesting convos, thanks!
EvilEnginer@reddit (OP)
Thanks ❤🙏
Playful-Bunch2831@reddit
How i can use it? I have Ollama on my Desktop with 5090. Pretty new to this :)
EvilEnginer@reddit (OP)
Just install latest beta version of LM Studio and download mine model via it. After that configure settings and apply system prompt. That's enough.
Ollama btw is nice. But LM Studio is easiest.
ayu-ya@reddit
This looks interesting! I'm definitely grabbing it in case I need a model that will fit on my current hardware without much quanting, a creative model will be good for my use cases. Are you planning to do something similar for the bigger Qwens, like 27 and 35B too?
EvilEnginer@reddit (OP)
I don't think so. They are slow as hell for mine RTX 3060. Qwen 3.5 9B is golden base for it's size and capabilities.
But I shared my method. So people can test other models with patches from HauhauCS.
Business-Weekend-537@reddit
What’s the license on the model?
This is cool.
EvilEnginer@reddit (OP)
Standart Apache 2.0.
sToeTer@reddit
I'll try it, thank you!
(btw: it's spelled standarD)
bajaja@reddit
not by enginers
Business-Weekend-537@reddit
Cool ty
AlbionPlayerFun@reddit
Does Claude 4.6 distill help even if thinking is off?
EvilEnginer@reddit (OP)
Yes it helps a lot, especially when characters describe their actions.
AlbionPlayerFun@reddit
Thx 🙏
Vastheap@reddit
How would you compare it with the regular uncensored version of the 9B model?
EvilEnginer@reddit (OP)
I compare it via two criteria: 1) Roleplay creativity and natural word speaking. 2) Programming Arcanoid game on HTML5, JavaScript and CSS in Tron Legacy Film Style.
Works fine.
esuil@reddit
So is thinking actually neutered in this model? I like Qwen35 models so much BECAUSE of their thinking.
EvilEnginer@reddit (OP)
Nope. It's just disabled by default in chat_template baked in GGUF. Set variable enable_thinking to True in LM Studio in chat template editor if you want to enable thinking.
22fattyfingers@reddit
Wow man! Pretty impressive
EvilEnginer@reddit (OP)
Thanks ^_^
2legsRises@reddit
So what does that even mean? How can you have 3 models in one? Asking as I honestly do not know.
EvilEnginer@reddit (OP)
I described a method how people can uncensor any Qwen 3.5 based checkpoint after fine tuning via HauhauCS checkpoints. For regular usage you use just one checkpoint.
Imaginary_Belt4976@reddit
This is so cool!! thanks for sharing the technique as well!!
EvilEnginer@reddit (OP)
Thank you very much 😊. Glad to help to community as much as I can. I got so many amazing LLMs here before.
hauhau901@reddit
Awesome, good job! :)
Thanks for still crediting me in your HF repo!
EvilEnginer@reddit (OP)
Thank you very much for your job :D. Glad to help. I love your uncensored checkpoints so much.
ALittleBitEver@reddit
If I could run a 9b, I would Test it
EvilEnginer@reddit (OP)
9B works fine even on GPUs with low armount of vram. So it should work fine.
shikima@reddit
I gonna test it today, thanks
EvilEnginer@reddit (OP)
Nice 👍. Let me know how it performs :D