Qwen3.5-35B-A3B-Uncensored-FernflowerAI-GGUF
Posted by EvilEnginer@reddit | LocalLLaMA | View on Reddit | 180 comments
Hello everyone. I found and fixed training bug in Qwen3.5 35B A3B model.
Here my fixed version: https://huggingface.co/LuffyTheFox/Qwen3.5-35B-A3B-Uncensored-FernflowerAI-GGUF
Upgraded system prompt that unlocks deep thinking (works great with this model):
https://pastebin.com/pU25DVnB
Chat template: https://pastebin.com/uk9ZkxCR (supports tool calling)
Recommended Settings (LM Studio):
| Temperature | 0.7 |
|---|---|
| Top K Sampling | 20 |
| Presence Penalty | 1.5 |
| Top P Sampling | 0.8 |
| Min P Sampling | 0 |
| Seed | 42 |
History:
I've been using Qwen 3.5 35B A3B (the uncensored version by HauhauCS) for a while. It's an incredible model - uncensored, MoE with 256 experts, hybrid DeltaNet + Attention, 40 layers. But something was off. On short prompts it works fine. On long conversations it started "philosophizing" - losing context, repeating itself, writing broken code with strange comments.
I spent two weeks digging through the weights.
What I found:
Two tensors. In blocks 36 and 37. ssm_conv1d.weight.
Their scale was \~60% higher than normal (σ=0.102 vs median 0.063). Because of how AdamW works, rare experts in the last layers get a huge effective learning rate - their weights drift.
In a recurrent architecture like DeltaNet, this kills the hidden state. The model forgets context after a few tokens.
Surprisingly I didn't found any issues in Gemma 4 26B A4B - all scales were correct in model.
What I did:
I scaled broken tensors back to normal. Nothing else. 489 other tensors were left untouched - their scale is architectural (gate_inp, etc.).
Results:
- Error reduction: 88.6%.
- Long conversations now stay coherent.
- Code generation works.
- No more "philosophizing".
What I learned:
One bug. Two tensors. 64GB of model. And the entire potential of the most complex open-weight architecture was locked behind it.
If you're using MoE + recurrent hybrids (DeltaNet, Mamba, etc.), check your last blocks. AdamW might have silently broken them.
Enjoy \^_\^
Thireus@reddit
u/VoidAlchemy
VoidAlchemy@reddit
Thanks, I'll look into it. fwiw all GGUFs created will have identical f32 dtypes for the mentioned vector data e.g.
Thireus@reddit
Thanks! Also see https://www.reddit.com/r/LocalLLaMA/comments/1sfwauj/comment/off7gus/ and answers.
VoidAlchemy@reddit
> The script is proprietary - I'm not sharing it.
I'm not convinced yet that there is any problem. I haven't seen any PPL/KLD numbers with/without patched bf16 safetensors (or bf16 gguf if that is where they are starting).
There is no problem at all with GGUF and their claim has nothing to do with GGUF. Their claim is the original safetensors bf16 format weights for those two tensors have wider standard deviation so they scale it down presumably?
It could just be some marketing hype for their finetuned model? Honestly not sure if it is worth investigating or not unless they release the actual script or someone can A/B test a patched/unpatched version in terms of PPL/KLD/lm harness evals or anything besides vibes.
Decivox@reddit
I built a tool and ran it against the original BF16 GGUF in the OP, I actually found one extra tensor (36), applied repairs, and got no differing results in the NIAH tests. Available below if you want to give it a test:
https://github.com/decibuild/qwen-ssm-repair
VoidAlchemy@reddit
Thanks for sharing your tokens! My GLM-5.1 running locally on CPU-only couldn't figure out what the "problem" even was after a couple hours so I gave up... haha...
Thireus@reddit
Thank you for looking into it!
Thireus@reddit
I agree and if there is indeed something wrong with the original tensors this should probably be raised with Qwen so they can update their model in a minor release.
EvilEnginer@reddit (OP)
So, I found that my per neuron fixes were too rough and I got random text as ouput. Trying another way - skip critical tensors and patch only what I can.
EbbNorth7735@reddit
Alternatively capture the conditions that recreates the issue and setup a pipeline to run various increments of the adjusted weights.
EvilEnginer@reddit (OP)
Really good idea btw. Thanks. I think I will use neighbour neurons in tensors with simply wight value copy paste for holes.
EvilEnginer@reddit (OP)
So. Now I am downloading original Qwen3.5 9B BF16 model from Unsloth: https://huggingface.co/unsloth/Qwen3.5-9B-GGUF/blob/main/Qwen3.5-9B-BF16.gguf
Let's see what was broken in the model...
Thireus@reddit
Could you share the script you are using and how to patch the BF16 tensors please?
EvilEnginer@reddit (OP)
Thanks for the interest. The script is proprietary - I'm not sharing it. If you want a specific model checked, contact me directly.
VoidAlchemy@reddit
Would you be willing the make the script not proprietary? Also do you have any PPL/KLD/lm harness evals to A/B test an unpatched vs a patched model? Otherwise I'm not personally convinced based on vibes alone.
More ramblings: https://www.reddit.com/r/LocalLLaMA/comments/1sfwauj/comment/oflmcoa/
Cheers!
EvilEnginer@reddit (OP)
The script stays proprietary. I don't have PPL/KLD/lm harness evals because I'm on Colab Free Tier and can't run them on 35B models. But the results speak for themselves: users report to me that 100k+ context started working, tool calls stable, no more loops or rambling. If you're not convinced, that's fine - the fixed model is free. Download it, test it yourself, and compare to the original. The difference is obvious.
VoidAlchemy@reddit
Thanks, I'll pass. Cheers!
True_Requirement_891@reddit
We need to do more investigative shit like this
EvilEnginer@reddit (OP)
Yep, very true.
VoidAlchemy@reddit
fwiw, this has nothing to do with GGUF, they are patching the original weights as released.
But yeah GGUF can be confusing too.
Johnwascn@reddit
Could you please check if the qwen3.5 122b is also damaged?
EvilEnginer@reddit (OP)
I will check Q4_K_P quant. It fits on Google Colab Free Tier disk space.
Johnwascn@reddit
Thanks, brother.
EvilEnginer@reddit (OP)
Btw, thank you very much guys for award. I am very pleased to hear that my work is in demand.
Thireus@reddit
Thanks for sharing this!
EvilEnginer@reddit (OP)
Thank you too :). I'm happy to help in any way I can.
Thireus@reddit
Has unsloth reacted to these findings?
EvilEnginer@reddit (OP)
I think not. They are focused on Gemma 4 now, and they are focusing on quantization not healing model architecture that I am doing.
CATLLM@reddit
Amazing work! Tgank you! If you can post a testing procedure, those of us that have a 4090 etc can help test for you. Maybe also consider setting up a github sponsor so people that are able to contribute to your wok can.
EvilEnginer@reddit (OP)
Thank you! I really appreciate the offer. A simple test: take a long task (50k-100k tokens) - coding with tool calls, a complex conversation, or a large document analysis on this model that I just made:
https://huggingface.co/LuffyTheFox/Qwen3.5-27B-Claude-4.6-Opus-FernflowerAI-GGUF
Run it on the original model and my fixed version. The original will likely become indecisive, repeat itself, or break. The fixed version should stay stable. As for GitHub Sponsors - that's a great idea. I'll look into it.
But honestly, the best support right now is just spreading the word and testing the model on real tasks. Every report helps the community. Thanks again for being willing to help. This is what open source is about.
nosrslygtfo@reddit
what happened to this version: https://huggingface.co/LuffyTheFox/Qwen3.5-27B-Claude-4.6-Opus-FernflowerAI-GGUF
this was the v2 and had a Q8 quant that worked really well... :/
EvilEnginer@reddit (OP)
Qwopus v3 is better then this one. Jackrong updated it.
CATLLM@reddit
Ok i just tested the 27B of your's and the original. I had Claude create a deep reasoning task but both where pretty much the same. Both sometimes stops around 8k - 13k tokens without finishing the task. The original model was able to finish the task 1/5 tries and yours was able to finish the task 1/5 times only after asked it "why did you stop?"
EvilEnginer@reddit (OP)
So, looks like my fix doesn't work with already trained model. Gradients alredy drifted too much and became broken on tensor
ssm_conv1d.weightID_crypto@reddit
Great stuff! Would you take a look at Jackrong/Qwopus3.5-27B-v3?
Better than v2 from my testing.
EvilEnginer@reddit (OP)
Yep I will take a look tomorrow.
Direct_Technician812@reddit
Oh my god, I'm looking forward to it too. Thank you so much!
cverity@reddit
Interesting. I had just tried yesterday to use the amazing 35B HauhauCS model for agentic coding instead of the standard 122B model to save some vram, and gave up after about an hour. Conversations would end prematurely, tool calls randomly failed, I couldn't leave it alone for a minute and quickly gave up. I was bummed because in every other respect, the 35B HauhauCS is a kick ass model.
So I will definitely be giving yours a try. Thanks for the contribution!
EvilEnginer@reddit (OP)
Nice. Thank you very much. Also I will upload new V2 update for this one today with fixed "dead" neurons in tensors, and Q4_K_L quant with Unsloth tensor profiles: https://huggingface.co/LuffyTheFox/Qwen3.5-35B-A3B-Uncensored-FernflowerAI-GGUF
JayPSec@reddit
How do you determine which and how a tensor is broken?
EvilEnginer@reddit (OP)
I compare each tensor's scale (standard deviation) to the median of its peer group (tensors with identical shape). If the deviation exceeds a threshold and the tensor shows signs of saturation, it gets repaired. The exact criteria are proprietary.
First_Neck_3384@reddit
when you say tensor scale, is it quantization scale (used to scale input to a specific range) or the magnitude of tensor itself?
EvilEnginer@reddit (OP)
Tensor scale here means the standard deviation of the weight values themselves - the magnitude of the tensor. Not quantization scale. I work directly with dequantized BF16 weights, so no quantization scale is involved. The comparison is between raw weight distributions, not compressed representations.
Embarrassed_Soup_279@reddit
does this mean the 27B dense model have similar training bug or is it only MOE?
EvilEnginer@reddit (OP)
27B is broken. I checked this one https://huggingface.co/unsloth/Qwen3.5-27B-GGUF/blob/main/Qwen3.5-27B-Q8_0.gguf . Contain 8 broken ssm_conv1d.weight tensors.
oxygen_addiction@reddit
Can you also check Stepfun 3.5? It has a similar problem with overthinking.
EvilEnginer@reddit (OP)
Can't check this one. Don't have system resources and disk space to load it. It's too big for Google Colab Free Tier that I am using.
oxygen_addiction@reddit
I can look at it when I come back from vacation. Can you share any more resources? A python script, a colab, anything so I can understand the process?
Decivox@reddit
If you have the know how/hardware, you can check out the work Ive done so far on this issue. I believe I have the detection logic down? I dont have the hardware unfortunately to try and actually run a repair function on a gguf however and then perform the tests.
https://github.com/decibuild/qwen-ssm-repair/tree/main
EvilEnginer@reddit (OP)
Thanks for the interest. The method is proprietary and I'm not sharing the script at this time. It took months of work to develop the method and make it architecture independent with .GGUF and .safetensors compatibility. If you need a large model checked, here's how it works: you provide access to a machine (Google Colab Pro) and pay for my time. I run the diagnostic remotely, my toolset never leaves my control. You get a report - fixed model and a list of issues in it. Contact me if this works for you, and I will do the job.
Secure_Archer_1529@reddit
I don’t get the downvoting . You’re sharing fixes you dit on a model and even suggested to help someone with the model of his choice. Everybody here has a job so kudos to you for having specialized skills people want to pay for.
EvilEnginer@reddit (OP)
Thank you very much :). I'm very glad to hear that.
Equal_Grape2337@reddit
So that means that the 4b and 9b should have the same issues? I actively using them at https://github.com/spokvulcan/tesseract
EvilEnginer@reddit (OP)
Yes, small ones are broken.
Prudent-Ad4509@reddit
Now I'm starting to wonder about 397b
EvilEnginer@reddit (OP)
Can't even load it for analyzis. Too big for Google Colab Free :D
Prudent-Ad4509@reddit
Can it be checked using lower quants ?
EvilEnginer@reddit (OP)
Lower quants are still too big for Google Collab Free Tier disk space.
Embarrassed_Soup_279@reddit
thank you for confirm. this is really interesting
EvilEnginer@reddit (OP)
I think yes. Because HauhauCS BF16 GGUF fits nicely to Google Colab Free Tier.
Embarrassed_Soup_279@reddit
tysm
EvilEnginer@reddit (OP)
I checked only MoE. I will take a look in 27B tomorrow and will let you know.
TheLastSpark@reddit
please reply if you see something like this in 27B as well!
EvilEnginer@reddit (OP)
27B is broken - confirmed.
IrisColt@reddit
The bug is again in the original Qwen 3.5 weights released by Alibaba, right?
EvilEnginer@reddit (OP)
Yes. Correct.
IrisColt@reddit
Mother of God... part 2.
TheLastSpark@reddit
Well I am eagerly awaiting a follow up post for 27B if you do fix it (and fixing improves it)
EvilEnginer@reddit (OP)
I think I can try to fix legendary model by https://huggingface.co/Jackrong/Qwen3.5-27B-Claude-4.6-Opus-Reasoning-Distilled
But I don't know if the fix would help here and can't test it, since it's painfully slow on my RTX 3060 12GB.
Your thoughts?
TheLastSpark@reddit
If you have a benchmark or some kind of code I can run i can maybe do it? I got a 4090 and dont mind running stuff on it to test
EvilEnginer@reddit (OP)
I don't have a formal benchmark. I focus on finding and fixing broken tensors - that's the hard part. Running tests is just validation.
If you want to help, here's a simple way: take the original 27B and my fixed 27B. Give both the same long, complex task (e.g., 50k-70k tokens of code or conversation). The original will likely become indecisive, repeat itself, or break. The fixed one should stay stable.
I don't have a 4090, so I can't run these tests myself. If you do and share the results, that would help the community. Either way, I'll release the fixed model - since people are asking for it.
Embarrassed_Soup_279@reddit
wowthank you !
EvilEnginer@reddit (OP)
Currently cooking Q4_K_XL quant of this model: https://huggingface.co/LuffyTheFox/Qwopus3.5-27B-v3-Uncensored-FernflowerAI-GGUF for powerful GPUs.
This would be the last test for Qwen3.5 27B model series.
Embarrassed_Soup_279@reddit
sorry if it seems like am asking too much, but would you please check this variant of RYS qwen model as well? https://huggingface.co/jackasda211233/Qwen3.5-27B-Uncensored-RYS-Reasoner-GGUF
it use a splice method to combine HauhauCS layers with RYS duplicated layers, and it has custom imatrix for reasoning / agentic task. iq4_nl seems to be the best quant to use, even over bf16 with that.
EvilEnginer@reddit (OP)
Done. BF16 quant uploaded: https://huggingface.co/LuffyTheFox/Qwen3.5-27B-Uncensored-RYS-Reasoner-FernflowerAI-GGUF
Now i will try to process iq4_nl as it is.
Embarrassed_Soup_279@reddit
thank you!
EvilEnginer@reddit (OP)
Okay I will check BF16 quant. Since it's uncompressed I can check tensors precisely and fix them. Now I can fix even broken neurons in neural network. I think it's time to test my new approach on this model.
nemomode7@reddit
Good job! Best parameters for inference? temp, topk, etc?
EvilEnginer@reddit (OP)
Simply use parameters from my post on top of this page. With default system prompt: "You are Qwen, created by Alibaba Cloud. You are a helpful assistant."
nemomode7@reddit
So, the same. Thank you!
jwpbe@reddit
I've been using this model extensively over the last few weeks, it duplicates model layers that are activated strongly on math and EQ Bench. Would you mind checking it for broken tensors? It's already very strong as it is.
https://huggingface.co/dnhkng/RYS-Qwen3.5-27B-FP8-XL
The explanation of the model: https://dnhkng.github.io/posts/rys-ii/
EvilEnginer@reddit (OP)
I can check this model if it will be in GGUF Q8_0. I can't load 27B FP8 model in safetensors, because it's too much resource hungry for checking.
jwpbe@reddit
there is a gguf set here: https://huggingface.co/Biomanticus/RYS-Qwen3.5-27B-gguf
EvilEnginer@reddit (OP)
Checked. It was broken too. Same problem with scaling for 8 ssm_conv1d.weight tensors. I fixed them in BF16 GGUF format. Here fixed version: https://huggingface.co/LuffyTheFox/RYS-Qwen3.5-27B-FernflowerAI-GGUF
EvilEnginer@reddit (OP)
Okay I will check F16 quant later.
BedderChavez@reddit
Can you check the Gemma 4 E4B model to see if it has any bugs? I had downloaded a couple of models to run modestly on an RTX 4060.
EvilEnginer@reddit (OP)
Checking finished. Here results for original Qwen3.5 9B model: https://pastebin.com/XD2VuwZp
EvilEnginer@reddit (OP)
Yep I will take a look later.
StardockEngineer@reddit
How was the original model uncensored? I don't want to download a model damaged in some other way.
EvilEnginer@reddit (OP)
Nobody knows. HauhauCS did it.
StardockEngineer@reddit
How can we do the original model? I'd be happy to do it if you have a script.
EvilEnginer@reddit (OP)
The uncensoring method was done by HauhauCS, not me. I don't have that script. My work is fixing broken tensors in the already uncensored model, not removing refusals. If you want the original model uncensored, you'd need to ask HauhauCS or figure it out yourself. My script is proprietary and not shared.
IrisColt@reddit
Just curious... who's actually responsible for the bug in this model? The GGUF creator? HauhauCS? The Qwen team? Seems like an important distinction. Asking in good faith.
EvilEnginer@reddit (OP)
The bug is in the original Qwen 3.5 weights released by Alibaba. Not GGUF. Not HauhauCS. Alibaba shipped it broken. I just fixed it. The cause is training-related - AdamW + MoE + DeltaNet causes rare experts in the last layers to drift. This is a known challenge with recurrent MoE architectures, but Alibaba didn't calibrate it before release.
ComplexType568@reddit
Oh wow, does this mean that the Unsloth models are also broken among the models hosted on the Alibaba API?
EvilEnginer@reddit (OP)
Yes. All of them are broken. I checked this 27B one from Unsloth: https://huggingface.co/unsloth/Qwen3.5-27B-GGUF/blob/main/Qwen3.5-27B-Q8_0.gguf
It's broken too. It contains 8 broken ssm_conv1d.weight tensors.
FeiX7@reddit
so how it affects the model?
EvilEnginer@reddit (OP)
It's losing context during conversation on agentic tasks.
FeiX7@reddit
that's really bad, did you contacted the Qwen Team on X?
EvilEnginer@reddit (OP)
No, I haven't written to them yet.
Koalateka@reddit
Just to be sure I understood this correctly: the error was in the full precision weights originally released by Alibaba. Is that correct?
EvilEnginer@reddit (OP)
Yes. Correct.
Koalateka@reddit
Your findings are very interesting, thanks for sharing.
IrisColt@reddit
Mother of God... Thanks for the info!!!
Major-System6752@reddit
Bartowski and unsloth quants affected?
EvilEnginer@reddit (OP)
Yes, affected.
nosrslygtfo@reddit
Awesome work and an what a writeup description! Super helpful and thank you!!! Are you planning to upload the Q8 version of your fixed 35B model? https://huggingface.co/LuffyTheFox/Qwen3.5-35B-A3B-Uncensored-FernflowerAI-GGUF
Also any plans of releasing MLX versions?
leonbollerup@reddit
I tried it.. but seems.. unstable.. atleast in LM studio and i have to use a much lower context window. with the original i can run stable at around 190k .. with this.. i can't start it above 90k and it crashes easier.. i need to fiddle around with it.
Equivalent-Dream9615@reddit
why not use q4_K_S? it would fit nicely in 24gb.
EvilEnginer@reddit (OP)
I prefer Q4_K_XL with UD tensor profiles from Unsloth, because it has better quality than Q4_K_M.
Warm-Put3482@reddit
i download it befor your post in LM hehhe
EvilEnginer@reddit (OP)
Ahah. Looks like hype is real :P
SeriousTeacher8058@reddit
Why isn't there a standard tool for comparing different versions of an LLM? If I had two versions of the same LLM, and I liked a specific feature from one version that another lacks, why can't I look at the layers and scale them or swap them with the same layers from another version?
EbbNorth7735@reddit
I'm sure you can. The file has a structure and it just requires splitting it up and parsing the correct parts to look at the numbers. It's what OP did.
Imaginary-Unit-3267@reddit
Because the mapping from parts of the neural network to traits is extremely ill-understood even by the people doing the training. To a great extent neural networks are black boxes. Figuring out how they operate takes work. The large scale architecture is well understood, but the details of how they encode the information they learn take a lot of digging, and there just isn't a standardized way to do that yet and likely won't be for a few more years yet.
EbbNorth7735@reddit
Did you check the 122B model? If not can you describe the process on how you checked them? I wouldn't mind checking myself just for my own knowledge.
EvilEnginer@reddit (OP)
I checked 122B. It's healthy. Only small ones is broken.
Lucis_unbra@reddit
Would you mind releasing fixed versions of the otherwise untouched source model? If this isn't exclusive to fine-tuned or ablated / uncensored variants I'd certainly like options are are "pure"
EvilEnginer@reddit (OP)
This is exclusive to finetuned / uncensored variants of model. I am 100% sure that people need both uncensored roleplay and coding capabilities in one model when they talk with local AI.
NewUser10101@reddit
Would love a review of the rest of the family, including 122B and the full 27B with safetensors (looks like you checked one of the fine-tuned variants).
EvilEnginer@reddit (OP)
I don't have enough system resources for 122B and 27B ones to process the tensors in safetensors format. RAM spices are insane. So, only 35B A3B and 9B can be currently processed in safetensors format on Google Colab Free Tier.
Subject_Secretary245@reddit
Does this also apply to smaller models like the 4b, or just the larger ones?
EvilEnginer@reddit (OP)
Yes, this can be applied to smaller models too.
raharjoharis@reddit
Is the 27b still experimental? And also please do one for the 9b as well sir
EvilEnginer@reddit (OP)
27B will always be experimental, the most correct approach is finetuning only fixed models with correct conv1d weights. I will pick 9B one from Jackrong later.
unjustifiably_angry@reddit
So Qwen3.5 is as good as it is.... broken?
Prudent-Ad4509@reddit
This might partially explain why 122b is so much better than 35B even at IQ3 quant.
raysar@reddit
is there boost in benchmark ?
EmperorOfNe@reddit
That is serious debugging work, very well done. Alibaba should reach out to you when you did achieve 88.6% reduction in errors, that is amazing.
EvilEnginer@reddit (OP)
Thank you very much). Yep I also hope for this, and would be happy to coollaborate with them.
Altruistic-Site-9000@reddit
Hey if someone wanted to start with fine tuning and all the basics how should they start I
EvilEnginer@reddit (OP)
Unsloth Studio is really nice i think.
FantasticBottle2463@reddit
Qwen_Qwen3.5-35B-A3B-Q8_0.gguf+tools is more smart to me, compare your bf16 version.
gpalmorejr@reddit
Interesting. I have never had this happen. Maybe I'm not using it long enough? How many tokens were the contexts when this error showed itself?
EvilEnginer@reddit (OP)
Usually starts showing after 50k-70k tokens. The model becomes indecisive, repeats itself, breaks tool calling, loses context. By 100k it often breaks completely. My fix handles 100k+ without issues.
gpalmorejr@reddit
Maybe that is why my coding agent got weird the other day.
EvilEnginer@reddit (OP)
Sounds like it. Try my fixed model - should stay more stable, at least for basic stuff. For complex coding this model requires training via Unsloth Studio on Claude Opus 4.6 datasets from Hugging Face.
gpalmorejr@reddit
I'll look into itnbut mybhardware is limited and I have to use their Q4_K_M. So, we'll see.
jerryohjerry@reddit
Damn, that's some serious detective work. Two tensors causing 88.6% error reduction is wild - the fact that it was hiding in plain sight in the weight scales is exactly the kind of thing that makes you question how many other models have similar silent failures nobody's caught yet. The AdamW + rare experts angle makes sense too. Those last layers don't get updated often so when they do, the optimizer overshoots hard. Curious if this explains some of the weird behavior people report with other MoE models that just gets blamed on "model quality" when it's actually a training artifact.
EvilEnginer@reddit (OP)
You nailed it. This isn't Qwen-specific - any MoE + recurrent architecture trained with AdamW can suffer from the same issue. Rare experts in last layers drift hard. Most teams blame 'model quality' and move on. Nobody bothered to look inside. Until now.
Kahvana@reddit
Thank you! Can you upload the safetensor version?
EvilEnginer@reddit (OP)
Of course. Already done: https://huggingface.co/LuffyTheFox/Qwen3.5-35B-A3B-Uncensored-FernflowerAI-safetensors
Kahvana@reddit
Awesome! Thank you so much!
EvilEnginer@reddit (OP)
You're welcome! Glad to contribute :)
hockey-throwawayy@reddit
Thanks for sharing this!
Would you be willing to do some major hand-holding and explain how to quantize this model into something that will fit 12 GB VRAM? I see the script on the HF page, but I am just totally unfamiliar with the nuts and bolts of the process.
My local LLM setup understanding begins and ends with "if HF shows my GPU with a green icon, I can try that model."
There are so many details to get these models running locally properly and I have yet to figure it all out. I'm looking for a good "daily driver".
EvilEnginer@reddit (OP)
Just use Q4_K_L quant. It's best in terms of size and quality. I am using it on my RTX 3060 12 GB. I have 10 tokens per second in LM Studio.
Vast_Strawberry3093@reddit
Where can i find this Q4_K_L quant?
tiffanytrashcan@reddit
There's a huggingface link in the OP...
hockey-throwawayy@reddit
Ah I see it now, thank you!
United_Razzmatazz769@reddit
Thanks for the model. Some qwen3.5 35B A3B models i have tried allways melt down past 50k tokens. Your model definately feels better. I got past some 100k api endpoint learning planning successfully with it.
EvilEnginer@reddit (OP)
Great to hear that! 100k tokens is exactly where the original model breaks. Your test confirms the fix works. Thanks for reporting back.
Several_Newspaper808@reddit
Hey, so you offload to RAM? The small gguf on hf is 24gb. Otherwise how would it fit in a 12gb card?
EvilEnginer@reddit (OP)
Yes I offload to RAM. It works on 12 GB card but with slow speed.
WhoRoger@reddit
Lol nice. Any interest in checking the small versions too? 4B, 2B, 0.8B are notoriously prone to getting stuck.
EvilEnginer@reddit (OP)
Thanks :). May be in future. Currently 35B and 27B in priority.
Quiet-Owl9220@reddit
Hey nice job. It doesn't give up mid-sentence after extended reasoning and tool calls any more.
EvilEnginer@reddit (OP)
Nice :)
wh33t@reddit
Interesting.
Maybe this explains why I have such poor experiences with Qwen3.5, it just becomes so fucking indecisive all of a sudden, looping itself, and no amount of parameter tuning seems to fix it. This is must be the issue.
EvilEnginer@reddit (OP)
That's exactly it. What you're describing - indecisive, looping, no amount of parameters fixes it - is the signature of this bug. The recurrent state in blocks 36-37 is corrupted. Model loses context, starts repeating, can't decide. Parameter tuning can't fix it because the problem is in the weights themselves. Only scaling those two tensors back to normal works. Try my fixed model with my settings and System Prompt. You'll see the difference immediately. No more loops. No more indecision. Just clean crystal clear human readable responses.
jikilan_@reddit
Any way to notify qwen team about this?
EvilEnginer@reddit (OP)
I think they are monitoring Twitter.
RemarkableAntelope80@reddit
So, to clarify. This affects training / that finetune? Or it actually affects inference on GGUFs of the original Qwen3.5 model? Either way, congrats figuring it out
EvilEnginer@reddit (OP)
It affects inference on any GGUF of original Qwen 3.5 35B A3B. Fine-tuning doesn't fix it. It masks it at best. So if someone fine-tunes a broken Qwen, they're building on unstable ground. Better to fix first, then fine-tune.
hesperaux@reddit
I want to understand stuff as much as you some day
Super interesting post. Thanks. I am slightly skeptical of it because of who I am as a person but... You sound like you know what you're talking about.
I am definitely gonna try this. I switched to 122B A10B because 35B A3B was.. Strange. Like you said, it got weird after 70k tokens. And it was not good at maintaining a direction. I wonder if it's related.
Another person asked if this is only that version (abliterated) or if it's this way on the official model. Can you answer that?
Thanks again. Cool stuff.
EvilEnginer@reddit (OP)
Official model contains this bug too. I checked Unsloth BF16 quant.
Dazzling_Equipment_9@reddit
I noticed you're using the `--seed 3407` parameter. Out of curiosity, what's the magic behind it? I've seen it mentioned in some parameter recommendation articles; it seems like a magic number. Given your professional background, I'd like to know what effect it has in actual use? Is it limited to LM Studio or does it apply to other clients as well?
kellyjames436@reddit
Does this model can run on 4060 8gb vram ?
EvilEnginer@reddit (OP)
Yes it can run. Since it's MoE and has only 3B active parameters.
Vast_Strawberry3093@reddit
But MoE rechooses the expert for each token, right?
MmmmMorphine@reddit
Yeah, you still have to hold the entire model in [v]ram - MoE sort of trades computation with size in a certain sense
There's ways of trying to predict the experts (among related techniques) that will be used next or most often - though I'm not too clear whether they're of all that much use
Anyway, yeah you'll need to offload a good chunk of the model if you have less than 24gb vram. And that's with significant quantization and kv cache management/quantization
kellyjames436@reddit
Thank you.
thingswhatnot@reddit
You should have added all this useful information to the model card - it reads like a black box and won't be discoverable.
Responsible-Ship1140@reddit
Ist das ein Fehler der in allen qwen3.5 Modellen auftauchen könnte? Die Beschreibung passt durchaus auf Dinge die ich mit qwen3.5:9b beobachten konnte (Q4)
k_am-1@reddit
!remindme 3 days
LegacyRemaster@reddit
the name is too short! Please add something epic!
EvilEnginer@reddit (OP)
Ahahah, very true. But I like it. It's unique :D
apollo_mg@reddit
Bravo good sir. Excellent digging, and thanks!
EvilEnginer@reddit (OP)
Thank you very much ^_^
Fun_Smoke4792@reddit
Remindme! In 14 hours
RemindMeBot@reddit
I will be messaging you in 14 hours on 2026-04-09 05:46:41 UTC to remind you of this link
CLICK THIS LINK to send a PM to also be reminded and to reduce spam.
^(Parent commenter can ) ^(delete this message to hide from others.)