I pray there is a Qwen 3.6 122b version (4x3090 owner)
Posted by Mr_Moonsilver@reddit | LocalLLaMA | View on Reddit | 37 comments
The 3.5 122b model already is fantastic at 4-bit. Really the best model I ever ran on my 4x3090, but from what I read how 35B 3.6 is doing, the 3.6 122b model would be an absolute value banger. Are we going to get it?
qwen_next_gguf_when@reddit
Surprised to see that thou are not running the 397b. I have only 24gb VRAM and am running the iq2.
jacek2023@reddit
and what's your usecase for it?
qwen_next_gguf_when@reddit
When small models can't figure it out , I will swap to larger model to troubleshoot.
jacek2023@reddit
could you give an example?
qwen_next_gguf_when@reddit
I use 27b q4 as my model for opencode . From time to time , it gets in a loop that it can't figure out how to fix a bug or implement a new feature or drift to a limbo. I use 397b to troubleshoot the code, or get a new idea of how to realize an idea.
Blues520@reddit
How do you run 397b with only 24gb?
grumd@reddit
experts in RAM
575_Inverse@reddit
But how much RAM? With 32GB RAM and 5060ti 16GB the 122b Q6 runs but... less than spectacularly in terms of tk/s. But yeah, it runs more than spectacularly when it comes to reasoning. Q4 on difficult visual tasks starts breaking, like blurting occasional Chinese tokens lol.
grumd@reddit
Q6? The gguf file alone is like 100GB+. How do you expect to put a 100GB model into 16+32? You're running your model off of a disk
You need at least 64gb RAM to run 122b IQ3_XXS. And you need to use
--no-mmapto even load it into RAM instead of serving from diskasfbrz96@reddit
How good is that?
qwen_next_gguf_when@reddit
Good for troubleshooting codes or generating new ideas on tools to create.
575_Inverse@reddit
in other words, the small models lack reasoning. Which is why the 122b is simply on a different league.
pmttyji@reddit
what t/s are you getting?
Thepandashirt@reddit
I think its super questionable we get 122B and unlikely we see 397B. A lot of money is invested in developing these models and investors are starting to actually expect profits from all these AI companies. Theres very little business incentive to release models like 122B or the full 397B which would cannibalize API token sales.
I think we continue to see lots of competition around models that fit in 24-32GB of VRAM where most consumer builds top out. As someone with enough VRAM to run 397B in 4-bit, I hope im wrong, but the trend says otherwise. Gemma 4 was 31B max and Qwen3.6 is only 35B so far, so consumer build friendly releases. We'll see.
575_Inverse@reddit
Not everyone has the funds for a multi-GPU rig that can run 122b acceptably fast. Those who have, will not fork money for API keys, especially considering CENSORED models are all basically castrated.
ttkciar@reddit
I suspect we will, but it may take some time.
If I were the Qwen team, I'd be using the Qwen3.5 traces logged from API users to synthesize training datasets for (1) remedying Qwen3.5's overthinking problems, and (2) coming up with better answers to real-world user prompts, using a big-ass "teacher" model and an iterative improvement pipeline.
Then I'd use it to tune Qwen3.5-35B-A3B (cheap to train), to produce Qwen3.6-35B-A3B, and set that loose for users to beta-test for a while, so I could analyze the API users' logged traces to see if the training datasets needed further adjustment.
After that adjustment, or after having verified that the datasets needed no further adjustment, I'd give the bigger (more expensive to train) models the same treatment to make 3.6 versions of them.
Perhaps they're doing something like that? But I have no particular insights.
Voxandr@reddit
How they gonna trace API usage with offline models?
ttkciar@reddit
They won't. They'll use the traces logged from people using their API.
We here in this sub use local inference, but remember that most inference users are using APIs.
king_of_jupyter@reddit
Quantum entanglement, ghost murmur style
Voxandr@reddit
Hahaha
laterbreh@reddit
Waiting for 3.6 397b :*(
fkyoj@reddit
what setup do you have to run that
laterbreh@reddit
3x RTX 6000 pros.
Mr_Moonsilver@reddit (OP)
That too!
__JockY__@reddit
Hoo boy, yes. The small models are great, but the 397B is really something else entirely!
FinalCap2680@reddit
You are not alone
Steus_au@reddit
glm5.1-air would be a killer too
RedParaglider@reddit
I still use 4.5 air all the time, it's an amazing model.
jacek2023@reddit
still no news https://huggingface.co/zai-org/GLM-5.1/discussions/2
Voxandr@reddit
Strixhalo Owner here. We need 122B!
zeferrum@reddit
Wha specific model quantization are you using for 4 bits in your quad 3090 rig ?
Mr_Moonsilver@reddit (OP)
Using cyankiwi's awq-4bit
AppealSame4367@reddit
I pray to the gods of speculative decoding innovations in llama cpp
Long_comment_san@reddit
Minimax?
Porespellar@reddit
robertpro01@reddit
We don't really know, just wait and see.
El_90@reddit
OMG yes please
Something that quants to Q5 @ 92GB ish would make me smile for a very long time