Qwen 3.6 35B A3B, RTX 5090 32GB, 187t/s, Q5 K S, 120K Context Size, Thinking Mode Off, Temp 0.1
Posted by sammyranks@reddit | LocalLLaMA | View on Reddit | 81 comments
deanpreese@reddit
Obviously great numbers for tok/sec The real question is, "how well does it work"
Agreeable-Yak6164@reddit
Very nice for misk coding tasks. Better than Glm Flash 4.7.
x4 faster than gemma4-31b with subjectively close accuracy and instruction following. Gemma is more accurate (dense). Gemma-moe was unusable (may be fixed, but i did not check updates).
I just started qwen3.6 to test... and did not stop for a hours, continue my work with it.
For last 7 hours it generate me about 1.4-1.5 million tokens.
7900XTX+mi5032gb; -ts 1,1; 1-2 threads with 128k ctx for each.
ppt - 1200-900 t\s with low context, 700-300 t\s with 90-110k filled context. avg: 701 t\s
tg - 40-23 t\s; avg: 35 t\s
Very nice release.
CalligrapherFar7833@reddit
Cant believe anything you wrote after your 4th word "misk"
Agreeable-Yak6164@reddit
Not native language. "No speaky English" (c) =)
CalligrapherFar7833@reddit
misk , t\s , etc
Agreeable-Yak6164@reddit
So on.. Are feel better now?))
CalligrapherFar7833@reddit
Do you have some brainrot formatting issues to end your sentence with ?))
Agreeable-Yak6164@reddit
Sure, a difficult past - BBC, IRC and dial-up :-p
JustSayin_thatuknow@reddit
😆
Agreeable-Yak6164@reddit
Man, you broke my evening... now i should find and rename tons of folders on my pc with name "misk" =(
CalligrapherFar7833@reddit
haha
sammyranks@reddit (OP)
That's great.
Jungle_Llama@reddit
latest llama.cpp Vulkan, unsloth Q4 XL with a single Mi50 32GB getting 75 tok/sec (prompt varies on task, I've seen 600 tok/sec) Not bad for 280 Euros for the card. Noticable improvement in speed and accuracy over 3.5. Starting to like this model an awful lot.
michaelsoft__binbows@reddit
Thats super relevant perf out of that. Wish i picked a few up back in the day, theyre too expensive to be worth grabbing at current prices
Jungle_Llama@reddit
They have risen to 420 euro here now but this card is stting in a cheap X99 with a single E5 2680 V4. Whole rig only cost me around 440 to build, had nvme and some ddr4 ram lying around. That's insane value these days considering what Qwen 3.6 35b can do.
michaelsoft__binbows@reddit
Nice. Yeah I have some old X99 boards and I got a E5 2690 v4 for $22 about 3 years ago. a few GPUs like this would be perfect to pair that with...
ubrtnk@reddit
I get about 75t/s on 2x 5060ti with 132k context but also with cheap power draw
Eversivam@reddit
how practical are 2x 5060ti ? I've been looking for people testing them but haven't found anything.
noreallyhughjackman@reddit
I think in the scheme of things theyre a little underpowered, I feel like 5070ti is in a better place (or 3080 20gb from china). But plenty good for playing around with larger context.
I was going to make a 4x5060ti 16gb system but once you get to models that size, theyre far too slow to run at usable speed (aside from batch/non realtime scenarios) so might be better to spend more on fast gpu than just sheer memory size.
ubrtnk@reddit
I normally would agree. Would have bought another 4080 or a 5070 but only had one 8-pin power to go with the last gpu that I was putting on the last 4 pcie lanes I had without bifurcation. Another 5060 made the most sense. I thought about going with an older volta or Turing card but didn't want to deal with the lack of some features
ubrtnk@reddit
This comment aged poorly in the last 2 days. I ended up returning the 5060 and found another 4080 for damn near the same price on marketplace. So as soon as my converter cable comes in I'll have qwen on 2x 4080s
ubrtnk@reddit
I originally got the first one to run the small always on models I needed like embedding, rerank and a task model for owui. Plenty fast for those models. It can also run gpt-oss:20 at like 100t/s. It was a good card to suppliment the 3090s at the time. After I got my 5090 for my gaming rig, moved the 4080 in and ran that with the 5060 in a mismatched pair. Now I'm at 2x4090s. 2x 3090s, the 4080 and the 5060 so got the second 5060 for a matched pair for always on qwen and playing around with tensor parallelism as a matched set. I got the pny version so it's also small and only takes up one 8-pin power, which was all I had left on my second psu.
Eversivam@reddit
if a model fits only 1 5060ti, will the other 5060ti help with extending the context ?
ubrtnk@reddit
Kinda. The weights split based on how you tell it to and the rest fills with context but both model and context live on both gpus.
ubrtnk@reddit
As you can see here, the context doenst split evenly. I could do my -ts split in llama.cpp like 4,5 or something similar to shift more of the weights to the 2nd 5060 if I reall wanted them to be balanced.
Here's a llama-bench example evenly split with a 2048 prompt:
Sorry for the formatting, can only upload one image
LiquidNeat@reddit
95 t/s on my Macbook Pro M5 with MLX, 60W power draw.
s1rlight@reddit
Stupid question, what software is the one in the screenshot?
Available-Craft-5795@reddit
Increese that temp a lil
sammyranks@reddit (OP)
0.1 was the default. Will try 0.5
OniCr0w@reddit
From the official Qwen authors:
Thinking mode (default):
temperature=1.0, top_p=0.95, top_k=20, min_p=0, presence_penalty=1.5temperature=0.6, top_p=0.95, top_k=20, min_p=0, presence_penalty=0Non-thinking mode:
temperature=0.7, top_p=0.8, top_k=20, min_p=0, presence_penalty=1.5temperature=1.0, top_p=1.0, top_k=40, min_p=0, presence_penalty=2.0Important:
--jinjaflag with llama.cpp for proper chat template handlingmmprojfile alongside the main GGUFUsage
Works with llama.cpp, LM Studio, Jan, koboldcpp, and other GGUF-compatible runtimes.
narrowbuys@reddit
Source ?
OniCr0w@reddit
It's in the HauHauCS Uncensored versions descriptions. That's where I got it at least but he claims it's from the Qwen team and I would trust him.
Octopotree@reddit
Does it's thinking take up context space? I guess I assumed the thinking would be discarded after answering
Fun_Librarian_7699@reddit
You can choose if you want to include thinking in the context.
OniCr0w@reddit
The uncensored HauHauCS versions have recommended settings for thinking and non-thinking modes in the LLM descriptions on Hugging Face.
Glittering-Call8746@reddit
Which quant ? Unsloth ?
sammyranks@reddit (OP)
Yes Unsloth, Q5 K S.
Odd_Butterfly_455@reddit
I just got my hand on 2 Radeon r9700 pro AI I plug one today waiting for my 1200 watts power supply come next week I will post some benchmark
Evildude42@reddit
So last night, I tried the sloth version of this at 4K4K, large 5K 5K, large and 6K. That’s being split between a B50 and a b580. Obviously, the smaller ones didn’t fit in the combined memory, and the larger one had to spill over in the ram, and I didn’t notice that the 4K version was about twice as fast as the six K version something like 35 tokens versus 15 tokens per second. Temperature was .6 but every single one of them crashed. The first few sample questions it went through fine by the time I went through for my third round of questions Il mk Studio just gave up the ghost and the model crashed. Today I got the newer LM studio versions, there was no 5K so I got the 4, 6 and the eighth. They all ran slower, but none of them crashed. By the way, I’m running in windows because I can’t get undo due to work or I couldn’t get it to work with the beta.
ComfyUser48@reddit
I am getting 166 tok / sec with my 5090 (limited to 80% power), with Q5_M, 210k context, running on llama.cpp
Dany0@reddit
Did you OC memory?
sammyranks@reddit (OP)
nice
bb943bfc39dae@reddit
Have you tried NVFP4 quant? Seems a waste not to leverage the Blackwell architecture
Subject_Mix_8339@reddit
where is this quant?
79215185-1feb-44c6@reddit
I still get around 130t/s with my 2x7900XTX. Nothing has really changed for me.
soyalemujica@reddit
What are you doing to get such nice token per second? I have a single 7900xtx and I am stuck in 65t/s which drops considerably to 40t/s at 8k context
79215185-1feb-44c6@reddit
You sir are out of VRAM.
soyalemujica@reddit
Sorry, I fixed it, now I get 170t/s with 131k context, I think I was doing wrong configuration.
MikeLPU@reddit
Fixed how?
soyalemujica@reddit
Enabled iGPU, put Discord, MS Edge, and other programs that I don't need to be on my main GPU, so my GPU vram is empty
MikeLPU@reddit
On windows? On linux with threadripper and full available 7900xtx i'm getting 90t/s max.
soyalemujica@reddit
Yeah, on Windows, I get 160t/s, as the context fills up it lowers ofc
ju9io@reddit
170t/s on a single 7900 xtx?
soyalemujica@reddit
Yeah, averaging 155t/s
MikeLPU@reddit
I have too and never saw these numbers. What the trick?
sammyranks@reddit (OP)
Great.
darkgamer_nw@reddit
What chat software is the one in the picture?
Abysseer@reddit
Looks like LMStudio
ego100trique@reddit
Win + Shift + S
retireb435@reddit
Do you think Q5 S is better than Q4 XL?
GregoryfromtheHood@reddit
With llama.cpp on Ubuntu I was getting 10k pp and 200-250 t/s from some quick tests on my 5090 without optimising anything yet. You using linux or windows?
Adventurous_Farm3073@reddit
I get around 120t/s on my dual5070 ti+ 5060ti system. My Dual 5090 system gets ~180. Q8 is close to 80.
Rare_Potential_1323@reddit
Is the 5060ti for pre processing? I was going to setup my old 1070ti for that soon
swiss_aspie@reddit
How does that work ? You can use a smaller gpu to speed things up?
KvAk_AKPlaysYT@reddit
I need a 5090.
Lmk if anyone has an extra one
loveisnomorethandust@reddit
i don't even have a dedicated gpu. i need a 5090 more than this guy. give me one before him.
KvAk_AKPlaysYT@reddit
Let's split it.
We both get 16GB.
FrostTactics@reddit
I get the mental image of the two of you standing before u/sammyranks as King David. By suggesting to split the GPU, you reveal that you are not its true owner.
LostDog_88@reddit
I want one too!
although, i want to retain my GBs, lets split the 5090 to 2545s!
Ok_Mammoth589@reddit
We all get one cuda core. It's the only fair method.
Healthy-Nebula-3603@reddit
Why do you change tenp? Leave it to the application that takes it from gguf.
FoundationFirm6934@reddit
Great job
chris_0611@reddit
you should ask it how to make a screenshot
ZealousidealBunch220@reddit
I think thinking is quite important for this model
DistinctObjective626@reddit
2 х RTX3090 unsloth/Qwen3.6-35B-A3B-UD-Q6_K_XL - 125 tok/sec (prompt 3800 tok/sec)
misha1350@reddit
Pretty great for UD-Q6_K_XL quants
Manaberryio@reddit
Genuine question: Would a Mac Mini with 24GB of RAM run smoothly this model? I have a computer with an RX6800 but GPUs are too expensive.
ismaelgokufox@reddit
Try a lower quant. There are a few options.
GMerton@reddit
I don’t think so. The model weights are 20GB I think. You also need to budget 4GB for your system. There won’t be any room left for context and compute.
FinBenton@reddit
I got 250 tok/sec on my 5090 but I tested with smaller context for now.
ArugulaAnnual1765@reddit
I can get 256k using 3.5 27b iq4xs same tps - doesnt seem worth the same performance for half the context, imma keep using it until 3.6 27b