running Qwen 3.6 35b A3B on 2x 5060TI
Posted by chocofoxy@reddit | LocalLLaMA | View on Reddit | 25 comments
i ran Qwen 3.6 35b A3B two 5060TI 16gb i used Q4 on LM Studio to get full context and i get 90t/s any tricks to optimze this more to upgrade to Q6 or Q8 ?
thanks
LoafyLemon@reddit
Try TurboQuant + MTP, it will not only speed everything up, but allow you to fit more context.
https://github.com/ggml-org/llama.cpp/pull/22983
voyager256@reddit
But how is the quality with the TurboQuant at Q4? I read that TheTom implementation is currently one of the best , but still nowhere near Q8.
chocofoxy@reddit (OP)
thanks i will try this
o0genesis0o@reddit
How did you add two GPU into a mATX mobo?
Building my rig with mATX mobo is currently one of my biggest tech regrets. It was more expensive than the full sized mobo, and mine has only 1 PCIe. And even if there is another one hidden somewhere on the board, there is just no more physical space.
IMHO, Q4 + full context + 90t/s decode is more than good enough. I might switch to Q6 K_XL from unsloth and squeeze the context down a bit, maybe to 128k or even 96k.
How is your prompt processing speed with those two 5060ti?
chocofoxy@reddit (OP)
i have the asus tuf B550 it has 2 x16 one gen 4 and one gen 3 and 2 x1 and it's cheap i got it for 90$ but you don't have any gap between the gpus when you stack them but it's ok the top one is hotter than the bottom one with just 5 - 8 degrees
soteko@reddit
I have this mb and I plan to put two 5060ti in it
https://www.asus.com/motherboards-components/motherboards/tuf-gaming/tuf-gaming-b550m-plus/
Is it the same one?
And what happens if you use Qwen3.6 Q6 with same context, what is token generation?
Also how is token generation with Qwen3.6 27B, because it is smaller?
And what is prompt processing speed?
Sorry too many questions lol
EducationalGood495@reddit
Would you recommend running Qwen 3.6 35B on 2080Ti 11GB? I am seeing a good deal for 180 and just building my first PC
fasti-au@reddit
Tom Turboquant quant turbo4 ok k turbo 3 on v and use dflash
FatheredPuma81@reddit
Switch to llama.cpp and don't load the 2GB Vision component. UD-Q6_K should just barely fit. Q8_0 KV Quantization should get you like 64k context or a bit more if you use Q5_1. UD-Q5_K_XL is the only way you're going to get full context length without offloading.
chocofoxy@reddit (OP)
i tried llama now i get lower 10 token per second ( surprisingly ) but the vram is more free and TP is better i will try to figure out the best config also i will try to use vllm nvfp4 and mtp
FatheredPuma81@reddit
Strange llama.cpp CUDA was already a bit faster for me and when I built it from source (really easy with OpenCode you just need to download some programs) with flags Grok, Claude, and Gemini said to set it runs a lot faster. Threw ngram-mod on top of it and it became even faster.
PotatoTime@reddit
I'm getting 40 t/s at q8 on a single 4070 12gb so you probably can optimize it further. I'm on llama.cpp though so I'm not familiar with lm studio
soteko@reddit
What are other specs CPU, memory etc?
PotatoTime@reddit
14700k CPU and 64GB DDR5 6400. I have a feeling my RAM speed is doing some heavy lifting but I'm only using about 12GB of it for qwen3.6 35b. I'm at 64k context and running on Linux if that makes a difference. I think I saw a lot higher RAM usage on windows
ImportantSignal2098@reddit
Yeah generation with experts offloaded to cpu bottlenecks on my ddr4 3200 pretty hard. A bit surprised you're only using 12GB ram with q8 quant though, I think I'm up to 11GB with q4_k_m, what is your kv quant? I saw moe is often not very good with lower kv quants so I didn't risk going below q8_0 there
chocofoxy@reddit (OP)
yes this is why i don't like offloading people keep saying that it works because they have high memory speed ddr5 but this doesn't work on ddr4 3200 it's not usable i tried this before i buy the second gpu and even it sucks more on dense models
PotatoTime@reddit
I haven't tried using kv quant yet. I think the low memory usage is mostly llama's memory mapping. RAM usage doubles when I set --no-mmap. Performance is the same either way though
sid351@reddit
I'm running 2 x 5060 TI as well, but hitting a "terminal thinking loop" situation where the model just devolves to producing only "/" characters until the max token limit, regularly throughout the day (using Llama.cpp).
I'd love to get that sorted properly, so if anyone has any ideas I'm all ears.
chocofoxy@reddit (OP)
i notice that happens when you lower quantization because i tested Q2 and this happens alot after 3 messages when i used Q4 this happens but not that often
Constant-Simple-1234@reddit
Not sorted, but I noticed something like that too. Funny enough 3.5 version is more robust for me than 3.6. I used ByteShape quants for 3.5, unsloth for 3.6. But tried AesSedai quants for 3.6 too and similar thing.
see_spot_ruminate@reddit
get 2 more 5060ti, lol
chocofoxy@reddit (OP)
lmao i will keep stacking 5060Ti until i get 96gb vram
see_spot_ruminate@reddit
I would actually watch out and maybe cap it at either 64gb (4 cards) or 128gb (8 cards, though I can't figure out how this would be practical). This is because once you get up to 4 cards (as people have bullied me into) vllm becomes the better option over llamacpp (praise be). With vllm, you need at power of 2 amount of cards to get the most out of tensor parallelism, eg 2^2 or 4, 2^3 or 8.
Xp_12@reddit
I'm trying to bully myself into my second set as well. 😅
see_spot_ruminate@reddit
nope, just some rando microcenter combo, 7600x3d and an asus motherboard. 2 of the cards are on nvme to oculink egpu.