running Qwen 3.6 35b A3B on 2x 5060TI

[-]

LoafyLemon@reddit

Try TurboQuant + MTP, it will not only speed everything up, but allow you to fit more context.

https://github.com/ggml-org/llama.cpp/pull/22983

[-]

voyager256@reddit

But how is the quality with the TurboQuant at Q4? I read that TheTom implementation is currently one of the best , but still nowhere near Q8.

[-]

o0genesis0o@reddit

How did you add two GPU into a mATX mobo?

Building my rig with mATX mobo is currently one of my biggest tech regrets. It was more expensive than the full sized mobo, and mine has only 1 PCIe. And even if there is another one hidden somewhere on the board, there is just no more physical space.

IMHO, Q4 + full context + 90t/s decode is more than good enough. I might switch to Q6 K_XL from unsloth and squeeze the context down a bit, maybe to 128k or even 96k.

How is your prompt processing speed with those two 5060ti?

[-]

chocofoxy@reddit (OP)

i have the asus tuf B550 it has 2 x16 one gen 4 and one gen 3 and 2 x1 and it's cheap i got it for 90$ but you don't have any gap between the gpus when you stack them but it's ok the top one is hotter than the bottom one with just 5 - 8 degrees

[-]

soteko@reddit

I have this mb and I plan to put two 5060ti in it
https://www.asus.com/motherboards-components/motherboards/tuf-gaming/tuf-gaming-b550m-plus/

Is it the same one?

And what happens if you use Qwen3.6 Q6 with same context, what is token generation?
Also how is token generation with Qwen3.6 27B, because it is smaller?
And what is prompt processing speed?

Sorry too many questions lol

[-]

EducationalGood495@reddit

Would you recommend running Qwen 3.6 35B on 2080Ti 11GB? I am seeing a good deal for 180 and just building my first PC

[-]

fasti-au@reddit

Tom Turboquant quant turbo4 ok k turbo 3 on v and use dflash

[-]

FatheredPuma81@reddit

Switch to llama.cpp and don't load the 2GB Vision component. UD-Q6_K should just barely fit. Q8_0 KV Quantization should get you like 64k context or a bit more if you use Q5_1. UD-Q5_K_XL is the only way you're going to get full context length without offloading.

[-]

chocofoxy@reddit (OP)

i tried llama now i get lower 10 token per second ( surprisingly ) but the vram is more free and TP is better i will try to figure out the best config also i will try to use vllm nvfp4 and mtp

[-]

FatheredPuma81@reddit

Strange llama.cpp CUDA was already a bit faster for me and when I built it from source (really easy with OpenCode you just need to download some programs) with flags Grok, Claude, and Gemini said to set it runs a lot faster. Threw ngram-mod on top of it and it became even faster.

[-]

PotatoTime@reddit

I'm getting 40 t/s at q8 on a single 4070 12gb so you probably can optimize it further. I'm on llama.cpp though so I'm not familiar with lm studio

[-]

soteko@reddit

What are other specs CPU, memory etc?

[-]

PotatoTime@reddit

14700k CPU and 64GB DDR5 6400. I have a feeling my RAM speed is doing some heavy lifting but I'm only using about 12GB of it for qwen3.6 35b. I'm at 64k context and running on Linux if that makes a difference. I think I saw a lot higher RAM usage on windows

[-]

ImportantSignal2098@reddit

Yeah generation with experts offloaded to cpu bottlenecks on my ddr4 3200 pretty hard. A bit surprised you're only using 12GB ram with q8 quant though, I think I'm up to 11GB with q4_k_m, what is your kv quant? I saw moe is often not very good with lower kv quants so I didn't risk going below q8_0 there

[-]

chocofoxy@reddit (OP)

yes this is why i don't like offloading people keep saying that it works because they have high memory speed ddr5 but this doesn't work on ddr4 3200 it's not usable i tried this before i buy the second gpu and even it sucks more on dense models

[-]

PotatoTime@reddit

I haven't tried using kv quant yet. I think the low memory usage is mostly llama's memory mapping. RAM usage doubles when I set --no-mmap. Performance is the same either way though

[-]

sid351@reddit

I'm running 2 x 5060 TI as well, but hitting a "terminal thinking loop" situation where the model just devolves to producing only "/" characters until the max token limit, regularly throughout the day (using Llama.cpp).

I'd love to get that sorted properly, so if anyone has any ideas I'm all ears.

[-]

chocofoxy@reddit (OP)

i notice that happens when you lower quantization because i tested Q2 and this happens alot after 3 messages when i used Q4 this happens but not that often

[-]

Constant-Simple-1234@reddit

Not sorted, but I noticed something like that too. Funny enough 3.5 version is more robust for me than 3.6. I used ByteShape quants for 3.5, unsloth for 3.6. But tried AesSedai quants for 3.6 too and similar thing.

[-]

see_spot_ruminate@reddit

get 2 more 5060ti, lol

[-]

chocofoxy@reddit (OP)

lmao i will keep stacking 5060Ti until i get 96gb vram

[-]

see_spot_ruminate@reddit

I would actually watch out and maybe cap it at either 64gb (4 cards) or 128gb (8 cards, though I can't figure out how this would be practical). This is because once you get up to 4 cards (as people have bullied me into) vllm becomes the better option over llamacpp (praise be). With vllm, you need at power of 2 amount of cards to get the most out of tensor parallelism, eg 2^2 or 4, 2^3 or 8.

[-]

Xp_12@reddit

I'm trying to bully myself into my second set as well. 😅

[-]

see_spot_ruminate@reddit

nope, just some rando microcenter combo, 7600x3d and an asus motherboard. 2 of the cards are on nvme to oculink egpu.