$400 Qwen 3.6-27B Setup - Dual RTX 3060 - 30-50 t/s

Posted by akira3weet@reddit | LocalLLaMA | View on Reddit | 51 comments

I picked up a 7900 XTX earlier which runs qwen3.6-27b fine, but not to my like. Its compute performance is quite unstable for me. With MTP the decode speed can reach 40-60 t/s, but prefill is just too slow. Regardless of whether I used ROCm or Vulkan, the prefill speed varies between 300t/s and 500 t/s, even with very long prompts.

I've been itching to try out an ultra-budget 24GB setup using dual 3060s. I managed to snag a second 3060 at a reasonable price in last few days. So I took out the 7900 XTX, installed the 3060s, and began testing.

Test Configuration

Test Platform: i7 4770k + Gigabyte GA-Z87MX-D3H
Quite an ancient platform, used for over a decade. But interestingly, it supports SLI by splitting PCIe 3.0 x16 into two PCIe 3.0 x8 when both slots used. Newer motherboards don't seem to offer such split but many offer one full-speed PCIe 5.0 x16 slot plus one PCIe 4.0 x4 slot. As we know, PCIe 4.0 x4 is equivalent to PCIe 3.0 x8. Therefore this old platform is on par with newer ones in terms of PCIe bottleneck.
Monitor is plugged into the motherboard using iGPU.
OS: Kubuntu 24.04
CUDA: 13.2
Models:
unsloth/Qwen3.6-27B-MTP-GGUF
unsloth/Qwen3.6-27B-GGUF
Quantization: Qwen3.6-27B-Q4_K_S.gguf
Software: llama.cpp 5/25/2026 master, self-compiled with CUDA support (official pre-compiled Linux CUDA binaries are not available for download).
Pre-requisite installation: sudo apt install nvidia-cuda-toolkit
Settings (detailed config at the end of the post):
Tensor parallel: -sm tensor -ts 1,1
-sm tensor cannot be enabled at the same time as -ctk and -ctv. This means KV cache quantization cannot be used, limiting the context window to around 64k. I usually need a 160k context, so this is a bit frustrating.
--spec-type draft-mtp --spec-draft-n-max 2

Test Result

2.16.262.271 I slot print_timing: id  0 | task 701 | prompt eval time =    3056.70 ms /  1394 tokens (    2.19 ms per token,   456.05 tokens per second)
2.16.262.276 I slot print_timing: id  0 | task 701 |        eval time =   22538.95 ms /   975 tokens (   23.12 ms per token,    43.26 tokens per second)
2.16.262.277 I slot print_timing: id  0 | task 701 |       total time =   25595.65 ms /  2369 tokens
2.16.262.291 I slot print_timing: id  0 | task 701 |     graphs reused =       1016
2.16.262.292 I slot print_timing: id  0 | task 701 | draft acceptance = 0.77618 (  593 accepted /   764 generated)
2.16.262.310 I statistics        draft-mtp: #calls(b,g,a) =   10   1038   1038, #gen drafts =   1038, #acc drafts =   959, #gen tokens =   2076, #acc tokens =  1792, dur(b,g,a) = 0.018, 8380.839, 3.772 ms
2.16.263.267 I slot    release: id  0 | task 701 | stop processing: n_tokens = 12343, truncated = 0

The initial peak speeds reached pp 600+ t/s and tg 50 t/s. At an actual context length of 12k, prompt processing (pp) hits 456.05 t/s, and text generation (tg) is at 43.26 t/s. This vastly exceeded my expectations. While it doesn't match the maximum peak speed of the 7900 XTX, the speed is incredibly stable, and the GPU utilization stays pegged at 100% for long durations. I have to say, CUDA is simply much more mature.

However, there are still some issues. It runs fine for a couple rounds, but tends to crash with an OOM error after some use. Disabling MTP stablize it and the context can be extended to 96k without MTP, the pp speed remains at 600+ t/s, and the tg speed drops to 31 t/s, which is still quite decent.

	Context Window	Prefill (pp)	Generation (tg)
MTP Initial Peak	64k	620 t/s	50 t/s
MTP @ 12k	64k	456 t/s	43.26 t/s
No MTP Initial Peak	96k	620 t/s	31 t/s
No MTP @ 20k	96k	605 t/s	29.10 t/s
No MTP @ 50k	96k	438 t/s	26.59 t/s

Conclusion

Cons

SPLIT_MODE_TENSOR currently cannot be used alongside KV cache quantization, making 24GB feel a bit tight. However, this is definitely not a niche demand; simple Q8 quantization could double the context to 128k / 192k. The future looks promising.

Pros

Incredible value for money. Depends on where you are two 3060s could cost as low as $400.
The CUDA ecosystem is mature. GPU utilization stays stable at 100% for long stretches, and once compiled, it works flawlessly without needing constant troubleshooting. Peace of mind.
The 3060 has a slim form factor, with short single- or dual-fan variants available, making it compatible with most ATX and mATX motherboards and cases without any hassle.

Inferences

Using dual 16GB cards that are slightly faster (e.g., 4060 Ti, 5060 Ti) will probably yield even better results, though the price-to-performance ratio will drop. Again, CUDA just offers better utilization. Having 32GB this way sould be much faster than, e.g., the crippled AI Pro R9700, and still cost less.

Other Notes

I also gave vLLM a brief try, but it seems poorly optimized for VRAM-constrained scenarios and kept hitting OOM no matter what. Plus, vLLM takes too long to start up, making debugging a pain, so I stopped messing with it.

Appendix

Detailed Configuration:

    --no-mmproj-offload \
    -dev CUDA0,CUDA1  -sm tensor -ts 1,1 \
    --fit off \
    --host 0.0.0.0 --port "$PORT" \
    -t 0 -ngl 99 -np 1 \
    --kv-unified --flash-attn on --ctx-size 64000 \ # or 96000
    --spec-type draft-mtp --spec-draft-n-max 2 \ # or remove this line
    -rea on \
    --temp 0.6 --top-k 20 --top-p 0.95 --min-p 0.0 --repeat-penalty 1.0 --presence-penalty 0.0

--split-mode

Prefill (pp)

Generation (tg)

layer

550

tensor

700

[-]

munkiemagik@reddit

Would this be one of those situations where a third GPU makes a lot of sense?

Seeing as you are not using vllm to use tensor parallel, and you've found a good reasonable price for the 3060's would it be worth splitting your x16 slot into x8,x8 to drop a third 3060 in there to give you that headroom for extra context you say you are lacking? I know its not an elegant solution in terms of cases and hardware, I presume you have this nicely in an ATX tower at the moment. But I imagine with 36GB you would be much happier.

There have been times in the past before the Qwen3.6 era where I was just shy of VRAM for context msyelf and often contemplated a 3060 ti with the GDDR6X (608GB/s) VRAM even though its only 8GB to give me just that little bit more headroom. In terms of memory bandwidth I couldnt find a more cost effective card that wouldn't drag the 3090s down too much.

[-]

akira3weet@reddit (OP)

Tensor split between 3 (non-power-of-2) is tricky, not sure how practical is that
The two cards are already x8+x8, maybe a different motherboard can work better.
I ran 5070Ti+3060 with layer split before, where the two cards run in serial. The point for dual 3060 is to run symmetric tensor split, where the two cards run in parallel. Ran 5070Ti+3060 tensor split probaby also work but the 3060 will bottleneck 5070Ti.

[-]

munkiemagik@reddit

1) I thought tensor split it doesnt actually matter how many GPUs you throw in, and it is specifically tensor parallel in vllm that you needed the powers of 2? I used to run 3xGPU before qwen3.6 era with llama.cpp and --tensor-split (now I find myself quite content with 2x GPU) for my use cases and never had any noticeable impact. I was just capped to the slowest cards

2) I mistakenly assumed you had one GPU per pcie slot, one on the cpu x16 and the second on the chipset pcie whatever that is x4 or x8 etc. My bad.,thinking you could then drop the third alongside the first in the x16 using split risers.

3) Agreed the 3060 would definitely slow down the 5070 ti, I used to run 2x 3090 + 1x 5090 and of course I was capped to 3090 speeds but the 80GB VRAM more than made up for it (gpt oss 120b fully in vram or glm 4.5 air wiht only certain specific parts offlaoded to cpu back then was great). Using 3xGPU I still wasnt suffering below 3090 performance using -ts. I did do funky little balancing acts to keep as much on the 5090 as possible and then spilling out the rest across the 2x3090 and fully acknowledge had I thrown a 3060 ti gddr6X into the mix the 608GB/s mem bandiwdth would have been the bottleneck.

[-]

akira3weet@reddit (OP)

`-sm tensor` is very recent, the default is `-sm layer`, `--tensor-split` specifies the ratio. For `-sm layer` the ratio is quite flexible.

[-]

munkiemagik@reddit

Sorry i completely missed that, havent seen this before, looks like I have more reading to do!

CypSteel@reddit

What do I need to look for in a card to do something like this? Would this work? MSI Gaming GeForce RTX 3060 12GB 15 Gbps GDRR6 192-Bit HDMI/DP PCIe 4 Torx Twin Fan Ampere OC Graphics Card

I just bought two random 3060 12GB that pops on local FB marketplace.

Eyelbee@reddit

TL;dr how exactly did you get those numbers? Doesn't make much sense, even a single 3090 is slower usually. Never tried MTP, is it due to that? Even if tensor parallelism is highly effective the memory bandwidth of the card shouldn't allow those speeds.

Single 3090 should be faster according to https://github.com/noonghunna/club-3090

Pixer---@reddit

That’s a great budget build. My 4x mi50 get 500pp/s and 60tk/s. But it probably draws multiple times more power

soshulmedia@reddit

Interesting. What is your setup like? What is your PCIe bandwidth, what quantization, llama.cpp or vllm?

I’ve got a romed8-2t and a epyc 7402 24 core. I’ve got p2p enabled with 26gb/s unidirectional and 52gb bidirectional between all cards. I’m using llamacpp mtp with max-n 3 for these numbers. Its running at Q8_0 and space for 1M tokens of context at fp16.

Vllm is faster in prompt processing but it slows down to 15tk/s at 70k context. Llamacpp has 35 tk/s at 200k

Ah, I see, thanks a lot for the figures! That's a lot better infrastructure around your cards than I have for mine. Seeing these figures, I guess it would make sense then for me to update my rig at some point.

You only use PCIe and do not have any of these AMD Instinct Infinity bridges in operation, or do you?

If I would figure out how and where to get one of these, I guess I could bring my rig up to about your figures but keeping the rest as budget as possible.

Unfortunately I don’t have the Infinity bridge. It’s running over pcie 4.0 16x. You could try the p2p enabled cuda driver and check if your cpu supports p2p natively. This will make it scale better using multi gpu

Thanks for the suggestion but I think that's probably not worthwhile for me - I run my GPUs in a repurposed mining rig with PCIE4 x1 for each. This is why I am really curious about these infinity bridge connectors.

Client_Hello@reddit

Yet another way to trade VRAM for speed.

Qwen 27b, Unsloth Q4_K_M with MTP, Dual 5060 ti 16gb, your config.

500 token prompt and 2600 token gen

--split-mode	Prefill (pp)	Generation (tg)
layer	550	40
tensor	700	65

johnzadok@reddit

Nice tensor speedup. Is your MB PCI-E gen 5 or 4?

Nice numbers! Great to see an actual test!

Do you have any idea why is 7900 XTX prefill slow? Based on the hardware it should be way faster than 3060. Asking as someone who's thinking of getting an 7900 XTX ($800 used) for running the Qwen 27B.

I wish I could figure it out. I constantly get varying results from 7900 XTX. Like I get 28t/s then after reboot it becomes 22t/s, etc. I struggles without result then another day it's back to 28t/s. And, way faster on paper has always been the theme for AMD cards.

lloyd08@reddit

I almost guarantee it's thermal throttling. I have mine undervolted by -100, GPU underclocked, with mem overclocked and similar numbers to GoodTip. I peak at 225W now, perf went up, and it still occasionally thermally throttles if I hit a large thinking context. Before taking the time to tune it, 27B thermally throttled within 1 second of token gen.

That might be true. Though the idea of such a bulky GPU thermal throttles quickly in a open case feels a bit shocking to me.

HelpfulHand3@reddit

Take a look at the temperature of the VRAM too. You might get hot there while the core stays low. Could be worth manually cranking the fans along with an undervolt and see if that solves the problem.

GoodTip7897@reddit

I have a 7900xtx and I get 1200 t/sec prefill for 4096 tokens and average like 600 for 48k tokens. Qwen 3.6 27b.

35b a3b hits 3000+ and stays above 1500 no matter what.

I use ubatch batch 1024

Sufficient_Sir_5414@reddit

This is a gold standard budget build post. The insight about your 2013 i7-4770k platform splitting PCIe lanes into x8/x8 being functionally equivalent to a modern budget board's x16/x4 split is a brilliant realization. It proves you don't need to drop $1,000 on a new platform just to avoid a pipeline bottleneck.

Your results perfectly highlight the classic AMD vs. NVIDIA dilemma for local LLMs right now. The 7900 XTX has massive raw bandwidth, but ROCm/Vulkan drivers still suffer from erratic prefill stability. Switching to dual 3060s means losing some peak speed, but gaining 100% stable GPU utilization and a mature CUDA execution loop. 43 t/s on a 27B Q4 model with MTP is an absolute win for a $400 GPU budget.

Overall-Branch-1496@reddit

Can you please estimate my performance loss using x16/x1 9070xt/9060xt cards. I'm struggling at 400-300pp and 35-25tg using mtp over vulkan?

autisticit@reddit

Ok but what is the best apple pie recipe?

mikewagnercmp@reddit

Actually I was using a single 3060 to run the q8 qwen3.6, offloading to memory, with 250k context, at between 20-30t/s and about 300-400pp speed. If you want i can post the llama cap commands I thought they worked pretty well. Also had a pretty fast q4 also with large context that was much faster.

Fine-Bite9484@reddit

please post the commands.

ducksoup_18@reddit

Definitely post it.

--server --webui-mcp-proxy -m /models/Qwen3.6-35B-A3B-UD-Q8_K_XL.gguf --fit on --ctx-size 245000 --port 8000 --host 0.0.0.0 --temp 0.7 --presence-penalty 0.3 --top-p 0.95 --min-p 0.00 --sleep-idle-seconds 600 --jinja --flash-attn on --repeat-penalty 1.1 -ctk q4_0 -ctv q4_0 -np 1 --timeout 3600 --models-max 1 --mlock --fit-target 256 --batch-size 2048 --ubatch-size 512 --no-prefill-assistant --cache-ram 40960 --cache-reuse 256 --checkpoint-every-n-tokens 4096 --no-mmap --ctx-checkpoints 16

--server --webui-mcp-proxy -m /models/Qwen3.6-35B-A3B-UD-Q4_K_XL.gguf --fit on --ctx-size 245000 --port 8000 --host 0.0.0.0 --jinja --flash-attn on -ctk q4_0 -ctv q4_0 -np 1 --batch-size 2048 --ubatch-size 512 --no-mmap --cache-ram 32768 -ot "blk.\d+.ffn_.*_exps=CPU"

They are both MOE models so suffer less going to system ram than others. Note I was running this on my unraid box with 64gb of ram, hence the large cache size, for q8 you might need a lot but not for q4

I based the above on info I found here https://www.reddit.com/r/LocalLLM/comments/1sq5n6i/best_local_llm_for_coding_on_rtx_3060_12gb/ in the comments.

I had these in notes, might not be the final settings I landed on

Come on... 27b and 35b-a3b has very different performance attribtues.

Durr yeah I misread the title on my phone , sorry!

gestapov@reddit

Vram?

Napster3301@reddit

compute isnt the ceiling here, memory bandwidth is. dual 3060 = \~720 GB/s combined. 7900 XTX alone = 960 GB/s. on paper the amd card should beat dual 3060 by \~30% on decode. youre seeing the opposite because cudas scheduler is deterministc and rocm/vulkan kernels stall.

the win isnt "cuda is more mature" in vague terms. its that cuda has fused kernels for tensor parallel attention that rocm doesnt have yet. you cant compensate for missing kernels with more bandwidth. amds hardware roofline is higher but their software cant reach it.

so dual 3060 doesnt beat 7900 xtx. it beats 7900 xtx-on-rocm. once youre clear on which fight is happening, the $400 win still stands but for different reasons.

DistanceSolar1449@reddit

Use vLLM instead of llama.cpp if you’re gonna use tensor parallel

zipperlein@reddit

Or ik_llama with graph mode if u want to stick to ggufs.

thirteen-bit@reddit

I'd suggest to change CUDA version. Either 12.*, 13.1 or 13.3 should be OK.

13.2 specifically has some bugs that manifest in llama.cpp quantized model run (not llama.cpp compilation):

https://github.com/ggml-org/llama.cpp/issues/21255

Top-Cardiologist1011@reddit

this is the kind of detailed post this sub needs more of. actual benchmarks, actual config, actual cost breakdown. not just "it works trust me bro"

UnethicalExperiments@reddit

I'm using 4x RTX 3060s with about 70t/s with 3.6 35b a3b q8 , and Q4 kv with 250k context using llama.cpp on some preliminary testing

zanar97862@reddit

Does Q4 cache with a Q8 model function well? All the quality test posts seem to think Q4 cache being very bad for qwen models. Are you running turboquant?

Jolly_Criticism9190@reddit

Have same dual 3060 12GB on PCIE X16 slots. Unsloth qwen3.6 27B Q4KL with F16 mmproj loaded. Windows 11. -ngl 99, -c 65000, —fit off. April self complied version of llama.cpp, cuda 12.6. only getting 16.5 t/s.

What did I do wrong?

MTP and `-sm tensor` are very new I believe. And for Windows you can download prebuilt from Github.

Yeah I tried it in my current build. It recognized -sm tensor but still having the same token speed. Maybe the MTP is the game changer here

maybe pay attention to the pcie lanes and cuda version. And the last resort... Linux.

I was waiting for you to say Linux.

I am surprised nobody else said Linux haha

lol

FINALLY! Been waiting for someone to drop dual 3060 configs so i dont have to do all the hard work of figuring out how to get it working well.

laul_pogan@reddit

MTP OOM on multi-turn is usually draft KV cache accumulation, not the main model. Try --spec-draft-n-max 1 instead of 2. In testing on tight VRAM setups, halving the speculative depth cuts the draft cache overhead enough to survive longer sessions; acceptance rate drops maybe 5-8 points but the stability gain is worth it. Your 0.77 acceptance at n=2 means you're leaving almost nothing on the table by dropping to 1.

Thank you! I'll try that out.

Big-Business-2505@reddit

I’ve actually got two similar rigs. Found a new use for my old 3060s. Both in ProxMox servers with the GPUs shared to VMs. They rock for my multi agent dev work. Cost me next to nothing since the cards were paid for years ago. Very power efficient as well. You can cap them as low as 110 with very little loss.