$400 Qwen 3.6-27B Setup - Dual RTX 3060 - 30-50 t/s

Posted by akira3weet@reddit | LocalLLaMA | View on Reddit | 51 comments

I picked up a 7900 XTX earlier which runs qwen3.6-27b fine, but not to my like. Its compute performance is quite unstable for me. With MTP the decode speed can reach 40-60 t/s, but prefill is just too slow. Regardless of whether I used ROCm or Vulkan, the prefill speed varies between 300t/s and 500 t/s, even with very long prompts.

I've been itching to try out an ultra-budget 24GB setup using dual 3060s. I managed to snag a second 3060 at a reasonable price in last few days. So I took out the 7900 XTX, installed the 3060s, and began testing.

Test Configuration

Test Result

2.16.262.271 I slot print_timing: id  0 | task 701 | prompt eval time =    3056.70 ms /  1394 tokens (    2.19 ms per token,   456.05 tokens per second)
2.16.262.276 I slot print_timing: id  0 | task 701 |        eval time =   22538.95 ms /   975 tokens (   23.12 ms per token,    43.26 tokens per second)
2.16.262.277 I slot print_timing: id  0 | task 701 |       total time =   25595.65 ms /  2369 tokens
2.16.262.291 I slot print_timing: id  0 | task 701 |     graphs reused =       1016
2.16.262.292 I slot print_timing: id  0 | task 701 | draft acceptance = 0.77618 (  593 accepted /   764 generated)
2.16.262.310 I statistics        draft-mtp: #calls(b,g,a) =   10   1038   1038, #gen drafts =   1038, #acc drafts =   959, #gen tokens =   2076, #acc tokens =  1792, dur(b,g,a) = 0.018, 8380.839, 3.772 ms
2.16.263.267 I slot    release: id  0 | task 701 | stop processing: n_tokens = 12343, truncated = 0

The initial peak speeds reached pp 600+ t/s and tg 50 t/s. At an actual context length of 12k, prompt processing (pp) hits 456.05 t/s, and text generation (tg) is at 43.26 t/s. This vastly exceeded my expectations. While it doesn't match the maximum peak speed of the 7900 XTX, the speed is incredibly stable, and the GPU utilization stays pegged at 100% for long durations. I have to say, CUDA is simply much more mature.

However, there are still some issues. It runs fine for a couple rounds, but tends to crash with an OOM error after some use. Disabling MTP stablize it and the context can be extended to 96k without MTP, the pp speed remains at 600+ t/s, and the tg speed drops to 31 t/s, which is still quite decent.

Context Window Prefill (pp) Generation (tg)
MTP Initial Peak 64k 620 t/s 50 t/s
MTP @ 12k 64k 456 t/s 43.26 t/s
No MTP Initial Peak 96k 620 t/s 31 t/s
No MTP @ 20k 96k 605 t/s 29.10 t/s
No MTP @ 50k 96k 438 t/s 26.59 t/s

Conclusion

Cons

Pros

Inferences

Other Notes

Appendix

Detailed Configuration:

    --no-mmproj-offload \
    -dev CUDA0,CUDA1  -sm tensor -ts 1,1 \
    --fit off \
    --host 0.0.0.0 --port "$PORT" \
    -t 0 -ngl 99 -np 1 \
    --kv-unified --flash-attn on --ctx-size 64000 \ # or 96000
    --spec-type draft-mtp --spec-draft-n-max 2 \ # or remove this line
    -rea on \
    --temp 0.6 --top-k 20 --top-p 0.95 --min-p 0.0 --repeat-penalty 1.0 --presence-penalty 0.0