Best second GPU for RTX 4070 Super?

[-]

volleyneo@reddit

I have that + 5060 ti 16gb (second hand market here is garbage) . I can run 98k context qwen3.5 27b Q5 UD XL, with Q8 kv cache. Or qwen3.6 35b moe , 132k context. But the split is very important - 10,16. Also batches you need to set them fixed i run, -b 2048 -ub 512. You need llama.cpp and manual tuning for dual gpus, especially the ones with vram difference .

[-]

TheFunSlayingKing@reddit

I'm thinking of getting a similar gpu because i have a similar issue regarding second hand market so i'm very iffy about getting a 3090

Planning on running 27b Qwen 3.6 with 132k+ context at Q4 Quants Or at most a 32b dense model, 120b MOE is my other "max" but i don't have problems with large MoE's because they're just better optimized

How is the performance for you? tgs wise

Also 4070

[-]

volleyneo@reddit

for 3.6 27b, 18 - 18.6t/s . 35b moe, ends up at 68 tokens /s

[-]

TheFunSlayingKing@reddit

That's very decently fast, not bad at all, not lightspeed like 35b3a but very good, what are your settings for this?

[-]

volleyneo@reddit

Unfortunately the 5060ti sits on a pcie2 x4 chipset, can't find a better board right now, so it should get better . I run these alias ai-qwen-godmode='cd \~/llama-mainline && CUDA_DEVICE_ORDER=PCI_BUS_ID CUDA_VISIBLE_DEVICES=1,0 ./build/bin/llama-server \

-m \~/ai-models/Qwen3.6-27B-UD-Q5_K_XL.gguf \

--alias Qwen-3.6-27B-Coder-Pro\

-ngl 99 -c 114688 -np 1 -b 4096 -ub 512 \

--flash-attn on --cache-type-k q8_0 --cache-type-v q8_0 \

--tensor-split 10,16 --context-shift --keep -1 \

--jinja --reasoning on \

--spec-type ngram-mod --spec-ngram-size-n 48 --draft-min 16 --draft-max 64 \

--temp 0.6 --top-k 20 --top-p 0.95 --min-p 0.0 \

--repeat-penalty 1.0 --presence-penalty 0.0 \

--port 8081 --api-key sk-500'

alias ai-qwen-pro='cd \~/llama-mainline && CUDA_DEVICE_ORDER=PCI_BUS_ID CUDA_VISIBLE_DEVICES=1,0 ./build/bin/llama-server \

-m \~/ai-models/Qwen3.6-35B-A3B-UD-Q4_K_XL.gguf \

--alias Qwen-3.6-MoE-Coder-Pro \

-ngl 99 -c 131072 -np 1 -b 4096 -ub 512 \

--flash-attn on --cache-type-k q8_0 --cache-type-v q8_0 \

--tensor-split 10,16 --context-shift --keep -1 \

--jinja --reasoning on \

--spec-type ngram-mod --spec-ngram-size-n 48 --draft-min 16 --draft-max 64 \

--temp 0.8 --top-k 20 --top-p 0.95 --min-p 0.1 \

--repeat-penalty 1.05 --presence-penalty 0.0 --repeat-last-n 256 \

--port 8081 --api-key sk-500'

[-]