Best second GPU for RTX 4070 Super?
Posted by Haunting-Fig-6383@reddit | LocalLLaMA | View on Reddit | 9 comments
So i currently have an rtx 4070 super, and it can easily run models like gemma3 12b and even gpt-oss 20b (although it takes up to a minute to generate a response). I want to get a second gpu so i can run larger models around 20b-30b params. What gpu do you guys recommend?
volleyneo@reddit
I have that + 5060 ti 16gb (second hand market here is garbage) . I can run 98k context qwen3.5 27b Q5 UD XL, with Q8 kv cache. Or qwen3.6 35b moe , 132k context. But the split is very important - 10,16. Also batches you need to set them fixed i run, -b 2048 -ub 512. You need llama.cpp and manual tuning for dual gpus, especially the ones with vram difference .
TheFunSlayingKing@reddit
I'm thinking of getting a similar gpu because i have a similar issue regarding second hand market so i'm very iffy about getting a 3090
Planning on running 27b Qwen 3.6 with 132k+ context at Q4 Quants Or at most a 32b dense model, 120b MOE is my other "max" but i don't have problems with large MoE's because they're just better optimized
How is the performance for you? tgs wise
Also 4070
volleyneo@reddit
for 3.6 27b, 18 - 18.6t/s . 35b moe, ends up at 68 tokens /s
TheFunSlayingKing@reddit
That's very decently fast, not bad at all, not lightspeed like 35b3a but very good, what are your settings for this?
volleyneo@reddit
Unfortunately the 5060ti sits on a pcie2 x4 chipset, can't find a better board right now, so it should get better . I run these alias ai-qwen-godmode='cd \~/llama-mainline && CUDA_DEVICE_ORDER=PCI_BUS_ID CUDA_VISIBLE_DEVICES=1,0 ./build/bin/llama-server \
-m \~/ai-models/Qwen3.6-27B-UD-Q5_K_XL.gguf \
--alias Qwen-3.6-27B-Coder-Pro\
-ngl 99 -c 114688 -np 1 -b 4096 -ub 512 \
--flash-attn on --cache-type-k q8_0 --cache-type-v q8_0 \
--tensor-split 10,16 --context-shift --keep -1 \
--jinja --reasoning on \
--spec-type ngram-mod --spec-ngram-size-n 48 --draft-min 16 --draft-max 64 \
--temp 0.6 --top-k 20 --top-p 0.95 --min-p 0.0 \
--repeat-penalty 1.0 --presence-penalty 0.0 \
--port 8081 --api-key sk-500'
alias ai-qwen-pro='cd \~/llama-mainline && CUDA_DEVICE_ORDER=PCI_BUS_ID CUDA_VISIBLE_DEVICES=1,0 ./build/bin/llama-server \
-m \~/ai-models/Qwen3.6-35B-A3B-UD-Q4_K_XL.gguf \
--alias Qwen-3.6-MoE-Coder-Pro \
-ngl 99 -c 131072 -np 1 -b 4096 -ub 512 \
--flash-attn on --cache-type-k q8_0 --cache-type-v q8_0 \
--tensor-split 10,16 --context-shift --keep -1 \
--jinja --reasoning on \
--spec-type ngram-mod --spec-ngram-size-n 48 --draft-min 16 --draft-max 64 \
--temp 0.8 --top-k 20 --top-p 0.95 --min-p 0.1 \
--repeat-penalty 1.05 --presence-penalty 0.0 --repeat-last-n 256 \
--port 8081 --api-key sk-500'
volleyneo@reddit
Ashmadia@reddit
I run a 3080 and a 5080. Sure, there’s better, but it’s been pretty solid
jacek2023@reddit
probably 5070
__novalis@reddit
I also have the 4070 Super and I chose to add a 5090. So far that seems to have been the right choice.