Added an old 2070 Super to my rig and I can't go back...worse, now I need more

Posted by PferdOne@reddit | LocalLLaMA | View on Reddit | 23 comments

Context: I built a new system last year November before everything went to shit. I spent like 5k for a 5090, 9800X3D and 96GB RAM. Recently (last 2-3 months) I'm heavily working on my local setup. Ditched Windows, went Ubuntu > Manjaro > CachyOS (now) and I'm basically building llama.cpp everyday now running tests to find optimal model quantizations, context sizes, best agent cli + harness, etc...most of you know the drill.

Now: I finally got around and took my old PC apart. I saw the 2070, dusted it off and put in my new PC (just out of curiousity). LET ME TELL YOU: I was not ready for what 8GB of additional VRAM does to a mf. I can suddenly run Qwen3.6-27B at Q8_0 with a context of 160k (q8_0 as well) and with MTP and I still generate 40-70tk/s. It's addicting! Now I'm looking at offers online for 5070tis and 3090s (because they are in the same ball park prize wise). I mean it's going to be the 3090 eventually, because I can't just pass on 8GB of VRAM but again I wasn't ready for this. Even a 2070 Super brings so much value if you have it laying around.

This experience was eye opening in terms of: acceptable performance + bigger VRAM > amazing performance + smaller VRAM

[-]

kanduking@reddit

I keep an extra a4000 in there to run the OS, all UI threads (this is actually critical under load) games and a couple 5k 165hz monitors.

Having a low/mid GB secondary GPU is absolutely worth it

[-]

PferdOne@reddit (OP)

Definitely. I started routing everything through my iGPU to keep my dGPUs free, because I'm not quite there yet to go headless ^^

[-]

Maleficent-Ad5999@reddit

Owww.. I’m on the same ship.. 5090, 9950x, 64gb ram.. ordered a Chinese modded gpu and couldn’t wait for its arrival.. my wallet is cursing me but I’m excited and if things workout I’ll probably add another one

[-]

PferdOne@reddit (OP)

you gonna love it! I gotta be honest, even if I wanted to add another 5090 I just don't have the room for it in my case.

[-]

Maleficent-Ad5999@reddit

Yeah Ikr.. saw the other comment and your project is pretty cool . But just curious, why didn’t you go with vLLM??

the tensor parallelism in vLLM is kinda mature and better than llamacpp..at least that’s what I think.. maybe I’m wrong.

[-]

PferdOne@reddit (OP)

I got so comfortable with llama.cpp that I tried it there first. But I will probably spin up vllm when I put in the 3090 next.

[-]

CreamPitiful4295@reddit

I recently upgraded to a 5090. Now you have me interested in putting the 3090 in.

[-]

PferdOne@reddit (OP)

This is what I'm going to do when I come back from vacation. Glorious 56GB of VRAM °o°

[-]

punky-beansnrice@reddit

vram beats raw speed is the lesson everyone learns the hard way. 8gb extra unlocks entirely different model classes. 3090 still the value-per-vram king at the consumer tier, even at used-market 2026 prices. once you taste 24gb you can never go back.

[-]

migsperez@reddit

Show us some of the commands you use, please. I'm trying to squeeze the most of out of my new 32gb card and struggling. What you're saying sounds like magic.

[-]

PferdOne@reddit (OP)

./llama.cpp/build/bin/llama-server \
-hf unsloth/Qwen3.6-27B-MTP-GGUF:Q8_0 \
--fit on \
--flash-attn on \
--cache-type-k q8_0 \
--cache-type-v q8_0 \
--batch-size 2048 \
--ubatch-size 512 \
--n-gpu-layers all \
--parallel 1 \
--threads 7 \
--reasoning on \
--temp 0.6 \
--top-k 20 \
--top-p 0.95 \
--min-p 0.0 \
--presence-penalty 0.0 \
--repeat-penalty 1.0 \
--chat-template-kwargs '{"preserve_thinking":true}' \
--main-gpu 0 \
--split-mode layer \
-c 147456 \
--tensor-split 32,4 \
--spec-type draft-mtp \
--spec-draft-n-max 3 \
--spec-draft-type-k q8_0 \
--spec-draft-type-v q8_0 \
-n 32768 \
--tools all

[-]

migsperez@reddit

A few parameters I've not used before thanks.

[-]

BlackBeardAI@reddit

I have been posting benchmarks that I run on my own nodes in this repo, check it out if you want to see what you'll likely get with what model/hardware combination.

https://github.com/blackbeardlabs/blackbeard-homelab/tree/main/benchmarks

[-]

migsperez@reddit

I'm taking and will be taking inspiration for sure. Thanks.

[-]

jacek2023@reddit

My 2070 was my first GPU for AI. I won a Kaggle gold medal thanks to it. Later, I bought a 3090, and then I tried using both the 3090 and 2070 together with llama.cpp. It worked, so later I bought some 3060s and more 3090s. 😄

[-]

MrShrek69@reddit

Is it a pain to get different cards to place nice with eachother?

[-]

jacek2023@reddit

I connected them to the motherboard, I needed to recompile llama.cpp (because 2070 is very old) and then llama.cpp detected them.

[-]

Fortunato_NC@reddit

Only if you think exporting an environment variable is a pain.

[-]

k-u-got-me@reddit

yeah this works if you already have a card laying around, but for those thinking about getting another card, please deeply think if you truly need the extra performance or if you are just chasing the number game. For those who actually use these models, the time spent optimizing often far outweighs the time they actually save. Kudos to you though, just a simple question, what do you actually use the model for?

[-]

PferdOne@reddit (OP)

I'm a SWE and I have personal project right now, where I use my agent to work on a personal language tutoring app. User flow goes like this: Scan worksheet from class, create a lesson and have a model act as a tutor. Since I wanna get better at speaking there's a little STT -> model takes in what I say, formulates a response -> TTS pipeline going on. Main goal is something like private classes locally.

[-]

k-u-got-me@reddit

Aah sounds interesting ngl, I did build something similar for myself, but found it harder to learn as the ai would keep repeating the same thing over in different words, when I was looking for depth or even a simple perspective change. I feel like most models lack the adaptability required to be tutors. It's very stagnant in its way, self learning feels alot smoother. Would love to try what you're building and see if it fixes my itch though. Thanks for sharing!

[-]

MrShrek69@reddit

For the last few years I’ve been trying to focus on quantity of vram over which card fast inference. So long as it feels like an acceptable speed for ur taste (everyone considers what’s acceptable differently). That why I picked up a strix halo machine. Just the quantity of VRAM allows me to experiment with all different workflows and multi model setups

[-]

PferdOne@reddit (OP)

To me 50tk/s is accaptable. pp is still at around 2000-3000tk/s for prompt processing depending on context size.