Dual RTX 5060 Ti (32GB pooled VRAM) vs Single RTX 5070 Ti (16GB): Real-world LLM benchmarks on Blackwell
Posted by SMTPA@reddit | LocalLLaMA | View on Reddit | 38 comments
I am the obsessive sort, and lately my obsession is ML/AI and particularly local LLM and GAI for privacy reasons. (I’m a lawyer. I want to use AI for my work but I will not upload unfiled patent disclosures to the cloud.) Long, aggravating story short, I built two Blackwell-based AI inference systems and ran some basic benchmarks when I first got both of them working. Here’s what I learned about VRAM pooling with dual consumer GPUs.
TL;DR
Dual RTX 5060 Ti setups offer better cost-per-GB ($82/GB vs $126/GB) and can run models that physically won’t fit on 16GB cards. The 1B model weirdness aside, performance is competitive, and the VRAM headroom is great for the price.
The Builds
5060ai (Dual GPU) - \~$2,600 total
∙ 2x RTX 5060 Ti 16GB = 32GB pooled VRAM
∙ Gigabyte X870E AORUS ELITE (dual PCIe slots on separate buses)
∙ Ryzen 7 7700X, 64GB DDR5-6000
∙ Ubuntu Server 24.04 headless
5070ai (Single GPU) - \~$2,000 total
∙ 1x RTX 5070 Ti 16GB
∙ MSI B850M MAG MORTAR (standard mATX)
∙ Ryzen 5 7600, 32GB DDR5-6000
∙ Pop!\_OS 24.04
Both running llama.cpp with NVIDIA driver 570.211 (open-source variant required for Blackwell).
Here’s what I got for my first few runs:
|Model |VRAM Used |5060ai (Dual) Prompt/Gen |5070ai (Single) Prompt/Gen|Winner |
|-------------------|-----------|----------------------------------------------|--------------------------|-------------|
|Llama 3.2 1B |\~7GB |[610-1051 / 330-481](tel:610-1051/330-481) t/s|2.1 / 2.5 t/s |Dual (500x!) |
|Llama 3.2 3B |\~18GB |1051.9 / 165.0 t/s |1055.6 / 283.6 t/s |Tie |
|Llama 3 8B |\~6GB |452.0 / 81.9 t/s |456.1 / 149.6 t/s |Single |
|**Qwen 2.5 14B Q5**|**\~16.2GB**|**6.0 / 38.6 t/s** |**OUT OF MEMORY** |**Dual only**|
For Qwen 2.5 14B Q5 Dual GPU Test:
GPU 0: 8,267 MiB (4,628 model + 3,200 context + 439 compute)
GPU 1: 8,296 MiB (4,876 model + 2,944 context + 475 compute)
Total: 16,563 MiB used, 15,261 MiB free
My Takeaways:
-
1B model did something weird: The 500x performance difference on Llama 3.2 1B is bizarre but consistent. Possibly a driver/scheduler issue with small models on single GPU?
-
VRAM Pooling Works!
llama.cpp’s --tensor-split 1,1 distributed the Qwen 14B model very well:
∙ GPU0: 8.3GB (4.6GB model + 3.2GB context)
∙ GPU1: 8.3GB (4.9GB model + 2.9GB context)
∙ Total: 16.6GB used, 15.4GB free
- The Headroom Is Nice
After loading Llama 3 8B:
∙ Single 5070 Ti: 5.7GB used = only 10.3GB free (ComfyUI + Ollama couldn’t load 8B afterward)
∙ Dual 5060 Ti: 6.0GB used = 26GB free (room for multiple workflows)
-
Cost per GB
∙ Dual 5060 Ti: $858 GPUs / 32GB \~ $27/GB
∙ Single 5070 Ti: $749 GPU / 16GB \~ $47/GB
∙ System cost per GB: \~$82 vs $126
Motherboards
I did not want to spend another $500 on the next tech step up for a mobo. So there was a lot of cursing, experimenting, and work-around finding. The X870E AORUS ELITE I got open box at MicroCenter has slots on separate buses (slots 1 and 3). This is important - I tried three other boards first and they just would not or could not cut it, and this was the major difference. Many less expensive boards have the M.2 slots sharing resources with the PCIe slots, and they are not always clear on exactly what configurations do what.
Does Dual Make Sense?
I think it does for me in these cases:
∙ Running models >12GB
∙ Multi-tasking (LLM + image gen + TTS)
∙ Future-proofing for 20-30GB models
∙ Cost-conscious (better $/GB)
I’ll use single 5070 Ti if:
∙ Mainly running 7B-8B models
∙ Single-task workflows
∙ Smaller budget ($618 less upfront)
∙ Want slightly better single-model performance
Blackwell Gotchas
∙ Requires NVIDIA driver 570+ (open-source variant only.) You WILL have driver headaches, almost certainly. It is very touchy. But it seems stable once operational.
∙ I learned after banging my head on it for a while that PyTorch stable doesn’t support sm\_120 - use nightly builds. I may, if my supply of misery runs low and I need to restock, try building the latest one from source with the right drivers. PyTorch stable 2.5.1 throws “sm\_120 not compatible” error.
∙ llama.cpp needs sm\_89 compile target (PTX forward compatibility)
∙ CUDA 12.4 from conda will not work. I had to use 12.8.
∙ nvidia-driver-570 proprietary (use open-source variant)
∙ RTL8125 Ethernet port needs manual driver install on Ubuntu on this board - it wanted to use r8169, and no.
∙ Fast Boot and Secure Boot will almost certainly need to be disabled in BIOS. Some boards just will not allow setup with both GPU active. Depower one and then you can get into BIOS and try changing things.
Benchmark Details
All tests used llama.cpp with identical prompts and parameters:
∙ --n-gpu-layers 99 (full GPU offload)
∙ --tensor-split 1,1 (dual GPU only)
∙ Models: Q4\_K\_M quantization except where noted
Dual-GPU VRAM distribution verified via nvidia-smi and nvtop.
ResponsibleTruck4717@reddit
May I ask why open source drivers?
I'm using 5060ti 16gb and 4060 8gb together.
The only issue I had was the graphics card going to p1 or p2 or what ever which led to very unstable environments (both windows and ubuntu) huge headache.
I don't know what caused this, but at the end connecting monitor to each gpu solve it.
Beside that I had no driver issues, I'm using official drivers.
and I just launch llama-server -m model_name -c context size and call it a day.
Does any of this sound familiar to you?
siege72a@reddit
Have you noticed performance changes using both cards, versus just the 5060ti?
ResponsibleTruck4717@reddit
The 5060ti is faster, and double the ram, which allow me to load bigger models than the 4060, and when using both allowing me to load even bigger models.
I can run 27b q4_k_m and 64k (at q8) context without much problem..
siege72a@reddit
Thank you!
SMTPA@reddit (OP)
As I understand it, Windows does not actually allow for true VRAM pooling. So on that front, a *nix of some kind is the only thing that will AFAIK.
Are you seeing true pooling with those disparate cards and your stock drivers? I used OS drivers because they support the broadest range of Blackwell features with the best reported stability. I have not found many people yet with a dual-Blackwell setup yet and the reports are scattered. I posted this after fooling with it for days because I can at least be a data point for “it IS possible and here’s a configuration that works.”
Xp_12@reddit
Is it actually true VRAM pooling though? Can you run a singular stable diffusion model split across two GPUs this way?
SMTPA@reddit (OP)
Yep. That’s tonight’s adventure!
Xp_12@reddit
Yes, as in you've already done it? because I'm pretty damn sure you can't. lmao. send pics if it works out.
Grouncher@reddit
5060 ti doesn‘t support NVLink to allow "merging" two into one.
LLMs use transformers, which run in sequence, allowing to pass the tiny result across cards, which ends in a single token, then the process is repeated.
SD by and large uses all layers at once, re-iterating on the whole image multiple times, making splitting them across two cards unreasonable.
Some modern SD models can use transformers, which means you can split them.
You can also split the text model from the image model to use both cards, if the whole setup won’t quite fit into a single card.
Xp_12@reddit
I'm familiar. Haven't seen a single person split a diffusion model without NVLINK is the reason I brought it up. Thanks for posting additional information for others.
Realistic-Science-87@reddit
Hi Did you run qwen 32b on your dual 5060ti setup?
xenw10@reddit
am sucessfully run qwen 3.5 35b a3b on my dual 5060ti 16gb with sglang. and get 50tok/sec.
Realistic-Science-87@reddit
Interesting. Thinking of purchasing dual 5060ti
thisoldhack@reddit
I had one 5060ti with 16 GB and the models you can load on there is just too small. I bought a second 5060ti and with combined 32 GB I'm able to run qwen3.6:35b-a3b-q4_K_M with 64k context window and it works really well. Seems like good price/performance, given any single card with 32 GB RAM is 3x the price of two of these, especially if you already have one.
Realistic-Science-87@reddit
Interesting. Have you tried nemotron-cascade-2? It gives very good performance if you need high tps. It also doesn't use much memory for context.
thisoldhack@reddit
I'll give it a try - thanks for the suggestion.
souna06@reddit
The cost-per-GB framing is useful. Did you also look at tokens/sec per dollar, or was VRAM headroom the main thing?
For the Qwen 14B that OOM'd on the single card — did you know ahead of time it wouldn't fit, or was that a surprise when you actually tried loading it?
Able_Zombie_7859@reddit
It's kinda impressive how hard you made these benchmarks to understand.
Xp_12@reddit
and i just compiled llama.cpp with 120 earlier today. although, I do agree that it was kind of a headache.
SMTPA@reddit (OP)
Which cards are you using, and which CUDA/Nvidia drivers? I could not get it to work with 120, but I am quite willing to admit I could have missed something.
Xp_12@reddit
I used cuda 13.1 latest on everything.
xenw10@reddit
YOU DONT NEED PCIE 5 X8/X8. THERE IS MISCONCEPTION ABOUT PCI BANDWIDTH FOR INFERENCE. IT WONT REQUIRED UNLESS YOU USE TENSOR PARALLEL IN VLLM.
Xp_12@reddit
yes... and I'd like to use my dual GPUs for exactly that, as well as have a higher prefill speed since I'm currently limited to x1 on the second. there really isn't any need to yell.
SMTPA@reddit (OP)
Oooh, you’re right. Unfortunately that one was out of stock at the time, or I would have definitely considered it. What’s your OS?
Xp_12@reddit
This time it was for Windows. Compiled llama-omni for minicpm 4.5 with their docker image. You should be fine on Linux as long as you have the proper toolkit and driver versions with the correct cmake parameters. llama.cpp on both Windows and Linux seems to hate sm_120 though 😂
sourceholder@reddit
LLMs are really good at creating tables.
Copy/paste output of first table:
OGNukem@reddit
Could you, with a riser cable, install all three cards?
SMTPA@reddit (OP)
In some mobo, yes. In this one, the remaining PCIe slot’s limitations are probably going to make this a net loss over just getting another 5070.
Background-Ad-5398@reddit
should run a iq3xs of a 70b model and try some quants of 24b, 32b, 49b. I already know a 14b model will run, thats not much of a benchmark
SMTPA@reddit (OP)
I picked the smallest one that would hace a VRAM requirement of just over 16GB so I could show the dual 5060Ti was truly pooling. You’re right and this is the plan. :)
Euphoric_Emotion5397@reddit
I'm using LM Studio and my setup is RTX 5080 16GB and RTX5060 TI 16GB.
So with Qwen 3 VL 30B Q4 using CUDA 12.
I am getting 72.8 tokens/sec and time to 1st token is <0.5second.
Blindax@reddit
You can even run much bigger models on the 32gb machine. Using 5070ti and 5060ti I can run OSS 120b with 4k ctx at good speeds.
Charming_Support726@reddit
Think you got a very good idea, but your post is hard to read and the results are barely to understand.
Maybe a reformat and a small refinement of the text could improve it by numbers
SMTPA@reddit (OP)
Sorry. How’s that?
Mountain_Patience231@reddit
i am using the same mb as yours, with 2x 9070xt. since your cards have different speed x16/x4, did you try tuning the TS in different ratio instead of 1,1?
SMTPA@reddit (OP)
I have not tried that. What improvements have you noticed with it?
Mountain_Patience231@reddit
i get much faster for feeding more loading into the slower card while run MOE model
SMTPA@reddit (OP)
Which cards are you using, and which CUDA packages? I could not get it to work with 120, but I am not an expert in that sort of thing. Quite possible I just didn’t think of something.