Dual RTX 5060 Ti (32GB pooled VRAM) vs Single RTX 5070 Ti (16GB): Real-world LLM benchmarks on Blackwell

Posted by SMTPA@reddit | LocalLLaMA | View on Reddit | 38 comments

I am the obsessive sort, and lately my obsession is ML/AI and particularly local LLM and GAI for privacy reasons. (I’m a lawyer. I want to use AI for my work but I will not upload unfiled patent disclosures to the cloud.) Long, aggravating story short, I built two Blackwell-based AI inference systems and ran some basic benchmarks when I first got both of them working. Here’s what I learned about VRAM pooling with dual consumer GPUs.

TL;DR

Dual RTX 5060 Ti setups offer better cost-per-GB ($82/GB vs $126/GB) and can run models that physically won’t fit on 16GB cards. The 1B model weirdness aside, performance is competitive, and the VRAM headroom is great for the price.

The Builds

5060ai (Dual GPU) - \~$2,600 total

∙   2x RTX 5060 Ti 16GB = 32GB pooled VRAM

∙   Gigabyte X870E AORUS ELITE (dual PCIe slots on separate buses)

∙   Ryzen 7 7700X, 64GB DDR5-6000

∙   Ubuntu Server 24.04 headless

5070ai (Single GPU) - \~$2,000 total

∙   1x RTX 5070 Ti 16GB

∙   MSI B850M MAG MORTAR (standard mATX)

∙   Ryzen 5 7600, 32GB DDR5-6000

∙   Pop!\_OS 24.04

Both running llama.cpp with NVIDIA driver 570.211 (open-source variant required for Blackwell).

Here’s what I got for my first few runs:

|-------------------|-----------|----------------------------------------------|--------------------------|-------------|

|Llama 3.2 3B |\~18GB |1051.9 / 165.0 t/s |1055.6 / 283.6 t/s |Tie |

|Llama 3 8B |\~6GB |452.0 / 81.9 t/s |456.1 / 149.6 t/s |Single |

For Qwen 2.5 14B Q5 Dual GPU Test:

GPU 0: 8,267 MiB (4,628 model + 3,200 context + 439 compute)

GPU 1: 8,296 MiB (4,876 model + 2,944 context + 475 compute)

Total: 16,563 MiB used, 15,261 MiB free

My Takeaways:

1B model did something weird: The 500x performance difference on Llama 3.2 1B is bizarre but consistent. Possibly a driver/scheduler issue with small models on single GPU?
VRAM Pooling Works!

llama.cpp’s --tensor-split 1,1 distributed the Qwen 14B model very well:

∙   GPU0: 8.3GB (4.6GB model + 3.2GB context)

∙   GPU1: 8.3GB (4.9GB model + 2.9GB context)

∙   Total: 16.6GB used, 15.4GB free

The Headroom Is Nice

After loading Llama 3 8B:

∙   Single 5070 Ti: 5.7GB used = only 10.3GB free (ComfyUI + Ollama couldn’t load 8B afterward)

∙   Dual 5060 Ti: 6.0GB used = 26GB free (room for multiple workflows)

Cost per GB

∙ Dual 5060 Ti: $858 GPUs / 32GB \~ $27/GB

∙ Single 5070 Ti: $749 GPU / 16GB \~ $47/GB

∙ System cost per GB: \~$82 vs $126

Motherboards

I did not want to spend another $500 on the next tech step up for a mobo. So there was a lot of cursing, experimenting, and work-around finding. The X870E AORUS ELITE I got open box at MicroCenter has slots on separate buses (slots 1 and 3). This is important - I tried three other boards first and they just would not or could not cut it, and this was the major difference. Many less expensive boards have the M.2 slots sharing resources with the PCIe slots, and they are not always clear on exactly what configurations do what.

Does Dual Make Sense?

I think it does for me in these cases:

∙   Running models >12GB

∙   Multi-tasking (LLM + image gen + TTS)

∙   Future-proofing for 20-30GB models

∙   Cost-conscious (better $/GB)

I’ll use single 5070 Ti if:

∙   Mainly running 7B-8B models

∙   Single-task workflows

∙   Smaller budget ($618 less upfront)

∙   Want slightly better single-model performance

Blackwell Gotchas

∙   Requires NVIDIA driver 570+ (open-source variant only.) You WILL have driver headaches, almost certainly. It is very touchy. But it seems stable once operational.

∙   I learned after banging my head on it for a while that PyTorch stable doesn’t support sm\_120 - use nightly builds. I may, if my supply of misery runs low and I need to restock, try building the latest one from source with the right drivers. PyTorch stable 2.5.1 throws “sm\_120 not compatible” error.

∙   llama.cpp needs sm\_89 compile target (PTX forward compatibility)

∙   CUDA 12.4 from conda will not work. I had to use 12.8.

∙   nvidia-driver-570 proprietary (use open-source variant)

∙   RTL8125 Ethernet port needs manual driver install on Ubuntu on this board - it wanted to use r8169, and no.

∙   Fast Boot and Secure Boot will almost certainly need to be disabled in BIOS. Some boards just will not allow setup with both GPU active. Depower one and then you can get into BIOS and try changing things.

Benchmark Details

All tests used llama.cpp with identical prompts and parameters:

∙   --n-gpu-layers 99 (full GPU offload)

∙   --tensor-split 1,1 (dual GPU only)

∙   Models: Q4\_K\_M quantization except where noted

Dual-GPU VRAM distribution verified via nvidia-smi and nvtop.

[-]

ResponsibleTruck4717@reddit

May I ask why open source drivers?
I'm using 5060ti 16gb and 4060 8gb together.

The only issue I had was the graphics card going to p1 or p2 or what ever which led to very unstable environments (both windows and ubuntu) huge headache.

I don't know what caused this, but at the end connecting monitor to each gpu solve it.

Beside that I had no driver issues, I'm using official drivers.

and I just launch llama-server -m model_name -c context size and call it a day.

Does any of this sound familiar to you?

[-]

siege72a@reddit

I'm using 5060ti 16gb and 4060 8gb together.

Have you noticed performance changes using both cards, versus just the 5060ti?

[-]

ResponsibleTruck4717@reddit

The 5060ti is faster, and double the ram, which allow me to load bigger models than the 4060, and when using both allowing me to load even bigger models.
I can run 27b q4_k_m and 64k (at q8) context without much problem..

[-]

siege72a@reddit

Thank you!

[-]

SMTPA@reddit (OP)

As I understand it, Windows does not actually allow for true VRAM pooling. So on that front, a *nix of some kind is the only thing that will AFAIK.

Are you seeing true pooling with those disparate cards and your stock drivers? I used OS drivers because they support the broadest range of Blackwell features with the best reported stability. I have not found many people yet with a dual-Blackwell setup yet and the reports are scattered. I posted this after fooling with it for days because I can at least be a data point for “it IS possible and here’s a configuration that works.”

[-]

Xp_12@reddit

Is it actually true VRAM pooling though? Can you run a singular stable diffusion model split across two GPUs this way?

[-]

SMTPA@reddit (OP)

Yep. That’s tonight’s adventure!

[-]

Xp_12@reddit

Yes, as in you've already done it? because I'm pretty damn sure you can't. lmao. send pics if it works out.

[-]

Grouncher@reddit

5060 ti doesn‘t support NVLink to allow "merging" two into one.

LLMs use transformers, which run in sequence, allowing to pass the tiny result across cards, which ends in a single token, then the process is repeated.

SD by and large uses all layers at once, re-iterating on the whole image multiple times, making splitting them across two cards unreasonable.

Some modern SD models can use transformers, which means you can split them.

The cost-per-GB framing is useful. Did you also look at tokens/sec per dollar, or was VRAM headroom the main thing?

[-]

Charming_Support726@reddit

Think you got a very good idea, but your post is hard to read and the results are barely to understand.

Maybe a reformat and a small refinement of the text could improve it by numbers

[-]

SMTPA@reddit (OP)

Sorry. How’s that?

[-]

Mountain_Patience231@reddit

i am using the same mb as yours, with 2x 9070xt. since your cards have different speed x16/x4, did you try tuning the TS in different ratio instead of 1,1?

[-]

SMTPA@reddit (OP)

I have not tried that. What improvements have you noticed with it?

[-]

Mountain_Patience231@reddit

i get much faster for feeding more loading into the slower card while run MOE model

[-]

SMTPA@reddit (OP)

Which cards are you using, and which CUDA packages? I could not get it to work with 120, but I am not an expert in that sort of thing. Quite possible I just didn’t think of something.