Dual RTX 5060 Ti (32GB pooled VRAM) vs Single RTX 5070 Ti (16GB): Real-world LLM benchmarks on Blackwell

Posted by SMTPA@reddit | LocalLLaMA | View on Reddit | 38 comments

I am the obsessive sort, and lately my obsession is ML/AI and particularly local LLM and GAI for privacy reasons. (I’m a lawyer. I want to use AI for my work but I will not upload unfiled patent disclosures to the cloud.) Long, aggravating story short, I built two Blackwell-based AI inference systems and ran some basic benchmarks when I first got both of them working. Here’s what I learned about VRAM pooling with dual consumer GPUs.

TL;DR

Dual RTX 5060 Ti setups offer better cost-per-GB ($82/GB vs $126/GB) and can run models that physically won’t fit on 16GB cards. The 1B model weirdness aside, performance is competitive, and the VRAM headroom is great for the price.

The Builds

5060ai (Dual GPU) - \~$2,600 total

∙   2x RTX 5060 Ti 16GB = 32GB pooled VRAM

∙   Gigabyte X870E AORUS ELITE (dual PCIe slots on separate buses)

∙   Ryzen 7 7700X, 64GB DDR5-6000

∙   Ubuntu Server 24.04 headless

5070ai (Single GPU) - \~$2,000 total

∙   1x RTX 5070 Ti 16GB

∙   MSI B850M MAG MORTAR (standard mATX)

∙   Ryzen 5 7600, 32GB DDR5-6000

∙   Pop!\_OS 24.04

Both running llama.cpp with NVIDIA driver 570.211 (open-source variant required for Blackwell).

Here’s what I got for my first few runs:

|Model |VRAM Used |5060ai (Dual) Prompt/Gen |5070ai (Single) Prompt/Gen|Winner |

|-------------------|-----------|----------------------------------------------|--------------------------|-------------|

|Llama 3.2 1B |\~7GB |[610-1051 / 330-481](tel:610-1051/330-481) t/s|2.1 / 2.5 t/s |Dual (500x!) |

|Llama 3.2 3B |\~18GB |1051.9 / 165.0 t/s |1055.6 / 283.6 t/s |Tie |

|Llama 3 8B |\~6GB |452.0 / 81.9 t/s |456.1 / 149.6 t/s |Single |

|**Qwen 2.5 14B Q5**|**\~16.2GB**|**6.0 / 38.6 t/s** |**OUT OF MEMORY** |**Dual only**|

For Qwen 2.5 14B Q5 Dual GPU Test:

GPU 0: 8,267 MiB (4,628 model + 3,200 context + 439 compute)

GPU 1: 8,296 MiB (4,876 model + 2,944 context + 475 compute)

Total: 16,563 MiB used, 15,261 MiB free

My Takeaways:

  1. 1B model did something weird: The 500x performance difference on Llama 3.2 1B is bizarre but consistent. Possibly a driver/scheduler issue with small models on single GPU?

  2. VRAM Pooling Works!

llama.cpp’s --tensor-split 1,1 distributed the Qwen 14B model very well:

∙   GPU0: 8.3GB (4.6GB model + 3.2GB context)

∙   GPU1: 8.3GB (4.9GB model + 2.9GB context)

∙   Total: 16.6GB used, 15.4GB free
  1. The Headroom Is Nice

After loading Llama 3 8B:

∙   Single 5070 Ti: 5.7GB used = only 10.3GB free (ComfyUI + Ollama couldn’t load 8B afterward)

∙   Dual 5060 Ti: 6.0GB used = 26GB free (room for multiple workflows)
  1. Cost per GB

    ∙ Dual 5060 Ti: $858 GPUs / 32GB \~ $27/GB

    ∙ Single 5070 Ti: $749 GPU / 16GB \~ $47/GB

    ∙ System cost per GB: \~$82 vs $126

Motherboards

I did not want to spend another $500 on the next tech step up for a mobo. So there was a lot of cursing, experimenting, and work-around finding. The X870E AORUS ELITE I got open box at MicroCenter has slots on separate buses (slots 1 and 3). This is important - I tried three other boards first and they just would not or could not cut it, and this was the major difference. Many less expensive boards have the M.2 slots sharing resources with the PCIe slots, and they are not always clear on exactly what configurations do what.

Does Dual Make Sense?

I think it does for me in these cases:

∙   Running models >12GB

∙   Multi-tasking (LLM + image gen + TTS)

∙   Future-proofing for 20-30GB models

∙   Cost-conscious (better $/GB)

I’ll use single 5070 Ti if:

∙   Mainly running 7B-8B models

∙   Single-task workflows

∙   Smaller budget ($618 less upfront)

∙   Want slightly better single-model performance

Blackwell Gotchas

∙   Requires NVIDIA driver 570+ (open-source variant only.) You WILL have driver headaches, almost certainly. It is very touchy. But it seems stable once operational.

∙   I learned after banging my head on it for a while that PyTorch stable doesn’t support sm\_120 - use nightly builds. I may, if my supply of misery runs low and I need to restock, try building the latest one from source with the right drivers. PyTorch stable 2.5.1 throws “sm\_120 not compatible” error.

∙   llama.cpp needs sm\_89 compile target (PTX forward compatibility)

∙   CUDA 12.4 from conda will not work. I had to use 12.8.

∙   nvidia-driver-570 proprietary (use open-source variant)

∙   RTL8125 Ethernet port needs manual driver install on Ubuntu on this board - it wanted to use r8169, and no.

∙   Fast Boot and Secure Boot will almost certainly need to be disabled in BIOS. Some boards just will not allow setup with both GPU active. Depower one and then you can get into BIOS and try changing things.

Benchmark Details

All tests used llama.cpp with identical prompts and parameters:

∙   --n-gpu-layers 99 (full GPU offload)

∙   --tensor-split 1,1 (dual GPU only)

∙   Models: Q4\_K\_M quantization except where noted

Dual-GPU VRAM distribution verified via nvidia-smi and nvtop.