5060ti 16gb or 5070 12gb for local LLM

[-]

cleversmoke@reddit

I'd personally go with the 5060ti 16GB, it's a great start and if the motherboard allows, can get another 5060ti 16GB. While the memory bandwidth isn't the same as an RTX 5090, 2x 5060ti's will be an affordable upgrade to 32GB vram.

[-]

soteko@reddit (OP)

I have other slot but it is x4, MB is ASUS TUF GAMING B550M-PLUS

[-]

ThankYou_666@reddit

I'm looking to get Asus TUF x870 plus. That should work well. Also, would you suggest start with 5060 ti 16GB for someone new to AI ML over 5070 to 16GB? Better upgrade later would be dual 5060 ti later on over single GPU but with slightly more vram? Thanks

[-]

soteko@reddit (OP)

Because of current budget constrains I ended up buying 5060 8GB with 32gb system ram, so I can evaluate local agentic coding and later to really invest and buy something a way better probably with 24GB vram or more.

So currently I am doing all planing with GLM 5.1 on Ollama Pro (20$) and then all coding by Pi is executed on Qwen 3.6 35B (Q5_K_XL) with around 240 t/s prompt processing and 30 t/s token generation in LM Studio in Windows. Only thing that I will upgrade now is system ram with at least 16GB so I can run Qwen 3.6 35B (Q6_K_XL).

I can probably do much better with Linux / LLama.cpp, but for now things work and I will keep it like this.

For your question, I would use money for 2 x 5060 Ti 16GB then one 5070 Ti 16GB, except if you can buy another 5070Ti in the future.

[-]

ThankYou_666@reddit

Thanks. Not sure if spending extra $500-600 is worth it for 5070ti over 5060ti for more or less double bandwidth and cuda cores but same vram, especially that I'm just starting out.

[-]

soteko@reddit (OP)

I would go with 5060ti 16gb, as you see I even went future down with 5060 8gb, because with anything less then 32gb vram you will end up with model that spill out in system ram and then everything is slow. Using anything less than 32GB vram, you will need to use more quantized models and to be honest they make mistakes as context grow.

Yes 5060ti 16gb is sweet spot if you starting, and get 64GB of ram so you can use Q6 or Q8 models.

[-]

ThankYou_666@reddit

Thanks. Got 64GB RAM thankfully just when prices started going crazy.

[-]

soteko@reddit (OP)

I don't have exact numbers, but if I have 30 t/s with 5060 8GB, with Qwen 3.6 35B, you will have 40 t/s with 5060ti 16gb and with 5070Ti something like 60 t/s.

But with two 5060Ti:
- first and most important thing is you can run dense model that is more intelligent and that is something you can't do with 5070 Ti with only 16GB (except with high quant)
- second from what I see full VRAM will give you > 100 t/s with MOE model like Qwen 3.6 35B on 2 x 5060 Ti

[-]

ThankYou_666@reddit

So, 5060 ti 16GB plus 64GB RAM would be great starting point, and easier to expand with 2nd GPI? The 2nd PCIe slot is PCIe 4.0 x16 slot (supports x4 mode). That would be sufficient for dual GPU setup? The main slot is PCIe 5.0 x16.

[-]

cleversmoke@reddit

If your budget allows for RTX 5070 ti, absolutely, since it has double the memory bandwidth. However, for nearly $800 USD, I'd rather put that money into a RTX 3090 24G (about the same memory bandwidth with 50% more vram than RTX 5070 ti) or even a modded RTX 3080 20G (~$600 on eBay).

If you have other use cases for a RTX 5070 ti 16G such as gaming, then it would be a better investment since 5070 ti will likely appreciate better than 5060 ti due to the memory bandwidth alone.

If this is purely an AI rig, I'd start with one RTX 5060 ti 16G, to give you enough vram to figure for your use cases. Load up Qwen3.6-35B-A3B or Gemma-4-26B-A4B-it, and then decide if you want to upgrade to another RTX 5060 ti 16G or a RTX 3090 24G.

[-]

ThankYou_666@reddit

GPU more to learn AI than gaming.

[-]

cleversmoke@reddit

For me, I wouldn't regret a 5060 ti due to the sheer value per vram it gives. I would likely regret a 5070 ti because I know I would have just bought a 3090 with nearly the same cost.

Perspective is I have upgraded slowly, started with a RTX 2060 12G for $120, then added a 3090 for $800, and a second 3090 for $1000. The 2060 still has its uses.

In any scenario, worst case, can just resell the card to recoup costs.

[-]

ThankYou_666@reddit

Thanks. So, start small and go from there. As you said, 5060 ti 16GB would be a starting point I wouldn't regret and work out what I need over time and add from there? As you said, can resell to recoup some costs back.

[-]

DocMadCow@reddit

x4 is just fine for inference. I have a 5070 TI 16GB and 5060 Ti 16GB (in x4 slot).

[-]

soteko@reddit (OP)

How is that combo?

Is it faster then x2 5060 ti ?

[-]

DocMadCow@reddit

I suspect the same as 2x5060 Ti as the slower card is slowing it down. But having 32GB definitely speeds it up over a single 5070 Ti for larger models.

[-]

Legitimate-Dog5690@reddit

The motherboard should be fine. The limiting factors might be your power supply (both peak watts and pcie outputs) and physically space/cooling in your case. Dual GPUs works pretty well now!

[-]

soteko@reddit (OP)

I will buy better supply.

[-]

andy_potato@reddit

That barely impacts performance

[-]

KURD_1_STAN@reddit

If u not spilling into ram , that shouldn't have an impact beside model first load

[-]

Bulky-Priority6824@reddit

With a mobo @ 8x4x & 2- 5060ti 16gb running llama with split mode layer on qwen 3.6 35 a3b q4 xl is 94 tg/s with about 3.5gb of overhead at 82k ctx

Will find out soon what 8x8x reflects

[-]

jacek2023@reddit

think how to get two 16GB

[-]

soteko@reddit (OP)

I have other slot x4 on ASUS TUF GAMING B550M-PLUS.
Is it ok to put another 5060 ti in that slot ?

[-]

ThankYou_666@reddit

Looking to get asus TUF x870 plus. Should I get 560 ti 16GB to start with or look at 5070 ti 16GB? If I need to upgrade, another 5060 to 16GB should be better than upgrading and spending more on a GPU with more vram?

[-]

andy_potato@reddit

Absolutely no problem. Running the exact same setup here

[-]

jacek2023@reddit

it's worth trying

[-]

Due_Duck_8472@reddit

For what? For writing smut? No difference, for coding? No way

[-]

Formal-Exam-8767@reddit

5060 ti 16GB or 5070 ti 16GB, no point in getting 5070 12GB.

[-]

ResponsibleTruck4717@reddit

This, or 2 5060ti 16gb vs on 5070ti 16gb.

[-]

Bulky-Priority6824@reddit

With a mobo @ 8x4x & 2- 5060ti 16gb running llama with split mode layer on qwen 3.6 35-a3b_q4_xl is 94 tg/s on Debian and 82 on windows with about 3.5gb of overhead at 82k ctx

Best of all they both idle at 7w each when nothing is going on.

Will find out soon what 8x8x reflects

[-]

soteko@reddit (OP)

what is prompt generation, I found for agent usage on CPU prompt is biggest problem, because even with CPU inference 10t/s output in coding it will do the job, I can leave it to work several hours to finish. But input tokens are massive like 1 200 000 input tokens with CPU prompt processing it totally unusable.

So that is my point if I can get to speed up prompt processing with gpu I will have some workable solution, no meter if it not fit in vram.

That is why I am asking is 12gb 5070 will do better job in prompt processing.

[-]

Bulky-Priority6824@reddit

root@msiam4:\~# /opt/llama.cpp/build/bin/llama-bench \
-m /mnt/storage/models/qwen36b/Qwen3.6-35B-A3B-UD-Q4_K_XL.gguf \
-p 512,2048 \
-n 128 \
-fa 1 \
-r 3
ggml_cuda_init: found 2 CUDA devices (Total VRAM: 31696 MiB):
Device 0: NVIDIA GeForce RTX 5060 Ti, compute capability 12.0, VMM: yes, VRAM: 15849 MiB
Device 1: NVIDIA GeForce RTX 5060 Ti, compute capability 12.0, VMM: yes, VRAM: 15846 MiB
| model                          |       size |     params | backend    | ngl | fa |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | -: | --------------: | -------------------: |
| qwen35moe 35B.A3B Q4_K - Medium | 20.81 GiB |    34.66 B | CUDA       | 99 | 1 |           pp512 |       2596.35 ± 8.54 |
| qwen35moe 35B.A3B Q4_K - Medium | 20.81 GiB |    34.66 B | CUDA       | 99 | 1 |          pp2048 |       3540.40 ± 9.98 |
| qwen35moe 35B.A3B Q4_K - Medium | 20.81 GiB |    34.66 B | CUDA       | 99 | 1 |           tg128 |        102.54 ± 0.48 |

build: e77056f9b (9024)
root@msiam4:\~#

Claude Summary 😄

Dual RTX 5060 Ti 16GB (32GB total VRAM) — Qwen3.6 35B A3B Q4_K_XL, llama.cpp mainline, Flash Attention on

Test	Tokens/sec	What it means
PP512	2596 t/s	Short prompt / first message
PP2048	3540 t/s	Long prompt / pasting code
TG128	102 t/s	Actual generation speed you feel

Key points for their decision:

The 27B dense and A3B MoE are very different — A3B is sparse so PP scales extremely well across GPUs, which inflates these PP numbers favorably vs a dense 27B
A single 5070 Ti 12GB cannot fit Qwen3 27B at Q8 (needs \~29GB) — they'd be forced into Q4 with VRAM to spare on the 5060 Ti 16GB
The 5060 Ti 16GB wins on capacity — fits bigger quants, fits 27B dense comfortably
The 5070 Ti 12GB wins on raw compute/bandwidth but loses the VRAM headroom for 27B
TG at 102 t/s on a 35B MoE is faster than most people's reading speed — plenty for coding

For light coding use, the 5060 Ti 16GB is the stronger choice specifically because of the VRAM headroom for larger/better quants of 27B.

[-]

soteko@reddit (OP)

Thanks man :)

[-]

Bulky-Priority6824@reddit

np

[-]

soteko@reddit (OP)

If you have time I would like to know if Qwen3.6 27B dense model will work on 2x 5060Ti and how fast.
Thanks.

[-]

Bulky-Priority6824@reddit

keep in mind, this is with 8x and 4x lanes. (I'll be able to do another test with proper x8 x8 in a few days)

https://imgur.com/a/gJpDJTx

Qwen3.6-27B-UD-Q4_K_XL.gguf

Total: 18gb across both cards

PP512: 888.45 t/s

PP2048: 1284.58 t/s

TG128: 21.74 t/s

[-]

soteko@reddit (OP)

I've tried to make some basic math. Last coding session with Pi + Ollama Pro / GLM 5.1 I had:
11 million input tokens
50k output tokens

Making simple calculation:

Qwen3.6 27B
PP2048: 11 000 000 / 1284.58 = 143 min
TG128: 50 000 / 21.74 = 39 min
Total: 182min or 3 hours agentic coding session.

Qwen3.6 35B
PP2048: 11 000 000 / 3540 = 52 min
TG128: 50 000 / 102 = 9 min
Total: 61min or 1 hour agentic coding session.

I hope I get this right.

[-]

jjsilvera1@reddit

Dont forget to take into account caching of tokens. Not sure how that would affect the timing.

[-]

soteko@reddit (OP)

Well this is worst case scenario.

Yes it should be faster.

[-]

jjsilvera1@reddit

thanks for taking the time to do this!

[-]

soteko@reddit (OP)

Thanks again :)
Well this looks totally ok to work with.

[-]

WouterC@reddit

Thanks

[-]

OniCr0w@reddit

5070 Ti has 16GB vram. I assume you meant 5070.

[-]

Bulky-Priority6824@reddit

yea thats a claude error i'll edit it to save confusion

[-]

Bulky-Priority6824@reddit

I can test if you want. I'll conjure up a scenerio and show you after while.

[-]

Blizado@reddit

You want as much as possible in VRAM. Why? The more layers of the model is inside the VRAM and not the RAM, the faster it runs. And for stuff, that only works on VRAM, 16GB is of course better as well. And you could add later a second card, then you have 32GB and many AI models can work with 2 GPUs at once, which speeds up the LLM a lot, not by factor 2, but a lot. If you don't need to have the model in parts in normal RAM, it is more than factor 2 of course. And by the actual hardware prices, it is worth to go this way.

[-]

horeaper@reddit

Try 7900XT

[-]

soteko@reddit (OP)

Well I can't find new or even used here.

[-]