5060ti 16gb or 5070 12gb for local LLM
Posted by soteko@reddit | LocalLLaMA | View on Reddit | 51 comments
As a title says, what is better taking the consideration that it will probably offload to CPU anyway?
Models Qwen 3.6 35b and maybe I am not sure it will be usable Qwen 3.6 27b...
CPU 5700x with 32GB dd4
cleversmoke@reddit
I'd personally go with the 5060ti 16GB, it's a great start and if the motherboard allows, can get another 5060ti 16GB. While the memory bandwidth isn't the same as an RTX 5090, 2x 5060ti's will be an affordable upgrade to 32GB vram.
soteko@reddit (OP)
I have other slot but it is x4, MB is ASUS TUF GAMING B550M-PLUS
ThankYou_666@reddit
I'm looking to get Asus TUF x870 plus. That should work well. Also, would you suggest start with 5060 ti 16GB for someone new to AI ML over 5070 to 16GB? Better upgrade later would be dual 5060 ti later on over single GPU but with slightly more vram? Thanks
soteko@reddit (OP)
Because of current budget constrains I ended up buying 5060 8GB with 32gb system ram, so I can evaluate local agentic coding and later to really invest and buy something a way better probably with 24GB vram or more.
So currently I am doing all planing with GLM 5.1 on Ollama Pro (20$) and then all coding by Pi is executed on Qwen 3.6 35B (Q5_K_XL) with around 240 t/s prompt processing and 30 t/s token generation in LM Studio in Windows. Only thing that I will upgrade now is system ram with at least 16GB so I can run Qwen 3.6 35B (Q6_K_XL).
I can probably do much better with Linux / LLama.cpp, but for now things work and I will keep it like this.
For your question, I would use money for 2 x 5060 Ti 16GB then one 5070 Ti 16GB, except if you can buy another 5070Ti in the future.
ThankYou_666@reddit
Thanks. Not sure if spending extra $500-600 is worth it for 5070ti over 5060ti for more or less double bandwidth and cuda cores but same vram, especially that I'm just starting out.
soteko@reddit (OP)
I would go with 5060ti 16gb, as you see I even went future down with 5060 8gb, because with anything less then 32gb vram you will end up with model that spill out in system ram and then everything is slow. Using anything less than 32GB vram, you will need to use more quantized models and to be honest they make mistakes as context grow.
Yes 5060ti 16gb is sweet spot if you starting, and get 64GB of ram so you can use Q6 or Q8 models.
ThankYou_666@reddit
Thanks. Got 64GB RAM thankfully just when prices started going crazy.
soteko@reddit (OP)
I don't have exact numbers, but if I have 30 t/s with 5060 8GB, with Qwen 3.6 35B, you will have 40 t/s with 5060ti 16gb and with 5070Ti something like 60 t/s.
But with two 5060Ti:
- first and most important thing is you can run dense model that is more intelligent and that is something you can't do with 5070 Ti with only 16GB (except with high quant)
- second from what I see full VRAM will give you > 100 t/s with MOE model like Qwen 3.6 35B on 2 x 5060 Ti
ThankYou_666@reddit
So, 5060 ti 16GB plus 64GB RAM would be great starting point, and easier to expand with 2nd GPI? The 2nd PCIe slot is PCIe 4.0 x16 slot (supports x4 mode). That would be sufficient for dual GPU setup? The main slot is PCIe 5.0 x16.
cleversmoke@reddit
If your budget allows for RTX 5070 ti, absolutely, since it has double the memory bandwidth. However, for nearly $800 USD, I'd rather put that money into a RTX 3090 24G (about the same memory bandwidth with 50% more vram than RTX 5070 ti) or even a modded RTX 3080 20G (~$600 on eBay).
If you have other use cases for a RTX 5070 ti 16G such as gaming, then it would be a better investment since 5070 ti will likely appreciate better than 5060 ti due to the memory bandwidth alone.
If this is purely an AI rig, I'd start with one RTX 5060 ti 16G, to give you enough vram to figure for your use cases. Load up Qwen3.6-35B-A3B or Gemma-4-26B-A4B-it, and then decide if you want to upgrade to another RTX 5060 ti 16G or a RTX 3090 24G.
ThankYou_666@reddit
GPU more to learn AI than gaming.
cleversmoke@reddit
For me, I wouldn't regret a 5060 ti due to the sheer value per vram it gives. I would likely regret a 5070 ti because I know I would have just bought a 3090 with nearly the same cost.
Perspective is I have upgraded slowly, started with a RTX 2060 12G for $120, then added a 3090 for $800, and a second 3090 for $1000. The 2060 still has its uses.
In any scenario, worst case, can just resell the card to recoup costs.
ThankYou_666@reddit
Thanks. So, start small and go from there. As you said, 5060 ti 16GB would be a starting point I wouldn't regret and work out what I need over time and add from there? As you said, can resell to recoup some costs back.
DocMadCow@reddit
x4 is just fine for inference. I have a 5070 TI 16GB and 5060 Ti 16GB (in x4 slot).
soteko@reddit (OP)
How is that combo?
Is it faster then x2 5060 ti ?
DocMadCow@reddit
I suspect the same as 2x5060 Ti as the slower card is slowing it down. But having 32GB definitely speeds it up over a single 5070 Ti for larger models.
Legitimate-Dog5690@reddit
The motherboard should be fine. The limiting factors might be your power supply (both peak watts and pcie outputs) and physically space/cooling in your case. Dual GPUs works pretty well now!
soteko@reddit (OP)
I will buy better supply.
andy_potato@reddit
That barely impacts performance
KURD_1_STAN@reddit
If u not spilling into ram , that shouldn't have an impact beside model first load
Bulky-Priority6824@reddit
With a mobo @ 8x4x & 2- 5060ti 16gb running llama with split mode layer on qwen 3.6 35 a3b q4 xl is 94 tg/s with about 3.5gb of overhead at 82k ctx
Will find out soon what 8x8x reflects
jacek2023@reddit
think how to get two 16GB
soteko@reddit (OP)
I have other slot x4 on ASUS TUF GAMING B550M-PLUS.
Is it ok to put another 5060 ti in that slot ?
ThankYou_666@reddit
Looking to get asus TUF x870 plus. Should I get 560 ti 16GB to start with or look at 5070 ti 16GB? If I need to upgrade, another 5060 to 16GB should be better than upgrading and spending more on a GPU with more vram?
andy_potato@reddit
Absolutely no problem. Running the exact same setup here
jacek2023@reddit
it's worth trying
Due_Duck_8472@reddit
For what? For writing smut? No difference, for coding? No way
Formal-Exam-8767@reddit
5060 ti 16GB or 5070 ti 16GB, no point in getting 5070 12GB.
ResponsibleTruck4717@reddit
This, or 2 5060ti 16gb vs on 5070ti 16gb.
Bulky-Priority6824@reddit
With a mobo @ 8x4x & 2- 5060ti 16gb running llama with split mode layer on qwen 3.6 35-a3b_q4_xl is 94 tg/s on Debian and 82 on windows with about 3.5gb of overhead at 82k ctx
Best of all they both idle at 7w each when nothing is going on.
Will find out soon what 8x8x reflects
soteko@reddit (OP)
what is prompt generation, I found for agent usage on CPU prompt is biggest problem, because even with CPU inference 10t/s output in coding it will do the job, I can leave it to work several hours to finish. But input tokens are massive like 1 200 000 input tokens with CPU prompt processing it totally unusable.
So that is my point if I can get to speed up prompt processing with gpu I will have some workable solution, no meter if it not fit in vram.
That is why I am asking is 12gb 5070 will do better job in prompt processing.
Bulky-Priority6824@reddit
root@msiam4:\~# /opt/llama.cpp/build/bin/llama-bench \
-m /mnt/storage/models/qwen36b/Qwen3.6-35B-A3B-UD-Q4_K_XL.gguf \
-p 512,2048 \
-n 128 \
-fa 1 \
-r 3
ggml_cuda_init: found 2 CUDA devices (Total VRAM: 31696 MiB):
Device 0: NVIDIA GeForce RTX 5060 Ti, compute capability 12.0, VMM: yes, VRAM: 15849 MiB
Device 1: NVIDIA GeForce RTX 5060 Ti, compute capability 12.0, VMM: yes, VRAM: 15846 MiB
| model | size | params | backend | ngl | fa | test | t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | -: | --------------: | -------------------: |
| qwen35moe 35B.A3B Q4_K - Medium | 20.81 GiB | 34.66 B | CUDA | 99 | 1 | pp512 | 2596.35 ± 8.54 |
| qwen35moe 35B.A3B Q4_K - Medium | 20.81 GiB | 34.66 B | CUDA | 99 | 1 | pp2048 | 3540.40 ± 9.98 |
| qwen35moe 35B.A3B Q4_K - Medium | 20.81 GiB | 34.66 B | CUDA | 99 | 1 | tg128 | 102.54 ± 0.48 |
build: e77056f9b (9024)
root@msiam4:\~#
Claude Summary 😄
Dual RTX 5060 Ti 16GB (32GB total VRAM) — Qwen3.6 35B A3B Q4_K_XL, llama.cpp mainline, Flash Attention on
Key points for their decision:
For light coding use, the 5060 Ti 16GB is the stronger choice specifically because of the VRAM headroom for larger/better quants of 27B.
soteko@reddit (OP)
Thanks man :)
Bulky-Priority6824@reddit
np
soteko@reddit (OP)
If you have time I would like to know if Qwen3.6 27B dense model will work on 2x 5060Ti and how fast.
Thanks.
Bulky-Priority6824@reddit
keep in mind, this is with 8x and 4x lanes. (I'll be able to do another test with proper x8 x8 in a few days)
https://imgur.com/a/gJpDJTx
Qwen3.6-27B-UD-Q4_K_XL.gguf
Total: 18gb across both cards
PP512: 888.45 t/s
PP2048: 1284.58 t/s
TG128: 21.74 t/s
soteko@reddit (OP)
I've tried to make some basic math. Last coding session with Pi + Ollama Pro / GLM 5.1 I had:
11 million input tokens
50k output tokens
Making simple calculation:
Qwen3.6 27B
PP2048: 11 000 000 / 1284.58 = 143 min
TG128: 50 000 / 21.74 = 39 min
Total: 182min or 3 hours agentic coding session.
Qwen3.6 35B
PP2048: 11 000 000 / 3540 = 52 min
TG128: 50 000 / 102 = 9 min
Total: 61min or 1 hour agentic coding session.
I hope I get this right.
jjsilvera1@reddit
Dont forget to take into account caching of tokens. Not sure how that would affect the timing.
soteko@reddit (OP)
Well this is worst case scenario.
Yes it should be faster.
jjsilvera1@reddit
thanks for taking the time to do this!
soteko@reddit (OP)
Thanks again :)
Well this looks totally ok to work with.
WouterC@reddit
Thanks
OniCr0w@reddit
5070 Ti has 16GB vram. I assume you meant 5070.
Bulky-Priority6824@reddit
yea thats a claude error i'll edit it to save confusion
Bulky-Priority6824@reddit
I can test if you want. I'll conjure up a scenerio and show you after while.
Blizado@reddit
You want as much as possible in VRAM. Why? The more layers of the model is inside the VRAM and not the RAM, the faster it runs. And for stuff, that only works on VRAM, 16GB is of course better as well. And you could add later a second card, then you have 32GB and many AI models can work with 2 GPUs at once, which speeds up the LLM a lot, not by factor 2, but a lot. If you don't need to have the model in parts in normal RAM, it is more than factor 2 of course. And by the actual hardware prices, it is worth to go this way.
horeaper@reddit
Try 7900XT
soteko@reddit (OP)
Well I can't find new or even used here.
horeaper@reddit
That's sad. My suggestion is wait for the 9070 (non XT) to reach a more reasonable price level. 5060Ti are just too slow, and 5070's VRAM are definitely not enough in current days, you'd better off spending all that money to deepseek API
Mashic@reddit
16GB.
Sad-Duck2812@reddit
I have seen people get a decent amount of tokens with even 12GB something like 60 tokens. I have also tested it on a 5070 12GB and managed 58-60 tk/s with cpu offload.
In my opinion get the 5060 ti 16GB it’s a very good budget gpu for AI models and you can even fit some models into it completely as it’s 16GB, even if you have to offload it’s better to fit as much of the model in gpu as you can.