Is 5060Ti 16GB and 32GB DDR5 system ram enough to play with local AI for a total rookie?
Posted by danuser8@reddit | LocalLLaMA | View on Reddit | 73 comments
For future proofing would it be better to get a secondary cheap GPU (like 3060) or another 32GB DDR5 RAM?
ThankYou_666@reddit
Did you go with the 5060 ti 16GB? Similar scenario as you. Not sure to go with 5060 ti or 5070 ti. Both at 16GB but 5070 ti has double bandwidth and cuda cores.
danuser8@reddit (OP)
I did not try local ai yet. But 5060 Ti is good for text and light image generation work.
For any kind of video and heavy image generation, got to go higher tier
ThankYou_666@reddit
Would 5070 ti be able to higher end enough for more high end AI, or is it more higher end GPUs with more vram?
danuser8@reddit (OP)
It’s more GPU VRAM and more bandwidth that matters. So higher tier GPU just happen to come with more bandwidth
sine120@reddit
Sure. With only 32GB system RAM I'd say just stick to smaller models that will fit in your GPU. You can fit up to 30B models with heavy quantization.
ThankYou_666@reddit
I'm in similar scenario as original poster. Got 64GB ram but not sure to go for 5060 ti or 5070 ti. Both at 16gb but 5070 ti has nearly double cuda cores and bandwidth. Not sure how much of a difference this makes for nearly double the price. Thinking better learn with a 5060 ti then go all out one with much more vram later. But sure if I should go 5070 ti now.
sine120@reddit
I've got a 9070XT and 64GB. For 16GB VRAM you're probably running MoE's, a slightly better GPU won't get you much unless you care about the gaming the performance.
v01dm4n@reddit
I have the exact same setup. Just bought it last month for AI.
If you have money for more ram, please get a gpu with 24G vram. Something like a used 3090/4090 or a workstation grade gpu such as a5000 if power is a concern.
24G is a perfect spot that allows you to run fp4 quantized 30B models for inference. That makes the entire spectrum accessible - qwen3-30b, llama-30b, gemma-27b, deepseek-32b, nemotron3-nano. While training, you may be able to finetune 7b or 8b models.
With 16gb 5060ti, i go back to gpt-oss for everything (starts at 100tps). For all other models, have to fall back to 14b models for inference. Have tried running qwen3-30b-a3b but it gives a constipated output at 7tps. Can't run qwen3-coder or even gemma27b effectively. :(
danuser8@reddit (OP)
What if I pair another GPU to make total VRAM 24GB or more?
v01dm4n@reddit
Ok, I have an update on this. Using llamacpp directly gives way better performance and I am actually able to use qwen3.5 35b a3b at about 55tps. My previous recommendation was based on ollama which sucks badly because it lacks the ability to offload few layers to the system ram. Currently, in my experience: llamacpp > lmstudio > ollama
danuser8@reddit (OP)
Nice, thanks for sharing
v01dm4n@reddit
Then you can do inference on slightly bigger models (using lmstudio etc) but fine-tuning on multi gpu needs distributed training (fsdp on torch), which is more work. That headache is not worth getting into unless you eventually plan to scale it on a data center cluster.
For inference, I'd recommend another 5060Ti like others said, but that'd mean spending 2x on your mobo.
Also 2 5060Tis is not the same as one 32G card. Consider some overheads for distributed setup.
RandomnameNLIL@reddit
Yeah I have your exact setup, it works perfectly for specialized 7-30B models. I am gonna upgrade to a V100 32g tho
munkiemagik@reddit
So no matter what you start with, you stay here long enough you will inevitably end up feeling like its not enough and wish you had more.
braydon125@reddit
I wish I never walked in the door. Where's the exit node brother...please my lab is drowning
m31317015@reddit
The exit node is the bottomless hole of cloud rental servers.
_realpaul@reddit
The comfyui sub is more accomodating these days. The image models seem to get better even at small sizes. Compressing all human knowledge requires a bit more space
ShouldWeOrShouldntWe@reddit
AI/ML engineer here, Local LLMs are quantized to run on lesser hardware. You can run LLMs on even less hardware - even down to a stock 3060, if you choose an appropriate model. You can even run local LLMs on a mac studio efficiently without a dedicated NVIDIA card. A stock 3060 can run a good amount of 7B models by itself.
danuser8@reddit (OP)
Thanks. Are they reliable enough though?
ShouldWeOrShouldntWe@reddit
The model itself will run regardless, but the quality of the output will decrease with how much it is quantized. So an 7B model will be less quality than a 15B and so on. And the less free RAM you have, the less context you will be able to provide the model before forgetting effects and hallucinations happen.
danuser8@reddit (OP)
So is 24GB VRAM the sweet spot of local ai to almost forget about hallucination and forgetting effect? Thus pairing dual GPUs to achieve that much RAM?
Or alternatively 16GB VRAM and 8GB System RAM?
tmvr@reddit
Yes, it perfectly fine, you can start with even less than you have. For this setup, you can run any dense model that's up to 7B/8B at Q8, 12B/14B at Q6 and larger 24B/27B at Q4. You can also run MoE models with splitting the model between VRAM and system RAM so that you get decent speed as well. For gpt-oss 20B you don't even need that, it fits with full 128K context into the 16GB VRAM. For Qwen3 30B A3B you will need to split, but it's just a checkbox and a slider in LM Studio or if you use llamacpp directly it's only a commandline switch (which is on by default) in the latest releases.
Just go for it, you don't need to buy anything more for start, you are already better offf than a lot of people.
Expensive_Suit_6458@reddit
How would gpt oss run with 128k and fit into 16gb vram? I’m running it in ollama “q4_k_m” with 16k context, and it consumes 14gb. 20k fills vram.
Is there something to fine tune it?
tmvr@reddit
There is nothing to do, the model is like that. I'm not sure what ollama is doing, but it is probably nothing good :) Use the original MXFP4 release. This and also Nemotron 3 Nano use much less VRAM for context compared to the Qwen models for example.
Expensive_Suit_6458@reddit
I’ll try that then. Thanks
tmvr@reddit
You need FA to be on, not sure if it is on by default in ollama, I'm using llamacpp directly.
Expensive_Suit_6458@reddit
Thanks for the tip. Turned on flash attention and sat kv cache to q8 and now with 128k context at 16gb 👏🏻
Wezzlefish@reddit
Some models I've messed with on my 5060Ti 16gb:
Qwen3 Image Edit 2509 (nunchakus fp4) Z-Image-Turbo (also Nunchakus fp4) Qwen2.5 coder 7b Qwen3 14b (Nvidia nvfp4) Phi4 Mini LTX2 fp4
A bunch of 4bit or 8bit quants basically
I've yet to try many others as I'm still trying to find "good" text generation models that fit 16gb vram. chatGPT, Copilot and Gemini have all hallucinated models that didn't exist when I ask for "top llms for 16gb vram" so I'm on the hunt manually.
NelsonMinar@reddit
Do you already have the card? If not you want to buy it right now if you can still get it at a reasonable price. I bought one for $480 last month but because NVidia is discontinuing these cards they are very hard to find now.
I've been enjoying it for casual tinkering, you can run some decent performing models on it. Instead of getting a secondary GPU I'd consider using any more money to lease time on a cloud inference engine. I know, not local...
Ancient-Car-1171@reddit
Skim on evrything and get 2x5060ti (right now or near future). Run it in tensor parallel. If used gpu is acceptable to you 1x3090 is a good starting point.
low_v2r@reddit
What MB do you use that supports dual PCIE5? My (older) MB is x4 but only does that for one device (well, - 2 if you include the m.2 drive)
Ancient-Car-1171@reddit
i use a Biostar Z690A, there are quite a few z690 motherboards have dual x8 pcie5. But just for llm interference you dont need that much pcie bandwidth, even with tensor parallel 4xpcie4.0 is already enough.
low_v2r@reddit
Thanks
danuser8@reddit (OP)
I might have to get another 5060 Ti
luncheroo@reddit
Another 32gb would help you run larger MoE models, but you should be fine getting started in the quantized 20-30gb and smaller categories and some of those models, like nemotron and Qwen3 30b a3b are quite good (I guess about GPT-4 level for many tasks).
The 70b+ models are the real OSS beasts, but they require somewhat expensive hardware that quickly gets out of the budget of the average hobbyist. So your move there is to rent beefy GPU compute online or start putting together your own specialized agentic framework on your home computer to help smaller models punch above their weight. Depends on what you want to do. I think that's the current state of things.
Smooth-Cow9084@reddit
Ddr5 is a bad starting point because it will have really close performance as ddr4 but cost more than twice.
But really depends on your goals and such. Typically 3090 is king. I also stacked with a 5060ti and dense models got decent speed. But for casual use... Not sure, 3060 12gb might be fine
If you wanted to get serious, it'd be best to sell all and get a ddr4 setup with 1-2 3090s.
danuser8@reddit (OP)
Isn’t 3090 good enough by itself with enough VRAM? Why are you pairing it another GPU?
Smooth-Cow9084@reddit
I just got a good deal and bought the 5060 but already sold for buying a second 3090.
If you are doing single requests, 3090 plus ddr4 ram is most cost-efficient
Gringe8@reddit
Dont listen to the people saying to get more ram to offload to ram unless youre ok with really slow generation. You can play around with the 12b nemo models with 16gb and have an ok experience.
danuser8@reddit (OP)
Thanks, this is probably the most practical advice, but with tight budget, maybe another 5060 in parallel is the best I could do to get more VRAM. I don’t care about speed
Ecstatic-Victory354@reddit
Honestly the 5060Ti with 16GB VRAM should handle most 7B-13B models pretty well, but if you're planning to mess around with larger models down the line I'd probably go with the extra RAM first since you can always offload to system memory when VRAM runs out
danuser8@reddit (OP)
So for bigger models, is dual GPU better or more system RAM better?
Ancient-Car-1171@reddit
16gb of good DDR5 now cost almost half the price of 16gb 5060ti, 2x5060ti is your best bet.
Miserable-Dare5090@reddit
dual GPU always better, but system ram is the cheap version of better
danuser8@reddit (OP)
DDR5 32GB runs for $300 now… I could land a 3060 as dual GPU setup for similar price
Miserable-Dare5090@reddit
You could buy a 5060ti for 400 not long ago, so 32gb VRAM. Worth it. 3060 I would not bother.
dwkdnvr@reddit
They're already up to $500 with the apparent announcement that the 16GB version is being discontinued due to the RAM market problems. Seems likely they'll go higher, unfortunately
m31317015@reddit
Yeah get one before it's too late. Or go with the meta and join the 3090 gang. :D
Equivalent-Repair488@reddit
Represent!
cosimoiaia@reddit
Also most 20-24B (at q4-ish) and some 30b, specially MoE, if you tune llama.cpp and don't splurge with context.
m31317015@reddit
GPT-OSS:20B should be able to run with 50 series on 16GB VRAM w/ MXFP4. That is more than enough for getting started.
ProfitEnough825@reddit
This. It runs very well on my 5070 ti, a lot faster than expected. I'd assume the 5060 ti would be fine as well.
Relative_Rope4234@reddit
How many tokens per second and prefill rate are you getting on 5070ti?
thegompa@reddit
my 5070ti does the following (i don't think i did any tuning) :
./llama-server -fa on -m ./models/gpt-oss-20b-UD-Q4_K_XL.gguf --host 0.0.0.0 --no-warmup -ngl 99 -t 8 -c 122768 --port ${PORT} --jinja
prompt eval time = 141.93 ms / 198 tokens ( 0.72 ms per token, 1395.05 tokens per second)
eval time = 1856.38 ms / 455 tokens ( 4.08 ms per token, 245.10 tokens per second)
total time = 1998.31 ms / 653 tokens
rerorerox42@reddit
It is more than enough to start playing with LLMs. Started with a 2070S myself
jacek2023@reddit
You can play with LLMs even without the GPU or with 2070. There are 8B models. 4B models and models smaller than 1B. 16B is a good intro GPU but not the final setup.
FullOf_Bad_Ideas@reddit
yes I started with 24GB of RAM and GTX 1080. It can still allow you to train and finetune small models, albeit slower. 5060 Ti and 32GB of RAM will allow you to run models like Qwen 3 30B A3B Coder for example, which is pretty good for vibe coding. No idea about future proofing - best bet is to have stable income moreso than buying any specific hardware.
glusphere@reddit
U cant mix a 5060 with a 3060. Architectures are different. Check before u commit to buy a second GPU.
desexmachina@reddit
I would probably want to have 64Gb of RAM, but you’ll be fine. Download NVIDIAs own local LLM app, works pretty well
SkyLordOmega@reddit
There is no such thing as future proofing.
Two things will certainly take place: Model size will increase and at the same time the quality of smaller models will improve.
mobileJay77@reddit
Given uncertain RAM and GPU prices, future proof is just speculation.
Learn with what you have. Toy with it for free. Buy tokens via API or cloud when you feel your system limits you.
Prudent-Ad4509@reddit
As others have said, you would want to run 80Gb total pretty soon. But to start playing with it 16Gb is good enough. Just ignore any gpus with less than 16/24gb, unless you get them for free, they are no older than nvidia 20x0 series and you have a space to install them.
Ambitious-Most4485@reddit
Yep duable but dont expect to go beyond 14B param (8bit quantized) unless with very hard quantization like q4. I have the same setup
TallComputerDude@reddit
5060 Ti + 3060 could work, but only because 3060 has x16. You must bifurcate the lanes between cards in BIOS settings. Lmstudio.ai is probably your best bet and runs fine in Windows. You want a big PSU tho, probably 800-1000w.
Fabulous_Fact_606@reddit
Start small. LFM2.5-1.5B is working great for me on my 5080 16GB. Design the front end around it and you are golden.
o0genesis0o@reddit
That's like a tiny baby model. It runs at around 500t/s prompt processing and around 45t/s output with just CPU on my mini PC.
o0genesis0o@reddit
You can play with it, and the speed is not bad. MoE models like OSS 20B and Qwen3 30B are the limit of "usable" (both speed and context size). You can try dense 24B and 27B at Q4 and even CPU offloading, but at that point, it reminds me too much of the days when I played with LLM on a laptop with 2060 6GB. Just pain. Realistically, you would only be comfortable with LLM that are within the 7B class (comfortable means running at high quants and full context length). You wouldn't suddenly have deepseek at home with this PC.
You can also play with comfyui and most of the new models, if you are willing to wait a bit.
32GB RAM could be limiting though. For example, Comfyui has a mechanism to cache nodes. It's quite easy for Comfyui to be killed by the OS for using too much RAM with certain workflows, which have a lot of switches and conditional routing.
If I can go back in time, I would max out the ram to 96GB on my workstation for sure. The 16GB 4060TI is not great but not terrible. At least it's quiet and efficient, and when I need to, I can play whatever at 1440p and good framerate.
usernameplshere@reddit
It is! You can run very decent quants of MoE models with cpu-offloading (Google it!) in the 20-30b range. I would recommend to try GPT OSS 20b in native mxfp4, it should run very fast and has decent intelligence. More than enough to tinker around with.
Kahvana@reddit
With that setup I ran Mistral Small 3.2 24B (and rp finetunes) at IQ4_NL and 16K context, you want to enable FA, set KV quant to q8_0, and not use the mmproj (too large). Using llama.cpp or koboldcpp is adviced!
grabber4321@reddit
Absolutely. You can run some nice LLMs up to 24B.
Devstral-2-Small will be fantastic for this. I also found GLM-4.6v-flash to do really good in agentic development.
Its not going to solve big problems, but you can piece together apps no problems.
Bigger models are definitely much better at one-shotting problems. But smaller LLMs are still fine work daily work.
EmPips@reddit
If you only use very modest context you can offload experts and probably get some solid speeds with qwen3-next-80B (iq4_xs). It's 42GB total.
legit_split_@reddit
You really want that extra RAM to have 80GB total system memory, which would allow you to run large models like gpt-oss-120b, glm 4.5 air, etc.
RiotNrrd2001@reddit
I have one of the crappiest GPUs that you can find, the GTX 1660 Ti. This card has 6 GB of VRAM, and does not support the "half duplex" mode that almost every other card in the world can handle.
Using LMStudio, I run LLMs on this machine just fine. Runs a little slow, but not stupidly so. Nemo actually gave me almost 150 tokens per second, although that's a tiny model whose quality is still not up to snuff for me.
You have a MUCH better GPU than I do. You will have virtually no problems running many local LLMs.