GPU advice for running local coding LLMs
Posted by mak3rdad@reddit | LocalLLaMA | View on Reddit | 13 comments
I’ve got a Threadripper 3995WX (64c/128t), 256GB RAM, plenty of NVMe, but no GPU. I want to run big open-source coding models like CodeLlama, Qwen-Coder, StarCoder2 locally, something close to Claude Code. If possible ;)
Budget is around $6K. I’ve seen the RTX 6000 Ada (48GB) suggested as the easiest single-card choice, but I also hear dual 4090s or even older 3090s could be better value. I’m fine with quantized models if the code quality is still pretty good.
Anyone here running repo-wide coding assistants locally? What GPUs and software stacks are you using (Ollama, vLLM, TGI, Aider, Continue, etc.)? Is it realistic to get something close to Claude Code performance on large codebases with current open models?
Thanks for any pointers before I spend the money on the gpu!
Financial_Stage6999@reddit
Tried quad 3090 and 5090 with 9950X and 256GB RAM. Not usable for agentic coding flow. Offloading makes everything too slow. Quad 3090 is too hot and noisy. Ended up leasing a Mac Studio.
Steus_au@reddit
what do you get from it? and what model, please.
Financial_Stage6999@reddit
We tried various Nvidia setups in our lab over the year. Key takeaway is that genuinely useful models for coding start at 100b MoE in size. We are enjoying GLM 4.5 Air at Q8. At this size and once context fills up they don’t fit into VRAM of any reasonable consumer level Nvidia-based setup. Once you offload to RAM the performance drops to the point when quick agentic iteration loop becomes impossible.
Monad_Maya@reddit
Local models cannot realistically compete with the big players. You need a lot of compute and VRAM to make it work.
Spend the 6k on tokens and call it a day. I know it's Localllama but be realistic.
mak3rdad@reddit (OP)
What for local llama. What is realistic? What is expected?
Monad_Maya@reddit
Matching cloud models in their performance running something locally, we are not there yet.
Alarmed_Till7091@reddit
If wattage is no object and you have the system for it, a whole bunch of 3090s is a pretty solid option.
When it comes to local models, the two most important things are Vram Size and Vram Speed. While a 4090 is faster as a card, but it's typically limited by the fact it has the same memory bandwidth and size as the 3090, so it ends up being only \~30% faster than a 3090. Not really worth the cost overhead unless you are doing other compute with your machine as well.
You could possibly run Q4/Q5 GLM 4.5 or GLM 4.5 air full on your system. They are pretty solid models on most benchmarks, but idk how much is performance is lost in the GLM Q4 quant.
mak3rdad@reddit (OP)
I think I can only fit at best 2 in my motherboard.
the-supreme-mugwump@reddit
I have similar setup with 2 3090s, it runs up to 70B models amazing, I sometimes run gpt-oss120b but it can’t do full gpu offload.
Alarmed_Till7091@reddit
2x3090 would be 48gb for around $2000. That gives you a theoretical 296gb sized MOE model with up to around 32b active FP8 params. So Q6-8 Qwen 235B and GLM 4.5 air or Q4-5 GLM 4.5.
NoVibeCoding@reddit
Paying for tokens will be cheaper than building a local setup and will offer you much more flexibility in the choice of models.
If you still want to build, I recommend renting various GPU configurations and testing your application on Runpod or VastAI.
You can also rent RTX 4090, 5090 and PRO 6000 on our website https://www.cloudrift.ai/
grannyte@reddit
If budget is no object to rtx6000 96 or 48 gb
If budget is a little concern and your models need compute go the multiple 3090 route.
If budget is a big concern your models only need vram and you have time to mess arround get some used v620
jacek2023@reddit
Llama.cop and any number of 3090s you can fit