GPU advice for running local coding LLMs

Posted by mak3rdad@reddit | LocalLLaMA | View on Reddit | 13 comments

I’ve got a Threadripper 3995WX (64c/128t), 256GB RAM, plenty of NVMe, but no GPU. I want to run big open-source coding models like CodeLlama, Qwen-Coder, StarCoder2 locally, something close to Claude Code. If possible ;)

Budget is around $6K. I’ve seen the RTX 6000 Ada (48GB) suggested as the easiest single-card choice, but I also hear dual 4090s or even older 3090s could be better value. I’m fine with quantized models if the code quality is still pretty good.

Anyone here running repo-wide coding assistants locally? What GPUs and software stacks are you using (Ollama, vLLM, TGI, Aider, Continue, etc.)? Is it realistic to get something close to Claude Code performance on large codebases with current open models?

Thanks for any pointers before I spend the money on the gpu!

[-]

Financial_Stage6999@reddit

Tried quad 3090 and 5090 with 9950X and 256GB RAM. Not usable for agentic coding flow. Offloading makes everything too slow. Quad 3090 is too hot and noisy. Ended up leasing a Mac Studio.

[-]

Steus_au@reddit

what do you get from it? and what model, please.

[-]

Financial_Stage6999@reddit

We tried various Nvidia setups in our lab over the year. Key takeaway is that genuinely useful models for coding start at 100b MoE in size. We are enjoying GLM 4.5 Air at Q8. At this size and once context fills up they don’t fit into VRAM of any reasonable consumer level Nvidia-based setup. Once you offload to RAM the performance drops to the point when quick agentic iteration loop becomes impossible.

[-]

Monad_Maya@reddit

Local models cannot realistically compete with the big players. You need a lot of compute and VRAM to make it work.

Spend the 6k on tokens and call it a day. I know it's Localllama but be realistic.

[-]

mak3rdad@reddit (OP)

What for local llama. What is realistic? What is expected?

[-]

Monad_Maya@reddit

Matching cloud models in their performance running something locally, we are not there yet.

[-]

Alarmed_Till7091@reddit

If wattage is no object and you have the system for it, a whole bunch of 3090s is a pretty solid option.

When it comes to local models, the two most important things are Vram Size and Vram Speed. While a 4090 is faster as a card, but it's typically limited by the fact it has the same memory bandwidth and size as the 3090, so it ends up being only \~30% faster than a 3090. Not really worth the cost overhead unless you are doing other compute with your machine as well.

You could possibly run Q4/Q5 GLM 4.5 or GLM 4.5 air full on your system. They are pretty solid models on most benchmarks, but idk how much is performance is lost in the GLM Q4 quant.

[-]

mak3rdad@reddit (OP)

I think I can only fit at best 2 in my motherboard.

[-]

You can also rent RTX 4090, 5090 and PRO 6000 on our website https://www.cloudrift.ai/

[-]

grannyte@reddit

If budget is no object to rtx6000 96 or 48 gb

If budget is a little concern and your models need compute go the multiple 3090 route.

If budget is a big concern your models only need vram and you have time to mess arround get some used v620

[-]

jacek2023@reddit

Llama.cop and any number of 3090s you can fit