Local LLM setup for coding (pair programming style) - GPU vs MacBook Pro?

Posted by bajis12870@reddit | LocalLLaMA | View on Reddit | 18 comments

Hey everyone,

I'm a programmer and I'd love to use local LLMs as a kind of "superpower" to move faster in my day-to-day work.

Typical use case: I'm working on a codebase (Rust, Python, Go, or TypeScript with React/Vue), and I want the model to understand the existing project and implement new features on top of it — ideally writing code directly in my IDE, like a pair programming partner.

Right now I've tried cloud models like Claude, Qwen, ChatGPT, and GLM. Results are honestly great (especially Claude), but cost and privacy are starting to bother me — hence the interest in going local.

My current setup:

Ryzen 9 9950X 96 GB DDR5 RAM GPU still to choose

I'm considering a few options and I'm not sure what makes the most sense:

Option A: Add a GPU

Nvidia 5090 (~€ 3500) AMD R9700 32 GB (~€ 1300)

Option B: Go all-in on a MacBook Pro M5 Max (128 GB RAM, ~€ 7000)

My main questions: 1. Are there local LLMs that actually get close to Claude-level performance for coding tasks?

Are there solid benchmarks specifically for coding + codebase-aware edits?
Which local models are currently best for this kind of workflow?
How much VRAM / unified memory do you realistically need for this use case?
Dense vs MoE models - what works better locally?
Does generation speed really matter that much? (e.g. 45 tok/s vs 100+ tok/s in real usage)
What tools are people using for this? (IDE plugins, local agents, etc.)
How can I test these setups before dropping thousands on hardware?

Curious to hear from people who are actually running local setups for real dev work (not just demos). What's your experience like?

[-]

qubridInc@reddit

Skip the Mac get a strong NVIDIA GPU (5090-class if budget allows), run Qwen 3.6 or coder variants via vLLM + Aider/OpenCode, and you’ll get the closest practical “Claude-like” local pair-programming setup today.

FederalAnalysis420@reddit

honestly i'd just rent a gpu on runpod or vast for an afternoon and actually test test models before using your own money. that could answer most of your questions faster than any benchmark will.

if you still want to buy, the 5090 should run the smaller dense models fast enough that agent loops actually feel responsive, and the mac lets you run bigger moe models but the speed drop is real. for pure coding work i'd probably lean 5090. .

privacy's a good reason to go local. on pure cost though, claude api tends to come out cheaper than people expect once you actually do the math.

erizon@reddit

I assume many people switching local are future-proofing. There are two polar opposites scenario both likely soonish (let's say in 0.5-2 years): 1. AI bubble bursts, and the market gets flooded with cheap GPU that were denied to the market 2. API costs increase by order of magnitude, and demand for local compute explodes, making it very expensive unless you already purchased the hardware earlier

FullOf_Bad_Ideas@reddit

There are two polar opposites scenario both likely soonish (let's say in 0.5-2 years)

I think neither of them will happen. Some models will get more expensive on API, some won't, GPUs will be possible to get but at high prices. Essentially - the same thing that is happening for the last 3+ years.

Erdnalexa@reddit

I bought an 5090FE from NVidia last October at about €2k (France). Is this not an option anymore? (That’s an actual question)

Important_Coach9717@reddit

We found the Joker

What?

yes_i_tried_google@reddit

You could sell it and make a 50% profit with today’s prices

How do I play then? 4k@120fps on AAA is hard to run, even with the 5090 (I hate how framegen looks and its latency).

thrownawaymane@reddit

You wait for Nvidia to bother making a 6090 and pay €3.5k for it :)

melspec_synth_42@reddit

if you can swing a 5090 the 32GB vram is a game changer for running 35B models at decent quants. mac is better as a daily driver but for raw inference throughput nothing beats nvidia right now

Pretend_Engineer5951@reddit

There's a significant gap.
apex-testing.org and onyx.app seemed correllating with my own observings.
Claude is unbeatable but a pair of good reasoning + fast comprehensive agentic model can be useful for coding tasks. My personal choice at the moment is MiniMax-M2.7 + Qwen 3.6.
My setup uses shared memory (2 x 128Gb). I'd stick at least on 32-48Gb of VRAM if had discrete GPUs.
Dense are slow but useful on analytics. MoE usually much faster, nice on acting step by step on the plan which was developed by more comprehensive model.
Anything more than 15t/s is good on writing code.
My choice is Jetbrains IDE + KiloCode (allows to tweak system prompt). Recently switched from Roo. Earlier used Cline.
Try OpenRouter first but exclude huge monsters like GLM or Kimi

Own_Mix_3755@reddit

Just a question - what were you missing in Roo that you switched?

I was looking for a feature to override system prompt. As I found it was available somewhere in the past, but then developers ripped it. Kilo hasn't do that yet.

No-Anchovies@reddit

Coming from an "unlimited resources" place of work, it has been a very humbling and grounding learning experience to compartmentalise personal projects just small enough that I can actually thrown some AI at it to patch or refactor. Personally I believe it's hard to beat the convenience of running linux & Nvidia. Full plug n play on popOS has been a very relaxing experience

HugeEntertainment820@reddit

I’ve been using the qwen 3.6 the last day and I’m impressed. Asking if it can do professional work is way more than simply are there model on level of Claude code. Is your app for 1,000 people, 30k or more? What’s your tech stack etc.

iamapizza@reddit

The answer to 1 is no, and it's also, depends on what you're doing. For some people it is and some it's not good enough, forcing them to adapt.

Your best bet is to add to what you currently have. If you can get a decent gpu, you can get started pretty quickly with a local setup and see which one works for you. With your cpu and a 5090 you'll get some really good speeds.

On the other hand if this is for your job maybe still consider a third party. If not Claude then maybe GitHub copilot.

alexwh68@reddit

I can’t answer all your questions but here is some answers.

No, reset your expectations.
I use either qwen coder or the new 3.6 version. I am using a Q6 locally.
I have a 96gb ram MBP and it works well.
Llama-server with opencode

Key thing here is I am using cursor for more thinking tasks and local for more boilerplate repetitive tasks. Local is slower for sure but my workflow has changed a bit, I am a freelancer working from home. I asked qwen to build all the code for 4 new tables based on the existing project. Had breakfast came back all done, repositories, interfaces, services, dto’s and basic blazor pages. That is roughly 4 hours work by hand, copy and pasting roughly 2 hours work. So min saving today 2 hours.

My goal is to cut down on api usage where sensible.