Local coding models limit

Posted by Blues520@reddit | LocalLLaMA | View on Reddit | 17 comments

I've have dual 3090s and have been running 32b coding models for a while now with Roo/Cline. While they are useful, I only found them helpful for basic to medium level tasks. They can start coding nonsense quite easily and have to be reigned in with a watchful eye. This takes a lot of energy and focus as well, so your coding style changes to accommodate this. For well defined low complexity tasks, they are good, but beyond that I found that they can't keep up.

The next level up would be to add another 48GB VRAM but at that power consumption the intelligence level is not necessary worth it. I'd be interested to know your experience if you're running coding models at around 96GB.

The hosted SOTA models can handle high complexity tasks and especially design, while still prone to hallucination. I often use chatgpt to discuss design and architecture which is fine because I'm not sharing much implementation details or IP. Privacy is the main reason that I'm running local. I don't feel comfortable just handing out my code and IP to these companies. So I'm stuck running 32b models that can help with basic tasks or having to add more VRAM, but I'm not sure if the returns are worth it unless it means running much larger models, and at that point the power consumption and cooling becomes a major factor. Would love to hear your thoughts and experiences on this.

[-]

AXYZE8@reddit

For me GPT-OSS-120B is a major stepup in coding. GLM 4.5 Air is also nice.

Try it with partial MoE expert offloading to CPU (everything on GPU, just some of the MoE on CPU, with llama.cpp you can use --n-cpu-moe) and then you may add another gpu if you want full gpu offloading for faster speeds.

[-]

Blues520@reddit (OP)

I haven't tried either of those models so will take your recommendation and give them both a shot. I'm using Roo to hopefully the agentic support is good.

[-]

Imaginae_Candlee@reddit

May be this pruned version of GLM 4.5 Air in Q4-Q3 will do
https://huggingface.co/bartowski/cerebras_GLM-4.5-Air-REAP-82B-A12B-GGUF

[-]

milkipedia@reddit

GLM 4.5 Air has given me just about everything I need for coding. I had repeated problems with tool calling in Roo with gpt-oss-120b.

[-]

aldegr@reddit

I use gpt-oss-120b with codex and I agree it’s pretty good. Need to pass along the CoT of tool calls for best results, though it’s not needed for Roo/Cline.

[-]

ParthProLegend@reddit

CoT of tool calls for best results

What is that and how to do it??

I run 20B-oss btw

[-]

aldegr@reddit

It's one of the quirks of gpt-oss outlined in OpenAI's docs

If the last message by the assistant was a tool call of any type, the analysis messages until the previous final message should be preserved on subsequent sampling until a final message gets issued

You can do it by either:

Using an inference server that supports the Responses API, as it will do it for you. Though, the Response API is not widely supported by clients.
Or, passing back the reasoning (Ollama/LM Studio/OpenRouter) / reasoning_content (llama.cpp) field for every tool call message from the assistant. I believe codex does this for reasoning but not reasoning_content, so I wrote an adapter for my own use with llama.cpp. Aside from codex, I don't know of any other client that sends back the CoT.

[-]

M1ckae1@reddit

what model for coding i could use with rtx spark station?

[-]

ortegaalfredo@reddit

I run 12 GPUs with GLM 4.6 and its great for anything. Power consumption is not that bad. Idle is \~200 W that is not that much heat, and with heavy usage my bill increased about 100-150 usd monthly.

[-]

Blues520@reddit (OP)

That's not as high of an electricity cost as I expected for a rig of that size. With those usage costs it might be worth it. Again, high capital costs but I've read that GLM 4.6 is close to SOTA so that's not bad to run a SOTA model locally with full privacy.

[-]

alexp702@reddit

Can only speak for 480b and it’s good. Slow, but much better. Unfortunately out of reach without using the cloud or pricy hardware.

[-]

Blues520@reddit (OP)

Thanks, sounds promising. Could you share some details about your rig and if you know any power consumption figures

[-]

alexp702@reddit

Oh mine's simple a £10K 512Gb Mac Studio. It uses up to 480W, but more like 380W under load. Idle is a few watts. I run three instances of Llama-server supporting 2 requests each with 128K context, and it serves an office of 10 (though not many use it yet). It seems to manage about 14 tokens out and 112 in on larger prompts. Been building a Grafana dashboard for it :1235 is 480, :1236 is 30B. Screen shot shows a prompt from Cline. Graphs are since yesterday. Its a go away and have a cup of tea system, but so far my results have been ok. Tip: always start a new prompt after making a change and a couple of tweaks. Word prompts with all the information in them to try to get the best results first time. Be prepared to step in after the third prompt, or you're wasting time. It can make huge changes quite fast, but small tweaks feel quicker done by hand. This makes all the difference when working with this type of system.

Interestingly the cloud is often not actually much faster - I guess because of all the load its under. It can be, but often isn't.

[-]

Blues520@reddit (OP)

That's an expensive setup but you get to run huge models and for relatively low power consumption. My dual 3090s use as much as your rig, so the power savings can offset the costs, however it's a high upfront investment. Good to know how good these macs are, both in terms of compute and power efficiency.

[-]

alexp702@reddit

Another interesting challenge when working with large contexts in Llama-server. The server was restarted. See how it slows down loading the context back in with a big context. It also seems to get gradually slower over time if you look at the gradients. I restarted the server with a 192K context after 128 tapped out on me. Seems optimizing context size really should be the focus of the AI tools. Reading around even massive H200 rigs can slow down like this - albeit at a higher throughput.

[-]

RobotRobotWhatDoUSee@reddit

I agree with other posters, gpt-oss 120B was a major step up in local llm coding ability. The 20B model can be nearly as good, itself a major step up in the 20-30B total parameter range, even though it is an MoE like the 120B. Highly recommend trying out both for your setup OP. 120B with require --n-cpu-moe, as noted by others.

[-]

Blues520@reddit (OP)

Thanks, I'm going to try it out. The number of parameters is a good step up while being manageable to run locally.