Local coding agents. Am I missing something?

Posted by SpaceKuh@reddit | LocalLLaMA | View on Reddit | 25 comments

I'm an experienced software dev that has been using various LLMs and tools to write code in the past few years. My hardware isn't the greatest for AI with a 4070ti and 64gb ddr5 but I can run a few smaller models. I tried out GemmaE4B, Gemma26b and different devstral models.

In the olama chat window, they work great, especially the smaller models that fit into my vram are incredibly fast. Sure the results cannot compete with frontier models like Gemini, Opus and codex but they are alright. All of that completely falls apart when I use them as coding agents though. I tried them with GitHub Copilot and Continue in VScode and more often than not they would just spin in circles, outright fail and throw errors. Is this the state of local AI currently, where the chat is slowly getting alright but agentic coding is still off the table if you don't have a personal Datacenter at home? I know my hardware isn't optimal but I hear of people running these things on laptops and I have no idea how these agents can compete even with the cheapest commercial models right now.

Did I miss a fundamental step in my setup? (I just installed ollama, installed the models, tried them out, maybe adjusted GPU layers to preserve some vram and added them in continue/Copilot)

Or is this the state of local coding agents right now?

thank you!

[-]

abnormal_human@reddit

You're running models that are 10x or more smaller than frontier models. It's going to be a lot worse!

Quantizing small models (as ollama typically does) does not help the situation. You may also have harness issues, either harnesses that match fit the model's expectations, or errors setting up tool calling or reasoning parsers.

If you run the 200-800B models and get the details right you'll find that they make half-decent coding agents. Not Opus or GPT-5.4, but far from useless.

[-]

SpaceKuh@reddit (OP)

Running these models locally would probably cost a small fortune so I guess I'll stick to subscriptions for agentic coding and just wait for the future :) thank you!

[-]

unjustifiably_angry@reddit

It depends on whether you're using them for work. I burn through millions of tokens a week so offloading 90% of that to a competent local model means I'm saving potentially hundreds of dollars a day. In that context, local AI hardware seems cheap. ROI within months, possibly even weeks.

There's a certain psychological effect too. With online models I cringe every time I make a request because it might cost anywhere from a few cents to tens of dollars and might not even get it right on the first try. It creates a strong reluctance to even engage with it. If you've bought the hardware upfront it has the opposite effect, it makes you eager to use it because you want to get your money's worth. You're more willing to experiment, take risks, double-check things, etc.

[-]

TheElectricBrit@reddit

What interface do you run your models in out of curiosity? I’ve used a mix of cursor and codex for over a year, recently switched to cursor because it seems a lot more competent, but likewise I cringe at the thought of giving it menial repetitive tasks because it’s burning money.

All while I sit on my 4090 doing nothing. I’d love a local LLM mainly for writing repetitive yaml and tasks that don’t require much “thinking”. Down the line I’d definitely love a “does everything” local AI with agents for different tasks so understand the need for more VRAM then. Appreciate the comments in here, it’s informative.

[-]

unjustifiably_angry@reddit

I'm still researching more advanced setups, currently just using VS Code & Roo extension. Cline extension is alright too but Roo has a few more presets built in and seemed to be a better choice for my weird local setup (too long to explain). It's working well enough for my purposes but I imagine there's a better way to go. The interface is friendly and intuitive at least.

[-]

iMrParker@reddit

I mean, your machine is capable of doing this given you offload model experts to the CPU and use a modern MOE model. You just need the right settings. It won't be Claude Opus level, but it'll get most jobs done if you prompt well and have patience

[-]

kongpjuter@reddit

My guess is the context window. Ollama default is 4K but this is usually filled up by the pre-prompt when using clause code etc. this means that your prompt goes over the context window already on the first request completely breaking everything . Try gemma4 but set the context window to 120.000 instead of the 4000 default

[-]

SpaceKuh@reddit (OP)

Wow, I will try that out, copilot and continue recognized the context size as much lager but apparently that was just the models max model context window....

[-]

Imaginary_Panda1474@reddit

How did it turn out?

[-]

ai_guy_nerd@reddit

Local agentic coding is definitely in a weird spot right now. The gap between 'chatting with a model' and 'running a loop' usually comes down to how the agent framework handles tool-call formatting. Even if the model is smart, if the wrapper expects a perfect JSON block and the model misses one bracket, the whole thing spins or crashes.

Small models often struggle with the strict syntax required for long-running agent loops compared to the loose nature of a chat window. Using something like OpenClaw or a more robust framework that handles the orchestration and error-correction can help, but the hardware bottleneck is real when you need a large context window to keep the agent from losing the plot.

It is less about the GPU and more about the model's ability to adhere to the system prompt over multiple turns. Most 'local' agents are just thin wrappers that break the moment the model deviates from a very specific output pattern.

[-]

crantob@reddit

If you want the llm to spit out software you don't understand, go agentic.

If you want to get assistance writing software you understand, use the llm as a (another) very knowledgable but occasionally completely idiotic coworker.

I'm well trained for this interaction model by 35 years in IT.

[-]

unjustifiably_angry@reddit

I'm shit at reading code but I have the AI generate detailed comments for every function so I can easily recognize where a problem is happening. Little different from an average project manager.

[-]

iMrParker@reddit

Tweak settings. Ollama has famously bad default params, check the model cards recommended settings for tool calls and use those. Youll have better luck with llama cpp. Or even LM studio tbh

I run Qwen3.5 27b and 122b with 128k context for agentic coding for my job and it's been a very good experience

[-]

SpaceKuh@reddit (OP)

What kind of hardware do I need for this to be a good experience? I'm afraid my 4070ti with 12gb vram won't be even remotely enough. So I assume that you use llama cpp for your work?

[-]

unjustifiably_angry@reddit

The cheapest option for local AI is a 32GB card running Qwen 3.5-27B or another similar-sized model. Any smaller than a 32GB card and you're going to be running something very compromised.

In my opinion the sweet spot currently is somewhere between 96GB and 384GB. You can go smaller but right now there's kind of a gulf between the ~30B models and the ~120B models, not a lot in-between.

[-]

iMrParker@reddit

Your hardware is capable given you use an MOE model and offload experts to the CPU. You may have to use a lower quant model and adjust the settings. Gemma 26b is actually a good choice. Otherwise a lower quant of Qwen3.5 35b

[-]

billy_booboo@reddit

MoE models at that size probably aren't gonna cut it. You'll need something like a 80b-120b MoE model or one of the nice dense models around 30b params to start getting reasonable results.

Hopefully in the next year we'll get models that work on your GPU and are good enough for local coding. It certainly seems possible with all these new updates.

[-]

GoodTip7897@reddit

Honestly Gemma 4 26b and qwen 3.5 35B are good agents for me for making some simple changes and testing stuff.

I use bf16 kv cache and I use q5_k_s for Gemma 26b and q4_k_xl for qwen 35b.

Sometimes the 35b makes stupid syntax errors but usually it's fine. They both can do 90% of what I need and qwen 3.5 27B handles the rest. All of those models are genuinely coherent at 64k+ context for long agentic loops

[-]

MrB0janglez@reddit

yeah this is pretty much the state of it right now. local models are solid for chat and autocomplete but they fall apart as full coding agents because it requires consistent multi-step reasoning and reliable tool-use. the smaller models that fit consumer vram just aren't there yet for that.

what's actually worked for me: use local for autocomplete/chat, but route actual agent tasks through claude or gpt-4o via API. keeps costs way lower than you'd think since agents use fewer tokens than expected for most tasks.

if you want to go all-local, qwen2.5-coder-32b is the best i've tried for pure local agentic work. but honestly at the 4070ti tier you're going to hit a wall. it's not a setup issue, it's a capability gap that's slowly closing.

[-]

SpaceKuh@reddit (OP)

Thank you! I guess I'll just have to wait and see, the subscriptions are performing well and the price is alright. I'm not really willing to buy hardware just to toy around with the current prices and there are other things I can do with these smaller LLMs besides agentic coding :) thank you for your response !

[-]

my_name_isnt_clever@reddit

Disregard that comment, it's a bot spreading misinformation. Qwen 2.5 is ancient and not worth using now, but many models used for spam don't know about anything newer.

[-]

iMrParker@reddit

I strongly recommend against using a Qwen2.5 model. Idk if that comment is a bot reply or not

[-]

Motor-Sky-3189@reddit

I've recently purchased a 5090 with 96gb ram setup, will I be able to run anything productive agentic-wise? I was never in the local-llm loop, so I genuinely don't know

[-]

billy_booboo@reddit

Yes absolutely. Try gemma4 31b! You can also run larger MoE models like qwen3-coder-next or qwen3.5 122b.

[-]

ag789@reddit

what I tried is actually with llama.cpp
https://github.com/ggml-org/llama.cpp
as I find it leaner than say ollama which probably runs it in a container.
I tried Qwen 3 coder 30 B
https://www.reddit.com/r/LocalLLaMA/comments/1sf8zp8/qwen_3_coder_30b_is_quite_impressive_for_coding/
and just currently QWen 3.5 35B A3B
https://huggingface.co/collections/unsloth/qwen35
https://huggingface.co/unsloth/Qwen3.5-35B-A3B-GGUF
https://www.reddit.com/r/LocalLLaMA/comments/1sjprna/qwen_35_28b_a3b_reap_for_coding_initial/

To temper your expectations a little, if you consider that the 'high end' models are as large as 122 Billion parameters
https://huggingface.co/Qwen/Qwen3.5-122B-A10B
and perhaps 379 Billion parameters
https://huggingface.co/Qwen/Qwen3.5-397B-A17B
and QWen coder next 80 Billion parameters
https://huggingface.co/unsloth/Qwen3-Coder-Next-GGUF
if you consider a Q8 quantization, that alone is 122 GB, 379 GB and 80 GB of memory (dram + vram) requirements.
Hence, for 'small' models e.g. those 30 Billion parameters and about models.
They can probably do 'something', but possibly not 'everything'.

What I did instead is that I simply use the web chat interface.
https://github.com/ggml-org/llama.cpp?tab=readme-ov-file#llama-server
(I think ollama similarly has a web interface, or you could perhaps run one that interfaces it)
upload codes, type a prompt and review the response.
well, no agent? yup no agent.
this would let you explore the capabilities of your model firsthand, figuring out its limits.

I'm yet to learn 'tool calling' but that perhaps you can try using opencode
https://opencode.ai/
to connect to these. Tool calling, agents etc adds significant 'complexity' vs that simple 'chat' interface, and I'd think one would need to figure out how to 'debug' problems, e.g. is that a chat template bug etc?
accordingly, with tool calling, your 'frontend' needs to present your prompt, along with the 'tools' in some json formats, as one of those 'protocols' of interfacing
https://developers.openai.com/api/docs/guides/function-calling
the task would be to figure out if your model is 'digesting' the prompts with all that extra 'tool calling json wrappers' and that they are responding in a format that your 'tool' / frontend understands and can interact with it.