Who is actually writing code with local models?

Posted by KarezzaReporter@reddit | LocalLLaMA | View on Reddit | 30 comments

I recently decided to see if I could write code with my local model. I selected a harness from someone who's here and it's pretty great.

I could examine the source code for security issues or rather have Cloud Code do that. I found some, fixed them, notified the dev, but in any case, I'm not going to say who it is.

So, I've been running it, and I'll just say, it's just much, much faster and better to run a smarter model than you can run on your local computer.

You know that story about the dog you train to walk on his hind legs? Yes, it can do that, but it's not going to walk very well, right?

It does work, it's really cool, it takes a really long time, and it's just not nearly as good as Claude Code or Codex, frankly. But, it's great that you can do that.

So, my question is, which of you are actually using local models in your day-to-day, to write code? As opposed to this being a fun hobby and something that we all look forward to being useful eventually.

[-]

Real_Ebb_7417@reddit

Man, how much memory does 1m tokens kv cache take? 😅

[-]

Medium_Chemist_4032@reddit

The first time I used it, with hf.co/unsloth/Qwen3-Coder-30B-A3B-Instruct-1M-GGUF:Q6_K it was about 20 to 30 gb (I'm probably wrong, but ballpark might be ok) for the kv cache alone.

Nemotron lowered that down significantly, (again, from memory) to under 10gb total.

I was stumped, how well the long context worked. It's not for the faint of heart though: I literally waited over 30 mins. for the context to be prefilled. After that though, with prefix caching, I used forked/cloned conversation functionality in open-webui, to keep asking more and more questions. I've put a whole service (plus some stuff around it) into the context, literally with "repomix --format markdown" as-is. I expected it to fall on it's face, but ...

On the repo I known very well, with a specific, super hard maintenance fact that kept tripping up people, the model responded correctly. In order to respond correctly, it had to just understand the confusing project structure, the dataflow that diverges and merges, along with some async points, and be very careful *not* to trust the actual names.

Now I'm using the 122b most often, due to the best decode on my hardware (2k tps) and a big directory with markdown files, cloned full projects, meeting transcripts - in agent mode. If you know, how to generally steer it, it can be worth their weight in gold.

[-]

One-Replacement-37@reddit

Turboquant exists…

[-]

Real_Ebb_7417@reddit

And it's not implemented in many backends yet and in some implemented wrongly. (eg. I tried running Gemma4 with turboquant in swiftLM on MacBook and it wasn't faster or didn't take less memory than Gemma4 GGUF via llama.cpp xd)

We gotta wait for actually good implementations in many tools I guess.

[-]

Objective-Stranger99@reddit

Not much actually with recent developments. I can fit it in less than 8 GB (for the KV cache alone). I offload it to RAM and keep the actual model layers on vram so its way faster

[-]

No_Afternoon_4260@reddit

Are you sure it's not the contrary? You are putting experts layer on system ram?

[-]

Objective-Stranger99@reddit

Attention layer + a few expert layers on VRAM

The other expert layers + kv cache on system ram

[-]

Long_comment_san@reddit

That's very confusing. The consensus was that cache should remain in VRAM for better access speed. You're saying that you offload the cache to RAM instead

[-]

Real_Ebb_7417@reddit

For dense models it makes more sense to keep kv cache in RAM if it means smaller layers offload to RAM. For MoE in my experience kv cache in vRAM works better if at least the active params weights fit in vRAM as well.

nikhilprasanth@reddit

Plan with a frontier model, split the plan into proper phases with well defined tasks. Use pi or opencode and implement the plan. Once done, debug with a frontier model and pass the findings to local. Repeat

ContextLengthMatters@reddit

Here's a novel concept. Not all tasks require a SOTA model. I run both Claude code and my local models on OpenCode and use them for different things. No sense in wasting tokens on well defined problems.

Secret_Appeal6271@reddit

I agree so much. Especially as a student, it doesn't make sense to it doesn't make any sense to use an expensive model to make a basic project or understand how a particular tool works. Even for complex projects I've built, something like basic testing and documentation can be handled well by a local model.

For actual real hard stuff on the clock - still frontier.

As an agent, that does: here's a directory with all my frameworks' documentation - find, how to do X, I know the solution is mentioned there, just forgot the details -- local all the way to save tokens. Especially with Nemotron nano and it's 1m context. I have been using it with success in such cases, despite needle benchmarks basically saying it's not really capable.

Certain-Cod-1404@reddit

what's your experience with nemotron nano compared to the qwen 3.5/3.6 series if you dont mind answering

kevin_1994@reddit

why do you want an LLM to do all the "hard" stuff? that's the enjoyable part of programming

let your local llms do easy or annoying stuff like debugging a SQL syntax error or refactoring a react component to useMemo.

do all the hard, fun stuff, with your human brain. it will keep you sharper. you'll e having more fun. and your software will be much better this way

gurilagarden@reddit

Different tools for different jobs. Software development is not a single task.

baradas@reddit

running both here too - qwen 3.6 locally for "find where X is defined" / doc-trawling / grunt stuff, claude code for the multi-file refactors that would eat my whole evening locally.

the one thing that bit me early on was losing track of CC sessions. i'd kick off a long task in one tmux pane, go tinker with llama-server in another, come back 2 hours later and the session had been sitting on a tool-use confirmation the whole time. or even worse, have burned through because auto-accept was on and i had forgot.

wrote a small rust TUI to keep tabs on all active CC sessions - pid, context %, $/hr burn, status, budget kill-switch at 100%:
github.com/mercurialsolo/claudectl.

leaving it here in case anyone else juggles parallel sessions across both worlds.

https://i.redd.it/oplu60zfk6wg1.gif

catplusplusok@reddit

MiniMax M2.7 in 3 bit (smarter) or Qwen 3.5 122B in 4 bit (faster on my hardware) can handle a lot of coding tasks so I only pay for Claude API for occasional complex planning. Granted that these take a lot of memory, but I also heard of people getting useful things down with smaller Qwen 3.5 / Gemma 4 models.

ea_man@reddit

Like everyone that can't upload code on the could?

JohnMason6504@reddit

yes I run Qwen 3.6 27b at Q4_K_M on a 4090 and it handles most coding work fine. the honest version is it is not a drop-in Claude replacement but it is not trying to be. for code completion and thinking through a single file or two the local model wins on latency and privacy. round trip to claude is 400ms plus network jitter where it breaks down is large multi-file refactors and deep codebase navigation. a 1M token context sounds great but the KV cache math is brutal. Qwen 27b at fp16 KV needs roughly 2 bytes times 2 times num_heads times head_dim times layers per token. that is near 400KB per token for a 27b. one million tokens is 400 GB. nobody is holding that in GPU KIVI 2bit-K quant per token gets you to roughly 50GB for 1M tokens but you pay with retrieval accuracy. Perplexity stays flat while NIAH drops 15 to 20 points at deep needles. my workflow is local for fast inner loop and cloud for the 200k plus context reads. for sensitive code that gate is not negotiable

Stepfunction@reddit

I am lazy and use GitHub Copilot because Codex is amazing.

DrVonSinistro@reddit

The absolute best efficient way to code with local LLM is to do your own code, be the architect and THEN talk with the LLM about looking for improvements, security and other refactors.

If you 100% vibe code, you'll get black spaghetti code that only the LLM will be able to sort out. Or you with enough time.

Xyrus2000@reddit

I disagree. You architect first. Plan it all out and have the LLM implement the scaffolding classes and methods, then have it write the unit tests (which, of course, will all fail because nothing is implemented yet).

Then start writing your code. For anything that is well established, boilerplate, or more "grunt work" such as basic file I/O and the like, you can have the LLM write it. Everything else is you.

The typical code base is anywhere from 40%-70% "grunt work". AI can usually handle the vast majority of that without any issues. The custom/innovative/creative part of the code base is where it can become more hit or miss with AI.

Gesha24@reddit

I have a fun side project that I am "coding" with local LLMs only. My observations:

1) it's entirely possible. You need a reasonably fast system for it to work, if you are generating 20 tok/sec and processing prompt at 100 tokens/sec - you will struggle to do anything meaningful.

2) Agents matter A LOT. Claude code pointing to a local llama server produces better results than continue.dev despite using exactly the same model at the back end. I am sure you can get to the same level of prompts manually, but it's much easier with Claude.

3) you need to be specific. For example, I can tell sonnet to make sure UI elements are aligned, give a screenshot - and it's fixed. I need to give Qwen the screenshot and say UI elements need to be aligned horizontally, otherwise it will align them somehow but not how I need it.

Does it make sense right now? No, I don't think so. But if the local small models keep improving and paid online models keep getting optimized to the point of spitting nonsense - it may make more sense. As it stands now, my single Radeon Pro 9700 can run Qwen 3.6 at 4 bits with 250K context.

rmhubbert@reddit

I only use local models for both my professional and personal coding now. Some serious caveats, though -

The most I'll quantize a model is 8 bits, with no kv cache quantization. Qwen3-Coder-Next at full precision for coding, Qwen3.5-112b-a30b at 8 bit for planning. Because of this, and to ensure fast tps, I need to run them on a home server with 192GB of VRAM available.
I work within a custom harness with very strict workflows, to ensure my standards are met. This was a huge, and continuing time sink.

So, aye, it is entirely possible, but to ensure the best possible outputs, you will be spending a lot of money on the hardware, and a lot of time on configuring your harness / workflow.

ttkciar@reddit

GLM-4.5-Air works great for codegen. I'm guessing your disappointments come from trying to use much smaller models locally.

If I still have my Codex usage, I just use it, because I pay for the subscription nevertheless. When it runs out, I switch to local model. I also use local models for some more "privacy-aware" projects.

And actually, the local models work faster for me often than API models (depends on the time of day, if USA has working hours, API models often work sloooowly xd).

But honestly, I use Codex just out of convenience and since I already pay for the sub for chatGPT, it's "free" for me (I don't need to launch the other PC to llama-server my model over LAN and I can continue working within the same session with Codex easly when I'm out of home). Local models would be enough for me for almost any usecase out of the box and for more complex usecases, they would also do well, but i'd have to put some effort into designing the right agentic workflow for them.

What i often do though, if I work with a local model, is that I let it do it's work and at some point I launch GPT/GLM/Claude to just review it's work (especially security-wise).

Prudent-Ad4509@reddit

Once you learn how to task both worlds (frontier model with harness and moderately sized local one with harness, let's say, 122b size), you soon find out that you can do nearly all you need with local. Remote frontier models are more tolerant to sloppy prompting. They are obviously more knowledgeable, but you don't need a cook with phd in advanced physics to cook a chicken.

Smaller models in the 27B-35B range are slowly getting better too; I believe the scope of tasks they can do will remain the same, they will just learn to do them better. You can't replace the knowledge of larger models needed for the most complex tasks.

Ok-Measurement-1575@reddit

I use local models to reply to inane topics on here.

m18coppola@reddit

I use local models to write code, but I don't have a speed or intelligence problem with 120gb of vram to spare.