Which 9B local models are actually good enough for coding?

Posted by CalvinBuild@reddit | LocalLLaMA | View on Reddit | 39 comments

I think 9B GGUFs are where local coding starts to get really interesting, since that’s around the point where a lot of normal GPU owners can still run something genuinely usable.

So far I’ve had decent results with OmniCoder-9B Q8_0 and a distilled Qwen 3.5 9B Q8_0 model I’ve been testing. One thing that surprised me was that the Qwen-based model could generate a portfolio landing page from a single prompt, and I could still make targeted follow-up edits afterward without it completely falling apart.

I’m running these through OpenCode with LM Studio as the provider.

I’m trying to get a better sense of what’s actually working for other people in practice. I’m mostly interested in models that hold up for moderate coding once you add tool calling, validation, and some multi-step repo work.

What \~9B models are you all using, and what harness or runtime are you running them in?

Models:

https://huggingface.co/Tesslate/OmniCoder-9B-GGUF

https://huggingface.co/Jackrong/Qwen3.5-9B-Claude-4.6-Opus-Reasoning-Distilled-v2-GGUF

[-]

Ok-Importance-3529@reddit

For me Copaw flash 9B is very usable, until now i never had luck with 9B models, always felt simple minded and forgetfull, but this one is different.

[-]

Recoil42@reddit

Serious coding? Multi-step? At 9B?

None. Don't do it. You're asking the equivalent of "which plastic spork should I use for gardening?"

The answer is you should not use a plastic spork for gardening. Reiterating what I have said here a dozen times: There are plenty of reasons to have small local setups — but multi-turn agentic coding isn't yet one of them. When each bad decision heavily compounds into future, it's important that you don't make mistakes, and having a high-test model will be the crucial difference between complete slop and not slop at all. Right now each advance is so impactful to productivity that professional coders are moving directly to the newest high-grade professional models each time immediately on release.

Spend the money on a Claude Code or Codex subscription. Doing otherwise at this moment in time is penny-wise, pound foolish, and anyone who tells you otherwise has barely dipped into the technology, is wasting your time, or trying to convince themselves of something that isn't true.

We will eventually have local models good for coding, but not now, and not at 9B for anything other than 'toy' setups.

[-]

CalvinBuild@reddit (OP)

Fair take. I also use Codex and Claude, so I'm not claiming 9B local models are the best option for serious coding.

I'm specifically asking about the local-on-consumer-hardware tier. For people who care about local-only workflows, privacy, cost, or edge-device experimentation, I want to know which ~9B models are currently the most usable in practice.

[-]

Ell2509@reddit

They just don't have the knowledge for sophisticated coding.

Qwen coder 30b a3b is ok ish. What are your device specs?

[-]

CalvinBuild@reddit (OP)

Agreed. Wishful thinking on my part.

3080ti 12gb

[-]

Ell2509@reddit

Yeah so the qwen3.5 35b, the qwen3 coder 30b, (both a3b moe) will be ok. 32gb ram, right?

Or just use claude code while things like turboquant and other new developments take hold.

[-]

CalvinBuild@reddit (OP)

Yeah 32gb ram. I will test out qwen3 coder 30b tonight.

[-]

Ell2509@reddit

General consensus is that the qwen3.5 35b a3b will be better at coding, and it is still only 3b active parameters, but it does overthink.

I use wrench 35b a3b when I need qwen. It is based in qwen 3.5 35b a3b but doesnt seem to think as much.

[-]

tmvr@reddit

It luckily dies down already, but when Qwen3.5 came out it was madness here with the astroturfing and the outlandish claims, it basically devolved into posts and comments claiming "4B can cure cancer"

The reality is that no 9B model is good enough for agentic coding. The 27B is decent, but if you are looking at 9B models you probably do not have the hardware to run the dense 27B. On the other hand you probably have the hardware to run the also decent MoE ones with loading the experts into system RAM:

Qwen3 Coder 30B A3B
Qwen3 Coder Next 80B A3B
Qwen3.5 35B A3B

The 80B one maybe not if you only have 32GB system RAM, but the rest you can with a 12GB or 16GB card and at least 32GB system RAM. You can try those with Claude Code if you are already using it and see what it gets you.

[-]

Oshden@reddit

If I have an RTX 5070 with 8GB of RAM and 64GB of system RAM, in your opinion could I run any of these models you mentioned? I’m still learning about how all of the different settings in LM studio work

[-]

tmvr@reddit

Yes, the Qwen3 Coder 30B A3B for sure. LM Studio has an option to put the experts into system RAM. I saw the latest version also has a slider how many, but I haven't used it, I use llamacpp (llama-server) dierctly which has a --fit parameter that will put things where they belong automatically depending on the context size you use with the -c parameter. It also has a --fit-ctx parameter which basically combines the two.

[-]

Oshden@reddit

Thanks a million for the detailed answer!

[-]

ea_man@reddit

use like: fit-target 126
and disable every hw accel you may have in browser or whatever ;)

[-]

CalvinBuild@reddit (OP)

Yeah, that is fair, but once you are leaning on system RAM heavily, the performance hit can get pretty brutal on \~12GB-class setups.

For coding, “technically runnable” and “pleasant enough to use” are very different things. That is a big part of why I was asking about the 9B tier in the first place.

I have a 3080ti 12gb, what would you try?

[-]

ea_man@reddit

30B A3B MoE at quant 3 should do you more than 100t/s, at Q4_K_M it's honest 50t/s.

27B at IQ3 should do you some 30t/s when the context doesn't spill.

[-]

tmvr@reddit

The 30B and the 35B should run fine, but stick to the Q4 quants from whoever you trust - bartowski or unsloth for example so in that case Q4_K_L or Q4_K_XL, but if that is too slow you can still go for the IQ4_XS from both of them and see if the output is acceptable.

You will need to adjust a handful of env variables for Claude Code to work with the local inference engine and models, there is a very handy summary in this recent post:

https://www.reddit.com/r/LocalLLaMA/comments/1s8l1ef/how_to_connect_claude_code_cli_to_a_local/

[-]

jacek2023@reddit

"Spend the money on a Claude Code or Codex subscription." LocalLLaMA as usual

[-]

Recoil42@reddit

If you come here for false flattery and circlejerking, you've got your priorities wrong.

[-]

Witty_Mycologist_995@reddit

r/localaicirclejerk

[-]

CalvinBuild@reddit (OP)

none of that

[-]

jacek2023@reddit

unfortunately people like you are here because the sub is popular

[-]

wazymandias@reddit

The 9B tier is decent for single-file edits and autocomplete but yeah, multi-step agentic stuff falls apart fast. The 9B tier is decent for single-file edits and autocomplete but yeah, multi-step agentic stuff falls apart fast.

[-]

some_user_2021@reddit

It is also worth noting that 9B tier could enter into infinite loops. It is also worth noting that 9B tier could enter into infinite loops. It is also worth noting that 9B tier could enter into infinite loops.

[-]

Nyghtbynger@reddit

It tried to edit the same file 7 times, while not finding it. After a few attempts of modifying repeat penatlties, temperatures (Omnicoder 9B) I think I'll switch models and use 27B in the meantime. But the task I do generally need 80K context and I can only store 69K..

[-]

CalvinBuild@reddit (OP)

It really does fall apart fast once you push it into multi-step agentic work. I'm still holding onto hope though lol.

[-]

CalvinBuild@reddit (OP)

V3 of that Qwen 3.5 9B distill just released. The posted gains look more like ~+5 pp on HumanEval and ~+1.4 pp on the posted MMLU-Pro slice, not blanket 6%+ everywhere.

V3 model:

https://huggingface.co/Jackrong/Qwopus3.5-9B-v3

[-]

Significant-Yam85@reddit

Waiting for Q8 GGUF and will test.

[-]

CalvinBuild@reddit (OP)

Here's the GGUF

https://huggingface.co/Jackrong/Qwopus3.5-9B-v3-GGUF

[-]

Significant-Yam85@reddit

Perfect!

[-]

qubridInc@reddit

Qwen-based 9B distills and OmniCoder are solid, but if you want more consistent multi-step repo work and tool use, try running them via Qubrid AI for better orchestration and reliability.

[-]

CalvinBuild@reddit (OP)

Yeah, I can believe orchestration helps a lot here. My impression so far is that the runtime around these ~9B models matters almost as much as the model itself once you start pushing multi-step repo work and tool use.

[-]

CalvinBuild@reddit (OP)

Yeah, fair. I’d rather use the model that actually knows more than chase parameter count on paper. If that 27B is materially smarter, that seems like the right call.

[-]

Wildnimal@reddit

The problem is not coding it's the context. Thats going to be a lot difficult IMHO. And even if you have ability to have a higher context window, the model might not be able to follow instructions.

You will have to split your projects per file with instructions and linking to other files for it to be useable.

No one shot but for small local things you can do it.

[-]

CalvinBuild@reddit (OP)

Yeah, I think that's the real bottleneck. Not raw coding ability, but context selection and instruction retention across steps. Splitting the project into tighter file-level tasks seems like the only practical way to r small local models usable right now.

[-]

refried_laser_beans@reddit

I loaded qwen3.5 9b q4 into open code and fired off a prompt for a react web app. It did it in one go. Took like an hour and a half though. It had dynamic content and multiple pages. Overall a simple web app but I was impressed.

[-]

CalvinBuild@reddit (OP)

That's actually pretty solid for a 9B.

A multi-page React app with dynamic content in one shot is not nothing. The hour and a half is the tax, but that is still way more usable than people give these models credit for.

Feels like the real bottleneck is less the model and more the runtime around it. Also interesting that there doesn't seem to be much difference between Q8_0 and Q4_K_M here.

[-]

ea_man@reddit

Hmm no.

I don't even use 30B A3B @ Q4 anymore, I prefer Qwen3.5-27B-UD-IQ3_XXS: it knows much better.

[-]

spky-dev@reddit

I use that Qwen3.5 Opus distill as an explore and compact agent in Opencode, but never for writing code. Typically use 27b and 122b for that.

[-]

CalvinBuild@reddit (OP)

Yeah, that matches what I'm seeing too. 9B still feels pretty stretched for real coding, but it's still worth testing because both the models and the harness/runtime side are improving fast.

At this point the more interesting question to me is how far small models can be pushed with better tool use, validation, and tighter runtime constraints before 27B+ becomes mandatory.