Local model on coding has reached a certain threshold to be feasible for real work

Posted by Exciting-Camera3226@reddit | LocalLLaMA | View on Reddit | 41 comments

We ran open-weight 27B–32B models on Terminal-Bench 2.0 (89 tasks, terminal-bench-2.git @ 69671fb) through our agent harness. Best result was Qwen 3.6-27B at 38.2% (34/89) under the default per-task timeout — the same constraint the public leaderboard uses (Qwen's official post uses a more relaxed config) . We deliberately used the default setup for TB official leaderboard, because we wanted an apples-to-apples number against the verified leaderboard.

One interesting find is that MOE models still has a order of magnitude of improve in terms of inference speeds.

The interesting part isn't 38.2% in absolute terms — current verified SOTA is \~80% (GPT-5.5 / Opus 4.6 / Gemini 3.1 Pro). The interesting part is what 38.2% maps to in time.

Anchoring on model release dates of verified leaderboard entries:

Terminus 2 + Claude Opus 4.1 (released Aug 2025): 38.0%
Terminus 2 + GPT-5.1-Codex (Nov 2025): 36.9%
Claude Code + Sonnet 4.5 (Sep 2025): 40.1%
Codex CLI + GPT-5-Codex (Sep 2025): 44.3%

So today's best runnable-offline coding model lands roughly where the hosted frontier was in late 2025 — about a 6–8 month lag. That's the first time this has been close enough to matter for real deployments (regulated environments, air-gapped, on-prem CI, batch workloads).

more details on our blog: https://antigma.ai/blog/2026/04/24/offline-coding-models

[-]

false79@reddit

People who know what they are doing with local models have been doing real work for a while now. I'm talking about the devs who know what aspects of their role is manually repetative and automating it with agentic tooling, freeing up time to tackle more important tasks (or not).

[-]

my_name_isnt_clever@reddit

I first used GPT-2 in AI Dungeon back in 2020 and even that trash model could do some useful things with text. I was getting use out of local LLMs in the Llama 3 days.

They've gotten a hell of a lot more capable since then, but if you know how they work you don't need the pinnacle of frontier intelligence for AI to be useful.

[-]

false79@reddit

What I'm not finding helpful is people who are zero or one shotting prompts expecting magic, absolutely have zero clue what is happening within the LLM, and say it's useless.

It really comes down to breaking down tasks, providing access to required context, and letting the LLM's reasoning connect the dots faster than a human would.

Even 4B models can do amazing time saving things, again if you know how they work.

[-]

the3dwin@reddit

Very very curious your workflow to get 4B models to give amazing results. The smaller models are fastest so again very very curious on your workflow. Please share.

[-]

false79@reddit

You can try it in a coding harness. Keep your tasks very lightweight and small, reference the files that need to be updated as part of the context or provide the fields and methods for the new class you need to make.

[-]

the3dwin@reddit

Thanks.

Just started hearing about "coding harness" any resources you recommend is highly appreciated.

Also looking for specifics about setup running on LM Studio or Ollama or Llama.cpp whichever you are using for 4B models.

Like CPUs, Max tokens etc

[-]

alrojo@reddit

How aggresive are your timeouts? At 1.9 tokens per second that is very slow generation.

[-]

Exciting-Camera3226@reddit (OP)

yeah, 1.9 tokens/second is really not usable. The token speed table was a separate experiment to check on consumer hardwares (the actual run with terminal bench uses same llama.cpp engine but with more powerful GPUs to speed up overall evaluation time via parallelizing)

[-]

iamgianluca@reddit

I can get 30 tokens/s in the decoding phase with an RTX 3090 using Qwen 3.6-27B. I would check those results, 1.9 tokens/s looks very low.

Is the difference in the prefilling phase really that significant between 3090 and 5090?

[-]

Paradigmind@reddit

It says 3060 though.

[-]

FullOf_Bad_Ideas@reddit

Does it actually feel this way? Does it feel like Opus 4.1 in real use in terms of how well the model plans, executes, stays on track and deals with 100k context?

[-]

Exciting-Camera3226@reddit (OP)

not yet, there are still some stability issue (speed which affects a lot of the tool call timeout), but could be caused by our setup

[-]

gladfelter@reddit

I've had a couple of hour-long sessions where pi + 27B 4-bit 8bit k-v 80k context on a 3090 chugs away, writing dozens of source files and their tests, verifying all tests pass, and creating a chain of commits, all from the project plan it had previously written out.

I haven't seen it go off the rails since I optimized my flags to llama.

[-]

SnooPaintings8639@reddit

For me it's fine, but... not very good at instruction following. I am using official FP8 weights, and it just tends to do its own thing, with very little regard to what's in the sys prompt. I feel like I have to restate "hard rules" every 10k tokens of so. It gets very annoying when it's in yolo mode.

[-]

vr_fanboy@reddit

to me it does feel equal or better than at least current sonnet. i cobbled another machine with spare parts i had lying around from an old mining rig only to have another 27b instance running, Q4 100k context. I now have two qwen 27b in the lab

I just tried the Q4 with the first task, im using lean-ctx to optimize context inside PI, i wanted to have a local build for security reasons, without calls to remote api's, etc.

I had a sonnet built plan, first thing qwen 3.6 27b did was to correctly question multiple parts of the plan and rework it.

Then started clean in another PI instance , installed rust, and all the shit it needed for the build. Keep in mind that is actively using the lean-ctx while re compiling AND deploying it, at one point it replaced the bineary for the first time and it had a bug. It was trapped because almost ALL calls go through lean-ctx, so i though ok, i have to manually revert the binary, but no, after 10 different tries it found a workaround, fixed, redeployed and everything is working, all by itself 40% token usage in the entire 10 minutes session. Insane. Im grinning like a maniac right now

cita

The lean-ctx middleware is intercepting every command. The broken JS wrapper is poisoned. Let me check how lean-ctx wraps bash commands:

$ echo hello

lean-ctx binary not found. Run: npm rebuild lean-ctx-bin

Command exited with code 127

Compressed 22 ? 22 tokens (0%)

Took 0.4s

Every bash call goes through lean-ctx first, which invokes the broken JS wrapper. I need to restore the JS file. The write tool should work since it doesn't go through bash:

write C:\Program Files\nodejs\node_modules\lean-ctx-bin\bin\lean-ctx.js

#!/usr/bin/env node "use strict";

const { spawnSync } = require("child_process"); const path = require("path");

const IS_WIN = process.platform === "win32"; const BINARY = path.join(__dirname, IS_WIN ? "lean-ctx.exe" : "lean-ctx");

const result = spawnSync(BINARY, process.argv.slice(2), { ... (14 more lines, 24 total, ctrl+o to expand)

JS restored. Let me verify bash works again:

$ echo hello

hello

Compressed 2 ? 2 tokens (0%)

Took 1.5s

Bash restored. Now let me properly replace the binary. Since I can't write to Program Files via sh, I'll modify the JS wrapper to point to the new binary location:

[-]

Exciting-Camera3226@reddit (OP)

but context window is indeed a particular issue with local model because not only output quality degrades like the hosted api, now the resource consumption also goes up significantly.

Compacting and clearing need to be a lot more aggressive with local model

[-]

Joozio@reddit

38.2% on TB2 with a runnable-offline model is wild when you frame it that way. I tried roughly the same path two weeks ago, Qwen 27B locally on a Mac Mini for night-shift jobs while Opus did the heavier reasoning. Wrote up the cost/token math after switching back from Codex https://thoughts.jock.pl/p/opus-4-7-codex-comeback-2026 MoE inference speed is the part that flipped it for me too.

[-]

Substantial_Step_351@reddit

The context management point you mentioned in the replies, "compacting and clearing need to be a lot more aggressive with local model" is the bit I keep coming back to. That's not just a hardware constraint, it's an architectural one.

Aggressive compaction introduces its own set of problems because what you drop from context is often exactly what the next step needs. For multi step agent work that tension doesn't go away, it just moves.

The benchmaxxing point from u/AXYZE8 is worth taking seriously too. If Terminal Bench 2 dropped Nov 2025 and newer models trained after that, the 6 to 8 month lag framing gets shakier. SWE rebench would be a cleaner read on whether the gap is real

[-]

MelodicRecognition7@reddit

So today's best runnable-offline coding model lands roughly where the hosted frontier was in late 2025 — about a 6–8 month lag. That's the first time this has been close enough to matter for real deployments (regulated environments, air-gapped, on-prem CI, batch workloads).

you miss one very important thing: the size of these models. You compare 27-32B offline models to 1000+B cloud models! There is no any lag, chinese models already outperform american ones, given there are just 6-8 months "lag" for 27B model to achieve the same results as 1000+ one.

[-]

FeiX7@reddit

what about Qwen3.6 MoE models? 35B I mean

[-]

Top-Rub-4670@reddit

One interesting find is that MOE models still has a order of magnitude of improve in terms of inference speeds.

Hmm. Generation of 35B is literally 15x that of 27B in your table. I think that's already plenty.

The parsing that is only 25% faster is harder to explain, it almost feels like you're offloading part of it?

[-]

rm-rf-rm@reddit

what no? the table is showing A3b MOE SLOWER than 27b...

[-]

Exciting-Camera3226@reddit (OP)

ah yeah, the wording is misleading I meant MOE is still significantly faster compared to dense model with similar size

[-]

R_Duncan@reddit

Your speeds are strange. RTX 6000 Blackwell here, context to the max.

27B generation is 50-59t/s

35B-A3B generation is 190-197 t/s.

[-]

Due_Duck_8472@reddit

Lol no

[-]

knigb@reddit

Interesting did all tests use RTX 5090?

[-]

Exciting-Camera3226@reddit (OP)

no. only the one noted as 5090. Others use 3060

[-]

tomByrer@reddit

IMHO 12GB isn't enough for real coding, unless you spend lots of time hunting the best quant, etc.

[-]

cygn@reddit

so the gap between Qwen's official post (59.3) and what you measured (38.2) for 27b is purely because of the timeout?

I still wonder if they have benchmaxxed terminal bench 2.0. Would love to see some independent benchmark.

[-]

Exciting-Camera3226@reddit (OP)

but there are many factors (GPUs, harness, inference engine) could be at play, we are going to do a more thorough eval later.

[-]

Exciting-Camera3226@reddit (OP)

yes. The timeout can play a big factor because of how slow local inference can be. This eval didn't do any tuning with cache/prompting/llama.cpp config

[-]

AXYZE8@reddit

Graph has non existing Qwen3.5-32B (32B was Qwen2.5) and Gemma 4 31B.

Table has correct Qwen3.5-35B, but then Gemma 4 26B-A4B

Looking inside... Hey Claude! But back to topic - if one name is madeup, then another model is completely different between tests... how can we trust these results at all?

[-]

AXYZE8@reddit

About that last table in the Reddit post - sorry but no. This is completely wrong conclusion.

Terminal-Bench 2 was released in Nov 2025, so what you are seeing here is benchmaxxing/contamination in new models. These closed models from Nov 2025 just werent optimized for that bench and newer Qwen models were. That's not "6-8months lag".

You need to test on fresh issues like SWE-rebench, only then you can draw such conclusions.

[-]

Exciting-Camera3226@reddit (OP)

thanks for the info, will try SWE-rebench !

[-]

AXYZE8@reddit

And then I looked into post further and saw you're using some form of Q4_K_M with no mention of quant maker, no mention of engine.

You degraded performance of these local models, especially of 9B. Now we are testing some quant, nobody knows if that GGUF for Gemma already has fixes inside.

And then I also see in your model list "GLM-4.6V-Flash", but there is no such model in graph/tables.

Sorry, but that is AI Slop top to bottom. Hallucination over hallucination.

[-]

Exciting-Camera3226@reddit (OP)

we are posting the results soon so you can verify

[-]