Mapping True Coding Efficiency (Coding Index vs. Compute Proxy)

Posted by NewtMurky@reddit | LocalLLaMA | View on Reddit | 10 comments

TPS (Tokens Per Second) is a misleading metric for speed. A model can be "fast" but use 5x more reasoning tokens to solve a bug, making it slower to reach a final answer.

I mapped ArtificialAnalysis.ai data to find the "Efficiency Frontier"—models that deliver the highest coding intelligence for the least "Compute Proxy" (Active Params × Tokens).

The Data:

Coding Index: Based on Terminal-Bench Hard and SciCode.
Intelligence Index v4.0: Includes GPQA Diamond, Humanity’s Last Exam, IFBench, SciCode, etc.

Key Takeaways:

Gemma 4 31B (The Local GOAT): Exceptionally efficient, hitting a 39 Coding Index while staying lean. Once llama.cpp is fully patched, this will be the gold standard for local dev.
Qwen3.5 122B (The MoE Sweet Spot): MiniMax-M2.5 benchmarks are misleading for local setups due to poor quantization stability. Qwen3.5 122B is the more stable, high-intelligence choice for local quants.
GLM-4.7 (The "Wordy" Thinker): Uses a massive 160M reasoning tokens to reach its scores. Even with high TPS, your Time-to-Solution will be much longer than peers.
Qwen3.5 397B (The SOTA): The current ceiling for intelligence (Intel 45 / Coding 41). Despite its size, its 17B-active MoE design is surprisingly efficient.

[-]

soyalemujica@reddit

Honw can this graph say 35B A3B to be better than Qwen3-Coder-Next? There is just no way. I run both models, and 35B is like 20% behind

Well, the literal answer is that artificial analysis which collects this measurement data says so. I know many people don't think this is the case, but presumably these performance metrics are objective, and objective data wins over people's subjective feels.

I have tried both the 80b coder and the 35b model and thought that both of them are pretty much trash. So far, the only local model I've ever found any good for anything is the 122B model, with a nod to gpt-oss-120b that could sometimes perform decent work if supervised enough.

PermanentLiminality@reddit

I'd like to see the Gemma 4 26B A4B on the graph. It is so much faster that in many cases it might be the better choice.

sarcasmguy1@reddit

What sort of rig (in terms of $) is needed to run Gemma 4 31B?

NewtMurky@reddit (OP)

Used RTX 3090 (24GB) is the sweet spot. You can find these for 700–850 on the used market.
The Mac Option is MacBook Pro or Mac Studio with at least 36GB of Unified Memory.

Inflation is back into the old GPUs. They are more like $950 now.

FusionCow@reddit

anything with 24gb of vram, but I would test different models on openrouter to see if a model like that is good enough for your usecase before buying a whole rig just to run it

Thank you! I’ve been using Codex heavily but the new usage limits suck. Considering putting together something that can be used in place of Codex for certain tasks. I know I won’t get any quality at the level of Codex but I wouldn’t mind trying to get something close to it. My coding use cases aren’t terribly demanding, given I do pretty heavy spec-driven development

StupidScaredSquirrel@reddit

Honestly smart choice of axis. I can watch the graph and say it reflects exactly how it felt for most of those models.