Benching local Qwen as a Codex validator, co-agent, and challenger

Posted by robert896r1@reddit | LocalLLaMA | View on Reddit | 14 comments

I’ve been running a local Qwen model beside Codex for coding work, and it has been more useful than I expected. It's never going to be a replacement for Codex. More like a second set of eyes much better than me.

The workflow is roughly:

* Codex does the main repo work.

* Local Qwen challenges the plan.

* Qwen checks for overbuilding, missed hard directives, UI/design issues, bad assumptions, and long-context misses.

* I review each interaction, test and validate before next stage. This isn't a "send massive prompt, thoughts and prayers" approach. I need things to work and scale.

That setup has been useful enough that I wanted a more concrete way to test local model profiles for this role and not just rely on synthetics.

So I built a small reproducible eval suite around that use case as I got tired of just reading benches and posts and that didn't align with my usecase.

I tested a few Qwen3.6 27B GGUF profiles through llama.cpp, including Bartowski and Unsloth variants, different context sizes, and q8/f16 KV cache.

Processing img 19f3cdz207zg1...

Main findings from my local runs:

* The best 128k profiles tied on the suite: bartowski-128k-f16, bartowski-128k-q8, and unsloth-128k-q8.

* q8 KV did not show a measured accuracy loss in this specific suite. That's not to say the same will be true for your use case.

* Context size mattered more than f16-vs-q8 KV for this workflow. Even in direct usage via opencode this remained true.

* The 65k profiles were fine until the suite asked for >65k context, then they failed pretty hard.

* unsloth-128k-f16 loaded, but hit local memory/throughput pressure on the long-context cases which due to it's bigger size just trips the 5090.

This is not a universal benchmark or trying to replace anything existing. It's my workflow, my local setup, and a use case specfic suite. I’m not claiming “best Qwen quant” or anything like that. The thing I’m trying to offer is a different kind of eval: if a local model is useful beside a frontier coding agent, codex in my case, in real work. For my usage, absolutely. Qwen is extremely good at keeping Codex from silent bypasses, smoothing over issues, racing to completion and hard coding to get around obstructions. Qwen keeps it in check. Also Qwen is MUCH better at UI. So when UI is involved, the roles reverse and Qwen takes the lead in design. I review and codex implements.

Project page:

https://robert896r1.github.io/qwen-realworld-accuracy-evals/

Repo:

https://github.com/robert896r1/qwen-realworld-accuracy-evals

I’d be interested in feedback, especially from people already using local models as coding companions, reviewers, or sidecar agents.

Also interested in real-world test cases people think should be added. I’m more interested in useful failures than prompt benching: missed directives, bad challenge behavior, overbuilding, UI judgment, long-context misses, etc.