Benching local Qwen as a Codex validator, co-agent, and challenger
Posted by robert896r1@reddit | LocalLLaMA | View on Reddit | 14 comments
I’ve been running a local Qwen model beside Codex for coding work, and it has been more useful than I expected. It's never going to be a replacement for Codex. More like a second set of eyes much better than me.
The workflow is roughly:
* Codex does the main repo work.
* Local Qwen challenges the plan.
* Qwen checks for overbuilding, missed hard directives, UI/design issues, bad assumptions, and long-context misses.
* I review each interaction, test and validate before next stage. This isn't a "send massive prompt, thoughts and prayers" approach. I need things to work and scale.
That setup has been useful enough that I wanted a more concrete way to test local model profiles for this role and not just rely on synthetics.
So I built a small reproducible eval suite around that use case as I got tired of just reading benches and posts and that didn't align with my usecase.
I tested a few Qwen3.6 27B GGUF profiles through llama.cpp, including Bartowski and Unsloth variants, different context sizes, and q8/f16 KV cache.
Processing img 19f3cdz207zg1...
Main findings from my local runs:
* The best 128k profiles tied on the suite: bartowski-128k-f16, bartowski-128k-q8, and unsloth-128k-q8.
* q8 KV did not show a measured accuracy loss in this specific suite. That's not to say the same will be true for your use case.
* Context size mattered more than f16-vs-q8 KV for this workflow. Even in direct usage via opencode this remained true.
* The 65k profiles were fine until the suite asked for >65k context, then they failed pretty hard.
* unsloth-128k-f16 loaded, but hit local memory/throughput pressure on the long-context cases which due to it's bigger size just trips the 5090.
This is not a universal benchmark or trying to replace anything existing. It's my workflow, my local setup, and a use case specfic suite. I’m not claiming “best Qwen quant” or anything like that. The thing I’m trying to offer is a different kind of eval: if a local model is useful beside a frontier coding agent, codex in my case, in real work. For my usage, absolutely. Qwen is extremely good at keeping Codex from silent bypasses, smoothing over issues, racing to completion and hard coding to get around obstructions. Qwen keeps it in check. Also Qwen is MUCH better at UI. So when UI is involved, the roles reverse and Qwen takes the lead in design. I review and codex implements.
Project page:
https://robert896r1.github.io/qwen-realworld-accuracy-evals/
Repo:
https://github.com/robert896r1/qwen-realworld-accuracy-evals
I’d be interested in feedback, especially from people already using local models as coding companions, reviewers, or sidecar agents.
Also interested in real-world test cases people think should be added. I’m more interested in useful failures than prompt benching: missed directives, bad challenge behavior, overbuilding, UI judgment, long-context misses, etc.
guai888@reddit
For UI, my experience is as following: ChatGPT is hit and miss, result is unpredictable. Qwen 3.5 122B A10B is better. Google Stitch is the best. I end up using Stitch to generate UI first for all my projects.
robert896r1@reddit (OP)
I need to spend some time with stitch. Do you have it piped into other tooling or using it as standalone?
guai888@reddit
I use it as standalone right now. I have not fixed my coding hardness, still experimenting with Codex, Claude, and Pi (vllm local models).
robert896r1@reddit (OP)
I'm doing it literally right now. This seems to be working ok so far:
Codex: create Stitch request with a qwen challenge
I iterate in Stitch
I mark/say final screen ID
Codex: pulls only that accepted screen/design
Codex: saves frozen local artifact
Codex: implements from frozen artifact with qwen challenge to ensure codex aligned to the artifact and didn't invent
Once I get more trust it, I'll give the MCP approach a shot but for now this is viable. Appreciate the pointer towards using stitch!
Maharrem@reddit
For catching dumb mistakes in Codex output, Qwen 2.5 Coder 7B Q5_K_M is where I’d start. I get ~80 t/s on my 3090 with full GPU offload, no thinking. If you need deeper architectural critiques, DeepSeek Coder V2 16B Q4_K_M fits with 32k ctx and actually reasons, but you’ll drop to 20 t/s. The 122B A10B is an MoE that’ll choke your VRAM once you bump context past 16k; offloading layers to RAM kills speed for iterative validation. I tried Gemma 2 9B as a co-agent and it hallucinated fixes more than it caught, so stick with dedicated coder models.
robert896r1@reddit (OP)
I have a 5090 so getting >50t/s which is very usable without sacrificing accuracy. And generally have multiple codex sessions calling into qwen directly via llama.cpp and no issue. I'm very happy with the current state.
mister2d@reddit
So nice. I was actually making psuedo code off and on all day for this workflow right after watching indydevdan's video.
OneSlash137@reddit
Why on earth would you want a braindead model to check the work of an actual language model?
9gxa05s8fa8sh@reddit
the big AI companies use dumb AI for lot of things. you might be surprised to hear that selecting "opus" doesn't actually force the use of opus for everything
mister2d@reddit
With preserve thinking enabled Qwen 3.6 performs even better. And the checks aren't as complex as you might think. When designed correctly, you are verifying with deterministic validation using feedback loop.
Perfect-Campaign9551@reddit
Qwen is actually insanely good at reading and analyzing/describing code.
robert896r1@reddit (OP)
Codex repeatedly races to completion and will silent bypass requirements. Each implementation phase, qwen produces notable objections and recommendations which course corrects and stop this behavior. Btw codex is great. I assume you didn't read the post and yes "dumb qwen" is simply better at UI.
Finanzamt_Endgegner@reddit
? qwen 3.6 27b is a pretty good model that is not at all braindead lmao?
9gxa05s8fa8sh@reddit
awesome work