Actual comparison between locally ran Qwen-3.6-27B and proprietary models
Posted by netikas@reddit | LocalLLaMA | View on Reddit | 71 comments
Hey y'all!
I've recently written a text in Russian about my experience comparing Qwen-3.6-27B with lower tier cloud models on hard tasks -- I wanted to share the translation of the post, since I found the results interesting and surprising. It might break Rule 3, since it's evaluation of LLM written code, but whatever, my methodology is handcrafted and results are still non-trivial. Sorry for the translation, my English is not that good.
__
I once had a server with a 3090 and a Xeon from AliExpress, and I used to run local models on it. This was back in those wonderful times when all interaction with LLMs happened through a web UI, agents were only just starting to appear, and if you wanted to write code properly, you had to copy it from the chat into a file and back again. Back then, I ran Mixtral 8x7B locally, partially offloaded into RAM, and I was extremely pleased with it. Generation speed was around 8 tokens per second, which was perfectly enough for casual chat with instant models, and Mixtral successfully wrote essays for me for Entrepreneurship & Innovation courses in my university. I tried using it for code generation too, or rather for Ansible configs, and predictably got chewed out by my teamlead, for stupid mistakes. Fun times.
Now Qwen-3.6-27B and Qwen-3.6-35B-A3B are out: two small models specifically tuned for coding and agentic tasks and aimed at local inference. To run them in full precision, that is, in FP8 — they were natively trained in it — you need around 36/40 GB of VRAM. But we are not proud people and are happy to compromise, so we can take GGUFs in q4_k_m or even q3_k_s to make them fit into local hardware.
I became curious about how capable local models really are at vibe coding. Obviously, they will not replace Opus or Sonnet, so as a satisfactory target I picked a sub-frontier model from a frontier lab: GPT-Codex-Spark. It has a 262k context window, it is not as smart as full Codex or GPT-5.2/5.4/5.5, but it is perfectly capable of calling tools, writing code, and so on. As an approximation of a local model, it works well enough — with the difference that it is super fast and costs $100 per month, while a local model will be super slow and free, or rather, will cost whatever electricity my gaming PC consumes. I also took Claude Haiku 4.5 to see what Anthropic has to offer.
For local inference hardware, I used a system with a Ryzen 7 7800X3D, 64 GB DDR5-6400, and an RTX 5080 with 16 GB VRAM. To make the task realistically difficult, I took a fairly complex work project — implementing an autoresearch loop from a relatively detailed design document* — and prompted Qwen-3.6-27B-q4_k_m, Qwen-3.6-27B via OpenRouter, Gemma-4-31B via OpenRouter, Claude Haiku 4.5 in Pi Agent, and Codex-Spark in Codex to implement it using my AGENTS.md. The OpenRouter models were included to estimate, first, how expensive it would be to use these models via API, and second, to estimate the upper bound of their capabilities — not crippled quantized inference on my hardware, but full precision.
Importantly, I deliberately chose a task that was too hard for these models. I did not expect even one of them to solve it cleanly. In principle, this is a common problem with local-model evals: people prompt them with tasks that are too simple, and then you get headlines like “My locally hosted Qwen matched Claude Opus in performance!” — both models wrote Snake in HTML, wow. In my case, the goal was not “solve the task,” but “mess up as little as possible while attempting to solve it.” So we will evaluate the applicability of these models not by whether they solved the task — only one out of four did — but by the cleanliness of the failure and the number of remaining fixes needed to match the spec. I evaluated the implementations with Claude Code, using Claude Opus 4.7, xhigh. It wrote the design document and was able to implement a clean solution itself, (at least, according to GPT-5.5's review), so let us trust that it is a good judge.
Results:
- Gemma-4-31B failed completely. It wrote a skeleton solution, but mocked half of the modules and made several mistakes in the implementation. No tests, no __init__.py, no requirements.txt or pyproject.toml, and the docs basically say “just install NumPy and you’ll be fine.” Cost: $0.112, 803k context tokens consumed, 21k tokens generated.
- Codex-Spark high produced a very beautiful implementation, very quickly — pity it does not work. All the files are neatly arranged into folders, but the imports are wrong. The model hallucinated methods for its own code, did not write unit tests, and did everything in two commits: all code plus documentation. I do not know how much money was spent; as far as I understand, Spark has no API. It used 1% of the Spark limits from the $100 subscription.
- Claude Haiku wrote very detailed docs and a README, created several Git branches (!), but did not write tests, leaks test into train, computes metrics incorrectly, and does not provide the necessary samples to the proposer. The code has many TODOs, no exception handling, and the entire loop will crash on a single error. It read 246k tokens, wrote 78k tokens, and cost $1.067 — the most expensive model of the tested ones.
- Qwen-3.6-27B-q4_k_m got it almost right, but there is a train-to-test leak in the code. It is a one-line fix, but still an error. In addition, there are no tests, no retries for LLM requests — though there is a TODO — and OPS.md does not describe common errors, how to fix them, the update guide, and so on. It read 39k tokens and wrote 45k tokens. It ran for almost the entire workday, around 8 hours — unsurprisingly, since I partially offloaded the model into RAM and got 10 TPS with an empty context and 1–2 TPS near the end of the solution. This is exactly why I did not even try to run Gemma-4-31B locally, especially given its outdated architecture and KV caches that are, compared to Qwen, prohibitively heavy.
- Qwen-3.6-27B in full quality via OpenRouter unexpectedly solved the task almost completely. The most serious issue is that instead of hashing a mutable object, it uses a substring from it, meaning we will not be able to track changes. But the autoresearch loop is fully working. There are tests, docs, commits — no branches, true, but who cares, they are not necessary here — a README, and so on. The reason is probably simple: the model ran the tests it wrote, so it caught all the errors that appeared in the other implementations. It consumed 4.4M tokens (!) and wrote 58k tokens. The run cost $0.939, which was surprisingly expensive -- the model costs $2 (!!!) per million tokens.
If we evaluate the solutions through the lens of “given competent feedback, which weak agent would be easiest to finish the job with?”, both Qwens win decisively. Full-quality Qwen has tests and can be fixed with two one-liners. Quantized Qwen can be fixed with one one-liner (and writing tests lol). Everything else is much less trivial to repair. Codex was especially disappointing: despite beautiful and clean architecture, the code does not import and is not covered by tests. A weak model, even with good feedback, will try to fix it and then say “I did everything, trust me bro” without actual confirmation that the fix worked.
So, conclusions: can a local model replace a $20, $100, or $200 subscription? Of course not. More than that, my small test is not representative at all — in real work, you have to navigate a large existing repository, not one-shot projects from a design document.
But I would still start thinking about a second GPU so that Qwen fits into VRAM and inference becomes faster. APIs are becoming more expensive, models generate more tokens, subscriptions are getting restricted — I am confident that in six months, a $20 plan will no longer allow anyone to vibe code properly, while $100 or $200 plans will either be cut down by limits to the level of Codex from the $20 plan a month ago, or strangled through KYC. Qwen, meanwhile, runs on my gaming (!) PC, writes code — slowly and with mistakes, but still writes it — and is perfectly capable of replacing lower-tier proprietary models. If I add something like a 3060, which costs about one and a half to two months of a $200 Claude subscription, to my setup, I will be able to run Qwen in Q6_K_M fully in VRAM. It will be fast, it will probably match the performance of the uncompressed Qwen from OpenRouter and compared to 200$ per-month toll it has a reasonable ROI.
I am confident that in six months the models will be updated, but the situation will remain roughly the same: Qwen-4 will handle vibe coding at the level of, or even better than, Claude Haiku 5 — that is, at the level of the current Sonnet 4.6 / Opus 4.5. This means that with occasional and relatively cheap reviews from a large, competent model through API, we will be able to fully get rid of the OpenAI/Anthropic/Google subscriptions. And that warms my soul.
Review document for the implementations by Claude:
https://github.com/chameleon-lizard/autoresearch_qwen_27b_q4_k_m/blob/main/autoresearch_review.md
Implementations repositories:
autoresearch_haiku:
https://github.com/chameleon-lizard/autoresearch_haiku
autoresearch_qwen_27b_q4_k_m:
https://github.com/chameleon-lizard/autoresearch_qwen_27b_q4_k_m
autoresearch_qwen_27b_openrouter:
https://github.com/chameleon-lizard/autoresearch_qwen_27b_openrouter
autoresearch_gemma_4_31b_openrouter:
https://github.com/chameleon-lizard/autoresearch_gemma_4_31b_openrouter
autoresearch_codex_spark:
https://github.com/chameleon-lizard/autoresearch_codex_spark
aegismuzuz@reddit
Solid benchmark, props for not asking for another python snake script. But the "buy a $300 GPU to ditch the sub" move is a classic local hosting trap. You're pricing the silicon but ignoring your own time. Waiting 8 hours for a quantized model to choke at 1-2 t/s near the end of the context - especially since Q4 loses coherence - is a total workflow killer. Your time is worth way more than a $200/mo subscription. Local is great for privacy or pet projects, but for actual work, paying Anthropic to get it done in 30 seconds is the obvious play
mksrd@reddit
#300 GPU no, $2k unified ram workstation yes because instead of 1-2 tps you'll see 20-40x better perf so time taken will be measured in 10s of mins not single digit hours.
netikas@reddit (OP)
Well, I've tried Qwen-3.6-35B, it outputs 70-50 tps on my hardware. It's not that smart, but it's smart enough -- and future models might be okayish.
viperx7@reddit
I run qwen3.6 27B on 48GB VRAM and it is very capable and every now and then it surprises me with things it can do
For a 48GB VRAM setup this is just way too good you get Q8 model with full context no context quantisation +vision
relmny@reddit
is that an rtx pro 5000?
Zeta1Reticuli@reddit
I think the primary question is whether Alibaba with its organizational changes will still continue to release small dense models by this time next year. Right now the two players we have for local model use are really just Qwen (Alibaba) and Gemma (Google).
abeecrombie@reddit
Second this. Though I think you usually you see it show up a year after big changes happen. For next few months Im hoping we are ok. After qwen 4 or so I'm with you. Or if ccp decides that open source is no longer than preferred route.
We all know it's gonna happen at some point.
Love open source but this is not really a few ppl sharing code or maintaining a project.
Training models is pretty expensive. Not to mention all the PhDs which are demanding crazy salaries (not sure if that is same in China )
takethismfusername@reddit
The CCP doesn't decide things like this. They're not a cartoon villain.
alphapussycat@reddit
CCP has complete control over all companies, but they won't interfere unless they deem it important enough. AI is related to national security, and affects the whole world. They could absolutely decide that no company is allowed to do open weight models, but it doesn't seem like they'd actually bother with that. minimax and kimi are afaik still releasing open weight and I don't think they're supposedly stopping, like qwen has said they will do.
takethismfusername@reddit
The US bans Nvidia from selling chips to China and other companies from selling equipment. The US also has complete control over all companies and it looks like they bother or afraid enough to impose these actions.
alphapussycat@reddit
No, afaik Nvidia can sell gpus in China, they just can't ship from US to China.
takethismfusername@reddit
Come on, dude, this is embarrassing.
alphapussycat@reddit
You think there aren't Nvidia gpus in China?
takethismfusername@reddit
Why so disingenuous? Does it taste that good? They ban the most high-end chips.
Paradigmind@reddit
Pretty funny if you think that a totalitarian party will not decide anything of political or economical relevance.
takethismfusername@reddit
They didn't even know DeepSeek existed until they dropped R1, let alone knowing DeepSeek had already open-sourced V2, V3 before R1.
Paradigmind@reddit
Pretty bullshit as every company is required to have a party committee.
EsotericTechnique@reddit
I think Qwen open source strategy is in reality a soft power move by the cpp, I'm not sure but seems plausible, other Chinese labs also release their weights consistently, it seems like a way to a chieve 2 things, disrupt the west hyperscallers strategy, while saving headspace on developers and easing the burden to implement their models. I could be wrong, and this is completely speculative :p
ProudCanadaCon36@reddit
No. It's just that they're superior to worthless ammurrican biomass.
hurdurdur7@reddit
Don't expect q4 quants to excel at coding. Q6 and up only.
bsawler@reddit
Is the gap between Q4_K_M/Q4_K_L and Q6_K that large? I know according to metrics the accuracy loss is "only" in the low single-digit-percentages but that doesn't really tell how much it loses the ability to write usable code.
aegismuzuz@reddit
The gap is massive for agents specifically. When an agent writes code, it’s constantly making tool calls and parsing json responses. At q4, the model starts hallucinating keys or mangling the tool call schema. q6 is basically the absolute floor if you want to keep any semblance of syntax strictness. Anything under 6 bits and you’ll spend half your day debugging dumb model typos instead of actually shipping
road-runn3r@reddit
This is not my experience at all with a IQ3_XSS quant. 0 tool call errors and no typos whatsoever. Is it the same as a q6? Not at all but it's usable.
YourNightmar31@reddit
I suspect people are vastly overestimating the quality difference between Q4 and Q6.
EmuMammoth6627@reddit
Right, I mean in a lot of instances even if an error is made it's just one more step of troubleshooting to fix it.
hurdurdur7@reddit
a single digit percentage offset on 50 000 tokens of code generated means a bunch of code that just doesn't work.
Opening-Broccoli9190@reddit
Hey there homie, could you clarify - what harnesses have you used and which prompts did you give?
netikas@reddit (OP)
Pi agent for everything except codex spark, it’s exclusive to codex afaik.
mrexodia@reddit
I use codex spark through pi all the time
netikas@reddit (OP)
Hmm. How to set it up? I don't see it in openrouter section.
mrexodia@reddit
Codex Spark is only available through your codex subscription. You run `/login` and OAuth with your ChatGPT account to use it.
AkiDenim@reddit
Reverse engineer Codex to see how to use Codex spark i believe. I use codex spark in Opencode just like that
segmond@reddit
Once you did that prompt you never had to manually prompt again?
netikas@reddit (OP)
Yep. All repos are single shotted. Again, this choice is deliberate, since I evaluate failure modes of each model, not their achievements.
netikas@reddit (OP)
I’m pretty sure I can get better performance if I tune my prompts a bit, but it’s a deliberate choice. The models are weak, this will highlight the failure modes and make the comparison apples-to-apples, without model specific tuning.
tomByrer@reddit
Sure, when you use hosted models, THEY do all the tuning for you. & they eat your data to better -clone your work- tune their models also.
Local hosting takes more time than hosted, just like cooking at home takes more time than drive-though resturaunts. You can also get better tuned quants, maybe IQ4
https://huggingface.co/mradermacher/Qwen3.6-27B-i1-GGUF
https://www.reddit.com/r/LocalLLaMA/comments/1svnmgo/quant_qwen3627b_on_16gb_vram_with_100k_context/
Also seems how you configure to run the models has a great impact:
https://www.reddit.com/r/LocalLLM/comments/1sv6cqk/follow_up_tested_tool_calling_fixes_for_qwen/?share_id=RSHkFdMq4wJM1BawzDUMk&utm_medium=ios_app&utm_name=ioscss&utm_source=share&utm_term=1
AutomaticDriver5882@reddit
I wanted the same
Cultural_Meeting_240@reddit
nice writeup. running qwen 27b on a 3090 is pretty solid honestly. i had a similar setup with mixtral back in the day and it was surprisingly usable. curious how it holds up on longer coding tasks tho, thats usually where local models start falling apart
YourNightmar31@reddit
What quant are you running, with how much context, and does it support tool calls and vision?
JuniorDeveloper73@reddit
what its better for coding and side thinking Qwen3.6-27B-UD-Q5_K_XL.gguf or Qwen3.6-35B-A3B-UD-Q8_K_XL.gguf
aegismuzuz@reddit
It’s not really a choice of "who is smarter" it’s just about your VRAM budget. That 35B MoE has over 2x the weights compared to its active params, and you still have to hold the entire 35B in vram even if only 3-4B fire per token. If memory is tight, a dense 27B at q6 is always going to be more stable and predictable than an aggressively compressed MoE fwiw
netikas@reddit (OP)
27b all the way.
Rule of thumb: moe sqrt(total * active) \~= dense equivalent. This way 35b is roughly equivalent to a 10b dense model.
Quantization beyond q5 usually means closer to q8 quality and Q5_K_XL would be much better than Q4_K_M I was testing.
But it would be much slower, that’s for sure.
Karyo_Ten@reddit
That's the kind of content I come to r/localllama for
Danmoreng@reddit
If you take an IQ3_XSS quant at 12Gb you can actually fit Qwen 27B into vram with decent context. I get around 30 t/s on my RTX 5080 mobile (at empty context). With fine tuned settings you can fit 90k context in the rest of the VRAM. See my launch script here: https://github.com/Danmoreng/local-qwen3-coder-env/blob/main/run_qwen3_6_27b_optimized.ps1
ea_man@reddit
Dho I used to run 27b on a 6700xt: https://www.reddit.com/r/LocalLLaMA/comments/1styxdy/compared_qwen_36_35b_with_qwen_36_27b_for_coding/
Acrobatic_Entry_2841@reddit
Can you solve a bit of this situation:
--models-dirMachine = Apple M4 24GB. qwen-coder-30B=a3b was working at a good speed in the ollama setup. Even I wish to get a bit rid of the subscription services if we can get a 'similar' performance.
machinegunkisses@reddit
Great work, thanks for sharing!
bahwi@reddit
Using ralph wiggum loop in pi and Qwen is working excellently. Plan/goal designed by gemini 3.1 though, haha.
ToInfinityAndAbove@reddit
Tldr?
netikas@reddit (OP)
Qwen good enough
RageQuitNub@reddit
can you share the design document, curious to see what tasks are given
netikas@reddit (OP)
It's presented in the repos, check it out:
https://github.com/chameleon-lizard/autoresearch_qwen_27b_q4_k_m/blob/main/DesignDoc.md
RageQuitNub@reddit
I am blind, thank you
PinkySwearNotABot@reddit
after glancing through that doc, I might as well be blind, too. because i have no idea what i just read. i was expecting some basic .md sheet i.e., but it was more written like some phd research paper out of Google Mind
QBTLabs@reddit
8 t/s on a partially offloaded Mixtral 8x7B is about where that setup peaks, so the Ansible struggles track.
Qwen3-27B at Q4_K_M on a 3090 gets you into the 15-20 t/s range depending on context length, which is where agentic loops start feeling usable rather than painful.
netikas@reddit (OP)
Well, again, this post is not about feasibility of running Qwen3 on a 16GB GPU, it's more about the ROI of local hardware finally becoming good enough so I can carefully recommend buying a second GPU for local coding instead of saying that it's for hobbyists and cloud models -- even smaller ones -- will be better in 100% of cases.
segmond@reddit
How many times did you have to prompt it? Can you add your prompts into the git repo?
gladfelter@reddit
I get 38t/s on my 3090, but it can slow down to 25t/s as the context grows to 80k or so.
gh0stwriter1234@reddit
If you have a 3090 you should be using one of the dflash implementations for it... either vllm or lucebox the latter gets around 100t/s on a 3090 in benchmarks.
gladfelter@reddit
I've heard that the context window is much smaller on a 3090 with vllm. I'm able to go as high as 160k on llama-server. Also, vllm with the high performance configs sounds incredibly unstable right now, has that changed?
gh0stwriter1234@reddit
Might be true, had issues recently with B70s... and a friend had issues with R9700... vllm is definitely harder to get right.
val_in_tech@reddit
Someone told me few days ago 25tps generation on Kimi k2.6 at 64k context is unusable 😆
illforgetsoonenough@reddit
Everyone has their own preferences and tolerances
Character_Split4906@reddit
The test to be benchmarked right I believe its essential to use the same coding harness across the base model and benchmarked models. Did you use the same coding harness when you implemented the solution with claude vs local models? I have seen coding harnesses making a big difference. Claude code or opencode setup right with local models can improve your results by considerable percentage.
netikas@reddit (OP)
Yep. The only difference is codex-spark, it uses codex as it's not available anywhere else. Otherwise the performance is matched.
Character_Split4906@reddit
Didnt you say above that you use claude code to get the initial solution from opus 4.7?
netikas@reddit (OP)
This was in another repo, on another machine, for another task.
It all started when I wanted to create an agentic autoresearch loop without clear scope in mind. I've iterated on the design via CC/Opus 4.7 and the original repo was indeed created by CC. Then I asked it to write the design document -- which was used for this experiment as the prompt, in isolation.
Void-kun@reddit
We are still miles away from what we need.
I'm used to running multiple different agents in parallel when working.
I hope we see a day where consumer grade hardware can allow us that performance locally.
spencer_kw@reddit
finally someone testing on something harder than snake.py. the qwen-handles-gruntwork-opus-reviews-the-hard-parts setup is exactly what i've been running. cost per completed task drops like 60%
FlyingDogCatcher@reddit
too long, didn't read lol
Such_Advantage_6949@reddit
We used to think that gpt3.5 locally is all we need… when u have the current sonnet at home, the frontier model then will make you drool..