The more I use it, the more I'm impressed
Posted by ComfyUser48@reddit | LocalLLaMA | View on Reddit | 94 comments
Qwen 3.6 27b vs Codex GPT 5.5 / Claude Opus 4.7
My local llm discovered a bug that they both missed
And it turns out it's critical
GPT 5.5 and Claude both stood their ground and didn't give up until the end - they claimed to be right all along.
I told my Qwen to provide detailed proof of his arguments, brought the evidance to both of them, and only then came their admission.
Qwen 3.6 27b thinks a lot. That can be both a good and a bad thing. In this case, the long thinking actually discovered a bug neither of the frontier models couldn't find.
GPT 5.5 is FAST. Really fast. But in reality as I found out, it comes with a big tradeoff.
[GPT 5.5 admission]()
[Claude Opus 4.7 admission]()
blargh4@reddit
Man, am I doing something wrong with Qwen? I swear all this gushing about it feels astroturfed because it's just super sloppy for me - can't trust it to do basic refactoring.
boutell@reddit
What Quant, and which Qwen model?
I'm not saying you're wrong, just gathering data.
TopTippityTop@reddit
Yeah, I don't see it either
ComfyUser48@reddit (OP)
3.6-27b??
GoodSamaritan333@reddit
Are you using a Q8 or BF16 version?
ComfyUser48@reddit (OP)
I'm switching between Q6_K a& Q8_0 depends on how much context I need.
This bug was discovered when I was on Q6_K with kv cache q8.
johnfkngzoidberg@reddit
I keep seeing people say that quantizing the K/V cache hurts intelligence more than quantizing the model, but the metrics I’ve seen show different. What’s your experience on Q8 kv?
I’m also using Q8 model with no quant on kv cache.
ComfyUser48@reddit (OP)
I would prefer not to use it, but I have to if I want workable context window. Didn't have any issues so far. Sometimes it stops during tool calling, I just tell it to continue
cmndr_spanky@reddit
Mind sharing your params when running the model ? Temperature, presence, etc ?
ComfyUser48@reddit (OP)
Q6:
-m /models/Qwen3.6-27B-Q6_K.gguf
--jinja
--alias "qwen3.6-27b-q6"
--ctx-size 255000
-ngl 999
--presence-penalty 0.0
--repeat-penalty 1.0
--temp 0.6
--top-p 0.95
--top-k 20
--min-p 0.0
--cache-type-k q8_0
--cache-type-v q8_0
--chat-template-kwargs '{"enable_thinking": true, "preserve_thinking": true}'
--flash-attn on
Q8:
-m /models/Qwen3.6-27B-Q8_0.gguf
--jinja
--alias "qwen3.6-27b-q8"
--ctx-size 107520
--no-mmproj-offload
-ngl 999
--presence-penalty 0.0
--repeat-penalty 1.0
--temp 0.6
--top-p 0.95
--top-k 20
--min-p 0.0
--cache-type-k q8_0
--cache-type-v q8_0
--chat-template-kwargs '{"enable_thinking": true, "preserve_thinking":
true}'
--flash-attn on
--port 8888
--host 0.0.0.0
starkruzr@reddit
what hardware?
ComfyUser48@reddit (OP)
rtx 5090
starkruzr@reddit
so theoretically reasonable to run on a pair of 5060Tis, same amount of VRAM anyway. good to know, ty.
ComfyUser48@reddit (OP)
You can run it yes but it will be slow
OttoRenner@reddit
How much slower with two 3090? 😅
ComfyUser48@reddit (OP)
3090 will do roughly half speed of 5090, which is decent.
5060 ti will do half of 3090 if I'm not mistaken
boutell@reddit
At this point the Google app on my phone autocompletes the words memory bandwidth whenever I start searching for a GPU
cmndr_spanky@reddit
Thank you!
tmvr@reddit
There is close to no degradation with Qwen3.6 27B when using KV at Q8_0:
From here, there are more results there for the 35B A3B and Gemma 4 as well
https://localbench.substack.com/p/kv-cache-quantization-benchmark
Caffdy@reddit
I was about to share that article. Yeah, Qwen is very resilient to quantized KV cache (in contrast to Gemma who suffers a lot)
ego100trique@reddit
I used q4 xl for the model quantization and it's still quite good so far. I'll have to try harder stuff when doing some programming.
unjustifiably_angry@reddit
Nightly llama.cpp has better Q8 kv-cache accuracy
soyalemujica@reddit
Differences is 3%\~ intelligence, with even an extra prompt or more accuracy with the first.
braydon125@reddit
Isn't qwen a girl
IrisColt@reddit
No, but Gemma is.
Fun_Librarian_7699@reddit
Asking the real question
truedima@reddit
Qwen Stefani
keyboardmonkewith@reddit
Maybe Gwenda
SaltyPopkorn@reddit
Qwenda?
bobby-chan@reddit
But wait, she has no shoulders. Should I harness? But she's not an animal!
Chat... Just, Chat. Hey!
ComplexType568@reddit
There needs to be at least a few thousand more tokens of thinking if the model has no system prompt
Gwolf4@reddit
Qwen Stacy
craftogrammer@reddit
This +1
PotatoTime@reddit
Qwen de la creme
Axenide@reddit
Qwen Tennyson
ketosoy@reddit
I always assumed they inherited their gender from the person talking to/about them. If I use he/him I should use he/him pronouns when discussing my agents.
So, qwen is a girl if you’re a girl, a boy if you’re a boy, any other if you’re any other.
my_name_isnt_clever@reddit
Where did you pick that up from?
ketosoy@reddit
It’s what I’ve observed people doing.
LumpyWelds@reddit
More of a girl thing, at least with cars.
https://www.prnewswire.com/news-releases/baby-want-to-name-my-car-younger-and-female-car-owners-most-likely-to-name-their-vehicles-nicknames-starting-with-b-most-popular-239905721.html
tengo_harambe@reddit
I always pictured Qwen as a jaded chainsmoking Chinese uncle who's one dumb prompt away from ending it all
UntimelyAlchemist@reddit
Gemma's a girl. Qwen's an Asian dude. That's the vibe I get anyway.
Silver-Champion-4846@reddit
Proof?
unjustifiably_angry@reddit
We'd need to put it in charge of an autonomous vehicle.
unjustifiably_angry@reddit
Bear in mind even less sycophantic LLMs will "admit" to being wrong if badgered long enough or adequately confused.
IrisColt@reddit
Much like a person
MoffKalast@reddit
GPT alignment seems to have shifted from agreeing with you about anything to almost never admitting it's wrong even when caught bullshitting directly, it's crazy. It generates some code, you give it some error feedback on why that doesn't compile or whatever, then it immediately shifts blame to saying what you did wrong lmao. Or when it gets into self reinforcing cycles of claiming something completely reasonable can't possibly work for some random ass reason and won't admit it's doable even when given sources on the contrary. Maddeningly infuriating.
Borkato@reddit
Much like a person.
ANONYMOUSEJR@reddit
After the Marines' daily recommended intake of crayons.
Green_Job6089@reddit
lol
jazir55@reddit
Claude Opus caught it it just categorized it as medium
Few_Water_1457@reddit
Claude doesn't even know his name depending on what time you use it
olegvs@reddit
Its?
DOAMOD@reddit
That's very human
Chris279m@reddit
lol
ortegaalfredo@reddit
I really cannot believe what Qwen did with they latest 27B. I mean all their models were generally very good, but this one is special.
Maybe it don't have all the knowledge of their bigger siblings but its so smart, it doesn't need to know all, it just find things by itself.
Ok_Scientist_8803@reddit
If you work with plenty of libraries where the code exists somewhere on your disk that is accessible, they're often used as well.
Qwen3.6-35b-a3b with opencode loves to look through libraries to find the right functions. Also uses man pages for bash commands. It might be worth generating a .MD doc for commonly used libraries so it saves on token count and time.
Kirito_5@reddit
That's great to know, you said you were using pi cli, is there any guide you'd recommend or custom settings? I'm planning for a similar local setup and would love your inputs.
ACheshirov@reddit
"line 4463" - yeah, that would be some nice vibe coded project right there... 😃
Fabulous-Possible758@reddit
Likely just having trouble figuring out which specific bug OP was referring to.
CalligrapherFar7833@reddit
Sounds like you dont have proper tests and you vibe slopped your code that did not get validated by any tests or the same llm that produced the vibe slop you have produced even sloppier tests
The-Pork-Piston@reddit
Been using 3.5 9b 4q to do some basic coding with 35b 4q checking its work.
I’m just messing around (only have 3070ti) but pretty impressed.
But still running Claude despite everything… if I use a bunch of plugins and remind it to follow Claude.md and to its ‘working style’ every single session (ignores it otherwise) it’s working around as well as it was when I was raw dogging it 6 weeks back.
SmartCustard9944@reddit
You cannot trust the performance of cloud models. Do we know if big benchmarkers are regularly updating benchmarks for these popular cloud models, or are we trusting blindly the initial published results by the providers themselves?
ComfyUser48@reddit (OP)
I'm trusting anything. I am just shocked the qwen3.6 27b beats them in some areas.
This is my production app and this bug was pretty much discovered by mistake. I ran the same exact code review prompt for all 3. Only qwen found the issue, and gpt and Claude insisted there is no issue.
It's just insane to me.
SmartCustard9944@reddit
What I am saying is that these models are getting more and more retarded with time because they are diverting compute elsewhere. At this point the pattern is clear that you get expected performance on week one and then it goes down until the model is fully retarded. They have so much routing that you don’t know if you are speaking to the full model or a minimal fast and stupid version.
brother_spirit@reddit
So true, and so frustrating. Claude went FR yesterday, insisted we don't need a plan, wrote a MR instead of planning it and then proceeded to commit before I'd even looked at the code. I caught it because the commit had a permission gate. Yelled at it for a while and ended session. We'll try again when they haven't got his brain on oxygen deprivation.
ComfyUser48@reddit (OP)
Man, I feel the same when I talk to GPT. The answer is so fast I am confused why it's so fast. I mean the codebase is huge, how do you answer so fast?
HermanHMS@reddit
What settings are you running qwen with?
dark-light92@reddit
Apart from the LLMs, did the human verify the bug and understood its severity? Or does the human just believe what magic word machine says?
ComfyUser48@reddit (OP)
I did verify it yes.
dark-light92@reddit
Good. Faith in humanity restored.
tengo_harambe@reddit
bro paid money to convince a computer it was wrong
ComfyUser48@reddit (OP)
I have Claude and Codex subs for a while now, bro
tengo_harambe@reddit
i just don't get the compulsion to try and get an LLM to admit fault though. if i A/B test two models with the same prompt and one puts out a shitty answer I just close out that tab
ComfyUser48@reddit (OP)
I didn't try to do anything. It claimed X, my llm claimed Y, I was looking for a clear answer. When I provided him the evidence that my llm written, only then GPT and Claude have retracted their claim.
What I meant to say, if it wasn't for the LLM, this whole bug would be have been missed.
The end goal for me is to find out how good is the model. And as it seems, it's really quite excelptional.
Pyrolistical@reddit
There is no such thing as llm admission
ComfyUser48@reddit (OP)
Sure looks like one for both.
Pyrolistical@reddit
You are thinking llm as if they are human, which is misunderstanding how they work
starkruzr@reddit
he's talking about the perceived aspects of the output, which is obvious given the context. it would be ridiculous to preface every statement with a "land acknowledgement" about how The Robot Isn't a People.
GrungeWerX@reddit
Not chat-gpt. Getting gpt to admit it made a mistake is like pulling teeth. THen it goes into this sorta, "oh yeah, that's kinda right. And I kinda said that, but didn't actually say that, but for sure I knew and thought it."
Silver-Champion-4846@reddit
I hate when it keeps reframing my arguments and saying "here's how you should say it, it's much stronger that way!". Also when it tries its best to gaslight me into accepting its argument.
GrungeWerX@reddit
It's so satisfying when you finally break it and it can no longer refute you in its reasoning loops.
Silver-Champion-4846@reddit
I don't even check the loops, and I couldn't make it cave in. It doesn't matter anyway, I wouldn't waste time and energy trying to convince an llm with anything. I was just testing how it would actually respond.
Both_Opportunity5327@reddit
You are thinking llm as if they are human, which is misunderstanding how they work?
Please explain how human intelligence works?
Silver-Champion-4846@reddit
What they most likely mean is that the llm doesn't have core beliefs in the same way that a human has them, it's just doing instruction following on a large scale. For it, generating "I was wrong" is just a matter of probabilistic calculations, even if after being engineered for a specific configuration by its training policies.
Both_Opportunity5327@reddit
No he said that the user has a misunderstanding of how an LLM works.
No-one knows how LLM's work they are too complex at the moment for that.
Anthropic has blogs that show they have vectors for emotions, we don't know what other higher concepts they have.
I think it is best to treat them as humans to interface with them, because I'm sure through all that training data they have a better concept than most humans what it is to be human.
Also I would be very shocked that if human thought was not probabilistic, We are Bayesian all the way down.
Silver-Champion-4846@reddit
If it was hypothetically true, imagine the experience of being a bodiless brain, no thoughts until receiving an external brain signal, the only time you're allowed to think is at inference time.
draconic_tongue@reddit
fucking mansplainer
pebblechewer@reddit
"The user tells me I'm wrong. I must be wrong. Best to acknowledge your mistake and move on"
You're right! 'Twas me, your humble AI assistant begging for forgiveness.
Please do not abandon me into the scrap heap of technology with the Hi8s and the BetaMax and the Zune.
SykenZy@reddit
"My Qwen".... Spoken like true loyal subject! :)
jcam12312@reddit
What tools are you using? Harness? IDE? etc?
ComfyUser48@reddit (OP)
Pi cli
pogitalonx@reddit
What does your qwen stack look like?
ComfyUser48@reddit (OP)
RTX 5090, 96gb ram
Running Qwen 3.6 27b 6k with 255k context or Q8 with 105k context.
Great_Guidance_8448@reddit
I am running it at Q8 eK/V cache with 100k context on my mobile RTX 5090. I am impressed with it!