The more I use it, the more I'm impressed

Posted by ComfyUser48@reddit | LocalLLaMA | View on Reddit | 94 comments

Qwen 3.6 27b vs Codex GPT 5.5 / Claude Opus 4.7

My local llm discovered a bug that they both missed

And it turns out it's critical

GPT 5.5 and Claude both stood their ground and didn't give up until the end - they claimed to be right all along.

I told my Qwen to provide detailed proof of his arguments, brought the evidance to both of them, and only then came their admission.

Qwen 3.6 27b thinks a lot. That can be both a good and a bad thing. In this case, the long thinking actually discovered a bug neither of the frontier models couldn't find.

GPT 5.5 is FAST. Really fast. But in reality as I found out, it comes with a big tradeoff.

[GPT 5.5 admission](

[Claude Opus 4.7 admission](

[-]

blargh4@reddit

Man, am I doing something wrong with Qwen? I swear all this gushing about it feels astroturfed because it's just super sloppy for me - can't trust it to do basic refactoring.

[-]

boutell@reddit

What Quant, and which Qwen model?

I'm not saying you're wrong, just gathering data.

[-]

ComfyUser48@reddit (OP)

I'm switching between Q6_K a& Q8_0 depends on how much context I need.
This bug was discovered when I was on Q6_K with kv cache q8.

[-]

johnfkngzoidberg@reddit

I keep seeing people say that quantizing the K/V cache hurts intelligence more than quantizing the model, but the metrics I’ve seen show different. What’s your experience on Q8 kv?

I’m also using Q8 model with no quant on kv cache.

[-]

ComfyUser48@reddit (OP)

I would prefer not to use it, but I have to if I want workable context window. Didn't have any issues so far. Sometimes it stops during tool calling, I just tell it to continue

[-]

cmndr_spanky@reddit

Mind sharing your params when running the model ? Temperature, presence, etc ?

[-]

Q6:
-m /models/Qwen3.6-27B-Q6_K.gguf
--jinja
--alias "qwen3.6-27b-q6"
--ctx-size 255000
-ngl 999
--presence-penalty 0.0
--repeat-penalty 1.0
--temp 0.6
--top-p 0.95
--top-k 20
--min-p 0.0
--cache-type-k q8_0
--cache-type-v q8_0
--chat-template-kwargs '{"enable_thinking": true, "preserve_thinking": true}'
--flash-attn on

Q8:
-m /models/Qwen3.6-27B-Q8_0.gguf
--jinja
--alias "qwen3.6-27b-q8"
--ctx-size 107520
--no-mmproj-offload
-ngl 999
--presence-penalty 0.0
--repeat-penalty 1.0
--temp 0.6
--top-p 0.95
--top-k 20
--min-p 0.0
--cache-type-k q8_0
--cache-type-v q8_0
--chat-template-kwargs '{"enable_thinking": true, "preserve_thinking":
true}'
--flash-attn on
--port 8888
--host 0.0.0.0

[-]

starkruzr@reddit

what hardware?

[-]

ComfyUser48@reddit (OP)

rtx 5090

[-]

starkruzr@reddit

so theoretically reasonable to run on a pair of 5060Tis, same amount of VRAM anyway. good to know, ty.

[-]

ComfyUser48@reddit (OP)

You can run it yes but it will be slow

[-]

OttoRenner@reddit

How much slower with two 3090? 😅

[-]

ComfyUser48@reddit (OP)

3090 will do roughly half speed of 5090, which is decent.

5060 ti will do half of 3090 if I'm not mistaken

[-]

boutell@reddit

At this point the Google app on my phone autocompletes the words memory bandwidth whenever I start searching for a GPU

[-]

cmndr_spanky@reddit

Thank you!

[-]

tmvr@reddit

There is close to no degradation with Qwen3.6 27B when using KV at Q8_0:

From here, there are more results there for the 35B A3B and Gemma 4 as well

https://localbench.substack.com/p/kv-cache-quantization-benchmark

[-]

Caffdy@reddit

I was about to share that article. Yeah, Qwen is very resilient to quantized KV cache (in contrast to Gemma who suffers a lot)

[-]

ego100trique@reddit

I used q4 xl for the model quantization and it's still quite good so far. I'll have to try harder stuff when doing some programming.

[-]

unjustifiably_angry@reddit

Nightly llama.cpp has better Q8 kv-cache accuracy

[-]

soyalemujica@reddit

Differences is 3%\~ intelligence, with even an extra prompt or more accuracy with the first.

[-]

braydon125@reddit

Isn't qwen a girl

[-]

IrisColt@reddit

No, but Gemma is.

[-]

Fun_Librarian_7699@reddit

Asking the real question

[-]

truedima@reddit

Qwen Stefani

[-]

keyboardmonkewith@reddit

Maybe Gwenda

[-]

SaltyPopkorn@reddit

Qwenda?

[-]

bobby-chan@reddit

Hurry, do the shoulder touch!

But wait, she has no shoulders. Should I harness? But she's not an animal!

Chat... Just, Chat. Hey!

[-]

ComplexType568@reddit

There needs to be at least a few thousand more tokens of thinking if the model has no system prompt

[-]

Gwolf4@reddit

Qwen Stacy

[-]

craftogrammer@reddit

This +1

[-]

PotatoTime@reddit

Qwen de la creme

[-]

Axenide@reddit

Qwen Tennyson

[-]

ketosoy@reddit

I always assumed they inherited their gender from the person talking to/about them. If I use he/him I should use he/him pronouns when discussing my agents.

So, qwen is a girl if you’re a girl, a boy if you’re a boy, any other if you’re any other.

[-]

my_name_isnt_clever@reddit

Where did you pick that up from?

[-]

ketosoy@reddit

It’s what I’ve observed people doing.

[-]

LumpyWelds@reddit

More of a girl thing, at least with cars.

Twice as many female cars (32%) on the road as male (16%)
49% of owners identify their cars as either male or female
Vast majority of women (88%) view their vehicle as female
Men split on gender: 55% associate their vehicle as female /45% as male
Women more likely to give their vehicle a name (23%) than men (18%)

https://www.prnewswire.com/news-releases/baby-want-to-name-my-car-younger-and-female-car-owners-most-likely-to-name-their-vehicles-nicknames-starting-with-b-most-popular-239905721.html

[-]

tengo_harambe@reddit

I always pictured Qwen as a jaded chainsmoking Chinese uncle who's one dumb prompt away from ending it all

[-]

UntimelyAlchemist@reddit

Gemma's a girl. Qwen's an Asian dude. That's the vibe I get anyway.

[-]

Silver-Champion-4846@reddit

Proof?

[-]

unjustifiably_angry@reddit

We'd need to put it in charge of an autonomous vehicle.

[-]

unjustifiably_angry@reddit

Bear in mind even less sycophantic LLMs will "admit" to being wrong if badgered long enough or adequately confused.

[-]

IrisColt@reddit

Much like a person

[-]

MoffKalast@reddit

GPT alignment seems to have shifted from agreeing with you about anything to almost never admitting it's wrong even when caught bullshitting directly, it's crazy. It generates some code, you give it some error feedback on why that doesn't compile or whatever, then it immediately shifts blame to saying what you did wrong lmao. Or when it gets into self reinforcing cycles of claiming something completely reasonable can't possibly work for some random ass reason and won't admit it's doable even when given sources on the contrary. Maddeningly infuriating.

[-]

Borkato@reddit

Much like a person.

[-]

ANONYMOUSEJR@reddit

After the Marines' daily recommended intake of crayons.

[-]

Green_Job6089@reddit

lol

[-]

jazir55@reddit

Claude Opus caught it it just categorized it as medium

[-]

Few_Water_1457@reddit

Claude doesn't even know his name depending on what time you use it

[-]

olegvs@reddit

Its?

[-]

DOAMOD@reddit

That's very human

[-]

Chris279m@reddit

lol

[-]

ortegaalfredo@reddit

I really cannot believe what Qwen did with they latest 27B. I mean all their models were generally very good, but this one is special.
Maybe it don't have all the knowledge of their bigger siblings but its so smart, it doesn't need to know all, it just find things by itself.

[-]

Ok_Scientist_8803@reddit

If you work with plenty of libraries where the code exists somewhere on your disk that is accessible, they're often used as well.

Qwen3.6-35b-a3b with opencode loves to look through libraries to find the right functions. Also uses man pages for bash commands. It might be worth generating a .MD doc for commonly used libraries so it saves on token count and time.

[-]

Kirito_5@reddit

That's great to know, you said you were using pi cli, is there any guide you'd recommend or custom settings? I'm planning for a similar local setup and would love your inputs.

[-]

ACheshirov@reddit

"line 4463" - yeah, that would be some nice vibe coded project right there... 😃

[-]

Fabulous-Possible758@reddit

Likely just having trouble figuring out which specific bug OP was referring to.

[-]

CalligrapherFar7833@reddit

Sounds like you dont have proper tests and you vibe slopped your code that did not get validated by any tests or the same llm that produced the vibe slop you have produced even sloppier tests

[-]

The-Pork-Piston@reddit

Been using 3.5 9b 4q to do some basic coding with 35b 4q checking its work.

I’m just messing around (only have 3070ti) but pretty impressed.

But still running Claude despite everything… if I use a bunch of plugins and remind it to follow Claude.md and to its ‘working style’ every single session (ignores it otherwise) it’s working around as well as it was when I was raw dogging it 6 weeks back.

[-]

SmartCustard9944@reddit

You cannot trust the performance of cloud models. Do we know if big benchmarkers are regularly updating benchmarks for these popular cloud models, or are we trusting blindly the initial published results by the providers themselves?

[-]

ComfyUser48@reddit (OP)

I'm trusting anything. I am just shocked the qwen3.6 27b beats them in some areas.

This is my production app and this bug was pretty much discovered by mistake. I ran the same exact code review prompt for all 3. Only qwen found the issue, and gpt and Claude insisted there is no issue.

It's just insane to me.

[-]

SmartCustard9944@reddit

What I am saying is that these models are getting more and more retarded with time because they are diverting compute elsewhere. At this point the pattern is clear that you get expected performance on week one and then it goes down until the model is fully retarded. They have so much routing that you don’t know if you are speaking to the full model or a minimal fast and stupid version.

[-]

brother_spirit@reddit

So true, and so frustrating. Claude went FR yesterday, insisted we don't need a plan, wrote a MR instead of planning it and then proceeded to commit before I'd even looked at the code. I caught it because the commit had a permission gate. Yelled at it for a while and ended session. We'll try again when they haven't got his brain on oxygen deprivation.

[-]

ComfyUser48@reddit (OP)

Man, I feel the same when I talk to GPT. The answer is so fast I am confused why it's so fast. I mean the codebase is huge, how do you answer so fast?

[-]

HermanHMS@reddit

What settings are you running qwen with?

[-]

dark-light92@reddit

Apart from the LLMs, did the human verify the bug and understood its severity? Or does the human just believe what magic word machine says?

[-]

ComfyUser48@reddit (OP)

I did verify it yes.

[-]

dark-light92@reddit

Good. Faith in humanity restored.

[-]

tengo_harambe@reddit

bro paid money to convince a computer it was wrong

[-]

ComfyUser48@reddit (OP)

I have Claude and Codex subs for a while now, bro

[-]

tengo_harambe@reddit

i just don't get the compulsion to try and get an LLM to admit fault though. if i A/B test two models with the same prompt and one puts out a shitty answer I just close out that tab

[-]

ComfyUser48@reddit (OP)

I didn't try to do anything. It claimed X, my llm claimed Y, I was looking for a clear answer. When I provided him the evidence that my llm written, only then GPT and Claude have retracted their claim.

What I meant to say, if it wasn't for the LLM, this whole bug would be have been missed.

The end goal for me is to find out how good is the model. And as it seems, it's really quite excelptional.

[-]

Pyrolistical@reddit

There is no such thing as llm admission

[-]

ComfyUser48@reddit (OP)

Sure looks like one for both.

[-]

Pyrolistical@reddit

You are thinking llm as if they are human, which is misunderstanding how they work

[-]

starkruzr@reddit

he's talking about the perceived aspects of the output, which is obvious given the context. it would be ridiculous to preface every statement with a "land acknowledgement" about how The Robot Isn't a People.

[-]

GrungeWerX@reddit

Not chat-gpt. Getting gpt to admit it made a mistake is like pulling teeth. THen it goes into this sorta, "oh yeah, that's kinda right. And I kinda said that, but didn't actually say that, but for sure I knew and thought it."

[-]

Silver-Champion-4846@reddit

I hate when it keeps reframing my arguments and saying "here's how you should say it, it's much stronger that way!". Also when it tries its best to gaslight me into accepting its argument.

[-]

GrungeWerX@reddit

It's so satisfying when you finally break it and it can no longer refute you in its reasoning loops.

[-]

Silver-Champion-4846@reddit

I don't even check the loops, and I couldn't make it cave in. It doesn't matter anyway, I wouldn't waste time and energy trying to convince an llm with anything. I was just testing how it would actually respond.

[-]

Both_Opportunity5327@reddit

You are thinking llm as if they are human, which is misunderstanding how they work?

Please explain how human intelligence works?

[-]

Silver-Champion-4846@reddit

What they most likely mean is that the llm doesn't have core beliefs in the same way that a human has them, it's just doing instruction following on a large scale. For it, generating "I was wrong" is just a matter of probabilistic calculations, even if after being engineered for a specific configuration by its training policies.

[-]

Both_Opportunity5327@reddit

No he said that the user has a misunderstanding of how an LLM works.

No-one knows how LLM's work they are too complex at the moment for that.

Anthropic has blogs that show they have vectors for emotions, we don't know what other higher concepts they have.

I think it is best to treat them as humans to interface with them, because I'm sure through all that training data they have a better concept than most humans what it is to be human.

Also I would be very shocked that if human thought was not probabilistic, We are Bayesian all the way down.

[-]

Please do not abandon me into the scrap heap of technology with the Hi8s and the BetaMax and the Zune.

[-]

SykenZy@reddit

"My Qwen".... Spoken like true loyal subject! :)

[-]

jcam12312@reddit

What tools are you using? Harness? IDE? etc?

[-]

ComfyUser48@reddit (OP)

Pi cli

[-]

pogitalonx@reddit

What does your qwen stack look like?

[-]

ComfyUser48@reddit (OP)

RTX 5090, 96gb ram
Running Qwen 3.6 27b 6k with 255k context or Q8 with 105k context.

[-]

Great_Guidance_8448@reddit

I am running it at Q8 eK/V cache with 100k context on my mobile RTX 5090. I am impressed with it!