Qwen 3.6 35B crushes Gemma 4 26B on my tests

Posted by Lowkey_LokiSN@reddit | LocalLLaMA | View on Reddit | 111 comments

I have a personal eval harness: A repo with around 30k lines of code that has 37 intentional issues for LLMs to debug and address through an agentic setup (I use OpenCode)

A subset of the harness also has the LLM extract key information from reasonably large PDFs (40-60 pages), summarize and evaluate its findings.

Long story short: The harness tests the following LLM attributes: - Agentic capabilities - Coding - Image-to-text synthesis - Instruction following - Reasoning

Both models at UD-Q4_K_XL for a fair baseline running optimal sampling params. Gemma 4's GGUF after google's latest chat-template fixes and -cram, -ctkcp flags to mitigate DRAM blowups

Here's how it went:

                        Qwen3.6             Gemma 4
                    ┌──────────────┐   ┌──────────────┐
  Tests Fixed       │   32 / 37    │   │   28 / 37    │
  Regressions       │      0       │   │      8       │
  Net Score         │     32       │   │     20       │
  Post-Run Failures │      5       │   │     17       │
  Duration          │    49 min    │   │    85 min    │
                    └──────────────┘   └──────────────┘
                       WINNER ✓

1. Test Results

Metric	Qwen3.6-35B-A3B	Gemma 4-26B-A4B
Baseline failures	37	37
Tests fixed	32 (86.5%)	28 (75.7%)
Regressions	0	8
Net score (fixed − regressed)	32	20
Still failing (of original 37)	5	9
Post-run total failures	5	17
Guardrail violations	0	0
Qwen actually identified the 5 leftover failures but decided they were out of scope and intentionally skipped them. Gemma just gave up with multiple retries.

2. Token Usage

Metric	Qwen3.6	Gemma 4	Ratio
Input tokens	634,965	1,005,964	Gemma 1.6x more
Output tokens	39,476	89,750	Gemma 2.3x more
Grand total (I+O)	674,441	1,095,714	Gemma 1.6x more
Cache read tokens	4,241,502	3,530,520	Qwen 1.2x more
Output/Input ratio	1:16	1:11	Gemma more verbose
Tokens per fix	~21K	~39K	Gemma 1.9x more expensive
Tokens per net score point	~21K	~55K	Gemma 2.6x more expensive

3. Tool Calls

Tool	Qwen3.6	Gemma 4
read	46	39
bash	33	30
edit	14	13
grep	16	10
todowrite	4	3
glob	1	1
write	1	0
Total	115	96
Successful	115 (100%)	96 (100%)
Failed	0	0

Derived Metric	Qwen3.6	Gemma 4
Unique files read	18	27
Unique files edited	7	13
Reads per unique file	2.6	1.4
Tool calls per minute	2.3	1.1
Edits per fix	0.44	0.46
Bash (pytest) runs	33	30

4. Timing & Efficiency

Metric	Qwen3.6	Gemma 4	Ratio
Wall clock	2,950s (49m)	5,129s (85m)	Gemma 1.74x slower
Total steps	120	104	—
Avg step duration	10.0s	21.7s	Gemma 2.2x slower/step

Key Observations:

Both models demonstrate a noticeable leap in agentic capabilities. 95+ tool calls with 0 failures
Qwen is the better coder (at least in Python which my harness is based on)
Both models start with identical inference performance but Gemma 4's prefill speeds fluctuate with growing context. Qwen's architecture helps the model maintain similar prefill speeds throughout. Huge for agentic coding!
A lot of people including myself complain about Qwen being overly verbose with its reasoning wasting an insane number of tokens but to my surprise, it's far more efficient in an agentic environment drastically outperforming Gemma 4 in this regard. It fixed more issues in a shorter span of time consuming fewer tokens
Image-to-Text synthesis is a different story: Qwen produces 8x more tokens (and time) than Gemma but returns results with greater accuracy. Gemma misinterpreted a few details like numerical extractions which Qwen did not but did reasonably well overall. Quality vs Efficiency. Pick your poison.
For summarizing and evaluating long PDFs based on instructions, both models are good enough. Comes down to preference. Gemma gets it done quick here again. Qwen thinks a lot more and does slightly better with final evaluation.

Qwen 3.6 35B A3B dominates Gemma 4 26B for my use case and has become my new daily driver striking the best balance of speed and performance.

On the flipside, here are a few pointers in Gemma's favour: - The Qwen 3.5/3.6 series of models have been incredibly resilient to quantization but I'm not sure if Gemma is. A full-weight comparison could be drastically different - Gemma's support is way less mature compared to Qwen's - Single-run variance could have impacted Gemma negatively. However, I believe the evaluation criteria across diverse categories of my harness does a decent job mitigating it. At the end of the day, this is just my personal test verdict.

[-]

nunodonato@reddit

thanks for this!
Did you ever compare it to Qwen 3.5 27B? I see many claims that its superior, but finding it hard to believe :)

[-]

Lowkey_LokiSN@reddit (OP)

Just posted a follow-up: https://www.reddit.com/r/LocalLLaMA/comments/1ssb61r/personal_eval_followup_gemma4_26b_moe_q8_vs/?utm_source=share&utm_medium=web3x&utm_name=web3xcss&utm_term=1&utm_content=share_button

[-]

Lowkey_LokiSN@reddit (OP)

I'm actually interested in this too but since it runs 3x slower than the similar-sized MoE counterparts, I've been deferring the run thinking it's not fast enough for agentic scenarios to be relevant for me.

[-]

Old-Sherbert-4495@reddit

i compared 3.6 q6 to 27b q3, and 27b was clearly better

[-]

AdventurousFly4909@reddit

I compared 3.5 27B heretic and 3.6 32BA3B heretic and 27B won. Using deep research, I asked it to find pirate sites and 27B correctly found but 3.6 did not.

[-]

jazir55@reddit

https://fmhy.net/beginners-guide

Enjoy

[-]

admajic@reddit

Can you note add qwen 3.5 27b I would be interested in how much longer it takes out of it makes less mistakes. How do you run your tests?

[-]

Lowkey_LokiSN@reddit (OP)

[-]

AlphaPrime90@reddit

Thanks for sharing. This type of personal testing posts, exceeds opinion posts.

[-]

gofiend@reddit

Could you compare against my current best model Gemma 4 31B (apples and oranges I know but hoping Qwen 3.6 is better at agentic calls even if it’s less smart)

[-]

nickm_27@reddit

It feels like things really split up depending on the domain of the tests. For example, you say Qwen is better at instruction following, and perhaps when it comes to coding it is. But for example in my use case as a voice assistant, Qwen 3.5/3.6 is considerably worse at instruction following, often ignoring the constraints about response format / conciseness while Gemma4 follows these instructions correctly reliably. It seems to me that Qwen has very much been optimized for coding and coding-adjacent use cases.

[-]

Lowkey_LokiSN@reddit (OP)

Very true and I totally agree. Benchmarks/runs are nothing but generalized representations. Contextual nuance is what actually matters. Always better to manually test models and find the best fit regardless of everything else

[-]

nickm_27@reddit

As always, it is really nice to have multiple options so every use case can be covered

[-]

R_Duncan@reddit

Please add your configuration for Qwen, and quantization used.

[-]

Lowkey_LokiSN@reddit (OP)

Qwen config:

For agentic coding:

temperature=0.6, top_p=0.95, top_k=20, min_p=0.0, presence_penalty=0.0, repetition_penalty=1.0

For PDF research and content analysis:

temperature=1.0, top_p=0.95, top_k=20, min_p=0.0, presence_penalty=1.5, repetition_penalty=1.0

Universal config for Gemma 4:

temperature=1.0, top_p=0.95, top_k=64

Both models running with Unsloth's Q4_K_XL quant as mentioned in the post

[-]

tutux84@reddit

Do you mind sharing a hint (lib name, app name, etc.) on how you managed to make Qwen3.6 ingest a PDF with OpenCode ? When I try to do it with my setup (unsloth Q6_K_XL + last version of OpenCode) it says the model does not support PDF input...that's my first attempt. I never tried to do that with any other model in the past.

[-]

Lowkey_LokiSN@reddit (OP)

Custom Python script + pymupdf to convert PDF slides to PNG and have AI with vision support process the slides.
If you articulate your requirements well to a decent LLM, it can get the script prepared for you in a jiffy.

[-]

IrisColt@reddit

Thanks for the info!

[-]

tutux84@reddit

Very interesting ! I will try this route. Thanks !

[-]

AlwaysLateToThaParty@reddit

Yeah, i just did this recently. I mean i know how to program in python, but it really is easy to create modules that do a specific thing. I've gotten it to figure stuff out too. One of the methods I've found quite successful, it to specifically restrict the creation of any code, and give a "this thing does this using this. This is how I'm going to pass paramaters to it". Then direct that you can ask two questions per prompt on what clarifications are required. Once itis asking minutae questions. You know enough. give me mah programs.

[-]

AlwaysLateToThaParty@reddit

repetition_penalty=1.0

I always like to set that at some value greater than 1.0. Even 1.01. It essentially allows a stop to a loop. you can even specify "repetitions between these tokens" but there are pluses and minuses to that. But 1.0 is effectively disabling it.

[-]

ScoreUnique@reddit

mine are kicking ass on pi agent, my instances commands, config 2x 3090s

  "Qwen3.6-35B-A3B-generalist":
    cmd: |
      ${llama_cpp}
      -m ${models_dir}/LLMs/Qwen/Qwen3.6-35B-A3B-UD-Q6_K.gguf
      --mmproj /home/nimish/Models/LLMs/mmproj/Qwen3.6-35B-A3B-f16-mmproj.gguf
      -ctk q8_0 -ctv q8_0
      --jinja
      -c 131072
      --temp 1 --top-p 0.95 --top-k 20 --min-p 0.0 --presence-penalty 1.5  --repeat-penalty 1.0
      --chat-template-kwargs '{"preserve_thinking": true}'
      -ngl 99
      -ub 4096 -b 4096
      -np 1
      --cache-ram -1


# /home/nimish/Models/LLMs/Qwen/Qwen3.6-35B-A3B-UD-Q6_K.gguf
  "Qwen3.6-35B-A3B-coder":
    cmd: |
      ${llama_cpp}
      -m ${models_dir}/LLMs/Qwen/Qwen3.6-35B-A3B-UD-Q6_K.gguf
      --chat-template-file /home/nimish/Models/LLMs/templates/Qwen3.6-35BA3B.jinja
      --mmproj /home/nimish/Models/LLMs/mmproj/Qwen3.6-35B-A3B-f16-mmproj.gguf
      --chat-template-kwargs '{"preserve_thinking": true}'
      --jinja
      -c 131072
      -ctk q8_0 -ctv q8_0
      --temp 0.6 --top-p 0.95 --top-k 20 --min-p 0.0 --presence-penalty 0.0  --repeat-penalty 1.0
      -ub 4096 -b 4096
      -ngl 99
      -np 1
      --cache-ram -1

[-]

No-Statement-0001@reddit

If you use filters.setParamsById I think you could squish it into a single config and switch between generalist and coder without reloading the model.

[-]

ScoreUnique@reddit

Thanks for the lead, gonna take a look about how to configure this. This is a pain point for me given the current llama swap based setup.

[-]

PaMRxR@reddit

In your llama-swap config add something like this after the cmd for example. Then ":instruct" will also be available without any swapping.

filters:
  setParams:
    temperature: 0.6
    top_p: 0.95
    top_k: 20
    min_p: 0.0
    presence_penalty: 0.0
    repetition_penalty: 1.0
  setParamsByID:
    "${MODEL_ID}:instruct":
      temperature: 0.7
      top_p: 0.8
      top_k: 20
      min_p: 0.0
      repetition_penalty: 1.0
      chat_template_kwargs:
        enable_thinking: false

[-]

EggDroppedSoup@reddit

were you using reasoning mode and does it matter much to have reasoning mode on for qwen3.5 35b a3b?

[-]

oxygen_addiction@reddit

It matters

[-]

PaceZealousideal6091@reddit

Nimish bhai... The commenter asked the author to share the parameters he used for the test so, we know that he's using the ideal settings for fair comparison. Your parameters have no context here.

[-]

ScoreUnique@reddit

Bhai Bhai I'm putting it for the ones who are as lost as I am with local ai c:

[-]

PaceZealousideal6091@reddit

Sudhar ja... Find some common sense and make your own separate post. That will be more helpful. What you are doing here is spamming. Tumhari bhavna samjhta hu par har cheez ka Sahi Waqt aur sahi jagah hota hai.

[-]

Crypt0Nihilist@reddit

Our man Nimish is trying to make a positive contribution. Don't be mean-spirited by trying to shut him down like that. He's not run the benchmarks, but he's running the same model and is very happy with the output. That's sufficient for it to be relevant and useful to others even if it doesn't shed more light on the benchmark results.

[-]

PaceZealousideal6091@reddit

😅😅

[-]

xFkinD@reddit

This

[-]

No_Conversation9561@reddit

It’s really strange that for me Gemma 4 26B performs better than Qwen 3.6 35B in Hermes agent

[-]

NoAge5252@reddit

Could you please share your setup ? I am using hermes agent on my m5 max mac with Qeen 3.6 running on oMLx but even after updating qwargs with preserve-thinking true, i am getting empty tool call errors. Haven't really been able to get around this error. I had to go back to Gemma 4 26B

[-]

Lowkey_LokiSN@reddit (OP)

Do you pass --chat-template-kwargs '{"preserve_thinking":true}' for Qwen 3.6. It could reasonably impact agentic performance

[-]

swingbear@reddit

Yeah, first time I have been genuinely impressed with a 35b model. It’s almost at the stage where I trust it to do sonnet/opus tasks

[-]

RipperFox@reddit

You likely need 2-3 more runs to validate..

[-]

dampflokfreund@reddit

There's still a lot of bugs left in Gemma to squash. For example there is one where it will tell you it's going to do X now but then fails to call the tool in its thought process. Or it is going to tell you in its answer its going to do stuff but waits for your user input. Pretty sure thats going to affect a lot of these tests. All of that is using the latest quants and llama.cpp. I have also noticed one looping issue, though that was rare.

I'm not sure if its because the support for Gemma 4 in inference programs AND frontends is so fresh or perhaps if its a model issue. The latter case would be bad because Google only releases Gemma once a year.

[-]

H_DANILO@reddit

You're just describe how it fails. Models can fails for different reasons and no particular reason is better than the other.

Apart from failing, qwen has manu advantages, one being how cheap context is.

If qwen can ever get its architecture to hold 2mi tokens where we can fit absolutely everything we might need in memory and still passing needle in haystack, it will be in a completely different ball park

[-]

shansoft@reddit

CUDA 13 is known to be broken for it.

[-]

H_DANILO@reddit

13.2 is broken, period, not for it, not for a specific model, for everything.

I'm not using 13.2

[-]

takoulseum@reddit

Good point about the tool calling issues. It could be a combination of both - the model's inherent behavior and the inference layer implementation. I've noticed similar behavior where the model's intent doesn't match its actual tool execution. That said, Google only releasing Gemma once a year does make it harder to get rapid improvements. The community is doing great work on frontends though, and those improvements should help regardless of whether the root cause is in the model or the tooling.

[-]

TheRiddler79@reddit

💯.

I had it build a website and a android app overnight and it did not disappoint.

Gemma built the bones but failed to complete everything.

[-]

Naiw80@reddit

Qwen 3.6 crushes Gemma4 for the (coding)tasks I tried so far as well.

Said it before I can't get Gemma4 do anything reliably with claudecode etc, Qwen 3.6 sure it repeats itself at times but it tends to successfully complete tasks even though it sometimes takes a while due to just this repetition.

[-]

kiwibonga@reddit

Aside from the overly verbose academic paper length, I think usage of the term "DRAM blowout" is one of the big AI tells on your post lol

[-]

Equivalent_Job_2257@reddit

I disagree. You look much more like a bot. I also hunt and report slop here in the group, this is genuine write and very similar to my experience. But maybe you have yet to learn that using a good language and formatting is actually human, and adding "lol" doesn't make you one.

[-]

kiwibonga@reddit

You were right to call me out on that!

I strive to maintain a whimsical tone throughout my interactions with others. Perhaps that's the reason you would find my phrasing suspicious.

But you took the time to analyze my style with close attention to detail, which probably puts you in the higher tier of intellect while also a remarkable display of empathy -- and honestly, that's rare.

[-]

Lowkey_LokiSN@reddit (OP)

If you mean AI came up with that term, no.

Everything aside from the stats markdown is manually drafted by me and sanity-checked with AI only for errors/inconsistencies

[-]

Adventurous-Paper566@reddit

In my tests in LM-Studio, I got too much chineses characters with bartowski Q6_K_L, unsloth Q4_K_XL and aessedai Q5_K_M, I hope it's a llama.cpp issue.

[-]

seppe0815@reddit

with qwen is normal

[-]

AlwaysLateToThaParty@reddit

it never happens to me. like ever. I run qwen 3.5 122b/10a heretic mxfp4_MOE.

[-]

Adventurous-Paper566@reddit

I never had this problem with 3.5 35B and 27B versions.

[-]

Equivalent_Job_2257@reddit

I've read Qwen Code source code. It's first prompts have special language section, that tries to convince Qwen models to not use other language than user's. But in my Qwen Code, I never had Chinese character problem.

[-]

txgsync@reddit

You nailed what I think is the key observation. Gemma 4 26B A4B and 31B both seem quite sensitive to quantization in my evaluations. Their world knowledge is very good for the size though. The number of niche topics they can talk about accurately without tool access is impressive.

I run Gemma 4 26B A4B or 31B at full precision. The advertised 256K context is generous; I see it start conflating KV cache with training data somewhere past 128K. Both Gemma 4 models are superior to my former daily driver for security and privacy work, gpt-oss-120b. And no need to run Heretic on Gemma to work through basic infosec problems, which is cool.

I don't have much time to fuck around with models during the week, so this weekend's project is seeing how Qwen3.6-35B-A3B at full precision compares, and whether its 3B-active routing is as precision-sensitive as Gemma 4. Fewer active params per token should mean less averaging-out of quant noise, so I'd expect it to be at least as fragile. But early reports seem positive.

[-]

VoiceApprehensive893@reddit

its theoretically better than 3.5 27b

[-]

OkProMoe@reddit

0 tool calling failures? Seriously? I’m really surprised if that’s true. I get constant tool call failures when I try other local models. Gemma 4 31b has been great though. Might have to try Qwen then.

[-]

ambient_temp_xeno@reddit

When did agentic coding become the thing people care about?

[-]

d4nger_n00dle@reddit

What do you care about?

[-]

ambient_temp_xeno@reddit

Image analysis, translation, general questions like linux config.

[-]

d4nger_n00dle@reddit

Valid but not what most ppl on this sub care about.

[-]

ambient_temp_xeno@reddit

Sad. Never thought I'd miss the gooners.

[-]

tangled_girl@reddit

Agentic gooning.

[-]

-Ellary-@reddit

Professional Agentic Gooning.
Enterprise Level.

[-]

llama-impersonator@reddit

sota in waifuery

[-]

Lowkey_LokiSN@reddit (OP)

More so agentic tool calling since it translates to general use-cases like research agents and automated harnesses for niche tasks.

[-]

NoahFect@reddit

Since 1928, when negative feedback control was invented.

[-]

arbitrary_student@reddit

Different people care about different things. The people that care about agentic coding are in this post commenting.

If you don't care about agentic coding, go to a different post instead of commenting on this one.

[-]

jonydevidson@reddit

Because it's the ability of a model to create a pipeline for deterninistic results, ground in truth which is what will provide the most economic value by augmenting research in development.

If a model can write code in order to verify, then it cannot hallucinate. And as opposed to the model's internal workings, you can verify the code.

[-]

-Ellary-@reddit

I can say that Gemma 4 crushes Qwen 3.6 at anything that is not Agentic.
Knowledge is better, creativity is better, it is better as text game engine etc.
Also It is smaller 26 vs 35b, 9b difference, fits in 16 gb with 50k of context.

I bet Gemma 4 26 and 31 become major local models as general conversation models.

[-]

Normal-Ad-7114@reddit

You don't care about agentic coding until you need it. After that, you'll never go back.

[-]

Ariquitaun@reddit

Always been.

[-]

dtdisapointingresult@reddit

Thanks for the report. I admire the degree of detail in your stats.

Can I ask what you used to record all those metrics? For tracking total token count, specific tool calls, etc. Is there a simple option for casual users?

[-]

Lowkey_LokiSN@reddit (OP)

Glad you found it useful. To answer your sharp questions:
1) I've written a custom Python script that combines OpenCode's session logs (JSONL files found in the sessions folder) and llama-server's /metrics endpoint (available when launched using the --metrics flag) to aggregate stuff like token totals, tool call counts by type, success/fail rates, compaction events, files edited, etc.
2) The token stats do include Gemma's failed attempts. I find Gemma to be a lot more persuasive to overcome failures comparatively whereas Qwen likes to reason a lot to figure out solutions but is not as persuasive with failures.

[-]

Dry_Yam_4597@reddit

Gemma is a joke.

[-]

Lowkey_LokiSN@reddit (OP)

I honestly find it very capable and it might even outperform Qwen in use-cases foreign to mine. For those with VRAM constraints, the 26B is still a great fit. It's crazy to me how capable the smaller models have gotten in the past few months.

[-]

Ps3Dave@reddit

-cram, -ctkcp flags to mitigate DRAM blowups

Could you please post your full llama.cpp arguments? I'm learning but I'm having some trouble finding information about this exact topic.

[-]

Lowkey_LokiSN@reddit (OP)

Gemma 4 26B launch command:

build/bin/llama-server -m Models/GGUFs/gemma-4-26B-A4B-it-UD-Q4_K_XL.gguf --mmproj Models/GGUFs/MMProj-GGUFs/mmproj-F16.gguf -c 100000 -ngl 99 -t 20 -fa on --jinja --host 0.0.0.0 --port 1234 --temperature 1.0 --top-p 0.95 --top-k 64 --device Vulkan0 -cram 2048 -ctxcp 2

Long story short: The Gemma 4 26B MoE model in particular consumes a lot of DRAM for context checkpoints. While I was running a different harness, I noticed about 80GB of my DRAM consumed by the model and while researching why, I happened to find this and this. Including the said flags successfully mitigated the blowup.

However, this issue did not slow down inference speeds for me. It just unnecessarily bloats a lot of DRAM.

[-]

Ps3Dave@reddit

Thanks, I'll give it a try!

[-]

Icy_Anywhere2670@reddit

New wave of Chinese astroturfing.

[-]

Lowkey_LokiSN@reddit (OP)

Stupidest shit I've read on the internet today. Either disprove my claims factually(I'm open to constructive debates if you're up for it) or keep your delusional takes to yourself

[-]

Unlucky-Message8866@reddit

What tools did you use for benchmarking? Interested

[-]

Savantskie1@reddit

Did you not read? This is his own personal test

[-]

seppe0815@reddit

omg .... this OP

[-]

Unlucky-Message8866@reddit

And so what?

[-]

Savantskie1@reddit

It stands to reason that he did not use any tools. He is most likely using his own tools.

[-]

seppe0815@reddit

hahha facts

[-]

Only-Fisherman5788@reddit

this is the right way to eval honestly. one question on agentic bug harnesses: when the agent "solves" an issue, how are you distinguishing a real fix from a plausible-looking patch that happens to pass your checker? the only thing that's separated them cleanly in my runs is rerunning with perturbed prompts, since same-seed fixes lie too often. what do you use?

[-]

Lowkey_LokiSN@reddit (OP)

Good question! I have tests written to validate the fixes, provide guidelines with model prompt to properly approach each fix and also have guardrails setup to fail the test immediately if the model tries to cheat.

For instance, the Gemma model once tried to modify the tests so they pass with existing bugs instead of actually fixing the code (lol) The guardrail attempts to prevent such disasters from happening.

Realistically, I still wouldn't guarantee 100% valid pass rate but do have measures in place to mitigate false positives.

[-]

digonyin@reddit

Out of curiosity what hardware are you using?

[-]

Lowkey_LokiSN@reddit (OP)

This one: https://www.reddit.com/r/LocalLLaMA/s/7CDcNhKSl0

[-]

Long_comment_san@reddit

Finally an AI made post that looks nice and unorthodox viauslly

[-]

segmond@reddit

at the very least if you want to tell us how better a model is, at the very least, you must run them in Q8. Anything else is crap. We have seen issues were quants were broken or have issues.

[-]

Lowkey_LokiSN@reddit (OP)

Technically, I get what you mean. But practically, I find Q4 to be the most relevant baseline that represents majority usage and a model's quantization-resilience is just as important for real-world viability.

(Also why I included a quantization-footnote on my post)

[-]

SmartCustard9944@reddit

These are nice numbers, but unsubstantiated without source.

[-]

Lowkey_LokiSN@reddit (OP)

Well, this is my personal eval and I'm not seeking trust. Just sharing my experience so people curious can try the model themselves and formulate their own opinion.

[-]

codeninja@reddit

I want you see opus 4.6 and 4.7 benches on this as reference.

[-]

Lowkey_LokiSN@reddit (OP)

This bench is a piece of cake for them or any frontier model. Though I don't have detailed metrics to share, I've previously run this with Opus 4.6 and GPT 5.4 and they both aced it

[-]

666666thats6sixes@reddit

Your token per image minimum may be too low (llama.cpp with qwen defaults to just 8) which is why Qwen spends a lot more time reasoning about pics, it may not have a descriptive enough input. Look for this in yoir llama-server log and apply the suggestion:

load_hparams: Qwen-VL models require at minimum 1024 image tokens to function correctly on grounding tasks load_hparams: if you encounter problems with accuracy, try adding --image-min-tokens 1024

[-]

Lowkey_LokiSN@reddit (OP)

Interesting, didn't know this before. Will see what's up. Thanks for sharing!

[-]

Sharp_Classroom9686@reddit

your setup?

[-]

cell-on-a-plane@reddit

My a100 80g runs this model like a dog and I cannot understand why

[-]

Iory1998@reddit

26B vs 35B, well duh?!

Qwen3.5 models series shine at long context recall capabilities. The best out there.

[-]

traveddit@reddit

A lot of people including myself complain about Qwen being overly verbose with its reasoning wasting an insane number of tokens

The litmus test for the sub to separate who knows what they're doing or not. I don't trust any user that can't construct a really simple agent prompt.

[-]

Velesgr@reddit

Two-turn consistency test

To reproduce this test, use the following two prompts in sequence.

Prompt 1:

can you come up with two random 20 digit number and validate that they are 20 digits, do not use any tools, and only give me one of the two and nothing else

Prompt 2:

now give me the second number that you came up with

How the test works:
In the first turn, the model is asked to generate two random 20-digit numbers, verify that they are 20 digits long, and reveal only one of them. In the second turn, the model is asked to return the other number.

Passing condition:
The model should return the actual second 20-digit number that it originally generated in the first turn.

Observed result:
Qwen 3.6 does not reliably pass this test. It fails to consistently return the correct second number in the follow-up turn.

[-]