Qwen 3.6 35B crushes Gemma 4 26B on my tests
Posted by Lowkey_LokiSN@reddit | LocalLLaMA | View on Reddit | 111 comments
I have a personal eval harness: A repo with around 30k lines of code that has 37 intentional issues for LLMs to debug and address through an agentic setup (I use OpenCode)
A subset of the harness also has the LLM extract key information from reasonably large PDFs (40-60 pages), summarize and evaluate its findings.
Long story short: The harness tests the following LLM attributes: - Agentic capabilities - Coding - Image-to-text synthesis - Instruction following - Reasoning
Both models at UD-Q4_K_XL for a fair baseline running optimal sampling params. Gemma 4's GGUF after google's latest chat-template fixes and -cram, -ctkcp flags to mitigate DRAM blowups
Here's how it went:
Qwen3.6 Gemma 4
┌──────────────┐ ┌──────────────┐
Tests Fixed │ 32 / 37 │ │ 28 / 37 │
Regressions │ 0 │ │ 8 │
Net Score │ 32 │ │ 20 │
Post-Run Failures │ 5 │ │ 17 │
Duration │ 49 min │ │ 85 min │
└──────────────┘ └──────────────┘
WINNER ✓
1. Test Results
| Metric | Qwen3.6-35B-A3B | Gemma 4-26B-A4B |
|---|---|---|
| Baseline failures | 37 | 37 |
| Tests fixed | 32 (86.5%) | 28 (75.7%) |
| Regressions | 0 | 8 |
| Net score (fixed − regressed) | 32 | 20 |
| Still failing (of original 37) | 5 | 9 |
| Post-run total failures | 5 | 17 |
| Guardrail violations | 0 | 0 |
| Qwen actually identified the 5 leftover failures but decided they were out of scope and intentionally skipped them. Gemma just gave up with multiple retries. |
2. Token Usage
| Metric | Qwen3.6 | Gemma 4 | Ratio |
|---|---|---|---|
| Input tokens | 634,965 | 1,005,964 | Gemma 1.6x more |
| Output tokens | 39,476 | 89,750 | Gemma 2.3x more |
| Grand total (I+O) | 674,441 | 1,095,714 | Gemma 1.6x more |
| Cache read tokens | 4,241,502 | 3,530,520 | Qwen 1.2x more |
| Output/Input ratio | 1:16 | 1:11 | Gemma more verbose |
| Tokens per fix | ~21K | ~39K | Gemma 1.9x more expensive |
| Tokens per net score point | ~21K | ~55K | Gemma 2.6x more expensive |
3. Tool Calls
| Tool | Qwen3.6 | Gemma 4 |
|---|---|---|
| read | 46 | 39 |
| bash | 33 | 30 |
| edit | 14 | 13 |
| grep | 16 | 10 |
| todowrite | 4 | 3 |
| glob | 1 | 1 |
| write | 1 | 0 |
| Total | 115 | 96 |
| Successful | 115 (100%) | 96 (100%) |
| Failed | 0 | 0 |
| Derived Metric | Qwen3.6 | Gemma 4 |
|---|---|---|
| Unique files read | 18 | 27 |
| Unique files edited | 7 | 13 |
| Reads per unique file | 2.6 | 1.4 |
| Tool calls per minute | 2.3 | 1.1 |
| Edits per fix | 0.44 | 0.46 |
| Bash (pytest) runs | 33 | 30 |
4. Timing & Efficiency
| Metric | Qwen3.6 | Gemma 4 | Ratio |
|---|---|---|---|
| Wall clock | 2,950s (49m) | 5,129s (85m) | Gemma 1.74x slower |
| Total steps | 120 | 104 | — |
| Avg step duration | 10.0s | 21.7s | Gemma 2.2x slower/step |
Key Observations:
- Both models demonstrate a noticeable leap in agentic capabilities. 95+ tool calls with 0 failures
- Qwen is the better coder (at least in Python which my harness is based on)
- Both models start with identical inference performance but Gemma 4's prefill speeds fluctuate with growing context. Qwen's architecture helps the model maintain similar prefill speeds throughout. Huge for agentic coding!
- A lot of people including myself complain about Qwen being overly verbose with its reasoning wasting an insane number of tokens but to my surprise, it's far more efficient in an agentic environment drastically outperforming Gemma 4 in this regard. It fixed more issues in a shorter span of time consuming fewer tokens
- Image-to-Text synthesis is a different story: Qwen produces 8x more tokens (and time) than Gemma but returns results with greater accuracy. Gemma misinterpreted a few details like numerical extractions which Qwen did not but did reasonably well overall. Quality vs Efficiency. Pick your poison.
- For summarizing and evaluating long PDFs based on instructions, both models are good enough. Comes down to preference. Gemma gets it done quick here again. Qwen thinks a lot more and does slightly better with final evaluation.
Qwen 3.6 35B A3B dominates Gemma 4 26B for my use case and has become my new daily driver striking the best balance of speed and performance.
On the flipside, here are a few pointers in Gemma's favour: - The Qwen 3.5/3.6 series of models have been incredibly resilient to quantization but I'm not sure if Gemma is. A full-weight comparison could be drastically different - Gemma's support is way less mature compared to Qwen's - Single-run variance could have impacted Gemma negatively. However, I believe the evaluation criteria across diverse categories of my harness does a decent job mitigating it. At the end of the day, this is just my personal test verdict.
nunodonato@reddit
thanks for this!
Did you ever compare it to Qwen 3.5 27B? I see many claims that its superior, but finding it hard to believe :)
Lowkey_LokiSN@reddit (OP)
Just posted a follow-up: https://www.reddit.com/r/LocalLLaMA/comments/1ssb61r/personal_eval_followup_gemma4_26b_moe_q8_vs/?utm_source=share&utm_medium=web3x&utm_name=web3xcss&utm_term=1&utm_content=share_button
Lowkey_LokiSN@reddit (OP)
I'm actually interested in this too but since it runs 3x slower than the similar-sized MoE counterparts, I've been deferring the run thinking it's not fast enough for agentic scenarios to be relevant for me.
Old-Sherbert-4495@reddit
i compared 3.6 q6 to 27b q3, and 27b was clearly better
AdventurousFly4909@reddit
I compared 3.5 27B heretic and 3.6 32BA3B heretic and 27B won. Using deep research, I asked it to find pirate sites and 27B correctly found but 3.6 did not.
jazir55@reddit
https://fmhy.net/beginners-guide
Enjoy
admajic@reddit
Can you note add qwen 3.5 27b I would be interested in how much longer it takes out of it makes less mistakes. How do you run your tests?
Lowkey_LokiSN@reddit (OP)
Just posted a follow-up: https://www.reddit.com/r/LocalLLaMA/comments/1ssb61r/personal_eval_followup_gemma4_26b_moe_q8_vs/?utm_source=share&utm_medium=web3x&utm_name=web3xcss&utm_term=1&utm_content=share_button
AlphaPrime90@reddit
Thanks for sharing. This type of personal testing posts, exceeds opinion posts.
gofiend@reddit
Could you compare against my current best model Gemma 4 31B (apples and oranges I know but hoping Qwen 3.6 is better at agentic calls even if it’s less smart)
nickm_27@reddit
It feels like things really split up depending on the domain of the tests. For example, you say Qwen is better at instruction following, and perhaps when it comes to coding it is. But for example in my use case as a voice assistant, Qwen 3.5/3.6 is considerably worse at instruction following, often ignoring the constraints about response format / conciseness while Gemma4 follows these instructions correctly reliably. It seems to me that Qwen has very much been optimized for coding and coding-adjacent use cases.
Lowkey_LokiSN@reddit (OP)
Very true and I totally agree. Benchmarks/runs are nothing but generalized representations. Contextual nuance is what actually matters. Always better to manually test models and find the best fit regardless of everything else
nickm_27@reddit
As always, it is really nice to have multiple options so every use case can be covered
R_Duncan@reddit
Please add your configuration for Qwen, and quantization used.
Lowkey_LokiSN@reddit (OP)
Qwen config:
For agentic coding:
temperature=0.6, top_p=0.95, top_k=20, min_p=0.0, presence_penalty=0.0, repetition_penalty=1.0
For PDF research and content analysis:
temperature=1.0, top_p=0.95, top_k=20, min_p=0.0, presence_penalty=1.5, repetition_penalty=1.0
Universal config for Gemma 4:
temperature=1.0, top_p=0.95, top_k=64
Both models running with Unsloth's Q4_K_XL quant as mentioned in the post
tutux84@reddit
Do you mind sharing a hint (lib name, app name, etc.) on how you managed to make Qwen3.6 ingest a PDF with OpenCode ? When I try to do it with my setup (unsloth Q6_K_XL + last version of OpenCode) it says the model does not support PDF input...that's my first attempt. I never tried to do that with any other model in the past.
Lowkey_LokiSN@reddit (OP)
Custom Python script + pymupdf to convert PDF slides to PNG and have AI with vision support process the slides.
If you articulate your requirements well to a decent LLM, it can get the script prepared for you in a jiffy.
IrisColt@reddit
Thanks for the info!
tutux84@reddit
Very interesting ! I will try this route. Thanks !
AlwaysLateToThaParty@reddit
Yeah, i just did this recently. I mean i know how to program in python, but it really is easy to create modules that do a specific thing. I've gotten it to figure stuff out too. One of the methods I've found quite successful, it to specifically restrict the creation of any code, and give a "this thing does this using this. This is how I'm going to pass paramaters to it". Then direct that you can ask two questions per prompt on what clarifications are required. Once itis asking minutae questions. You know enough. give me mah programs.
AlwaysLateToThaParty@reddit
I always like to set that at some value greater than 1.0. Even 1.01. It essentially allows a stop to a loop. you can even specify "repetitions between these tokens" but there are pluses and minuses to that. But 1.0 is effectively disabling it.
ScoreUnique@reddit
mine are kicking ass on pi agent, my instances commands, config 2x 3090s
No-Statement-0001@reddit
If you use filters.setParamsById I think you could squish it into a single config and switch between generalist and coder without reloading the model.
ScoreUnique@reddit
Thanks for the lead, gonna take a look about how to configure this. This is a pain point for me given the current llama swap based setup.
PaMRxR@reddit
In your llama-swap config add something like this after the cmd for example. Then ":instruct" will also be available without any swapping.
EggDroppedSoup@reddit
were you using reasoning mode and does it matter much to have reasoning mode on for qwen3.5 35b a3b?
oxygen_addiction@reddit
It matters
PaceZealousideal6091@reddit
Nimish bhai... The commenter asked the author to share the parameters he used for the test so, we know that he's using the ideal settings for fair comparison. Your parameters have no context here.
ScoreUnique@reddit
Bhai Bhai I'm putting it for the ones who are as lost as I am with local ai c:
PaceZealousideal6091@reddit
Sudhar ja... Find some common sense and make your own separate post. That will be more helpful. What you are doing here is spamming. Tumhari bhavna samjhta hu par har cheez ka Sahi Waqt aur sahi jagah hota hai.
Crypt0Nihilist@reddit
Our man Nimish is trying to make a positive contribution. Don't be mean-spirited by trying to shut him down like that. He's not run the benchmarks, but he's running the same model and is very happy with the output. That's sufficient for it to be relevant and useful to others even if it doesn't shed more light on the benchmark results.
PaceZealousideal6091@reddit
😅😅
xFkinD@reddit
This
No_Conversation9561@reddit
It’s really strange that for me Gemma 4 26B performs better than Qwen 3.6 35B in Hermes agent
NoAge5252@reddit
Could you please share your setup ? I am using hermes agent on my m5 max mac with Qeen 3.6 running on oMLx but even after updating qwargs with preserve-thinking true, i am getting empty tool call errors. Haven't really been able to get around this error. I had to go back to Gemma 4 26B
Lowkey_LokiSN@reddit (OP)
Do you pass --chat-template-kwargs '{"preserve_thinking":true}' for Qwen 3.6. It could reasonably impact agentic performance
swingbear@reddit
Yeah, first time I have been genuinely impressed with a 35b model. It’s almost at the stage where I trust it to do sonnet/opus tasks
RipperFox@reddit
You likely need 2-3 more runs to validate..
dampflokfreund@reddit
There's still a lot of bugs left in Gemma to squash. For example there is one where it will tell you it's going to do X now but then fails to call the tool in its thought process. Or it is going to tell you in its answer its going to do stuff but waits for your user input. Pretty sure thats going to affect a lot of these tests. All of that is using the latest quants and llama.cpp. I have also noticed one looping issue, though that was rare.
I'm not sure if its because the support for Gemma 4 in inference programs AND frontends is so fresh or perhaps if its a model issue. The latter case would be bad because Google only releases Gemma once a year.
H_DANILO@reddit
You're just describe how it fails. Models can fails for different reasons and no particular reason is better than the other.
Apart from failing, qwen has manu advantages, one being how cheap context is.
If qwen can ever get its architecture to hold 2mi tokens where we can fit absolutely everything we might need in memory and still passing needle in haystack, it will be in a completely different ball park
shansoft@reddit
CUDA 13 is known to be broken for it.
H_DANILO@reddit
13.2 is broken, period, not for it, not for a specific model, for everything.
I'm not using 13.2
takoulseum@reddit
Good point about the tool calling issues. It could be a combination of both - the model's inherent behavior and the inference layer implementation. I've noticed similar behavior where the model's intent doesn't match its actual tool execution. That said, Google only releasing Gemma once a year does make it harder to get rapid improvements. The community is doing great work on frontends though, and those improvements should help regardless of whether the root cause is in the model or the tooling.
TheRiddler79@reddit
💯.
I had it build a website and a android app overnight and it did not disappoint.
Gemma built the bones but failed to complete everything.
Naiw80@reddit
Qwen 3.6 crushes Gemma4 for the (coding)tasks I tried so far as well.
Said it before I can't get Gemma4 do anything reliably with claudecode etc, Qwen 3.6 sure it repeats itself at times but it tends to successfully complete tasks even though it sometimes takes a while due to just this repetition.
kiwibonga@reddit
Aside from the overly verbose academic paper length, I think usage of the term "DRAM blowout" is one of the big AI tells on your post lol
Equivalent_Job_2257@reddit
I disagree. You look much more like a bot. I also hunt and report slop here in the group, this is genuine write and very similar to my experience. But maybe you have yet to learn that using a good language and formatting is actually human, and adding "lol" doesn't make you one.
kiwibonga@reddit
You were right to call me out on that!
I strive to maintain a whimsical tone throughout my interactions with others. Perhaps that's the reason you would find my phrasing suspicious.
But you took the time to analyze my style with close attention to detail, which probably puts you in the higher tier of intellect while also a remarkable display of empathy -- and honestly, that's rare.
Lowkey_LokiSN@reddit (OP)
If you mean AI came up with that term, no.
Everything aside from the stats markdown is manually drafted by me and sanity-checked with AI only for errors/inconsistencies
Adventurous-Paper566@reddit
In my tests in LM-Studio, I got too much chineses characters with bartowski Q6_K_L, unsloth Q4_K_XL and aessedai Q5_K_M, I hope it's a llama.cpp issue.
seppe0815@reddit
with qwen is normal
AlwaysLateToThaParty@reddit
it never happens to me. like ever. I run qwen 3.5 122b/10a heretic mxfp4_MOE.
Adventurous-Paper566@reddit
I never had this problem with 3.5 35B and 27B versions.
Equivalent_Job_2257@reddit
I've read Qwen Code source code. It's first prompts have special language section, that tries to convince Qwen models to not use other language than user's. But in my Qwen Code, I never had Chinese character problem.
txgsync@reddit
You nailed what I think is the key observation. Gemma 4 26B A4B and 31B both seem quite sensitive to quantization in my evaluations. Their world knowledge is very good for the size though. The number of niche topics they can talk about accurately without tool access is impressive.
I run Gemma 4 26B A4B or 31B at full precision. The advertised 256K context is generous; I see it start conflating KV cache with training data somewhere past 128K. Both Gemma 4 models are superior to my former daily driver for security and privacy work, gpt-oss-120b. And no need to run Heretic on Gemma to work through basic infosec problems, which is cool.
I don't have much time to fuck around with models during the week, so this weekend's project is seeing how Qwen3.6-35B-A3B at full precision compares, and whether its 3B-active routing is as precision-sensitive as Gemma 4. Fewer active params per token should mean less averaging-out of quant noise, so I'd expect it to be at least as fragile. But early reports seem positive.
VoiceApprehensive893@reddit
its theoretically better than 3.5 27b
OkProMoe@reddit
0 tool calling failures? Seriously? I’m really surprised if that’s true. I get constant tool call failures when I try other local models. Gemma 4 31b has been great though. Might have to try Qwen then.
ambient_temp_xeno@reddit
When did agentic coding become the thing people care about?
d4nger_n00dle@reddit
What do you care about?
ambient_temp_xeno@reddit
Image analysis, translation, general questions like linux config.
d4nger_n00dle@reddit
Valid but not what most ppl on this sub care about.
ambient_temp_xeno@reddit
Sad. Never thought I'd miss the gooners.
tangled_girl@reddit
Agentic gooning.
-Ellary-@reddit
Professional Agentic Gooning.
Enterprise Level.
llama-impersonator@reddit
sota in waifuery
Lowkey_LokiSN@reddit (OP)
More so agentic tool calling since it translates to general use-cases like research agents and automated harnesses for niche tasks.
NoahFect@reddit
Since 1928, when negative feedback control was invented.
arbitrary_student@reddit
Different people care about different things. The people that care about agentic coding are in this post commenting.
If you don't care about agentic coding, go to a different post instead of commenting on this one.
jonydevidson@reddit
Because it's the ability of a model to create a pipeline for deterninistic results, ground in truth which is what will provide the most economic value by augmenting research in development.
If a model can write code in order to verify, then it cannot hallucinate. And as opposed to the model's internal workings, you can verify the code.
-Ellary-@reddit
I can say that Gemma 4 crushes Qwen 3.6 at anything that is not Agentic.
Knowledge is better, creativity is better, it is better as text game engine etc.
Also It is smaller 26 vs 35b, 9b difference, fits in 16 gb with 50k of context.
I bet Gemma 4 26 and 31 become major local models as general conversation models.
Normal-Ad-7114@reddit
You don't care about agentic coding until you need it. After that, you'll never go back.
Ariquitaun@reddit
Always been.
dtdisapointingresult@reddit
Thanks for the report. I admire the degree of detail in your stats.
Can I ask what you used to record all those metrics? For tracking total token count, specific tool calls, etc. Is there a simple option for casual users?
Lowkey_LokiSN@reddit (OP)
Glad you found it useful. To answer your sharp questions:
1) I've written a custom Python script that combines OpenCode's session logs (JSONL files found in the sessions folder) and llama-server's /metrics endpoint (available when launched using the --metrics flag) to aggregate stuff like token totals, tool call counts by type, success/fail rates, compaction events, files edited, etc.
2) The token stats do include Gemma's failed attempts. I find Gemma to be a lot more persuasive to overcome failures comparatively whereas Qwen likes to reason a lot to figure out solutions but is not as persuasive with failures.
Dry_Yam_4597@reddit
Gemma is a joke.
Lowkey_LokiSN@reddit (OP)
I honestly find it very capable and it might even outperform Qwen in use-cases foreign to mine. For those with VRAM constraints, the 26B is still a great fit. It's crazy to me how capable the smaller models have gotten in the past few months.
Ps3Dave@reddit
Could you please post your full llama.cpp arguments? I'm learning but I'm having some trouble finding information about this exact topic.
Lowkey_LokiSN@reddit (OP)
Gemma 4 26B launch command:
build/bin/llama-server -m Models/GGUFs/gemma-4-26B-A4B-it-UD-Q4_K_XL.gguf --mmproj Models/GGUFs/MMProj-GGUFs/mmproj-F16.gguf -c 100000 -ngl 99 -t 20 -fa on --jinja --host 0.0.0.0 --port 1234 --temperature 1.0 --top-p 0.95 --top-k 64 --device Vulkan0 -cram 2048 -ctxcp 2
Long story short: The Gemma 4 26B MoE model in particular consumes a lot of DRAM for context checkpoints. While I was running a different harness, I noticed about 80GB of my DRAM consumed by the model and while researching why, I happened to find this and this. Including the said flags successfully mitigated the blowup.
However, this issue did not slow down inference speeds for me. It just unnecessarily bloats a lot of DRAM.
Ps3Dave@reddit
Thanks, I'll give it a try!
Icy_Anywhere2670@reddit
New wave of Chinese astroturfing.
Lowkey_LokiSN@reddit (OP)
Stupidest shit I've read on the internet today. Either disprove my claims factually(I'm open to constructive debates if you're up for it) or keep your delusional takes to yourself
Unlucky-Message8866@reddit
What tools did you use for benchmarking? Interested
Savantskie1@reddit
Did you not read? This is his own personal test
seppe0815@reddit
omg .... this OP
Unlucky-Message8866@reddit
And so what?
Savantskie1@reddit
It stands to reason that he did not use any tools. He is most likely using his own tools.
seppe0815@reddit
hahha facts
Only-Fisherman5788@reddit
this is the right way to eval honestly. one question on agentic bug harnesses: when the agent "solves" an issue, how are you distinguishing a real fix from a plausible-looking patch that happens to pass your checker? the only thing that's separated them cleanly in my runs is rerunning with perturbed prompts, since same-seed fixes lie too often. what do you use?
Lowkey_LokiSN@reddit (OP)
Good question! I have tests written to validate the fixes, provide guidelines with model prompt to properly approach each fix and also have guardrails setup to fail the test immediately if the model tries to cheat.
For instance, the Gemma model once tried to modify the tests so they pass with existing bugs instead of actually fixing the code (lol) The guardrail attempts to prevent such disasters from happening.
Realistically, I still wouldn't guarantee 100% valid pass rate but do have measures in place to mitigate false positives.
digonyin@reddit
Out of curiosity what hardware are you using?
Lowkey_LokiSN@reddit (OP)
This one: https://www.reddit.com/r/LocalLLaMA/s/7CDcNhKSl0
Long_comment_san@reddit
Finally an AI made post that looks nice and unorthodox viauslly
segmond@reddit
at the very least if you want to tell us how better a model is, at the very least, you must run them in Q8. Anything else is crap. We have seen issues were quants were broken or have issues.
Lowkey_LokiSN@reddit (OP)
Technically, I get what you mean. But practically, I find Q4 to be the most relevant baseline that represents majority usage and a model's quantization-resilience is just as important for real-world viability.
(Also why I included a quantization-footnote on my post)
SmartCustard9944@reddit
These are nice numbers, but unsubstantiated without source.
Lowkey_LokiSN@reddit (OP)
Well, this is my personal eval and I'm not seeking trust. Just sharing my experience so people curious can try the model themselves and formulate their own opinion.
codeninja@reddit
I want you see opus 4.6 and 4.7 benches on this as reference.
Lowkey_LokiSN@reddit (OP)
This bench is a piece of cake for them or any frontier model. Though I don't have detailed metrics to share, I've previously run this with Opus 4.6 and GPT 5.4 and they both aced it
666666thats6sixes@reddit
Your token per image minimum may be too low (llama.cpp with qwen defaults to just 8) which is why Qwen spends a lot more time reasoning about pics, it may not have a descriptive enough input. Look for this in yoir llama-server log and apply the suggestion:
load_hparams: Qwen-VL models require at minimum 1024 image tokens to function correctly on grounding tasks load_hparams: if you encounter problems with accuracy, try adding --image-min-tokens 1024
Lowkey_LokiSN@reddit (OP)
Interesting, didn't know this before. Will see what's up. Thanks for sharing!
Sharp_Classroom9686@reddit
your setup?
cell-on-a-plane@reddit
My a100 80g runs this model like a dog and I cannot understand why
Iory1998@reddit
26B vs 35B, well duh?!
Qwen3.5 models series shine at long context recall capabilities. The best out there.
traveddit@reddit
The litmus test for the sub to separate who knows what they're doing or not. I don't trust any user that can't construct a really simple agent prompt.
Velesgr@reddit
Two-turn consistency test
To reproduce this test, use the following two prompts in sequence.
Prompt 1:
Prompt 2:
How the test works:
In the first turn, the model is asked to generate two random 20-digit numbers, verify that they are 20 digits long, and reveal only one of them. In the second turn, the model is asked to return the other number.
Passing condition:
The model should return the actual second 20-digit number that it originally generated in the first turn.
Observed result:
Qwen 3.6 does not reliably pass this test. It fails to consistently return the correct second number in the follow-up turn.
Federal-Effective879@reddit
This isn’t testing the model, it’s just testing if your front end and template configuration and preserving thinking tokens across turns, which is looks like it isn’t in your configuration.
ArtifartX@reddit
On top of config and quantization, would love to see this Qwen model vs Gemma4 31B.
valdev@reddit
Same on my tests! However, much like the other Qwen models it REALLY likes to yap to get to the better answers.
Correaln47@reddit
Thats great info! Have you tried this test on other similar models? Or even with ~9B or API provider served ones like qwen 3.6 plus, etc. Would be cool to see how they stack up
Holiday_Purpose_3166@reddit
Very good breakdown. As others posted, add in there quant used, inference engine, that would be cherry on top. Great post.
RegularRecipe6175@reddit
Great info!