My experience with testing all frontier open-weight models against GPT and Claude

Posted by Anbeeld@reddit | LocalLLaMA | View on Reddit | 37 comments

I spent about a week testing open-weight models for real work, comparing them against what I already know from ChatGPT, Gemini, and Claude. The gap between what benchmarks suggest and what happens when you give these models something to verify is bigger than I expected.

The clearest example: I ran an audit of a 66-skill codebase for description quality, routing conflicts, and overlap. Ten models, same files, same OpenCode setup with identical tools and MCPs. The answers were in the repo, so I could ground-truth every claim. Two models produced reviews I'd trust. Eight did not.

GPT 5.4 got the most right. It found missing boundary clauses and caught routing gaps where two skills could match the same prompt. It also flagged descriptions too vague for an agent to route correctly. It didn't hallucinate skills that don't exist or praise things that were broken. GPT is precise and grounded but doesn't always synthesize across the whole system. Claude Opus is better at pulling together information spread across many files and connecting parts that aren't adjacent, and GPT sometimes misses that.

GLM 5.1 was close behind and had the best fix plan. It caught a broken cross-reference pointing to a skill by the wrong name and a pair of skills both claiming the same scope with zero boundary between them. It's the only reliable open-weight model I tested. It's also noticeably slower than everything else here. The findings are consistently accurate though, which I can't say for the others.

Minimax M2.7 can handle context well, sometimes edging past GPT 5.4 and GLM 5.1, connecting information across files like Claude Opus does. But it's constantly factually wrong in ways those two catch immediately. On the audit it claimed a file was missing when it exists, said a duplicate directory exists when it doesn't, and called two overlapping skills conflict-free. The mistakes are specific and confident, which makes them expensive to verify. The structure of its reasoning is great, but the particulars are often wrong.

And then there's Kimi K2.5, which gave everything five stars and analyzed skills that aren't in the repo. Five stars, across the board, on a codebase where at least two routing conflicts are plain to see. It's allegedly strong at UI work, and it's fast and visual, which GLM and Minimax are not. But I wouldn't trust it with anything that requires checking claims against source material.

DeepSeek 3.2 claimed a wrong skill count and made a blanket statement about exclusion clauses that one counterexample kills.

Qwen 3.5 didn't complete the task on the first attempt. I had to hand-hold it past its own context window overflow. When it finally finished, it had counted 60 instead of 66, pulled in skills from outside the scope, and said a cluster had "no overlap" when its descriptions cross-reference each other. I haven't seen it impress on any task I've tried. Qwen 3 Coder at least used the right count, but its review was so thin and positive it reads like a product page.

Gemini 3 Flash Preview declared "No detected conflicts" and gave mostly praise. It's fast though, and at that speed it's better than any open-weight alternative. If I need a quick first pass I won't act on, I'd reach for it. Can't trust it for precision work, but useful at that speed.

The rest are noise. Nemotron 3 Super said a skill lacks guidance that its description already contains. Mistral Large 3 called boundaries fuzzy that the descriptions resolve explicitly. Same kind of error in each case: confident claim, easily falsified, not worth the context window it loaded.

The pattern across the week: models willing to say something is wrong consistently produce more useful output than models that default to praise. The most dangerous output is the plausible claims that happen to be false, "no conflicts," "every skill has exclusions." Because of that GPT 5.4 and GLM 5.1 are what I'm using now. Claude would be there too if it didn't run out of limits after 1 message. The rest I can't trust at all, except for using Gemini for simple, mechanical tasks.

[-]

tat_tvam_asshole@reddit

you should evaluate them in the same harness, and also, you aren't controlling for server-side orchestration (obviously) so it's not really as good a comparison of the models as you think

[-]

Anbeeld@reddit (OP)

You should read the post before commenting.

[-]

tat_tvam_asshole@reddit

Perhaps I needed to have spelled it more clearly, "as also", because as I said, you don't control server-side orchestration, hence its not a clear comparison, nor is it a comprehensive in its comparison.

I'm offering you valuable feedback by pointing out the methodological holes that can help you evaluate models more fairly if you care to make the process more objective and thorough. At least, if you wanted to share your n=1 experience, that's great, though people shouldn't extrapolate too deeply from the information.

[-]

Miserable-Dare5090@reddit

You are using them how? Openrouter I’m guessing. The task seems suspiciously hard that only GPT5.4 can accomplish. GLM5 is 700+ billion params. By Qwen 3.5, do you mean 397b? 35b? how about 3.6?

[-]

Anbeeld@reddit (OP)

Ollama Cloud. GLM 5.1 went pretty well though? Qwen 3.5 397b, haven't tried 3.6 at all.

[-]

Miserable-Dare5090@reddit

Yep, but I think my biggest complaint is still that you are comparing gpt5 high (rumored 2T model) Opus (rumored, according to Elon Musk, 10T params, but I believe its more around 5T) to models that are at least the size.

I don’t know about coding capacity, but I can run qwen 397 at home, and its really good. Just downloaded minimax 2.7 so I’ll give that a spin, but otherwise my qwen plus at home is currently the best solution for me.

It’s not local if its on the cloud…whether cloud ollama or whatever other cloud.

[-]

Tman1677@reddit

Opus is not 5T lmao, most reputable sources (read: not Musk lol) have it around 800b.

[-]

Miserable-Dare5090@reddit

that’s interesting, he said Grok is 400B in a recent twitter post and “1/10 the size of Opus”

[-]

Tman1677@reddit

Yeah, obviously it's proprietary so you can't openly call him out on it, but it's just a lie. I trust the report from Semianalysis more personally

[-]

Such_Advantage_6949@reddit

That doesnt mean the comparison should limit to the model u can run right? It meant to be a no limit test. I am sure if u have enough hardware to run glm 5.1 u will praise it non stop or whatever biggest model u can run

[-]

Miserable-Dare5090@reddit

Ok, fair point, but it sort of defeats the purpose of local llama. We are all aware that Opus and GPT5 are likely better, it’s not a secret. But my experience is that the model is also as good as the user. How the system prompt is crafted, the compression used for the weights, and the tools available at its disposal can make a big difference. I don’t code, but I bet that if I set Qwen 397 as the creator and minimax m2.7 as the critic, I will have very close results to GlM5.1 which is a model that like Kimi it can’t be run that easily locally.

That’s the gripe, the models that did best are the ones that few if none can load locally, but is that a reason to discard other models or is it that yes, it takes a little more to get a system going with more than 1 LLM locally to compete reasonably with a behemoth model? 6 months ago this was the rage — multi model agentic workflows. It was and still is a very viable local alternative.

I am lucky to have the hardware to at least get close to that goal, although I bought it before everything went crazy with RAM. But I bought it specifically because I saw giant companies making it untenable to have alternatives for AI, and forcing customers to use the cloud as the only solution.

That kind of limitation of freedom should not sit will with Americans…and ironically, the Chinese are running more local models than we are.

This is to say, that’s nice that the gated paid for models are doing the best, and that at least GLM seems to be on their heels. But didn’t we already know this?

It would be more useful to test GPT5 standard (which we know is a different model, as is GPTlow or nano or whatever…haven’t had an openAI account for almost 6 months at this point so I don’t know what they are serving) and Sonnet against the models reviewed.

Those are every day models for 99% of folks that can see local as an alternative.

[-]

Such_Advantage_6949@reddit

I have 224gb of vram alone, and i run qwen 397B at q4 full context. And no it is still far away from codex. I use local llm for privacy usecase and for easier task. (My) Time is money. Even on good hardware (my system is 3090, 4090, 5090 rtx 6000) it is still quite slow. I probably migrate to dual rtx 6000 next so at least minimax run fast via vllm.

[-]

Miserable-Dare5090@reddit

If your pcie bus is 5.0 you’ll get the speed you want. I thought the two GB10 machines were a bad investment (5k total when I got them, 4TB drives each) but now I realize the memory bandwidth = cross computer bandwidth is actually a blessing for tensor parallel. It increases compute, increases decode speed, so probably not close to your rig but Qwen397 runs at a stable 27tps and 11-1500 pp no matter the context size.

[-]

dmigowski@reddit

What hardware are you using from for that fat qwen?

[-]

Miserable-Dare5090@reddit

two sparks chained with a 200G DAC. Also gemma 31 and nemotron cascade on a 40gVram&64G DDR5 linux machine, Qwen122 & Minimax 2.7 jangq quants via vMLX on a mac studio 192gb and Qwen Next + Qwen 35 on a strix halo. All running as a unified endpoint soon (hopefully, got scared of liteLLM’s problem recently…looking for an alternative endpoint unifier).

That is all wired by ethernet and a SFP backbone as well, accessed via tailscale exclusively from my little laptop.

[-]

UnusualAverage8687@reddit

And why not test Gemma 4?

Seems like an odd way to do a benchmark - a huge amount of set up to get a single data point.

[-]

Anbeeld@reddit (OP)

Actually, I tried it but wasn't impressed to the point I forgot about it. It's a small model and doesn't compete with the big boys or even with Gemini 3 Flash. It's heavily optimized for smaller token footprint as well so it's very hard to make it work through something in details.

[-]

MrHaxx1@reddit

It doesn't really matter whether only model can complete the task; how close the others get is still valuable data.

Like, if the 2nd best completes it 95%, but costs half, then it's the clear winner for most people.

[-]

shing3232@reddit

pls noted that all Qwen3.5 have buggy weight and require fixing.

[-]

ormandj@reddit

What do you mean?

[-]

shing3232@reddit

https://www.reddit.com/r/LocalLLaMA/s/hnNXq0oKEE

[-]

ormandj@reddit

I'm not sure anything has been presented as proof there, I'm not an expert in ML/LLMs but I'm not sure seeing outlier shapes is an immediate cause for concern. The author of those 'fixes' won't benchmark because they (supposedly) cannot, so it's just conjecture. Even benchmarking is "hard" to do properly to really determine if there is an issue.

[-]

MoodDelicious3920@reddit

What do u think about somnet 4.6 and muse spark?

[-]

g33khub@reddit

Qwen 3.6 plus is doing a better job than minimax 2.7 for my personal projects. For work, I use exclusively opus 4.6 high 1M. There is definitely a difference but small-ish projects are being handled quite okay by these two models and its soooo much cheaper. Will give glm 5.1 a shot but I had very bad experience with 4.7 \~6 months back.

[-]

crantob@reddit

Sir, this is LocalLLaMA

[-]

g33khub@reddit

qwen, minimax, glm -- all "can" be run locally. Opus is the benchmark ceiling to compare against.

[-]

DeltaSqueezer@reddit

Could you test older model glm-4. 7 it is faster than 5.1

[-]

Anbeeld@reddit (OP)

For me quality is more important than speed so it's only GLM 5+ that interested me, as it's a big leap compared to 4.7 from what I've heard. As I said for simpler tasks I just use Gemini.

[-]

po_stulate@reddit

But people have more experience with 4.7 and 5, without them as a baseline for comparism it's hard to gauge how much 5.1 has actually improved. (I personally did not have good experiences with 4.7 and 5 and suddenly now I'm told 5.1 is best it's a bit hard to relate.)

[-]

-dysangel-@reddit

I actually preferred 4.6 to 4.7, especially locally. I had a bunch of issues with getting 4.7 to work cleanly. 5 was clearly better than both. The difference between 5 and 5.1 is hard to say, but it has definitely been working well for me on my day job. If you just don't like it's style somehow then that's understandable, but I find GLM 5 pleasant to work with.

[-]

LCLforBrains@reddit

This is a great illustration of why benchmark scores and real-world routing quality diverge so much. The models that look good on standard evals aren't necessarily the ones that handle the edge cases in your actual skill set, and finding that out requires exactly the kind of manual, ground-truth audit you did here. The tricky part at scale is that this kind of review doesn't stay manageable as your codebase or agent grows. We built Greenflash to automate the pattern-finding across real agent interactions, so you're not relying on benchmarks or sampling.

Curious what you ended up doing with the eight models that failed, whether you iterated on the descriptions or just cut them.

[-]

Anbeeld@reddit (OP)

Brought to you by GLM 5.1, also this comment as an alternative.

Model	Tier	Strengths	Weaknesses	Best for
GPT 5.4	Trust	Precise, grounded, no hallucinations, catches real problems	Doesn't always synthesize across whole system, lacks Claude's context breadth	Most trustworthy pass on anything you plan to act on
GLM 5.1	Trust	Best fix plans, consistent accuracy, only reliable open-weight	Very slow	Work needing accuracy over speed; only open-weight option you can trust
Minimax M2.7	Caution	Good context breadth, connects info across files like Claude	Constantly factually wrong in confident ways, wrong specifics within fine reasoning structure	Not recommended; structure is fine but particulars are unreliable
Gemini 3 Flash	Caution	Fast, better than any open-weight at that speed	Declared "no conflicts" on a repo with real conflicts, mostly praise	Quick first passes you won't act on; speed tier only
Kimi K2.5	Weak	Fast, visual, allegedly strong at UI	Gave everything 5 stars, analyzed skills not in repo, weak at coding/analysis	UI tasks maybe; not coding, analysis, or anything requiring verification
DeepSeek 3.2	Weak	None observed	Wrong skill count, blanket claims one counterexample kills	Not recommended
Qwen 3.5	Weak	None observed	Context window overflow, wrong count, out-of-scope skills, "no overlap" on cross-referenced cluster	Not recommended; no task has impressed
Qwen 3 Coder	Weak	Correct skill count	Review too thin and positive to be useful	Not recommended for analysis; maybe other coding tasks
Nemotron 3 Super	Noise	Avoided hard false claims	Said a skill lacks guidance its description already contains	Not recommended
Mistral Large 3	Noise	None observed	Referenced skill not in repo, called resolved boundaries "fuzzy"	Not recommended
Claude Opus	Trust (prior experience)	Best context breadth, connects distant parts of codebase	Not part of this test, based on prior use	Tasks needing broader context reach; pairs well with GPT's precision

[-]

SaltResident9310@reddit

Model	Type	Key Strengths	Key Weaknesses	Verdict
GPT 5.4	Closed	Precise, grounded, caught routing/boundary gaps.	Misses some cross-file synthesis.	Top Tier; Most trusted.
GLM 5.1	Open	Accurate findings, best fix plans, reliable.	Noticeably slow.	Best Open-Weight; Recommended.
Claude Opus	Closed	Best at connecting non-adjacent info across files.	Severe usage limits.	Top Tier (but restricted).
Minimax M2.7	Open	Great reasoning structure and context handling.	High "confident" factual errors/hallucinations.	Unreliable; expensive to verify.
Gemini 3 Flash	Closed	High speed.	Overly positive; missed obvious conflicts.	Useful for "quick first passes" only.
Kimi K2.5	Open	Fast, visual, good for UI work.	Defaulted to "5 stars"; hallucinated skills.	Cannot trust for audits.
Qwen 3.5	Open	N/A	Context overflows; wrong counts; thin reviews.	Not recommended.
Nemotron 3	Open	N/A	Confident but easily falsified claims.	"Noise."
Mistral Large 3	Open	N/A	Falsely claimed boundaries were fuzzy.	"Noise."
DeepSeek 3.2	Open	N/A	Wrong counts; incorrect blanket statements.	Unreliable.