My experience with testing all frontier open-weight models against GPT and Claude

Posted by Anbeeld@reddit | LocalLLaMA | View on Reddit | 37 comments

I spent about a week testing open-weight models for real work, comparing them against what I already know from ChatGPT, Gemini, and Claude. The gap between what benchmarks suggest and what happens when you give these models something to verify is bigger than I expected.

The clearest example: I ran an audit of a 66-skill codebase for description quality, routing conflicts, and overlap. Ten models, same files, same OpenCode setup with identical tools and MCPs. The answers were in the repo, so I could ground-truth every claim. Two models produced reviews I'd trust. Eight did not.

GPT 5.4 got the most right. It found missing boundary clauses and caught routing gaps where two skills could match the same prompt. It also flagged descriptions too vague for an agent to route correctly. It didn't hallucinate skills that don't exist or praise things that were broken. GPT is precise and grounded but doesn't always synthesize across the whole system. Claude Opus is better at pulling together information spread across many files and connecting parts that aren't adjacent, and GPT sometimes misses that.

GLM 5.1 was close behind and had the best fix plan. It caught a broken cross-reference pointing to a skill by the wrong name and a pair of skills both claiming the same scope with zero boundary between them. It's the only reliable open-weight model I tested. It's also noticeably slower than everything else here. The findings are consistently accurate though, which I can't say for the others.

Minimax M2.7 can handle context well, sometimes edging past GPT 5.4 and GLM 5.1, connecting information across files like Claude Opus does. But it's constantly factually wrong in ways those two catch immediately. On the audit it claimed a file was missing when it exists, said a duplicate directory exists when it doesn't, and called two overlapping skills conflict-free. The mistakes are specific and confident, which makes them expensive to verify. The structure of its reasoning is great, but the particulars are often wrong.

And then there's Kimi K2.5, which gave everything five stars and analyzed skills that aren't in the repo. Five stars, across the board, on a codebase where at least two routing conflicts are plain to see. It's allegedly strong at UI work, and it's fast and visual, which GLM and Minimax are not. But I wouldn't trust it with anything that requires checking claims against source material.

DeepSeek 3.2 claimed a wrong skill count and made a blanket statement about exclusion clauses that one counterexample kills.

Qwen 3.5 didn't complete the task on the first attempt. I had to hand-hold it past its own context window overflow. When it finally finished, it had counted 60 instead of 66, pulled in skills from outside the scope, and said a cluster had "no overlap" when its descriptions cross-reference each other. I haven't seen it impress on any task I've tried. Qwen 3 Coder at least used the right count, but its review was so thin and positive it reads like a product page.

Gemini 3 Flash Preview declared "No detected conflicts" and gave mostly praise. It's fast though, and at that speed it's better than any open-weight alternative. If I need a quick first pass I won't act on, I'd reach for it. Can't trust it for precision work, but useful at that speed.

The rest are noise. Nemotron 3 Super said a skill lacks guidance that its description already contains. Mistral Large 3 called boundaries fuzzy that the descriptions resolve explicitly. Same kind of error in each case: confident claim, easily falsified, not worth the context window it loaded.

The pattern across the week: models willing to say something is wrong consistently produce more useful output than models that default to praise. The most dangerous output is the plausible claims that happen to be false, "no conflicts," "every skill has exclusions." Because of that GPT 5.4 and GLM 5.1 are what I'm using now. Claude would be there too if it didn't run out of limits after 1 message. The rest I can't trust at all, except for using Gemini for simple, mechanical tasks.