AI Model Reviews
Posted by Typical-Tomatillo138@reddit | LocalLLaMA | View on Reddit | 46 comments
LLM benchmarks are terrible. Everyone overfits their models so they can max out benchmarks in no more than a few months after its release. Open source models release with headlines "90% of Opus at 5% of the cost", yet anyone who has actually used it can feel the obvious difference in quality.
It's impossible to find good reviews on models any more. Every result on the google search "minimax m2.7 review" is either
-
AI-written slop blogposts made in 10 minutes. These are the worst.
-
Meaningless benchmark results either by the big orgs (overfitting) or personal test results (doesn't translate between use cases)
-
Reddit threads with very conflicting information: comments are evenly divided between GLM, Qwen and Minimax with everyone reporting different quality
Are there any good sources for model reviews left in 2026? I can't seem to find any.
traveddit@reddit
I think everyone's definition of what makes a good model might be different. People on this sub complain about Qwen overthinking because they don't know how to prompt the model and then we have first day Gemma release with broken parser but users praising agentic performance just hours after. At the end of the day you can only really trust your own tests and standards because the majority of "reviews" are shit.
ag789@reddit
nope, not quite, I've a Qwen 3.5 28B REAP model work on a 'difficult' code refactoring problem.
it goes in loop burning > 12k tokens in 'thinking' without reaching a response.
the problem is 'solved' simply going back to the 'original' Qwen 3.5 35B A3B but Q4 quantised model.
it ran through a moderate verbose 'thinking' , but reverted with the refactored script and 'fixed everything' in that small script refactoring task.
https://www.reddit.com/r/LocalLLaMA/comments/1sjprna/comment/ofvy6hb/
models do have limits.
fulgencio_batista@reddit
What is your system prompt? I find the second I give it an agentic prompt it stops overthinking and instead focused on acting. Most of the time CoT is about a paragraph or so, sometimes maybe 1k tokens when it’s planning or has gotten unexpected results.
ag789@reddit
erm, thanks, but that it seemed a little complicated, I'm using llama.cpp, it probably used the jinja template from the model file itself, I'm not too sure if the system prompt is after all embedded there.
https://huggingface.co/unsloth/Qwen3.5-35B-A3B-GGUF?show_file_info=Qwen3.5-35B-A3B-Q4_K_M.gguf
but that unsloth did provide a guide to disable thinking
https://unsloth.ai/docs/models/qwen3.5#how-to-enable-or-disable-reasoning-and-thinking
fulgencio_batista@reddit
It’s not complicated at all. Literally just write a paragraph system prompt telling it that it’s an agent and include a useless tool that it won’t use (you don’t even need an actual tool). When I ask Qwen to give me a fun fact it replies within \~100 tokens vs \~2000 w/o the system prompt.
ag789@reddit
thanks, I'd study the 'jinja' template to see if I can pass an argument to do that, I think llama.cpp has templates that already supported function calls, I'm seeing a "mcp server" option in the web ui, which probably mean that I can connect and mcp server as well, but I'd guess a simple function such as date should do, and is probably useful after all. A thing is normally, agents are run by the 'client', I may need to use a separate web ui instead of that bundled with llama.cpp
traveddit@reddit
https://imgur.com/a/DULyCGm
Go in your llama.cpp webui and it has a field to pass the system prompt. You should start by messing around with different basic prompts here.
ag789@reddit
thanks
mrtrly@reddit
Minimax m2.7 search only returns benchmark posts because nobody's actually using it on real tasks. I tested it on actual conversations last month and the "Opus parity" claim fell apart once context got complex.
qubridInc@reddit
Benchmarks aren’t useless you just need to ignore them and follow real users + eval tools (DeepEval, Langfuse) + niche communities, because even researchers agree static benchmarks miss real-world behavior.
SnooPaintings8639@reddit
Yeah, there is one. It was suggested by Andrej Karpathy a year or two ago: the 'vibes' on the r/LocalLLaMA subreddit for any given model.
xspider2000@reddit
Thats why on the r/LocalLLaMA lot of bots now
Dabalam@reddit
A review doesn't seem a great fit for LLM assessment to me. Reviews require people to have a shared notion of what an item is for and a shared accepted method of establishing quality. We have neither of these for LLMs. At best you find someone who has a similar use case and home that their experience generalises to your use case. That might be more probable given a huge number of impressions, but that would require some sort of aggregation of opinions as opposed to the vibes of one expert (perhaps Math is an exception, but I am not certain if there is variability among different Mathematical domains).
Aggregate reviews may mislead you on performance in your specific case, but n = 1 performance is almost certainly going to mislead you unless you curate your reviews to use the model exactly as you are planning to. I do generally think structured testing is better than vibes. A relevant vibe check might be more helpful than an irrelevant benchmark but I think we would still be better off having a more structured "benchmark eque" approach to testing models for our own use cases.
toothpastespiders@reddit
I think anyone really interested in local models needs to bite the bullet and make their own benchmark based on their own real-world use. At this point it's pretty trivial to vibe code something together that just hooks into the openai api to let you swap out backends as needed. Yeah, in terms of quantity it won't match the more well known benchmarks. But even a small, hand made, benchmark based on real world needs is going to have better predictive value than the big well known ones.
mr_zerolith@reddit
Unfortunately not. I like Bijan Bowen on youtube because he's one of the few that isn't paid for.
Otherwise i fish in the litter box for the lost diamond here like everyone else because it has a lower signal to noise ratio than most places.
rm-rf-rm@reddit
I think you mean higher signal to noise ratio than most places right?
P.S: Thats why I mod here, this is relatively speaking the best place on the web, maybe the planet. Everywhere else, including other subs, is much worse
mr_zerolith@reddit
Oops., dyslexia. I mean to say higher. I edited my post
Thanks for your work. Yes, this is the best place for this kind of discussion even if growth is starting to reduce the discussion quality.
DeepOrangeSky@reddit
Bijan has been my go-to on youtube whenever a new model comes out (mostly just because he's pretty funny, and his vids are fun to watch). He does have a pretty open/honest vibe, where the one time he got paid or given some item he just straight up said so when he got it. I mean, I guess if he was an evil genius he could do that to build fake trust or something, but, my read on him is that he's probably just a good dude.
The other main one I've been watching is xCreate, but I haven't watched nearly as many of his vids, so I'm not sure if he's as un-bought/legit as Bijan or not. He seems really nice, and likable, and his vids are also usually pretty fun to watch (also, he gets into more technical/advanced stuff when he talks compared to Bijan, during his vids, which sometimes goes over my head, but some people might enjoy if they aren't as computer illiterate as I am), but no clue as far as bias/independence or whatever yet.
Anyway, usually I just try the models out myself, if my computer is capable of running them, and I try out some difficult prompts of my own and compare its outputs to using the same prompts on my other favorite models, to see how strong it is. And if it is a model that is too big but I am still curious about, then I just read the dozens/hundreds of posts people make about using it, on here, and it gives some rough idea (since even if a bunch of individual people are wrong, as an aggregate, usually they give some rough sense of whether it sucks or if it is good). Albeit trying to keep in mind that for the first few days or even weeks of a model coming out, they have lots of bugs, or bad quants, or so on, so, by a month later they might be way better than what people were saying on opening day.
My_Unbiased_Opinion@reddit
I have found the NatInt section of the UGI benchmark to be very accurate in terms of capability for my use case.
evia89@reddit
1 I like https://swe-rebench.com/
No need to try model just as it drops
2 I also have nano gpt sub so I can try all open weights model in my workflow and see how it behaves
Far-Low-4705@reddit
make your own benchmark that fits your use case, keep it local and private.
MisticRain69@reddit
Pretty damn difficult to find any real reviews of models. Usually the best shit I find on models and how good they are is buried deep in a reddit post/huggingface page with like 10 likes or the rare non slop youtube video. As for minimax m2.7 it feels pretty good to me best local model ive used so far. Feels like 80-85% of sonnet 4.6/4.5 in terms of how it "speaks" and how well it performs for my use case. Just can't be as vauge as you can with claude or else it will misinterpret what you say. I was able to replace my claude $100 max plan with local minimax m2.7 now I just use a bit of sonnet when minimax cant do it. Free tier for claude not paid. I use the MiniMax-M2.7-K_G_3.50 quant https://huggingface.co/Goldkoron/MiniMax-M2.7/tree/main and it performs better than the unsloth UD IQ4_NL quant. I use a gmktec evo x2 128gb with a 3090ti egpu usb 4 attached.
Ambitious-Hornet-841@reddit
Honestly, this is why I've stopped trusting anything except running my own 20-prompt test suite on a rented GPU. Takes an afternoon, cuts through all the "90% of Opus" BS.
But for quick checks: find one person on Reddit with your exact use case and DM them. Their frustrations > any benchmark.
What's your main use case?
dark-light92@reddit
No they are not. They are useful for comparing model performance. Yes, some benchmarks get saturated (mmlu) and other benchmarks become irrelevant (lmarena) but new benchmarks are come up to take their place.
I hate the stupid comments like "model x is benchmaxxed... bla bla bla".
Sure. If that's the case then why don't you come up with a way to measure model intelligence which aren't either "vibes" or benchmarks?
Typical-Tomatillo138@reddit (OP)
Which new benchmarks do you recommend taking seriously?
dark-light92@reddit
Depends entirely on what your use case is.
Typical-Tomatillo138@reddit (OP)
Say agentic coding.
dark-light92@reddit
https://swe-rebench.com/
https://www.tbench.ai/leaderboard/terminal-bench/2.0
Zealousideal_Fill285@reddit
Terminal bench is faking results. They had injected solutions right in agents.md to boost forgecode results.
dark-light92@reddit
Source?
Also, if someone is cheating that doesn't make the benchmark bad.
Zealousideal_Fill285@reddit
Well if they post results when there is obvious evidence of cheating how then that doesn't make that benchmark useless?
https://debugml.github.io/cheating-agents/
dark-light92@reddit
Hmm... I didn't know that. Datapoints in the leaderboard should be vetted. Not sure what the current process is.
However, only the Terminus 2 agent (which they themselves develop) has benchmarks with many models. If you filter by that agent, the model positions are as expected.
PhilippeEiffel@reddit
Even for agentic coding, it depends what you are working on. I mean writing html+css+javascript is very different from writing C code for some micro-controller. Each model has different capabilities in each category.
I made some test in a programming language rarely used. I observed:
- Qwen3.5 has not been trained with this language (it does not know the syntax and in not able to follow the prompted rules regarding the syntax)
- Gemma 4 has seen this language. Many parts of the syntax are correct, it is able to follow the syntax advice to fix it's code. It is quite capable.
- gpt-oss-120b has clearly been trained with this language. Syntax is perfect, spirit of the language clearly understood and applied. Unfortunately this model is not so good when context reaches 100k (significant slow down), instruction following was a bit lower than competitors when I tested.
ag789@reddit
simply take benchmarks 'with a pot of salt', as in a practicaly sense, if you take a model (neural nets, not necessary LLM), and train it on just the tests, it can probably 'fake' the samme LLM results given identical tokens.
ag789@reddit
it is true, and it isn't surprising that 'frontier' (or even 'lesser frontier') models would have included the tests and results. the result is that even if you take a simple neural network (you don't even need an LLM), if given this input expect that output, the neural network weights would converge to the training output after training.
ag789@reddit
oh, and I think it is about time for a community, crowd source tests for LLMs, they could be based on existing benchmarks design, but that each contribution should be a local test and novel. e.g. code refactoring or bug fixing, on arbitrary code , with arbitrary problems.
generation isn't as strong a test as it tends to be a test on 'memory'.
Long_War8748@reddit
Imho the root cause is that everyone wants to have benchmark results, but no one wants to do it by hand (understandably, it is quite time intense to do), and we are just not there yet that this can be done automatically.
It reminds me of certifications, no one wants to actually test new hires, they want to outsource it, and let Universities and Certification Mills do the sorting lol.
ag789@reddit
one interesting scenario is to have the LLM play a general helpdesk, responding to arbitrary problems and proposing solutions / fixes for them.
SourceCodeplz@reddit
You need to test for yourself
bennyb0y@reddit
the most legit benchmark
mindwip@reddit
Thanks cool, looks like some of the latest models not there yet.
But interesting not one opensource is in the non bankrupt area!
Accomplished_Ad9530@reddit
Gemma 4 is open source. 31B is #3 in the rankings and 26BA4B is #8, both not bankrupt.
Fast_Tradition6074@reddit
Exactly. Official benchmark scores are basically just the culmination of overfitting at this point. I've been feeling the same way, which is why I'm researching a method to score generated text by detecting geometric distortions during the LLM's inference process.
My primary goal is pre-emptive hallucination detection, but if this goes well, it could potentially become a universal benchmark. Imagine a metric where you can objectively say, 'This model has an average distortion score of 58, so it’s highly prone to hallucinations.' That’s the future I’m aiming for
Zealousideal_Fill285@reddit
There is already ai hallucination bench
PhilippeEiffel@reddit
I remember reading something about a benchmark based on data compression. It was on reddit, but I don't clearly remember the details. The main advantage was that models could not specifically be trained for this. To my understanding, more powerful the model results in higher compression ratio.
Note that this benchmark does not qualify every good properties expected from a LLM.
Maybe someone has a link?
Xx69JdawgxX@reddit
sounds like a need to me. I'm in the same boat. bout to start trying some larger local models out and idk where to even begin