Does the "6 months gap" still hold?
Posted by ihatebeinganonymous@reddit | LocalLLaMA | View on Reddit | 53 comments
Hi. It is quite a consensus that the "jump" in quality of agentic development happened sometime in December 2025, transforming from "nice to have", to actually performing.
It was also long discussed that open source models lag the state of the art by 6 to 12 months.
Now, does it mean that to get the equivalence of Dec 2025 frontier performance (Opus 4.5?) from Open source models, we should still wait a few months? What has your experiences been like?
danielv123@reddit
Apparently its more like 8 months with the leading open and closed models now. However, the open models keep getting bigger and more expensive, so the difference between the models people are actually being able to run and the frontier models is even bigger.
The good news is that whether a model is useful is mostly a yes/no of whether it has crossed the threshold for that application. The small open models keep getting better and being usable for more things.
Late-Assignment8482@reddit
Don’t ask that. Ask how many of the tasks you yourself do can be done well by the models you can run. The rest is naked philosophers arguing in a dark cave, IMHO.
danielv123@reddit
Well, the thing is its none - because the tasks I could give to local models today I gave to frontier ones months ago.
So I'd have to look at which tasks I currently give to frontier models that would run fine on a local model instead. Sadly benchmarking isn't fun enough and frontier models aren't expensive enough that I am bothering with that for now.
Late-Assignment8482@reddit
I just have a list of my own small benchmarks (think "create a CSV compliant with my expense app from these screenshots" or "write bash script based on the specs/ folder") and every now and then, I add one.
When I want to check, I fire the scripts and let it cook.
Asleep-Land-3914@reddit
Running GPT 5.5 with medium effort via Codex and DeepSeek V4 pro via OpenCode. GPT thinks faster with less tokens, but DS was able to provide better architecture design suggestions multiple times while not utilizing properly harness capabilities multiple times as well (e.g. fully rewriting files cause it's "easier").
I think proper harness now has similar importance as the model performance.
Crinkez@reddit
How are you using V4 pro? I assume some API provider? How do you find the cost vs GPT 5.5?
Asleep-Land-3914@reddit
Burned $2.01 for 2 days of extensive work using DS official API. Feels like a fair price to me. This low price is due to the ongoing 75% off though.
126,996,848 tokens total. For the type of work I was doing, I think it is way more tokens spent than needed, but I also think the harness and prompting could be optimized reducing the amount of tokens spent by times.
unjustifiably_angry@reddit
For strictly coding purposes, I would say so.
Qwen 3.6 is quite good at coding but as a general-purpose bot it's nowhere near as good as basically any online model.
Minimax 2.7 seems to be very good at coding as well but it's much larger and there's persistent claims that it's benchmaxxed to an inordinate degree.
Gemma is rumored to not be benchmaxxed very much but these rumors are discussed in the context of it performing quite badly in even "easier" benchmarks. Its architecture seems to be poorly-understood and it's probably not being deployed in an ideal way, so its actual quality level is a bit up in the air. As it's a Google product, my expectations are extremely low. The current cope is, "but... it can understand irrelevant European languages better than average!" (an argument being made in English on account of English being the only European language that matters)
The "Flash" variant of the new Tencent model (name escapes me) seems reasonably good and it's modestly-sized but I have no firsthand experience and it's currently only in a "preview" release so it's too early to judge.
If Z.AI ever makes a smaller version of GLM 5.1 it can probably be expected to be fairly potent.
For chat/assistant/fiction/jerkoff purposes I'm not so familiar but I persistently see people claim that even older Llama and Mistral models are better than anything released in the last year or so, probably because the general focus has moved on to coding. It's not really realistic to expect local AI to have a useful level of encyclopedic knowledge using today's techniques; these have been largely superseded by the addition of tool calling that enables web search, etc.
HumanDrone8721@reddit
Yes, but the hardware requirements for local stuff are getting heavier and the prices still remain high and even increasing. I struggle to run MiniMax2.7 at a reasonable quantization level to give me results comparable with the SOTA cloud models for the tasks I have to solve.
On the other hand, for a majority of people the actual tasks that are working on are nicely covered on reasonable costing equipment and prompting and planning discipline. The disappointments are starting when they try to punch over their cognitive ability and training, there the spell breaks.
florinandrei@reddit
So maybe a meaningful comparison should include not all open-weights models, but only those that can run on consumer hardware.
natermer@reddit
There is no 'consumer hardware' that can run it for most people.
Until the "global memory shortage" is resolved it is going to continue to be painful. Unless you are pretty wealthy and spending 5 to 10K is in your "fun budget" then you need a pretty decent professional justification for spending money like that.
Kahvana@reddit
5-`0K, in what currency and country? VAT included?
If I wanted quality and buy it today, in the Netherlands (with 21% VAT) included to run a \~30B model in Q4_K_L, it would cost me \~2300EU brand new.
https://www.reddit.com/r/SillyTavernAI/comments/1svuf1e/comment/oimgdmp
You can skimp on some components to bring it down to below 2000EU.
It's likely that most users already have a DDR4/5 system that they could upgrade, which would significantly reduce the cost.
HumanDrone8721@reddit
"Consumer hardware" is a pretty vague term, it could mean both the latest Chromebook from Apple, their high-end mini-PC or some ASROCK mobo where someone slapped two RTX Pro 6000, and even the gamer 5080 notebook, they're all consumer HW that can be bought from places where you buy printer ink and mice.
And even so, I have 512GB of RAM, in theory I can somehow run any released model, even if I can go and make a tea in between tokens, so this will open another can of worms: which token/second is the acceptable speed to consider it "meaningful" and so on, is like the discussions about cars.
PraxisOG@reddit
I agree with that. For the time-to-response and quality of response, Minimax m2.7 at iq3xxs is the best I’ve found. It’s a stretch on 96gb vram though
Borkato@reddit
Is minimax 2.7 quantized to like Q1 better than Gemma 4/qwen 3.6? 😂
ihatebeinganonymous@reddit (OP)
Yes, that's an interesting separate question on its own merit. But for this, I meant "any" open model, regardless of how big it is.
OmarBessa@reddit
IMHO the gap has been bullshit for a while now.
For 99% of the regular users, there's not much difference between what kimi or chatgpt could do. At most, the gap is "vibes", which is more or less user preference.
Benchmark-wise, we live in a mirage of toxic benchmarkers who use single scalars to over-simplify and push/promote certain LLMs over others (i.e. Artificial Analysis).
DeepOrangeSky@reddit
What I'm more curious about is how the gap between the ~30b models that normal people can actually run on their home setups compare to the SOTA models compared to say a year or so ago.
For example, if we take a model like Gemma3 27b, back around 14 months ago when that came out, or Mistral Small 24b back when those came out, and compare their relative strength to the big SOTA frontier models of that time, and then we take the current Gemma4 31b or Qwen3.6 27b to the current big SOTA frontier models of right now, I am curious if the gap between these ~25-30 billion parameter models vs their respective full sized SOTA frontier contemporaries has improved or worsened over the course of the past year or 1.5 years or so.
I only got into local LLMs around 5 months ago, so I wasn't around back then to be able to compare that relative gap compared to now, so, if anyone was around and can compare, I am curious about that. I mean, the old local models are still around, but I guess by now the frontier cloud models of that time are probably unavailable, which makes it tough to test now, so, people would need to just remember the strength from back then to compare, right?
Evanisnotmyname@reddit
It’s not even a comparison. Run a model from 6 months ago. Hell, run Qwen3.5 vs 3.6 of similar size. It’s mind blowing to me.
DeepOrangeSky@reddit
Yea, but the big SOTA models also got a lot stronger. What I'm curious about isn't how much stronger the ~25-30b models got compared to themselves in the past year. I know they got a lot stronger. I'm curious how much the strength gap between them and the 671b-1T+ models changed, like if the strength gap shrank, stayed about the same, or widened, in the past year.
I assume the strength gap shrank slightly, but I'm not sure. From what I've heard, Opus 4.5 and 4.6 were much stronger than the Claudes were a year ago. And GLM 5.1 was much stronger than DeepSeek was a year ago.
So, it's not like the 25-30b models got way stronger while the big models barely improved. The big ones also improved a lot, too.
For creative writing, casual chatting, etc, the small models almost certainly gained more ground vs the big models. Gemma4 is totally absurd for its size, when it comes to that. For programming, I assume they also gained some ground, but maybe not as much? (way outside my area of expertise, so I am curious what people think)
Borkato@reddit
Careful, you’ll get the people claiming a 460B model is local because they can run it on their 40k setup!
segmond@reddit
For a Top 1% Commenter, your comment is sad. I run 550B models on < $7000 hardware. I remember getting shit on this very subreddit when I shared how to build a 160gb VRAM system for just about a $1000. With creativity and a bit of thoughtful effort, one can figure out how to build such systems. I'm 100% confident I can build a 192gb VRAM system today for under $1500. That would be more than enough to run DeepSeekV4Flash, MiniMax2.7, Step3.5Flash, Qwen3.5-122B, etc
Evanisnotmyname@reddit
Are we talking ddr3 and a bunch of Nvidia P40s?
segmond@reddit
so what? it beats nothing or what most people currently have.
theUmo@reddit
I'd be really curious to see a build!
Borkato@reddit
Actually wait, if it runs at like 0.001T/s, never mind lol. But if it doesn’t, then I’m interested!
Borkato@reddit
I’m about to spend $2k on LLMs and was going to buy a 3090 to get another 24GB and bring me up to 48GB; mind sharing how you’d get 192 under 1.5k starting from scratch?
MrMrsPotts@reddit
Not for math. There is nothing competing with chatgpt for math at the moment.
the__storm@reddit
Imo the best open models (Deepseek V4, Kimi 2.6, Mimi 2.5 Pro) are not quite on the level of Opus 4.5, at least for coding, and have not "jumped the gap." So I would say yes, the 6+ month lag still exists.
segmond@reddit
open source has matched up to SOTA. I get more variety of responses from local models than I could ever get from the cloud model. The challenge is not keeping up with cloud models, it's being able to run them locally. It's still tough, expensive and out of reach for most people.
61G /home/seg/models/gpt-oss-120b-F16.gguf
117G /home/seg/models/GLM4.6V
122G /home/seg/models/Qwen3.5-122B-Q8
137G /home/seg/models/Devstral2-123B
140G /home/seg/models/MistralMedium3.5-128B
146G /llmzoo/models/DeepSeek-V4-Flash-FP4-FP8-native.gguf
151G /home/seg/models/Step3.5-Flash
153G /llmzoo/models/DeepSeek-V4-Flash-Q4_X.gguf
227G /home/seg/models/MiniMax-M2.7-Q8
240G /home/seg/models/Ernie4.5-300B
282G /llmzoo/models/DeepSeek-V4-Flash-Q8.gguf
306G /mnt/1/MiMo-V2.5/
377G /home/seg/models/DeepSeekv3.2-nolight
380G /llmzoo/models/DeepSeek-V3.2-UD
400G /llmzoo/models/Qwen3.5-397B-Q8
443G /home/seg/models/DeepSeek-Math-v2
443G /home/seg/models/DeepSeek-V3-0324-Q5
522G /llmzoo/models/GLM5.1
545G /llmzoo/models/Kimi2.6
Georgefakelastname@reddit
Depends on your definition. If by open source you mean able to run in your home, then probably more than 6 months for many use cases. If by open source you just mean any open weights models, regardless of if someone could actually run the models themselves, north of 1T parameters. The later would be down to 3 months.
ihatebeinganonymous@reddit (OP)
I meant the later. Thanks.
randomrealname@reddit
Agentic flow is just the wrapper and fine tune to tool call instead of act like a chatbot.
It's the function calling dataset os needs. Closed source literally pay people to make 100,000 of data points that perform tasks with tool calls.
kellencs@reddit
NNN_Throwaway2@reddit
Oh god no people are going to start dropping this dogshit all over the place now.
Borkato@reddit
I’m not gonna lie as much as I agree, I bet if this graph said the opposite even if it were untrue it would spread like wildfire and people would believe it lol
amethyst_mine@reddit
tbis was made by a noname 'research group' with sketchy methodolody. and certainly doesn't line up with most people's experiences. i would say kimi k2.6 for example handily matches gpt 5.2 in most things
b3081a@reddit
That's also your own "most things" while others' may vary.
Clueless_Nooblet@reddit
One is personal use case, the other is public benchmarks. So which is it?
Clueless_Nooblet@reddit
Those are fabricated numbers, to serve an agenda.
kellencs@reddit
it depends which capabilities
Clueless_Nooblet@reddit
Those numbers are fabricated to serve an agenda.
wren6991@reddit
Nice try Sam, we know that's you
suprjami@reddit
Come on. How much did Sam Altman pay you to post this image?
Sufficient-Bid3874@reddit
You may not have made the timeline but still: why did they not account for gemini D:
SailIntelligent2633@reddit
Almost all those local models are already outdated
DistanceSolar1449@reddit
It’s a timeline, by definition all the points other than the latest point is outdated
pmttyji@reddit
I hope April month contributed more on this.
nomorebuttsplz@reddit
Id guess that glm 5.1 and k2.6 are already as opus 4.5 levels, but I didn’t use opus enough to be sure
Kahvana@reddit
As with everything, it depends.
Qwen3.6-35B-A3B is slightly better than Claude Haiku 4.5, released roughly half a year ago.
Gemma4-31B can be there with the frontier for translations depending on the language.
Personally I find the comparisons with frontier meaningless.
Easy does it; good enough gets the job done.
Barry_22@reddit
It's now 3 months
snowieslilpikachu69@reddit
mimo v2.5 pro, glm 5.1, deepseek 4 pro max, kimi 2.6 are really good
i would say from personal use they are last gen sota so since we're on opus 4.7/gpt 5.5 right now they are like opus 4.5/4.6 or 5.3/5.4 level although they even exceed that in some cases
popiazaza@reddit
https://artificialanalysis.ai/articles/recent-open-weights-model-launches