Recent Open models from last 6 Months - Nov 2025 - Apr 2026

Posted by pmttyji@reddit | LocalLLaMA | View on Reddit | 45 comments

I created this chart with recent open models from last 6 months. Few might be older than that possibly.

Included only latest versions(Ex: Only Kimi-K2.6, no Kimi-K2.5 & Kimi-K2. Also only GLM-5.1 & GLM-4.7, no GLM-4.6 & GLM-4.5). I couldn't add some models like Ling-2.5-1T, Ring-2.5-1T, Omnicoder. Also I didn't add small models(except Qwen3.5-9B/4B & Gemma-4-E4B) as the graph is too crowdy already. Sorry if I missed any recent models.

Possibly best 6 months for Local LLMs?!? Still this month has more than a week, so we could get few more models.

So what do you think about overall graph? Underrated & Overlooked models?

[-]

Zestyclose_Yak_3174@reddit

I am wondering how 27B 3.6 stacks up against 3.5 in this benchmark

[-]

pmttyji@reddit (OP)

Couldn't replace the image in thread so posted as separate comment. Check it out.

[-]

Zestyclose_Yak_3174@reddit

Thanks

[-]

200206487@reddit

Qwen3.6 27 dense just dropped

[-]

pmttyji@reddit (OP)

Couldn't replace the image in thread so posted as separate comment. Check it out

[-]

pmttyji@reddit (OP)

Yep, noticed. I thought of adding that one to this graph, but that site don't have this model yet to add.

[-]

LoSboccacc@reddit

Afaik they benchmark internally so there's some lag anyway.

[-]

some_user_2021@reddit

😲

[-]

Glittering-Call8746@reddit

Where's is qwen 3.6 27b ?

[-]

pmttyji@reddit (OP)

Couldn't replace the image in thread so posted as separate comment. Check it out.

[-]

pmttyji@reddit (OP)

Updated one with Qwen3.6-27B, DeepSeek V4 Flash, DeepSeek V4 Pro. Had to remove few tiny models.

[-]

nextlevelhollerith@reddit

I remember the times when we were amazed by GPT-4. GPT-4o (Nov 24) as a Intelligence Index of 17. Now Qwen 3.6 35 has 43. With some decent hardware you can run that locally. Its remarkable what open models we have right now.

[-]

Qwen30bEnjoyer@reddit

Years ago people would've ridiculed you for suggest we could get performance even close to any of those frontier models on local hardware. I've only been in this community for a year, and the jump between Qwen 2.5 and Qwen 3.6 is genuinely astonishing. But I wonder, what is the driving force for the Pareto frontier of intelligence vs. hardware cost being lifted up so drastically? Given that these are fundamentally just probability distributions for the next token, I wouldn't expect that kind of gain. If it's a combination of RL via GRPO or similar in combination with dark knowledge transfer via distillation, that could make sense, but I wonder what implications it would have for generalization in out-of-distribution tasks.

Would it be possible that as these RL pipelines keep sharpening the LLM for more diverse tasks, maybe we would see an overfit? A greater drop in relative performance compared to the in-distribution baseline?

[-]

PANIC_EXCEPTION@reddit

Coding, mostly. It's a goldmine for training data (verifiable rewards). It seems to transfer over to other reasoning as well. Just look at Qwen3-Coder-Next. I daily drive that thing for general purpose non-thinking tasks, not just code.

[-]

NNN_Throwaway2@reddit

I suspect a lot of the wow factor of 4o came from its writing style or personality, rather than its raw intelligence. On that front, Qwen 3.6 isn't really interchangeable with it.

[-]

bwjxjelsbd@reddit

it has people on chokehold lol

remember when people protesting after oAI retired that model?

[-]

PANIC_EXCEPTION@reddit

Ah yes, the unsatisfied married women with AI boyfriends. Truly the best exemplars of human preference.

[-]

NNN_Throwaway2@reddit

lol yeah weren't people going on hunger strike or am I thinking of something else

[-]

changing_who_i_am@reddit

That plus the memory harness OpenAI built. And probably it being trained on LMarena.

[-]

iMakeSense@reddit

You know any details of that? I know there weren't anthropic level leaks for it, but anything similar?

[-]

Ok-Contest-5856@reddit

GPT 4o still feels like a solid model to this day. I think “true” improvements for LLMs has somewhat stagnated or requires a ton of thinking tokens and we’re in a benchmaxxing or specialization era right now.

[-]

po_stulate@reddit

But don't forget that they're open weight models, not real open models. Once they stop releasing new model weights we have nothing in our hands that we can make any model as good as what they're giving us rn.

[-]

dsartori@reddit

Look way to the right on that chart. OLMO is what we would be building on, I guess.

[-]

pmttyji@reddit (OP)

K2-Think-V2 from LLM360 too - u/ttkciar

[-]

ttkciar@reddit

Yup, I've been liking both Olmo-3.1-32B and K2-V2 for future community-driven projects.

Both are somewhat under-trained (189 tokens/parameter for Olmo-3.1-32B, and 112 tokens/parameter for K2-V2) but also hit above their weight in terms of overall inference quality. This means they should be strong base models for further training, and they should be able to absorb a lot more training data without "overcooking".

They are not without drawbacks, though. Neither are great for creative writing; Olmo-3.1-32B-Instruct really shines at STEM tasks, while K2-V2-Instruct is great for analysis, RAG, and logical problem-solving.

However, when using Olmo-3.1-32B-Instruct for real-world physics problems I found it had some odd gaps in its knowledge (most glaringly, it insisted that Lithium-6 fission was not physically possible).

K2-V2-Instruct's achilles heel is that it gets very, very slow as context grows large (I think due to its simple architecture; it is using ye olde "llama" design). Its 226K token context inference is only 6% as fast as it is at 4K token context.

It's very likely that Olmo's shortcomings might be covercome with more training, while K2-V2 should be fine at lower-context-length inference, so I wouldn't rule them out.

Also, the main reason both of these models exhibit such high quality inference is because of the very high quality data used for training them, all of which is available for download on Huggingface. Even if we do not use the models themselves, using their data for community training projects seems like a gimme.

Another approach I've been doodling with recently is using Gemma-4-31B-it as a base model for an MoE. Since Google switched from their old, bad Gemma license to Apache-2.0, Gemma 4 is viable in ways the older Gemma models were not.

I had thought AllenAI's FlexOlmo technology was the answer to our prayers, in that it would allow us to federate community training without getting hung up on intercommunication during training, but u/mz_gt pointed out shortcomings of FlexOlmo which makes it a lot less viable for that than I'd hoped.

Despite that, I think the same general approach to federated MoE training might work, if we can figure out how to keep the experts intercompatible without synchronizing during training.

One way we might be able to do that is to limit the number of sparse layers, keep the rest of the model dense, and bucketize the experts post-training so that mutually compatible experts are grouped. The routing logic (which unfortunately would need to happen post-merge; FlexOlmo really fell down there) would then choose only experts within those groups.

I need to do some groundwork to ascertain if this approach is viable at all, though. It might prove to be a non-starter.

If something like that can work, though, it would mean we could federate the training of experts, using Gemma-4-31B-it as the "anchor" model. Each participant would add training only to the middle six-layer "block" (five sliding window attention layers + one full attention layer) which RYS theory predicts is structure-agnostic, collectively making a 2.9B parameter expert.

The end result would be a model with 28B of shared/dense parameters and some number of 2.9B parameter expert blocks. With 32 experts, four chosen, that would give us a MoE with 121B total parameters and 40B active parameters.

It looks good "on paper" but like I said there's a lot of groundwork to do first to see if the approach is even viable. Nonetheless, I'm confident that the open LLM community will have options, moving forward, if we need to pick up the mantle of open LLM R&D ourselves.

[-]

RobotRobotWhatDoUSee@reddit

Extremely interesting. I would like to subscribe to your newsletter on community-driven Franken-FlexOlmos :)

[-]

Caffdy@reddit

I understand your intention with this one, but, wouldn't be misleading to put the Artificial Analysis branding when they didn't created that particular graph per se? If I'm understanding correctly, you frankenstein'ed the indexes from several posts from them across time; Like I said, I understand what were you trying to do, I think we all have done that trying to suss out how new models compare to old ones, but I don't know how well this works, I'm sure Artificial Analysis changes their methodology all the time, and models slide up and down in value from one benchmark run to another

[-]

pmttyji@reddit (OP)

Next time onwards I'll create & share own graph some other way.

[-]

Skystunt@reddit

Minimax m2.7 is crazy good, close to glm but but only ~200B parameters, i can run that model locally which is rare for models that smart

Deepseek 4 has some hefty competition !

[-]

nomorebuttsplz@reddit

I find m2.7 8 bit to not be close to glm5.1 4 bit for debugging.

It will loop and second guess itself. “Wait… but the jinja template is version 3.2… actually… a simpler approach would be to test the code first… wait! I should look up what the docs say about the template… actually, I already did…” and on and on, not making progress.

Whereas glm will take more time to process the prompt but is faster overall because it one shots the bug.

[-]

CriticallyCarmelized@reddit

If it weren’t for its aggressive censoring, MiniMax M2.7 would be the best local model. Best balance of size and performance.

[-]

bipplemonade@reddit

Ten years ago I would have never imagined we would and could only rely on the Chinese for open source technologies

[-]

ttkciar@reddit

Violates Rule Three: Low-effort post

The moderator team is trying to raise the bar on benchmark posts, to avoid inundating the sub. It is no longer sufficient to provide a screenshot of benchmark results. Benchmarks should be accompanied by insightful analysis or on-topic points which bring new understanding to the community.

[-]

pmttyji@reddit (OP)

:'( It took me more than 45 mins to create this graph after checking this sub's "New Model" Flair & HuggingFace.

[-]

jacek2023@reddit

they did same with my posts in the past, but "Kimi cloud is cheaper than Claude cloud" is fully loved, this is LocalLLaMA 2026

[-]

ttkciar@reddit

Okay, I've re-approved it. Could you edit your post text, please, to add some analysis or make some on-point and on-topic observations to promote understanding?

[-]

pmttyji@reddit (OP)

Thanks for restoring this. You're a lovely MOD. I'm not gonna use AA graph for threads. I'll update post.

[-]

vex_humanssucks@reddit

Appreciate the chart — it's surprisingly hard to keep track when releases are happening weekly. One thing I'd add: the parameter count alone is getting less useful as a signal. Kimi K2.6 and GLM-5.1 perform way above their size class once you tune the sampling. Would be interesting to see this charted against MMLU or some task-specific benchmark alongside the model names.

[-]