Open Models - April 2026 - One of the best months of all time for Local LLMs?
Posted by pmttyji@reddit | LocalLLaMA | View on Reddit | 130 comments
Any underrated or overlooked models?
FYI MiniMax-M2.7 switched their license(from MIT to Non-Commercial) so it's not in graph.
^(PS : Took me 30 mins to gather these models & generate this graph)
SeyAssociation38@reddit
qwen 3.6 397b will never be released nor will anything over 122b. management is trying to profit off of it and this is why some qwen team members left. management sees it as giving away money
Better-Struggle9958@reddit
why it called local?
Glittering_Focus1538@reddit
Because the weights are open, u can download and freely use the model if u have the hardware(for minimax at least 10k worth)
Netsuko@reddit
I feel like a model that is, by all means, 99.999% impossible to run locally should not be considered a "local" model at all.
Borkato@reddit
This opinion is very unpopular here and I have no idea why. It’s ridiculous to pretend like we should care a ton about an OS model that’s massive. Like yeah it’s neat but it’s not local.
ttkciar@reddit
I suppose if you personally only had a 4GB GPU, you wouldn't consider Qwen3.5-9B local either.
Glittering_Focus1538@reddit
I run qwen 3.6 on my 16 gb card, i imagine others do too
ttkciar@reddit
What does that have to do with anything?
Glittering_Focus1538@reddit
That it's stupid to base what everyone defines as local by what you can run. I'm sure theres plenty of people who have a mac mini cluster(6k) or nvidea dgx spark which could run deepseek v4 for 4k
ttkciar@reddit
Ah, okie-doke, it sounded like you were disagreeing, but I guess we are in agreement.
Borkato and Netsuko seem to be of the opinion that models they personally cannot host at home should not be considered local models.
The point of my 4GB GPU hypothetical was to illustrate the invalidity of exactly what you said -- basing "what everyone defines as local by what you can run".
Local models are, and always will be, any models which you could conceivably use if you had the necessary local hardware.
That requires, at a minimum, access to the weights and either inference software support or sufficient understanding of the model architecture to facilitate implementing inference software support.
Digger412@reddit
Just because it may not be runnable locally for you doesn't mean it isn't for others. I could run every model on that list for instance, and I've got a PR open to support both new MiMo V2.5 models in llama.cpp.
I don't say this to be mean, but just to push back a bit against the "Your model must be below X parameters to be considered local" sentiment. It feels like gatekeeping to say that just because a model is super large, it doesn't deserve to be discussed here.
Borkato@reddit
How can you run a 1T dense model?! What speeds do you get and how much vram do you have?
Digger412@reddit
None of those at the 1T+ size are dense models, they're all MoE's.
I've got eight 6000 Pros (so 768GB VRAM total), and speeds depend on the regimen basically. I have 768GB of 12 channel DDR5 RAM too so I can do single user with llama.cpp on CPU+GPU but it's slower total throughput than vllm for instance.
I've benched K2.6 at full quality in llama.cpp before and get about 40 tk/s TG at zero context.
Right now I'm doing some testing with the V2.5 Pro 1T gguf and it's much slower due to FA incompatibility with the head size or something, it's about 10tk/s but I think that'd go up to 30tk/s if I turned FA off (at the cost of much more KV memory needed).
DS V4 is still mostly unsupported AFAIK, and I can't fit it entirely on VRAM anyways so will be waiting for llama.cpp support.
Borkato@reddit
What’s your PP speed?
Digger412@reddit
It's in that chart for K2.6, for the V2.5 Q8_0 PP is \~600tk/s I think. I haven't done a sweep bench on it yet.
Borkato@reddit
Oh shoot the chart didn’t load when I first looked
Embarrassed_Adagio28@reddit
Okay well what do you think the cut off be? The cutoff point will have to be arbitrary because people have a very wide range of local hardware.. or you could just use your brain and understand local doesnt mean the same for everybody.
Borkato@reddit
Honestly 4 3090s or so is a great cutoff. Anything more than that and you need server architecture tbh
ttkciar@reddit
Why would using a server matter? A lot of us here use servers. It's just different hardware, but if that hardware is right here at home, then it's local.
You get that servers are just computers, not fundamentally different from a desktop or laptop, right?
jacek2023@reddit
They simply lie to justify discussing these topics.
ttkciar@reddit
You are free to be wrong.
alphapussycat@reddit
Like 10 months ago it would've been like $3k. It's not unrealistic levels of hardware. It's just that it's too late to get hardware now.
b0tbuilder@reddit
Best deal for 3k is 2 x R9000 32GB. Nice cards but it’s sad that is the most reasonable price / perf right now.
alphapussycat@reddit
No it's epyc CPU and mobo, with like 768gb 12 channel ram. That's the only reasonable way to run the 500+gb models.
Basilthebatlord@reddit
For now at least!
DinoAmino@reddit
Open Weight for someone who has the VRAM. Then it's local. I assume people who are at 16Gb and under could say the same about Glm and MiniMax and Qwen 397B. They can't possibly run those. But some have the VRAM to run it local. No need to split hairs over it, but I agree ut it would have been better and more accurate to just say open weight.
jacek2023@reddit
you asked valid question but you are downvoted
geldonyetich@reddit
Gemma 4:31b was the first time I felt dazzled with something approaching a frontier model on a locally running LLM. It's very sharp. Gemma 4:24b, on the other hand, did not impress, it even has a tendency to stroke out.
I finally gave Nemotron-3-Nano-Omni a try the other day and it was very, very fast. I'm still curious how smart it is, it could be quite good, but I can't really tell subjectively. can definitely see the application for a wide range of tasks that require expedience without the inference of a dense model.
AD7GD@reddit
I just randomly installed gemma4 4b because lmstudio recommended it when I installed it on a test PC. It's shockingly good for a 4b model.
Glittering_Focus1538@reddit
it runs at 200 tok/s on my hardware, I desperately wanted it to be able to code in opencode, but it just wasn't to be, not even the bigger varient could.
hust921@reddit
People. "Local" doesn't mean: "runs on my gaming laptop". The democratization that local models are creating is still perfectly valid for companies, labs, local or even national governments. Who need or want to run their own infrastructure.
Local or opensource anything (AI included) has nothing to do with affordability. I would like to run it too. But just because I can't, doesn't make it any less "local".
vick2djax@reddit
This graph doesn’t make me feel good about my first 3090 coming in the mail in a few days
jimmytoan@reddit
The license switch from MiniMax is worth flagging - this is becoming a recurring pattern where models get released under permissive licenses (MIT, Apache) to build adoption and mindshare, then quietly shift to non-commercial when the project needs to monetize. For anyone building anything production-adjacent on these models, the license audit before deployment is now a necessary step. The graph is great btw, April was genuinely exceptional - Qwen3.6 35B alone would have made this month noteworthy.
iamn0@reddit
Qwen3.5-122B
Embarrassed_Adagio28@reddit
I found qwen3.5 122b q5 to be much worse than qwen3.6 17b q5 and even qwen3.6 35b q5. However i am extremely excited to try out qweb3.6 122b if they release it.
relmny@reddit
I find it that it depends. Maybe usually yes, but I did find 2-3 cases were 122b was the model that "got it" while 27b never did (same prompt many attempts). And what it "got" was comparable to the 397b and bigger models.
122b is a very strange model, to me...
Anyway, yeah, 27b is one of my daily drivers.
shansoft@reddit
In my coding experience, its able to solved a lot more problem that 27B couldn't or simply stuck at.
savage_shaq@reddit
What hardware are you using the run these massive models at home?
arcanemachined@reddit
God, yes. 122b please.
No_Algae1753@reddit
Who knows if they will release it tho
marscarsrars@reddit
Tell your experience with this model.
No_Algae1753@reddit
best 120b out there rn
SV_SV_SV@reddit
I am in a news deficit of 6 months or so.. If you happen to have the experience, how does this model relate to GLM-4.5 Air?
nickless07@reddit
Similiar but faster and a bit better.
ys2020@reddit
Glm 5.1 is my fav at the moment. Honestly, it's mind blowing we get this type of quality with free weights.
Revolutionalredstone@reddit
That was indeed an incredible month, Those who can and do use AI are looking at something like an ever brightening summer forever ;)
TheCatDaddy69@reddit
Parameter sizes as a metrics are so dumb..
ElementNumber6@reddit
It's a good general measurement of trained/instilled knowledge. A metric desperately under-valued by our leading testing benchmarks.
TheCatDaddy69@reddit
Yeah , comparing the same model against its distils, but for example, Gemma 31 nears Kimmi's one 1T model being a fraction smaller.
robogame_dev@reddit
Agree - param count is the fundamental resolution of the model, more params in a model is like more pixels in an image, it is able to draw finer distinctions out of the same amount of training data as compared to fewer params.
henk717@reddit
Certainly has been a hit month for me, and a rough month for the devs who had to bend Gemma4 into behaving since it had the annoying traits of GPT-OSS, GLM and the past Gemma combined (BOS like token in the template instead of as a bos, extremely sensitive to syntax and heavy to run without swa).
My personal hit was Qwen3.5-27B-Heretic which is finally a model I can coax into writting really long stories. And many in our community have been enjoying Gemma4 as a roleplay model now that it behaves correctly.
rosie254@reddit
the landscape has moved really fast, but i still like my Qwen3-VL-8B. it just works well for some reason. nowadays i'm on gemma4 26b a4b and qwen3.5 9b, but those aren't exactly underrated!
also... this chart assumes very powerful hardware, how is this focused on local? most people have 8GB vram or 16GB vram at most
TheRealSol4ra@reddit
What a shitty graph. What does param count have to do with anything
jacek2023@reddit
1600B model is my favourite local model I run it all day on raspberry Pi
IrisColt@reddit
heh
dbenc@reddit
.00001 bit quant
jacek2023@reddit
I am not sure quants are available. I believe OP runs unquantized version on his setup.
dbenc@reddit
ah my mistake. must be screaming at 1 token per week
jacek2023@reddit
I wonder what is the usecase for that
bucolucas@reddit
Compute at the power scale of hawking radiation perhaps
StereoWings7@reddit
I guess he is an artist working on another project in homage to this musical performance.
Borkato@reddit
Honestly more like per year
dbenc@reddit
need someone smart to do the math
typical-predditor@reddit
Wait until you find out the token speed of the Earth Supercomputer. It was asked to find the meaning to Life, The Universe, and Everything. I hear we're still waiting on the first token.
RelationshipLong9092@reddit
there was a guy who posted here yesterday with 16 Sparks he was clustering
he can run it
not sure who else
Monad_Maya@reddit
The smaller Flash variant is just about possible for a small minority of us.
MotokoAGI@reddit
I'm running Flash locally and it is a great solid model. it's making it to the top of my list.
jacek2023@reddit
I am able to run quantized 235B models locally on my setup, but people discussing here Kimi/DeepSeek/GLM models usually run them in cloud (or not run at all, just hype benchmarks) and call them "local" because these models are from China.
j_osb@reddit
No, Local also means local deployments for companies and such.
For those models like GLM5.1 and KimiK2.6 are very feaible.
alphapussycat@reddit
They can be fun on CPU if you happen to have built a 768gb ddr5 epyc server before ram price ve hike. Expensive, sure, but also not that expensive.
It's only now that it's not really possible, but fir those with a server, they can run them locally.
ML-Future@reddit
Even so, it is important that such powerful models are open source.
ttkciar@reddit
I don't think any of those models qualify as open source, except maybe Trinity-Large-Thinking, but I don't think they publish their training datasets, do they?
If you meant it would be great if such powerful models were open source, then I wholeheartedly agree.
jacek2023@reddit
Wow finally I must agree with you on something ;)
epicrob@reddit
You mean 1600 byte model running in your raspberry Pi? ;)
debackerl@reddit
You only need a cluster of 128 RPis to run it 😂
GregariousJB@reddit
Is that all day for a single prompt?
I can't imagine a raspberry pi running AI that well. How did you do it?
SV_SV_SV@reddit
He obviousy meant a hypercluster of watercooled raspberry pi's.
jacek2023@reddit
not my post but https://www.reddit.com/r/LocalLLaMA/comments/1sarlb8/gemma_4_running_on_raspberry_pi5/
lunerift@reddit
Feels like a great month on paper - but params don’t really tell the story.
In practice, a lot of these models still struggle with consistency and eval outside benchmarks.
Smaller well-tuned models often end up more usable in real pipelines.
Curious what people are actually running in production vs just testing?
RickyRickC137@reddit
Mistral would probably name the 1.6T model as "Medium Large"?
alphapussycat@reddit
Tbh, mistral has more realistic naming. A 32b model is small, the next step up is around 70-128b, then 400-1kb for the large.
9b and 4b are tiny models.
rditorx@reddit
How much is 1kb nowadays?
alphapussycat@reddit
1 trillion.
Pleasant-Shallot-707@reddit
Nah… medium small large
Netsuko@reddit
Calling DeepSeek V4 Pro Max a "local" model is an insane stretch. That thing is almost 900 gigabytes in size
dsanft@reddit
I can run it.
12 Mi50s, 2 3090s, dual core Xeon with 768GB DDR4.
Hoak-em@reddit
dual-socket, so probably not unless there's an inference engine that doesn't need duplication across the sockets
Same dual socket setup but with DDR5 and two 8570s and some different GPUs, I max out at amxint4 GLM-5.1 -- anything beyond that would be impossible to run
dsanft@reddit
I wrote my own engine to solve the NUMA/cross-socket problem. Don't have kernels for Deepseek MLA/DSA yet though. Will have to get those in soon.
Netsuko@reddit
Well.. "Technically" I am closer to being a Millionaire than Elon Musk is.
jcoigny@reddit
I see what you did there /s
Embarrassed_Adagio28@reddit
Dual core xeon? You running a 2008 cpu?
RelationshipLong9092@reddit
dual socket means there are two CPUs on his motherboard / barebone
BusinessYou7196@reddit
Socket ≠ core
jacek2023@reddit
"I can run it. (...) At least in theory" - this is exactly what is happening here in 2026. Hype destroyed this place.
_VirtualCosmos_@reddit
on MXFP4 if so...
thereisonlythedance@reddit
Still waiting on a GGUF here. Main devs of llama.cpp don’t seem to be DeepSeek fans.
VoiceApprehensive893@reddit
ssd torture time
Monad_Maya@reddit
Open weight might be more accurate but you know what OP meant.
IngenuityNo1411@reddit
human generated shit post
SimultaneousPing@reddit
human slop
_VirtualCosmos_@reddit
rude
mister2d@reddit
lol
Pleasant-Shallot-707@reddit
Most of the ones with a bar worth a damn are in no way local
Sanity_N0t_Included@reddit
Who the hell is running Deepseek-v4-Pro-Max locally?!?!?!?!
Netsuko@reddit
I guess your local datacenter/ai model provider xD
Sanity_N0t_Included@reddit
LOL!
GhostVPN@reddit
just rent a gpu for a monnth in data center like 15€ month
inevitabledeath3@reddit
Which data center has GPUs that cheap?
Also I don't think a single GPU would work for DeepSeek in most cases.
GhostVPN@reddit
https://vast.ai/pricing
Netsuko@reddit
Dude you are VASTLY underestimating the cost to rent SEVERAL H100-class gpus lol
inevitabledeath3@reddit
I think you misread. Those prices are by the hour, and for DeepSeek V4 Pro you need 4 of them. Although I guess Flash only needs one.
madlad13265@reddit
According to prices here, you're looking at like 10-15 dollars *an hour* to run something thats like 1T parameters
Signor_Garibaldi@reddit
To host kimi, you'd use 8x h100, you'd pay 27$ per hour, not 17 per month
ikkiyikki@reddit
Are you high? The electric bill alone would be more than that.
ElementNumber6@reddit
Many of us, perhaps, if not for mega corporations snatching and withholding the necessary hardware from us.
_VirtualCosmos_@reddit
you just need a couple old smartphones cluser together bro
a_beautiful_rhind@reddit
Did I miss flash max? A deepseek we can run again?
Plastic-Stress-6468@reddit
I mean I can technically run every model on the chart if I am willing to wait a long ass time or just rent a bunch of gpus.
For what it's worth I'd rather have a bunch of models I can't run public available than not. Maybe in a few years they won't be so out of reach.
Paradigmind@reddit
It must be cold in here. Qwen3.6 27B looks so small.
TopTippityTop@reddit
Deep seek has +60% parameters than Kimi, but manages to be worse
some_user_2021@reddit
So many waifus
I-did-not-eat-that@reddit
Locally on my 50 grand "gaming rack".
Practical-Elk-1579@reddit
500gb vram models kek
-Akos-@reddit
LFM 2.5.
Technical-Earth-3254@reddit
I can't run it locally (yet!) but DS V4 Flash is SO good for its size.
mrinterweb@reddit
I really appreciate how good the smaller models are getting (Qwen, Gemma). More params doesn't necessarily mean better.
MrObsidian_@reddit
I just tried Granite-4.1-8b and it is straight up ass. But atleast Apache-2 I guess
Ne00n@reddit
Brother in VRAM, where do you get enough to run that?
Thrumpwart@reddit
…so far.
atape_1@reddit
Really unfortunate that MiniMax is no longer MIT.
I'm not sure it's because of this move, but the stock price of the company is doing far worse than of Z.Ai.