Open Models - April 2026 - One of the best months of all time for Local LLMs?

[-]

SeyAssociation38@reddit

qwen 3.6 397b will never be released nor will anything over 122b. management is trying to profit off of it and this is why some qwen team members left. management sees it as giving away money

[-]

Glittering_Focus1538@reddit

Because the weights are open, u can download and freely use the model if u have the hardware(for minimax at least 10k worth)

[-]

Netsuko@reddit

I feel like a model that is, by all means, 99.999% impossible to run locally should not be considered a "local" model at all.

[-]

Borkato@reddit

This opinion is very unpopular here and I have no idea why. It’s ridiculous to pretend like we should care a ton about an OS model that’s massive. Like yeah it’s neat but it’s not local.

[-]

ttkciar@reddit

I suppose if you personally only had a 4GB GPU, you wouldn't consider Qwen3.5-9B local either.

[-]

Glittering_Focus1538@reddit

I run qwen 3.6 on my 16 gb card, i imagine others do too

[-]

ttkciar@reddit

What does that have to do with anything?

[-]

Glittering_Focus1538@reddit

That it's stupid to base what everyone defines as local by what you can run. I'm sure theres plenty of people who have a mac mini cluster(6k) or nvidea dgx spark which could run deepseek v4 for 4k

[-]

ttkciar@reddit

Ah, okie-doke, it sounded like you were disagreeing, but I guess we are in agreement.

Borkato and Netsuko seem to be of the opinion that models they personally cannot host at home should not be considered local models.

The point of my 4GB GPU hypothetical was to illustrate the invalidity of exactly what you said -- basing "what everyone defines as local by what you can run".

Local models are, and always will be, any models which you could conceivably use if you had the necessary local hardware.

That requires, at a minimum, access to the weights and either inference software support or sufficient understanding of the model architecture to facilitate implementing inference software support.

[-]

Just because it may not be runnable locally for you doesn't mean it isn't for others. I could run every model on that list for instance, and I've got a PR open to support both new MiMo V2.5 models in llama.cpp.

I don't say this to be mean, but just to push back a bit against the "Your model must be below X parameters to be considered local" sentiment. It feels like gatekeeping to say that just because a model is super large, it doesn't deserve to be discussed here.

[-]

Borkato@reddit

How can you run a 1T dense model?! What speeds do you get and how much vram do you have?

[-]

Digger412@reddit

None of those at the 1T+ size are dense models, they're all MoE's.

I've got eight 6000 Pros (so 768GB VRAM total), and speeds depend on the regimen basically. I have 768GB of 12 channel DDR5 RAM too so I can do single user with llama.cpp on CPU+GPU but it's slower total throughput than vllm for instance.

I've benched K2.6 at full quality in llama.cpp before and get about 40 tk/s TG at zero context.

Right now I'm doing some testing with the V2.5 Pro 1T gguf and it's much slower due to FA incompatibility with the head size or something, it's about 10tk/s but I think that'd go up to 30tk/s if I turned FA off (at the cost of much more KV memory needed).

DS V4 is still mostly unsupported AFAIK, and I can't fit it entirely on VRAM anyways so will be waiting for llama.cpp support.

[-]

Borkato@reddit

What’s your PP speed?

[-]

Digger412@reddit

It's in that chart for K2.6, for the V2.5 Q8_0 PP is \~600tk/s I think. I haven't done a sweep bench on it yet.

[-]

Borkato@reddit

Oh shoot the chart didn’t load when I first looked

[-]

Embarrassed_Adagio28@reddit

Okay well what do you think the cut off be? The cutoff point will have to be arbitrary because people have a very wide range of local hardware.. or you could just use your brain and understand local doesnt mean the same for everybody.

[-]

Borkato@reddit

Honestly 4 3090s or so is a great cutoff. Anything more than that and you need server architecture tbh

[-]

ttkciar@reddit

Why would using a server matter? A lot of us here use servers. It's just different hardware, but if that hardware is right here at home, then it's local.

You get that servers are just computers, not fundamentally different from a desktop or laptop, right?

[-]

jacek2023@reddit

They simply lie to justify discussing these topics.

[-]

ttkciar@reddit

I feel like a model that is, by all means, 99.999% impossible to run locally should not be considered a "local" model

You are free to be wrong.

[-]

alphapussycat@reddit

Like 10 months ago it would've been like $3k. It's not unrealistic levels of hardware. It's just that it's too late to get hardware now.

[-]

b0tbuilder@reddit

Best deal for 3k is 2 x R9000 32GB. Nice cards but it’s sad that is the most reasonable price / perf right now.

[-]

alphapussycat@reddit

No it's epyc CPU and mobo, with like 768gb 12 channel ram. That's the only reasonable way to run the 500+gb models.

[-]

Basilthebatlord@reddit

For now at least!

[-]

DinoAmino@reddit

Open Weight for someone who has the VRAM. Then it's local. I assume people who are at 16Gb and under could say the same about Glm and MiniMax and Qwen 397B. They can't possibly run those. But some have the VRAM to run it local. No need to split hairs over it, but I agree ut it would have been better and more accurate to just say open weight.

[-]

jacek2023@reddit

you asked valid question but you are downvoted

[-]

geldonyetich@reddit

Gemma 4:31b was the first time I felt dazzled with something approaching a frontier model on a locally running LLM. It's very sharp. Gemma 4:24b, on the other hand, did not impress, it even has a tendency to stroke out.

I finally gave Nemotron-3-Nano-Omni a try the other day and it was very, very fast. I'm still curious how smart it is, it could be quite good, but I can't really tell subjectively. can definitely see the application for a wide range of tasks that require expedience without the inference of a dense model.

[-]

AD7GD@reddit

I just randomly installed gemma4 4b because lmstudio recommended it when I installed it on a test PC. It's shockingly good for a 4b model.

[-]

Glittering_Focus1538@reddit

it runs at 200 tok/s on my hardware, I desperately wanted it to be able to code in opencode, but it just wasn't to be, not even the bigger varient could.

[-]

hust921@reddit

People. "Local" doesn't mean: "runs on my gaming laptop". The democratization that local models are creating is still perfectly valid for companies, labs, local or even national governments. Who need or want to run their own infrastructure.

Local or opensource anything (AI included) has nothing to do with affordability. I would like to run it too. But just because I can't, doesn't make it any less "local".

[-]

vick2djax@reddit

This graph doesn’t make me feel good about my first 3090 coming in the mail in a few days

[-]

jimmytoan@reddit

The license switch from MiniMax is worth flagging - this is becoming a recurring pattern where models get released under permissive licenses (MIT, Apache) to build adoption and mindshare, then quietly shift to non-commercial when the project needs to monetize. For anyone building anything production-adjacent on these models, the license audit before deployment is now a necessary step. The graph is great btw, April was genuinely exceptional - Qwen3.6 35B alone would have made this month noteworthy.

[-]

iamn0@reddit

Qwen3.5-122B

[-]

Embarrassed_Adagio28@reddit

I found qwen3.5 122b q5 to be much worse than qwen3.6 17b q5 and even qwen3.6 35b q5. However i am extremely excited to try out qweb3.6 122b if they release it.

[-]

relmny@reddit

I find it that it depends. Maybe usually yes, but I did find 2-3 cases were 122b was the model that "got it" while 27b never did (same prompt many attempts). And what it "got" was comparable to the 397b and bigger models.

122b is a very strange model, to me...

Anyway, yeah, 27b is one of my daily drivers.

[-]

shansoft@reddit

In my coding experience, its able to solved a lot more problem that 27B couldn't or simply stuck at.

[-]

savage_shaq@reddit

What hardware are you using the run these massive models at home?

[-]

arcanemachined@reddit

God, yes. 122b please.

[-]

No_Algae1753@reddit

Who knows if they will release it tho

[-]

marscarsrars@reddit

Tell your experience with this model.

[-]

No_Algae1753@reddit

best 120b out there rn

[-]

SV_SV_SV@reddit

I am in a news deficit of 6 months or so.. If you happen to have the experience, how does this model relate to GLM-4.5 Air?

[-]

nickless07@reddit

Similiar but faster and a bit better.

[-]

ys2020@reddit

Glm 5.1 is my fav at the moment. Honestly, it's mind blowing we get this type of quality with free weights.

[-]

Revolutionalredstone@reddit

That was indeed an incredible month, Those who can and do use AI are looking at something like an ever brightening summer forever ;)

[-]

TheCatDaddy69@reddit

Parameter sizes as a metrics are so dumb..

[-]

ElementNumber6@reddit

It's a good general measurement of trained/instilled knowledge. A metric desperately under-valued by our leading testing benchmarks.

[-]

TheCatDaddy69@reddit

Yeah , comparing the same model against its distils, but for example, Gemma 31 nears Kimmi's one 1T model being a fraction smaller.

[-]

robogame_dev@reddit

Agree - param count is the fundamental resolution of the model, more params in a model is like more pixels in an image, it is able to draw finer distinctions out of the same amount of training data as compared to fewer params.

[-]

henk717@reddit

Certainly has been a hit month for me, and a rough month for the devs who had to bend Gemma4 into behaving since it had the annoying traits of GPT-OSS, GLM and the past Gemma combined (BOS like token in the template instead of as a bos, extremely sensitive to syntax and heavy to run without swa).
My personal hit was Qwen3.5-27B-Heretic which is finally a model I can coax into writting really long stories. And many in our community have been enjoying Gemma4 as a roleplay model now that it behaves correctly.

[-]

rosie254@reddit

the landscape has moved really fast, but i still like my Qwen3-VL-8B. it just works well for some reason. nowadays i'm on gemma4 26b a4b and qwen3.5 9b, but those aren't exactly underrated!

also... this chart assumes very powerful hardware, how is this focused on local? most people have 8GB vram or 16GB vram at most

[-]

TheRealSol4ra@reddit

What a shitty graph. What does param count have to do with anything

[-]

jacek2023@reddit

1600B model is my favourite local model I run it all day on raspberry Pi

[-]

IrisColt@reddit

heh

[-]

dbenc@reddit

.00001 bit quant

[-]

jacek2023@reddit

I am not sure quants are available. I believe OP runs unquantized version on his setup.

[-]

dbenc@reddit

ah my mistake. must be screaming at 1 token per week

[-]

jacek2023@reddit

I wonder what is the usecase for that

[-]

bucolucas@reddit

Compute at the power scale of hawking radiation perhaps

[-]

StereoWings7@reddit

I guess he is an artist working on another project in homage to this musical performance.

[-]

Borkato@reddit

Honestly more like per year

[-]

dbenc@reddit

need someone smart to do the math

[-]

typical-predditor@reddit

Wait until you find out the token speed of the Earth Supercomputer. It was asked to find the meaning to Life, The Universe, and Everything. I hear we're still waiting on the first token.

[-]

RelationshipLong9092@reddit

there was a guy who posted here yesterday with 16 Sparks he was clustering

he can run it

not sure who else

[-]

Monad_Maya@reddit

The smaller Flash variant is just about possible for a small minority of us.

[-]

MotokoAGI@reddit

I'm running Flash locally and it is a great solid model. it's making it to the top of my list.

[-]

jacek2023@reddit

I am able to run quantized 235B models locally on my setup, but people discussing here Kimi/DeepSeek/GLM models usually run them in cloud (or not run at all, just hype benchmarks) and call them "local" because these models are from China.

[-]

j_osb@reddit

No, Local also means local deployments for companies and such.

For those models like GLM5.1 and KimiK2.6 are very feaible.

[-]

alphapussycat@reddit

They can be fun on CPU if you happen to have built a 768gb ddr5 epyc server before ram price ve hike. Expensive, sure, but also not that expensive.

It's only now that it's not really possible, but fir those with a server, they can run them locally.

[-]

ML-Future@reddit

Even so, it is important that such powerful models are open source.

[-]

ttkciar@reddit

I don't think any of those models qualify as open source, except maybe Trinity-Large-Thinking, but I don't think they publish their training datasets, do they?

If you meant it would be great if such powerful models were open source, then I wholeheartedly agree.

[-]

jacek2023@reddit

Wow finally I must agree with you on something ;)

[-]

epicrob@reddit

1600B model is my favourite local model I run it all day on raspberry Pi

You mean 1600 byte model running in your raspberry Pi? ;)

[-]

debackerl@reddit

You only need a cluster of 128 RPis to run it 😂

[-]

GregariousJB@reddit

Is that all day for a single prompt?

I can't imagine a raspberry pi running AI that well. How did you do it?

[-]

SV_SV_SV@reddit

He obviousy meant a hypercluster of watercooled raspberry pi's.

[-]

jacek2023@reddit

not my post but https://www.reddit.com/r/LocalLLaMA/comments/1sarlb8/gemma_4_running_on_raspberry_pi5/

[-]

lunerift@reddit

Feels like a great month on paper - but params don’t really tell the story.
In practice, a lot of these models still struggle with consistency and eval outside benchmarks.
Smaller well-tuned models often end up more usable in real pipelines.
Curious what people are actually running in production vs just testing?

[-]

RickyRickC137@reddit

Mistral would probably name the 1.6T model as "Medium Large"?

[-]

alphapussycat@reddit

Tbh, mistral has more realistic naming. A 32b model is small, the next step up is around 70-128b, then 400-1kb for the large.

9b and 4b are tiny models.

[-]

rditorx@reddit

How much is 1kb nowadays?

[-]

alphapussycat@reddit

1 trillion.

[-]

Pleasant-Shallot-707@reddit

Nah… medium small large

[-]

Netsuko@reddit

Calling DeepSeek V4 Pro Max a "local" model is an insane stretch. That thing is almost 900 gigabytes in size

[-]

dsanft@reddit

I can run it.

12 Mi50s, 2 3090s, dual core Xeon with 768GB DDR4.

[-]

Hoak-em@reddit

dual-socket, so probably not unless there's an inference engine that doesn't need duplication across the sockets

Same dual socket setup but with DDR5 and two 8570s and some different GPUs, I max out at amxint4 GLM-5.1 -- anything beyond that would be impossible to run

[-]

dsanft@reddit

I wrote my own engine to solve the NUMA/cross-socket problem. Don't have kernels for Deepseek MLA/DSA yet though. Will have to get those in soon.

[-]

Netsuko@reddit

Well.. "Technically" I am closer to being a Millionaire than Elon Musk is.

[-]

jcoigny@reddit

I see what you did there /s

[-]

Embarrassed_Adagio28@reddit

Dual core xeon? You running a 2008 cpu?

[-]

RelationshipLong9092@reddit

dual socket means there are two CPUs on his motherboard / barebone

[-]

BusinessYou7196@reddit

Socket ≠ core

[-]

jacek2023@reddit

"I can run it. (...) At least in theory" - this is exactly what is happening here in 2026. Hype destroyed this place.

[-]

_VirtualCosmos_@reddit

on MXFP4 if so...

[-]

thereisonlythedance@reddit

Still waiting on a GGUF here. Main devs of llama.cpp don’t seem to be DeepSeek fans.

[-]

VoiceApprehensive893@reddit

ssd torture time

[-]

Monad_Maya@reddit

Open weight might be more accurate but you know what OP meant.

[-]

IngenuityNo1411@reddit

human generated shit post

[-]

SimultaneousPing@reddit

human slop

[-]

_VirtualCosmos_@reddit

rude

[-]

mister2d@reddit

lol

[-]

Pleasant-Shallot-707@reddit

Most of the ones with a bar worth a damn are in no way local

[-]

Sanity_N0t_Included@reddit

Who the hell is running Deepseek-v4-Pro-Max locally?!?!?!?!

[-]

Netsuko@reddit

I guess your local datacenter/ai model provider xD

[-]

Sanity_N0t_Included@reddit

LOL!

[-]

GhostVPN@reddit

just rent a gpu for a monnth in data center like 15€ month

[-]

inevitabledeath3@reddit

Which data center has GPUs that cheap?

Also I don't think a single GPU would work for DeepSeek in most cases.

[-]

GhostVPN@reddit

https://vast.ai/pricing

[-]

Netsuko@reddit

Dude you are VASTLY underestimating the cost to rent SEVERAL H100-class gpus lol

[-]

inevitabledeath3@reddit

I think you misread. Those prices are by the hour, and for DeepSeek V4 Pro you need 4 of them. Although I guess Flash only needs one.

[-]

madlad13265@reddit

According to prices here, you're looking at like 10-15 dollars *an hour* to run something thats like 1T parameters

[-]

Signor_Garibaldi@reddit

To host kimi, you'd use 8x h100, you'd pay 27$ per hour, not 17 per month

[-]

ikkiyikki@reddit

Are you high? The electric bill alone would be more than that.

[-]

ElementNumber6@reddit

Many of us, perhaps, if not for mega corporations snatching and withholding the necessary hardware from us.

[-]

_VirtualCosmos_@reddit

you just need a couple old smartphones cluser together bro

[-]

a_beautiful_rhind@reddit

Did I miss flash max? A deepseek we can run again?

[-]

Plastic-Stress-6468@reddit

I mean I can technically run every model on the chart if I am willing to wait a long ass time or just rent a bunch of gpus.

For what it's worth I'd rather have a bunch of models I can't run public available than not. Maybe in a few years they won't be so out of reach.

[-]