What are you predictions for the future of local LLM?

Posted by HiddenPingouin@reddit | LocalLLaMA | View on Reddit | 16 comments

Are we going to get more capable smaller models? How long before we can run someting like GLM5.1 on a Macbook? Speaking of big models, are we getting more hardware to run it or the opposite? Machines with more Unified memory for inference?

[-]

-dysangel-@reddit

Once DSV4 comes out, we should be able to offload a lot of weights to SSD without harming performance. So, current hardware will only get more capable.

[-]

Real_Ebb_7417@reddit

Why? Is there some new interference technique expected from DS4? (It’s a genuine question out of curiousity, because currently offloading to SSD makes interference sloooowww)

[-]

-dysangel-@reddit

Yes. Deepseek V4 has lookup tables which have pre-computed vector embeddings which can be injected into the model at runtime. I guess you could think of it a little like looking up a wikipedia article in advance and injecting it into the model whenever there's a token sequence match. All of those lookups can be stored on disk and retrieved deterministically in advance and passed into the model with very little overhead. So, all the active model weights can be focused on intelligence, and knowledge can be injected.

The Gemma 4 2b and 4b models use a similar system, but just for single tokens in the vocabulary I think? Whereas Deepseek V4 has multi-token lookups.

[-]

Real_Ebb_7417@reddit

How do you know that DS v4 will have it? There isn’t much official info about this model (if any xd)

[-]

-dysangel-@reddit

It's expected. They usually follow the pattern of releasing papers on their new techniques, then releasing a frontier model implementing those techniques. In this case Engram and mHC

[-]

silenceimpaired@reddit

It gets shut down by government or corporations find a way to watermark text as not only AI but tied to the device that generates the text.

Models get more brittle unable to be fine tuned but having greater capability

[-]

magikfly@reddit

honestly, i even see them running on wearables long term, short term: phones.

[-]

Pwc9Z@reddit

It's going to be great if you can afford the hardware

[-]

ProfessionalSpend589@reddit

IF there is any new hardware produced: https://www.tomshardware.com/tech-industry/taiwanese-chip-makers-call-on-government-to-stockpile-helium-lng-tsia-pleads-for-strategic-supplies-as-us-and-iran-sign-ceasefire-in-middle-east

Last week I read in the local news that fuel prices have increased so much now that it makes economic sense to steal - someone stole from a private property large quantities of stored fuel.

If Taiwan stops producing new hardware - I tremble thinking about how that will affect prices (I need at least 1 more 32GB GPU coupled with system 32GB or 64GB of RAM).

[-]

EffectiveCeilingFan@reddit

I’m pretty sure it has always made economic sense to steal lol

[-]

ProfessionalSpend589@reddit

Ah, yes. You’re right.

This was a bit more destructive and overt as opposed to what organised crime does.

[-]

Middle_Bullfrog_6173@reddit

Some time next year there will be something with equivalent agentic ability (but probably not world knowledge) running comfortably on a high end laptop.

[-]

iits-Shaz@reddit

One angle nobody's mentioned: phones are the next frontier for local LLMs, not just desktops.

Gemma 4 E2B runs at 30 tok/s on a mid-range Android phone right now — that's faster than GPT-3.5 felt through the API two years ago. The MoE architecture (2.3B active out of a larger total) is designed exactly for this: expert routing keeps inference fast while total knowledge stays high.

The convergence I'm watching:

Models are shrinking faster than hardware is growing. Distillation + MoE + better quantization (look at what RotorQuant just did with KV cache compression) means the usable model size on 8GB of phone RAM is going up every quarter.
The app layer is catching up. llama.cpp runs on Android via Termux today. Native SDKs for iOS and Android are emerging. The "how do I even run this on a phone" barrier is disappearing.
Privacy is becoming a feature, not a tradeoff. For personal assistants, health tracking, finance tools — running on-device means your data never leaves the phone. That's not just a nice-to-have, it's a selling point to end users who don't care about model benchmarks but do care about privacy.

I agree with u/false79 that cloud will stay ahead on raw capability. But the interesting market isn't "replace GPT-5 locally" — it's "run a good-enough model on a device the user already owns, for tasks that benefit from being private and always-available." That's already possible today and it's only getting better.

[-]

false79@reddit

If we look at past release, both local and cloud models are making huge strides. But I would argue, cloud is exponentially moving faster where it is impossible for local to compete.

Not only is cloud moving faster in the LLM in itself but also the tooling on the cloud side. For example, if you're a Claude user and request a recipe, you'll always get it in a consistent format cause data is being passed in to a recipe skills script. The LLM is not wasting any tokens on presenting the data, it just dumps it into the deterministric script.

If you want to be a true Local LLM power user, you already need to be domain expert in a field.
You also have to have a fundamental understanding of the limitations of a Local LLM.

Having a strong understanding of both means you don't need 400b LLM with 300k context to do every single things you need to do.

Yes, it would be nice to have that available but the optimal solution would be to have the smallest amount of params, that fits within your VRAM budget, to do the majority of work you need to do.

I do believer we are already there and have been for at least a year now.

[-]

the_shadowmind@reddit

Big companies make SOTA, open source catches up to that point in 6 months. This year's SOTA is next years mid-size model.

[-]

GroundbreakingMall54@reddit

honestly i think we're maybe 18 months away from running something GPT-4 level on a macbook. the quantization gains alone have been insane this past year. the real bottleneck isnt model size anymore its memory bandwidth and nobody seems to be solving that fast enough