Local LLMs > Cloud Hype? Efficiency is winning
Posted by No-Draft-116@reddit | LocalLLaMA | View on Reddit | 6 comments
Everyone’s obsessed with bigger models, but running them locally is where things get real.
Hot take - efficiency is the new benchmark. Not max params, not peak FLOPS - just how long your system runs without sounding like a jet engine.
I’ve been testing smaller quantised models on edge-focused chips (Mediatek mainly), and ngl, things have become more usable than I imagined :) like, fast responses, low power draw, and no cloud dependency anxiety.
I think we are basically entering the good enough locally > perfect in the cloud phase.
Also, weirdly don’t see many people talking about Mediatek for edge AI / vision workloads. Am I missing something, or is it just underrated right now?
Anyone want to share what setups are you all running for local LLMs right now?
BidWestern1056@reddit
yes sir and with npcsh and incognide we'll keep winning
https://github.com/npc-worldwide/npcsh
https://github.com/npc-worldwide/incognide
tsukuyomi911@reddit
Efficiency means nothing when the output doesn't translate to economic value. Why do you think all the big tech is paying insane amounts of money to Antropic and OpenAI models? They directly translate to usable and productive output. That said there is Efficiency gains to be had with inference optimised chips which is already happening.
StupidScaredSquirrel@reddit
I agree with everything you said just that there is value in small models too. I sell custom software for small firms and I use qwen35b on my laptop for so much of it. I can safely say my speed has 3x. The number of clients hasn't though so for me the value is strictly in time rather than money. However I could now potentially say yes to more work rather than turn people down, which has already happened in the past.
StupidScaredSquirrel@reddit
Ok clanker
Macestudios32@reddit
Expanded opinion.
If we take into account the new digital id laws, the news that chats are not private, that most people have little VRAM and RAM added to the rise of open source agents gives us... Smaller, wearable models for that.
If I want to race I buy a sports car, if I want i consume a utility car that spends little.
If money is relevant (as always) I will buy a utility car even if I would like to have a Ferrari
SexyAlienHotTubWater@reddit
Cloud hardware is a lot more efficient than local hardware - performance per watt matters more when you're running it maxxed out 24/7, so that's what it's optimised for.
I don't think the premise is correct though. Access to the compute is more important than efficiency at this point - there aren't enough GPUs to go around. That's why they're so obnoxiously expensive.