API pricing is in freefall. What's the actual case for running local now beyond privacy?
Posted by Distinct-Expression2@reddit | LocalLLaMA | View on Reddit | 393 comments
K2.5 just dropped at roughly 10% of Opus pricing with competitive benchmarks. Deepseek is practically free. Gemini has a massive free tier. Every month the API cost floor drops another 50%.
Meanwhile running a 70B locally still means either a k+ GPU or dealing with quantization tradeoffs and 15 tok/s on consumer hardware.
I've been running local for about a year now and I'm genuinely starting to question the math. The three arguments I keep hearing:
- Privacy — legit, no argument. If you're processing sensitive data, local is the only option.
- No rate limits — fair, but most providers have pretty generous limits now unless you're doing something unusual.
- "It's free after hardware costs" — this one aged poorly. That 3090 isn't free, electricity isn't free, and your time configuring and optimizing isn't free. At current API rates you'd need to run millions of tokens before breaking even.
The argument I never hear but actually find compelling: latency control and customization. If you need a fine-tuned model for a specific domain with predictable latency, local still wins. But that's a pretty niche use case.
What's keeping you all running local at this point? Genuinely curious if I'm missing something or if the calculus has actually shifted.
3lm3rmaid@reddit
I agree with you, the math doesnt work anymore, Id say it's dead unless you have a lot of users that would justify paying for GPUs, electricity and whatnot. A good middle ground imo is a flat-rate API for open weights like Featherless ai, gives the privacy of local without the hardware costs, buuut you don't own anything so that's the tradeoff.
Successful-Major-257@reddit
honestly the vendor lock-in risk is real. been using openclaw-api to hedge across models when one goes down
Successful-Major-257@reddit
honestly the vendor lock-in risk is real. been using openclaw-api to hedge across models when one goes down
tech2biz@reddit
We were running fully on-prem but found that hybrid with dynamic cascading during runtime (for cost and latency optimization) is the sweet spot.
_Anime_Anuradha@reddit
It's better to use the website which offers a few free APIs and the paid ones are less expensive... But the thing is you have to scrap the internet to find those sites ... There is a site which I found and for me it's working pretty good -- freeaiapikey -- maybe it works for those who are in need .
recovery_baha@reddit
This happened to a friend last month — an agent loop ate ~$600 overnight. Hard caps + retry guards saved us later. Worth checking before scaling.
recovery_baha@reddit
This happened to a friend last month — an agent loop ate ~$600 overnight. Hard caps + retry guards saved us later. Worth checking before scaling.This happened to a friend last month — an agent loop ate ~$600 overnight. Hard caps + retry guards saved us later. Worth checking before scaling.
sloptimizer@reddit
Consistency is a big one for me! You never know what model or quant you are getting when using cloud APIs. You don't know what kind of system prompt it has (and how much of the best context space that system prompt has taken away from you).
When you run local - you know what you are getting. You can run with no system prompt on an empty context, which gives you the best possible output. And finally, you can adjust your prompting skills by seeing what works and what doesn't without all the extra variables thrown in. For example: is my prompt the problem, or I just hitting a lobotomized Q4 quant on OpenRouter?
On a subjective side, there is certain joy in running these models on a box sitting under your desk. Makes it feel real, like something you can see or touch.
Minimum-Vanilla949@reddit
The offline aspect is huge for me - I travel a lot and having models that work without internet is clutch. Also call me paranoid but I don't trust these API companies to not randomly change their ToS or jack up prices once they corner the market
thatguy122@reddit
Exactly this. Don't be fooled by these 10 yr subsidized loss leader fees intended to corner the market.
Icy-Pay7479@reddit
Trudging through the mountains with a triple 3090 desktop “death stranding” style.
CallumCarmicheal@reddit
"Hi I'm Sam" echo's in the distance swiftly followed by the roarings of the 3090's spinning up.
T_UMP@reddit
Roarings that send shivers down my spine :)
CarrotcakeSuperSand@reddit
They’re not subsidized, inference has pretty solid gross margins. It’s the training and initial infrastructure buildout that causes negative cash flow.
Mkboii@reddit
So isn't it better to hold off local upgrades till the market collapses. It would coincide with the demand for data centre RAM going down and then better local gpu options may become available.
miken0222@reddit
Do you mind sharing what model you use on your offline travels? Im in a similar situation where offline is frequent but I need something during those times.
Kahvana@reddit
On my 8GB RAM laptop I use LFM2-VL 1.6B (could probably go for 3B) as it's super fast, low resource usage, has vision and is decent enough for toolcalling. Pair it with zim archives (kiwix) for grounding / world knowledge and it's solid. Haven't tried websearch with it.
You can also pair it with LFM2-CoBERT-350M for RAG. Probably not better than Qwen3-Embedding-0.6B, but it is much lighter to run.
Aardvark_Says_What@reddit
Why do I suddenly feel like a 3-year old with stabiliser wheels on my bike, somehow caught up in the Tour peloton?
Kahvana@reddit
Sorry, what? 🤣
Aardvark_Says_What@reddit
Remember that Superman movie where he flew so fast around the Earth that time went backwards? That's your knowledge. Mine is lifting up the back of a pickup truck.
I'm still impressed with what I can do. :)
Kahvana@reddit
Thanks for the compliment!
spaceman_@reddit
100% this last bit. It's the same reason I use so many open source tools: I don't want to depend on a single vendor who can independently decide "actually, you need us now" and jack up prices massively or change the model quality I have access to as a low-end user.
Think about Adobe but on steroids.
Aardvark_Says_What@reddit
BMW heated seats subscription. Owners go fucking mental. BMW: "Oops. We went a bit too far (introduced it too soon, save it for later)."
Late-stage capitalism. Living the dream.
fenixnoctis@reddit
I don’t get your logic here though.
Once a vendor jacks up prices, you’re free to switch.
Until then why not take advantage of it?
dreadcain@reddit
You're free to switch, but switching isn't free. Depending on how much notice they give you and how twisted up you've gotten your business in their ecosystems you might be stuck paying whatever they want to charge while you spend a few months (or years in my experience ...) while also probably paying for whatever service you're switching to and a whole slew of "transition" consultants who promise it'll be fast and seemless and yet somehow never deliver on either
fenixnoctis@reddit
I didn’t get the impression we were talking about business in the OP or this post.
Business is a diff story than personal use.
dreadcain@reddit
I mean I guess the stakes are lower, but getting yourself reliant on something that is 100% guaranteed to explode in price just feels like a bad play
fenixnoctis@reddit
Build your systems to avoid lock in? You can abstract away anything.
spaceman_@reddit
We're talking about what happens when one player corners the market. Say, for example OpenAI achieves a models that others can't close to. Other players start exiting the market or going out of business. No other serious alternative is available. The winner can do as they please now.
There used to be competition to Adobe in graphic design and desktop publishing, but throughout the late 90s and early 00s, they either went out of business or were acquired by Adobe. Now graphic designers really only have one player to turn to if they're serious about working in that industry, and have to put up with their subscription bullshit.
AciD1BuRN@reddit
They keep changing the tos so much now i don't think it can be even called tos at this point
Aardvark_Says_What@reddit
"Waterproof Shoes Incorporated hereby do not guarantee that our products will be waterproof or necessarily provide the services expected of products called 'shoes'."
MoffKalast@reddit
It's more like tng now already, or even ds9.
Aardvark_Says_What@reddit
> I don't trust these API companies to not randomly change...
That's not nice. You really can trust them 100%... to jack up the prices and turn down the tokens the second they think they have somehow locked you in.
genshiryoku@reddit
The trend we're seeing is a commodification of LLMs it's basically impossible to have a commodity monopoly, so over the long run costs should get lower as innovations reduce the cost of serving inference.
There is also an economy of scale going on where large server farms can simply just serve you inference at a lower cost than it takes you to simply pay for the electricity.
For me as someone owning multiple RTX 3090s the electricity costs of serving prompts are already higher than the cost of using APIs for the inference of the same model.
Local only makes sense if you have free (solar) power, need privacy or offline usage. Or if you have a custom fine-tuned model that you need to run, since hosting this yourself on rented GPUs is still more expensive than running it yourself unlike generic big model inference.
GnistAI@reddit
Interestingly, anyone using electricity to heat their home can justifiably consider their GPU power consumption as free. At least during winter.
MrPecunius@reddit
Everybody needs privacy.
UsernameAttempt@reddit
I'm not sure cornering the market is in the cards. There's too much competition, too many alternatives, and the technology is not one that can be monopolized. The best models of the biggest companies of 1 year ago are worse than the models of small companies in China today. With improvements to models slowing down, we're moving towards models as commodities - similar in performance and competing on price.
SeasonNo3107@reddit
Cornering the market will only happen in 5 plus years with buyouts and mergers imo
SlimPerceptions@reddit
Ironic reading a take like this on a Local LLM sub
thisdude415@reddit
The point is that there's too much to corner/buy out/merge.
The information on how to create a new model is out there, and it can be done for relatively small amounts of money (which will continue to get cheaper).
Current cost estimates suggest a new AI model costs maybe $15 Mn to train. That's a lot of money, but it's really not a lot of money in the grand scheme of business.
Any Fortune 500 company can deploy $20M without breaking a sweat. A typical car dealership has $10M in new vehicles sitting on the lot.
That's not to say that the "high end" intelligence won't get more expensive (it will), but there IS meaningful competition at the low end, and this space will likely get MORE competitive, not less.
Ok-Internal9317@reddit
Almost impossible given how much good open source models are out there, one can just buy some h100s and run a inferencing business, it'll be a matter of competition in energy prices
Conscious-Ball8373@reddit
There's also too much investor money around and once it dries up someone will have to pay the bills.
-dysangel-@reddit
You can already run fairly useful models on a mid range Mac though, so by then almost everyone is going to have access to "good enough" inference at home - if they want it
cuberhino@reddit
What should I run? I have 2 x base Mac mini m4, and 1 3090+5700x3d with 64gb ram. Wish there was a site or something where I could put in my rig and it recommends the strongest models it can handle
yomohiroyuzuuu@reddit
Someone please make this site. I don’t know what I’m doing, and every article looks like alphabet soup to me.
Trotskyist@reddit
The thing is, it's a pretty difficult question to answer with a ton of variables. Everything has tradeoffs and is heavily context dependent.
yomohiroyuzuuu@reddit
Maybe like “if you want to do X, you need Y, and if you want to do it fast/nice, you need to do buy Z.”
A nice general grid/chart or something. I dunno.
Conscious-Ball8373@reddit
Yes, I think there's a very real risk to investors that by the time they start trying to collect on their investment, a combination of hardware and software progress will mean that anyone can run the latest models at home on commodity hardware. We currently run so much stuff in the cloud because the hardware to run it is hella expensive and it makes sense to share it so it's all running at capacity; yes, you can run a kind of useful model on commodity hardware but actually getting the best coding models etc to run at home is still pretty expensive to do well. But f every PC comes with a chip that can run a 700B Q16 model at 100 tok/s, a lot of that cloud demand is going to dry up, especially when they start trying to charge fees that reflect their real sunk costs. Not a good time to be an investor in this stuff IMO.
-dysangel-@reddit
I was thinking the same thing for text models. I've been wondering if there will be decent money to be made on video or game generation models though. We already have decent quality chat bots and coding bots "at home", but being able to generate movies on the fly etc is probably still going require a lot of horsepower and scaffolding for a while, and be something worth paying for.
Budget-Juggernaut-68@reddit
and by then we'll have cheap compute from data centres.
Icy_Foundation3534@reddit
mergers happen very fast in this market
spam_and_pythons@reddit
I trust they will absolutely do that. They have to, their costs are enormous. Granted they'd do it regardless.
mycall@reddit
They can also block countries if they were forced too
qwerty____qwerty@reddit
what setup do you have? if you travel a lot - I'm assuming that's a laptop, aaaand how powerful should that laptop be to compete with gemini api?
pandodev@reddit
yes privacy and offline for the right things is EVERYTHING.
Eye-m-Guilty@reddit
Would love to know what ur running offline n the set up!
p3r3lin@reddit
Same. I live in a part of the world where stable and always available internet connection is still not guaranteed (...Germany). So having "AI in the pocket" would be a great thing to have. There are projects like https://locallyai.app/ that enable small models (4b, etc) running on mobile phones, but its definitely no replacement for a SOTA model of any kind.
bigh-aus@reddit
Don't forget about being forced to give all their past data - that's the big one.
https://openai.com/index/response-to-nyt-data-demands/
Data invariably gets leaked (even accidentally).
Ayumu-Aikawa@reddit
exactly, the TOS and pricing are things that can change any day, we keep seeing stories about how these companies are not making any profit. It's clear for me they're going to have to change something sooner or later
lambdawaves@reddit
“Once they corner the market”
They won’t corner the market. There will always be at least a handful of competitors that offer a 98% similar product (via the same API) preventing arbitrarily high price hikes
Imaginary_Context_32@reddit
May I know your setups ( hardware, models, usecase?)
Kahvana@reddit
Hmmm, I don't fully agree with you on the second/third point
As for other points:
And most important of all, it's just fun!
SouthernFriedAthiest@reddit
Running the same pipeline. TTS is the one where local really pulls ahead of cloud on cost — ElevenLabs/Play.ht charge $5-30/million characters depending on tier, and the quality gap with Qwen3-TTS basically closed in the last year. Voice cloning from a 10-second sample, per-line emotion control, zero per-character cost.
The whole thing started from just wanting to give voice to my own tools, then I started exposing it as an API for friends who don't have GPUs. Built a web UI for it at tts.scrappylabs.ai if anyone wants to hear the quality before setting up their own instance.
boisheep@reddit
And you can easily get put into a list.
I think I was once put into a list after googling about "man cp" and whatnot, and google was giving me warning and I was like, what the?... I just want the manpages of the cp command.
Doesn't happen anymore but that was like a long time ago, the AI something something, an old one I guess.
You don't know what triggers these algorithms, you just don't.
And I don't want to be the dad that gets arrested after sending photos of his son to the doctor for checkup.
Distinct-Expression2@reddit (OP)
Thanks a lot for the point; why not having something like a single gpu with more vram like 3090/4090? Some of this moe models cannot be sharded nicely to my understanding
Kahvana@reddit
Honestly no clue about the MoE models, I only have used dense models so far (very happy with Magistral 2509 Q8_0 and 32K context!)
As for why I made that choice:
Dry-Influence9@reddit
Privacy is becoming bigger and bigger of a reason now that big ai bros are looking for ways to better skin the cat. They are getting so intrusive that im very close from getting rid of windows from all my systems.
05032-MendicantBias@reddit
It's to the point where Windows barely even work as an OS because of all the spyware. It's absurd that file search will not even find the file anymore...
T_UMP@reddit
For the file searching aspect, use "Everything"
https://www.voidtools.com/support/everything/
PangurBanTheCat@reddit
We know it's possible, too. I use Search Everything from voidtools and it... genuinely works incredibly well.
Windows is such a joke anymore. ...but I also hate the frequent forum diving I have to do on Linux when something doesn't work, lol.
Distinct-Expression2@reddit (OP)
Windows is a meme at this point
AlanCarrOnline@reddit
I'm bouncing between vibe-coding on Windows and playing around with Linux Mint at the moment. Windows has crashed so hard it needed reinstalling twice in 2 months.
I'm sick of it.
The final straw for me was seeing Copilot, 'Search with bing" and AI stuff all over.... Notepad.
Notepad?
Windows doesn't even copy and paste properly for me now. Often I find it's pasting something I copied earlier, not the thing I just copied. It's like 'CTRL A, CTRL C... CTRL V... WTF?'
Too busy sending my clipboard data to Microsoft to do it's actual job. So now I have Mint installed on a partition of D and will move over entirely to Linux when I finish the vibe-coding project.
Constandinoskalifo@reddit
Do yourself a favor and try out Linux Mint. Welcome to the other side! 😜
HighQFilter@reddit
Yeah, my main desktop is still Win10, but everything else is Linux at this point. I won't be moving to Win11 at home. I have to put up with it at work, but when Win10 truly is done, its going to be Linux from there on out.
05032-MendicantBias@reddit
API pricing won't be subsidized forever. At some point venture capital will want a return. Same as the short time where Uber was subsidized.
By all means, get a millionare to subsidize your workflows, but know this is a short term deal that won't last.
cleverusernametry@reddit
It stuns me that people are still so gullible after over a decade of SaaS and cloud. You already see people going back to on-prem from cloud because cloud pricing has become so predatory. Just wait for this arc to play out with AI APIs
mrkstu@reddit
Because VMWare is always going to be non-predatory, right?
cleverusernametry@reddit
Huh? What does VMware have to do with anything?
Distinct-Expression2@reddit (OP)
Nice point. When do you think that will happen?
protestor@reddit
OpenAI already announced the first enshittification package (ads in chat). The reason they didn't jack up prices yet is that they need to first kill local AI - they need to maintain in people's minds this idea that local AI makes no sense financially, since cloud AI is so cheap. Also, they can't kill local AI by pushing the frontier (open weights AI is trailing frontier models by 6-8 months). They have no moat in the model itself, and all the money they pour on training will only help Chinese clones when they distill it.
So what they are currently doing is to buy up enough wafers to jack up prices for machines used in local AI. I think that's what they are more worried about, consumer GPUs that crosses the 32GB-ish VRAM barrier. They play at TSMC made GPUs more expensive and made people buy more 8GB GPUs.
I think that by the time the enshittification of closed AI is completed, running local AI will be unfeasible, a distant thing in the past
mrkstu@reddit
Depends- Apple is incentivized to local AI and is equally important to TMSC, and leaving a market hole for AMD isn’t wise.
05032-MendicantBias@reddit
I have no idea. The scheme collapses when venture capital lose patience. It could be as soon as this quarter, or it might take a few years.
I feel confident the bubble pop will be preceeded by OpenAI trying an IPO for 2 trillion dollars, when venture capital will try to offload their position to retail.
Conscious-Ball8373@reddit
The other possibility is that a combination of hardware and software technological development and manufacturing capacity increases bring us to the point where the actual cost of AI is about what we're currently paying. That's going to shaft a lot of investors who have sunk a lot of cash into the current wave of build-outs. Let me get my really tiny violin.
Conscious-Ball8373@reddit
Nvidia don't quite the models and they don't manufacture the silicon. They may not get a say in it.
xXprayerwarrior69Xx@reddit
Nvidia won’t let that happen lol
SeasonNo3107@reddit
Venture capital wont collapse cause they all use AI to ask what they should invest in /s
Dry-Judgment4242@reddit
AI will collapse when cars do. Anytime now we will be back to horses and carriages... Anyway.
05032-MendicantBias@reddit
I'm arguing for a dot com scenario.
The technology is incredible.
Not 2 trillion dollar incredible.
MengerianMango@reddit
Ford alone made \~9x as much revenue in 2025 as OpenAI. Their PROFIT was 2.5x OpenAI's whole ass revenue lmao.
We've over invested in this hype cycle by like 2 or 3 orders of magnitude. Yeah, eventually AI will rule the world, but it ain't gonna be LLMs and there will be an ungodly snapback from all the bets that it would be LLMs. The issue is that the valuations depend on "growth", but they mean "growth in earnings" not "growth in spendings", which seems to be all current AI companies can grow. Sucks for them deepkseek et al wrecked their bottom line and inference is as commoditized as water and electricity.
xmBQWugdxjaA@reddit
You say that as though Deepmind and OpenAI don't also have some of the best non-LLM models.
MengerianMango@reddit
Fully expect Google to win in the long run. There's nothing OAI can do that they can't copy and improve upon within a year or two (see gemini lol).
I think the smart trade is probably short nvda and long google. Nvidia is what all the startups are using. They're all going to get slaughtered when investor patience dries up. The issue is that, to make it past this hype cycle over the hump to the thing that actually becomes AGI, it's going to cost 10 figures a year (maybe quarter) in RnD. What great reason is there, as an investor, to dump your cash into the OAI incinerator when you can buy goog, they'll fund their own development, limit your risk, and end up better than OAI eventually anyway. At current valuation, OAI makes no sense as an investor (ie discounted by risk and then compared relative to peers), but they need investors just to survive to end of year.
Ofc, worth saying, I'm speaking with a lot more certainty than I really have and leaving out disclaimers for sanity and brevity. Might be wrong. Who fckin knows yk. I'm just a guy on reddit.
buecker02@reddit
Someone doesn't understand how VC works even it was clearly explained in this thread.
MikeLPU@reddit
Not far from the truth.
Mtolivepickle@reddit
This 100%
rbpri@reddit
Google just cut Gemini’s free tier in AI studio from ~100 RPD to ~20 RPD. It’s impossible to know when exactly enshittification is going to hit but it’s coming sooner rather than later.
neotorama@reddit
When they IPO
BonjaminClay@reddit
Exactly, whenever becoming profitable becomes existential for them instead of burning a pyre of money to grow.
SINdicate@reddit
18 to 24 months
Finn55@reddit
When it is least convenient
dydhaw@reddit
Wtf are you talking about? Inference is super competitive and still profitable. The inference market is about as far from a monopoly as it could be. Especially for small/open weight models that can be run locally.
Not that I'm opposed to running locally, mind you. But you can't pretend it's for future gains on inference costs. Just look at the cloud infrastructure market for reference. There are many competitive options even beyond the big 3 especially if you're not doing hyperscale enterprise b2b shit
05032-MendicantBias@reddit
I do not dispute that on a single GPU basis, you could feasibly sell H100 inference runtime cheaper than a 3090 rigs at home, perhaps even profitably.
What I argue, is that if you bought 200 000 GPUs, and have (generously) 1 000 worth of them running profitably, the rest are idle, training, serving inference for free, or unplugged, then you have a recipe to burn money.
I claim the businness model is nonsense, and is setting money on fire at an astonishing rate.
dydhaw@reddit
It sounds like you're mostly talking about major 'consumer' AI labs like OpenAI, Anthropic etc, which i don't think are really relevant for this discussion, because they only sell inference for (overpriced) proprietary models that you can't run locally anyway.
The only reason inference could become more expensive is if supply growth fails to meet demand, which with chip shortages isn't that far-fetched, honestly. That's actually a solid reason to run local, prepping for the upcoming GPU apocalypse
theAndrewWiggins@reddit
Arguably this still means that you should wait it out if it's strictly an economic analysis.
AnomalyNexus@reddit
Even without subsidization it’s hard to beat the economics of centralized data centers that have scale.
Seems unlikely that it’ll ever drop to a point where local wins a like for like shootout regardless of what happens
KontoOficjalneMR@reddit
I can absolutely see it dropping. Cloud is multiple time more expensive than setting up your own server. People pay for convinience and scaling.
But if you're short on cash and don't need scaling ... local servers are multiple times cheaper then AWS.
AnomalyNexus@reddit
I don't think AWS is the right reference point here - they're artificially high because of corporate customers and synergies with their other products.
If I take my local card 3090 and run it for an hour that costs me 0.13 USD in electricity (uk pricing). Vast.ai charges 0.12USD, others like salad as low as 0.1USD.
So even with ignoring 1) cost of hardware 2) electricity for rest of computer 3) risk of the thing dying 4) cost of rest of computer 5) ignoring that i'll not be using it 24/7...i.e. ignoring everything under the sun except elec...local still loses.
It's just hard to beat bulk pricing on electricity and the high utilization you get from aggregating demand from many. Those advantages aren't going away even if they stop subsidizing.
johnkapolos@reddit
API pricing isn't subsidized. Subscriptions are subsidized.
bathamel@reddit
None of these companies are remotely profitable. Therefore they are all subsidized at the moment.
johnkapolos@reddit
You can be living in your parent's basement and still not sell lemonade from your stand at a loss.
bathamel@reddit
Yes, because your parents are subsidizing you.... and they are losing money.
johnkapolos@reddit
The lemonade, you sell it for market price. The price of the lemonade is not subsidized. You don't get a discount for the lemonade.
Thus, the lemonade is not subsidized. For the person buying the lemonade, it doesn't matter that your PS5 subscription is paid by your parents.
If you still can't grasp it even now, I have no more to add.
cyberdork@reddit
Both are subsidized.
johnkapolos@reddit
That would make zero sense.
cyberdork@reddit
How so, the entire business is currently subsided by debt and the need of a continuous stream of VC.
johnkapolos@reddit
The VC money burns for training, buying GPUs and talent.
Once tou have the GPUs, some model and talent, the inference cost is in running the DC (i.e. pay for energy and expenses). This part is the inference cost.
The inference cost + margin is your enterprise product. OpenAI did something like 1 billion in revenue from that.
The consumer products, these are the funnel and you subsidize them so that you can tell your investors you have 100 bazillion users and make it easier to raise more money.
MitsotakiShogun@reddit
Not everyone runs their APIs at a loss. Most inference providers don't, and neither does Anthropic, likely OpenAI too, definitely not Amazon/Microsoft/Google's cloud deployments of the same models. As an example:
Even if everyone on the list "subsidized" their API (why?), Google doesn't have much reason to do that, right?
05032-MendicantBias@reddit
OpenAI burns money like there is no tomorrow, reportedly they burn anywhere between 15 to 50 billions a year and OpenAI themselves say they won't see green until 2030. If there is a company I'm 100% certain will go bankrupt, is OpenAI. It might be as soon as this year if venture capital loses patience.
The GPU datacenters use will become obsolete in two to three years, and because of hype, and demand spike, they cost 3X to 10X their base price. Not talking about many electricity grids just not having the capacity, and datacenters using portable gas turbines, at great expense.
Models are getting bigger, and more expensives, and agents burn exponentially more tokens, increasing inference cost even further.
None of this is profitable, let alone sustainable.
There might be rare exceptions, like datacenters using hardware bought at fair price, running on renewables.
Now, Microsoft and Google do have a profitable businness, so they can subsidize the money losing AI inference. I love to use their free credits! (which you can be certain are at great loss)
It's the silicon valley play of subsidize it until everyone is out, and with a monopoly raise prices. They love spamming this strategy.
MitsotakiShogun@reddit
Yes, and? Did I say ANYTHING about their overall profitability? In fact, why are you so focused on 1 example? What about the other 4 companies I mentioned? And did you not see that I only spoke about API pricing not being subsidized?
Again, what are you talking about? What free credits? I gave you a screenshot from OpenRouter, no cloud credits or anything. And if you're talking about "free credits", you're too small to matter to their economics, their enterprise customers paying those API prices (even with discounts) consume a million times more.
But more importantly, why are you avoiding my main question: why would they subsidize deployment of models they didn't make, like Kimi?
05032-MendicantBias@reddit
Ultimately we'll have to wait the post mortem done after the AI bubble pop, like it happened for the dot com bubble. It's a similar, but worse economics for AI, because fiber lasts many long year, but GPUs have to pay themselves much sooner than their obsolescence deadline and need more ongoing resources to work.
I claim that Google and Microsoft do not have enough paying API inference customer to even break even with all the racks they bought.
And google and microsoft do run copilot and gemini AI powered search for free for all the massive user base. I claim the vast majority of AI inference users, pay nothing.
seiggy@reddit
Microsoft has had record quarter after record quarter. https://www.microsoft.com/en-us/investor/earnings/FY-2026-Q1/press-release-webcast
Show me where they’re losing even a penny. Because it’s sure not in their earnings report.
ProfessionalJackals@reddit
Easy, they are not reporting their own LLM costs.
Notice how every business branch sees growth, but no mention of CoPilot as a separate entity ...
Microsoft is a special case in the sense, that they earn money by selling datacenter compute to Anthropic, OpenAI etc ... Who eat the losses. When that gravy train stops, so will it impact that datacenter part of Microsoft.
From the conference call:
OpenAI eats the debt, Microsoft profits as long as OpenAI keeps funding that circle.
But the actual CoPilot information is not shown, why? Because its not profitable and it losses are hidden in the numbers of the other services.
Like i said, the moment that OpenAI and Anthropic start to falter, so does the Azure service growth and profits. Right now its a game of keeping the balloon going as long as possible, and hoping they get to profit (they will not without major price increases > what trigger client reactions).
seiggy@reddit
It doesn't work that way at all. OpenAI's contract is for data center usage for training their models and for the OpenAI Platform (ChatGPT and their API) inference.
Copilot, Microsoft Foundry, and Azure OpenAI are not included as part of that $250B contract.
Copilot runs on MS data centers and is costed to the appropriate cost center - Microsoft 365 Commercial cloud or Microsoft 365 Consumer cloud.
Azure OpenAI inference is costed to the Azure and other cloud services cost center.
So, when OpenAI and Anthropic falter, yes, that $250B in purchased services goes away, but Copilot doesn't suddenly vanish, nor does the revenue generated by Copilot and M365, nor does the massive amount of usage of Microsoft Foundry for inference of both OpenAI and open-source models on Azure.
05032-MendicantBias@reddit
I'm paying exactly 0 $ to Microsoft, and I'm using copilot a lot for code and image generation <- Microsoft is losing money on me.
seiggy@reddit
Your usage is being subsidized by the large enterprises that are paying for Copilot. That and them selling your usage data.
ProfessionalJackals@reddit
Yep ... Most of the people that already wanted AI, are using it already via those subsidized services. The actual market potential for growth, hinges a lot on companies finding different uses all that mass of hardware to be utilized.
At some point we hit saturation, and the prices will start to go up, what then will trigger a reaction from the cliental. AI sounds great for companies if you can really replace employees with it, what can fund that bubble for a while. But if those prices keep going up, that fired employee can suddenly becomes cost efficient again vs the LLMs.
dmter@reddit
why would GPUs become obsolete? even 3090 is still useful and it's 2020 or something. nvidia has interest in keeping those datacenter cards relevant so they can keep supporting them and even halt progress in new product development if they don't want prices of old hardware to plummet which could bankrupt companies they invested in
05032-MendicantBias@reddit
Datacenter can't run anything but the most efficient GPU, because the dominant cost is electricity and cooling.
ross_st@reddit
I think for Google, it's all about ecosystem ownership rather than profitable inference. Did you see how long they were willing to run YouTube at a loss for? Also, they release Gemma under Apache rather than the commercial royalties licensing model Llama uses, also a clear ecosystem ownership play.
They can't afford to keep the free tier as generous as it currently is, but I think they'd be willing to run Gemini at a slight loss indefinitely if that's what it takes to maintain ecosystem ownership.
Sydorovich@reddit
Not just raise prices, but significantly degrade quality AND sell your data for add revenue AND sell your prompts to local governments for a fee so they would be able to get you into prison for wrongthink.
pip25hu@reddit
We should not mix up two things. Inference providers running open-weight models can definitely make a profit, no doubt about it. But companies OpenAI and Anthropic are also hoping to recoup their costs for training these models, and no, they're not even close to breaking even.
MitsotakiShogun@reddit
I didn't say anything about training, I simply commented on your first sentence.
Recap: OP was talking about falling API costs, you said API costs are subsidized, I said they aren't.
Trotskyist@reddit
Nobody will continue paying for their models if they don't keep up in the rat race. They can't just sit on their laurels with their current models and wait to recoup costs or their userbase will dry up entirely.
If VC were to dry up today for any of these providers, they would have to bake training costs into the inference pricing. It is effectively as much a necessary cost as is electricity.
johnkapolos@reddit
You are being downvoted for being right.
Steus_au@reddit
this is LocalLLaMA rules )
MitsotakiShogun@reddit
Downvotes without comments stopped mattering to me a while ago, because that's just Reddit being Reddit :)
No_Afternoon_4260@reddit
vertex providing kimi, that's a new one.
iotsov@reddit
Oh sweet summer child.
MitsotakiShogun@reddit
So are all these providers losing money serving a model they didn't make, including Google?
corruptboomerang@reddit
This. It's the same as when Cloud & SAAS were becoming a thing, they were price competitive even under priced... To lure customers in, knowing that once you move, once you give up your capabilities its very difficult to rebuild those.
thisdude415@reddit
API unit pricing is profitable at all the major AI labs (Google, Anthropic, OpenAI, AWS).
Need proof? AWS serves Anthropic models over its Bedrock platform. No way in hell is AWS subsidizing inference at scale.
Neither AWS nor Anthropic is subsidizing those tokens, and the pricing matches Anthropic's direct pricing.
ProfessionalJackals@reddit
A smart person uses those subsidized subscription models like CoPilot, like a resource for as long as possible and then uses the money you saved, to get a better offline solution.
I mean, you can do a lot work with CoPilot for like a $100 per year. $390 for the those with a ton of work.
For offline LLMs your looking at a few grand just to get started with models, that are often closer to free/0.33 in the subsidized solutions.
As long as you know that one day the grave trail will stop, and you plan for it, you can maximize time.
Over time, better / lighter local models will get developed, hardware will grow to be better at LLM "stuff", prices get cheaper.
Yes, the current insane AI memory prize issue is a dampener on this but it will not last forever. There is a point of saturation, where the amount of cliental that wants to pay for these subsidized services starts to thin out. And then a cycle of increasing prices, will drive the mass datacenter markets down, less order, and eventually the consumer market will pick up again. But now with years of extra development / knowledge.
cniinc@reddit
"The goal of venture capital is to make everything else go away, so they get a monopoly and raise prices" - the best succinct description of venture capital. Understand that the second you become dependent on the external model they'll squeeze you for as much as they can on that investment. While the model is cheap, learn how to do it without the API, and have the API do the parts you can't. Then slowly increase you skills until you can do it all without API. Make them subsidize your learning, not your dependence.
customgenitalia@reddit
Nailed it. Running local is the long game, the skills you learn will start to pay off when the VC money runs out. There is so much compute potential sitting idle, I think you’ll soon start to see creative ways to leverage this as nodes in a distributed AI fabric of sorts, think SETI but for ASI.
BonjaminClay@reddit
Exactly this. Enshittification comes for everything at this point and if they are giving something away or it is unrealistically cheap now then you shouldn't rely on it. I have learned this lesson too many times. I stopped buying physical media or maintaining my own copies for a long time because streaming was just easier and now there are 20 streaming services all wanting 5x more per month each.
Building with local or on something I control the costs of means that when the AI bubble pops my stuff won't break or get unpredictably more expensive.
mumBa_@reddit
I understand what you're saying but the amount of competition basically means that as long as no model wins (price-performance wise), they will have to keep the prices low for competition, which will definitely be the case for the upcoming decade+. All these companies also get free training data, so it's a two way street. You need users for their data, but if your model is too expensive, no one will use it and go to the cheaper competitor. It's a race to the bottom but once you're there, you need to stay there otherwise someone else will take your market position. So yeah unless a monopoly appears (google probably), realistically it will only get cheaper with time.
rditorx@reddit
Besides venture capital, if you dry out the competition, and that includes locally run AI, you gain control over the market and can ask for almost any price, as long as people and companies can afford it.
Anyusername7294@reddit
It's generally agreed that interference isn't subsidized
ravage382@reddit
To start turning a profit, all the big AI companies are going to have to really increase monthly rates. I imagine not too far in the future, prices are going to go way up.
I have come to rely on AI enough that I want to ensure my continued access to it, regardless of what the big companies do. I will be pretty happy with just gpt120b and a few of the smaller qwen and glm models going into the future. Knowledge cutoffs are annoying, but mcp tools with access to web search mostly mitigate that.
Glum-Traffic-7203@reddit
For me it’s privacy and customisation as the biggest to
When it comes to rate limits and costs - there are specialist high volume low cost providers like doubleword who are even cheaper than APIs like Kiki
Which-Jello9157@reddit
same question here. replies above are convincing but realistically im sticking with cloud apis for now since its easy and cheap. models keep getting bigger and whatever hardware u buy today probably wont handle next gen stuff anyway. do you have any recommended third-party api provider without rpm limit?
PassagePlus3777@reddit
Openrouter.ai is the obvious choice for model variety since they aggregate from multiple providers but sometimes u get rate limited even on paid tier. pro tip tho, OR supports byok (bring your own key) so u can route to whichever upstream is cheapest for that specific model. i use atlascloud.ai for kimi k2.5 since theyre the cheapest and stable, and phala for glm 4.7 flash cuz they're stable on it. Way cheaper if ur running volume
Man-Of-Culture-0690@reddit
(Openrouter.ai)[https://openrouter.ai/] is the obvious choice for model variety since they aggregate from multiple providers but sometimes u get rate limited even on paid tier. pro tip tho, OR supports byok (bring your own key) so u can route to whichever upstream is cheapest for that specific model. i use (atlascloud.ai)[https://www.atlascloud.ai/] for kimi k2.5 since theyre the cheapest and stable, and phala for glm 4.7 flash cuz they're stable on it. Way cheaper if ur running volume
Euphoric_Emotion5397@reddit
there are some things you rather not do online.
Humans valued their privacy.
so , a lot of things can be done online. But there are some things still kept offline.
It's not a mutually exclusive situation with most people. It is a complementary solution (use GPU for x use case, use online API for x use case)
mr_zerolith@reddit
I'll add to your list.
- speed/intelligence tradeoffs are controlled by me.
- i have no idea what quantization/etc the model provider is using and they could be intermittently "cutting the product", which leads to inconsistent results that can be time consuming.
- when the service is down, i can get it back up with a reboot instead of wait a random amount of time.
- i am not directly helping fund a company that is involved in enormous copyright infringement or bribing, sorry, i mean, lobbying, the government against the interest of competitors
Financial-Source7453@reddit
Abliterated models. Tired hearing "I am sorry I can't do that due to policy restrictions" from ChatGPT all the time.
Vahn84@reddit
I simply do not want a future where I’m not in control of my things. Corporations are forcing us into a future where we probably won’t have shit into our hands…and we will be forced to rent everything. I don’t want a life on subscription services
Lan_BobPage@reddit
Why having a well, when you can just buy a bottle of water at the store? Why having solar panels when you can just pay for electricity? Why owning a house, when you can just rent a flat? Why own movies, when you can just pay for a subscription service?
The answer is always the same. I want to own what I have, and don't want to be a slave. Any of these commodities could be taken away by others, at any point, for any reason.
Firm-Fix-5946@reddit
do you have a well instead of depending on tap water? do you have solar panels instead of depending on electricity? surely if you do you can still understand why the vast majority of human beings don't find those tradeoffs worth it and don't do those things?
what a crazy analogy
Lan_BobPage@reddit
So spending 20k in hardware isn't good enough to be considered a LLM enthusiast. Got it. God forbid wanting to be independent in some aspects of one's life, I guess owning nothing really does make you feel happier huh. Not sure why you seem to be seething this much, I guess I struck a nerve. I hope you find some peace.
Firm-Fix-5946@reddit
what? what are you talking about? did you really think something about my post suggests you should somehow spend more? dude. touch grass.
AlexMillsDev@reddit
Imagine calling everyone who has a different opinion a schizophrenic. What a crazy (and offensive) thing to say. Is this really the state of online discourse in 2026?
PeteInBrissie@reddit
OK, genuine question because I stopped trying to run locally 3 months ago and I'm likely out of the loop. Can ANYTHING you can run locally on 32GB even remotely compare to Sonnet 4.5, Let alone the latest Opus?
Lan_BobPage@reddit
32 GB of what? System RAM? No. GPU? No. Kimi K2.5 seems to be comparable though, it just came out. Some claim it beats Opus even. But if you wanna run it I suggest you start saving up for a few 6000 Pros
pixelpoet_nz@reddit
There's an ollama guy in another thread working on getting it running with 2x Strix Halo, which I happen to have and together cost about half of a single RTX 6000 Pro
Lan_BobPage@reddit
No thanks. I'd rather run my models at near full precision with decent context and zero hassle.
pixelpoet_nz@reddit
In the end it's about how well it works, no? I think there's a good chance that in the coming year there will be some really good local coding models around 256 GB, and 2x Strix Halo seems pretty power- and cost-efficient to me. I use one as my daily driver and it's amazingly fast.
Lan_BobPage@reddit
Good for you man. Hopefully next year Kimi-like models will be in the 200b - 400b ballpark so we can all run them just fine.
evia89@reddit
2 x 3090 and then u stretch it. It will be like glm 4.7 flash. Def useful but not for everyone
Distinct-Expression2@reddit (OP)
Fair point and very good angle - thnk man!
Nepherpitu@reddit
For example, these nice cloud providers decided to not take money from some people just because of their nation. Pretty racist, right?
AlexMillsDev@reddit
I don’t hate North Korean people but I don’t want my cloud provider taking money from the North Korean government.
Nation is not race. Stop with the culture war brain rot please.
AriyaSavaka@reddit
You don't really own anything if you're not living in an anarchist society. The government can just decide to fuck you up with impunity, just like the ICE agents you see lately, don't think it won't apply to you. The house, the car, ecerything that you think you own can still be taken away easily, if the ruling class decide so
Lan_BobPage@reddit
LMAO ok commie, slow down. I love me some ICE on my tea.
LocalLLaMA-ModTeam@reddit
r/LocalLLaMA does not allow hate
TubbyKing@reddit
Gross
NoahFect@reddit
Anarchism: The strong do what they want, the weak do what they must
Communism: The strong do what they want, the weak do what they must
Capitalism: The strong do what they want, the weak do what they must
Socialism: The strong do what they want, the weak do what they must
. . .
Antidisestablishmentarianism: The strong do what they want, the weak do what they must
Background-Ad-5398@reddit
"anarchist society" this was called the warring states period and its what happens to that kind of society, and its much worse then anything we have now
Creepy_Stable_9171@reddit
DUMB WAYS TO DIEE IN AMERICA DUMB WAYS TO DIE
SpicyWangz@reddit
Ah yes, because I don’t have control over one aspect of my life the best solution is to hand over control to every other aspect of my life.
That’s a horrible way to live.
HopefulMaximum0@reddit
The API prices are low because they decided to operate at a loss until they win. Then they will crank the price to the moon and do whatever suits their current whim to you, because you can't go elsewhere.
The plan is simple, and as old as anti-dumping laws.
Thrumpwart@reddit
TWO CHATS AT THE SAME TIME.
LM Studio new feature is awesome.
Maximum-Wishbone5616@reddit
In my experience the SLA is non-existent for most of AI services.
We require at least 99.99% (that is uber minimum), we monitor all our servers, instances and services every 30 seconds, we aim to get at least 99.9999% through out them.
Due to sheer amount of requests we process per minute, 99.9% would mean that 3rd party AI 503 would cause havoc for our customers/monitoring systems.
They are not even close to 95% with 30 second monitoring...
SlimPerceptions@reddit
Even on point #1 - encrypted files and rented gpu’s? I haven’t done it but it seems like their are adequate privacy solutions out there even for the cloud
AlwaysLateToThaParty@reddit
Having a 96GB GPU with 1.7TB/s for gaming.
Photochromism@reddit
Lol. Dumb take. The minute one of these platforms becomes the favorite, becomes a necessity, and wipes out the others, the prices will a quadruple.
TruckAmbitious3049@reddit
For me it's rate limits.
For data analysis, I need to do a lot of labeling. Paid Gemini and ChatGPT would hit rate limits.
For transcribing, sonoix is amazing and cheap. But if it's a large batch, then Whisper still better.
Big_River_@reddit
all these prognosticators claiming models are commodities are investor bros who have a very confident surface level understanding of technology and macroeconomics - applying traditional business school analytics on technology adoption and artificial intelligence as a service like sas on steroids margins or corporations are going to just train their own model and cut vendor and labor cost - just like implementing a new management information system with autonomous agents as admins and analysts and developers...on their own local data center that the model manages itself with human in the loop hardware maintenance until robotics improves fine motor skills ...and all of this is a 9 month implementation that will become relatively plug and play fine tuned specialized model that can rewrite and optimize the code base, product firmware, optimize manufacturing processes, define the kpi that drive results, write up the marketing strategy and product design, 24/7 live dashboards on every deep dive imaginable for the c-suite goons to...nevermind all along frontier labs are racing with govt labs to produce the singular super intelligence that unlocks unimaginable world model exploits that change our fundamental understanding of what intelligence is capablr
open source + home lab represents the counterweight to corporate cloud "all your base (models) are belong to us ( and your data too )" and you can rent them for cheap now but once local compute becomes too expensive for joe six pack to build out in his garage - the cloud becomes the only game in town to get that coevo level up intelligence and truly generative creative extension of minds - there is a fundamental divide between the compute you own and the compute you rent - your data, your painfully fine tuned model that all your agents and business/creative / personal growth process depends upon...lease your work forever to market whims...cloud compute prices are so low to hook as many as they can while local compute components are skyrocketing - say goodbye to the capable personal computer - the next commercial electronics is cloud optimized priority connectivity for proprietary cloud ai managed computing resources - the user will have a great experience but it will always be renting the ability to functionally participate in the economy - when one org takes the super intelligence to monopolize compute as everyday iphone level (comprehensive information ingress paid promo filter and algorithm ranked content engagement device) service is vertically integrated.....well the individuals who sit on the board and the majority shareholders of this company that owns an overwhelming majority of compute ascend to a level of control and influence on the learning and reasoning of humanity that reduces their impact on the collective narrative to NPCs going through time in limited awareness incurious and relatively happy and cared for like a collection of objects or pets or puppets that are fun to play with until they break or shoot each other dead in the streets to amuse or shock....
local compute ai clusters and distributed data owned by and maintained for the benefit of small communities and collectives is the only counterbalance to the commercial interests of capital markets and investors demand for unsustainable reality bending returns that own and control the narrative for some fucking reason in this shard of existence
get local stacked functional compute or you are not playing the game -- you are getting played - get hooked on the cloud
Vivarevo@reddit
Pricing is marketing, it will change on short notice
x0xxin@reddit
It's also a super fun and practical hobby. I've learned so much about self hosting and "cloud tech" via all of my labbing.
Chilidawg@reddit
There's the principle of ownership over renting. I know we're losing that battle a little more every day, but still...
No matter how cheap something is, there's an immeasurable gap between "free" and "paid". I understand that my electricity bill is likely more expensive than the API. However, that's grouped with utilities that are already expensive before running local models.
There's the novelty of verbally talking to your computer. There's no black box API promising you're talking to a model as opposed to someone in New Delhi. You can run the script and listen to your GPU fans speed up.
Finally, I also don't pay Sam or Elon on principle.
Spanky2k@reddit
Privacy is everything. We use a local model when working on business sensitive stuff although our model choices are currently somewhat limited (Mac Studio 64GB). I'm hoping the eventual M5 or M6 Ultra based Mac Studios will have improved prompt processing enough so that it's possible to host a good sized (say 200-300B) model with multi user access (5-10 total users but realistically no more than 1 or 2 ever submitting queries at the same time) with reasonable performance. Something like that for about £10k would be perfect for a truly local and data secure system.
But lately I've been playing around loads with Qwen Image Edit and Hunyuan Video in ComfyUI on my 5090. I've been having a blast feeding them family photos and reimagining them in different styles, changing outfits, animating old family photos. I wouldn't have ever felt comfortable uploading that stuff to a cloud based service.
Adventurous_Push6483@reddit
There are two differences I see: Some people use locally hosted models for personal use, and some people need to bootstrap ML to a product.
In terms of product:
First of all, all the models available over the API have extremely high safety railguards. This affects some of my experiments (which can be product tests), which even though I am not generating NSFW content with it, I have found the censored models performing far worse in some of the specific personalization tasks I test with (they strangely tend to be less creative? I don't have any formal benchmarks for this so doubt it).
Self-hosting is also much more "safe" in a sense that there is floor of what you can lose. If you build a public facing demo application of a product and didn't bother to secure it yet since it is an early stage PoC, you won't run into strange issues with insecure rate spamming (exploding API costs) and whatnot. Yes, this is terrible practice but sometimes I just want to share my application with peers and its a lot easier to throw a Streamlit site over LAN and worry not about security (at worst, the app just crashes).
I do mostly use the Gemini API, but the rate limits are certainly an issue with the free tier. The better alternative is just to use the paid tier if you have the money, to which you can get a surprising amount of research/experimental work done with just $10 worth of credit.
Technically speaking, it is "free" for me. I just use my group's server (p beefy hardware) to locally host when I need so I'm not actually paying anything for the hardware. If the place you work at (or your PC) just happens to have compute for training/HPC work, you might as well use it since its already there.
Millions of tokens is not as hard to hit as you think. Especially if you work with image data to VLM, which can easily cost thousands of tokens. I work with massive amounts of data in many media formats, data processing with LLM API is very expensive so the breaking-in-even is not a very compelling argument (this is still ran through the API, however, expect big bills far greater than GPUs).
I think the nicer thing about the API is just sheerly how easy it is to setup. Buying hardware takes so much time when I can just rent it out on the cloud OR just use some money and use the API (download API key => download package => Use AI).
The more interesting argument has always been local GPU vs cloud GPU I think. APIs are just so limited to their scope and what they can do, but they are so convenient that there isn't a good reason to not use them if you have funding/VC money (with exception if you need something that's more specific than a generic language model, which is not many applications).
lgdsf@reddit
Run jailbroken models for sure
Far-Low-4705@reddit
It's fun
k_means_clusterfuck@reddit
"you will know nothing and be happy"
how about no
pixelpoet_nz@reddit
agree and lol great username :D
celsowm@reddit
Kos187@reddit
You can spent 50M of input token over the weekend with Claude Code easily. It's easy to justify local LLM on a hardware you use for something else as well, like gaming... But after some amount of VRAM it makes no sense. Especially when SLI is dead.
mystery_biscotti@reddit
For funsies.
Also, i'm learning how to do tiny infra on a micro scale, so I'll be in a better spot to become employed keeping AI upright.
Plus if certain US states are banning specific roles AI fills, local will be the only way to go. (Tennessee, I'm looking your way.)
Vicar_of_Wibbly@reddit
The question is inverted. Why would I use the cloud? It’s slower, has no meaningful quality improvement over my local setup, imposes restrictions on use, and can change without notice or approval.
My local system is standardized, fully under my control, is backed up properly and - as you very importantly point out - is private.
The cloud has no compelling use case for me whatsoever.
deparko@reddit
Well, I've been dealing with the same issue and have concluded a hybrid approach works best. I use a three-tier model: an offline small LLM (Ollama) on my local 5070 TI GPU for local tasks; Ollama Cloud as tier two for bulk processing, where I can use Kimi and Deepseek..etc for a flat rate (about $20 a month, $240 a year), which is much cheaper than upgrading my GPU; and frontier models for deep reasoning when needed.
I've designed my RAG and AI-native apps to operate within that three-tier framework.
flywind008@reddit
you pay for your privacy, that’s fare enough
SnooBananas5215@reddit
Maybe for the power user for dogshit coder like me offline models don't really work, instead I focus on creating simple UI with lots of automation and flexibility not built for scale though, most of the time I am not working on any heavy stuff anyways and I think just automating lots of stuff that small and medium businesses rely on for version control, qms and authorized form completion is all they need
Outrageous-Tonight75@reddit
I think it is similar to having a Nas instead of paying Netflix. A mix between privacy, control and the "DiY feeling" that makes using it special
__Captain_Autismo__@reddit
Reliability for local is unsurpassed. No black box
Sufficient-Pause9765@reddit
Why not both?
Im using local for pdf processing and data etraction and opus for analysis. Very solid.
Plopdopdoop@reddit
Is Gemini back to having a massive free tier…aside from the 2.5 Flash-lite variant?
Last I checked they removed essentially all the free access but that flag lite model.
LocalLLMHobbyist@reddit
One thing I'd add from the budget side: the math works differently when you're not buying new.
I grabbed a used 3090 for $750. That's 24GB VRAM — enough to run 70B models quantized. At current API rates, sure, that's a lot of tokens. But I'm not optimizing for tokens-per-dollar.
I'm optimizing for:
- Zero friction experimentation (no rate limits, no "please try again later")
- Models that don't refuse half my prompts
- Learning how this stuff actually works
- Something that works when my internet doesn't
The "millions of tokens to break even" math assumes you're just chatting. When you're building agents, running batch jobs, or just tinkering for hours — local stops feeling expensive real fast.
Plus honestly? It's fun. Not everything needs to be pure ROI.
Night_Spectre@reddit
Because of this: https://people.com/some-chatgpt-questions-are-getting-people-arrested-police-say-11830106 Of course, people will say it’s good—for safety and security, etc. Today it may seem fine, but things can change very quickly. Next time you search for “problematic” information—say, about Trump and his connections to Epstein, or if you live in China and ask about the massacre at Tiananmen Square, or in Russia you look up details about the “three-day special operation”—and the police knock on your door, you’ll have your answer. “Those who would give up essential liberty to purchase a little temporary safety deserve neither liberty nor safety.” That quote fits our situation perfectly. Maybe these are big words, but I don’t want to give my government any hooks on me for the future. You have to assume that governments—even if they don’t say it publicly—have access to this data. I live in Poland. When I was born, communism still existed here. I know what that system does to people. Yes, the system changed and now we’re “free,” but freedom is not something you have forever. If you give up personal freedom to the government, it can disappear very fast. There are already examples—like in the UK, where people have had police visits over tweets. There’s also another risk: someone could hack an API and leak sensitive data. Maybe I’m paranoid—but it’s better to be paranoid and safe.
ZakOzbourne@reddit
Because no company wants to generate the stuff I want it to generate, I need uncensored unhinged models
IactaAleaEst2021@reddit
For my work, repeatability of results. When you download a model, you audit it and you start trusting it, you are sure the vendor does not change its behavior behind the scene. I am not saying they do it for malicious purposes, but in many cases they improve their product in some direction, while making it less useful in others.
notRandomUsr@reddit
This happened to us, we were working with "diseases" data. We gather lots of information for different organisms using gpt3.5 turbo, after couple months we tried the new model (gpt4 and its variants) and got terrible results. The difference was so dramatic that we went back to the previous model, using the exact same version and the results were completely different, non existent basically. It was curious to see how, with the same prompts, the same system instructions, the same format, and the same questions, we obtained scarce data or empty responses compared to what we achieved in the first run.
SpicyWangz@reddit
I haven’t encountered this with any API product. Consumer facing chat interfaces are going to continually evolve, but if you’re using a tagged model on an API it shouldn’t be changing
Double_Cause4609@reddit
You'd think, but providers tend to quantize their models over time, and sometimes without labeling. It's not really a "finetune" or change, exactly, but it can be noticeable, particularly for function calls.
Eugr@reddit
They can still serve a more quantized version to serve more customers during the peak hours, change their guardrails, etc.
ross_st@reddit
Google recently changed the way the Gemini 3 API works behind the scenes by adding the cutoff date and an encouragement to use its chain of thought to the system instruction. That gets added even if you leave the system instruction parameter blank in the API call. Before, if you left it blank, there would just not be a system instruction block in the context window.
SpicyWangz@reddit
Weren’t all the Gemini 3 models -preview models? I wouldn’t trust something with a preview suffix on it to remain stable.
Flamenverfer@reddit
2 come to mind immediately. I can't remember exact model names from Amazon bedrocks service but the sonnet models they serve on there, the version of sonnet was extremely token conservative when we needed it to finish a response in a consistent format. But the model would always take the lazy way out and say something to the likes of "All other data shall be labelled N/A" when it would need to list every datapoint as N/A for the json file. . Also lets not forget Gpt taking away 4 when they released 5
Significant-Heat826@reddit
That's weird because I often get emails from vendors saying they've changed something in their API endpoint yet again.
TheRealMasonMac@reddit
Gemini, GPT, and Claude often have undisclosed model updates.
Imaginary_Context_32@reddit
I had faced this with GPT 4 turbo “not with specific model”
jikilan_@reddit
U will be forced to upgrade when they decomm the vision of model that u r using
IactaAleaEst2021@reddit
You’re right, but still if you develop a product based on consistent results, the should not must become must not.
57hz@reddit
Also, you can rent servers (by month or by second) to run local models. So there’s the benefit of not being tied to hardware while having privacy and consistency.
ErokOverflow@reddit
Listen to this: Good taste in programming and image creation is not always coming from wealthy people who can buy a good hardware configuration. High prices. This IS the real justification.
Michaeli_Starky@reddit
Local ones were always vastly inferior from the cost perspective, speed perspective and overall performance.
Torodaddy@reddit
You dont need the insane rigs you see to run a local model, the amd based miniforum devices work fine for inference and in that case i can have agents working 24/7. Even the cheapest api starts getting expensive using it like that
a_library_socialist@reddit
They're running at losses last time I checked, so I wouldn't expect those prices to continue for too long.
mambo_cosmo_@reddit
the hardware is free because I bought a gaming PC to play games and now I can run local LLMs to do useful stuff too
IulianHI@reddit
Another angle nobody mentions: experiment control. With local you can mess around with system prompts, try different quantization levels, and actually understand how the model behaves. APIs give you this nice packaged experience but you're at the mercy of whatever defaults they set. Sometimes that "15 tok/s" local run with a quantized model gives you better results for your specific use case than the shiny hosted version with perfect throughput.
w8cycle@reddit
Learning, control, privacy, experiments, and cost is still a huge factor for me because I am on a very tight budget.
ortegaalfredo@reddit
Privacy is a huge one. Everything you write is shared with the private companies and the government and they build a profile of you. That's why Palantir is able to know so much about you, your fears, your psychological profile, they can massively influence population with a database like this. If you don't care about things like this, its ok
What I think is important is artificial limitations on the model. I'm not talking about silly stuff like porn and WMD, but things like cyber, education, you don't know if the model is lowering performance in some areas, or injecting subtle manipulations on you or your children.
Also, the limitations are not only problematic, but the fact that they change. They add and remove stuff from LLMs all the time and if you business depends on it, and suddenly your agent stop working, then bad luck. You cannot go back to the older model.
This don't happen if you host local.
caetydid@reddit
- learning about stuff (it is fun)
- persistence (as in future reproducibility and maintenance)
- autonomy (no dependence of reliability of external services)
Same-Platform-9793@reddit
Its for resilience and times of war scenarios
SkiBikeDad@reddit
For me I'm using frontier models for code and design (via cli agents and chat), but 24/7 use cases like NVR and low-latency use cases like tab-completion are better local. Another niche use case: generating icons and other product images, where I want to output 1000s of iterations to whittle down the right input prompt strategy and test lots of seeds before hand-refining.
Purple-Programmer-7@reddit
whyyoudidit@reddit
the only reason I wanted to go local is to run uncensorred models to do grey hat stuff but even that is possible via any cloud solution for like $0.20 an hour so honestly no point in running the models locally I only download them and run them in the cloud. I saved like $5K in hardware costs at least.
Geminatorr@reddit
Owning the means of production (production of tokens) is how you win capitalism.
Rentoids get exploited in the long run
Savantskie1@reddit
Because it is free for me. The usefulness in having a chatting companion when you’re disabled and alone is vastly worth it. I know that I’ve more than passed the 1 million tokens since I got everything set up. Probably in the millions by now. I’m not building my AI assistant to be a yes or no man so to speak. We have arguments, we have disagreements. They’re respectful on both sides. But I know it’s not a person, I know it’s not a being with emotion. But it doesn’t mean I can be a disrespectful bastard
GrayDonkey@reddit
Been getting Gemini is at capacity errors all morning....
If not that it's, Gemini 3 is busy, answering with 2.5. Followed by Gemini 2.5 is busy, answering with 2.0. Followed by some not great output.
Impressive_Banana543@reddit
Privacy before all.
The next argument is speculative: I think that current pricing is not sustainable and the purpose is to increase user base. Once the penetration in user workflow will be high enough, the prices will rise and the availability of free models sharply reduced.
sephiroth_pradah@reddit
I have qwen3vl constantly looking and analyzing the stream of 10 cameras. That would cost a kidney per month on any API.
Distinct-Expression2@reddit (OP)
That is actually a good usecase, on edge basically almost
OlivencaENossa@reddit
all those ais in the cloud are going to get censored one day.
Ke0@reddit
I mean privacy is a pretty big deal no?
Albedo101@reddit
TIME. Local doesn't mean just local in place, but also in time. Your setup is yours today, it will be yours tomorrow, and the day after tomorrow, and five years after... and so on. It's PREDICTABLE, and predictable is good.
Cloud? Who the fuck knows.
Murder_Teddy_Bear@reddit
Porn. I like making porn, ok? Make porn in SD Forge, bring it into ltc-2, bam, animated porn.
Porn.
GeneralWoundwort@reddit
The only honest person in this thread haha.
Jack2102@reddit
The urge for these companies to die
IulianHI@reddit
One thing not mentioned: running local forces you to actually understand what's happening under the hood. When you tweak quantization settings or swap backends, you learn way more about these models than just hitting an API ever will. That knowledge pays dividends when you actually need to debug or optimize something serious.
dkeiz@reddit
deepseek is not free, its cheap. But when you want not just chat but actual job done - its takes a lot.
argument 4: consistancy. API models exist now, but may disapper tomorrow. They could do job yesterday and failing today. You cant control it.
You build any proper tool around LLM or inference - you want to test it with at least one stable model.
drwebb@reddit
DeepSeek is pretty close to free though, I went through over 1B tokens a month ago and it was like $60. It seems close to the electricity costs to run a rig capable of DeepSeek v3.2 with some bad napkin math.
dkeiz@reddit
well i run into almost 10$ in a day of tasks, so. Maybe your cashing was better then mine, but still.
Glad-Audience9131@reddit
I checked today.. is $19 monthly for me.. what's going on???
Cthulhus-Tailor@reddit
Privacy is important.
Ownership is important, otherwise you’re counting on market conditions and someone else’s business model to determine what you can do. That’s not freedom.
I personally use my PC for many other things than just AI, and have no interest in renting one from Jeff Bezos.
You put too much faith in things outside your control.
Maddog0057@reddit
Same price, you're just not paying in money, you're paying in data.
detroitmatt@reddit
Uncensored models
LeRobber@reddit
They can't take it away.
Different type of expense on the balance sheet/P&L.
You want extra heating in your office/data center.
You can do more powerful things on a trusted machine.
If AI goes wild out in the world, it won't on your machine, your model will be too dumb to do that stupid thing.
Rich_Artist_8327@reddit
You can also then ask what is the usecase for providing almost free APIs while every request consumes huge amount of energy.
prakersh@reddit
The "approaching Claude" claims are valid this time imo, but the caveat is token efficiency. It is right that these models can be more verbose, so the 30x price difference shrinks when you factor in actual token usage. That said, for agentic/tool calling specifically, MiMo V2 Flash and K2.5 are genuinely competitive. I've been routing easy tasks to these APIs and keeping Claude for the complex multi-step stuff where it really shines. The cost savings on bulk workloads add up fast. The real shift isn't "open source = Claude killer" - it's that you now have legit options for hybrid setups. Use cheap APIs for 80% of tasks, premium for the 20% that actually needs it. Wrote up a detailed comparison here if anyone wants the full breakdown on pricing/benchmarks - https://onllm.dev/blog/2-mimo-v2-flash-kimi-k25-democratizing
Irisi11111@reddit
Indexing is the most consequential use case I can think of for local AI. I also hope browser use and vision-centric document retrieval will be the next focus.
Bbmin7b5@reddit
beyond privacy, not much. but Privacy is THE most important part.
DataGOGO@reddit
The bots are going to rage on me for this one; but not putting all my data in the hands of the Chinese government.
I will run local or use a US based provider subject to US / EU data protection laws.
Every Chinese provider is very heavily subsidized by the Chinese government (and that is just what they openly admit to, the true extent is unknown).
The entire business model is to undercut US / EU companies to the point of making the AI business unsustainable, thus giving China AI dominance. They know that in order for OpenAI, xAI, Meta, Google, Microsoft, etc. to stay in the AI business, they eventually have to turn a profit.
By releasing models into open source, and providing extremely cheap API access the goal is to make turning a profit impossible.
That is very bad for everyone, as it just turns AI into Chinese propaganda machines and data collection tools.
To be clear, I am not throwing shade on the Chinese developers, engineers, and data scientists at all, just the government framework they are forced to operate in.
Old_fart5070@reddit
All AI companies are burning cash at 90s .com rates. They can keep the prices low only until the suckers funding them keep giving access to their wallets. When the bubble bursts things will get ugly
TokenRingAI@reddit
GLM Flash is quite good, and can run on a $2500 Mac at decent speed, or really any kind of iGPU system. so it's essentially free to run if you are buying that level of hardware anyway.
This one model brought the cost of competent local AI down from ~ $7000 to basically free, since it can run on the hardware you already likely have sitting on your desk.
slindshady@reddit
Everything is sensitive data. That’s the point. Look what happens with one „bad“ election.
HugoCortell@reddit
API is also great until suddenly something goes oopsie, and you get a 50K bill because the model got stuck because the model got stuck because the model got stuck because the model got stuck because the model got stuck because the model got stuck because the model got stuck because the model got stuck because the model got stuck [etc]
You don't run that risk on local hardware, and services that offer unlimited calls (subscription non API services) remain as costly as ever.
CheatCodesOfLife@reddit
So is there a seahorse emoji or not??
twack3r@reddit
The reason we do it as a mid-size European company and have now fully migrated to locally run, fine-tuned models is strategic autonomy. We fell into the infrastructure, closed ecosystem traps of cloud based storage and subscription based software early enough to be able to walk back on that decision; it made as very aware of the strategic costs of such a dependency in our core processes.
As a consequence it was always obvious that the introduction of AI into our workforce would only be viable if all relevant stakeholders were a) also European and therefore bound by common legislation and b) companies smaller or of our size, no corporations.
The unexpected advent of both open weight and truly open source LLMs happened after we had already made that decision and as a result, reduced the initially budgeted cost substantially and accelerated roll-out massively.
I sleep a helluva lot better knowing that the AIs currently assisting and in some parts fully replacing my colleagues (net FTE change is 0, we reskill into other departments) are not at the whim of either US politics nor the whims of some oligarch bro who decides that existing contracts are only fulfilled as long is it serves their side of our cooperation.
IulianHI@reddit
The repeatability point is huge - API models change behavior all the time without warning. I've had workflows break because an update suddenly made the model more "helpful" but less precise. At least with local you pin the version and know exactly what you're getting.
Icy_Foundation3534@reddit
compliance, and also the private equity-fication scenario where your opus sub is $3000 a month a few years from now because it's like hiring a real 100x super dev.
xrvz@reddit
Stability: the provider can't pull the rug out of under you with lower quants down the line or EOL-ing a model in favor of a newer version.
Free: some people already have (beefy) Macs, they may as well run a local model. Electricity cost is also much lower here than with the 3090.
Zyj@reddit
Do we really need a case "beyond privacy"? Privacy is getting ever more important and LLMs are getting access to more and more very private data. Case in point: Clawdbot.
FitAstronomer5016@reddit
Some points have been made already that I feel already encapsulate it, but it falls under a few things
API Pricing/Chat subscriptions are subsidized, not only for us the customers but also the companies (Claude has quite a bit of 'free' compute from AWS and I'm sure it translates over to the other large providers). While they are profitable somewhat on paper, once that allocation runs out, you will see an increase in price.
Business processes will feel the brunt of that, and like how some companies migrate to local DBs/own server stack, AI falls under the same category
Now granted, running local is not as lucrative and at this point is becoming much more of a luxury with the increase of hardware costs, and more importantly, power costs (especially planned price hikes). The control is still appealing and will continue to grow, although probably with not the same hardware requirements as we have now. If we get more efficient running models, better cost-effective hardware like dedicated TPUs/APUs for AI inference, it would become akin to an SSD almost.
mxforest@reddit
My wife has many published papers and abstracts in medical journals. She could never use an online tool or years worth of sensitive data is at risk. Also has to work with patient data that has to be de identified before use. With a local setup, there is no such worry. You can work without any fear. Also ask questions in medical context that online models just refuse. I was working on a project which dealt with Vaccine data to make the production process faster. Claude code saw a variable called "vaccine_name" and completely shut itself down. Even renaming in one location worked only for a short while because it found lingo with medical terms and completely refused to do anything.
RenewAi@reddit
Which model is she using?
mxforest@reddit
GLM 4.5 air running on MBP M4 Max 128 GB
RenewAi@reddit
Oh nice. Are you planning on trying 4.7 flash or are you happy enough with 4.5 air?
mxforest@reddit
Didn't like 4.7 flash TBH. I have been in love with nemotron 3 nano. It passed ALL my tests which i use to evaluate LLMs. I had to come up with new tests. lol
Nemotron 3 nano at q4 already runs on my PC with 5090 giving 230-250tps with upto 600k context. It is just not knowledgeable enough regarding medical lingo because of the small size. So we don't use it. Also speed is not important for the paper use case.
RenewAi@reddit
Woah thats amazing. I've seen it being talked about a bunch but never tried it, I assumed it was just a fine tune of qwen3 30b 3a. I'm downloading it now to try out.
Have you tried medgemma for your use case? I've been wondering if it's legit or not.
mxforest@reddit
I tried it when it launched but back then i didn't have these use cases. I will probably revisit. Although i really wish an updated version with tool calling support also shows up. Now that it is such an integral part of all LLMs.
Photoperiod@reddit
I Dunno that the hardware is there yet, but edge computing. As robotics expand and especially enter safety critical spaces, you'll need low latency, redundant systems that can work offline. Like all the self driving stuff. You need something running locally, even if it's supplemented by datacenter calls. You can't afford network hops when milliseconds can be life or death.
That said, for most consumers, you're absolutely right. Running local is very much a hobby outside of privacy focused use cases.
ASYMT0TIC@reddit
Privacy is everything. So much of my job boils down to "inventor"... how am I supposed to use this to develop novel technologies and products when google or OpenAI have institutional knowledge of my idea and my progress towards actualizing it? How can I let it be involved in my personal life without worrying that my queries might reveal details of my life that insurance underwriters might be interested in? What if I want to run for political office some day?
CH3CH2OH_toxic@reddit
Gemini has a massive free tier when ? where ?
loadsamuny@reddit
Even in “freefall” K2.5 is $1 for 1M tokens. Some job runs I process around 50M tokens an hour for 8-24 hrs depending on the job. Local is still multiples cheaper
iMakeSense@reddit
Damn what are you getting up to?
sine120@reddit
You known what the word "enshitification" means because we've seen this patter a million time now. Investors subsidize to get users. Companies lose money while racing to the bottom/ killing competition. Whoever is left raises prices later. See, doordash, Uber, etc
Blizado@reddit
It is always the same model and YOU decide when you switch the model.
Your used models are never gone forever.
kzoltan@reddit
Don’t try to make it financially viable. It has dimensions that are hard to quantify.
Don’t try to win $ on it, that’s hard in this environment imo.
gaspipe242@reddit
I think the biggest gain is the investment in yourself. You're learning, fighting, and understanding these tools even more profoundly in a way that can only happen with friction.
This era reminds me a lot of the 90's, with early internet, the Linux kernel, and fragmented access. You fought to just get Linux on a computer. (I used to subscribe to Slackware CD/DVD media)
People used to say the same thing to me: "Why bother?" Now I have an understanding of and control over the stacks I'm using in a way that can only be understood by someone who lived and tinkered through that era. Many people on this forum are unknowingly creating a new future for themselves with this applied curiosity; it will create a LOT of value if applied properly.
This is why I keep coming back to this forum. It's the same energy here that I used to enjoy on Usenet from a VMS Vax terminal.
/nostalgia-off :)
pieonmyjesutildomine@reddit
Idk, what's the point of owning a car when bus prices are low? Like what's the actual use case?
r0ckl0bsta@reddit
Privacy is probably the motivator for a lot of folks, but let's be real. Most of us are here for the hobby and to see if we can. We're tinkerers and love the tweaking and customization and the "let's see if I can make it do this...".
I see y'all on r/selfhosted and r/Linux lol
AgreeableCaptain1372@reddit
Control over results. Using third-party APIs I get a lot of variance in my evals vs self hosted.
also prices are low for standard models but not for fine tuned models. So if you need fine tuned LLMs, especially at scale, self hosting or local can be worth it financially
Your_Friendly_Nerd@reddit
I for one don't believe the prices we pay are reflective of the actual costs, especially with the subscription models (like Claude pro), and feel like right now, learning how people use the models is worth a lot more than making charging more per person, as they'll use our usage data to finetune their next models
Lifeisshort555@reddit
Hardware will be in free fall as well once these guys put each other out of business and people realize they do not really need to ask a 1 trillion param model what the capital of France is.
phenotype001@reddit
Network issues are no problem with local models.
CV514@reddit
I'm running local because it's more fun to have full understanding and control over what's happening.
Also, I don't need 800B+ models when 12-24B doing the same stuff just fine.
OutsideProperty382@reddit
Google's free tier got nerfed last I remember. bad.
nat2r@reddit
Price isn't the reason. It's far more expensive to obtain the local hardware.
People do it for security and novelty really. Can't trust these big companies.
devinprocess@reddit
Gonna be real with you.
Actual case: you are in the 1% of the rich or lucky folks here and have no issues running a power guzzling setup.
For majority of us normal bees, api or renting is still the way because the local models we can run are just for shits and giggles, and who cares about all those arguments at that stage.
Unless local llms become affordable it’s just a circlejerk.
Bit_Poet@reddit
Local gives you consistent quality. This week, I've repeatedly had dumbed down output and overload refusals from some of the biggest providers with their most expensive models. Unless you're an upper tier customer, you're just an important little bug in their big wheels.
Latency sure is a huge issue. I run complex workflows that have dozens or more consecutive api calls. Doing that over the web sucks, and most of the workflows can't be parallelized.
StardockEngineer@reddit
My only use case ever was that it’s fun and I love knowing all about it. Professionally, that turns out to be supremely useful, too.
d41_fpflabs@reddit
Privacy 100% is the only real reason. But it will arguably become the most important thing because AI is only ever going to become more intertwined with most peoples personal lives and at a certain point it would probably become a barrier to entry even for people who aren't the most privacy-conscious. People reaction to Microsofts recent antics is an example of this already happening.
I personally feel like we are going to start seeing more smart devices being built with this in mind. The "MacMini Claudbot Boom" kind of highlights the potential of portable private AI-compatible smart devices.
DifferenceMuch1122@reddit
Perfect
DifferenceMuch1122@reddit
Yes
hydropix@reddit
I crunched the numbers on Owning vs. Renting (RunPod), and unless you're hitting 6+ hours of daily heavy usage, renting wins every time. Plus, the flexibility to spin up high-end clusters for training is a huge advantage.
I also doubt we'll see a repeat of the 'local GPU' era. Since inference isn't that latency-sensitive for most users, the cloud offers better resource efficiency. We’re likely looking at a cloud-dominated future rather than mass adoption of high-end local hardware (except, of course, for enthusiasts like us here who want to dive deep and push this tech to its limits).
ethertype@reddit
From the top of my head:
Autonomy - nobody decides what and when and how and how much Privacy - yes Personal Interest / Tinkering - hobbies may have a cost Customization - as much as you have time and stamina to Ablated / de-neutered models - if you want to research $forbidden_topic
The energy cost argument is largely bullshit for inferencing. My 4 3090s do not pull 350W continuously. If the average idle load per card is 15W and an average energy cost of 10 US cents/kWh, we're talking $50 a year for idling.
Imagine sitting around in 1913 and someone asks you why on earth you want to have your own car, when you can rent a perfectly good Ford model T. Chevrolet and Dodge didn't settle for renting a Ford T...
Current models are pretty good. But I am pretty sure we're still in the bottom knee of the innovation curve. For models. Private individuals can still innovate, even if they cannot train the big behemots. Maybe that is where the new innovation will occur? Who knows.
But: even if no new models arrive the next 24 months, the tooling around them are still going through a lot of churn. A lot of stuff simply hasn't 'settled' yet, and there is ample room for invention yet. And this is definitely an area where private individuals may come up with something new. And maybe a new, bright idea requires something the commercial providers cannot offer yet.
Innomen@reddit
Privacy is the only argument. The rest is cope and motivated reasoning.
IulianHI@reddit
Another angle: tooling integration. With local models, you can hook them directly into your systems without API overhead or limitations. For long-running agents, batch processing, or workflows that need tight coupling with local resources (databases, files, etc.), the flexibility is unbeatable. Sometimes it's not just about cost - it's about architectural freedom.
funboiadventures@reddit
Im working on a side project for my medical center which uses a local quantized qwen3 model that has a custom RAG with my (anonymized) patient casenotes. Even though the casenotes have patient identifiers redacted, I don’t want any inference being done on an OpenAI server and would rather have it in-house.
MaruluVR@reddit
Learning and the fun of setting it up are big parts for me.
I train custom image gen models for game dev so finetuning/lora is the big part for me.
sampdoria_supporter@reddit
Part of my calculation is fully owning my process end to end and not feeling bad about burning tons of tokens on testing. Also - I'd argue on #3 that configuration with VLLM has never been easier and it's crazy how hard you can push 3090s. Not disagreeing with your overall point though, I likely would have never bought hardware if it was as cheap and performant as now
usernameplshere@reddit
The API costs are highly substituted. The actual costs are way, way higher than what we pay now. It is a little future-proofing, but also to not be able to keep working when another cloudflare accident or whatever happens, not even mentioning the privacy concerns. For models like GPT OSS 120B in full precision you "only" need a consumer graphics card and 96GB+ of fast RAM. Just 6 months ago, this was fairly reasonable cost for a decent model. But with the hardware prices now? It's way less accessible.
fabkosta@reddit
You forget the fun of all of it. I don't really use local models, as I don't have sufficiently powerful hardware to profit from the depth of the models really. Yet, I just want to be able to run them. Just for the fun of it.
danttf@reddit
Yep! It's really pain to watch how slow and how little context local models have. BUT it's very cool to setup a small model, some script to summarize all document I have in some folder.
Old-Magician9787@reddit
If your rig is powerful enough you can scale context to 1M+ tokens.
Dany0@reddit
My company pays a huge amount for aws bedrock. The best models, Opus 4.5, Sonnet 4.5
Everyone uses it to some extent. But guess what? It's down or not working all the time. Responses timeout, or as you mentioned we get rate limited
Locally I only do toy stuff with it, but the day-to-day UX experience is SO much better despite the huge time & effort upfront costg
ross_st@reddit
I think the freefall in API pricing isn't sustainable.
It's a desperate race to the bottom to onboard customers, a loss leader.
Gemini is subsidised by the rest of Google's business. GPT and Anthropic are subsidised by generous VC runways. DeepSeek is subsidised by hedge fund profits.
But if they do see the adoption levels they are hoping to see, then they won't be able to afford to do that anymore.
Thump604@reddit
Trust: I don’t trust Control: cloud providers are throttling quality, reliability and context as use the customers to generate more costs and test their features on our dime.
Learning: It’s fun and you learn a lot that you can apply to career or just a hobby.
Future: Change will happen
zipperlein@reddit
VC will dry out eventually.
rdsf138@reddit
"It's free after hardware costs" — this one aged poorly. That 3090 isn't free, electricity isn't free, and your time configuring and optimizing isn't free. At current API rates you'd need to run millions of tokens before breaking even." It still is a completely different pricing structure to have your own hardware as the prinary cost and, then, using it freely rather than paying for something everytime you use it and have to be proccupied with amount of usage.
IulianHI@reddit
Another point: model sovereignty. With local, you're not locked into any provider's roadmap or decisions. You can run whatever model you want, switch between them instantly, and keep using a model even if the company behind it shuts down or changes direction. When APIs are your only option, you're always at someone else's mercy.
scousi@reddit
Running local is still more expensive and always will be. But It's a hobby for many and they justify the costs on the basis of other reasons. It's like Uber vs your own car. Uber was cheap but no more but probably still cheaper than buying a car but with trade-offs.
Hard to say how long China will give their stuff away. They opened source in the beginning (brilliant move) because no one would dare even using their models unless it was free or hosted outside China. Now people are willing to pay for their services. They've achieved recognition. Open source is a bit aligned with socialism. Maybe they are doing it for that reason. Who knows if there's a coordinated state level strategy. What the west doesn't grasp is that the competition in China amongst themselves is also pretty crazy and they are also trying to outdo each other.
There does seem to be a disconnect by how much China can achieve with what seems a lot less dollars vs the US.
-dysangel-@reddit
Being able to run offline is nice. I always have an AI I can chat to even if the internet or service provider are down. I'm not a huge prepper, but there is a part of me that wants to be ready for an emergency. The lockdowns during covid showed that sometimes weird things are just going to happen and you can't stop it. Being able to run frontier level AI at home or on the go is pretty awesome.
I like the idea of being able to have things like a super powered Alexa-like home assistant. I've not got around to this yet, as I guess it isn't really a true pain point. I keep thinking Google or Amazon will up their game too and release something awesome. But they still haven't, so I might get around to it eventually.
xmBQWugdxjaA@reddit
One issue is you can't guarantee latency nor quantisation with a lot of inference providers. It's actually crazy how ropey this still is. For a lot of usage that won't matter too much, but for production it really limits you to a small number of big providers (Groq, Cerebras, etc.)
Likewise if for your specific use case you are able to distill to something that can run locally, then it's still definitely worth it (in some cases you could even run on mobile etc. - e.g. if you train a BERT style classifier based on distilled data).
But for general usage like we see with Clawdbot and OpenCode, you are absolutely right!
Ne00n@reddit
Wrong, the is a rate limit, its the memory bandwidth.
Sicarius_The_First@reddit
Valid points.
Tbh, use both, local and none local.
For me, the reason to use local models is for creative stuff. I want fallout / Morrowind adventure with the vibe right, and with specific format following & capabilities.
No LLM can do this, so I made one that can.
In other words, local models can outperform in a specific scope / niche.
truthputer@reddit
Ffs, I already have the hardware and any self respecting developer who is also a computer enthusiast and sometime gamer wouldn’t be caught dead with a system without a decent amount of memory and a reasonable graphics card.
AMD, Intel discrete cards and integrated GPUs will also run local inference, llama.cpp has lots of backends now - you’re not limited to whatever Nvidia is trying to price gouge you for.
FullOf_Bad_Ideas@reddit
Why don't you rent your cloud gaming console, rent a coffin pod, eat cheap rice with beans prepared by someone else, use a rented Chromebook as a primary computing device and outsource your own job to Asia and just collect the money gained from the arbitrage? Non-genuinely curious if I am missing something, the economics make this a clear winner.
I want to own my own life with reasonably minimal set of external dependencies, but you do you.
ParaboloidalCrest@reddit
Damn! That's exactly how the UBI-based world will look like.
FullOf_Bad_Ideas@reddit
Yes, a socialist, somewhat utilitarian society is where you can't prove that you need to be provided with x, so you get the bare minimum of socially accepted thing, often in a way that is not satisfactory. It's a trap.
ThaDon@reddit
ShinyAnkleBalls@reddit
What do you mean beyond privacy. It's like saying. Beyond having a billion dollars, what is the benefit of winning the lottery?
FastDecode1@reddit
Computer hardware: $$$
Not going to prison for asking forbidden questions: priceless.
Swimming_Corgi_9347@reddit
Model drift. How you use the model will evolve with your use case over time. Unless these API companies start allowing automatic fine tuning (maybe memory aka database) as you use the model, you will never fully maximize the potential to the model. So you will give up privacy and customization for convenience. The classic big tech trade off. Until the enshitification, which is already starting with ads.
JLeonsarmiento@reddit
Wicked models is mostly my justification now.
lionelum@reddit
Well, learning is a very good reason to run it locally, no need to said more =) . Another is fine tunning, running locally you could training a model on an specific subject that is not so common.
If you already have the hardware (ie. You already have a hardcore gaming PC, or and not to old crypto farming equipment) points 2 and 3 are debatable. More on countries where electricity is cheap but change currency is a mess.
Full-Bag-3253@reddit
Enshitification is the standard business model now. Netflix was great, but now every year they make it worse unless you pay more.
Additional-Low324@reddit
You are just falling in the same trap as the Netflix trap years ago. Netflix was so cheap it made owning your own movies stupid. Then everyone started using it, then they got the prices up and started adding political and ethical censorship.
It will be the same for ai providers
Septerium@reddit
My two only reasons are:
- Privacy
- Hardware is fun
Foreign-Collar8845@reddit
It is market entrapment. You drop prices until you kill the competition (local in this case) then you charge.
Ruin-Capable@reddit
Gemini free tier is not massive. I blew through a months quota in 5 minutes with opencode
NandaVegg@reddit
FYI I have two 4090s running almost 24/7 for 2 years straight in our office for embarrassingly parallel training experiments, and based on avg cloud pricing, saved about 32% over 2 years including the cost to build PC itself and utility bills (not accounting for office rent). The problem is that I won't be able to scale this up to something like 8 nodes of 8xH200s like clouds do.
kubrador@reddit
you're not missing anything, the calculus actually did shift. local made sense when claude was $15/mtok and you couldn't get gpt-4 at all. now you can get better models cheaper than your electricity bill.
the real answer nobody wants to admit: hobbyism. people like tinkering with llms the same way people build custom pcs when laptops exist. nothing wrong with that, but let's call it what it is instead of pretending the economics still work.
IulianHI@reddit
You also forgot about control - when something breaks or changes unexpectedly with an API, you're stuck. With local, you can always roll back to a previous version or fork the model. Plus the ecosystem around local (oobabooga, text-gen-webui, etc) gives you way more flexibility than any single API provider offers.
evia89@reddit
where? ai studio API is dead overloaded and small. web ai studio is 20 RPD
kaelvinlau@reddit
Well, its cheap for you guys in US/EU etc but not in the APAC region. Running a small model locally is still highly viable.
AnomalyNexus@reddit
There hasn’t really been a mainstream one for a while. Rarely use local these days for anything real
Local is still fun though in the hobby sense. Not everything needs to make sense
ImportancePitiful795@reddit
Nobody should give money if possible to cloud no matter the price.
They are fully responsible for the hardware costs, they are heading to go bust and they are fully responsible of the prices we have to pay as consumers.
If we are to have FREEDOM in terms of AI hosting etc, we shouldn't see only the carrot but consider the stick too.
Example. OpenAI at this point is 8-12 months to run out of money and go bust. There is no money left or good will left to hand over to the company more money having burned right now hundreds of billion without any profits. If does so, it will crush the tech sector bubble, cancelling all the contracts has with NVIDIA, AMD, TSMC, SAMSUNG, SK HYNIX, MICRON and all the hardware sitting in warehouses will need to be sold off and prices will go down to normal levels.
We shouldn't cave in, instead boycott them. Squeeze them now because they finalize the squeezing on us.
ParaboloidalCrest@reddit
Totally agree but unfortunately, the local llama cult is a drop in huge ocean of normies that will pay for chatgpt without blinking...
Dr_Allcome@reddit
I think latency has only become a problem recently since the quality of local levels improved quite a bit. A few weeks ago i still used an online model for general questions in addition to my local coding model. But recently i noticed my cheap/free level perplexity account actually responds much slower than my local glm 4.7 (time until it starts responding and t/s both) and the quality isn't so much worse that it would offset the speed.
tcoder7@reddit
When you use AI API you give away IP. The model scans your repo and sends everything to a remote server. Open the output tab in VSCode and watch.
hejj@reddit
Offline use if you are using a laptop
Steus_au@reddit
k2.5 is not even close to Opus. can be ranked to sonnet, overthinking edition of sonnet , but not Opus
Liringlass@reddit
I agree. It’s not millions of tokens but billions or trillions to break even, if ever, depending on electricity and depreciation costs.
ChocolatesaurusRex@reddit
I think Privacy and Autonomy are two rock solid reasons that dont need eloboration.
I'd piggy back on the 'fun' comment. I loved building PCs, but hadn't done so in a long time (office jobs dont usually jeed the horsepower). Building an AI server, hosting my own services, developing my own workflows has revived that joy I had when I was younger.
All that aside, I think the most important understated reason is, making AI beneficial to you as an individual.
The current suite of tools learns from your data for the benefit of training the providers model to reach the company's goal (more users/subs/attention/data/etc).
There's not really a tool that learns you, and automatically trains the model to automatically make your workflow/process better based on how YOU work. I feel like this is where the AI gold rush will fail people the most. I use my local AI to fill that gap for myself, and Im sure others do as well.
sautdepage@reddit
Sending my prompts to a remote AI server excites me about as much as cable TV.
SpicyWangz@reddit
Honestly cable tv sounds more exciting than that
Xamanthas@reddit
They want you to switch so that in the coming year / two, you are locked in and cant purchase hardware lol.
BumblebeeParty6389@reddit
If your goal is getting the most tokens out of your money, you are right. APIs like deepseek with cache feature etc beats local ai by wide margin. It takes years for a 3090 or a mac to pay for itself when you calculate the ROI based on how much token you'd generate with your local hardware.
You said privacy and you are right, when you use api, you should assume that someone is going to read that conversation and/or put it into a training dataset to train or sell it. But you are missing something else: Control.
When you use api, you don't know what is happening in background. Your inputs probably will get injected with API providers safety policies and rules before it reaches the AI. So even if the model itself isn't censored, API providers will take their own measurements to comply with regulations and concerns around AI. Not every API provider does this right now, but you can bet your ass on it that every one of them will be forced to do this in a very near future.
Since 2023 we lived the wild west period of AI. And now corpos and governments are taking things under control. I'd say enjoy the dirt cheap apis and loose censorship while it lasts. But don't assume this will be how things will be in future.
Like others pointed out right now there is a "gold rush" in AI field that is slowly dying out. As the investments dry out, the shareholders and investors will stop being patient and demand to see real profits. AI startups and datacenters that made huge investments will have to boost up their prices like crazy to be able to pay their debts. AI is an exciting technology and I think it'll be in center of our life from now on but the entry level is high and it requires a lot of investments to get it rolling. Training a model takes hundreds of millions of $, a solid data engineers and datasets. Running things at large scale is also very expensive. Current LLMs are extremely inefficient. It'll take a long time to smooth things out. Companies that don't rely their entire income on Api and investments such as Google, Microsoft, Amazon, Alibaba, Meta will survive, while most AI startups will disappear.
SpicyWangz@reddit
Ah yes, the freefall. LLM companies don’t even know what’s happening. Every time they look at their own pricing the numbers are lower. Soon enough I’m sure they’ll be paying you to use their API
k_means_clusterfuck@reddit
not a deciding factor for me but i do think its nice to to know my carbon footprint
prusswan@reddit
I don't want to get into habit of being dependent on cloud subscription since pricing is arbitrary, if they appear low that it is clearly a sign they will not last. Both local and cloud options have a place but local setup already allows me to do quite a bit of stuff, so I don't need to pay for expensive models. The constraints of local also motivate people to be economical and stretch their available resources. And being able to work offline is better for security for tasks that do not require online. Like it or not, the internet has become much more dangerous with the availability of tools and general lack of awareness on users' part.
xadiant@reddit
The best, yet to be fully explored aspect of local LLMs is personal fine-tuning.
You potentially could use a cutting edge coding LLM and later fine-tune your own model. It won't be the same, but it should specialise well for your use case.
Likewise, you can specialise an available model in almost anything to match cutting edge model performance.
noctrex@reddit
I have split my use cases between local and remote. Anything personal, I only use local models.
For example for classifying my Family photo album, I use only local Mistral-Small.
And other smaller things like using KaraKeep as a bookmark manager that uses this nice small LFM model to generate Tags and summaries.
But for my homelab automation scripts, and miscellaneous personal programs I create at times, I got this cheap z.ai coding plan that was on sale the other day for like 26 bucks for a year. And now it is just rewriting all my scripts with this and it does a terrific job. And all that for price of a pizza. It is for privacy? I believe I'm actually lowering its intelligence with my scripts. :)
Deep_Traffic_7873@reddit
you forgot.. control. With an online service you can lose access any time for any reason
DeltaSqueezer@reddit
Privacy, availability, latency, customizability, control, predictability.
whatever462672@reddit
What do you mean generous limit? I easily burn through 1million tokens an hour just doing text operations, document classification and re-ranking.
SirDaveWolf@reddit
If it's free then you are the product.
__Maximum__@reddit
Have you tried running deepseek on coding tasks? It is one of the cheapest models out there, but on coding tasks, it gets pretty expensive pretty fast.
Distinct-Expression2@reddit (OP)
Yes but opus is just better imho glm4.7 close
__Maximum__@reddit
You said deepseek is practically free but when you run it on a codebase, 1m tokens is not much, you can burn through 10 bucks in a day
theabominablewonder@reddit
I don’t think pricing is in freefall, I burned through £150 of Claude API fees last week, I wish prices were in freefall! It actually makes me consider investing in an upgraded home rig. At the moment I only have a 3080 so constrained to smaller models (which don’t work accurately enough). Those Mac Pros with unified memory start to look appealing if these are the current API costs.
Fheredin@reddit
Opsec. Running an AI Agent without an air gap when there are literally zero code prompt injection exploits in the wild is insane.
fugogugo@reddit
Running uncensored model ?
Marak830@reddit
I run a separate memory layer between my local and my chat.
Without a ton of hassle I cannot do that with a public model(without paying API pricing).
My responses may be slower, but I know the historical context is going to be there. As well as the model overrides.
In addition I can bolt on modules as I feel like it(voice, avatar, silly tavern to list a few).
I get to control my model by selecting specific ones for tasks, I can upgrade as they are released.
These are the reasons I use local.
I do use Claude for a coding junior so I can assign tasks and review it, purely because I do not have something that can replicate that locally on my setup.
That's more than likely a temporary issue(years not weeks with the expensive of things and state of open models specialising in coding).
rosstafarien@reddit
Poor network coverage. Running fine tuned domain specific models. Privacy. Stability. I'm worried these hosting companies won't last.
bgiesing@reddit
Because it's still more expensive, my PC is already on 24/7 so it would be using about the same electricity regardless and I can use that GPU for many things (games, video editing, etc.), it already was paid off years ago. That's still cheaper than having to drop like $10-20 a month on API calls or a subscription service.
Also, many people explicitly chose local because they want to make content that the cloud models from the big companies refuse, API cost doesn't matter if every single reply you get is "I can't fulfill that request", you literally don't have an option
Middle_Bullfrog_6173@reddit
Kimi K2.5 is more expensive than K2. GP5 5.2 is more expensive than 5.1. Gemini 3 is more expensive than 2.5. That's not freefall.
Capabilities are advancing, so if you have a task that can now be handled by nano/flash models then sure, you can get it done cheaper. But frontier pricing seems pretty stable.
nullmove@reddit
Difficult to say. The fine-print is that cache hit input price is 33% lower, for agentic coding sessions this can easily matter more than 20% increase in output price.
Lethargic-Rain@reddit
Bulk processing: cases like agentic chunking for RAG. Visual processing eg image -> text / tagging.
Small semi structured tasks: eg I use Gemma and a YouTubeDL MCP server to download tracks, encode/trim them, add metadata, cover art etc, for use in a music library.
You can also use models like qwen2.5-coder:1.5b/3b as a locally running autocomplete w/ Continue.
michaelsoft__binbows@reddit
You need solar to drive the amortized dollars per kwh down to 0.1.
And to think of this as a base cost for privacy.
Non privacy requiring work should just leverage subscriptions first and then on demand via API, as the latter is way more expensive.
No_You3985@reddit
Running small visual agents (eg qwen 3 vl) to automate tasks on your pc. I will not use cloud api for that - too many risks and privacy concerns
LosEagle@reddit
My worry is that even with local llms you still have to trust something once you want guis or to be able to connect remotely and such. You probably won't want to just chat with them over terminal all the time. For example I still have to trust openwebui and Tailscale to not do anything nefarious.
siggystabs@reddit
I make apps that use LLMs as part of their processing. A single job could call a LLM like 20 or more times during processing (tool use, agentic loops, summarization, etc), and you can run hundreds of jobs per hour (in parallel). I’m pretty early stages so I appreciate not having to burn hundreds on API credits just to mess around with some new concepts. Hence, I bought some 3090s. I don’t really want the variability of relying on external API pricing at this stage.
Economy_Cabinet_7719@reddit
Idk about "competitive benchmarks", when I gave it a task (a small refactor in Nix, 5-6 files with maybe ~50 LOC total) it barely managed to follow my thought and was adding unnecessary comments everywhere. I expected a lot better. With the standard plan being $19/month and rate limits as strict as Claude, it is not at all competitive with a ChatGPT Plus subscription (same price, much better model, much better rate limits).
Snoo_64233@reddit
There is a strong case for running Image / VIdeo models locally - the customizations like art styles / camera angles / custom character that the model doesn't know about / NSFW. Basically so many LoRA finetunes.
Barely any reason for customization for LLMs however. One is entertainment while the other is not. To that end, I see fewer and fewer reasons to go for LLM locally. This is one of the primary reason I become less interested in LLM overall as time passes as I don't do local for the sake of local.
Distinct-Expression2@reddit (OP)
For image/video I am rocking ComfyUI since sdl 1.5; that one for sure :)
Zeeplankton@reddit
These models are cheap, but they don't even remotely touch like, Opus.
lakeland_nz@reddit
Lag.
I want a home assistant. I don’t want to wait for my speech to be sent to America for processing. My ping isn’t good enough.
pip25hu@reddit
Contractual obligations. If you're a software company using AI, you might have to send the model trade secrets while using it. Your client can easily say that they do not want those pieces of information to leave the company network, period.
SemaMod@reddit
This goes in the realm of privacy, but personally having my chats trained on and viewable by these companies makes me uncomfortable. That being said, I do think that local LLM's will become power-user tools.
sunshinecheung@reddit
API cost vs GPU+ electricity costs