Is running local LLMs actually cheaper in the long run?

[-]

MexInAbu@reddit

No. The so called "hyperscalers" are burning through VC cash to subside the cost. Local LLM are about control, privacy and learning.

[-]

Randommaggy@reddit

Also not getting hooked on something that will 20X i price.

[-]

But then, if we agree that this is unreasonable for the future, doesn't it make sense to TAKE ADVANTAGE AS MUCH AS YOU CAN of the low low prices, and take the pain of setting up your own cluster when they go up, instead of taking it now?

[-]

TemporaryUser10@reddit

It doesn't need to be an either/or

[-]

eras@reddit

In the long term this might change, though.

But you can get a lot of tokens for a few thousand bucks you need to have anywhere near usable, and a lot more if you want actually comparative.

[-]

Maximum-Wishbone5616@reddit

Much cheaper.
Do not listen to kids that never spend 100-200k on infra. They have no idea what they are talking about.

£100k cluster => delivers 768GB VRAM, for team of 12 developers it is enough to mix from 3-10 different LLM.

Our costs per cluster for power/energy => less than £500 pm.

We have to spend immense amount of money and in last months i.e. Claude Opus 4.6 was as dumb and lazy that we had to always use Qwen 3.5 to verify it's code/findings.

Currently we are not only SAVING on subscriptions + API but also are heavily more productive. We are not bound by changed performance of an model, we have much stronger control over system prompts and all models are user super specific guidelines for each mode.

What most people misses is the fact that self-hosted models puts you in full control. This boosts productivity THROUGH the roof.

I got few days ago our almost monthly performance reviews (tickets/code written/features/etc.), when we have cancelled all our MAX's + other subs, suddenly not only number of code increased, but also number of failed tests (written by external team) also decreased, time to solution decreased and number of tickets per day boosted like almost 35%.

So it is not about cost of sub/infra. It is about business cost of not developing fast enough HIGH QUALITY ENTERPISE cost.

It is not about 200$, it is about 200-500$ k in lost revenue per MONTH + compounding effect in 3 years.

I would spend even £1m now for this cluster.

[-]

DeltaSqueezer@reddit

Yes, the economics of it shift once you have a large enough team to amortize the server cost.

[-]

MrAlienOverLord@reddit

id say you are the one who has no idea .. 100k isnt a cluster all you get for that is a 7x6000 pro node from scan - a hgx dgx costs you 400k and you need the 10-15 kwh power drain (in the uk good luck) .. no chance in hell that pays back in a matter of 5 years - disclamer i work with a hoster (and noone wants the hassle to deal with infra them self)

[-]

MrAlienOverLord@reddit

in addition to that you need n\^2+1 units if your biz depends on it \^\^ now lets do the math how viable that is

[-]

eras@reddit

Lease from cloud as failover?

[-]

Long_comment_san@reddit

I bet something like 48gb VRAM and 512gb RAM would absolutely suffice for 99% of use cases and it's probably under 10k$. You can cut that in half and probably get under 3500$. And it's going to be yours forever. That's probably paying off in like 6 months over cloud.

[-]

DeltaSqueezer@reddit

Agree. With Qwen3.5 I've replaced big models for a lot of stuff. Coding seems like one of the last bastions of big model advantage, but I'm managing to replace a lot now with only Qwen3.5-9B. I actually prefer the local one as it is faster, even if it is much less intelligent.

[-]

bidibidibop@reddit

You should include the electricity costs as well.

[-]

Long_comment_san@reddit

That's neglible relative to token costs.

[-]

denoflore_ai_guy@reddit

It’s funny - right now it makes more sense for me to get a new 3060 12g or 5060ti 16g and have the extra power the gpu holds for testing and cuda processing for other tasks than just getting ram. WTF world are we in right now.

[-]

Long_comment_san@reddit

No joke. I have zero idea what's the point of building datacenters with dead on arrival hardware (while waiting a year or two for hardware arrival as well), because hardware is evolving right now. Datacenters made sense when hardware was available, now Nvidia say they're reserved for the next several years. Who are those idiots?!

[-]

Admirable-Earth-2017@reddit

It will be, when companies finish gathering training data from users and stop subsidizing

Than everyone will get proper fkd, you won't even be able to migrate to local models, so before you have time do migration

[-]

TractionLayer_ai@reddit

Yes, it can get cheaper over time at scale , but the bigger benefit is keeping your data out of a third-party black box. For a lot of teams, privacy, control, and compliance matter more than the raw compute cost.

[-]

Euphoric_Emotion5397@reddit

yes for me. But i still pay $20 for vibe coding the app. But I use the local LLM to scrape tons of data and analysis. The local models like qwen 3.6 moe are already super good.

[-]

Cultural_Meeting_240@reddit

It depends on your usage volume. If you are running inference heavily every day, local pays for itself pretty fast. The upfront GPU cost is real but after that it is basically just electricity. For lighter usage, API calls might actually be cheaper. The real win for me is privacy and no rate limits, the cost savings came later once I started using it more consistently.

[-]

rwa2@reddit

No one's arguing for the "it depends" hybrid approach.

The future is in routing different types of requests to the best available model suited for it wherever it lives. Get the best of both worlds. Have access to all the latest models for testing with Openrouter or Azure AI Foundry. Paid for in microtransactions rather than monthly subscriptions. Scale the bulk of your work on the local models in a way to amortize your hardware and local storage efficiently. Be resilient to downtime for either.

[-]

MrAlienOverLord@reddit

generally api is cheaper - if you biz depends on it - you need at least 2-3 sre + the hardware+ spares .. not worth it unless you are 250 man or bigger sized org

[-]

denoflore_ai_guy@reddit

looks at 15k ramshackle inference bench

Nope! But it is a lot more fun!

[-]

Shoddy_Cook_864@reddit

Try this project out, its a free open source project that lets you use large models like Kimi K2 with claude code for completely free by utilizing NVIDIA Cloud.

Github link: https://github.com/Ujwal397/Arbiter/

[-]

Technical_Split_6315@reddit

No. Models from companies are funded by investors and they are offering the models losing money. That’s why you are seeing plannings getting nerfed every day (Copilot removing Opus and Addis rate limit, Claude Code removed for pro users etc etc)

Hosting a good LLM is extremely expensive, you would need to pay around 30-40k to run a worse Sonnet 4.6. Just pay for API call at this point and hope next year it will be cheaper to host your own model

[-]