Why run local? Count the money
Posted by Badger-Purple@reddit | LocalLLaMA | View on Reddit | 128 comments
I’m not a coder, but I run local models. I gave in to agent hype (I was building my own, but there is so much to do) and installed Hermes. Running with Qwen-397b out of a 2 spark cluster.
So…I asked Hermes today to tally the token count, and the result…200 million tokens. In 5 days.
At this rate, using an agent for tasks like installing software and debugging things I want to try out, what is the cost I am saving? Artificial Analysis says the price is about 1.25 dollars per million tokens on average from providers. At current pricing per Artificial Analysis, that gives me about 1250 dollars per month, and my sparks will pay themselves by 6 months.
So, caveats of course I bought them at cheaper prices than today, but it’s a simple estimate that there is some valid reasons to go local.
Like I said, I am not programming and I know there are programmers that easily triple my token count in the same time. That implies that if you use 100 million tokens per day, the return on investment is still there today, even with crazy computer prices.
To me, local AI is about the desire to utilize a cool technology without the strings attached that threaten individual privacy and intellectual property. But knowing that my investment is not just purely hobbyism gives me more conviction that local AI is the future.
I know I am preaching to the choir…So the question is, has anyone else felt their rig is becoming more sustainable now than 6 months ago, price wise? Would love to hear!!
Juan_Valadez@reddit
In my case, I run LLMs locally for all these reasons:
- Privacy.
- Availability.
- Consistency.
- Customization.
- No usage limits.
- Price.
- And simply because I like it.
iVtechboyinpa@reddit
> Availability, usage limits & pricing
I pay for the $200/month Claude Code plan. I would be surprised if they go up to $250, $300, $350, etc.
At that point, why am I paying a car payment for AI? $200 is already tough to swallow - that’s $2,400/years like WTAF lmao. But it’s fine, I’ll take it on the chin because it works, it’s simple, ion gotta configure nothing.
But if they ever raise the price, or are on their (usual) extended outage, or change usage limits and I can’t use the model, I’ve got my trusty ‘ole server @ home.
CabinetMain3163@reddit
I pay 100$ for unlimited usage featherless a month, still cheaper than getting gpu that could handle 256k tokens context and everything around that gpu...
gambit700@reddit
This is seriously why I'm considering getting a strix halo or spark. $2400 a year for claude is nuts.
CabinetMain3163@reddit
I get most of that using featherless though and front cost of gpu to do the same would be a lot
nochkin@reddit
The "I like it" part should be the first one in the list.
eribob@reddit
I agree!
No-Equivalent-2440@reddit
amen to that
kevin_1994@reddit
for me it goes something like:
OuterKey@reddit
The fact it's even possible on consumer hardware (in some cases older hardware!) and now becoming more and more useful. Price wise is iffy in the short term (gpus can be expensive) but in the long run would be cheaper than a subscription.
No need for internet is a big win too
misanthrophiccunt@reddit
I think Privacy is not written there enough times. I would have written it after every other line. People without NDAs do not realise how essential that part is.
Badger-Purple@reddit (OP)
super important
Also if you are bound by laws like doctor patient confidentiality. Anyone want to come to my clinic, see me for a rash, and have our convo go to openAI? anyone?
Ornery_Hall@reddit
Patients' data are barred behind independent server/database per HIPPA and ISC2 regulation. it is a grey area so far to conduct research by AI, but localized llm is definitely preferred. but I do not know any scientists around me know how to build a local llm atm. it's all up to IT geeks unfortunately,
Badger-Purple@reddit (OP)
I know a couple who have, including myself!
misanthrophiccunt@reddit
in Europe we're talking not just NDAs but also GDPR, which fines you very literally up to €20 million or 4% of annual global turnover whichever is higher.
A single fine can destroy most startups.
luvs_spaniels@reddit
Consistency. Can't emphasize enough.
Dany0@reddit
Initially I liked the customization and price. And privacy I did not care much about but I guess it was a bonus. Mostly I just liked tinkering with it and seeing what it can't do lmao
Years later I really appreciate the consistency the most. I set it up the way I like it, and sure it's probably not optimal. But it won't change for no reason. And then the outages and usage caps showed me that I have to cradle my 5090 like it was my biological child
Badger-Purple@reddit (OP)
yeah. BTW https://github.com/NousResearch/hermes-agent/issues/336 they’re working on a skill from another post you asked about. Also Hindsight as a memory system injects the retrieved content from previous work or an obsidian vault. Lastly, I do have a system prompt for prompt engineering linked to a RAG of a large collection of prompt engineering text, which I asked hermes to make a sub agent for. So my prompt is routed to the sub agent, then spit back to main agent in refined form.
Badger-Purple@reddit (OP)
well said!
T0biasCZE@reddit
its not as simple as that, you also need to account for the power consumption
0.3€ per kwh...
Badger-Purple@reddit (OP)
solar!
kmouratidis@reddit
If you can get it. Hard/impossible to do in small/medium apartments and/or rentals.
bgravato@reddit
I was going to say that too... People often forget the cost of electricity...
That said, the price per kWh can vary wildly depending where you live...
I have an indexed tariff (to OMIE Spot market). So price per kWh can vary a lot from month to month... Last month I paid about 0.11€/kWh + taxes (tax is 6% for the first 200kWh/month, above that it's 23%).
I know people (in other countries) who pay a lot more, as well as others who pay less...
Britbong1492@reddit
But it cost you 10 million tokens to do "npm install ..."
Badger-Purple@reddit (OP)
Yes. I am a physician scientist with 2 doctorates and an engineering degree, but did not know what npm was 8 months ago. so 🤷🏻♂️
somerussianbear@reddit
Huge proponent of local AI here. Spent uncountable hours installing, setting up, testing, frustrating over local models in our laughable hardware (compared to a real cloud 35+ grand chip in a cluster with 12 of those). Achieved almost nothing with the actual model I’ve managed to run locally, but yeah, setup time was a shit ton of my hours. Gotta charge my hour full price on that so it was damn expensive.
Best thing I’ve done last 6 months happened last weekend. After more than a year using Claude and GPT as daily drivers of my SWE job (20y experience, Staff Eng), I decided to put $10 on DeepSeek and try to use it on my job and personal stuff. It did the same thing as the others, barely noticeable. Till now, I have spent a grand total of $1.21 of those $10.
One dollar, twenty one cents. 3 days of normal use. DSv4 is 75% off currently, but still, normal price would be $3.60.
I challenge anybody to put $10 there and try to burn as fast as they can.
I’m just happy Apple hasn’t released that MacStudio M5 Ultra 1TB RAM cause I would have wasted some 10-15 grand there before this realization.
gladfelter@reddit
That's one of the really cool things about agents for me: guided learning. Docs can be painful for me because I'm reading instead of acting, and when I want to do something I don't want to stop all forward progress and read the entire universe of documentation and wait a week for answers to questions at stackoverflow (only to be scolded for not reading the one obscure leaf doc that indirectly answered my question if I had fully processed an encyclopedia's worth of other docs.) Seeing the tools in motion works so much better for me and my personality.
redpandafire@reddit
I got fed up when the documentation for libraries I was using was missing and was only found in the comments of the code itself...
somerussianbear@reddit
Whatever makes you sleep/justify that 10 grand expense to your partner, but 200M tokens is more like $10 in the DeepSeek API. Their cache is insane, saves you tons.
In 6 months you’d have spent some $100-200 on the API. Your ROI becomes a hard sell when it jumps to 5+ years in the best case scenario.
And don’t forget that you’re creating work for it, it’s not like you couldn’t live before without that, you could, it’s just that now you have that inference and you want to use it for everything so you have this idea that you’re super productive with it.
Work never ends. We always find something else to do if we find resources, and these things are not necessarily necessary.
johnkapolos@reddit
460 tokens per second, every second for 5 days.
Obviously not.
If your setup managed to give you a proper answer in this simple task, imagine what else your setup does wrong.
eli_pizza@reddit
Pretty sure using the same open weight model hosted is cheaper than my electricity cost let alone amortization of the hardware. But all the other points stand.
Ill_Barber8709@reddit
200 millions tokens in 5 days? That's 40 millions per day, 463 tokens per second
Are you sure about your math here? That seems like a lot. Even for the smallest local model you could find.
Badger-Purple@reddit (OP)
Second bot that posted this. You realize that API costs include input tokens too right?
Ill_Barber8709@reddit
What makes me think I'm a bot? Dude, is it forbidden to be wrong now? Go touch some grass...
Besides, output tokens and input tokens don't cost the same so your math still sucks.
datbackup@reddit
Hey just a friendly fyi, i’m reading your various comments in this thread and it’s actually you that comes across as needing to touch grass
Badger-Purple@reddit (OP)
blended token api price
Ill_Barber8709@reddit
And? I use Claude with a Max subscription, where input and output tokens are counted separately.
So, your math still sucks.
Badger-Purple@reddit (OP)
Ill_Barber8709@reddit
LOL. That doesn't prove anything, does it?
power97992@reddit
It is the blended token usage..His prefills are fast but his decodes are slow ( probably <19-20tk/s)
Late-Assignment8482@reddit
In the last week:
* My company's subscription went API only, meaning that I can barely use it without maxing out and it's $250/mo
* Claude got way worse at using file write tool on my own sub
* Nothing good happened
And we're still in the honeymoon of below-cost tokens!
GLM-4.7 and Qwen3.5-122B got better at what I need them for, because they're fixed points I can improve prompts/harnesses on without sudden backsliding.
XccesSv2@reddit
If you count your money you would spent a cheap coding plan from z.ai or minimax for 10$ / month and get the same and a better model
Badger-Purple@reddit (OP)
no thanks, I don’t want data sent to China!!
XccesSv2@reddit
Yes thats another reason but your post is about money. Never ever is local AI worth the money from just that perspective.
SLxTnT@reddit
You own the hardware. That also means you can sell the hardware. Unless the price of your hardware plummets, the overall cost would be far lower.
entsnack@reddit
hardware prices always go up! always! I took money out of SPY and bought 1 TB DDR5
SLxTnT@reddit
Why the stupid joke? OP said 6 months to pay off hardware, but you still own the hardware. That has a value you don't get with a subscription.
entsnack@reddit
I still own the Titan X Pascal that I started my AI career with. I can assure you its value is a rounding error.
My 4090s, DGX Sparks, A100s, and H100s have retained their value so far, but I'm not deluded to think this is going to continue once we reach our new GPU equilibrium.
Nepherpitu@reddit
Well, for some sanctioned jurisdictions running locally is cheaper and more stable.
Badger-Purple@reddit (OP)
I mean, you do you. I feel it’s not the main motivation, but when it makes some sense financially, it’s better justification for my wife :D
ea_man@reddit
Well I guess you wouldn't buy all 200m token of top expensive SOTA, you would do most of those with the cheaper option as I don't use QWEN 27B at max specs with reasoning for all tasks.
Badger-Purple@reddit (OP)
I could not get a model below 120 billion params to do sysadmin work well. 27B is too slow, but to your point, I do use it for certain tasks that can be one/few shot. But only because I asked the fatter Qwen to optimize a VLLM fork that allowed me to use skinny Qwen within a 24GB GPU (Lorbus release, int4 AWQ) with a draft model included!!
entsnack@reddit
you're missing out not having a second spark, that CX7 interconnect is what you're really paying for
Badger-Purple@reddit (OP)
I said Spark cluster!
entsnack@reddit
oh nice! amazing, I love mine.
SangerGRBY@reddit
Is this comparison accurate ?
What if ur agents used 5x or 20x plans ?
howardhus@reddit
there is no blwck n white.
your crappy local models will never be as good as commercial models.
for coding nothing will best opus and whatnot.
someone coding will spend 200mill tokens with local for subpar results when a commercial model will use some 50 million for better quality (my pure anecdotal experience here!).
BUT as other have said: privacy.
you can trust local models to habdle privste data like passwords (which even commercial providers advice not to trust commercial models with) and private letters or your spicy pics that you dont want resurfaced on the internet if there ever is a leak on anthreminiAI (and i am 100% sure at some pojnt there will be one).
also if you really have a personal assistsnt doing little chores local is enough.
a_beautiful_rhind@reddit
Deepseek, kimi and mistral count as local models and people pay for those. You don't need opus for everything and what you get as "opus" isn't a fixed factor.
Doesn't anthropic keep pulling back the subs?
howardhus@reddit
you are talking prices.. i am talking quality. fact is: commercial models are better than any local currently.
on the 200$ sub i havent noticed anything bad yet.
a_beautiful_rhind@reddit
I also get bored of using the same model all the time. Quality is comparable on different tasks.
Badger-Purple@reddit (OP)
I don’t know if you are trying to agree or disagree here, calling them crappy. Who calls Fat Qwen and Minimax crappy?
howardhus@reddit
i literally put it in the first line... its not about "agree" or "disagree".. also no point in being a local model fanboi.
i do local coding since qwen 7B was the best coding model available. but i also have the 200dollar sub and use it extensively.. also i have (as most of us in this sub) hoarded and tried out several terabytes of models filling up my hard drivers... the difference is night and day.. all the people posting "yea qwen 3.6 is as good as opus" were delusional or using it for simple tests like "give me a snake game". Alone the CTX is AGES away from what we do with local with our 256k that isnt even achievable with local hardware.
So i DO call qwen crappy in comparison to what opus and ohter commercial models can do. BUT its free and promises privacy which is priceless.. basically i am here repeating all the points from my comment before for some reason..
re-read my comment without trying to fit me into a box of "for me or againt me".
llama-impersonator@reddit
opus addicts are obnoxious
davidy22@reddit
Going on a sub called localllama to make this kind of post, feeling brave today are we? Same energy as someone going to r/dogs to make a post about why they like dogs. Wake me up when you drop this same post in r/claudeai or something.
species__8472__@reddit
Local models won't have sudden "updates" that make them worse. They don't send all of your chats to tech companies for analysis. There's way more variety with local models to suit your needs. You don't need an internet connection. And of course, they are free.
Badger-Purple@reddit (OP)
Linus Torvalds said once he prefers open source software because it’s like sex…some things are better when they’re free.
Whoz_Yerdaddi@reddit
And you're probably less likely to catch a virus.
Equivalent-Repair488@reddit
With the supply chain attacks as of recent, remember "don't be a fool, wrap your tool" (in a docker container)
Mac_NCheez_TW@reddit
Maybe not a virus but a vulnerability 🤣 or head ache.
philmarcracken@reddit
I fucking despise ads. I know they're coming, if they haven't already. I block every single form of them I possibly can. Thats my only reason; I have more than enough money to pay for subs(probably because im not being a mindless consumer drone)
power97992@reddit
Ds v4 flash right now is .28c per mil cache reads , even blended, it won’t be more than 2 bucks per 100 mil tokens (90mil cache ,9 mil initial reads , one mil outputs) , so <60bucks/m. Even if u use a more expensive provider , it wont be more than 122/m
ProfessionalJackals@reddit
https://platform.xiaomimimo.com/docs/en-US/tokenplan/subscription
Beats Flash in performance and price ... Pro is $0.2/million tokens (on the cheap $6 subscription, cheaper on the higher version). Non-Pro is $0.1/million tokens.
Where do you get 1.2B * 0.28 = $24/m ??? I think you mean $336 for 1.2 billion tokens.
https://openrouter.ai/deepseek/deepseek-v4-flash
The_2nd_Coming@reddit
Yes I spent 800m+ tokens in the last 4 days and spent less than $10. :D
power97992@reddit
Mostly cache then
The_2nd_Coming@reddit
Yes exactly. It was costly me quite a bit until input caching worked.
DeProgrammer99@reddit
Running Qwen3.6-27B costs me $1.50 per million tokens (non-parallel, no speculation, when my solar panels aren't generating). I live in an area where the electricity cost is barely above-average relative to the whole US. But there are much more efficient GPUs and models and parallelism/speculation...
rpkarma@reddit
For me, it’s $0.22 per million tokens! The GB10 is crazy efficient I guess haha
Badger-Purple@reddit (OP)
It’s a fantastic model. I have it running on a 140W GPU, with optimizations based on recent repos. Skinny Qwen is the reviewer to Fat Qwen’s work
Budget-Juggernaut-68@reddit
>I know I am preaching to the choir…So the question is, has anyone else felt their rig is becoming more sustainable now than 6 months ago, price wise? Would love to hear!!
When API prices/subscription prices goes high enough to make this make sense. I'll switch, else ain't nobody got money for that.
a_beautiful_rhind@reddit
By then the hardware will be priced up or unavailable. Everyone will be thinking the same way.
Badger-Purple@reddit (OP)
This is what the problem was 6 months ago, and the plunge was hard. But now it would be harder, with CPUs being next on the chopping block and Nvidia releasing the 3060 “refresh” this summer, AMD with their new iteration of the SoC being only marginally better, Intel cancelling the consumer GPU line beyond battlemage…
jacek2023@reddit
From a purely pragmatic point of view, the main reason to use local AI TODAY is to prepare for the moment when cloud based AI becomes too expensive to use, and people addicted to the cloud wake up with their pants down.
a_beautiful_rhind@reddit
A year ago, free inference was plentiful. Now there is less and less. Deep shit is going to happen sooner than later.
I imagine that commercial customers will at some point get priority while everyone else huddles around error 429 or gets shunted to "flash" models.
darktotheknight@reddit
The legislation in Germany finally allowed installing 800W solar panels on our balconies without any further permissions. You can install them on your own, plug in into your standard wall socket and they cost like 250€ with everything included.
If you're not running it 24/7, but only during sunlight, guess what, your local LLM is basically running for free. If you really consume 800W, these puppies pay for themselves in a few months.
bgravato@reddit
and do you get the solar panels for free too?
darktotheknight@reddit
I literally pointed out what they cost right now and that they pay for themselves in a few months - or worst case - within 1 or 2 years.
Badger-Purple@reddit (OP)
This is true. But it’s not a fool’s errand anymore
Tema_Art_7777@reddit
Privacy is the #1 reason
mr_zerolith@reddit
Privacy and knowing your client's code isn't being leaked and trained on is priceless.
Spent $13k on hardware to serve a dev team of 8 and i don't regret it
Schlick7@reddit
If your doing agentic work than large portion of those tokens would have been cached and therefore much cheaper
UnarmedRespite@reddit
Don't forget that you can also use the wattage for heating.
(yes it's less efficient than a heat pump or something, but you're getting dual-use)
bgravato@reddit
and undesirable in the summer... (or spring and autumn as well where I live... we only need heating in the 3 winter months)
Ok-Measurement-1575@reddit
You're not getting 100 million tokens/day out of a spark.
Maybe 2, if you're lucky :D
Iory1998@reddit
You post just confirms what many already realized: the future is not local nor propriety.. The future is a new architecture that doesn't need millions of tokens to install a an app on a computer.
Kahvana@reddit
I was blown away back in march 2025 when running mistral nemo 12b q4_k_m for the first time. Then I got blown away by running mistral small 3.2 24b q4_k_m in june 2025 ...and now again, march/april 2026.
The jump in intelligence and capabilities from 2025 to 2026 is staggering. I can actually use Qwen3.6-35B-A3B as a solid Claude Haiku 4.5 replacement, Gemma4-31B for decent quality translations in my native language (Dutch), have good local OCR and more.
The money I've spend on 2x RTX 5060 Ti 16GB's was well worth it for that year alone in terms of "free upgrades" in intelligence, the lack of worry for reoccuring payments, the low latency, control and privacy... but most importantly the journey of learning it all.
Hopefully I can get solid RAM upgrades for my current system in 3 years or so, I'm expecting having to wait 5 years. But until then, I am quite content with my setup.
FlyingDogCatcher@reddit
The latest batch of gemmas, qwens, and nemos has me feeling pretty good about my mini ryzen strix pc with a 7600xt attached. There's tradeoffs, but it works.
Nieles1337@reddit
There is running huge models on very expensive hardware locally. And there is running medium models on hardware you need anyway. I bought the Framework 64GB for 1800,- a budget I had for my aging system that needed an update anyway. It runs the MOE models fine and the output for me is decent enough. So for me it costs nothing extra.
Badger-Purple@reddit (OP)
Good for you! The strix halo is a fine pc. Currently, my investment in the dgx cluster (5000 all in) is less than getting a second strix halo. But when the bosgame was 1600 it was a deal.
power97992@reddit
Dgx spark is so slow at decoding get an rtx 6000 pro or h100?
Badger-Purple@reddit (OP)
I have 2 sparks, the decode is about 30tps on Qwen 397. It works fine for agents. I got the sparks for 5k together, the 6000pro is 9.6k and slows down to the same bc you can’t load a 397b model in 96Gb. So, apples to apples, I’m happy w my set up.
power97992@reddit
U can load 96 gb of params including all the active params onto a 6000 pro and the rest onto dgx sparks , it will be much faster
ttkciar@reddit
Yep, this. I did end up buying a few GPUs, but for the first few years I simply ran llama.cpp on the Xeons already in my homelab.
There's a reason there is so much interest in models which can infer on a smartphone: that is the hardware most people already have.
Badger-Purple@reddit (OP)
I rather be on the edge of tech, and then land down into the valley where the iphone 2B model is also installing an obscure fork of a repository and adding features I want but aren’t available while I do my actual work. All the while scheduling my calendar and summarizing emails. Those will be the days.
MotokoAGI@reddit
You are lying and folks in this forum keep falling for this crap.
200 million tokens in 5 days is 40 million tokens a day.
40 million tokens a day is roughly 462 tokens a second. 462 tokens a second non stop every second for 24 hours. Without prompt processing. You were not generating 462 tokens running Qwen3.5-397B.
You're a bold face liar.
power97992@reddit
I think he is talking about input tokens / prefills and decode tokens together …
Badger-Purple@reddit (OP)
wtf is this hallucination. can you tell me how to make the perfect pancakes?
Badger-Purple@reddit (OP)
The other 33% was minimax m2.7
entsnack@reddit
lmfao
darktotheknight@reddit
And they don't disappear or degrade after 6 months. You still own them, they're still worth some money.
DataGOGO@reddit
It is a hobby, not a viable alternative
Badger-Purple@reddit (OP)
Alternative to…?
DataGOGO@reddit
Subscriptions / API's
Run an agent to do... what?
Badger-Purple@reddit (OP)
Whatever I want, why would that matter?
ravage382@reddit
I think when people hear agent, they assume you mean some open claw, pretending to be a person, doing quirky things on the Internet.
I think you mean agent for agentic systems tasks/ sysadmin work. That's exactly what mine is doing and it's already burned nearly 300m tokens since I've started tracking it.
DataGOGO@reddit
no, I was trying to ask him what he means by "agent".
Badger-Purple@reddit (OP)
An LLM with tools and a task on a loop. Simplest definition possible.
DataGOGO@reddit
so a prompt.
Badger-Purple@reddit (OP)
no, a system prompt is not the same. This is Andrej Karpathy’s definition…do you have a better one?
Badger-Purple@reddit (OP)
This, main use of Hermes right now. Also I have an agent to function as an ambient scribe for a medical practice, using parakeet to pyannote to qwen with subagents for coding/billing, fact checking and then my current hobby is making a final agent reviewing the output and creating finetuning data from it, hopefully to apply LoRA on a smaller model to do the note writing task.
DataGOGO@reddit
sorry, let me clarify, what do you mean by "agent"
Badger-Purple@reddit (OP)
An LLM with tools and a task on a loop
DataGOGO@reddit
ok, so what did you mean by this?
Not trying to be difficult, i just am trying to understand your requirements.
Badger-Purple@reddit (OP)
Meaning I’d want to have an API-linked agentic harness so I’d pay the cost in tokens for whatever sub tasks and requests the agent generates
UnethicalExperiments@reddit
Been a hobbyist since the late 80s as a kid. The fun is getting the hardware to work the way you want. Work out the kinks, and do stuff you didn't know was possible before .
This isn't quite the multivac I first read about, but it's sure fun as hell to play with.
Badger-Purple@reddit (OP)
Agreed!!
Its_Powerful_Bonus@reddit
RemindMe! 2 days
RemindMeBot@reddit
I will be messaging you in 2 days on 2026-05-07 21:01:40 UTC to remind you of this link
CLICK THIS LINK to send a PM to also be reminded and to reduce spam.
^(Parent commenter can ) ^(delete this message to hide from others.)
braydon125@reddit
Not even close. I feel like an early prospecter, and I haven't found any gold. Sure a little nugget here and there but I'm going to go bankrupt. Hopefully I can keep my gpu