Apple silicon costs more than OpenRouter: an analysis
Posted by boutell@reddit | LocalLLaMA | View on Reddit | 47 comments
I am not the author.
My two cents: I'm not suggesting we don't all know local AI is expensive, at least for now. The math gets interesting if OpenRouter providers are burning investor cash and it runs out, or we take into account hardware we use for other purposes, or privacy is a primary motivation.
And... inference providers resold by OpenRouter ARE burning investor cash. I would have thought they would have little motivation to do so on OpenRouter, but if they are model creators then they want to promote their model. If they aren't it's still a place to dump excess capacity at a reduced loss. And none of the above will last forever.
In the meeting, it's a helluva hobby.
Fit-Produce420@reddit
Also providers will sub in quantized models if they think they can.
TheTerrasque@reddit
Or not handle for example tool calls. I've had several providers just stop working and return errors when openai api native tool calls were tried.
Fit-Produce420@reddit
$$$ again. Tool calls burn tokens.
FullOf_Bad_Ideas@reddit
Your local model is likely quantized too, on average I think cloud providers offer less quantized options than what people running locally end up with.
boutell@reddit (OP)
Not sure why this is down voted although an important difference is knowing versus wondering how bad it is!
TheGuy839@reddit
You answered your question yourself
FullOf_Bad_Ideas@reddit
A lot of the time providers share the info about the quant they're running or you can find out by contacting support. If you're using a single model from single provider a lot of the time I don't think it's a big issue as you can check for quality. Worst they'd run is NVFP4 quant, which can be pretty bad, but probably still better than Q2 GGUF quant.
Fit-Produce420@reddit
"A lot of times" is carrying your entire statement. Many times, they don't.
Sometimes I am pretty sure they are underserving me, because the output on even a new context will be garbage
Fit-Produce420@reddit
You don't know my models, bud.
Also, they're the same quantization when they start and finish.
alsencon@reddit
A lot of AI feels “cheap” right now because investor money is keeping the prices artificially low. The real reality check might come when the free money dries up
ketosoy@reddit
it usually costs more for electricity to do inference than to get the same tokens from the deepseek api.
graypasser@reddit
Deepseek is nothing but an exception, to be fair.
ea_man@reddit
I mean you could use solar panels at home :)
BahnMe@reddit
Those aren’t free and neither is installation.
Thistlemanizzle@reddit
Jesus. Deepseek is cheaper than electricity? That ends the capital investment conversation out of the gate.
opezdol@reddit
You can still sell your mac after 3-5y.
boutell@reddit (OP)
I was about to agree, but the original author explains that he's offering these various time frames as essentially how long it might take to burn the machine out doing this kind of work all day all the time. You can pick the one you find most plausible.
graypasser@reddit
Don't burn the machine out, really.
Anything that is working at 100% load is like 300% less efficient than 50% load or something.
Front_Eagle739@reddit
Then you power limit slightly to reduce temps and wear and see barely any decrease in llm performance while wear decreases exponentially
po_stulate@reddit
Is there a way to power limit a macbook slightly or is it only for mac studios? Power saving mode limits it to about 30w and for LLM inference that's half speed. Would be nice if there's a way to limit it to around 90w.
Deep90@reddit
IMO a 3-5 year commitment is rough when you consider the performance of a mac on just models released today.
Kahvana@reddit
Yeah... I really hope we get one, if lucky two more years of fantastic improvements before the investor money dried up for these companies. Having a really good gemma 5/6, qwen4/5 and deepseek v5/6 flash would be really rad.
It's clear that this can't go on forever, so let's enjoy the show while it lasts.
ea_man@reddit
Maybe they will sell LLM in hardware as accelerators that you can plug in your pc.
Kahvana@reddit
Not opposed to slottable NPU accelerators, but burning models into hardware is not my thing. I like being able to swap models instead of having to purchase whole new cards.
graypasser@reddit
I doubt burning models into hardware is anyone's thing for today, as they are moving too fast to make such things.
Puzzleheaded_Base302@reddit
openrouter providers run high concurrency, local AI run mostly concurrency of 1. that is why openrouter is cheaper than local AI.
sheppyrun@reddit
The cost comparison only works if you're running models 24/7. For sporadic use, OpenRouter wins because you're not amortizing hardware depreciation and electricity across idle time.
But the real variable everyone misses is context window. A 128K context on a Mac Studio uses the same power as a 4K context. On API billing, you're paying for every token in and out. If your workflow involves large codebases or long documents, the local cost curve flattens fast.
Also, OpenRouter pricing isn't stable. Provider rates change, models get swapped, and free tiers get throttled. Local hardware is a fixed cost with predictable performance.
It depends on whether you value cost predictability or cost minimization.
mohelgamal@reddit
There is another thing to take into account.
This math works if the entire purpose of buying the computer was to run AI and nothing else, but people need computers for other purposes. so you need to take into account money spent on just having a computer.
That’s really the big benefit, I already have my MacBook Pro, so giving someone else money to un queries bile my processors sit idle doesn’t make sense.
Aware-Ad9831@reddit
Cloud inference is backed by VC for now -- and local hardware is overpriced by people who are trying to oversmart the market.
The key to local inference being cheap is owning hardware before it become popular.
ea_man@reddit
In a few years we may get LLM as the hardware accelerated video codex we have now in CPU / GPU, ultra efficent and fast speeds. It's kinda weird to run inference on these power hungry GPU actually, it's the early days.
Aware-Ad9831@reddit
Remember when touchscreen phones where s novelty and mostly useless?
They are cheap and reliable, but it took us a while.
The difference with LLMs is that here people believe they don't just buy an experimental toy, but that they unlock some magical productivity.
BobbyL2k@reddit
This isn’t that surprising. Token wise OpenRouter should definitely be cheaper. The inference providers are optimizing for cost to maximize their profits. If running Macs are somehow cheaper than an NVIDIA cluster, the inference providers would switch to Macs, and NVIDIA’s would not be the massive company it is today.
People are speculating that inference providers are burning investment money. I don’t see why that would be useful. Provider switching is extremely easy. Maybe some are losing money temporarily as they’re in the process of tuning and optimizing their system. Maybe some are losing money off peak time but make up the loss during peak hours. The profit margin might be slim. But they are not losing money as a whole.
Now that’s not to say that labs training the models are making the cost of training back by selling tokens. Those are definitely still losing money.
bhabani_coder@reddit
36k token per hour? That more like per minute requirement, then you Mac can do so many more thing in parallel like running the agent or browsing
FullOf_Bad_Ideas@reddit
Single stream inference is bad with tokenomics unless kv cache hit on OpenRouter providers is expensive for your model and you do a lot of small tool calls.
I translated about 10,000,000,000 tokens locally this weekend in about 30 hours at the cost of $30. With DeepL it would cost me 1,250,000USD, and with Google Translate inferior quality it would be about 1,000,000 USD. With cheapest OpenRouter llama 3.1 8b model i could find quickly it would be 0.02 usd per M input and 0.05M per M output. So, 700USD. Batching could get it down a bit, and renting GPUs would bring it down lower. Still, I think that's a decent saving.
Miserable-Dare5090@reddit
And you can run llama 8b in an iphone.
FullOf_Bad_Ideas@reddit
yes, but it would take about 16 years to run those tokens instead of 30 hours.
Miserable-Dare5090@reddit
Actually runs pretty fast. Have you tried small models on the phone? If it’s 5Gb or so, it will run and will be decently fast. Not the same but just sayin’
FullOf_Bad_Ideas@reddit
Yes I am running LLMs on my phone. 2B to 34B, dense and MoEs. I have 16GB of RAM on my phone.
I assumed that iPhone might generate about 20 t/s per second. It would take about 16 years to generate 10 billion tokens. I actually calculated that before responding to you.
Miserable-Dare5090@reddit
I see, you need very fast decode throughput not prefill.
FullOf_Bad_Ideas@reddit
yeah, I need both. I am running 1024 concurrent streams spread over 8 3090 Tis. Translation is more or less 1:1 symmetrical when it comes to input and output.
BumbleSlob@reddit
My math suggests that a $12k Mac Studio running inference 24/7 of Qwen 397B becomes more economical than sending the same requests to Anthropic API Sonnet in about 5 months.
I also like to point out to people that a lot of us are terrified of finding out we spent hundreds or thousands of bucks on a rogue LLM agent so we are extra mindful about how inference is deployed versus just having your own machine you can YOLO to your hearts content.
a_beautiful_rhind@reddit
3d printing is more expensive than getting something made. growing vegetables will never work out.
Economies of scale and all.
A year ago we did have all that free inference from everyone and their mother. If you rode that and skipped getting hardware, look at what happened with prices.
d70@reddit
I’m skeptical. Feel like there is no way OR can be cheaper if you run almost 24/7 for a year. You also use that MacBook Pro or whatever for other things too
Ok_Technology_5962@reddit
I see that the analysis is a bit wrong. It doesnt take into account agentic tasks. When you run an agent the bottleneck is not output speed but how many back and forth toolcalls you do, thus reusing the kv cache. Example "do research on what price is X right now and give me the links". This will result if 10s if not 100 tool calls everytime the llm will write maybe 1 line of code, read the data at 400 tps or more if amall model but a q8 minimax or 300b model sis that speed for pp and then write another one. This results in millions of tokens sent back and forth not 36k.
I already almost went broke when i forgot to switch from open router to local for a request. Just used 10 bucks and stoped before i noticed.
For every step change in use you will have an exponential requirement for tokens from just chat to then agents and next will be OS level use always on multiagent frameworks with Multitoken prediction, speculative decoding and speculative prefil. By my own analysis using all the advances one month ill be able to use half a billion tokens. Yes Billion... Good luck all. Btw (my current token tracker shows 700million over 2 months)
Miserable-Dare5090@reddit
This, why are people not understanding this point? Agents churn millions of tokens. We moved on from the silly chatbots, this is where the real gains are when you go local.
Betadoggo_@reddit
It's not surprising that providers running these models on large systems designed for high throughput and low power draw (relatively) are able to provide tokens for cheaper than local hardware. The real benefit of local is privacy, control, and having a system that's uninterruptible.
_FlyingWhales@reddit
Personally, i think the value of local execution lies in privacy, consistent quality and education. Providers on openrouter often have terrible reliability and quantize models.