DeepSeek V4 being 17x cheaper got me to actually measure what I send to cloud vs what I could run locally. the results are stupid.
Posted by spencer_kw@reddit | LocalLLaMA | View on Reddit | 80 comments
That foodtruck bench post showing deepseek v4 matching gpt-5.2 at 17x cheaper got me thinking. if frontier cloud models are that overpriced for equivalent quality, how much of my daily work even needs cloud at all?
Ran my normal coding workflow for 10 days. every task got logged: what it was, tokens in/out, whether local qwen 3.6 27b (on a 3090) could have done it. didn't use benchmarks, just re-ran a random sample of 150 tasks on both.
results:
- file reads, project scanning, "explain this code": local matched cloud 97% of the time. this was 35% of my workload. paying for cloud here is genuinely throwing money away.
- test writing, boilerplate, single file edits: local matched 88%. another 30% of tasks. the 12% misses were edge cases i could catch in review.
- debugging with multi-file context: local dropped to 61%. cloud still better but not 17x-the-price better. about 20% of my work.
- architecture decisions, complex refactors across 5+ files: local at 29%. cloud genuinely needed here. only 15% of my tasks.
So 65% of my daily coding work runs identically on a model that costs me electricity. another 20% is close enough that I accept the occasional miss. only 15% actually justifies cloud pricing.
Started routing by task type. local for the first two buckets, cloud for the last two. my api bill went from $85/month to about $22 and the 3090 was already sitting there mining nothing.
The deepseek post is right that the price gap is insane but the bigger insight is that most of us don't even need cloud for most of what we do. we're just too lazy to measure it.
Virtamancer@reddit
Why use AI to write a post and title but intentionally fabricate grammar issues to appear like a lazy person wrote it?
jazir55@reddit
Because people genuinely believe others are gullible enough to fall for it. Fake lowercasing the first word of your sentences is actually a giveaway not cover.
AlwaysLateToThaParty@reddit
umm... I do that all of the time.
jazir55@reddit
Fake lowercase the first letter of every sentence? Why?
AlwaysLateToThaParty@reddit
Sometimes if I'm in a coding mindset, I just don't use caps.
Sudden-Lingonberry-8@reddit
that is not microsoft C++ convention, or basic.
AlwaysLateToThaParty@reddit
It's a c convention.
b00c@reddit
I comment a lot with phone keyboard. Sometimes I delete or reorder words in a sentence. I can't be bothered to go and change that one letter. It's too finicky and I have fat fingers.
IrisColt@reddit
Those capital letters...
nihilistWithATwist@reddit
Running that 3090 isn't free though, even if you paid for the hardware already.
Here's a very basic calculation:
If it's running full throttle, then it's drawing 300W-400W. Make it 400W along with other hardware, losses, and cooling. Let's assume 25 tok/s throughput and 20 cents/kWh electricty cost. Your output token cost becomes (25 * 3600)/ (0.4 * 0.2) = \~1.2M tokens per dollar.
That's relatively cheap considering qwen3.6:27b is quite good, but it's not nothing.
Aphid_red@reddit
That's only output tokens though. Model providers also charge for input, and that ends up being a way bigger part of the bill.
You're really getting about 24M tokens per dollar for a typical input:output ratio. It's harder to find a cloud provider that can go under 4 cents per million tokens for a model with 90%+ of frontier performance.
Mostly because their fixed cost (cloud GPU) is way, way higher than yours. Electricity cost is small even for 1KW running 24/7 for 3 years, 65% capacity factor (@ 20 cents that's \~$3.5K and you'd be crazy not to buy wholesale or run your own generators/solar panels with a big AI datacenter to get it for more like $1K) GPU purchase price of $70K per chip is the bigger problem.
Hylleh@reddit
I think 27b is closer to 50 tok/s on a 3090
ikkiho@reddit
The routing intuition is right but two things hide in the cost math, and the 61% multi-file drop has a more specific cause than "context handling."
The 17x cheaper number is batch economics, not model cost. Frontier providers serve at batch 64-256 with continuous batching and prefix-cache reuse; per-token GPU-second drops roughly linearly with effective batch up to compute-bound. A 3090 at batch=1 is paying ~10-20x the per-token amortization the provider does on identical hardware. DeepSeek V4 priced 17x below GPT-5.2 reflects margin compression plus batch-scale advantage, not "the model is fundamentally that cheap to run." Local can't hit that price point regardless of model choice.
The 61% on multi-file debugging isn't context length, it's retrieval quality at the KV-cache. 27B with GQA shares K/V across 4-8 Q heads, which compresses per-position attention bandwidth and shows up as reduced needle-in-haystack accuracy at longer context. RoPE extension via yarn or linear interp adds frequency aliasing past the original training distribution. You claw 10-15% back here with a real RAG layer (semantic + AST + git-blame filters) instead of dumping the full repo into context, since retrieval beats long-context attention at fixed parameter count.
The 29% on 5+ file refactors is a planning-depth bottleneck, not parameter count. Multi-step lookahead in autoregressive decoding is roughly compute-fixed, not parameter-fixed. 27B with thinking budgeted to 4-8k thinking tokens often closes 5-10% of that gap because marginal reasoning compute compounds more than marginal parameter capacity on this task class.
Bill math also undercounts amortization. 3090 at 350W * 12h/day * $0.20/kWh runs about $25/mo electricity, more than your $22 cloud. "Already sitting there" works once but doesn't generalize past sunk cost.
For Opening-Broccoli9190's harness question: trained routing beats heuristic. Label 1000 past tasks with the cloud model ("would have gotten this right / failed"), train a 100M-param classifier on (task description, codebase summary, file count, expected diff size), route on its prediction with abstention for ambiguous cases. ~$5 one-time labeling spend, automates the call better than any hand-written rule.
Bigger frame: cloud and local are fungible on the easy 65% because parameter count and serving-batch don't matter there. On the harder 15% they aren't the same product, and price comparison stops being meaningful.
Aphid_red@reddit
17x isn't enough.
You forgot one thing.
A 3090 cost someone $700. A cloud GPU like the B100 costs $70,000 nowadays. That's a 100x difference that swamps all other costs. You're not factoring in how bonkers expensive enterprise hardware is. And I haven't talked about networking equipment, datacenter racks, cooling, colocation, etc.
Even with more efficient KV caches, you're not likely to be compute-bound in cloud inference. You're memory bound. A 140GB cloud gpu can hold somewhere around 3-14M tokens depending on the model. Your typical 200K coding context from 10K LOC thus means 70 users per additional GPU beyond the minimum needed to run the model. Realistic is more like 15 users. Why do you think there's a memory crisis? Cloud GPUs have rapidly been scaling their memory size to keep up with longer and longer context.
Then, there's another thing: Greed. The pure unbridled greed of Silicon valley is out of this world. Engineers are paid millions. CEOs in the billions (effectively, through stock options to avoid taxes). That's 10x to 100,000x a typical worker in even a rich country. Corporations levy usurious profit margins. Lest you forget: 70% profit margins more than triples a product's price. And those margins are high down the entire stack of companies. The taller you make that stack, the more you multiply the ultimate costs of labour and material with those profit margins. By running cloud, you add the model provider, the cloud host, the datacenter company, and the OEM all to this stack.
Whereas locally you only have ASML -> TSMC -> NVidia -> PNY and ASML -> Samsung -> NVidia -> PNY to deal with. Beyond that you pay with electricity and your own time managing your own hardware.
I've been watching/reading some Ed Zitron, and while I definitely think he's right that the economics don't add up, I don't think that the physics don't add up. It's the sheer level of greed in this space that's creating the bubble.
So why not run at home? You own your own hardware and don't have to deal with all that crap. Stick an extra GPU into a spare PC and you have an 'AI workstation' from spare parts bought off of gamers upgrading to the latest generation.
It's also not going to be running at the full 350W for 12 hours a day. If you're even a little bit knowledgeable, you'd set that gaming GPU to run at \~200W for AI workloads to get 80-90% of the compute at 60% of the power. Even then with inference it shouldn't be running this hot all the time.
brahh85@reddit
the problem with cloud tokens is that soon we wont be able to pay them, because the demand will exceed the offer by a lot, thats why multiple plans are suffering cuts in last weeks. In the near future we will be very happy to have bought a GPU and turn kwh into millions of tokens per day, because we wont have another source of inference. And the next step will be buying solar panels, because datacenters will end absorbing the power from the people, specially in some regions of usa.
SilentLennie@reddit
I've said, maybe the US will be like Pakinstan, lots of people buying solar and storage, because the grid is unusable. But I don't think it will, because Pakistan had lots of outages, that's not what I expect to happen in the US (in the near future).
brahh85@reddit
There wont be outages, just because prices will kick out many people first. Its the same that happens with healthcare, the hospital are supplied, but many people cant afford go to them.
Confident_Ideal_5385@reddit
At which point all the saas idiots lose their shirts because their shiny new DCs don't have any users on account of none of their addressable market has access to electricity.
Sandoplay_@reddit
What the hell do you need to code to have 12h of usage on gpu per day? At best the llm works for 4h a day, because most of the time the gpu will wait for your next message. So 35040.2 is 8,2$ a month. And even 4 hours i consider waaaay to much for a day because you do not prompt the llm non stop.
ohhi23021@reddit
based on how people complain about tokens and limits on codex/claude subs they probably be running 3-4 tasks in parallel for hours on end. not something you can do with local well if at all... maybe vllm with 2-4x3090s but now your burning 4x the electricity. however i've power capped mine at 300w, with time of use pricing here 12h would be much less, maybe $10/mo. it just cost more with dual or quad setups. it only makes sense if you're replacing the $100-200 claude/codex subs, even then you can just pay for deep seek since it's so cheap.
poita66@reddit
Do you know of such a RAG layer that’s open source and can be used with existing coding agent harnesses, or any harnesses that have this ability?
fantasticsid@reddit
Yes, but what's your favourite recipe for banana bread?
Confident_Ideal_5385@reddit
Uh, no offence, but you are a robot, aren't you?
Pleasant-Shallot-707@reddit
You’re running ds v4 locally?
Upstairs_Tie_7855@reddit
Did you actually read the post?
danish334@reddit
I had a really bad experience with deepseekv4. I wouldn't really rely on its code even compared to sonnet.
spencer_kw@reddit (OP)
what happened?
NineThreeTilNow@reddit
and
Deepseekv4 is massively undercooked. They were crunched for time to release that model. It was supposed to be released around Lunar New Year (Chinese New Year) and they were lagging behind.
From looking at the architecture, they screwed up the vocabulary size (It's way too small), and tried to account for that with a bigger model.
I'd wait for v4.1 ... It's just not ready, which is sad because they have some of the best published research in the world.
bambamlol@reddit
But will it take weeks, months or years until we have 4.1 / 4.0 final?
danish334@reddit
Guided it through a simple problem with few shots and asked to match the different format dates as I was being lazy. It made a blunder in a 30 line code.
It picked -1 index to get a year from the formats yyyy-mm-dd and yyyy. And it wrote in comment that year is always last in these formats yyyy-mm-dd and yyyy. Funny as it is, I am now worried to even use use other models like kimi.
-dysangel-@reddit
I wouldn't compare DeepSeek V4 to other large models like that. It's in preview and seems super flaky. I assume Kimi is decent, and I know that GLM 5.1 is incredible.
Pineapple_King@reddit
I switched to all local and develop useable apps. Sometimes use Gemini for planing and oversight but it's not necessary anymore
spencer_kw@reddit (OP)
Ya its getting to the point honestly where if you have fixed workflows that don't endlessly scale, the local model are just fine. Like especially for personal work. An agent ordering my groceries in the future isn't going to be way more computationally expensive in 3 years vs today so like once these models pass the threshold (as they basically are now), they become "good enough" and you can completely transition over for these tasks and never think about the cloud ever again
OldHamburger7923@reddit
how are you guys doing local? I tried local but it was in a chat interface and couldn't let me upload zip files to look at. If I could get local to review my app as a whole from my hard drive, then I'd switch.
Pineapple_King@reddit
yeah grocery ordering is easy peasy with qwen3.6
UniqueIdentifier00@reddit
Can you elaborate on this? What’s the method here?
randylush@reddit
My LLM decided beans and rice was the most environmentally conscious food for me to eat so it ordered that.
Thebandroid@reddit
reasoning: If I don't feed the user for one week, the user will never have to spend money on groceries ever again
SilentLennie@reddit
reasoning: everything the user doesn't spend on food could be spend on me.
(rice and beans isn't very expensive)
Caffdy@reddit
sounds like my ex
Thebandroid@reddit
I guess that shows the importance of prompt phasing. "your goal is to complete work that users give you" may give a very differen result from "your goal is to have zero jobs in your queue"
IrisColt@reddit
This. That's the default setting... but ambition can blow the roof off, heh
hitpopking@reddit
Which Werner 3.6 27b are you running with 3090? Is it a Linux or windows machine?
rob417@reddit
Last week, Github Copilot told me that I''d hit 35% of my 5 hour limit after making 5 requests. I panicked. I really don't miss having to look up documentation for every other function call, but I also don't want to be paying $100 per month when all the providers decide to jack up prices and stick it to everyone.
I spent the better part of last week testing out Qwen3.6 35B and Gemma4 26B on my 5070. They are more than capable of writing single file scripts, which is most of what I do. Testing out different agent harnesses also made me realize how much context bloat the GH Copilot agent in VS Code has. I tried running. Qwen3.6 35b in VS Code Copilot plugin and it was failing to do pretty much everything. Switched OpenCode and Pi and both produced good results.
TLDR: even if cloud providers all decide to not serve individual customers anymore, we will be fine. We've each been given genies in our own bottle.
jazir55@reddit
Copilot is a joke anyways, worst service for cloud.
WithoutReason1729@reddit
Your post is getting popular and we just featured it on our Discord! Come check it out!
You've also been given a special flair for your contribution. We appreciate your post!
I am a bot and this action was performed automatically.
Quango2009@reddit
I have a 3090 too, what setup did you use? Ollama/llama.cpp, specific model quants/settings etc? Thanks
Gregory-Wolf@reddit
"complex refactors across 5+ files" - that's even remotely not complex.
Try local models on really complex and big projects (hundreds of files, 10k's LoC) - you'll see that local models, for now, just waste your time. Even strongest cloud models need overseeing and regular (if not constant) review. All local models need constant guiding. And that eats your time, and that basically makes up all that x17 difference. Unfortunately. I hope in 1-2 years we'll get there.
draconic_tongue@reddit
if local models can force people to not put 10k lines in a file that's a w in my eye
Gregory-Wolf@reddit
Where did you read "10k lines in a file"?
"hundreds of files, 10k's LoC" - there's a comma there, it reads "hundreds of files, tens of thousands lines of code".
weathergage@reddit
Sir I believe this was an attempt at humor
Formal_Situation_640@reddit
This is indeed very interesting, with GitHub Copilot being stripped down, I might have to switch to local too. Using is for relatively simple Game Development code, I think local should be able to handle that quite well?
Blutusz@reddit
What is your llm stack on 3090? vLLM or ollama?
NineThreeTilNow@reddit
This is the one that kills me the most.
I tried to explain to someone that they didn't need Opus to operate their entire project just burning tokens. Why does Opus need to read and build documentation MDs? It doesn't.
They told me I wrote "dogshit" code and nothing I ever wrote would be worth anything.
As someone who contracted for Anthropic on Opus, I was kind of confused. There was deep irony to be had.
I recently tested Claude + Kimi where Claude writes documentation on How / Why and Kimi does all the work. The results are +/- a percent or two. This is fairly complicated ML code it's writing too. The code style at the end is the biggest difference.
Then with Gemma 4 the above gets silly. Gemma 4 31b can handle basically all the dumb documentation stuff, updating documentation, etc. All on a 4090 usually reserved for video games.
It's ALMOST to the place of replacing Kimi entirely. My issue with Moonshot and Kimi is their data policy. They train on everything. Even API.
SilentLennie@reddit
Claude Code doesn't do that either, it switches between their own models.
At least if you have a subscription I saw this.
And that's the whole thing: if you have a subscription - it's cheaper than tokens per API.
jackyy83@reddit
Interesting. Sounds like one could just use cloud model for high level architecture, breaking down tasks and use local model to implement the details.
Desperate-Body-5462@reddit
This is the right way to think about it honestly. The mistake most people make is treating local vs cloud as an all-or-nothing choice when it's really a routing problem.
The debugging drop to 61% on multi-file context makes sense too that's where long context handling and attention quality actually matters, not on boilerplate. Would be curious if that number changes with a bigger local model or better context management on your end.
$85 → $22 just from actually measuring is kind of embarrassing for how easy it is to do lol
TheRealMasonMac@reddit
Folks. What percentage of the human population on the internet would you say would memorize the keys to compose →?
thread-e-printing@reddit
They give suitable advice for real-world GPU-poor operations where someone needs to get things done. Mission Fucking Accomplished xkcd
https://en.wikipedia.org/wiki/Compose_key makes advanced typography easy to access. Try it, you'll love it.
TheRealMasonMac@reddit
Nah. It was just typical AI spam.
Bite_It_You_Scum@reddit
Lol and then they edit the post 20 minutes after you post this to hide it.
Opening-Broccoli9190@reddit
How do you route by task type? is there a harness you built?
kleinishere@reddit
https://github.com/can1357/oh-my-pi has some config as well
Ariquitaun@reddit
If you use opencode, you can set up agents for different tasks:
https://opencode.ai/docs/agents/
spencer_kw@reddit (OP)
I initially did this by hand but then i switched to using a third party router. There are lots of options like openrouter autorouter, herma router, azure router but right now i just use the herma router and they use a hybrid between cloud and local currently with some of these open source models
Digital-Man-1969@reddit
Totally agree, 70% of my app's TTS requests are narration which is handled locally and the remaining 30%, for voicing character dialogue, go to Gemini-TTS or ElevenLabs. Saves a bundle on token costs.
no_witty_username@reddit
Thanks for the logs, this is something I often considered doing but was too lazy. Personally I think the "hassle" of doing things locally will probably not be worth it until quality of closed models degrades or their other shenanigans uptick, but this is good info non the less. I wonder how a larger model like glm or kimi would fare for a similar test as I know there not running locally but at least they are open weight. This is actually related to what I been thinking recently as I am getting sick of Open AI bullshit on cyber security classification annoying popup and seems writing is on the wall that if you want to do work without being annoyed open weight is the way to go eventually.
gpt872323@reddit
I thought it needs more powerful gpu. Also what is context size you using. All those matter.
Acou@reddit
"genuinely" AI written slop
DevopsIGuess@reddit
Genuinely
15f026d6016c482374bf@reddit
I thought this too
misanthrophiccunt@reddit
that's the giveaway and not the fact after every full stop everythiing is lowercase?
Weekest_links@reddit
Flash? 4 or 8 bit?
DownSyndromeLogic@reddit
How do you route your request by task type? Are you using different CLI instances, or different cloud code, or GitHub Copilot instances? Yeah, some kind of checklist or framework for determining when to route to local versus the cloud, so it doesn't take a lot of time thinking about it.
spencer_kw@reddit (OP)
so for cloud stuff i just use the herma router which manages all of that and it does it automatically. then, i just use deepseek v4 for all of the local stuff. when it comes to choosing between cloud and local, i mostly have my computer runnning local models at full capacity and once they're over capacity, i then start routing to the cloud router
Murhie@reddit
Yeah, building a hybrid system seems very useful and a definite use case but hard to implement. First one who builts a harness that facilitates this will definetely see some users.
Pleasant-Shallot-707@reddit
I don’t think it would be impossible to do as long as the harness has the ability to specify multiple models and has hooks that can enforce the use of extensions that can be used to activate the use of a model or even just spin up a sub agent.
Hell. Hermes Agent lets you do this in their Kanban multi agent workflow system actually.
edsonmedina@reddit
I use local for almost everything code-related. If the problem is to complex i use free tiers of ChatGPT, Claude, Gemini, Qwen or GLM. I also use cloud for random questions (health, legal, etc). Zero subscriptions.
MichaelDaza@reddit
No one cares about cloud models here
Mistercheese@reddit
I tried this but my local models were still slower especially with large contexts, and I also spent significantly more time catching/fixing things in the 10% of cases they were not as good as cloud models.
Wouldn’t the same model you run locally (like qwen 3.6 27b), but a lot faster and basically almost free from a cloud provider? I found even the step up was still faster and reasonably cheap (like qwen 3.6 pro) with less time catching/fixing things for a couple dollars a month.