Ever wonder how much cost you can save when coding with local LLM?

[-]

twanz18@reddit

The savings are real, especially if you are already running a decent GPU. Cloud API costs add up fast when you are doing agentic coding with lots of back and forth. One thing that made local setups even more practical for me was being able to trigger agents remotely. I use OpenACP to bridge my local agents to Telegram so I can start tasks from my phone. No extra cloud costs since everything still runs on my machine. Full disclosure: I work on the project.

[-]

Snake2k@reddit

I think "subscribe to code" is not really a feasible model. I've been coding for like 15 or something years.

I think with models like qwen3.5:9b it's showing that you can definitely download a model locally and have a "coding server" running that you can use to code. Just like runtimes and other necessary software engineering services/setups.

Once all the dust settles and this AI hysteria is over. I think this is the baseline we'll all come down to. There will still be cloud managed ones for enterprise, and you're free to get them if you have a big enough need for them, but for most coding w/ local models will be the way to go.

[-]

-Crash_Override-@reddit

Once all the dust settles and this AI hysteria is over. I think this is the baseline we'll all come down to. There will still be cloud managed ones for enterprise, and you're free to get them if you have a big enough need for them, but for most coding w/ local models will be the way to go.

I am honestly shocked that on a sub dedicated to localLLMs of all places, that there is a take so disjointed from reality.

This take may have sounded reasonable in 2024, but today, in 2026? With the complete paradigm shift we've seen over the past 3-6 months. We're already past the point where this take can come to past.

[-]

crantob@reddit

A buzzword tossed out in lieu of an argument doesn't give me much hope, but I am mildly curious...

Could you elaborate on how you disagree with the preceding statement?

Thank you

[-]

-Crash_Override-@reddit

Above user argues:

1) coding model are no feasible.

They already are feasible. People are shelling out money hand over fist. Enterprises dont care about $200/mo for a developer when they are already paying them 100s/k a year. Its literally an insignificant rounding error. Hell, I pay that a month, and for the value I get out of it, its easy to justify.

2) qwen is showing that local LLMs are feasible for coding.

Qwen is impressive. But compared to what frontier labs are doing...its not even in the same realm. Just from a pure coding perspective, you can look at the benchmarks and such, but its just not as good. Period. But more importantly. Qwen doesn't have the full ecosystem around it that claude, gpt, gemini have. It doesn't have the level of tool use, it doesnt have the agentic colding capabilities. Claude is a full suite, from claude chat, to code, to now cowork that bridges the gap. That paradigm shift buzzword that you hate is the only way to quantify what has happened over the past 6 months. The way we interact with computers, create artifacts, interact with code bases, etc... completely different.

I can say all this stuff but unless you actually bite the bullet, and use these tools yourself you'll just shake it off as homerism or copium.

3) That local AI servers are going to be the future.

I think this holds a tablespoon of water, but its just a glaring misunderstanding of how corporate america works and a complete overestimation of peoples capabilities and a disconnect from the hardware market. Despite this sub bing about local llms, I dont think many people actually run them in any meaningful capacity. I have a few pretty serious AI servers, my main one running 4 3090s. In total, on my servers and homelab setup I have spent well over $10k, probably closer to $20k. I'm paying $50-100 a month on energy alone (my server rack is more than just AI, but AI is a big chunk of it).

And guess what...I still pay $200/mo for claude. A significant amount for gpt API usage. Subscribe to gemini and grok. Why...because of simplicity and value.

To run any usable model at usable speeds, despite what people will tell you (oh, I run xyz on my RTX potato60 GPU) requires signficant capital. That is only going to go up when you consider every piece of hardware is getting massively more expensive. I was about to pull the trigger on a RTX pro 6000...until I noticed they went from around $7k to almost $10k in the past 4 months. The 256gb of DDR4 ECC ram i bough last may for $125...is now like $1k. People are priced out of the local hardware market.

But even puting aside power costs to run your local machine. The space they take up. The heat they generate. You still then need to run linux set up all the supporting services, load the model, configure them into your workflow, etc...all for a very subpar experience compared to what I can get if I swipe my credit card. Now imagine that in a corporation...IT serving this to hundreds of users. Thousands of users. You require a shit load of capex to get started and then a shit load of opex to keep it running. Its literally why the cloud became so popular.

I think there will be a place for selfhosted/open weight models...hell in my org we use a number of them, mostly for easy batch processing jobs. But for productivity work, and especially coding, the answer will always be...pay the premium for the best. Its a competitive advantage.

Note: spelling is probably bad, im writing this on the go and grammarly isnt working for some reason.

[-]

crantob@reddit

There will still be cloud managed ones for enterprise

Your entire post seems to be confirmation of this statement, not disagreement with it.

[-]

Xcellent101@reddit

your argument is very similar to people hosting plex servers vs people subscribing to Netflix and such. There is a market for both but honestly for coding (your time is money), you will pay the premium.

I do like how the community is pushing the local idea beyond what is possible (running models on phones, apple minis, phones, ...) We will need this to keep the subscriptions in check.

[-]

Snake2k@reddit

I could be wrong, definitely. But I'm for sure not in a minority when I say that I'm not about to pay a subscription fee for something I can literally do myself. And if I can have that thing run locally, why would I wanna use that as my daily driver?

When you're writing SQL or DB tests do you spin up a whole GCP instance for it or do you test it out with a local DB like MySQL or something first or something and then when you're ready start setting that thing up?

I fully acknowledged a middle mixed future where I acknowledged hosted models too.

[-]

-Crash_Override-@reddit

When you're writing SQL or DB tests do you spin up a whole GCP instance for it or do you test it out with a local DB like MySQL or something first or something and then when you're ready start setting that thing up?

Im not sure if you understand how tools like Claude Code or Codex work.

You dont spin up a cloud instance. They dont run locally but they work locally.

I literally used claude code to set up an old brocade network switch the other day. It opened a serial connection to the switch, flashed it, and then configured the trunk and VLANs.

[-]

cockerspanielhere@reddit

Bleeding edge frontier models cost LOTS of money and energy, but we tend to forget (or neglect?) that because of hype and bubble

[-]

-Crash_Override-@reddit

It can cost all the money and energy it wants if the return is net positive.

Im really not sure what your point is.

[-]

cockerspanielhere@reddit

The return is absolutely not "net positive". You clearly have no clue about AI companies' financial situation
Energy is finite, there's not enough energy to satisfy our childish whims

That's my point

[-]

IpppyCaccy@reddit

Energy is finite, there's not enough energy to satisfy our childish whims

I have a 20 kilowatt array. It's satisfying my childish whims pretty readily at the moment.

[-]

cockerspanielhere@reddit

I was talking about Musk, Altman and the rest of technocrats, but good to know you consider yourself childish

[-]

-Crash_Override-@reddit

What a mess of an argument you've got here.

1) This sub and clearly this thread is talking about end user tools, not the financial architecture of multibillion dollar companies. These tools are expensive. But $200/mo is peanuts for a developer. I oversee AI for a F500 financial firm (belive me or dont, dont really care). My budget for AI tools is in the millions - copilot, github copilot, Gemini, etc... Millions is a drop in the bucket when I'm paying one developer $200-400k/yr. So the return for a end user is net positive. But on the macro I'm not sure if you understand the financial situation of these companies. Doubt you've heard of capex or opex in your life. Neither here nor there for this discussion.

2) Literally the second law of thermodynamics says otherwise. You are talking about energy capture/generation in a consumer sense. Again, that argument is neither here nor there in this context. If the demand for AI tools is there, eventually the generation capacity will catch up. Not that it will be easy or wont lag, but it will get there, purely by nature of capitalism. But again, the framing was AI at the consumer level, specifically on this sub, framed against local LLMs. People here acting like localLLMs do not have the exact same problem as frontier model builders do. I have spent well over $10k on my AI rig....capex....and there is a not insignficant cost to run local model as it sucks down energy....opex....but because I'm just one person, I can get zero economies of scale. I'm probably doing $50/mo electricity alone in compute cost on my AI server, yet my $200/mo claude subscription gets me massively more value....far more than 4x the value.....than what I can run locally.

[-]

Ok-Ad-8976@reddit

Yeah, it's pretty amazing, isn't it?

[-]

Torodaddy@reddit

Playing devils advocate for a moment, I think we'll begin to see more of a bifurcation between frontier models and local models where for speed; small models will be run locally or loaded on chip for quick simple stuff like grammer correction or translation and larger models will be more expensive but also impossibly dense requiring power and memory in vast excess of a home gamer.

[-]

PANIC_EXCEPTION@reddit

I think the missing portion is the lower parameter count Pareto frontier. If improvements fail to keep scaling in the high parameter regime, the next logical step is figuring out how to take the biggest models and shrinking them down as much as possible.

[-]

bobaburger@reddit (OP)

I actually have a slightly different vision about where we are after this.

Just like how unused internet infrastructure drove down the bandwidth cost after the dot com bubble, leading to the rise of video streaming and cloud computing. We might have more access to cheaper AI servers in the future, and able to do things that sound ridiculous expensive if we're doing it today.

[-]

TripleSecretSquirrel@reddit

Nah, I think it’s going to be all Jevon’s paradox. As ai computing becomes more and more efficient and cost-effective on a per-token basis, we’ll simply use more and more of it.

[-]

IrisColt@reddit

This.

[-]

Snake2k@reddit

I can definitely see that as a future. Pay to upgrade is another one that is already very much popular in the music industry.

You can buy Ableton 12, but if you want 13 you'll have to pay upgrade costs.

I really do think what you're saying will happen once this fades. It'll definitely get more realistic in a varied way.

[-]

PANIC_EXCEPTION@reddit

I can see brands like Framework packaging new commodity LLM ASICs containing original weights that leave GPU/NPUs in the dust at a fraction of the power, with users downloading LoRAs for extra finetuning. You simply slot in one of those modules and you have a power-efficient agent. That they are modular means you can eventually replace them with better modules.

[-]

TripleSecretSquirrel@reddit

That would be awesome! I fear that the cost to develop model-specific ASICs a la Talaas would be staggeringly high and unjustifiable for consumer-oriented hardware though. Maybe/hopefully I’m wrong though!

[-]

Snake2k@reddit

that would be insanely awesome to have consumer grade stuff like that

[-]

I-am_Sleepy@reddit

It would be interesting even if the smaller model can’t match the frontier one, it will still canibalize a lot of larger models utility. With vastly lower VRAM usage, it should makes the overall LLM price cheaper overtime - as it will become a commodity

For lora, GCP vertex AI already offer something similar but to their gemini family, and using them so far it production is very straightforward. If the model is commoditized (and follow compliance), with predictable performance, and cost. And the infrastructure + training+ deployment is simple to integrate

This will absolutely destroy the frontier model lab profit margin. With smaller model released, I can see these SaaS popping up very soon to cover, and streamline this entire pipeline

[-]

Torodaddy@reddit

I don't know, lots of internet "stuff" is easy to diy but people don't because of laziness

[-]

Snake2k@reddit

It'll get easier. Downloading source code and building it yourself is a common thing in a subsection of tech communities (programmers, sys admins, etc), but it was too much for some people to keep doing and maintained, so then we have package managers.

Things like ollama are basically package managers for local llms. I don't see why that can't further be simplified.

[-]

Green-Ad-3964@reddit

Sorry I don't understand. Is cloude code usable in a local machine with a local model for free?

[-]

bobaburger@reddit (OP)

yes

[-]

Green-Ad-3964@reddit

I wasn't aware of this; how does it compare to the likes of OpenCode, Cline or Aider?

[-]

bobaburger@reddit (OP)

there are two parts of this, the first part is how to run it, which has been covered very well, for example, the Unsloth's doc https://unsloth.ai/docs/basics/claude-code

the second part is trickier though :D the short answer is no, local model will not be as good as the hosted models from these services.

long answe is, it really depends on what model you're using. for all of the above services, the cloud model are usually large ones or commercial ones with full quants, so their speed and quality is way above local models. but you either have to pay for the tokens, or pay with your data privacy (if you're using free models).

[-]

Green-Ad-3964@reddit

I was referring to the same local models also for these services; I, for example, currently use qwen 3.5 27b (q4_K_M) on cline. It sorta works.

[-]

bobaburger@reddit (OP)

oh in that case, i only tried claude code and opencode, generally opencode is way faster because it has simpler prompt, but the agentic workflow will not be as deep as claude code.

[-]

counterfeit25@reddit

Regarding discussions on tokens per second:

OP mentioned 2M tokens over 2 minutes -> 2*10\^6 tokens / 120 seconds = 16,667 tokens / second

That includes both input and output tokens, so it's not like OP is claiming 16k output tokens per second (that would be Taalas, super cool btw https://taalas.com/the-path-to-ubiquitous-ai/). Processing the input tokens in the LLM prefill phase is generally faster than generating output tokens in the decode phase, on a per token basis. For a rough overview of LLM serving prefill/decode phase feel free to Google it, or see https://huggingface.co/blog/tngtech/llm-performance-prefill-decode-concurrent-requests

Claude Code also has really big system prompts (like 10+ plus tokens each) for different tasks. Adding to that any tool definitions, injected MCP stuff, expanded skills, etc., the system prompt can get huge.

So if we assume 16k combined input/output tokens per second, does that make sense?

Let's say on average each LLM request consumes X tokens (input/output tokens combined, but ratio of input/output tokens for agentic workflows is very high, i.e. much more input tokens than output tokens):

X tokens/request, 2 minutes, 2*10\^6 tokens

2*10\^6 tokens * (1/X) requests/token * (1/2) "per minute" = (1/X) * 10\^6 requests per minute

How many requests per minute on average is reasonable for OP's Claude Code setup? Honestly I'm not sure, and I'm curious to see some benchmarks here. Just to plug something in, let's say on average 5 seconds per LLM call?

(5/60) minutes per request -> 12 requests per minute

(1/X) * 10\^6 requests per minute = 12 requests per minute -> X = 83,333 tokens per request

Honestly consuming on average 83,333 tokens (input/output combined) per LLM request for agentic workflows seems within the ballpark.

[-]

bobaburger@reddit (OP)

Posted in the other comment, but here's the llama log and analysis again https://gist.github.com/huytd/3a1dd7a6a76fac3b19503f57b76dbe65#5-request-by-request-breakdown

Basically thanks to KV cache, the actual number of tokens being process by the GPU is much smaller, but the total tokens being sent/received within Claude Code (which users will be billed for) was a lot.

[-]

counterfeit25@reddit

Hmm, according to your logs, you averaged 30-35 output tokens / sec, with a total of 13,410 output tokens generated. At 35 output tokens / sec, that would have taken 383 seconds -> 6 minutes. That's just for output token generation, not including pre-fill. Unless I'm missing something here, like really spiky generation speed at times?

[-]

jdchmiel@reddit

was is a single request at a time, or parallel requests? i tend to reach around 10x throughput with parallel requests

[-]

counterfeit25@reddit

Possible, if OPs GPU could support multiple requests in parallel, eg batch size 2+

[-]

counterfeit25@reddit

nice, thanks for the info! updated my comment from earlier

[-]

T0mSIlver@reddit

Did you disable thinking on purpose (e.g. --chat-template-kwargs {"enable_thinking": false} or similar)?
In your screenshots I don’t see any thinking blocks.
Asking because there’s a llama.cpp issue (#20090) where the Anthropic /v1/messages API drops thinking content blocks, so without that being fixed (or without thinking being disabled), it sounds like Claude Code wouldn’t behave correctly with the model you mention.

[-]

bobaburger@reddit (OP)

yes https://www.reddit.com/r/LocalLLaMA/comments/1rkai3l/comment/o8je6dk/?utm_source=share&utm_medium=web3x&utm_name=web3xcss&utm_term=1&utm_content=share_button

i disabled thinking because it would cause the model stop responding while making tool calls more often. maybe related to the early EOS token issue that you linked in your issue.

[-]

T0mSIlver@reddit

Thanks, mind if I link your comment in the llama.cpp issue as an extra data point?

That symptom ("stops responding while making a tool call") sounds very similar to what I’m tracking in my llama.cpp issue (#20090) — possibly the same underlying cause.

What I think is happening under the hood:

With thinking enabled, the assistant’s single logical turn uses interleaved thinking: thinking → tool_use → thinking → tool_use → … (all inside one ongoing tool-use chain).
llama.cpp’s Anthropic-compat conversion currently drops thinking blocks. So the model is shown a transcript where previous tool calls in the current turn appear to have happened with no thinking lead-in (because it was deleted).
Then, when it’s time for the next tool call, the model begins a fresh thinking block (it’s trying to re-establish its plan), but the prompt now “teaches” it that tool calls happen without the missing reasoning scaffolding. That mismatch shifts the distribution: instead of transitioning from "thinking" into a tool_use, it often terminates the message (EOS) right there.
From the user perspective this reads as: “Claude Code stalls / model stops responding after tool calls,” but the mechanical cause is: dropping interleaved thinking blocks breaks the learned cadence that leads into the next tool_use, so the model ends the turn prematurely.
Disabling thinking “fixes” it because there are no thinking blocks to be dropped, so the tool-call sequence stays internally consistent and the model keeps emitting tool_use blocks instead of bailing out.

[-]

bobaburger@reddit (OP)

yes, feel free to link it in, but it's just my guess/observation and may not 100% correct.

[-]

Accomplished_Egg7987@reddit

Whoa, 2M token / 2 minutes = 10000 token/sec
I guess you are one of the lucky ones to have an ASIC with Qwen3.5 35B A3B on it :)

[-]

bobaburger@reddit (OP)

I have the explanation in a reply to another comment, but also updated the post for it.

[-]

Accomplished_Egg7987@reddit

Hmm, ok then it makes perfect sense.

Btw I too tried to use same model with LM Studio but couldn't able to make it work with prompt caching(not KV I guess) and bored with waiting for prompt processings.
Good luck with your off grid journey :)

[-]

bobaburger@reddit (OP)

i had that issue before too, seems like llama.cpp version in LMStudio did not have the latest updates to support these new models yet.

[-]

Odd-Piccolo5260@reddit

Dumb question how do you get it to run inside of claude or say antigravity?

[-]

bobaburger@reddit (OP)

For claude code: https://unsloth.ai/docs/basics/claude-code

I have no idea how to run it in antigravity though.

[-]

RedditVTT@reddit

For Antigravity you can use the Continue extension. Had to change the marketplace to the Visual Studio one. Installed the latest version from there and it didn't work...click the button to downgrade to an older version and that ended up working for me.

[-]

Sky_Linx@reddit

Cost savings for coding tasks IMO don't hold as a argument for using local LLM, at least for me. I use the Ollama Cloud subscription and for $100/month the limits are insane. I do a LOT of work, often working on up to 4-5 tasks at the same time, in parallel, with GLM 5, and I never get close to the limits nor do I have issues with concurrency.

[-]

08148694@reddit

How has nobody pointed out the obvious here

OP is comparing the cost of Qwen3.5 35B to the cost of Sonnet 4.6

2M tokens of Qwen3.5 35B costs about $0.50 from cloud providers. More than the cost of electricity, but at a far higher tps and without the need to buy a PC

[-]

Scared-Department342@reddit

this is exactly the kind of workflow that makes dedicated edge hardware worth it. i run a jetson orin nano (67 TOPS) for local inference and the cost savings are insane compared to cloud APIs. paid €549 once for a ClawBox and haven't touched my openai billing since. 20W power draw so electricity is basically nothing. if you're doing this kind of heavy local inference daily its a no brainer https://openclawhardware.dev (OWNYOURAI for 10% off)

[-]

Anarchaotic@reddit

How did you set up Claude Code to work for you? I'm new Coding in general - I've been using Claude to help me via the terminal (but that's less code and more on operational deployments and stuff) - is there a tutorial or something I can follow?

[-]

MinimumCourage6807@reddit

There are huge benefits of being able to use models locally with the cost of electricity. For example, I,ve been doing overnight tasks which produce very valuable output but the token amount for operation is somwhere between 100-40 milion tokens / run. Wit actually paying for the compute as api tokens it would not make much sense or at least a lot less margin for me 😁. These are tasks where you dont need the best model, a good "shovel" is more than good enough. I think there are about endless amount of this kind of useful, but not that hard tasks.

[-]

Djagatahel@reddit

Do you have an example of such a task?

[-]

MinimumCourage6807@reddit

Well basically all tasks that require the llm to surf web pages consumes a lot of tokens. I work in digital marketing and have around 30 web sites partly on my watch so reading them through site by site and looking for typos errors etc is one example. Other one is gathering different info for lists from web which requires a bit more intelligence what basic web scraping offers.

[-]

Djagatahel@reddit

Makes sense! I can see why these tasks would require tons of token

[-]

MinimumCourage6807@reddit

Off course all tokens are not equal and in many cases you get a ton of input tokens ( from web searching related you can easily get 50-100k tokens / one view of web page). But it is lot of work for the llm to gather information and there still is no way around it that those kind of tasks will consume lots of tokens no matter how good the model is.

[-]

IpppyCaccy@reddit

I'm guessing a few examples are reading through logs, looking for problems and coming up with prioritized list of potential fixes, reading through email and putting the distilled messages in a vector database. At least that's what I'm planning on doing.

[-]

StatisticianOdd6974@reddit

Want to know as well

[-]

ClayToTheMax@reddit

Idk, I was testing out the Q4 of 35b and I was getting about 50 t/s on my v100s. Prompt processing took longer than I expected generally. Tried with LMstudio and ran on Qwen cli, and it did okay, but honestly was still kinda trash. I just downloaded it from ollama and am going to test in codex tonight. I’ll keep you posted to see if that makes a difference.

[-]

Blue_Discipline@reddit

Do you know how the qwen3.5 models fare on a VPS. So that one uses that instead of cloud models. I haven’t been able to get any model to run properly even qwen3.5:4b feels very slow

[-]

bobaburger@reddit (OP)

on a normal VPS, you only have an option to run it with CPU, which will be extremely slow (not to mention you need a lot more RAM to load), you can rent cloud GPU nodes, which run better, but still about $0.3-$0.4/hr at least (for a L40S or a 3090)

[-]

alternateit@reddit

Which CLI did you use to run Qwen locally ?

[-]

bobaburger@reddit (OP)

i’m using claude code

[-]

alternateit@reddit

Can u use Claude code to run local llms ? I didn’t know

[-]

bobaburger@reddit (OP)

yes, all coding agent can use local models https://unsloth.ai/docs/models/qwen3.5#claude-codex

[-]

mr_zerolith@reddit

I can save money on operation, get higher reliability than commercial services, and not hand my client's source code from being potentially compromised.

Priceless!

[-]

LocoMod@reddit

What is the difference between availability and reliability?

[-]

bobaburger@reddit (OP)

not sure what’s your question about, can you elaborate?

[-]

iMakeSense@reddit

What are your specs?

[-]

bobaburger@reddit (OP)

Ryzen 7 7700X, 32 GB DDR5-6000, RTX 5060 Ti 16 GB

[-]

Bando63@reddit

Hi do you know if I can use mac mini m4 pro with 64 gb ram to run the same configuration?

[-]

socialjusticeinme@reddit

Yeah, except the open generation speed will be half of his 5060 ti, so what took him two minutes may take you closer to 4 minutes

[-]

bobaburger@reddit (OP)

i also run the same model on my work laptop, which is m2 max 64gb, got about the same token generation speed but prompt processing was 300 t/s.

[-]

fugogugo@reddit

huh I have same exact GPU
how much token/s you got?

[-]

bobaburger@reddit (OP)

about 1k4 t/s pp, 35 t/s tg

[-]

power97992@reddit

1004 *120= 120.5k input tokens processed in two minutes , not 2mil ?

[-]

fugogugo@reddit

uh sorry ? what is pp and tg?

[-]

Key_Section8879@reddit

Prompt processing and token generation

[-]

Significant_Fig_7581@reddit

Is it still usable at Q2?

[-]

bobaburger@reddit (OP)

Yes. pretty much usable. With some subtle issues, like, it cannot use the AskUserQuestion tool in claude code, while q3 and q4 can, and a couple of instructions will get ignored more often than higher quants.

[-]

Significant_Fig_7581@reddit

Thank you

[-]

KaosNutz@reddit

That's a good question, previous wisdom from this sub would be to switch to 9b q4 at this point

[-]

Significant_Fig_7581@reddit

Well tried Qwen coder at Q2 and Q3 and it was actually pretty good at Q2, Everyone was surprised really...

[-]

Snirlavi5@reddit

Is a proxy still required for Claude code to work with a local model? (for compatability with Anthropics Api)

[-]

bobaburger@reddit (OP)

no, llama.cpp already supported Anthropic style API so you can run it directly.

[-]

Snirlavi5@reddit

Cool, thanks

[-]

arthor@reddit

yea but open source models arent safe.

[-]

crantob@reddit

now that wasn't nice ;P

[-]

Durian881@reddit

How so?

[-]

arthor@reddit

im just quoting the anthropic CEO

[-]

counterfeit25@reddit

hope you're being sarcastic then /s

[-]

bobaburger@reddit (OP)

What make you think so?

[-]

arthor@reddit

this video..

https://x.com/jikkujose/status/1952588432280051930

[-]

nicholas_the_furious@reddit

He didn't mention safety once.

[-]

lookwatchlistenplay@reddit

Electricity isn't safe either.

[-]

3spky5u-oss@reddit

lol.

[-]

Snake2k@reddit

Elaborate

[-]

Notyit@reddit

All of that, I paid nothing except for two minutes of 400W electricity for the PC.

How much did your PC cost though

[-]

Mashic@reddit

Don't you have to buy a pc even if you're using AI online?

[-]

redoubt515@reddit

Everyone already owns and needs a device that is capable of using cloud hosted models. An old smartphone, a shitty chromebook, a raspberry pi, or a 15 year old thinkpad could all do okay. Almost nobody would need to buy a new or expensive PC to use cloud hosted AI.

That is not at all true for most of us wanting to run models locally. Hardware requirements are much more significant, and people in this sub are spending tremendous amounts on hardware. Just a few years ago before the AI boom the RTX 3090 was considered absolutely unnecessary and overkill for pretty much anyone outside of certain professions, and considered laughably expensive. AI has changed that overton window soo much that now a lot of people in this sub consider it to be the "budget" option and the bare minimum to run anything 'decent'

[-]

crantob@reddit

Well yeah... we have AI now. That's a completely different value proposition than bumping a game from 1080 to 1440p.

Prices are now creeping up, but for a time here they were around 600-750€ here. I was able to drop unneeded expenditures for a couple months to afford one. For the utility gained, it was a no-brainer.

The transformer has transformed the value of computers.

[-]

aadoop6@reddit

That cost is included in your API pricing no?

[-]

coder543@reddit

Codex has never sent me a laptop to use Codex with, no.

[-]

nicholas_the_furious@reddit

My PC has increased in value since I built it. Does that count as a negative cost?

[-]

Creepy-Bell-4527@reddit

Ironically, your pc has increased in value... because of AI.

[-]

bobaburger@reddit (OP)

Exactly :)))))

[-]

Ok_Caregiver_1355@reddit

"0" if you have solar energy tho

[-]

LostVector@reddit

Exactly how does 2M tokens in 2 minutes happen?

[-]

tmvr@reddit

I don't think it does, I think that calc in Claude Code is incorrect. I've tried it a few days ago hooked to a local model and after creating some simple stuff it claimed that 2-3M input tokens were used for the 20K or so output tokens. That is nonsense even with the 18K system prompt.

[-]

xienze@reddit

I believe it. I recently had to add Javadoc to something like 100 classes, with varying numbers of methods and such. My $20 plan got locked out within like 30 minutes and a couple dozen files, including review time. Was a little shocked but didn’t want to stop so I loaded up $20 in API credits to get a sense of what was going on. Finished eventually and the billing page showed millions of tokens. Makes sense in hindsight, entire files are getting shuttled around, repeatedly. Claude also really liked presenting changes one at a time instead of file-by-file, which had a noticeable lag between display, so I suspect that each one of those was a round trip (i.e., prompt+response). I really had to give it a lot of “yeah don’t be so stupid, batch these changes up” direction to get things to be more reasonable. Part of me thinks this behavior is somewhat inefficient by design in order to either get you to pay for a ton or tokens or reduce usage of your subscription plan. I definitely prefer unmetered local usage where possible. I can only imagine how expensive this is gonna get when LLM subscription usage is truly pervasive and the price gets jacked up.

[-]

tmvr@reddit

I don't know about the paid plan(s) because I got those through work with unlimited usage so I never look at the stats, but did for the local model out of curiosity. My test was to see if that local model is OK for some basic stuff and how is calling tools work on the machine. I started with an empty project, got it to create two HTML files (single file games), build some docker containers and serve them from there, create a landing page to select which one you want and serve that from a docker container as well. Pretty simple, not a lot of code or ingest of external stuff and it went well, but with this alone it showed me 2-3M input token usage.

[-]

georgesung@reddit

Looking at the LLM requests/responses from Claude Code it makes sense. A while ago I tried some simple test cases, and saw a gigantic input context (system prompt plus tool definitions) with a very short output, like a tool call.

Input/request:

https://gist.github.com/georgesung/36798614e6f23670cdb310bf53e665aa#file-gistfile1-txt-L1708-L2494

Output/response (in this case it was a simple tool call w/ associated thinking tokens):

https://gist.github.com/georgesung/36798614e6f23670cdb310bf53e665aa#file-gistfile1-txt-L2496-L2521

More details if curious: https://medium.com/p/7796941806f5

[-]

tmvr@reddit

What was/is interesting also is that those high input token numbers are after I've started using this:

export CLAUDE_CODE_ATTRIBUTION_HEADER=0

in order not to process the 18K system prompt before and after every prompt.

[-]

counterfeit25@reddit

Lots of input tokens. The system prompt itself for Claude Code is 10k+ tokens.

[-]

redoubt515@reddit

do end-users have to pay for the system prompt tokens? I never considered that

[-]

counterfeit25@reddit

Yes, though the per token cost of input tokens is generally much cheaper than output tokens. E.g. https://claude.com/pricing#api

[-]

waiting_for_zban@reddit

This is mainly because of the trasnformer architecture and hardware optimizations. Prompt processing (pp) in general is always faster as input tokens are encoded once, which makes it relatively cheaper than token generation (tg), as you have to take into account the entire growing context to produce the next output, making tg slower, thus costing more. The only caveat with input tokens is that they scale worse, if you don't contract the context.

[-]

counterfeit25@reddit

Yup, from my understanding off the top of my head, when processing input tokens during prefill, all the hidden state tensors can be computed in parallel, e.g. hidden states for input token 1 can be computed in parallel with those of input token 10. But during decode there is a sequential dependency, e.g. you need to compute the hidden states and final value of output token N before computing those of output token N+1, not in parallel.

[-]

lemondrops9@reddit

By my math it would be 16,666 tks which doesnt add up.

[-]

ResidentPositive4122@reddit

It could be that ccusage doesn't count cached tokens as cached? You can have lots of "steps", where the previous ones are in kv cache, but ccusage counts the total number of tokens sent? Also most of that 2m is likely input tokens (agent grepping lots of files). You can def hit high pp with everything loaded in vram and enough room for many concurrent sessions with vLLM/sglang.

[-]

lemondrops9@reddit

OP is using a single 5060 ti 16gb

[-]

bobaburger@reddit (OP)

I got the numbers from ccusage, interesting if they're reporting a wrong number.

[-]

wanderer_4004@reddit

If you have 100k context then every question you ask is 100k token PP. Even if it is simple things like 'make the button more blue', 'ok, a bit wider border' etc.

Keeping the KV-cache in VRAM gives instant answers but also limits the number of user requests a GPU can handle.

[-]

bobaburger@reddit (OP)

Subagents. Apparently it's the `superpower` skill that was built-in in Claude's marketplace. It works so well, but if you're paying for API, beware of it.

[-]

nicholas_the_furious@reddit

Maybe subagents?

[-]

wisepal_app@reddit

if i am not wrong, your context window size ise 128k. How does Claude code create 2 mil tokens? You said even 2q variant tool calling is good. Which flags do you use in llama-server?

[-]

bobaburger@reddit (OP)

Yes, I'm running 128k. The 2M was the total of input + output tokens, if you look at the llama log on the left side, the total input token that goes to the context window was 52750. The rest was the amount of token generated to be written to the files, Claude won't send those back into the conversation so it will not flood the context.

[-]

bobaburger@reddit (OP)

oh btw, here's the command I'm running:

```
llama-server -m Qwen3.5-35B-A3B-UD-Q2_K_XL.gguf -fit on -fa 1 -c 128000 -np 1 --no-mmap --temp 0.7 --top-p 0.8 --top-k 20 --min-p 0.0 --presence-penalty 1.5 --repeat-penalty 1.0 --chat-template-kwargs "{\"enable_thinking\": false}" -b 4096 -ub 2048 -ctk q8_0 -ctv q8_0
```

[-]

soumen08@reddit

Which GPU are you on?

[-]

bobaburger@reddit (OP)

RTX 5060 Ti

[-]

Shoddy_Recognition_2@reddit

I have exactly this... nice :)

[-]

lemondrops9@reddit

he's full of it. There is no way OP is doing +16k tokens a sec.

[-]

lemondrops9@reddit

So your doing 16,666 tks .... in order to get 2 million in 2 mins.

I doubt that...

[-]

bobaburger@reddit (OP)

You actually got me doubt my numbers, so I ran the llama.cpp log twice, with gemini 3.1 pro and claude sonnet 4.6

The reason you see the number did not add up is because of the KV cache. 2M tokens was a wrong number. The actual input tokens was 3M, and output tokens was 13k. But with KV cache, the total processed prompt tokens was 138k tokens.

You can see the full details here https://gist.github.com/huytd/3a1dd7a6a76fac3b19503f57b76dbe65#5-request-by-request-breakdown

[-]

tmvr@reddit

OK, this makes more sense, I was also doubtful of the similar numbers seen with my usage, thanks for the info!

[-]

counterfeit25@reddit

So even more impressive? 3M tokens in 2 min instead of "only" 2M tokens in 2 min :D
But I think those numbers are possible.

[-]

counterfeit25@reddit

it's not 2 million output tokens in 2 min, it's 2M tokens combined. that includes input tokens. Claude Code system prompt itself can be 10k+ input tokens.

[-]

lemondrops9@reddit

still doesn't add up.

[-]

PsychologicalOne752@reddit

The entire business model is turned on it's head. 🤣

[-]

counterfeit25@reddit

"I paid nothing except for two minutes of 400W electricity for the PC"

I was curious about the electricity cost of 2 minutes at 400W:

X USD/kWh * (2/60) h * 0.4 kW = (2/60) * 0.4 * X USD

If we plug in, say $0.25 per kWh from the utility company, we'll get:

(2/60) * 0.4 * 0.25 = 0.0033 USD

So about 1/3 of a cent for the electricity costs to run 2 minutes of computation at 400W, cool! Especially compared to $10.85 from Claude Sonnet 4.6.

You'd also need to account for the depreciation on your PC, but if you use your PC for other personal reasons then maybe that's not an issue.

[-]

lemondrops9@reddit

Im more wondering how OP thinks they are getting 16,666 tks.

[-]

counterfeit25@reddit

When looking at tokens per second people are generally referring to output tokens per second (decode phase), not input tokens per second (prefill phase) (https://huggingface.co/blog/tngtech/llm-performance-prefill-decode-concurrent-requests)

So the 2M token count is counting both input and output tokens.

[-]

bobaburger@reddit (OP)

Yeah other cost like PC depreciation was minimal. As for the big-small model switch in Claude, at work, I use Opus at the main model and Sonnet as the small model, some subagents were set to Haiku, so I think it still fair if we assume Sonnet cost as an average.