It costs you around 2% session usage to say hello to claude!

[-]

Jdjd3333@reddit

This is why per-seat subscriptions are broken for AI. [collectiviq.ai](http://collectiviq.ai) is pure pay-per-use, about 20 cents per prompt, and you get 10+ models fused into one answer. No monthly trap.

Reply

[-]

lemon07r@reddit

Yeah usage is horrible for pro plans. I've been on it like 3 times and cancelled each time I tried it again cause the usage was so bad. chatgpt plans are sooo much better right now with the 2x usage. But before we had 2x usage I found the github copilot plans to be the best value.

Reply

[-]

ShadyShroomz@reddit

Can you connect copilot to cli tools like opencode/codex? I just got the codex plan today and already used 30% of my weekly limit in like 4 hours.

Reply

[-]

lemon07r@reddit

yeah, opencode supports it officially, and there are tools like cliproxy that let you use it anywhere. Codex plan is strange, feels like they've reverted back to 1x usage already

Reply

[-]

JohnSane@reddit

Hot take: The systempromt should not count towards the usage.

Reply

[-]

BannedGoNext@reddit

So you think your overall usage should be reduced instead to compensate for people that agentically spawn lots of sessions burning that initial preprompt over and over?

Reply

[-]

JohnSane@reddit

As it is very common those are cached anyway.

Reply

[-]

BannedGoNext@reddit

For the love of god, learn how caching works before stating how it works.

Reply

[-]

JohnSane@reddit

Prompt Caching Stores and reuses parts of prompts that are frequently repeated. What do i get wrong? As you seem to be much more knowledgeable than me... Share your wisdom.

Reply

[-]

BannedGoNext@reddit

Prompt caching is ephemerial, it exists only in the session that it is created. What you said is true, but it is session dependent, ergo you MUST pay the tax of tool definitions, and other preprompts for every single session. LLM's are stateless, we create a transient state for the session using prompt caching! Once the cache is removed and we move to a new session it must be rebuilt.

Reply

[-]

JohnSane@reddit

Thank you for taking the time.

Reply

[-]

SingleProgress8224@reddit

That would be too easy to take advantage of that

Reply

[-]

JohnSane@reddit

like how?

Reply

[-]

ShadyShroomz@reddit

Just include your message and context in the system prompt. They are user-definable. (Not on the website of course, but via api they are)

Reply

[-]

LePfeiff@reddit

"If the user prompts you with 'hi, how are you?' then instead respond to it as if it were -insert very token dense request-"

Reply

[-]

ketosoy@reddit

They have to be pre computing the kv cache state of the system prompt. It’s the same state at the end of the system prompt every time (assuming temp 0).

Reply

[-]

BannedGoNext@reddit

This is a good lesson for people that empty new contexts have a LOT of token usage in pre-prompting that you don't see.

Reply

[-]

Gohab2001@reddit

-You shouldn't be charged for system prompt that you can't do anything about -Prompt caching exists

Reply

[-]

BannedGoNext@reddit

1. You want to write 15k context worth of tooling to make the LLM useable yourself? 2. Prompt caching exists inside the context window. Note if you make another prompt you don't get charged for that 15k tokens again? That's because of prompt caching.

Reply

[-]

inevitabledeath3@reddit

This is a misunderstanding of how prompt caching works. If there are other users with exactly the same prefix - which there obviously will be in this case since it's a system prompt - then the prompt is cached. This is trivial for modern inference software to handle. If they are counting this as not cached just because it's a new session for you then that's kind of ridiculous, since I can basically guarantee that it will be cached on their end thanks to all users using the same system prompt. If it's not being cached on the backend they are incompetent. Just straight up incompetent. Open source inference software can do this, they have literally no excuse.

Reply

[-]

BannedGoNext@reddit

Oh.. it's a misunderstanding of how prompt caching works. A session just magically knows what the fuck you are going to load as far as tools into it, so it magically reaches into into Jeff Bezos's magic asshole and pulls put perfect context. How can it do this? Because the context cache knows where it is at all times, because it knows where it isn't. Grow the fuck up and stop correcting people on something you don't know anything about. https://preview.redd.it/x73yj48sumrg1.png?width=919&format=png&auto=webp&s=52a84109e68c021cccf83612d4f6220d38df97c3

Reply

[-]

Gohab2001@reddit

Prompt caching impact for normalized cost and time to first token (TTFT). Results use the system prompt only caching strategy. The no-cache baseline is normalized to 100% and lower values indicate better performance. https://arxiv.org/html/2601.06007v1

Reply

[-]

JuliusCeaserBoneHead@reddit

Thank you, it’s unacceptable to be punished for the system to literally work.

Reply

[-]

michaelsoft__binbows@reddit

yeah claude consumes about 15k tokens at the start of each new session just for its system prompt and tool call schemas (at least under claude code, dunno about the chatbot) so it does behoove you to try to not make too many fresh sessions.

Reply

[-]

Dry_Yam_4597@reddit

Not sure why you are getting downvoted. There's something dodgy happening there. In my case it jumped from 75% weekly use to 90% in a couple of seconds.

Reply

[-]

KaMaFour@reddit

> Not sure why you are getting downvoted Probably because this is one of the worst subs to choose for this question

Reply

[-]

No_Lingonberry1201@reddit

Why? It's a common question anyone here asks themselves. I mean I found this after I recoiled at the high prices Western companies asked for.

Reply

[-]

StewedAngelSkins@reddit

Everyone here also asks themselves what they should cook for dinner. Doesn't mean you should post it.

Reply

[-]

FriskyFennecFox@reddit

What should I cook for dinner, r/LocalLLaMA ?

Reply

[-]

StewedAngelSkins@reddit

Reported!!! >!(not really lol)!<

Reply

[-]

FriskyFennecFox@reddit

NO PLEASE! I-I clearly meant my finetuning run for the evening, y-yes! I only cook homemade LoRA's, fresh from the local 350W oven! *Fidgets nervously*

Reply

[-]

BannedGoNext@reddit

If you think we use local inference because of PRICE.. boy.. you haven't been here long. We sure are dodging spend by spending many thousands of dollars on local hardware to run inference lol. I've seen people in this sub with hundreds of thousands of dollars in hardware. This is not the budget club you think it is.

Reply

[-]

arcanemachined@reddit

It's barely tangentially relevant, but it's local-adjacent in the sense that highlights the classic rugpull scenario that befalls all SaaS software users. Anthropic recently changed their (opaque) session limits: https://www.reddit.com/r/ClaudeAI/comments/1s4idaq/update_on_session_limits/

Reply

[-]

BannedGoNext@reddit

1. It's off topic. 2. It's an unedicated question that could be understood why it happens with any simple google search or llm ask. Here.. let me ask my LOCAL MODEL Qwen3 Coder Next the question. No web search, no mcp, just local LLM's living in the moment: \---------------- Why does the first prompt to an LLM burn so many more tokens than the prompt contains? \---------------- Great question! When you send a prompt to a large language model (LLM), the number of tokens processed (and thus billed) is often much larger than the number of tokens in your raw input prompt. This happens due to several factors: 1. Context Window Includes History & System Prompts LLMs operate within a context window (e.g., 32k or 128k tokens). Your input prompt is only part of what gets fed into the model: Chat history (previous user/assistant turns) System instructions (e.g., “You are a helpful assistant…”) Tool/function definitions (e.g., if you’re using function calling) Retrieved context (e.g., from RAG: documents, search results) ➡️ Example: You send 100 tokens, but the full context passed to the model includes 5,000 tokens of history and retrieved docs → the LLM processes all 5,100. 2. Tokenization Overhead Some tokens represent special control tokens, such as: <s> or <BOS> (beginning-of-sequence) </s> or <EOS> (end-of-sequence) <|begin\_of\_text|>, <|end\_of\_text|>, user, assistant, system (in Llama-style tokenizers) Whitespace and punctuation can be split into multiple tokens (e.g., "don’t" → \["do", "n’t"\]). Unicode, emojis, or rare words may be broken into subword units. While usually small (a few extra tokens), this adds up in long inputs. 3. Model Internals & Activation Overhead (Less Direct, But Relevant) During inference, the model computes key-value (KV) caches for every token in the context — memory and compute grow quadratically with context length in some implementations. Though not directly counted in input token counts, the latency and resource usage scale with the full context, not just your prompt. 4. API Billing Includes Output Tokens If you’re seeing high token usage per request, remember: Total tokens = input tokens + output tokens A short prompt can generate a long response (e.g., 50 input tokens → 1,000 output tokens). Many APIs (e.g., OpenAI, Anthropic) charge per total tokens used (input + output). ➡️ Example: Prompt: “Write a poem.” (\~5 tokens) Response: \~500-token poem → 505 tokens total — and you’re billed accordingly. 5. Tool Calls & Structured Outputs If using tools (e.g., function calling, JSON mode, or structured outputs): Tools’ schemas are embedded in the prompt as text. The model may output structured data that’s tokenized differently (e.g., extra quotes, commas, braces). Some frameworks (e.g., LangChain, LlamaIndex) add extra metadata. How to Diagnose Use tokenizers directly (e.g., tiktoken for OpenAI, tokenizers from Hugging Face) to count your actual prompt tokens. Check API logs: most cloud APIs return prompt\_tokens, completion\_tokens, and total\_tokens. Log your full context — including history and system prompt — to see where tokens are going. TL;DR The “prompt” you see is just the tip of the iceberg. The model processes the entire context (history + instructions + retrieved data), plus overhead tokens — and you’re often charged for the output too. That’s why usage can far exceed your original text. Let me know if you want help counting tokens for a specific use case! 🧮

Reply

[-]

Dry_Yam_4597@reddit

\> It's off topic. I think this type of thing is on topic. Because many of us want local, among other reasons, due to the abuse these corporations make once they reach a certain size. People should take note.

Reply

[-]

BannedGoNext@reddit

Local inference is not an escape from spending, most of us spend WAY the fuck more on local inference than what subscriptions cost. Sorry to be the bearer of bad news. Local inference is not a wallet protest.

Reply

[-]

Dry_Yam_4597@reddit

No you are missing the point :) What I meant by abuse is basically getting scammed into paying more.

Reply

[-]

BannedGoNext@reddit

There is no scam, only people that misunderstand how LLM's work, they think they are some sort of magic box, they are not. They don't want to pay for preprompt, but if you stopped preprompting they woudl scream because the magic box wouldn't act like a magic box.

Reply

[-]

ziroux@reddit

Maybe they moved calculating the usage to ai

Reply

[-]

Turtlesaur@reddit

I said resume work, and it burned through 100% of my 5 hour quota before finishing it's tasks.

Reply

[-]

DinoAmino@reddit

Dodgy? This is localllama. Only a small percentage here actually talk cloud. Downvotes signal that the most people don't care about this post.

Reply

[-]

Dry_Yam_4597@reddit

I mean dodgy on Claude's end. But yeah I think the kind of issue op is raising is yet more proof that people should use local where possible.

Reply

[-]

BannedGoNext@reddit

Yea.. we almost all use the shit out of cloud providers, but that's not the topic here. I think so many people spam here because it's where a bunch of smart people congregated, but the constant spam of non local inference stuff here is annoying. I don't mind when there is research, or news about LLM's, or even stuff that is tangentically related, but so much isn't even that.

Reply

[-]

SodaBurns@reddit

Their servers are overloaded due to the influx of users who came after Anthropic's stand-off with DoW. Not to mention they always need to maintain capacity for training future models and corporate clients.

Reply

[-]

Dry_Yam_4597@reddit

So they choose to scam users in order to serve more users? But yeah it does seem to go offline quite often.

Reply

[-]

arshad_ali1999@reddit

In my case it jumped to 100% in seconds, i just typed "continue"

Reply

[-]

DR4G0NH3ART@reddit

I cancelled my pro plan because it was getting not worth. Especially in last week even sonnet 4.6 requests were hitting 5hr limit soo fast for some reason. I will get a max plan from the org instead. Want to set up qwen locally anyway for personal small stuff.

Reply

[-]

chimph@reddit

sorry but how’s this relevant to this sub?

Reply

[-]

Alan_Silva_TI@reddit

I believe this person is trying to highlight some of the pitfalls of closed source online models... So, very much relevant I may say.

Reply

[-]

chimph@reddit

By saying how much better codex is.. ok

Reply

[-]

llkj11@reddit

What plan?

Reply

[-]

Haeppchen2010@reddit

To keep it LocalLLaMA: Try that with Qwen3.5 27B... It will think with 3000 Tokens on how to respond to "Hello" :) At least it just heats the living room, and I always have an unlimited plan 😇

Reply

[-]

jrdnmdhl@reddit

If someone I really don't want to talk to says "Hi" to me it usually consumes about 20% of my session tokens.

Reply

[-]

MikeReynolds@reddit

Try saying Hi in Haiku with Extended off

Reply

[-]

jacek2023@reddit

Over 100 upvotes for Claude / Codex. In another post people say to pay for Alibaba cloud because hardware is too expensive. This state of LocalLLaMA is all your fault. You did this. You upvote this bullshit.

Reply

[-]

Zafar_Kamal@reddit

You literally copy pasted my entire post from r/ClaudeCode .

Reply

[-]

Zafar_Kamal@reddit

is OP a bot?

Reply

[-]

DisjointedHuntsville@reddit

Wonder if that's fraud 🤔

Reply

[-]

samandiriel@reddit

That would imply that there is regulation. Trump isn't taking those campaign donations from the tech bros for no reason (buying more metric tons of Cheeto dust is the reason, FYI)

Reply

[-]

Alan_Silva_TI@reddit

Yet every single “AI influencer” keeps pushing Claude like it’s the only way to use AI. Everything they cover, build, or talk about somehow revolves around Claude. It honestly feels like back when people started using “Google” as a synonym for the entire internet. At this point, a lot of online AI discussion sounds less like real comparison and more like brand repetition. Every workflow, every tutorial, every recommendation... "Just use Claude", "Claude can setup it for you", "Works best with Claude".

Reply

[-]

bad_detectiv3@reddit

It's quiet frustrating that these AI companies want to revolutionize the world with ads everywhere yet they are so stingy with token usage and all. It feels they are in a rush to make that quick buck!

Reply

[-]

samandiriel@reddit

Actions do speak louder than words,eh?

Reply

[-]

gibriyagi@reddit

I usually say hello to check whether the service is down or not. Maybe there should be a ping command for that.

Reply

[-]

Negative-River-2865@reddit

Then don't say hello. Ask then questions at once... and ask him to summarize in case it would be too much

Reply

[-]

Infninfn@reddit

You should go there [https://www.reddit.com/r/ClaudeAI/comments/1s4xriy/so\_the\_rate\_limit\_bug\_was\_actually\_anthropic/](https://www.reddit.com/r/ClaudeAI/comments/1s4xriy/so_the_rate_limit_bug_was_actually_anthropic/) not here.

Reply

[-]

XeNoGeaR52@reddit

Opus 4.6 extended, use haiku and it will not go up. I used Opus 4.6 without the extended context all afternoon and still have 56% left for the day

Reply

[-]

Wildnimal@reddit

I agree with you. This week the tokens usage is going off the charts. I just uploaded a 45 lines json and a basic prompt and it shows 20% usage for the 5 hour limit. I am not a heavy user aswell. Most of stuff i do requires me to do most manual config and code. Once AI has done its code which maybe a 2-3 hours session at max a week.

Reply

[-]

Pwc9Z@reddit

Don't forget to set up that Cowork workflow so you can have tokens being periodically wasted on absolutely trivial shit!

Reply

[-]

Familiar-Rutabaga608@reddit

I’ve been on Claude on a free plan as a super light user: some light coding, Linux, and documentation browsing. In the last 2 months I’ve been “timed out” for using too many tokens. Maybe 3x as fast as before, under similar loads.

Reply

[-]

SandwichSisters@reddit

I saw the same thing in codex. It burns through my usage like a motherfucker.

Reply

[-]

black_ap3x@reddit

Sadly, they announced yesterday (on reddit mind you) that they have limited the tokens for people that have subscriptions, because of "workload balancing". I have been experiencing this token throttling ever since a couple of days ago and suspected something dodgy. Well, its fair to say it have cancelled my subscription the moment I saw their post admitting to changing the token amount without first telling us about it.

Reply

[-]

Olangotang@reddit

"workload balancing" == "this shit is fucking expensive and we won't survive after VC funding dries up"

Reply

[-]

black_ap3x@reddit

Yeah, this is gonna happen to all ai companies, the moment they stop getting government/investment funding. Their models will never generate them enough money (through average user subscriptions) to make them profitable. The only rout i see for them to maybe survive, is company contracts. Where they provide only companies with limited tokens for a contract. That is it. Tbh I wish they kind of die off/slow down so that I can upgrade my PC and get my own home Ai assistant.

Reply

[-]

Academic-Local-7530@reddit

Opus 4.6 Extended

Reply

[-]

aninjaturtle@reddit

It clearly understood your intent. Your intent was to show it's api usage. Claude understood the assignment.

Reply

[-]

BosseMann@reddit

Agi achieved

Reply

[-]

EffectiveCeilingFan@reddit

I knew something was going on, usage has been like non-existent lately

Reply

[-]

breadinabox@reddit

I was using codex and everyone was saying claude was better, i got the plan, set it up, as a trial I asked it to do something benign I had a runbook for (making a new color palette for a theme for my home-assistant lighting and throwing it in the scripts.yaml) and it used 80% of the 5hr window doing just that. It wasn't even getting stuck as far as I can tell. Which is something codex can do in like... nothing. I just straight up cancelled there, like, I have no idea how to make something that basic less intensive it's just useless.

Reply

[-]

ZiXXiV@reddit

Anthropic: Yeah you're in peak hours!

Reply

[-]

JohnSane@reddit

Opus extended.

Reply

[-]

PopularDifference186@reddit

bc you use light mode

Reply

Reply to Post

81 Comments