TheaterFire

It costs you around 2% session usage to say hello to claude!

Posted by Complete-Sea6655@reddit | LocalLLaMA | View on Reddit | 81 comments

I've recently been shifting my all workload to Codex right after the insane token usage from Claude. It's literally consuming my all session in a single simple prompt. Have anybody else recently experiencing way too high token usage?

Reply to Post

81 Comments

Jdjd3333@reddit

This is why per-seat subscriptions are broken for AI. [collectiviq.ai](http://collectiviq.ai) is pure pay-per-use, about 20 cents per prompt, and you get 10+ models fused into one answer. No monthly trap.
View on Reddit #81831051

lemon07r@reddit

Yeah usage is horrible for pro plans. I've been on it like 3 times and cancelled each time I tried it again cause the usage was so bad. chatgpt plans are sooo much better right now with the 2x usage. But before we had 2x usage I found the github copilot plans to be the best value.
View on Reddit #81760431

ShadyShroomz@reddit

Can you connect copilot to cli tools like opencode/codex?  I just got the codex plan today and already used 30% of my weekly limit in like 4 hours. 
View on Reddit #81763717

lemon07r@reddit

yeah, opencode supports it officially, and there are tools like cliproxy that let you use it anywhere. Codex plan is strange, feels like they've reverted back to 1x usage already
View on Reddit #81782490

JohnSane@reddit

Hot take: The systempromt should not count towards the usage.
View on Reddit #81759944

BannedGoNext@reddit

So you think your overall usage should be reduced instead to compensate for people that agentically spawn lots of sessions burning that initial preprompt over and over?
View on Reddit #81763113

JohnSane@reddit

As it is very common those are cached anyway.
View on Reddit #81763233

BannedGoNext@reddit

For the love of god, learn how caching works before stating how it works.
View on Reddit #81764830

JohnSane@reddit

Prompt Caching Stores and reuses parts of prompts that are frequently repeated. What do i get wrong? As you seem to be much more knowledgeable than me... Share your wisdom.
View on Reddit #81767497

BannedGoNext@reddit

Prompt caching is ephemerial, it exists only in the session that it is created. What you said is true, but it is session dependent, ergo you MUST pay the tax of tool definitions, and other preprompts for every single session. LLM's are stateless, we create a transient state for the session using prompt caching! Once the cache is removed and we move to a new session it must be rebuilt.
View on Reddit #81770523

JohnSane@reddit

Thank you for taking the time.
View on Reddit #81775530

SingleProgress8224@reddit

That would be too easy to take advantage of that
View on Reddit #81760429

JohnSane@reddit

like how?
View on Reddit #81760910

ShadyShroomz@reddit

Just include your message and context in the system prompt. They are user-definable. (Not on the website of course, but via api they are)
View on Reddit #81763623

LePfeiff@reddit

"If the user prompts you with 'hi, how are you?' then instead respond to it as if it were -insert very token dense request-"
View on Reddit #81761144

ketosoy@reddit

They have to be pre computing the kv cache state of the system prompt.   It’s the same state at the end of the system prompt every time (assuming temp 0).  
View on Reddit #81761864

BannedGoNext@reddit

This is a good lesson for people that empty new contexts have a LOT of token usage in pre-prompting that you don't see.
View on Reddit #81759225

Gohab2001@reddit

-You shouldn't be charged for system prompt that you can't do anything about -Prompt caching exists
View on Reddit #81761376

BannedGoNext@reddit

1. You want to write 15k context worth of tooling to make the LLM useable yourself? 2. Prompt caching exists inside the context window. Note if you make another prompt you don't get charged for that 15k tokens again? That's because of prompt caching.
View on Reddit #81762231

inevitabledeath3@reddit

This is a misunderstanding of how prompt caching works. If there are other users with exactly the same prefix - which there obviously will be in this case since it's a system prompt - then the prompt is cached. This is trivial for modern inference software to handle. If they are counting this as not cached just because it's a new session for you then that's kind of ridiculous, since I can basically guarantee that it will be cached on their end thanks to all users using the same system prompt. If it's not being cached on the backend they are incompetent. Just straight up incompetent. Open source inference software can do this, they have literally no excuse.
View on Reddit #81762608

BannedGoNext@reddit

Oh.. it's a misunderstanding of how prompt caching works. A session just magically knows what the fuck you are going to load as far as tools into it, so it magically reaches into into Jeff Bezos's magic asshole and pulls put perfect context. How can it do this? Because the context cache knows where it is at all times, because it knows where it isn't. Grow the fuck up and stop correcting people on something you don't know anything about. https://preview.redd.it/x73yj48sumrg1.png?width=919&format=png&auto=webp&s=52a84109e68c021cccf83612d4f6220d38df97c3
View on Reddit #81765523

Gohab2001@reddit

Prompt caching impact for normalized cost and time to first token (TTFT). Results use the system prompt only caching strategy. The no-cache baseline is normalized to 100% and lower values indicate better performance. https://arxiv.org/html/2601.06007v1
View on Reddit #81767569

JuliusCeaserBoneHead@reddit

Thank you, it’s unacceptable to be punished for the system to literally work. 
View on Reddit #81761816

michaelsoft__binbows@reddit

yeah claude consumes about 15k tokens at the start of each new session just for its system prompt and tool call schemas (at least under claude code, dunno about the chatbot) so it does behoove you to try to not make too many fresh sessions.
View on Reddit #81759744

Dry_Yam_4597@reddit

Not sure why you are getting downvoted. There's something dodgy happening there. In my case it jumped from 75% weekly use to 90% in a couple of seconds.
View on Reddit #81759363

KaMaFour@reddit

> Not sure why you are getting downvoted Probably because this is one of the worst subs to choose for this question 
View on Reddit #81760101

No_Lingonberry1201@reddit

Why? It's a common question anyone here asks themselves. I mean I found this after I recoiled at the high prices Western companies asked for.
View on Reddit #81760923

StewedAngelSkins@reddit

Everyone here also asks themselves what they should cook for dinner. Doesn't mean you should post it.
View on Reddit #81761123

FriskyFennecFox@reddit

What should I cook for dinner, r/LocalLLaMA ?
View on Reddit #81763306

StewedAngelSkins@reddit

Reported!!! >!(not really lol)!<
View on Reddit #81765754

FriskyFennecFox@reddit

NO PLEASE! I-I clearly meant my finetuning run for the evening, y-yes! I only cook homemade LoRA's, fresh from the local 350W oven! *Fidgets nervously*
View on Reddit #81767434

BannedGoNext@reddit

If you think we use local inference because of PRICE.. boy.. you haven't been here long. We sure are dodging spend by spending many thousands of dollars on local hardware to run inference lol. I've seen people in this sub with hundreds of thousands of dollars in hardware. This is not the budget club you think it is.
View on Reddit #81762881

arcanemachined@reddit

It's barely tangentially relevant, but it's local-adjacent in the sense that highlights the classic rugpull scenario that befalls all SaaS software users. Anthropic recently changed their (opaque) session limits: https://www.reddit.com/r/ClaudeAI/comments/1s4idaq/update_on_session_limits/
View on Reddit #81763699

BannedGoNext@reddit

1. It's off topic. 2. It's an unedicated question that could be understood why it happens with any simple google search or llm ask. Here.. let me ask my LOCAL MODEL Qwen3 Coder Next the question. No web search, no mcp, just local LLM's living in the moment: \---------------- Why does the first prompt to an LLM burn so many more tokens than the prompt contains? \---------------- Great question! When you send a prompt to a large language model (LLM), the number of tokens processed (and thus billed) is often much larger than the number of tokens in your raw input prompt. This happens due to several factors: 1. Context Window Includes History & System Prompts LLMs operate within a context window (e.g., 32k or 128k tokens). Your input prompt is only part of what gets fed into the model: Chat history (previous user/assistant turns) System instructions (e.g., “You are a helpful assistant…”) Tool/function definitions (e.g., if you’re using function calling) Retrieved context (e.g., from RAG: documents, search results) ➡️ Example: You send 100 tokens, but the full context passed to the model includes 5,000 tokens of history and retrieved docs → the LLM processes all 5,100. 2. Tokenization Overhead Some tokens represent special control tokens, such as: <s> or <BOS> (beginning-of-sequence) </s> or <EOS> (end-of-sequence) <|begin\_of\_text|>, <|end\_of\_text|>, user, assistant, system (in Llama-style tokenizers) Whitespace and punctuation can be split into multiple tokens (e.g., "don’t" → \["do", "n’t"\]). Unicode, emojis, or rare words may be broken into subword units. While usually small (a few extra tokens), this adds up in long inputs. 3. Model Internals & Activation Overhead (Less Direct, But Relevant) During inference, the model computes key-value (KV) caches for every token in the context — memory and compute grow quadratically with context length in some implementations. Though not directly counted in input token counts, the latency and resource usage scale with the full context, not just your prompt. 4. API Billing Includes Output Tokens If you’re seeing high token usage per request, remember: Total tokens = input tokens + output tokens A short prompt can generate a long response (e.g., 50 input tokens → 1,000 output tokens). Many APIs (e.g., OpenAI, Anthropic) charge per total tokens used (input + output). ➡️ Example: Prompt: “Write a poem.” (\~5 tokens) Response: \~500-token poem → 505 tokens total — and you’re billed accordingly. 5. Tool Calls & Structured Outputs If using tools (e.g., function calling, JSON mode, or structured outputs): Tools’ schemas are embedded in the prompt as text. The model may output structured data that’s tokenized differently (e.g., extra quotes, commas, braces). Some frameworks (e.g., LangChain, LlamaIndex) add extra metadata. How to Diagnose Use tokenizers directly (e.g., tiktoken for OpenAI, tokenizers from Hugging Face) to count your actual prompt tokens. Check API logs: most cloud APIs return prompt\_tokens, completion\_tokens, and total\_tokens. Log your full context — including history and system prompt — to see where tokens are going. TL;DR The “prompt” you see is just the tip of the iceberg. The model processes the entire context (history + instructions + retrieved data), plus overhead tokens — and you’re often charged for the output too. That’s why usage can far exceed your original text. Let me know if you want help counting tokens for a specific use case! 🧮
View on Reddit #81762644

Dry_Yam_4597@reddit

\> It's off topic. I think this type of thing is on topic. Because many of us want local, among other reasons, due to the abuse these corporations make once they reach a certain size. People should take note.
View on Reddit #81762971

BannedGoNext@reddit

Local inference is not an escape from spending, most of us spend WAY the fuck more on local inference than what subscriptions cost. Sorry to be the bearer of bad news. Local inference is not a wallet protest.
View on Reddit #81764926

Dry_Yam_4597@reddit

No you are missing the point :) What I meant by abuse is basically getting scammed into paying more.
View on Reddit #81765671

BannedGoNext@reddit

There is no scam, only people that misunderstand how LLM's work, they think they are some sort of magic box, they are not. They don't want to pay for preprompt, but if you stopped preprompting they woudl scream because the magic box wouldn't act like a magic box.
View on Reddit #81766022

ziroux@reddit

Maybe they moved calculating the usage to ai
View on Reddit #81765412

Turtlesaur@reddit

I said resume work, and it burned through 100% of my 5 hour quota before finishing it's tasks.
View on Reddit #81764397

DinoAmino@reddit

Dodgy? This is localllama. Only a small percentage here actually talk cloud. Downvotes signal that the most people don't care about this post.
View on Reddit #81760193

Dry_Yam_4597@reddit

I mean dodgy on Claude's end. But yeah I think the kind of issue op is raising is yet more proof that people should use local where possible.
View on Reddit #81763042

BannedGoNext@reddit

Yea.. we almost all use the shit out of cloud providers, but that's not the topic here. I think so many people spam here because it's where a bunch of smart people congregated, but the constant spam of non local inference stuff here is annoying. I don't mind when there is research, or news about LLM's, or even stuff that is tangentically related, but so much isn't even that.
View on Reddit #81763031

SodaBurns@reddit

Their servers are overloaded due to the influx of users who came after Anthropic's stand-off with DoW. Not to mention they always need to maintain capacity for training future models and corporate clients.
View on Reddit #81761467

Dry_Yam_4597@reddit

So they choose to scam users in order to serve more users? But yeah it does seem to go offline quite often.
View on Reddit #81762871

arshad_ali1999@reddit

In my case it jumped to 100% in seconds, i just typed "continue"
View on Reddit #81761276

DR4G0NH3ART@reddit

I cancelled my pro plan because it was getting not worth. Especially in last week even sonnet 4.6 requests were hitting 5hr limit soo fast for some reason. I will get a max plan from the org instead. Want to set up qwen locally anyway for personal small stuff.
View on Reddit #81759665

chimph@reddit

sorry but how’s this relevant to this sub?
View on Reddit #81760539

Alan_Silva_TI@reddit

I believe this person is trying to highlight some of the pitfalls of closed source online models... So, very much relevant I may say.
View on Reddit #81763004

chimph@reddit

By saying how much better codex is.. ok
View on Reddit #81766267

llkj11@reddit

What plan?
View on Reddit #81764152

Haeppchen2010@reddit

To keep it LocalLLaMA: Try that with Qwen3.5 27B... It will think with 3000 Tokens on how to respond to "Hello" :) At least it just heats the living room, and I always have an unlimited plan 😇
View on Reddit #81763888

jrdnmdhl@reddit

If someone I really don't want to talk to says "Hi" to me it usually consumes about 20% of my session tokens.
View on Reddit #81763578

MikeReynolds@reddit

Try saying Hi in Haiku with Extended off
View on Reddit #81763408

jacek2023@reddit

Over 100 upvotes for Claude / Codex. In another post people say to pay for Alibaba cloud because hardware is too expensive. This state of LocalLLaMA is all your fault. You did this. You upvote this bullshit.
View on Reddit #81763251

Zafar_Kamal@reddit

You literally copy pasted my entire post from r/ClaudeCode .
View on Reddit #81762615

Zafar_Kamal@reddit

is OP a bot?
View on Reddit #81763227

DisjointedHuntsville@reddit

Wonder if that's fraud 🤔
View on Reddit #81762264

samandiriel@reddit

That would  imply that there is regulation. Trump isn't taking those campaign donations from the tech bros for no reason  (buying more metric tons of Cheeto dust is the reason, FYI)
View on Reddit #81762997

Alan_Silva_TI@reddit

Yet every single “AI influencer” keeps pushing Claude like it’s the only way to use AI. Everything they cover, build, or talk about somehow revolves around Claude. It honestly feels like back when people started using “Google” as a synonym for the entire internet. At this point, a lot of online AI discussion sounds less like real comparison and more like brand repetition. Every workflow, every tutorial, every recommendation... "Just use Claude", "Claude can setup it for you", "Works best with Claude".
View on Reddit #81762901

bad_detectiv3@reddit

It's quiet frustrating that these AI companies want to revolutionize the world with ads everywhere yet they are so stingy with token usage and all. It feels they are in a rush to make that quick buck!
View on Reddit #81760988

samandiriel@reddit

Actions do speak louder than words,eh?
View on Reddit #81762735

gibriyagi@reddit

I usually say hello to check whether the service is down or not. Maybe there should be a ping command for that.
View on Reddit #81762675

Negative-River-2865@reddit

Then don't say hello. Ask then questions at once... and ask him to summarize in case it would be too much
View on Reddit #81762425

Infninfn@reddit

You should go there [https://www.reddit.com/r/ClaudeAI/comments/1s4xriy/so\_the\_rate\_limit\_bug\_was\_actually\_anthropic/](https://www.reddit.com/r/ClaudeAI/comments/1s4xriy/so_the_rate_limit_bug_was_actually_anthropic/) not here.
View on Reddit #81762283

XeNoGeaR52@reddit

Opus 4.6 extended, use haiku and it will not go up. I used Opus 4.6 without the extended context all afternoon and still have 56% left for the day
View on Reddit #81762230

Wildnimal@reddit

I agree with you. This week the tokens usage is going off the charts. I just uploaded a 45 lines json and a basic prompt and it shows 20% usage for the 5 hour limit. I am not a heavy user aswell. Most of stuff i do requires me to do most manual config and code. Once AI has done its code which maybe a 2-3 hours session at max a week.
View on Reddit #81762165

Pwc9Z@reddit

Don't forget to set up that Cowork workflow so you can have tokens being periodically wasted on absolutely trivial shit!
View on Reddit #81761983

Familiar-Rutabaga608@reddit

I’ve been on Claude on a free plan as a super light user: some light coding, Linux, and documentation browsing. In the last 2 months I’ve been “timed out” for using too many tokens. Maybe 3x as fast as before, under similar loads.
View on Reddit #81761942

SandwichSisters@reddit

I saw the same thing in codex. It burns through my usage like a motherfucker.
View on Reddit #81761725

black_ap3x@reddit

Sadly, they announced yesterday (on reddit mind you) that they have limited the tokens for people that have subscriptions, because of "workload balancing". I have been experiencing this token throttling ever since a couple of days ago and suspected something dodgy. Well, its fair to say it have cancelled my subscription the moment I saw their post admitting to changing the token amount without first telling us about it.
View on Reddit #81760979

Olangotang@reddit

"workload balancing" == "this shit is fucking expensive and we won't survive after VC funding dries up"
View on Reddit #81761360

black_ap3x@reddit

Yeah, this is gonna happen to all ai companies, the moment they stop getting government/investment funding. Their models will never generate them enough money (through average user subscriptions) to make them profitable. The only rout i see for them to maybe survive, is company contracts. Where they provide only companies with limited tokens for a contract. That is it. Tbh I wish they kind of die off/slow down so that I can upgrade my PC and get my own home Ai assistant.
View on Reddit #81761681

Academic-Local-7530@reddit

Opus 4.6 Extended
View on Reddit #81761558

aninjaturtle@reddit

It clearly understood your intent. Your intent was to show it's api usage. Claude understood the assignment.
View on Reddit #81759052

BosseMann@reddit

Agi achieved
View on Reddit #81761423

EffectiveCeilingFan@reddit

I knew something was going on, usage has been like non-existent lately
View on Reddit #81761168

breadinabox@reddit

I was using codex and everyone was saying claude was better, i got the plan, set it up, as a trial I asked it to do something benign I had a runbook for (making a new color palette for a theme for my home-assistant lighting and throwing it in the scripts.yaml) and it used 80% of the 5hr window doing just that. It wasn't even getting stuck as far as I can tell. Which is something codex can do in like... nothing. I just straight up cancelled there, like, I have no idea how to make something that basic less intensive it's just useless.
View on Reddit #81760159

ZiXXiV@reddit

Anthropic: Yeah you're in peak hours!
View on Reddit #81759812

JohnSane@reddit

Opus extended.
View on Reddit #81759798

PopularDifference186@reddit

bc you use light mode
View on Reddit #81759098