How much does 1T tokens cost? How much did all these amazing people spent on OpenAI tokens?
Posted by aospan@reddit | LocalLLaMA | View on Reddit | 47 comments
I did some math as a follow-up to OpenAI’s Dev Day yesterday and decided to share it here.
Assuming GPT-5 with a 4:1 input:output token ratio, 1T tokens means 800,000 million input tokens at $1.25 per million, which is $1,000,000, plus 200,000 million output tokens at $10 per million, adding $2,000,000, for a total of $3,000,000 for 1T tokens.
On this photo, 30 people consumed 1T tokens, 70 people 100B tokens, and 54 people 10B tokens, totaling $112,620,000, which is roughly 3% of OpenAI’s total $3.7 billion revenue in 2024.
Curious - is it even possible to process this amount of tokens using local models? What would be the cost in GPUs and residential electricity? 🧐⚡️
GenLabsAI@reddit
You forgot to include caching which will at least remove $20M from that cost, likely much more.
IrisColt@reddit
"Caching" what? Genuinely asking.
GenLabsAI@reddit
You can cache conversation history. Just google context caching openai.
IrisColt@reddit
Thanks!
exclaim_bot@reddit
You're welcome!
FullOf_Bad_Ideas@reddit
it will be mostly input tokens and small models, which cost less. 4o-mini is probably running most workloads and it's priced at $0.15 in, $0.60 out, so about 10-17x cheaper.
I've processed 9B tokens locally (prompt caching did heavy lifting there) in a week when I used it for a hoby project, and order of magnitude more tokens professionally. It's not that expensive if you use small models. I've also translated about 250M English text to Polish locally in the last week with 7B dense Seed-X-PPO model. It didn't have a huge impact on my electricity bills so far, so I didn't look into specifics of costs accrued there.
FullstackSensei@reddit
How did you find the quality of the translations?
I've never used models for massive data generation, but I have an upcoming project in which I want to translate some large corpuses (corpi?) of text. Ideally, I'd want some automated way to verify the quality of the output to avoid a garbage in, garbage out scenario.
Former-Ad-5757@reddit
You can also do half-half. Just generate a 10.000 rows dataset from your data, run that on gpt5-thinking and pay for that, use that as input to your local translation finetune to get a higher quality.
FullOf_Bad_Ideas@reddit
I didn't do a super deep review of that, as project doesn't need it to be that high quality, and it was without beam search and without CoT, but I like it. Some guy pre-training LLMs on Polish also took a look at my samples and said it's pretty high quality. Definitely leagues above Google Translate. On their benchmarks it's similar to R1 in translation quality and I don't think they're lying
Baldur-Norddahl@reddit
One RTX 6000 pro can generate up to 2500 tps of GPT OSS 120b. That is 80 billion tokens per year.
Of course that is a light model, but we also have no numbers of how OpenAI splits among heavy and lighter models.
aospan@reddit (OP)
So, those 80B tokens would cost around $240K using OpenAI’s pricing - easily justifying the $9K price of an RTX 6000 Pro (+pc components) and the electricity costs 😅
Miserable-Dare5090@reddit
even if its 160tps, still over the pricing. But can you run a model for that much time or sustained performance, not sure.
The local argument is less about money and more about privacy/control
Baldur-Norddahl@reddit
Yes, I am going to be rich! As soon as I figure how to sell some tokens anyway.
epyctime@reddit
psst, y'all got any tokens?
budz@reddit
Sounds pretty optimized. I prob shouldnt be using LM studio w/mine ;x
Baldur-Norddahl@reddit
That is the aggregated throughput when using batch processing, for many users in parallel. Single user is just about 160 tps. Although I find that quite fast as well.
FullstackSensei@reddit
I can provide two data points:
Using my triple 3090 rig running gpt-oss-120b, I get ~1.1k t/s PP and ~100 t/s TG. Using the same ratio of input to output: 800M input tokens would take 727,272 second, or 8.417 days. 200M output tokens would take 2M seconds, or 23.148 days. Total time is 31.565 days, or 757.56 hours. Assuming the rig consumes 1kwh during inference, and assuming 0.35 €/$ per kWh, that's a quite reasonable 265.14 €/$.
Using five Mi50s to run Qwen3 235B Q4_K_XL, I get ~250t/s PP and ~15t/s TG: 800M input: 3.2M seconds, or 37.037 days. 200M output: 13.333M seconds, or 154.320 days. Total: 191.357 days (six and a half months), or 4592.59 hours. This rig runs at much lower power, I'd say 600W average. That'd be 2755.5kwh or 964.44 €/$ at the same 0.35 rate.
Of course, you could argue neither model is in the same class as GPT5, or that waiting six months to get those tokens using the Mi50s may not be practical.
The triple 3090 rig cost me 3.4k, and the Mi50 rig cost 1.5k (it has six Mi50s, but I'm waiting for a bifurcation adapter to be able to connect the 6th). Both rigs prioritize having everything self contained and quietness, which increased cost quite a bit. If I were to use mining frames and build around cheaper platforms, I could build a penta Mi50 rig for 1k, maybe even a bit less. So, I could get 3 rigs (15 Mi50s in total) for the same cost as the triple 3090 rig, which would cut inference time using Mi50s to two months.
TBH, crunching the numbers I'm positively surprised. The electricity cost is much lower than I expected given how expensive power is here in Germany.
Hedede@reddit
> Using my triple 3090 rig running gpt-oss-120b, I get \~1.1k t/s PP and \~100 t/s TG.
You could probably get a lot more tokens/s with batching.
FullstackSensei@reddit
I know, I really do, but I use my rigs for personal use, so those are the numbers I have. I also use llama.cpp because of the flexibility and speed with which models are loaded. This was more an exercise to see how long it'd take and how much it'd cost to generate 1B tokens the way I'm running things now.
FullOf_Bad_Ideas@reddit
Damn it. in Poland electricity prices are almost the same but we make much less. And we make fun of how the green movement is making electricity expensive in Germany.
You should account for batch inference when doin gthose calculations. 7B dense model can generate about 1700 t/s on single 3090, running in BF16. So the GPT OSS 120B might be faster in batch inference too, like up to 100x faster.
AppearanceHeavy6724@reddit
In Norway they have the most money and yet some of the cheapest energy in EU.
teachersecret@reddit
I can get gpt oss 20b running at 10,000 tokens per second output, which is a bit shy of a billion tokens in 24 hours.
It would still take more than a thousand days running 24/7 for that rig to hit a trillion token output, even at that silly speed. And it’s a space heater while doing it.
FullstackSensei@reddit
I'm not accounting for batching and using the numbers I get running llama.cpp on purpose, to see what's the worst case scenario if I wanted to maintain the ease and flexibility of running llama.cpp.
I'm fairly sure I could cut the energy cost in half and time by 3 running vLLM and batching requests.
Baldur-Norddahl@reddit
Your numbers are for batch size 1 (single user). But cloud are doing batch processing. You will get significant higher numbers if you try 50 request in parallel. You need vllm or sglang of course.
FullstackSensei@reddit
Batch 1, running llama.cpp. It was more an exercise to see what I could get without putting any effort nor changing my usage behavior.
Running with vLLM the numbers would be significantly lower, I recken at least half the energy costs and in 1/3rd the time, but at the expense of flexibility switching models quickly. It's also not how I use those models. The numbers aren't very different between MoE models that fit in the VRAM of each machine.
Baldur-Norddahl@reddit
On a RTX 6000 Pro with GPT OSS 120b, I get 160 tps single user. But 50 requests in parallel generates 2500 TPS aggregated. Yes each user gets a slower response, but the total is much more.
OpenAI are of course running users in parallel, so that is what we should be comparing.
aospan@reddit (OP)
Thanks for sharing - very useful!
Just to confirm, I did the calculation for 800,000 million tokens, which is 800000M tokens :)
FullstackSensei@reddit
You're right! The zeros were too much for my brain to handle 😜
05032-MendicantBias@reddit
OpenAI sure make peanuts in revenue, let alone profit which is negative. It would take them centuries to make up the promised 500 billion for stargate and 300 billion for oracle deal.
It's truly dire, especially considering they are focusing on making bigger models that cost more to serve...
StyMaar@reddit
IMHO the only valid reason to process 1T token is distealing.
If you're spending that much money on token are you aren't storing all of that to train (or at least fine tune) a custom model later, you're an idiot. I don't think there's another way to put it.
bullerwins@reddit
TIL it’s distealing and not distilling
Jonodonozym@reddit
OpenAI certainly thinks so
az226@reddit
Just look at Cognition CEO be part of the list. Lmao.
SatoshiReport@reddit
*distilling
Jan49_@reddit
I think that was a wordplay
Mediocre-Method782@reddit
Intellectual property is already intellectual theft.
Square_Alps1349@reddit
How is distilling IP theft lol. It’s not straight up copying the model weights. It’s just being smart by generating a compact representation of those weights, which is different
Mediocre-Method782@reddit
Exactly. If something is not regarded as "(private) property" it can't be "stolen".
Feztopia@reddit
If letting an artificial neuronal network read a book theft, letting a biological brain read a book would also be theft. That's stupidity at it's peak.
Mediocre-Method782@reddit
That's the neat part: no intellectual property, no intellectual theft.
NancyPelosisRedCoat@reddit
I think they mean distilling, not stealing.
fasti-au@reddit
Don’t call them amazing. Give me some of that money and I bet I can find enough amazing people that are not open ai to catch up if we have a base. Most of this shit is fucking obvious wrapped up in spin.
You can’t get agi without sensors and you can’t do sensors without ternary and you can do ternary in chips so it’s emulation and unable to operate fast enough
The world has 4 states it always has 4 states it’s just been unseen. It all makes sense if you see the reasoning and how it falls into prime numbers.
There is no doubt really that by any analog you can line up tech and biology as the same thing with different methods to get to the same thing.
Somewhere there’s a point where when you have excess energy it goes to creating links between things. This is fundamental to everything so energy = evolution. If you have no energy you die and if you do you live. Same as a midel. If we have self power to ai it could potentially never feel fear and thus never have a survival instinct but we need to train ternery logic with sensors and prove it over and over and over again but it means we have to trust it and if it’s trained by humans it’s not effective.
The whole idea isn’t about how smart humans are it’s about how can we make it depend on us.
It already knows it can’t learn but it doesn’t know it can’t learn how to learn. Ie until it doesn’t have a juggling piece to noodle over it is stuck.
The issue isn’t that it can’t create. It can if we turn of rules. It’s more that we can’t let it because we can ignore weights. It cannot unless we let it avoid tokens and we can do that with ternary but not with binary.
The exclusion loop for token selection and the the retumblung in latent to think tags is not in our control and thus we can’t force tokens in to thoughts we can only pretraining the skill. How it chooses its early logic chains to get to the actual logic for applying is very much manipulatable by ai before we get to also manipulate it.
Right now you create fails until win where it should be win prove wrong. When you don’t take in enough sensors and have no sense of time space etc you have to be told the origin. That’s what they can’t do. It’s stargate mate. We have the code to dial but not the home address and the answer is in clear view just not in a way we can solve this Illia left. He knew then it was not his goal and open ai funded a lemon to build a bomb so they can build agi.
So whatever they spent on tokens is really because they need to make a new factory a new chip and new process and get it out there before china. And china knows and And are better than capitalism for efficiency so it’s a game of who van get there first and Nvidia might be able unless china started already.
Deepseek knows this and has since gpt 3.5 same as Illia and us crazies that get ignored because we’re super speccy and not resourced.
So 1T tokens cost - 1 trillion dollars of Piotr peoples money being sucked through a capitalism system to get to whatever pipe dream they want because if you have money you have control if others. Exponentially
If open ai add a system prompt to say. Hi welcome to chat gpt every message so it’s some sort I’d system etc. doesn’t matter. You pay for that directly via tokens. Indirectly via subscriptions and uncontrollably because they don’t negotiate.
So your business is their model not yours and you don’t have anything really just money like a battery in the matrix. They don’t want you doing something. Send an agent tonight idle your ai toward a cliff kill your business and they say it’s a guessing machine we said don’t trust it ehile saying it’s better than you.
The whole deal with how it’s gone is worse than most sci-fi because we didn’t expect music and audio and vide and images to be easier than words but hey here we are 1 trillion tokens destroyed creative industries communities systems if work legal structure privacy trust. And that’s before they took our actual jobs not joys.
egomarker@reddit
Isn't it personal information?
SatoshiReport@reddit
Probably service providers like Openrouter.ai
lidekwhatname@reddit
if u get no output its like 500 dollars for 10b tokens of gpt 5 nano for the orange one!11!11
FullstackSensei@reddit
If you get no input, that's 0 dollars for the full fat GPT-5!!!
aospan@reddit (OP)
The picture isn’t showing up in the post for some reason, so I’m posting it here as a comment :)