Have you wondered about the cost of using an API from a model provider like Anthropic?

Posted by power97992@reddit | LocalLLaMA | View on Reddit | 9 comments

Let's suppose claude sonnet 4.0 has 700b params and 32b active parameters . How much does it cost approximately to train for one training run if you rent the gpus by bulk or you own it? And the inference cost?

Suppose it was trained on 15 trillion tokens(including distilled) and 32 b active and sometimes you have 1.5x compute overheads from routing, inefficiencies and so on , then you will need approximately 4.32*10\^24 flops.

A reserved b200 in bulk costs around 3usd/hr or 1.14usd/hr to own for 5 years(1.165 if u include the electricity) and it has 9TFlop/s of fp8 sparse compute , then a single test run on 15 trillion tokens and 60% utilization costs only 668k if you rent it and 259k if you own the gpus... Plus a few rerisking small runs and experimental and failed runs costing approximately 2.4 million usd,

However the synthetic data generation from claude opus costs way more... If claude opus4.0 is 5 trillion parameters and 160b active and trained on 150 trillion tokens, then a single test run costs 33.4 million USD on 9259 gpus.

And to generate 1 trillion reasoning tokens for distillation for claude sonnet from Opus, you will need 11.1 mil b200 gpu hours, so 33.3 mil usd if u use rented gpus... then the total cost for claude sonnet 4.0 costs around 36.3 million usd using rented gpus .. Note, if you own the gpus, the training cost in total is significantly lower, around 14 mil (assuming 4c/kwh) not including the maintenance cost...

Note u are probably giving free tokens to them for training and distilling... I really question when they say they don't train on your api tokens even when you opt out of training when they keep all your data logs and it saves them so much money if they train on them (they probably anonymize your data)... Their customers will have generated over 89 -114 trillion of tokens by the end of this year.. Even train on 10% of their customers' data(via opting in or not), it is trillions of tokens..

Note this doesnt include the labor costs; they have almost 1100(1097) employees , which equates to an avg of 660mil/year for labor (not including ceo bonuses)..

Note claude 4.5 is cheaper to train than 4.0 if it is just fined tuned or trained on less tokens... if it uses the same amount of tokens and compute, then the same cost.

Suppose the claude 4.0/4.5 runs on the b200 and has the same parameter , the q4 version only takes 2-3 b200s to run, it 2.31-3.45 usd/hr to run it if you own the gpus or 6usd/hr if you rent it. The output token revenue per hour (if the actives are split) for claude 4.5 is 40 usd, 48.6-2.31)/48.6=95.2% profit  if they own the gpus before factoring training costs.

(48.6-6)/48.6 =**87.7% profit if it is rental for the output tokens(**most gpus are rented for anthropic)

The input token revenue is outrageous.. THey make 6074 usd per hour for q4 prefills(3037 for q8) for claude 4.5 sonnet if they charge 3 usd/mil tokens !! and one hour of compute for 2 b200s costs only 2.33 usd if they own the gpus(this includes the electricity, but not the infra cost) or 6 dollars if they rent .. The profit margin is 99.96% if they own the gpus(note this only takes in account gpu costs, it will be 1.2-1.25x the cost if you include the infra and not depreciation) and 99.9% profit if they rent the gpus..

A 100k b200 data center costs around 420-480 million bucks to build and cost..

Btw, anthropic will make 5 bil this year, actually even including the labor cost, anthropic is actually making profit if you amortize the gpu cost over 5 years and the data center over 25 years and the data set over many years and include only the cost of training runs for products already released .. This also applies for other model providers...

OpenAI is a little cheaper but they are making profit too if you amortize everything..