Compared actual usage costs for Chinese AI models. Token efficiency changes everything.

Posted by YormeSachi@reddit | LocalLLaMA | View on Reddit | 41 comments

Everyone talks about per-token pricing but nobody mentions token efficiency. How many tokens does it take to complete the same task?

Tested this with coding tasks cause thats where I actually use these models.

glm-4.6: $0.15 input / $0.60 output Kimi K2: $1.50-2.00 MiniMax: $0.80-1.20 deepseek: $0.28

deepseek looks cheapest on paper. But thats not the whole story.

Token efficiency (same task):

Gave each model identical coding task: "refactor this component to use hooks, add error handling, write tests"

glm: 8,200 tokens average deepseek: 14,800 tokens average MiniMax: 10,500 tokens average, Kimi: 11,000 tokens average

glm uses 26% fewer tokens than Kimi, 45% fewer than deepseek.

Real cost for that task:

glm: \~$0.04 (4 cents) deepseek: \~$0.03 (3 cents) - looks cheaper MiniMax: \~$0.05 (5 cents) Kimi: \~$0.09 (9 cents)

But wait. If you do 100 similar tasks:

glm: Total tokens needed: \~820K, Cost: $0.40-0.50 deepseek: Total tokens needed: \~1.48M, Cost: $0.41 - basically same as glm despite lower per-token price MiniMax: Total tokens needed: \~1.05M, Cost: $0.50-0.60 Kimi: Total tokens needed: \~1.1M, Cost: $0.90-1.00

Token efficiency beats per-token price. glm generates less verbose code, fewer explanatory comments, tighter solutions. deepseek tends to over-explain and generate longer outputs.

For businesses doing thousands of API calls daily, glms efficiency compounds into real savings even though its not the absolute cheapest per-token.

Switched to glm for production workloads. Monthly costs dropped 60% vs previous setup. Performance is adequate for 90% of tasks.

deepseeks pricing looks great until you realize youre using 50% more tokens per task. The savings disappear.

Anyone else measuring token efficiency? Feel like this is the underrated metric everyone ignores.

[-]

No-Fig-8614@reddit

It is true that using Chinese characters or a similar language that can express far more information then Latin characters. This does provide an advantage for both processing and token usage.

It’s the same reason that (I’ll have to find the paper) where they set two AI models up to chat and learn and they came up with a more proprietary efficient language (reminds me of ebcdic vs ascii and how certain things have certain performance or unique value gains)There are going to be a bunch of these interesting types of things that squeeze out additional performance.

I also need to look up the source that has like 1000 common sentences in English and mandarin and the amount of reduced tokens was very noticeable.

[-]

Cold-Bathroom-8329@reddit

No, because for cost tracking each Chinese character counts as a token, whereas for English and the likes, ~4 chars count as a token (and in many cases entire words).

Costs are: - 1 Chinese character = 1 token - 1 Latin character != 1 token

[-]

TomLucidor@reddit

1 Chinese syllable matches about 1-2 syllable in English I think... all languages regress towards 39 bits per second, and French/English managed to pack more per second (good for thinking, bad for token efficiency since it packs too much noise). Maybe Chinese just pack ideas in less syllables per second compared to Latin and Altaic languages, and that concept dimensions are well-managed to not include irrelevance or context. Good for communication, bad for contextual logic.

[-]

Cold-Bathroom-8329@reddit

Chinese characters can encapsulate a lot of meaning, but often have very high ambiguity lack precision. Expressing something vague is much more efficient in Chinese, but expressing something precise is much less efficient as you must add in a lot of extra explanation to narrow down the meaning.

Averaging means little when language is highly spiky and there is no true 1:1 mapping especially between western languages and Chinese. There are many differences in efficiency, but in different scenarios so it is a mixed bag and collapsing it into one entropy score is not particularly helpful.

[-]

TomLucidor@reddit

Assuming the general case (e.g. basic logical thinking), averaging seems workable. Chinese is one of those "unambiguous if not academic" languages, with an almost Grug-like in nature. For example Cantonese has lower bits-per-second and closer to the norm than Mandarin, BUT also lower syllables-per-second, both using the same script assuming "Traditional" Chinese (in Taiwan, China uses "Simplified" Chinese which muddles meaning), and yet the former has more "Plain English" characteristics whilst being closer to ancient forms of Chinese.

If "dialects" can skew efficiency, then "tone of language" within the same language (for thinking models) COULD also have the same effect!

[-]

Cold-Bathroom-8329@reddit

Assuming the general case (e.g. basic logical thinking), averaging seems workable. Chinese is one of those "unambiguous if not academic" languages, with an almost Grug-like in nature.

Averaging is pointless because different contexts will yield entirely different results with high variance. It is like averaging wealth by looking at one homeless guy and one billionaire in each country and to then conclude that X or Y is wealthier because a difference of $3 is in their favor.

Chinese is not at all “unambiguous if not academic.” In fact, the opposite is more true. Academic language tends to use more precise characters that people would not use in daily life, whereas daily life often uses imprecise words.

For example Cantonese has lower bits-per-second and closer to the norm than Mandarin, BUT also lower syllables-per-second, both using the same script assuming "Traditional" Chinese (in Taiwan, China uses "Simplified" Chinese which muddles meaning), and yet the former has more "Plain English" characteristics whilst being closer to ancient forms of Chinese.

Spoken Cantonese and Mandarin have, for all intents and purposes near identical entropy. The difference in possible syllable combinations is negligible given any sentence.

As for traditional and simplified, traditional to simplified map almost 1:1 in practice. The difference is not enough to have much of an impact in practice.

If "dialects" can skew efficiency, then "tone of language" within the same language (for thinking models) COULD also have the same effect!

Dialect of Chinese map 1:1 to Mandarin. The only difference is that dialects, being local and more rural, tend to use less abstract language and as a result be highly imprecise because the language is limited by context.

No one is discussing astrophysics in a village dialect. As for mainstream dialects like Mandarin and Cantonese, again, they map 1:1 in practice and trying to find a difference is a stretch that is of no use for how negligible the difference actually is.

[-]

Salt_Discussion8043@reddit

Yeah I use domain specific languages it squeezes out more efficiency

TheRealGentlefox@reddit

You can see this most strongly with smaller Qwen models.

80B A3B ended up costing the same, or more, than its 235B brother in some of my tests. It burns soooooo many tokens thinking. QWQ was the same way, often hitting ~20k reasoning tokens for certain queries.

80B vs 235B success rate tho, are they the same?

At this logic puzzle the 80B actually did quite well, a bit under 235B.

Which kinds? Cus Zebra Puzzles might be benchmaxxed

tech_genie1988@reddit

If the model gives you exactly what you need. Using extra tokens becomes unnecessary anyway. and yeah, using fewer tokens is definitely its own advantage too.

dubesor86@reddit

I have been hammering on token-efficiency ever since reasoning models appeared a bit over a year ago. I track token usage and give verbosity values to each model. It annoyed me to no end to constantly have people say model x or y is cheaper just because of the mtok completely ignoring you have to account for token usage. Anyone who is cost-conscious or depends on response latency cares deeply about this stuff, but it requires actual effort to track and communicate, whereas looking at a simple dollar value for mtok requires zero effort, even if its an entirely useless figure.

Smart-Cap-2216@reddit

Using the glm coding plan offers a cost-performance ratio far superior to all other models.

dash_bro@reddit

We've benchmarked stuff like this heavily for thinking models, it's a standard practice. It's also very well tracked for known leaderboards, where you'll see number of steps, number of tokens and total cost of the model as well.

Coding, overall (open models): GLM and Qwen Dominate Coding, overall (closed models): Gemini is far ahead purely for price, but Claude has superior performance especially for enterprise

Mescallan@reddit

Claude opus 4.5 price per token is very high, but the tokens it outputs are generally so high value, and in environments it's been trained for, it uses significantly less tokens than basically any other model for the same task.

TheRealMasonMac@reddit

https://nousresearch.com/measuring-thinking-efficiency-in-reasoning-models-the-missing-benchmark/

There is a model size to thinking time tradeoff I think (assuming performance need to be kept the same), and also smaller models are more likely to be weaker so they could be more "confidently wrong".

dwiedenau2@reddit

You did all this while not even mentioned caching once, which makes most chinese llms more expensive than the top us models.

UnifiedFlow@reddit

Can you explain why caching is different in Chinese models? In my understanding caching is not related to the model in any way.

It is related to the price, which you are comparing here. Most providers for chinese models do not offer caching or have a much smaller discount on cached tokens than us providers.

FullOf_Bad_Ideas@reddit

the biggest issue in caching is actually US providers hosting Chinese models.

GLM, DeepSeek and Kimi have caching on their own API (I think Minimax doesn't), but other companies hosting those models, usually US based (capital intensive business so it makes sense), do not.

But even when Chinese companies do offer caching, it's not as cheap as what OpenAI, Claude or xAI does.

VampiroMedicado@reddit

Is there a reason for that

They use off-the-shelf inference engines like vLLM and SGLang and cache/accounts are just getting integrated there recently.

It seems like market just hasn't forced them to upgrade their setup yet.

vLLM has robust kv caching support.

Through LMCache, right?

Keeping separate cache for each user for an hour or so and also billing it correctly is not trivial.

sdkgierjgioperjki0@reddit

Except for Deepseek of course, which has amazing discounts on cached tokens and you get very high hit rate. I'm hitting 90% cached tokens when using Claude Code with deepseek-chat.

ttkciar@reddit

This is badly off-topic for the sub.

Thick-Protection-458@reddit

Why, token efficiency should directly translate into response time and compute required (electricity this way). Which, especially the first one - might well be important

xxPoLyGLoTxx@reddit

I found it interesting nonetheless and I only use local models.

createthiscom@reddit

"But wait" lol

texasdude11@reddit

Wait my butt

Clear-Ad-9312@reddit

Qwen's reasoning has made this phrase cringe af for me

HarambeTenSei@reddit

But wait, I get a lot of that from claude and gemini 3 as well

but wait, I like to use local models and not really rely on the proprietary ones all that much
(I wish I had money to buy GPUs before the AI data centers inflated the prices of everything)

Baby qwens spam this so much even when reasoning about star wars

SWE-Rebench measures cost per problem - https://swe-rebench.com/

GLM 4.5 is both less performant as well as costlier than GPT 5 Codex, GPT 5 Medium and even a bit costlier than Claude Sonnet 4.5. At least on those benchmarked issues.

MrPecunius@reddit

Per-token pricing is yet another reason I run local models almost exclusively.

Kinda offtopic to go into the weeds about non-local API pricing in this group imo.

StardockEngineer@reddit

I’m not saying you’re wrong, but this test doesn’t sound good. Did they all put in the same amount of error handling? Tests? How much non commented code did they make? Do we have those details? You could also just ask the other models not to make comments. Might seem like an extra step, but one people might be willing to take if the overall output is the cheapest.

Scared-Biscotti2287@reddit

I am not that familiar with the Chinese models. but it kinda looks like i might give glm a chance. thanks for the info.