CAISI releases evaluation report: DeepSeek V4 becomes the most powerful model in China, but still lags about 8 months behind the US frontier

[-]

truthputer@reddit

What these charts aren't capturing is that a lot of the recent Chinese innovations are about making smaller models behave more efficiently and with the quality of larger ones. It's the MOE models that run on consumer hardware - and each new release behaves as if it were a much bigger model.

Sonnet level intelligence is already very useful for like 90% of tasks - if today's top Chinese models had launched 6 years ago they would have been as world-changing as OpenAI was with Chat GPT.

I can easily see a future where open models capture the entire bottom 80% of the LLM market, and it's only really the most complex 20% to 10% of tasks that need the expensive paid cloud models.

It's a real shame OpenAI and Anthropic are closed-source, paid product companies - if they all worked together they'd be able to accomplish so much more.

[-]

Nyghtbynger@reddit

The electricity will become the pain point next winter in the US. How do I gamble on that ?

[-]

TheIncarnated@reddit

Self sustainability. You bet on local renewable energy

[-]

TheRealMasonMac@reddit

Wait until Trump bans that, lol.

[-]

TheIncarnated@reddit

My grandaddy didn't run shine through these hollers for me to be a bootlicker.

Most of us here work in tech. I can guarantee half of them "sail the seas". I'm not worried

[-]

TheRealMasonMac@reddit

Ah, I meant ban solar panels and the like for “being a threat to national security.” Can’t do much about that.

[-]

TheIncarnated@reddit

I mean... The black market always exists lol

But yeah, at that point, I'm doing native based wind turbines and hydro

[-]

Klutzy-Snow8016@reddit

US government says 8 months, DeepSeek themselves imply 2 months, maybe the truth is somewhere in between.

[-]

TomLucidor@reddit

My bet is 3-6 months but I am not sure when they will converge, best bet is 2027/2028

[-]

kevinlch@reddit

hard to converge when both is evolving in parallel. hardware/distillation sanction is a huge wall

[-]

TomLucidor@reddit

Hopping over hardware using software acceleration would be the holy grail for this (I hope)

[-]

ResidentPositive4122@reddit

If anything the chances of converging are lower now, IMO. It used to be 6 months away, in the times of gpt4o / o1, but then scale came into play. Whatever the top labs are doing right now is at a scale that other labs simply can't match. And it seems that RL really benefits from large scale. And as their internal unreleased models improve, so does their training pipelines, diversity of RL environments, filtering, and so on.

While you still need the right architecture (see llama4 w/ bad MoE sparsity), and scale alone won't help you (see the latest Meta models that aren't anything special despite them having scale galore), we also have xAI that got to SotA last year by pure scale. They've since dropped off a bit, but still got within touching distance on pure scale.

The gap becomes obvious once you take them out of distribution w.r.t. the usual benchmarks. Have them work on coding tasks that aren't usually benched for. Have them setup things that aren't "vanilla" setups. They'll do the work, and if you look at the thinking traces they might "mime" the proper reasoning, but the results are sometimes a miss. Looks like a duck, quacks like a duck, but the end result is a chicken :/

[-]

SeyAssociation38@reddit

the scale is compute. rl and synthetic data generation are that scale.

[-]

TomLucidor@reddit

Bet on algo gains not compute dependence

[-]

TomLucidor@reddit

We need to see if there are RL methods that China cracked that can beat the US habit of brute-forcing. Llama 4 feels like a sneak peek on how everyone else is behaving

[-]

Serprotease@reddit

In the time of 4o/o1 we had Qwen2.5, Yi and deepseek 2.5. Those were miles behind and not in any shape or form a serious replacement to GPT models.

You can said they were 6 months behind, at 3~3.5 level or the same number of months as the current Deepseek v4 pro/gpt5.0 but chinese models are now serious alternatives, not toys for the few interested in AI.

Using elo and months is a poor choice as it hides the actual situation and the smaller rate of progress between each model iteration.

And glm5.1/xiaomi 2.5 are also a lot closer to the US Sota models, enough to push Anthropic to talk about mythos.

[-]

Mashic@reddit

There is the data accumulation factor too. The more used models have more data to work with and improve upon.

[-]

SeyAssociation38@reddit

they look at that chart, see a widening gap between china and the us, and pat themselves in the back for banning nvidia from china.

[-]

Nyghtbynger@reddit

USA fights with marketing, market gambling and hollywood, China with logistics and production. The fall will be harder

[-]

Confusion_Senior@reddit

Kimi is probz a bit ahead yet

[-]

9gxa05s8fa8sh@reddit

they left off the chinese models that got in the way of making the line slope look low. if you make this graph with popular benchmark data it will look different.

this is PROBABLY rigged because the US government has been corrupted through and through. they're releasing something to make US AI companies look good to help them reach their IPOs before the bubble pops. everybody who helps gets kickbacks.

look at the slope between gpt 5.4 and 5.5 compared to the slope between k2 and k2.5. it shows an astronomical improvement from 5.4 to 5.5, but nobody feels that. and if you normalize by work time or work cost, there is very little difference betweeen model releases. this gen of models cost more resources to run; they're brute forcing more and increasing intelligence less over time.

[-]

Macestudios32@reddit

Only 8 months? Great! I wish something I can have at home was only 8 months behind in its technology.

[-]

EveningIncrease7579@reddit

[-]

eposnix@reddit

This is funny fan fiction

[-]

NNN_Throwaway2@reddit

This selection of benchmarks AND models looks highly cherry-picked. And then I assume they're deriving an "estimated" ELO based on that (why even bother with an ELO at that point)?

[-]

Hefty_Wolverine_553@reddit

No Kimi 2.6? No GLM 5.1? No MiMo V2.5 Pro? Deepseek V4 was released after these models...

[-]

SeyAssociation38@reddit

they are probably thinking they are the same based on parameter count

[-]

ambient_temp_xeno@reddit

More like it's the government so they probably got told to evaluate 'deepseek' ages ago and finally got around to it.

[-]

NandaVegg@reddit

I casually thought DeepSeek 4 Pro was slightly above GPT-5.2, and I'm not even sure if that is a praise.

GPT-5.2 is literally one of the worst frontier models of this generation per Arena-type elo rating (ranked #77 overall, even with style control it's #52, even below GLM 4.7 or Gemini 2.5 Pro). It is so heavily benchmaxxed/RL'd hard towards frontier math/logic reasoning type task that it puts "Wait this might not be X but it is still Y" type mini-CoT every few sentences. I'd never ever put GPT-5.2 above Opus 4.6 in any case.

[-]