CAISI releases evaluation report: DeepSeek V4 becomes the most powerful model in China, but still lags about 8 months behind the US frontier
Posted by External_Mood4719@reddit | LocalLLaMA | View on Reddit | 36 comments




https://www.nist.gov/news-events/news/2026/05/caisi-evaluation-deepseek-v4-pro
truthputer@reddit
What these charts aren't capturing is that a lot of the recent Chinese innovations are about making smaller models behave more efficiently and with the quality of larger ones. It's the MOE models that run on consumer hardware - and each new release behaves as if it were a much bigger model.
Sonnet level intelligence is already very useful for like 90% of tasks - if today's top Chinese models had launched 6 years ago they would have been as world-changing as OpenAI was with Chat GPT.
I can easily see a future where open models capture the entire bottom 80% of the LLM market, and it's only really the most complex 20% to 10% of tasks that need the expensive paid cloud models.
It's a real shame OpenAI and Anthropic are closed-source, paid product companies - if they all worked together they'd be able to accomplish so much more.
Nyghtbynger@reddit
The electricity will become the pain point next winter in the US. How do I gamble on that ?
TheIncarnated@reddit
Self sustainability. You bet on local renewable energy
TheRealMasonMac@reddit
Wait until Trump bans that, lol.
TheIncarnated@reddit
My grandaddy didn't run shine through these hollers for me to be a bootlicker.
Most of us here work in tech. I can guarantee half of them "sail the seas". I'm not worried
TheRealMasonMac@reddit
Ah, I meant ban solar panels and the like for “being a threat to national security.” Can’t do much about that.
TheIncarnated@reddit
I mean... The black market always exists lol
But yeah, at that point, I'm doing native based wind turbines and hydro
Klutzy-Snow8016@reddit
US government says 8 months, DeepSeek themselves imply 2 months, maybe the truth is somewhere in between.
TomLucidor@reddit
My bet is 3-6 months but I am not sure when they will converge, best bet is 2027/2028
kevinlch@reddit
hard to converge when both is evolving in parallel. hardware/distillation sanction is a huge wall
TomLucidor@reddit
Hopping over hardware using software acceleration would be the holy grail for this (I hope)
ResidentPositive4122@reddit
If anything the chances of converging are lower now, IMO. It used to be 6 months away, in the times of gpt4o / o1, but then scale came into play. Whatever the top labs are doing right now is at a scale that other labs simply can't match. And it seems that RL really benefits from large scale. And as their internal unreleased models improve, so does their training pipelines, diversity of RL environments, filtering, and so on.
While you still need the right architecture (see llama4 w/ bad MoE sparsity), and scale alone won't help you (see the latest Meta models that aren't anything special despite them having scale galore), we also have xAI that got to SotA last year by pure scale. They've since dropped off a bit, but still got within touching distance on pure scale.
The gap becomes obvious once you take them out of distribution w.r.t. the usual benchmarks. Have them work on coding tasks that aren't usually benched for. Have them setup things that aren't "vanilla" setups. They'll do the work, and if you look at the thinking traces they might "mime" the proper reasoning, but the results are sometimes a miss. Looks like a duck, quacks like a duck, but the end result is a chicken :/
SeyAssociation38@reddit
the scale is compute. rl and synthetic data generation are that scale.
TomLucidor@reddit
Bet on algo gains not compute dependence
TomLucidor@reddit
We need to see if there are RL methods that China cracked that can beat the US habit of brute-forcing. Llama 4 feels like a sneak peek on how everyone else is behaving
Serprotease@reddit
In the time of 4o/o1 we had Qwen2.5, Yi and deepseek 2.5. Those were miles behind and not in any shape or form a serious replacement to GPT models.
You can said they were 6 months behind, at 3~3.5 level or the same number of months as the current Deepseek v4 pro/gpt5.0 but chinese models are now serious alternatives, not toys for the few interested in AI.
Using elo and months is a poor choice as it hides the actual situation and the smaller rate of progress between each model iteration.
And glm5.1/xiaomi 2.5 are also a lot closer to the US Sota models, enough to push Anthropic to talk about mythos.
Mashic@reddit
There is the data accumulation factor too. The more used models have more data to work with and improve upon.
SeyAssociation38@reddit
they look at that chart, see a widening gap between china and the us, and pat themselves in the back for banning nvidia from china.
Nyghtbynger@reddit
USA fights with marketing, market gambling and hollywood, China with logistics and production. The fall will be harder
Confusion_Senior@reddit
Kimi is probz a bit ahead yet
9gxa05s8fa8sh@reddit
they left off the chinese models that got in the way of making the line slope look low. if you make this graph with popular benchmark data it will look different.
this is PROBABLY rigged because the US government has been corrupted through and through. they're releasing something to make US AI companies look good to help them reach their IPOs before the bubble pops. everybody who helps gets kickbacks.
look at the slope between gpt 5.4 and 5.5 compared to the slope between k2 and k2.5. it shows an astronomical improvement from 5.4 to 5.5, but nobody feels that. and if you normalize by work time or work cost, there is very little difference betweeen model releases. this gen of models cost more resources to run; they're brute forcing more and increasing intelligence less over time.
Macestudios32@reddit
Only 8 months? Great! I wish something I can have at home was only 8 months behind in its technology.
EveningIncrease7579@reddit
eposnix@reddit
This is funny fan fiction
NNN_Throwaway2@reddit
This selection of benchmarks AND models looks highly cherry-picked. And then I assume they're deriving an "estimated" ELO based on that (why even bother with an ELO at that point)?
Hefty_Wolverine_553@reddit
No Kimi 2.6? No GLM 5.1? No MiMo V2.5 Pro? Deepseek V4 was released after these models...
SeyAssociation38@reddit
they are probably thinking they are the same based on parameter count
ambient_temp_xeno@reddit
More like it's the government so they probably got told to evaluate 'deepseek' ages ago and finally got around to it.
NandaVegg@reddit
I casually thought DeepSeek 4 Pro was slightly above GPT-5.2, and I'm not even sure if that is a praise.
GPT-5.2 is literally one of the worst frontier models of this generation per Arena-type elo rating (ranked #77 overall, even with style control it's #52, even below GLM 4.7 or Gemini 2.5 Pro). It is so heavily benchmaxxed/RL'd hard towards frontier math/logic reasoning type task that it puts "Wait this might not be X but it is still Y" type mini-CoT every few sentences. I'd never ever put GPT-5.2 above Opus 4.6 in any case.
LagOps91@reddit
it's a preview. it's undercooked. it's not done training. stop making comparison charts!
Dr_Me_123@reddit
An organization that knows less than forum users produced a report.
IngenuityNo1411@reddit
typical U.S. money burning ways
woct0rdho@reddit
'xkcd 2048: Curve Fitting'
idkwhattochoo@reddit
"elo" are we really going to use that as metrics here?
TomLucidor@reddit
Use METR then
darktotheknight@reddit
Given how much US sanctioned China regarding top tier GPUs, this is very impressive. Now imagine they had unlimited access to the latest GPUs...