Decreased Intelligence Density in DeepSeek V4 Pro
Posted by Mindless_Pain1860@reddit | LocalLLaMA | View on Reddit | 90 comments
In the V3.2 paper, they mentioned:
Second, token efficiency remains a challenge; DeepSeek-V3.2 typically requires longer generation trajectories (i.e., more tokens) to match the output quality of models like Gemini 3.0-Pro. Future work will focus on optimizing the intelligence density of the model’s reasoning chains to improve efficiency.
However, in V4 Pro, the situation seems to have worsened. Even the non-thinking mode uses significantly more tokens than V3.2, and V4 Pro (1.6T) is roughly 2.5x larger than V3.2 (0.67T). This suggests that the intelligence density of the model has decreased rather than improved!
If we compare it with GPT-5.4 and GPT-5.5, the gap is even larger. DeepSeek appears to require around 10x more tokens to achieve similar performance. Assuming the same TPS, this implies roughly 10x longer for DeepSeek V4 Pro to complete the same task.
Hyp3rSoniX@reddit
I think the main goal of the v4 release was to get the models to run on the Huawai Ascend AI processors.
They will probably optimise and improve the model afterwards. They're trying to become as independent from nvidia and the likes as possible - so the Huawai chip support probably had the highest priority.
marcobaldo@reddit
The training still happened on Nvidia, but I get the point.
SeyAssociation38@reddit
The number of training tokens and training itself are compute bottlenecked because the training tokens are mostly synthetic now. Until they figure out EUV this is something I expect
EffectiveCeilingFan@reddit
Yeah that’s what my understanding was as well. Considering that they get a lot of funding from the government and there is significant interest in not being reliant on American companies.
Puzzleheaded-Drama-8@reddit
To me the v4 pro seems to be hugely undertrained. I expect we're going to see huge gains in that models when we get new checkpoints in coming months.
dark-light92@reddit
They did say this release was a preview.
holchansg@reddit
With a complety new architecture. Deepseek 4 is all about Engram e mHC, its about the model dealing with huge contexts.
ChocomelP@reddit
Finanzamt_Endgegner@reddit
Yep not a lot of training tokens in their report
Yes_but_I_think@reddit
32T is not a lot?
UpAndDownArrows@reddit
Datasets that comfortably fits on my desktop PC I use just to browse reddit and store my photos? Yeah, not a lot.
Finanzamt_Endgegner@reddit
its around Chinchilla Optimal which is not seen as optimal anymore, so basically yes thats a lot of tokens but this model can be trained on a lot more and still gain performance
RedditLovingSun@reddit
Yes Chinchilla optimal doesn't show upper limits of training it shows where further tokens have diminishing effects, but diminishing effects are still progress and makes the model better
beijinghouse@reddit
Chinchilla's assumptions are that the ONE computer training the model will also do ALL the inference for that same model and that the model will never be shared or used by anyone else ever. It falls apart if there's another user of the system or another computer will run inference on the model (ever) or if you believe there's an economy in the world where trade could take place, etc. So Chinchilla is uniquely (and only) a correct heuristic when you're in an apocalypse-survival movie and you find the final computer left on earth to make a private model to run for yourself while you sit alone waiting to die.
Orolol@reddit
Not for a model of this size.
FullOf_Bad_Ideas@reddit
How do you tell if model is undertrained or just bad? Is this just a synonym that sounds less harsh?
my_name_isnt_clever@reddit
Undertrained and bad means it has room for improvement. If it's fully trained and bad, well that's how you get Llama 4.
This_Maintenance_834@reddit
it is really hilarious when you put llama 4 in that way.
nullmove@reddit
Is this model all even "bad"? Seems to me that it's right up there with (Chinese) frontier in broad reasoning and coding. But specific domains like agentic coding in a harness requires a substantial post-training with RL, quite possibly it's lacking there (and is reflected in terminal bench) for now.
Compared to its size (1.6T params) 36T tokens is not much, freaking Qwen 3 0.6B was trained on that much. Qwen/Kimi etc. are vision models, they have also been trained on tons of vision token, and done right this increases model's general intelligence. By not supporting vision, this model has also left a lot of tokens in the table, so its undertrained in that sense as well.
toughcentaur9018@reddit
I’m no expert but seems to me like the model could be doing so much more at that size
FullOf_Bad_Ideas@reddit
Doing so much more meaning better benchmark scores? ERNIE 5.0 is 2.4T and it's closed weights, in benchmarks I think it's just a tiny bit better or about the same, I didn't compare them closely.
Muon is more token-efficient for training so it changes how much tokens you need to get certain performance.
Borkato@reddit
I feel like undertrained means if they continue training it in the same direction it will continue to go in the same, good direction, while bad implies that if they continue training it in the same direction it will just get worse.
brown2green@reddit
32T tokens for 1.6T parameters is exactly Chinchilla-optimal (20 tokens/parameter).
pigeon57434@reddit
friendly reminder that Qwen3 ZERO POINT 6 B was trained on 36T tokens ya bro this model is undertrained as fuck
KaroYadgar@reddit
basically every model in the qwen family was trained on the same 36T tokens guh. Kimi K2.6 was trained on almost 30T tokens (including both images & text) and I don't see complaints about that.
ElementNumber6@reddit
Kimi doesn't affect stock prices the way Deepseek does.
This release needs to be dampened like crazy, lest the unthinkable happen, and some people lose a little money.
pigeon57434@reddit
kimi is also 600B parameters smaller than deepseek and even the largest qwen 3 model qwen3-max we were told was 1T parameters too which is still 600B smaller than DeepSeek they have it uniquely bad because they have the biggest verified chinese model ever trained but also the same or less tokens than people making much smaller models and i do actually think these models like kimi are untrained a lot its just less bad than deepseek
Alt_Restorer@reddit
Chinchilla optimal means you produce the smartest model per unit of training compute, without considering inference compute.
When you train past Chinchilla optimal (more tokens per parameter), you create a model that's better and which uses less inference compute for unit of intelligence. The only thing that's not optimal about it is that you could have created an even smarter model for the same amount of training compute by adding more parameters, but this model would require more inference compute, because it's larger.
True_Requirement_891@reddit
> When you train past Chinchilla optimal (more tokens per parameter), you create a model that's better and which uses less inference compute per unit of intelligence.
This should explain the qwen-3.5 models.
beijinghouse@reddit
Chinchilla-optimal unironically cited in 2026? When will it end??
Chinchilla assumptions = literal solipsism.
Chinchilla training point = "correct" ONLY if the computer making the model is the ONLY computer in the world AND you are the SOLE user of the model once its made AND you will never get another computer OR interact with another human ever. If you believe there are other people in the world or other computers in the world or an economy in the world or that you will directly share the model with anyone besides yourself... EACH of these "new assumptions" violate Chinchilla. And ALL in the same direction! That's why it's insane to train to Chinchilla. It's always correct to train FAR beyond naive Chinchilla.
To be fair to the researchers, they never used the term "Chinchilla Optimal". That term was 100% fabricated from nothing. The intellectual underpinning of "Chinchilla Optimal" = uninformed NFT-tech-bros spouting off on twitter 4 years ago. People misunderstanding this as "optimal" 4 years ago wasn't a problem and probably actually still progress at the time (since some labs were under-training below even that tragically pathetic level back then). But the popular idea of "Chinchilla Optimal" (which never existed in any capacity in the research literature) somehow lives on today in peoples' hearts even though it's a totally bankrupt concept that never existed and is wildly incorrect.
At best, it's a "Chinchilla Lowerbound". All its assumptions are catastrophically wrong by several orders of magnitude all in the same incorrect direction (low).
Seeing a model trained to Chinchilla Lowerbound EXACTLY in 2026 is damning indictment of incompetence on the part of that lab. They didn't just under-train - they provably under-trained - ironically based off the accelerationist ravings of the ghost of BeffJezos from 2022.
NandaVegg@reddit
When Chinchilla paper came out, there weren't much talk about synthetic datasets (except maybe FITM-type augment) people barely started to think about RLing on LLMs and more importantly there were a lot fewer abilities the model was benchmarked against. We barely had enough data to even reach Chinchilla optimal without (naively) going through multiple epochs.
Chinchilla might be not be valid anymore b/c each year there are new "standard" paradigms (and benchmarks), like agentic behavior w/ terminal, to optimize against. They are pushing the total compute need higher along with them especially in mid-to-post-training phase.
But also I think it is still a good rule of thumb for pretraining run (that you pour every available data in).
Silver-Champion-4846@reddit
Maybe the epoch is low?
LMTLS5@reddit
in chinchilla paper itself they mention that higher param models(they experimented only untill 70b) may require even more tokens. on top of that chinchilla was for dense this is moe. so basically 20 tokens per param is almost irrelevent here
FullOf_Bad_Ideas@reddit
Yeah though that's for dense. MoE have different scaling, depending on sparsity - https://arxiv.org/abs/2507.17702
Yes_but_I_think@reddit
Higher or lower?
FullOf_Bad_Ideas@reddit
At the same compute, you need more tokens and lower activated parameter count, but total parameter count will be higher than dense. Exact number of training tokens per parameter that's compute optimal will depend on sparsity, but it's calculated off activated parameter count, not total parameter count. Deepseek V4 Pro is definitely not "compute optimal" and it's overtrained, like all recent models that have good performance for their size.
Ok_Warning2146@reddit
I think we will see better model when Kimi makes a derivative out of it. ;)
NoahFect@reddit
Presumably that's why it was released as a "preview" model.
Specter_Origin@reddit
If they were able to wait this much, I wonder why they even released it under cooked...
Puzzleheaded-Drama-8@reddit
I say all deepseek releases are just meant to generate them profit by manipulating market. Maybe this time they played long positions on other players so intentionally made model not that good?
IrisColt@reddit
it's ogre
jfufufj@reddit
I do notice that V4 model output very long thoughts in order to get a task done.
ninjasaid13@reddit
Deepseek V4.1 probably
mivog49274@reddit
this one will shatter mountains and make rivers of sweat and tears flow, as his older sibling 3.1 did.
I really trust the incremental thrust power of Deepseek, in addition to the fact that this model seem to be a "preview".
With the expected price drop, this will certainly be something.
ikkiho@reddit
The "intelligence density" framing collapses two orthogonal things: parameter density (params used per task) and reasoning density (tokens emitted per task).
For a 1.6T MoE the active params per token govern compute, not the total. So "V4 Pro is 2.5x larger" is misleading once routing is factored in, which is why the thread keeps splitting between "should be smarter" and "undertrained" without converging.
Reasoning density is shaped in post-training: length penalties, DPO/RLHF on conciseness, process reward models that penalize wandering, distillation from a shorter teacher. GPT-5.5 visibly invests in this (short chains, very little internal narration). DeepSeek's published recipe has historically front-loaded into pretraining and SFT, with comparatively less compute spent on conciseness-shaped RL. The V3.2 paper basically said this out loud when it flagged token efficiency as future work.
So "density decreased" is the wrong diagnostic. The model is not dumber; the post-training stage that controls tokens-per-unit-of-reasoning is weak or absent. A single major-version bump (especially one prioritizing Ascend deployment per Hyp3rSoniX) would not close that gap. Expect a V4.1 or a separate "turbo"/"flash" branch tuned specifically for reasoning length.
NandaVegg@reddit
GPT-5.5 is very good at this. It's been known that sufficiently large (in this case I think large means more dims rather than more layers) model can retain longer CoT internally even after it is "overwritten" by post-training that promotes shorter CoT but also the model needs more expensive RL passes.
Ok_Warning2146@reddit
The main improvement of DSV4 is KV cache saving, second by speed gain. Raw intelligence is not their forte.
NandaVegg@reddit
I tried to post this but it was immediately automodded, but DS V4 is also quite idiosyncratic model compared to GLM 5.1 and Kimi 2.6 which are more identical to each other. Both Pro and Flash are the highest AA-Omniscience hallucination rate models ever:
This means the model almost never refuses to answer or question itself, but instead it will try to come up with guessed continuation anyway. Also this may mean the model never stops or can't be steered when its confidence is too high (jives with other commentaries that it refuses to fix itself even when "told" through user prompt; you'd need to manually edit the model output like base model).
Methodologies to reduce this is quite thoroughly studied (Grok is heavily trained against this as its main case is for news/real-time SNS post retrieval) so it is mostly up to each lab whether to reduce that. Maybe DS V4 was heavily geared towards frontier research that requires a lot of guesswork rather than known-facts-based task. But that'd come with worse user experience for "normal" use case.
It is also probably good for creative writing since creativity will not get subtly questioned by mini-CoT type prose like "it is not X but maybe Y".
Finanzamt_Endgegner@reddit
Well it's a preview so not that surprising the haven't fixed token efficiency yet, that's what they gonna do in further versions is my guess, also it's probably not even full trained yet the tokens it trained on are rather few, but potential is there, my guess is they wanted to try out a post training run on a half ready pre train to test how well their architecture changes work out and since they seem to think this went well they released it
-dysangel-@reddit
I'm just surprised why they released it in this state if that's the case? Was it pressure from the community? I have been hoping V4 will come out, but it's surprising that they're the default models on deepseek.com chat when the performance is not refined yet.
daniel-sousa-me@reddit
It seems to be a significant improvement over 3.2, so it's a worthy release even if they have something better down the pipeline
Also, having more people try it will allow a better final release
Finanzamt_Endgegner@reddit
Well they were quite a bit behind and id guess there was some pressure from the CCP, anyways good proof of concept and will bring iss forward for sure
TheKingOfTCGames@reddit
Gpt 5.5 was specifically trained for token efficiency its like 3-5x more efficient then opus and like 10xs sonnet which is a similar sized model (probably) as v4
SpicyWangz@reddit
Hope they release another version of gpt-oss soon
Technical-Earth-3254@reddit
I wish they would release old models like 4o/ mini, GPT 4.1, GPT 5/ mini... after they get deprecated as well. But sincr they're having quite a lot of funding problems rn, I doubt they will release any proper llms as oss in the near future.
Square_Empress_777@reddit
Do you know the closest way I can get something like o4? I have a 5090 and 32 gb vram. I liked its friendliness
Due-Memory-6957@reddit
http://canirun.ai/
grumd@reddit
With a 5090 you should run Qwen 3.6 27B at a quant like UD-Q5_K_XL (or also try Q4-Q6 depending on what fits)
Square_Empress_777@reddit
Thanks, do you know where i can get an uncensored version of this model?
grumd@reddit
There's a bunch of them on huggingface
evia89@reddit
You cant. Closest is nanogpt $8 sub and try to emulate with 1 of big ass CN model
boutell@reddit
There are local models that have been trained to emulate 4o. You can run them with ollama, which includes a basic UI, and various other tools. Have a google for them.
Square_Empress_777@reddit
Ok. Do you have any suggestions?
boutell@reddit
No, you'll have to do research.
boutell@reddit
The usual disclaimers apply (don't trust your life decisions solely to any AI, much less one that fits on your computer)
Technical-Earth-3254@reddit
Idk sry, I'm not using AI for conversation, just for programming and researching.
Theio666@reddit
Tbh I don't think that releasing 4o in OSS is a good idea...
Technical-Earth-3254@reddit
GPT 5.5 is insanely efficient and fast. It doesn't really feel worse than 5.4 but doesn't really have to reason for the same task, where 5.4 took minutes to work on something. Truly impressive what they did there. The price increase is still there and will show, if it's better for the user or OAI in the long run.
Far-Low-4705@reddit
they probably increased efficiency, and kept per token costs the same, and increased the prices to increase profit margin without anyone noticing
That would be my guess, since they are burning cash insanely fast. anthropic has insane prices, but they arent burning cash as fast as openai
Due-Memory-6957@reddit
Tbh we can't actually know the density of proprietary models, they can just lie.
Middle_Bullfrog_6173@reddit
Yeah, they "dominate" the AA token use charts as well, so definitely token hungry.
I'm not surprised density takes a hit at the frontier. We don't really know how the closed models compare. Flash is not that bad, just a bit disappointing after the small Qwens have pushed density so far.
Technical-Earth-3254@reddit
I wonder if the parameter numbers Elon Musk dropped on X were accurate some weeks ago for Grok and the Anthropic models.
ImpressiveSuperfluit@reddit
I, too, wish that moron would ever say something accurate. But alas.
coder543@reddit
this thread is talking about intelligence density in terms of tokens, not weights, and the Qwen3.5+ models use a lot of tokens. They have bad token density. But, yes, the DeepSeek V4 models require even more tokens.
Zc5Gwu@reddit
I don’t think this chart shows everything. It’s not just output tokens to pay attention to but speed as well if you can output tokens very fast and cheaply (I.e. gpt-oss) then it doesn’t matter how many tokens you output since you’re making up for it in speed.
Middle_Bullfrog_6173@reddit
That's a log scale so the difference between Deepseek and Qwen is actually quite significant. But the OP also talked about model size.
IMHO, total compute is more important than token use alone. Something like active params x tokens used as a ballpark. And total size matters too, of course.
Finanzamt_Endgegner@reddit
Yeah it's a trade off with token density to parameter density but since I can run 27b but can't run 300b + models I'll take that
coder543@reddit
That's a false dichotomy since you can also run Gemma 4, which is very token dense for its intelligence, as that chart shows.
Zyj@reddit
> However, in
V4 Pro, the situation seems to have worsened. Even the non-thinking mode uses significantly more tokens thanV3.2, andV4 Pro(1.6T) is roughly 2.5x larger thanV3.2(0.67T). This suggests that the intelligence density of the model has decreased rather than improved!How can you claim that, it very much depends on the output quality.
claudiollm@reddit
fwiw the FullOf_Bad_Ideas comment is the one i keep reaching for. if compute optimal is calculated off activated not total params, then "undertrained" ppl are using the wrong denominator. v4 pro is probably overtrained relative to its activated count, which would also explain why intelligence density drops as total scales without scaling activated.
is that the consensus here or am i picking the wrong frame
fuck_cis_shit@reddit
they still haven't done nearly all the post-training that they plan
you know the difference between deepseek-v3 and -v3.3, right? or qwen 3 and qwen 3.6? v4 is just starting still
dogesator@reddit
The chart you posted yourself shows that deepseek V4 Pro achieves way better accuracy than Deepseek V3.2, that’s not a worsening. If you extrapolate the curve of Deepseek V3.2 tokens used vs accuracy achieved, it’s similar pareto curve to deepseek v4, not better or worse.
igorsusmelj@reddit
I’m not sure we can compare these. Total tokens and output tokens don’t have to match. Also, I’m not sure if successful trajectories would use fewer tokens as the model stops vs unsuccessful ones where it continues to struggle and try.
Kahvana@reddit
Yeah, don't blame them tho. Lots of new things being tried out in this release, you can't have it all. Wonder if they will address it or if they will focus first on engrams.
Yes_but_I_think@reddit
Artificial analysis shows (5.5 xhigh) 75M vs (4pro max) 190M tokens for completing their benchmarks, that's like 2.5x more not 10x more.
ambient_temp_xeno@reddit
It sure does chew through tokens.
Comfortable-Rock-498@reddit
I have also observed this in my tests. Hopefully, they will address it in upcoming versions
fatihmtlm@reddit
Is this using the official api?
Mindless_Pain1860@reddit (OP)
Figures are taken from the GPT model card and the DeepSeek V4 paper
IngenuityNo1411@reddit
Hot take: at least you have the posibility to deploy DeepSeek V4 Pro on your own hardware, which is impossible with GPT-5.4 or 5.5.