DeepSeek-R1’s paper was updated 2 days ago, expanding from 22 pages to 86 pages and adding a substantial amount of detail.

[-]

redditonc3again@reddit

This should significantly benefit Open-R1 right? Is that project still active?

Reply

[-]

New arch about to drop? dsv4 + r2? Packing all the goodies learned from last year. Hopefully they try smaller sizes as well. Would be interesting to see how the arch improvements work at several sizes.

Reply

[-]

swaglord1k@reddit

that would make sense. we'll finally see whether deepseek was a one trick pony or if they'll managed to drop another market-disrupting model

Reply

[-]

ForsookComparison@reddit

R1-0528 and V3.2 were both equally SOTA-threatening especially when cost comes into play. The big shocker was *"you can do this without being a USA-bases hyperscaler!?"*

Reply

[-]

Zestyclose839@reddit

I can 100% see a lab dropping something market-disrupting with a new architecture. World models would be a big one, letting an LLM imagine how something would play out in the real world instead of just guessing based on what's described in its training text. JEPA (joint embedding prediction architecture) is a great example, as it gets the model thinking in abstract concepts. Research labs are better able to "move fast and break things" in this way, as there aren't many commercial consequences for a failed model. Consider the success of DeepMind's releases vs. Bard, for instance. If DeepSeek were to drop an LLM with a world model and real-world use case, they could absolutely sweep the competition.

Reply

[-]

espadrine@reddit

>especially when cost comes into play To emphasize this point, look at this graph from https://metabench.organisons.com/. There is one dot in the middle that stands out. https://preview.redd.it/n7mwph9w8ybg1.png?width=1598&format=png&auto=webp&s=f7ce556084b0fa4cb2282aac6e2b3f78a8c79558

Reply

[-]

sdkgierjgioperjki0@reddit

That website has some wrong information. It doesn't correctly account for caching for some models while using caching price for others. Also it has exact parameter counts for Gemini 3 which isn't publicly known, and the size they provide is way too small to be plausible.

Reply

[-]

espadrine@reddit

Unknown values are guessed from known values, so some information is uncertain (with uncertainty estimated in the table tab), but overall it gives a good picture IMO.

Reply

[-]

sdkgierjgioperjki0@reddit

The pricing is still very wrong, Kimi for example have its price calculated based on cache hit while Deepseek 3.2 is using their cache miss price.

Reply

[-]

espadrine@reddit

You're right, wrong data input. Fixed!

Reply

[-]

No_Afternoon_4260@reddit

Idk but if they have the balls to release a 100-200B dense model that would be a revolution to me at least x)

Reply

[-]

Bakoro@reddit

To a smaller extent, we already had a moment like that with HRM/TRM models, it just didn't crash the stock market because they aren't LLMs. The small recursive models proved that small, task specific models were viable in new way, and that's revived a whole branch of inquiry. Ultimately scale is still king, everyone know that DeepSeek has a ton of compute, even if it's not all the top of the line stuff. Research has been making some big moves in more disciplined architecture and training, so, I think we'll have at least one more "oh shit" moment where someone finds a way to make models more sample efficient. My money is on truly multimodal models providing highly structured latent spaces, and moving past raw transformers as a linear sequence of layers, and instead having multiple modules with cross attention. The problem with text-only models is that they have to reconstruct everything about the world, purely from the relationships between words. There is no ground truth beyond the adjacency and frequency of text. We've seen that Reinforcement Learning with Verifiable Rewards helps define the relationships in a more concrete way, and RL can refine the knowledge a pretrained model already has. There's a bunch of research into the geometric representations models make, and the manifolds of data, which tend to be low dimensional. My intuition is that, by anchoring latent dimensions with multimodal biases, we'll get models that can learn with a few examples, rather than thousands, after the initial structure is defined. That'll be the big "oh shit" moment where we'll be able to take a base model and do serious training on single GPUs, and be able to do extensive fine-tuning with minimal risk of catastrophic forgetting.

Reply

[-]

swaglord1k@reddit

consider what they managed to do by squeezing v3 i think we might.

Reply

[-]

Few_Painter_5588@reddit

Uhm, no. Deepseek were always market disrupting. Deepseek v2, v3, V3 0324, R1, 3.2. THey've only had three flops, V1, R1 0528 and V3.2 speciale.

Reply

[-]

SlowFail2433@reddit

R1 0528 was very widely used

Reply

[-]

my_name_isnt_clever@reddit

Speciale beat Gemini 3 Pro at pure reasoning, and it was never intended to be a standalone product. Not sure how that could be considered a flop.

Reply

[-]

-dysangel-@reddit

I really liked R1 0528 - it's the only one of the Deepseek models I've tried that still worked well even at Q2

Reply

[-]

swaglord1k@reddit

disagree, only r1 was disruptive, and even that was barely on par with the closed-source sota. i wouldn't be surprised if they've been cooking a new architecture that will mog existing llms

Reply

[-]

Few_Painter_5588@reddit

Well, deepseek doesn't really compete with the frontier. But from a cost effective standpoint, they disrupted the markets quite significantly. ChatGPT mini, Claude Haiku and Gemini Flash basically got destroyed by Deepseek - and that's a market where the real money is made. I think deepseek v4 with their new architecture will disrupt the market. Especially if they follow up their theory and raise the parameter count and especially if they integrate their work on Janus, Deepseek OCR and Deepseek VL into v4.

Reply

[-]

panic_in_the_galaxy@reddit

You don't update a paper with new results. You would just write a new paper. These are just explanations and clarifications.

Reply

[-]

ab2377@reddit

but why do late?

Reply

[-]

autoencoder@reddit

red herring for competitors?

Reply

[-]

layer4down@reddit

I’m not sure how it would benefit DeepSeek.

Reply

[-]

linkillion@reddit

A year is forever in the AI space right now but not all that long in scientific research. Most papers take anywhere from a couple months to a year from first draft to publishing in a peer review journal (and to make a draft requires months to years of experimentation). I'm not sayin this is what's happening here (it's just an arxiv update) however it's not at all surprising if you're doing good research.

Reply

[-]

QuackerEnte@reddit

"slow and steady wins the race"

Reply

[-]

panic_in_the_galaxy@reddit

Because they probably got a lot of answers from other researchers and tried to answer them. Writing papers takes time.

Reply

[-]

ResidentPositive4122@reddit

No, I mean they've updated this with all they had, and preparing for a new arch. Like in closing a chapter.

Reply

[-]

Elven77AI@reddit

This seems like it, dumping dozens of pages means its no longer relevant to their current research and they moved on to something far more effective(i.e. no competitor advantage), likely a new reasoning architecture built from https://huggingface.co/papers/2512.24880

Reply

[-]

qtvivies@reddit

https://preview.redd.it/t6ic0x3nywbg1.png?width=1965&format=png&auto=webp&s=ae5ef60128b8a0cf89351e8673ea41eddafb037c Something interesting towards the end. Looks like someone forgot about this

Reply

[-]

Intelligent-Time7704@reddit

real openai

Reply

[-]

Mikasa0xdev@reddit

86 pages? That's a weekend read.

Reply

[-]

Crisis_Averted@reddit

layman here: no idea what that implies.

Reply

[-]

qtvivies@reddit

They intended to release the (mostly synthetic? by r1-zero) data used for the post training of R1, except I assume that \`xxx\` was left in accidentally as a placeholder in place of a link. Or they changed their mind and just forgot to remove it.

Reply

[-]

Dany0@reddit

Supervised fine-tuning and Reinforcement learning data. Used to post-train

Reply

[-]

Eyelbee@reddit

It's crazy it's only been one year, feels like ages

Reply

[-]

badgerbadgerbadgerWI@reddit

The level of detail they're releasing is remarkable. This kind of transparency is what pushes the whole field forward. Really interested in their distillation approach - getting smaller models to match larger ones' reasoning is key for edge deployment.

Reply

[-]

CryptoUsher@reddit

honestly the fact they went back and added 60+ pages is kind of wild. most papers just release and call it a day, maybe a small erratum if something's broken. wonder if this was all stuff they had internally but couldn't publish initially, or if they're responding to community feedback and trying to explain their approach better. either way it's good for reproducibility. the original paper was already dense but felt like it was missing implementation details. if they're actually filling in those gaps this could be huge for people trying to replicate or build on their work.

Reply

[-]

Imaginary-Bit-3656@reddit

I think it was part of getting the paper accepted into Nature, where that same additional information was included with the work as supplimentary material (I think the standards for Nature are considered quite high) Many of the papers we see on Arxiv are preprints, works that have not been peer reviewed and may not ever be published in a journal.

Reply

[-]

timfduffy@reddit

I think all this info was previously released as a [supplment to their R1 paper in Nature](https://static-content.springer.com/esm/art%3A10.1038%2Fs41586-025-09422-z/MediaObjects/41586_2025_9422_MOESM1_ESM.pdf).

Reply

[-]

CryptoUsher@reddit

honestly the fact they went back and added 60+ pages is kind of wild. most papers just release and call it a day, maybe a small erratum if something's broken. wonder if this was all stuff they had internally but couldn't publish initially, or if they're responding to community feedback and trying to explain their approach better. either way it's good for reproducibility. the original paper was already dense but felt like it was missing implementation details. if they're actually filling in those gaps this could be huge for people trying to replicate or build on their work.

Reply

[-]

yoshiK@reddit

I did quickly throw the two papers into Gemini. It's really fun to live in the future. #Gemini summary: The paper **"DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning"** (arXiv:2501.12948) marks a significant milestone in open-source AI by demonstrating that advanced reasoning (similar to OpenAI’s o1) can be achieved through large-scale Reinforcement Learning (RL) with minimal human-annotated data. The two versions you provided represent the initial release (**v1**, Jan 22, 2025) and the latest updated version (which has been significantly expanded to **86+ pages** as of January 2026). ### 1. Overall Paper Summary The paper introduces two primary models: * **DeepSeek-R1-Zero:** A model trained via "pure RL" (using the GRPO algorithm) starting directly from a base model without any Supervised Fine-Tuning (SFT). It demonstrates that reasoning behaviors like self-correction and reflection can emerge purely from reward signals. * **DeepSeek-R1:** A more "user-friendly" version that uses a multi-stage pipeline (Cold-start SFT → Reasoning RL → Rejection Sampling/SFT → General RL) to fix the "readability" and "language mixing" issues of R1-Zero while maintaining state-of-the-art reasoning performance. * **Distillation:** The authors show that the reasoning patterns discovered by the 671B model can be distilled into smaller models (1.5B to 70B), allowing a 14B model to outperform much larger ones on math and coding benchmarks. --- ### 2. Comparison: Extensions in the New Version The newer version is a massive technical expansion (growing from roughly 22 pages to over 85 pages). The key additions and extensions include: #### A. The "Aha Moment" Expansion (Section 2.2.1) The new version provides a much deeper analysis of the **"Aha Moment"**—the point during RL training where the model unexpectedly learns to "re-think" its approach. The extension includes more qualitative examples and internal data showing the model's transition from linear solving to iterative self-correction without being prompted to do so. #### B. Detailed 4-Stage Training Pipeline While v1 outlined the stages, the new version details the specific composition of the **800k total training samples**: * **Stage 1 (Cold Start):** Expanded details on the ~5,000-10,000 long CoT (Chain of Thought) samples used to "prime" the model. * **Stage 3 (Rejection Sampling):** A deeper dive into how 600k reasoning-related and 200k non-reasoning samples were filtered and used to improve the model's general chat capabilities and prevent "forgetting" during the reasoning-heavy RL stages. #### C. Comprehensive Ablation Studies The new version adds extensive "What if?" scenarios that were absent or brief in v1: * **Distillation vs. RL:** New evidence explaining *why* distilling a large model's reasoning traces into a small model is more effective than training that small model directly with its own RL. * **Base Model Impact:** Analysis of how different base models (DeepSeek-V3 vs. Qwen vs. Llama) respond to the R1 training recipe. #### D. Expanded "Unsuccessful Attempts" (Section 4.2) One of the most valuable additions for researchers is the expanded section on what **did not work**. The new version elaborates on their failures with: * **Process Reward Models (PRM):** Detailed reasons why step-level rewards were difficult to scale or prone to "reward hacking" compared to the outcome-based rewards used in R1. * **Monte Carlo Tree Search (MCTS):** Technical explanation of why MCTS didn't provide the expected gains over simple RL in the context of LLM reasoning. #### E. New Benchmarks & Technical Specs * **Updated Results:** Includes more recent evaluations on benchmarks like **AIME 2025**, **LiveCodeBench**, and specialized medical/legal reasoning tests. * **Hyperparameters:** The new version includes exhaustive tables of training hyperparameters (learning rates, GRPO group sizes, KL divergence coefficients) which were previously withheld or summarized. ### Summary Table | Feature | v1 (Original) | Latest Version (Extension) | | :--- | :--- | :--- | | **Page Count** | ~22 Pages | **86+ Pages** | | **Methodology** | High-level 4-stage overview | Granular detail on each stage (SFT, RL, Rejection Sampling) | | **Behaviors** | Mentions "self-correction" | Deep dive into "Aha Moment" with case studies | | **Failed Paths** | Brief mention of PRM/MCTS | Exhaustive analysis of why PRM and MCTS underperformed | | **Distillation** | Introduced 1.5B to 70B models | Added deep ablation on distillation efficiency and data filtering | | **Hyperparameters** | Partial/Summary | **Complete Technical Specs** for reproducibility |

Reply

[-]

pigeon57434@reddit

i dont understand people who just post completely slop summaries into comment sections by AI if we wanted a summary we would ask a model ourselves since it requires no effort

Reply

[-]

Chickenbuttlord@reddit

Nobody asked for your opinion buddy, yes we do very much need summary posted into a comment incase we're too lazy to do it ourselves. Thank you op!

Reply

[-]

pigeon57434@reddit

i guess maybe if it was a good summary it would be ok but this is literally such a bad and in some cases just objectively factually wrong

Reply

[-]

menictagrib@reddit

The new version of someone with no background knowledge answering a technical question for dopamine by repeating whatever they see on the first page of Google

Reply

[-]

arguingwithabot@reddit

I hear you, but at the end of the day they saved you a few clicks, keystrokes and tokens.

Reply

[-]

DigThatData@reddit

there is no section 2.2.1.

Reply

[-]

yoshiK@reddit

The discussion of the a-ha moment is actually section 2.3. Though I didn't check too closely since the summary is anyhow that I probably need to sit down and read the whole thing at some point.

Reply

[-]

warnerbell@reddit

The original paper was light on implementation specifics. If they've added more on how they got the reasoning behavior to emerge, that's valuable.

Reply

[-]

TelloLeEngineer@reddit

does arxiv have a diff UI?

Reply

[-]

Freonr2@reddit

Don't think so, but you can download the raw TeX format versions of the paper and see the appendix.tex is new and the largest .tex file, or look for diffs that way from the raw .tex files.

Reply

[-]

Ok_Technology_5962@reddit

Current research is linear attention. Save.2 with the cache and now they had a massive paper come out that puts the linear into the whole modelaking it possible to train more than 60 layers. So yes this one is done. The compute and thinking will now happen internally in 1000 layers

Reply

[-]

jeffwadsworth@reddit

I was using the online chat version last night to improve a large Java class (40K tokens) with multiple methods. It did so beautifully with zero issues in one shot. The same task in my sub of Gemini 3 Pro chat interface failed in a few shots due to hallucinations. They have really improved that model a lot from a year ago.

Reply

[-]

Aggressive-Bother470@reddit

New grpo details perhaps? From reading the hf page it implied it was maybe light in that regard?

Reply

[-]

WithoutReason1729@reddit

Your post is getting popular and we just featured it on our Discord! [Come check it out!](https://discord.gg/PgFhZ8cnWW) You've also been given a special flair for your contribution. We appreciate your post! *I am a bot and this action was performed automatically.*

Reply

Reply to Post

55 Comments