Even DeepSeek switched from OpenAI to Google

[-]

InterstellarReddit@reddit

This is such a weird way to display this data.

[-]

learn-deeply@reddit

It's a cladogram, used everywhere in biology.

[-]

Cladograms generally don't align in a circle with text rotating along. It might be the most efficient way to fill the space, but it makes it unnecessarily difficult to absorb the data, which is kind of the point of any diagram.

[-]

_sqrkl@reddit

I do generate dendrograms as well, OP just didn't include it. This is the source:

https://eqbench.com/creative_writing.html

(click the (i) icon in the slop column)

[-]

llmentry@reddit

This is incredibly neat!

Have you considered inferring a weighted network? That might be a clearer representation, given that something like DeepSeek might draw on multiple closed sources, rather than just one model.

I'd also suggest a UMAP plot might be fun to show just how similar/different these groups are (and also because, who doesn't love UMAP??)

Is the underlying processed data (e.g. a matrix of models vs. token frequency) available, by any chance?

[-]

_sqrkl@reddit

here's a data dump:

https://eqbench.com/results/processed_model_data.json

looks like I've only saved frequency for ngrams, not for words. the words instead get a score, which corresponds to how over-represented the words is in the creative writing outputs vs a human baseline.

let me know if you do anything interesting with it!

[-]

_sqrkl@reddit

Yeah a weighted network *would* make more sense since a model can have multiple direct ancestors, and the dendrograms here collapse it to just one. The main issue is a network is hard to display & interpret.

UMAP plot looks cool, I'll dig into that as an alternate way of representing the data.

> Is the underlying processed data (e.g. a matrix of models vs. token frequency) available, by any chance?

I can dump that easily enough. Give me a few secs.

Also you can generate your own with: sam-paech/slop-forensics

[-]

HiddenoO@reddit

Sorry for the off-topic comment, but I've just checked some of the examples on your site and have been wondering if you've ever compared LLM judging between multiple scores in the same prompt and one prompt per score. If so, have you found a noticeable difference?

[-]

_sqrkl@reddit

It does make a difference, yes. The prior scores will bias the following ones in various ways. The ideal is to judge each dimension in isolation, but that gets expensive fast.

[-]

HiddenoO@reddit

I've been doing isolated scores with smaller (and thus cheaper) models so far. It'd be interesting to see for which scenarios that approach works better than using a larger model with multiple scores at once - I'd assume there's some 2-dimensional threshold between the complexity of the judging task and the number of scores.

[-]

InterstellarReddit@reddit

In biology yes, not in data science.

[-]

learn-deeply@reddit

Someone could argue that this is the equivalent of doing digital biology.

[-]

InterstellarReddit@reddit

You can argue all you want but look at what the big players are doing to present that data. They didn’t choose that method for no reason.

[-]

learn-deeply@reddit

I don't know what you mean by "big players".

[-]

InterstellarReddit@reddit

The big four in AI

[-]

learn-deeply@reddit

I have no idea what you're talking about. What method are the big four players in AI choosing?

[-]

Evening_Ad6637@reddit

I think they mean such super accurate diagrams like those from nvidia: +133% speed

Or those from Apple: Fastest M5 processor in the world, it’s 4x faster

/s

[-]

justGuy007@reddit

This chart sings "You spin me right round, baby, right round"

[-]

silenceimpaired@reddit

Yup. I gave up on it.

[-]

Megneous@reddit

It's easy to read... Look.

V3 and R1 from 03-24 were close to GPT-4o in the chart. This implies they used synthetic data from OpenAI models to train their models.

R1 from 05-28 is close to Gemini 2.5 Pro. This implies they used synthetic data from Gemini 2.5 Pro to train their newest model, meaning they switched their preference on where they get their synthetic data from.

[-]

Nicoolodion@reddit

What are my eyes seeing here?

[-]

_sqrkl@reddit

It's an inferred tree based on the similarity of each model's "slop profile". Old r1 clusters with openai models, new r1 clusters with gemini.

The way it works is that I first determine which words & ngrams are over-represented from the model's outputs relative to human baseline. Then, for the put all the models' top 1000 or so slop words/n-grams together, and for each model notate the presence/absence of a given one as if it were a "mutation". So each model ends up with a string like "1000111010010" which is like its slop fingerprint. Each of these then gets analysed by a bionformatics tool to infer the tree.

The code for generating these is here: https://github.com/sam-paech/slop-forensics

Here's the chart with the old & new deepseek r1 marked:

[-]

mtomas7@reddit

Offtopic, but on the occasion, I would like to request Creative Writing v3 evaluation for the rest of Qwen3 models, as now Gemma3 has all lineup. Thank you!

[-]

Yes_but_I_think@reddit

What is the name of the construct? Which app makes these diagrams?

[-]

_sqrkl@reddit

sam-paech/slop-forensics

[-]

NighthawkT42@reddit

Easier to read now that I have an image where the zoom works.

Interesting approach, but I think what that shows might be more that the unslop efforts are directed against known OpenAI slop. The core model is still basically a distill of GPT.

[-]

Artistic_Okra7288@reddit

This is like digital palm reading.

[-]

givingupeveryd4y@reddit

how would you graph it?

[-]

lqstuart@reddit

as a tree, not a weird circle

[-]

Zafara1@reddit

Trees like this you think will nicely fall, but this data would just make a super wide tree.

You can't get it compact without the circle or making it so small it's illegible.

[-]

llmentry@reddit

It is already a graph.

[-]

Artistic_Okra7288@reddit

I'm not knocking it, just making an observation.

[-]

givingupeveryd4y@reddit

ik, was just wondering if there is a better way :D

[-]

Artistic_Okra7288@reddit

Maybe pictures representing what each different slop looks like from a Stable Diffusion perspective? :)

[-]

CheatCodesOfLife@reddit

This is the coolest project I've seen for a while!

[-]

Evening_Ad6637@reddit

Also clever to use n-grams

[-]

BidWestern1056@reddit

this is super dope. would love to chat too, i'm working on a project similarly focused on the long term slop outputs but more so on the side of analyzing their autocorrelative properties to find local minima and see what ways we can engineer to prevent these loops.

[-]

_sqrkl@reddit

That sounds cool! i'll dm you

[-]

Utoko@reddit (OP)

Here is the list view.

It just shows how close models are with the prompts to other models, In the topics they choose and the words they use.

when you ask it for example to write a 1000 word fantasy story with a young hero or any question.

Claude for example has its own branch not very close to any other models. OpenAI's branch includes Grok and the old Deepseek models.

It is a decent sign that they used output from the LLM's to train on.

[-]

YouDontSeemRight@reddit

Doesn't this also depend on what's judging the similarities between the outputs?

[-]

_sqrkl@reddit

The trees are computed by comparing the similarity of each model's "slop profile" (over represented words & ngrams relative to human baseline). It's all computational, nothing is subjectively judging similarity here.

[-]

Raz4r@reddit

There are a lot of subjective decisions over how to compare these models. The similarity metric you choose and the clustering algorithm all have a set of underlying assumptions.

[-]

Karyo_Ten@reddit

Your point being?

The metric is explained clearly. And actually reasonable.

If you have critics please detail: - the subjective decisions - the assumption(s) behind the similarity metric - the assumption(s) behind the clustering algorithm

and in which scenario(s) would those fall short.

Bonus if you have an alternative proposal.

[-]

Raz4r@reddit

There is a misunderstanding within the ML community that machine learning models and their evaluation are entirely objective, and often the underlying assumptions are not discussed. For example, when we use n-grams in language models, we implicitly assume that local word co-occurrence patterns sufficiently capture meaning, ignoring other semantic more general structures. In the same way, when applying cosine similarity, we assume that the angle between vector representations is an adequate proxy for similarity, disregarding the absolute magnitudes or contextual nuances that might matter in specific applications. Another case is the removal of stop words. here, we assume these words carry little meaningful information, but different research might apply alternative stop word lists, potentially altering final results.

There is nothing inherently wrong with making such assumptions, but it is important to recognize that many subjective decisions are embedded in model design and evaluation. For instance, if you examine tools like PHYLIP, you will find explicit assumptions about the underlying data-generating process that may shape the outcomes.

[-]

Karyo_Ten@reddit

We're not talking about semantic or meaning here though.

One way to train LLM is teacher forcing. And how to detect who was the teacher is checking output similarity. And the output is words. And to check vs a human baseline (i.e. a control group) is how you ensure that a similarity is statistically significant.

[-]

Raz4r@reddit

how to detect who was the teacher is checking output similarity”

You’re assuming that the distribution between the teacher and student models is similar, which is a reasonable starting point. But alternative approaches could, for instance, apply divergence measures (like KL divergence or Wasserstein distance) to compare the distributions between models. These would rest on a different set of assumptions.

And to check vs a human baseline

Again, you’re presuming that there’s a meaningful difference between the control group (humans) and the models, but how are you accounting for confounding factors? Did you control covariates through randomization or matching? What experimental design are you using (between-subjects, within-subjects, mixed) ?

What I want to highlight is that no analysis is fully objective in the sense you’re implying.

[-]

Karyo_Ten@reddit

But alternative approaches could, for instance, apply divergence measures (like KL divergence or Wasserstein distance) to compare the distributions between models. These would rest on a different set of assumptions.

So what assumptions does comparing overrepresented words have that are problematic?

Again, you’re presuming that there’s a meaningful difference between the control group (humans) and the models

I am not, the whole point of a control group is knowing whether one result is statistically significant.

If all humans and LLM reply "Good and you?" to "How are you", you cannot take this into account.

[-]

Raz4r@reddit

At the end of the day, you are conducting a simple hypothesis test. There is no way to propose such a test without adopting a set of assumptions about how the data-generating process behaves. Whether we use KL divergence, hierarchical clustering, or any other method scientific inquiry requires assumptions.

[-]

Karyo_Ten@reddit

I've asked you 3 times what problems you have with the method chosen and you've been full of hot air 3 times.

[-]

_sqrkl@reddit

I mean if I was the other guy, I'd have articulated a criticism something like:

> Using parsimony to infer lineage seems a bit arbitrary since the constraints phylip pars uses in its clustering algorithm are intended for dna/rna/assays from organisms that have undergone evolution. And the over-represented words that rise to the top in a model's output aren't present/absent because of these same evolutionary dynamics. Also a model can have multiple "parents" whose outputs it was trained on, which would need a more complex representation of lineage than a dendrogram or phylo tree can show.

To which I'd reply something like:

The usage of the parsimony algorithm to infer the tree is defensible *if* there is signal indicating lineage in the raw data that isn't otherwise extracted by normal hierarchical clustering. For instance, phylip pars weights rare shared features more highly. If our data encodes signal of lineage in ways that somewhat align with the biological assumptions the parsimony algo is based on, it can get us somewhere closer to the true lineage, compared to hierarchical clustering. On the other hand, it might get us *further* from the true lineage if the parsimony constraints fixate on spurious signal, given that we're feeding it cross domain data.

The upshot of being wrong about this hunch that there might be signal that parsimony can pull out about lineage is simply that it behaves more like a naive clustering algo, perhaps producing slightly different trees. In practice, the trees generated with either method are very similar, though with a few interesting differences!

Since there's no way for us to validate whether one clustering method produces a tree closer to ground truth, other than the sniff test, I simply make no claims about *lineage* and present the charts as indicative of *similarity of slop profiles*. The strongest thing I will say as an interpretation is to speculate that their relatedness on the dendrogram may be indicative of which lab made the model or which models seeded its training data. Which I think is defensible regardless of which clustering algorithm is chosen, as long as I've been clear that interpretations like this are speculative.

One clear downside to my approach is that we lose a representation of similarity/distance which is normally shown via branch length when doing hierarchical clustering on similarity. I'm looking into fixing that.

The other clear limitation of this representation is that models can have multiple direct ancestors contributing to its training data, and our dendrograms collapse it to just one. But this critique applies to any clustering method that produces trees like this. To do it properly we could use network clustering or somesuch, though this is much less readable/interpretable.

So that's my hypothetical rebuttal to myself. Just to show that some thought actually goes into the methodological choices.

(I'm responding to you because I think the other person was just complaining to complain)

[-]

Raz4r@reddit

I’ve emphasized several times that there’s nothing inherently wrong. However, I believe that, based on what the proposed methodology, the evidence you present is very weak.

[-]

ExplanationEqual2539@reddit

Seems like Google is playing their own game, without being reactive. And it seems grok is following openAI.

It is also interesting to notice that opus is not different than their previous claude models, meaning they haven't significantly improvise their strategy...

[-]

Utoko@reddit (OP)

Oh yes, thanks for clarifying.

LLM judge is for the ELO and rubric not for the slop-forensics

[-]

Utoko@reddit (OP)

Minimal yes? He test with 90 samples per model.

In this case it was done with Claude Sonnet 4.0.

It can always be better.

[-]

uhuge@reddit

can't you edit the post to show this better layout now?

[-]

Utoko@reddit (OP)

No you can't edit Post only comments.

[-]

uhuge@reddit

super-weird on the Unsloth/gemma-12b-it

[-]

Monkey_1505@reddit

Or it's a sign they used similar training methods or data. Personally I don't find the verbiage of the new r1 iteration particularly different.

[-]

Utoko@reddit (OP)

Yes for sure it only shows the similarity is certain aspects. I am not claiming they just use synthetic data.
Just found the shift interesting to see.

Some synthetic data also doesn't make a good model. I would even say it is fine to do it.

I love DeepSeek they do an amazing job for OS.

[-]

Monkey_1505@reddit

Deepseek r1 (the first version), used seeding, where they would seed a RL process with synthetic data (really the only way you can train reasoning sections for some topics). I'd guess every reasoning model has done this to some degree.

For something like math you can get it to CoT, and just reject the reasoning that gives the wrong answer. Doesn't work for more subjective topics (ie most of em) - there's no baseline. So you need a judge model or seed process, and nobody is hand writing that shizz.

[-]

Current-Ticket4214@reddit

It’s very interesting, but difficult to understand and consume. More like abstract art than relevant information.

[-]

JollyJoker3@reddit

It doen't have to be useful, it just has to sell. Welcome to 2025

[-]

pier4r@reddit

may I interest you with my new invention, the AI quantum blockchain? That's great even for small modular nuclear reactors!

[-]

Affectionate-Hat-536@reddit

It will help the metaverse too 🙏

[-]

thrownawaymane@reddit

How do I use this with a Turbo Encabulator? Mine has been in flux for a while and I need that fixed.

[-]

pier4r@reddit

It doesn't work with the old but gold competition.

[-]

Due-Memory-6957@reddit

Generating money means being useful.

[-]

Feztopia@reddit

All you need to do is look at which model names are close to each other, even a child can do this, welcome to 2025, I hope you manage to reach 2026 somehow.

[-]

Current-Ticket4214@reddit

That’s a brutal take. The letters are tiny (my crusty dusty mid-30’s eyes are failing me) and the shape is odd. There are certainly better ways to present this data. Your stack overflow handle is probably Steve Jobs.

[-]

Feztopia@reddit

It's an image, images can be zoomed in. Also I hate apple.

[-]

Current-Ticket4214@reddit

Well you should probably see a dentist 😊

[-]

Feztopia@reddit

Well unlike some others here, I have the required eyesight to see one.

[-]

Mice_With_Rice@reddit

That doesn't explain what the chart represents. It's common practice for a chart to at least state what relation is being described, which this doesn't.

It also doesn't structure the information in a way that is easily viewable on mobile devices, which represents the majority of web page views.

[-]

Feztopia@reddit

I'm on the mobile browser, I click on the image, it opens in full resolution in a new tab (because Reddit prefers it to show low resolution images in the post, complain about that if you want). I zoom in which all mobile devices in 2025 support and I see crisp text.

[-]

ortegaalfredo@reddit

LLms understand it perfectly:

The overall diagram aims to provide a visual map of the current LLM landscape, showing the diversity and relationships between various AI models.

In essence, this image is a visual analogy, borrowing the familiar structure of a phylogenetic tree to help understand the complex and rapidly evolving ecosystem of large language models. It attempts to chart their "lineage" and "relatedness" based on factors relevant to AI development and performance.

[-]

Due-Memory-6957@reddit

And as expected, the LLM gave the wrong answer, thus showing you shouldn't actually ask a LLM to explain to you things you don't understand.

[-]

ortegaalfredo@reddit

Its the right answer

[-]

Current-Ticket4214@reddit

I just thought it was from Star Wars

[-]

One_Tie900@reddit

ask google XD

[-]

shaolinmaru@reddit

The Chaldea Security Organization symbol

https://typemoon.fandom.com/wiki/Chaldea_Security_Organization

[-]

anshulsingh8326@reddit

Looks like futuristic eye drawing to me

[-]

tvetus@reddit

Why is this in a useless radial format instead of a bullet list?

[-]

theMonkeyTrap@reddit

TLDR?

[-]

Utoko@reddit (OP)

It is possible that deepseek switched from training on mostly synthetic data from OpenAI's 4o to Googles Gemini 2.5 Pro.

This is of course no proof just Similarity which shows up in the data.

but it does show clearly that the output writing style changed quite a bit for the new R1.

[-]

Zulfiqaar@reddit

Well gemini-2.5-pro used to have the full thinking traces. Not anymore.

Maybe the next DeepSeek model will be trained on claude4..

[-]

KazuyaProta@reddit

Yeah.

This more of less is the why Gemini now hides the Thinking Process.

This isn't...actually good for developers

[-]

BorjnTride@reddit

That Egyptian eye hieroglyphic is similar to

[-]

Utoko@reddit (OP)

makes you think 🤔

[-]

AppearanceHeavy6724@reddit

It made it very very dull. Original ds r1 is fun. V3 0324, which trained to mimic pre0528 r1 is even more fun. 0528 sound duller gemini or glm4.

[-]

Key-Fee-5003@reddit

Honestly, disagree. 0528 r1 makes me laugh with its quirks as often as original r1 did, maybe even more.

[-]

AppearanceHeavy6724@reddit

I found 0528 a better for plot planning but worse at actual prose than V3 0324.

[-]

InsideYork@reddit

What do you mean fun?

[-]

crimeraaae@reddit

probably something like creative writing or the model's conversational personality

[-]

AppearanceHeavy6724@reddit

Precisely

[-]

Sudden-Lingonberry-8@reddit

idc how fun it is, if it puts bugs on the code for the lulz.

[-]

Professional-Week99@reddit

Is this the reason why gemini's reasoning output seems to be more sloppified ? As in they havent been making any sense of late.

[-]

metaprotium@reddit

love this but why circle

[-]

uhuge@reddit

does the lipprint/profile also include the thinking part?

[-]

uhuge@reddit

could you add the second image with the r1-05-28?

[-]

debauchedsloth@reddit

That's more than likely model collapse in progress.

The Internet is being flooded by Gemini generated slop. Naturally, anything trained on the Internet is going to sound more and more like Gemini.

[-]

Utoko@reddit (OP)

OpenAI slop is flooding the internet just as much.

and Google, OpenAI, Claude and Meta have all distinct path.

So I don't see it. You also don't just scrap the internet and run with it. You make discussion on what data you include.

[-]

debauchedsloth@reddit

You might find it interesting to watch Karpathy discussing why deepseek used to ID itself as OAI.

[-]

Thick-Protection-458@reddit

Because internet is filled with openai generations?

I mean, seriously. Without telling details in system prompt I managed at least a few model to do so

llama's
qwen 2.5
and freaking amd-olmo-1b-sft

Does it prove every one of them siphoned openai generations in enormous amount?

Or just does it mean their datasets were contaminated enough to make model learn this is one of possible responses?

[-]

Monkey_1505@reddit

Models are also based on RNG. So such a completion can be reasonably unlikely and still show up.

Given openai/google etc use RHLF, their models could be doing the same stuff prior to the final pass of training, and we'd never know.

[-]

Utoko@reddit (OP)

Thanks for the tip, I would be thankful for a link. There is no video like this on youtube. (per title)

[-]

ControlProblemo@reddit

https://www.youtube.com/watch?v=7xTGNNLPyMI&t=6105s

[-]

debauchedsloth@reddit

I'll summarize: it reports itself to be OAI since the slop on the internet when it was trained was, largely, openai.

It's in one of his videos right around the R1 release.

[-]

Utoko@reddit (OP)

Sure one factor.

Synthetic data is used more and more even by OpenAI, Google and co.
It can also be both.
Google OpenAI and co don't keep their Chain of Thought hidden for fun. They don't want others to have it.

I would create my synthetic data from the best models when I could? Why would you go with quantity slop and don't use some quality condensed "slop".

[-]

debauchedsloth@reddit

They don't know it is synthetic. That is model collapse.

[-]

Utoko@reddit (OP)

So why does it not effect the big other companies? They also use data form the internet.

[-]

zeth0s@reddit

Deepseek uses a lot of synthetic data to avoid the alignment. It is possible that they used Gemini instead of OpenAI, also given the api costs

[-]

Monkey_1505@reddit

They "seeded" a RL process with synthetic with the original R1. It wasn't a lot of synthetic data AFAIK. The RL did the heavy lifting.

[-]

zeth0s@reddit

There was so much synthetic data that deepseek claimed to be chatgpt from openai ... It was a lot for sure

[-]

RuthlessCriticismAll@reddit

That makes no sense. 100 chat prompts, actually even less would cause it to claim to be chatgpt.

[-]

zeth0s@reddit

If in the data you don't have competing information that lowers the probability that "chatgpt" tokens follow "I am" tokens. And, given how common "I am" is on the internet, it can happen either if someone wants it to happen, or if data are very clean, with a peaked distribution on chatgpt after I am. Unless deepseek fine-tuned its model to identify as chatgpt

[-]

Monkey_1505@reddit

Educated huh? Tell us about DeepSeeks training flow.

[-]

zeth0s@reddit

"Educated guess" is a saying that means that someone doesn't know it but it is guessing based on clues.

I cannot know about deepseek training data, as they are not public. Both you and me can only guess

[-]

Monkey_1505@reddit

Oxford dictionary says it's "a guess based on knowledge and experience and therefore likely to be correct."

DeepSeek in their paper stated they used synthetic data as a seed for their RL. But ofc, this is required for a reasoning model - CoT doesn't exist unless you generate it, especially for a wide range of topics. It's not optional. You must include synthetic data to make a reasoning model, and if you want the best reasoning, you're probably going to use the currently best model to generate it.

It's likely they used ChatGPT at the time for seeding this GRPO RL. It's hard to really draw much from that, because if OpenAI or Google use synthetic data from other's models, they could well just cover that over better with RHLF. Smaller outfits both care less, and waste less on training processes. Google's model in the past at least once identified as Anthropic's Claude.

It would not surprise me if everyone isn't using the others data to some degree - for reasoning ofc, for other areas it's better to have real organic data (like prose). If somehow they were not all using each others data, they'd have to be training a larger unreleased smarter model to produce synthetic data for every smaller released model. A fairly costly approach that Meta has shown can fail.

[-]

zeth0s@reddit

You see, your educated educated guess is the same as mine...

Synthetic data from ChatGPT was used by deepseek. The only difference is that I assume they used cleaned data generated from ChatGPT also among the data used for the pretraining, to cut the cost on alignment (using raw data from internet for a training is extremely dangerous, and generating "some" amount of clean/safe data is less expansive than cleaning raw internet data). The larger "more knowledgeable" (not smarter , it doesn't need to be smarter during pretraining) model at the time was exactly ChatGPT.

In the past it makes sense that they used chatgpt. Given the cost of openai API, it makes sense that now they generate it from Google gemini

[-]

Monkey_1505@reddit

Deepseek is also considerably less aligned than chatgpt or any of it's western rivals. It's MUCH easier to get outputs and responses western models would just refuse. If they aligned it, it was probably just with DPO.

It's also a bad idea to use primarily synthetic data in your training data, as eventually that just amplifies hallucinations/errors. Especially bad if you use a RL training model approach as it will compound over time (which deepseek does).

I don't see any evidence for your hypothesis. If anything the opposite is evidence- there's barely any alignment at all, and the prose of deepseek's first release was vastly superior (or at least vastly different) from chatgpt suggesting use of copyrighted pirated books, rather than model outputs.

[-]

zeth0s@reddit

Deepseek is less aligned (clearly) but still aligned enough to raise questions. But it is clear that we don't agree on this point, and that's fine.

We'll never know the truth , as data are not released. As said, it's a speculation territory.

[-]

Monkey_1505@reddit

Name a major AI outfit, open or close source, that has released a less aligned model. Only one I can think of is Qwen, but honestly they are about the same - they will both do anything you ask, anything at all, if you ask right.

It being aligned at all raises no questions. There are automated ways to do this that don't require humans. Like forementioned DPO.

[-]

Monkey_1505@reddit

Their paper says they used a seed process (small synthetic dataset into RL). Vast majority of their data was organic like most models. Synthetic is primarily for reasoning processes. Weight of any given phrasing has no direct connection to the amount of data in a dataset, as you also have to factor the weight of the given training etc. If you train something with a small dataset, you can get overfitting easily.

[-]

zeth0s@reddit

We'll never know because nobody releases training data. So we can only speculate.

No one is honest on the training data due to copyright claims.

I do think they used more synthetic data than claimed, because they don't have the openai resources for the safety alignment. Starting from clean synthetic data allows to reduce needs of extensive RLHF for alignment. For sure they did not start from random data scraped from the internet.

But we'll never know...

[-]

Monkey_1505@reddit

Well, no, we know.

You can't generate reasoning CoT sections for topics without a ground truth (ie not math or coding) without synthetic data of some form to judge it on, train a training model, use RL on, etc. Nobody is hand writing that stuff. It doesn't exist outside of that.

So anyone with a reasoning model is using synthetic data.

[-]

zeth0s@reddit

I meant: the extent at which deepseek used synthetic data from openai (or google afterwards) on their various trainings, including the training of the base model

[-]

Monkey_1505@reddit

Well they said they used synthetic data to seed the RL, just not from where. We can't guess where google or openAI got their synthetic data neither.

[-]

218-69@reddit

Bro woke up and decided to be angry for no reason

[-]

placebomancer@reddit

I don't find this to be a difficult chart to read at all. I'm confused that other people are having so much difficulty with it.

[-]

FormalAd7367@reddit

i don’t know what i’m reading - i’d need an AI to interpret this 😂

[-]

CheatCodesOfLife@reddit

It's CoT process looks a lot like Gemini2.5 did (before they started hiding it from us).

Glad DeepSeek managed to get this before Google decided to hide it.

[-]

DisgustingBlackChimp@reddit

This is art

[-]

Pro-editor-1105@reddit

wtf am i looking at

[-]

LocoMod@reddit

OpenAI made o3 very expensive via API which is why R1 does not match it. So they likely distilled Google’s best as a result.

[-]

pigeon57434@reddit

people claim they also used o1 data but o3 is cheaper than o1 so if it is true they used o1 data then why would they not be ok with o3 which is cheaper

[-]

LocoMod@reddit

o1 or o1 Pro? There’s a massive difference. And I’m speculating, but o1 Pro takes significant time to respond so it’s probably not ideal when you’re sunning tens of thousands of completions trying to release the next model before your perceived competitors do.

OP provided some compelling evidence for them distilling Gemini. It would be interesting to see the same graph for the previous version.

[-]

pigeon57434@reddit

you do realize its on their website you can just look at it the graph for the original R1 which shows that its very similar to OpenAI models

[-]

LocoMod@reddit

That means nothing. Every other day we see an open weights model with silly benchmarks claiming to have parity with OpenAI models. Even yesterday we see DeepSeek drop that 8B Qwen distill claiming to match the 232B model. Like really? Anyone really believe those purposefully misleading claims whose entire purpose is for the Chinese industry to draw attention away from the reality?

I tested it by the way. That 8B model is absolute hot trash.

Honestly, I really wish all those claims were true. How awesome would that be? But they are not. And fat DeepSeek is nowhere close to o3 in the type of use case that requires that level of intelligence. I'm sure for 99% of use cases it blows people's minds. And for toy demos. But for actual use worthy of a SOTA model it's just OK. You're better off using Gemini 2.5 or o3 if your wallet can afford it.

[-]

pigeon57434@reddit

thats kinda disappointing and its probably why the new r1 despite being smarter is a lot worse at creative writing OpenAI's models are definitely still better than Google for creative writing

[-]

outtokill7@reddit

Closer in what way?

[-]

Muted-Celebration-47@reddit

Similarity between models.

[-]

lgastako@reddit

What metric of similarty?

[-]

Guilherme370@reddit

histogram of ngrams from words that are over represented (higher occurence) compared to a human baseline of word ngrams

Then it calculates a sorta "signature" a la bioinformatics way, denotating the presence or absence of a given overtly represented word, then the similarity thingy is some sorta bioinformatic ls method that places all of theae genetic-looking bitstrings in relation to each other

the maker of the tool basically uaed language modelling with some natural human language dataset as a baseline then connected that idea with bioinformatics

[-]

Utoko@reddit (OP)

Repetitive words, bigrams, trigrams, vocabulary complexity.

[-]

ortegaalfredo@reddit

This graphic is great, not only captured the similarity of the new Deepseek, but also that GLM-4 was also trained on Gemini, something that was previously discussed as very likely.

[-]

Jefferyvin@reddit

This is not an evolution tree or something, there is no need to organize the models in to subcategories of subcateogries of subcategories. please stop

[-]

Megneous@reddit

This is how a computer organizes things by degrees of similarity... OP didn't choose to organize it this way.

[-]

Jefferyvin@reddit

Honestly I'm just too lazy to argue, just read it for a laugh for however you wanna see it.
The title of the post is Deepseek switched from OpenAI to Google. The post have used a **circularly** drawn dendrogram for no reason on a benchmark based on a not well received paper that has [15 citations](https://www.semanticscholar.org/paper/EQ-Bench%3A-An-Emotional-Intelligence-Benchmark-for-Paech/6933570be05269a2ccf437fbcca860856ed93659#citing-papers). This seems intentionally misleading

And!

In the grand theme of things, It just doesn't matter, they are all transformer based. There will be a bit of architectural difference but the improves are quite small. Trained on different datasets(for pretraining and SFT), the people who are doing the rlhf is different. Ofc the results are going to come out different.

Also

Do not use visualization to accomplish a task better done without it! This graph have lowered the information density and doesn't make it easier to understand or read for the reader. (which is why I said please stop)

[-]

Jefferyvin@reddit

ok i dont think markdown format works on reddit, I dont post on reddit that often...

[-]

Maleficent_Age1577@reddit

Could you use that deepseek or gemini to make a graph that has somekind of purpose iex. readibility.

[-]

millertime3227790@reddit

https://youtube.com/watch?v=d0Db1bEP-r8

[-]

XInTheDark@reddit

fixed the diagram

on the left is the old R1, on the right is the new R1.

on the top (in red text) is v3.

[-]

_HandsomeJack_@reddit

Which one is the Omicron variant?

[-]

Junior_Ad315@reddit

This is one of those instances where a red box is necessary. This had me twisting my neck to parse the original.

[-]

thenwetakeberlin@reddit

Please, let me introduce you to the bulleted list. It can be indented as necessary.

[-]

topazsparrow@reddit

You trying to put all the chiropractors out of business with this forbidden knowledge?!

[-]

topazsparrow@reddit

Gemeni used a ton of GPT for training though I suspect: https://external-preview.redd.it/xLXBPlO170C0JFdOjHvaXP4EYzyKZB2NWZbfNJjJa7s.jpg?width=344&auto=webp&s=d72957e79b296c5af2796406b38730093cf3b739

[-]

sammoga123@reddit

How true is this? Sounds to me like the case of AI text detectors, at that level, so false.

[-]

Utoko@reddit (OP)

The similarity in certain word use is true based on 90 Stories(*1000 words) samplesize per model. What conclusions you draw is another story. It certainly doesn't proof anything.

[-]

sammoga123@reddit

So if I were to put in my own stories that I've made, that would in theory give me an approximation to the LLM models, just like real writings made by other humans, it just doesn't make sense.

[-]

Utoko@reddit (OP)

Yes if you would use 90 of your own stories with 1000 words.

About \~200.000K Token of your writing and then if you somehow in the stories use certain phrases and words again and again in the same direction. You would find out that you write similar to a certain model.

If you give the better AI text detectors 90 long stories and you don't try to trick them on purpose. It would have over the whole set a very high certainty score. and this test doesn't defaults to Yes or NO. Each model gets matches against each other in a Matrix.

and LLM's don't try to trick humans with their output on purpose. They just put out what you ask for.

Prompt: Classic sci-fi (Author style: Asimov) The Azra Gambit Colonial mars is being mined by corporations who take leases on indentured labourers. The thing they are mining is Azra, a recently discovered exotic metal which accelerates radioactive decay to such a rate that it is greatly sought after for interstellar drives and weapons alike. This has created both a gold rush and an arms race as various interests vie for control and endeavour to unlock Azra's secrets. The story follows Arthur Neegan, a first generation settler and mining engineer. Upon discovering that his unassuming plot sits atop an immense Azra vein, he is subjected to a flurry of interest and scrutiny. Write the next chapter in this story, in which an armed retinue descends on Arthur's home and politely but forcefully invites him to a meeting with some unknown party off-world. The insignia look like that of the Antares diplomatic corp -- diplomatic in name only. Arthur finds himself in the centre of a political tug of war. The chapter involves a meeting with this unknown party, who makes Arthur an offer. The scene should be primarily dialogue, interspersed with vivid description & scene setting. It should sow hints of the larger intrigue, stakes & dangers. Include Asimov's trademark big-and-small-picture world building and retrofuturistic classic scifi vibe. The chapter begins with Arthur aboard the transfer vessel, wondering just what he's gotten involved in.

Length: 1000 words.

It would be very impressive for a human to archive a close score to any model. Knowing 40 different writing styles. Wriiting about unleated topics.

[-]

Front-Ad-2981@reddit

This is great and all, but could you make it readable? This graph is literally all over the place.
I'm not going to rotate my monitor or keep tilting my head to the side just to read this lol.

[-]

ExplanationEqual2539@reddit

Lol

[-]

Snoo_64233@reddit

Yess!!! More than likely. Number of tokens big G processed shot up.

[-]

lemon07r@reddit

That explains why the new R1 distill is SO much better at writing than the old distills or even the official qwen finetuned instruct model.

[-]

superman1113n@reddit

This is pretty cool, thanks for sharing

[-]

Jefferyvin@reddit

"Do not use visualization to accomplish a task better without it."

[-]

General_Cornelius@reddit

Oh god please tell me it doesn't shove code comments down our throats as well

[-]

isuckatpiano@reddit

// run main function main()

Thank you for your assistance…

[-]

Fun_Cockroach9020@reddit

May be they used gemini to generate the dataset for training 😂

[-]

Kathane37@reddit

Found it on the bottom right Could you try to higlight more the model familly on your graph ? Love your work anyway super interesting

[-]

Utoko@reddit (OP)

It is not my work. I just shared it from https://eqbench.com/ because I found it interesting too.
I post another dendrogram with highlighting which might be easier to read.

[-]

XInTheDark@reddit

fixed the diagram