Is this the largest "No synthetic data" open weight LLM? (142B)

[-]

GortKlaatu_@reddit

But where did they get their tokens and how did they verify there was no synthetic data?

It's one thing to not generate your own synthetic data, but another to claim there's no synthetic data in your dataset.

It's also been shown that synthetic data can improve training so I'm curious how they perform on other benchmarks.

[-]

Longjumping-Solid563@reddit

Ah yes, no synthetic data to prevent contamination in pre training just to use a teacher model in post-training. Make sense lol.

But fr I would say synthetic data improves training just because of limited quality data, especially at scale:

High quality non-synthetic data >>> High Quality synthetic data >>> 99.9% of non-synthetic data out there.

[-]

It's not enough to say "synthetic" vs "not synthetic".
Some subjects are going to be much easier to generate high quality synthetic data for, some will be nearly impossible to generate high quality data.

For formal logic, math, and engineering, it is now fairly easy to ad-lib thousands of variations on thousands of problems, and to compose increasingly difficult sets of problems. You can have effectively infinite training data, because you want the model to tightly fit to certain functions and processes, and testing for if generalization has been achieved is feasible.

Compare that to psychology, where generating synthetic data is almost certainly only going to reinforce biases and and crystalize concepts for a very fluid field which sees semi-regular updates and where the language and best practices are frequently changing.

Synthetic data is king where there is an objectively correct, provable, quantifiable answer. That's where you can get self-play, and completely take humans out of the loop, and get super-human abilities like AlphaGo achieved

[-]

IrisColt@reddit

For formal logic, math, and engineering, it is now fairly easy to ad-lib thousands of variations on thousands of problems

Exactly!

[-]

AnOnlineHandle@reddit

I often wonder if people working in the field are trying any conditioning hacks for this kind of problem.

e.g. In image diffusion models, I train with phrases for image quality, art or photo style, etc, and then can get a concept from style A to be generated in style B. If a 'quality' conditioning signal was used for accuracy for things known to be true, could the model learn to use that signal to bring forth higher quality responses, perhaps pulling on complex signals in the data which we can't see, since that's what ML is all about.

And could you perhaps train an 'inventor' mode on new discoveries from after the model's training (an embedding or something with a problem/solution prompt format) , things which perhaps logically make sense from what is already known, and then use that with other scientific questions to try to bring forth plausible answers which can be known from the existing data, but we just don't recognize yet. Even if it finds a few promising plausible answers to outstanding problems (countless diseases etc), it might make it all worth it.

[-]

Bakoro@reddit

Some stuff is just better done via a domain specific model rather than a language model. There are math models which have discovered new math, chemistry models which have discovered new chemicals, material science models which have developed new materials, chip design models which have designed more efficient chips...

Yes people are trying to push LLMs as far as they can go, but for some stuff, hyper specialization is just plain better.

[-]

finah1995@reddit

Lol yeah there are even chemistry models that win their makers the Nobel Prize for Chemistry.

[-]

RegisteredJustToSay@reddit

That's true, but there are definitely classes of synthetic data generation that benefit even psychology. For example, using machine translations can boost performance for both less and more popular languages. There are quality books and papers out there that have never been translated, and translation of e.g. East-asian sources on the topic, even if not perfect, would still help an English-speaker obtain a better quality answer if their particular line of query was something along the lines of 'differences on views on clinical insanity between western and eastern cultures'.

Your overall point still remains true, but I'm just highlighting that synthetic data doesn't have to be totally made up, it can also be augmented or transformed truthful data.

[-]

DepthHour1669@reddit

Yeah give me synthetic data over reddit comments

[-]

Durian881@reddit

They probably got a lot of "organic" data from the Rednote users.

[-]

Expensive-Apricot-25@reddit

It's also been shown that synthetic data can improve training so I'm curious how they perform on other benchmarks.

This is not true.

[-]

fullouterjoin@reddit

Phi Models Disagree https://arxiv.org/abs/2404.14219

Please don't make claims w/o a citation.

[-]

toothpastespiders@reddit

You agree with the article's premise that phi 3, at 3.8b, is better than mixtral?

[-]

Due-Memory-6957@reddit

It's a weird cope that people have that AI generated content can't be used to train AI otherwise it gets bad. What's surprising is to see it here when people have been doing that for years (sometimes people that finetune and divulge their models here) with the result being positive.

[-]

westsunset@reddit

[-]

numsu@reddit

Data dated earlier than Nov 2022? 😄

[-]

NorthSideScrambler@reddit

The LLM was trained on a Brazzers library dump. The finest in human culture and soul.

[-]

TheRealMasonMac@reddit

Is it necessarily true that synthetic data improves performance? I would think inferior performance from human-only data happens because of poor quality data.

[-]

Echo9Zulu-@reddit

The Phi technical reports discuss rigorous experimentation with synthetic data

[-]

-dysangel-@reddit

Also why is "no synthetic data" seen as a good thing? There have been papers where synthetic data has been just as or even more effective than human data. Just try any Qwen3 variant vs similar sized models and you'll see how effective refining the model on synthetic data can be.

[-]

ortegaalfredo@reddit

Ggod, I only use free-range non-synthetic data LLMs.

[-]

PlayfulCookie2693@reddit

All this synthetic text nowadays I heard is not only bad for the poor LLMs but also you I heard. Here is a source I found about how reading synthetic fed LLMs is bad for you. Because reading their outputs will actually like rewire your brain or something like that.

[-]

ParaboloidalCrest@reddit

Interesting. Is there a ranking of models by training token count out there?