Is this the largest "No synthetic data" open weight LLM? (142B)
Posted by AaronFeng47@reddit | LocalLLaMA | View on Reddit | 40 comments

From the GitHub page of https://huggingface.co/rednote-hilab/dots.llm1.base
GortKlaatu_@reddit
But where did they get their tokens and how did they verify there was no synthetic data?
It's one thing to not generate your own synthetic data, but another to claim there's no synthetic data in your dataset.
It's also been shown that synthetic data can improve training so I'm curious how they perform on other benchmarks.
Longjumping-Solid563@reddit
Ah yes, no synthetic data to prevent contamination in pre training just to use a teacher model in post-training. Make sense lol.
But fr I would say synthetic data improves training just because of limited quality data, especially at scale:
High quality non-synthetic data >>> High Quality synthetic data >>> 99.9% of non-synthetic data out there.
Bakoro@reddit
It's not enough to say "synthetic" vs "not synthetic".
Some subjects are going to be much easier to generate high quality synthetic data for, some will be nearly impossible to generate high quality data.
For formal logic, math, and engineering, it is now fairly easy to ad-lib thousands of variations on thousands of problems, and to compose increasingly difficult sets of problems. You can have effectively infinite training data, because you want the model to tightly fit to certain functions and processes, and testing for if generalization has been achieved is feasible.
Compare that to psychology, where generating synthetic data is almost certainly only going to reinforce biases and and crystalize concepts for a very fluid field which sees semi-regular updates and where the language and best practices are frequently changing.
Synthetic data is king where there is an objectively correct, provable, quantifiable answer. That's where you can get self-play, and completely take humans out of the loop, and get super-human abilities like AlphaGo achieved
IrisColt@reddit
Exactly!
AnOnlineHandle@reddit
I often wonder if people working in the field are trying any conditioning hacks for this kind of problem.
e.g. In image diffusion models, I train with phrases for image quality, art or photo style, etc, and then can get a concept from style A to be generated in style B. If a 'quality' conditioning signal was used for accuracy for things known to be true, could the model learn to use that signal to bring forth higher quality responses, perhaps pulling on complex signals in the data which we can't see, since that's what ML is all about.
And could you perhaps train an 'inventor' mode on new discoveries from after the model's training (an embedding or something with a problem/solution prompt format) , things which perhaps logically make sense from what is already known, and then use that with other scientific questions to try to bring forth plausible answers which can be known from the existing data, but we just don't recognize yet. Even if it finds a few promising plausible answers to outstanding problems (countless diseases etc), it might make it all worth it.
Bakoro@reddit
Some stuff is just better done via a domain specific model rather than a language model. There are math models which have discovered new math, chemistry models which have discovered new chemicals, material science models which have developed new materials, chip design models which have designed more efficient chips...
Yes people are trying to push LLMs as far as they can go, but for some stuff, hyper specialization is just plain better.
finah1995@reddit
Lol yeah there are even chemistry models that win their makers the Nobel Prize for Chemistry.
RegisteredJustToSay@reddit
That's true, but there are definitely classes of synthetic data generation that benefit even psychology. For example, using machine translations can boost performance for both less and more popular languages. There are quality books and papers out there that have never been translated, and translation of e.g. East-asian sources on the topic, even if not perfect, would still help an English-speaker obtain a better quality answer if their particular line of query was something along the lines of 'differences on views on clinical insanity between western and eastern cultures'.
Your overall point still remains true, but I'm just highlighting that synthetic data doesn't have to be totally made up, it can also be augmented or transformed truthful data.
DepthHour1669@reddit
Yeah give me synthetic data over reddit comments
Durian881@reddit
They probably got a lot of "organic" data from the Rednote users.
Expensive-Apricot-25@reddit
This is not true.
fullouterjoin@reddit
Phi Models Disagree https://arxiv.org/abs/2404.14219
Please don't make claims w/o a citation.
toothpastespiders@reddit
You agree with the article's premise that phi 3, at 3.8b, is better than mixtral?
Due-Memory-6957@reddit
It's a weird cope that people have that AI generated content can't be used to train AI otherwise it gets bad. What's surprising is to see it here when people have been doing that for years (sometimes people that finetune and divulge their models here) with the result being positive.
westsunset@reddit
numsu@reddit
Data dated earlier than Nov 2022? 😄
NorthSideScrambler@reddit
The LLM was trained on a Brazzers library dump. The finest in human culture and soul.
TheRealMasonMac@reddit
Is it necessarily true that synthetic data improves performance? I would think inferior performance from human-only data happens because of poor quality data.
Echo9Zulu-@reddit
The Phi technical reports discuss rigorous experimentation with synthetic data
-dysangel-@reddit
Also why is "no synthetic data" seen as a good thing? There have been papers where synthetic data has been just as or even more effective than human data. Just try any Qwen3 variant vs similar sized models and you'll see how effective refining the model on synthetic data can be.
ortegaalfredo@reddit
Ggod, I only use free-range non-synthetic data LLMs.
PlayfulCookie2693@reddit
All this synthetic text nowadays I heard is not only bad for the poor LLMs but also you I heard. Here is a source I found about how reading synthetic fed LLMs is bad for you. Because reading their outputs will actually like rewire your brain or something like that.
iamMess@reddit
I think llama 3 was trained on 15t and qwen 30t for pre training.
thereisonlythedance@reddit
Wasn’t a lot of that synthetic?
stuffitystuff@reddit
Much of it was stolen books, at least
Due-Memory-6957@reddit
Based, I wish I could steal as many, maybe one day
stuffitystuff@reddit
Clearly a lot of Facebook employees with nothing better to do than downvote me. Well, I hated their stupid recreation of the banana stand from Arrested Development in their offices in 2009 and still hate it today!
iamMess@reddit
Not for pre training. Only finetuning and rl.
Soft-Ad4690@reddit
Source?
DoggoChann@reddit
It's literately impossible to back up that claim unless all data used is from before the invention of LLMs
BumblebeeOk3281@reddit
please We need Unsloth dynamic quant gguf please :-)
yoracale@reddit
We'll see what we can do! 🥰
SithLordRising@reddit
Good results so far. Fun to use
FullOf_Bad_Ideas@reddit
I don't think so, there's a reasonable chance that DeepSeek V2 and MiniMax Text 01 were trained without synthethic data, about as big as this model not being inadvertedly trained on synthetic data.
Internet is full of AI-generated data nowadays, and they might not see it as synthetic because they didn't synthethize it by themselves, but it will show up in a model in a similar way.
fizzy1242@reddit
Interesting, hope we can get some quants for this soon.
DepthHour1669@reddit
Probably not, there needs to be PRs out for llama.cpp and VLLM first.
fizzy1242@reddit
true, but overtime it seems llamacpp atleast has gained support for different architectures
Hanthunius@reddit
VERY promising model. Waiting anxiously for GGUF or MLX quants!!
noage@reddit
mistral small 24b from jan 2025 says the same thing and it's bigger.
https://mistral.ai/news/mistral-small-3
ParaboloidalCrest@reddit
Interesting. Is there a ranking of models by training token count out there?