An idea: an LLM trapped in the past
Posted by Vehnum@reddit | LocalLLaMA | View on Reddit | 49 comments
Has anyone ever thought to make an LLM trained on data from before a certain year/time?
For example, an LLM trained on data only from 2010 or prior?
I thought it was an interesting concept but I don’t know if it had been thought of or done before.
s101c@reddit
I think an LLM up to 1950s is possible. We have millions of books, archived letters, newspapers, transcripts and so on.
Bonus part: the 1930s-1950s material will be public domain in few decades, so the training data could be released with a very permissive licence.
You could train a 1920s LLM with public domain data right now and call it something like "Public Llama".
Expensive-Apricot-25@reddit
it wont even know its an AI... lol
It'll be just like talking to someone from that time period!
RealSataan@reddit
Not just technological optimism. The model will be massively racist. With a lot of eugenics, race theory, sexism thrown in
s101c@reddit
The model will be a snapshot of the entire world (and its history) up to a certain point. Whatever the world represented up to that time, we will see in the model, for better or worse.
Also remember that it's expected to know the story from many sides. It will know both American and Irish views, and from religious standpoint, the world from pagan and later eras, which would really contradict each other. From what I understand, eugenics was known in a relatively short time period starting in late 19th century, so it wouldn't poison most of the training data. Expect lots of religious overtones though.
The model will not have 4chan training data, so that might be a relief at least.
RealSataan@reddit
I'm assuming the training data will be used from books written using printing press. Since printing press was invented in Europe, most of the books will be Europe centric. There will not be enough counter points to learn from.
One thing will be certain, the model will surely learn it's ok to look down on others. Throughout history and everywhere across the world that's one constant.
gnaarw@reddit
The printing press was invented in China so there's plenty there plus Korean and Japanese. Google is a little racist in this regard (though it's not at all unlikely Gutenberg invented this separately from the Chinese of course) and even tells me the first printed version of the art of war is from the 19th century instead of like 200 BC :D
Of course since you don't know what to look for and Google isn't really focused on Chinese materials, your stuff will be majority European centric...
It would be of course good to get some Arabic and Indian in there. They should have gotten the printing press about 100 years after Gutenberg so plenty of texts from there.
s101c@reddit
No one has trained such a model yet. For me personally it's an open ended question, whether or not it would be less empathetic than a modern model. Most likely this would end in the way that you described. Another outcome (that could also come as a surprise) is that the model would simply not care about race more than a modern one.
allegedrc4@reddit
That is an excessively pessimistic and awful way to view the past.
Xandrmoro@reddit
Isnt that the point? As in, represent the worldview of the time.
apetalous42@reddit
I'm down for this, I just need a TTS with that Mid-Atlantic accent.
Competitive_Ad_5515@reddit
Elevenlabs has a number, including Judy Garland and Sir Lawrence Oliver
LegitimateCopy7@reddit
sounds like eternity in the AI sector.
s101c@reddit
Which is why I support training a model with 1929 as the cutoff date.
Everything before January 1, 1929 is in public domain now. The Wall Street Crash of 1929 happened in October, so the cutoff will be on a relatively high note, and the model will still feel modern.
Another option is training a model before with cutoff date in 1913, which is long-lasting wish of mine: to talk to the actual Old World which was destroyed in WW1 (and after which our world has emerged).
a_chatbot@reddit
Imagine talking 'current events' with a model trained in the 1920's and 30's. What you think of the KKK, Stalin, Hitler, etc, lol.
ninjasaid13@reddit
I think the model will be quite bad, not in just knowledge, but we can't really make it conversational, most of the conversations today is all over the internet while in the past conversations weren't really written down. There's also alot of roleplaying on the internet, that you can't find in public domain data.
peppaz@reddit
If you asked it to explain how it was trained or what an LLM is it would explode lmao
EffectiveReady6483@reddit
I would love to have a medieval LLM with no knowledge of America...
s101c@reddit
Very hard to do in my opinion, because of the low amount of training data.
See here:
https://www.statista.com/graphic/1/1396121/europe-book-production-half-century-region-historical.jpg
The invention of the printing press happened just a bit prior to the discovery of American continent(s).
The data that existed before that, is in such a miniscule amount compared to what came after.
Even the difference between 18th and 16th century is around 4.6 times. Obviously, many early printed books were a re-issue of existing medieval and ancient texts, so the difference might be not so staggering, and still would make a 'true' medieval model very limited.
Jumper775-2@reddit
Well what you could do would be use modern synthetic data techniques to generate enough data to make a foundation model with 0 world knowledge, then train it on the low amount of data to give it world knowledge, and then instruct tune it and use GRPO to force it to make inferences and extrapolate as much as it can from that data. Still wouldn’t be as good as modern models by a mile, but I think it would produce something.
raiango@reddit
It probably breaks the ability to easily examine but you could generate synthetic data from only the target eras.
kali_tragus@reddit
It could be interesting to see how well such a model predicts the "future". Or write science fiction, if you like.
zjuwyz@reddit
I'm really curious if the 1920s LLM could rediscover relativity theory
Rego117@reddit
Really like this idea, would be fascinating seeing how a model trained on a certain decade with literature, newspapers, transcriptions etc would result in differences in output.
If done properly it would almost act as a time capsule. Would love to see someone try this if it hasn't been done already
a_beautiful_rhind@reddit
I lived that with gemini. It accused me of lying when showing it images from 2025.
Did you ever try a character from a different time period? Decent LLMs can keep from being anachronistic. Not necessarily the full experience but easier than convincing someone to train a 30b+ on old data.
NihilisticAssHat@reddit
I handed it the FBI press release for when someone shot Trump's ear. It said that article was clearly fake, and meant to manipulate.
satansprinter@reddit
You can download the entire wikipedia database. Not sure if its easy to pinpoint to a specific time / date. But that might make it relatively "easy" to train a LLM with a specific set of data.
Now that i think about it, maybe that is also a way to train an LLM on a specific year, if you can see what changed in a specific year, it gives a pretty good insight about what happens in a year. Dont know why you want that specifically in an LLM but, its possible
Comfortable-Rock-498@reddit
This is a brilliant idea OP. I would use it! Not just for amusement but I think it would be helpful in many ways. One off the top of my head is it'll challenge the tendency to look at the past with rose-colored-glasses when people can actually talk to 1950s or 1990s or whatever
wonderfulnonsense@reddit
It would be kinda funny to somehow train one on genealogical data. You tell the llm your name and town and it goes "oh, you're uncle ned's kid"
wntersnw@reddit
There was a post on here ages ago about something like that. Think it was some guy on twitter who was collecting old magazines and newspapers and stuff to train on. Not sure if it ever went anywhere.
Found it: https://reddit.com/r/LocalLLaMA/comments/197zjk5/someone_has_trained_their_own_ai_on_old_magazines/
indicava@reddit
You would have to somehow have it “forget” all knowledge following your proposed cutoff, doesn’t seem to feasible
crispyfrybits@reddit
The idea is the data is it trained in would be from the past so the LLM doesn't have to pretend to "forget" anything. As far as the LLM Is aware it only has data up to say 1970.
vibjelo@reddit
Well, I'm guessing the right approach here wouldn't be to try to remove anything from existing models, but train one from scratch on datasets that were created before the cutoff date, so it's not in there in the first place.
Vehnum@reddit (OP)
Yeah, I would think the only way possible would be training it from scratch and I certainly don’t have the time, resources or know-how to do it. This was just a random thought I had.
crispyfrybits@reddit
I feel like it could be hard to get limited data to do the training but I love this idea. I think it will happen, just about who ends up trying it first.
Monarc73@reddit
It's not only possible, but will soon be MANDATORY, given how much AI generated content is about to be out there.
phree_radical@reddit
https://github.com/prateekcaire/GPT2-VictorianStories?tab=readme-ov-file
custodiam99@reddit
LLM from ancient Roman and Greek texts in English? : r/LocalLLaMA
Echo9Zulu-@reddit
I was introducing my Dad to ChatGPT and to make it easier for him to understand prompting we used voice mode.
Eventually we started talking with it about time travel and to illustrate how alignment worked I prompted something about if GPT4o were to be transported to say 2000, would it's alignment training encourage the modelt to lie/decieve humans in the past to preserve it's knowledge of the future. Gpt4o said it would. Not unexpected but pretty wild to hear the whole response framed as if it were to protect humans, as if we had stumbled onto a usecase for censorship AND that GPT-4o understands hidden nuance of time travel? Lol my dad was blown away
dp3471@reddit
would love to see one on historic lingo only
maurosr777@reddit
I can imagine sociologists and psychologists would be delighted to talk to "someone" from the past. Ofc there's a long way for it to be used in academic research seriously
valdecircarvalho@reddit
PROMPT
maikuthe1@reddit
With only a prompt it would be contaminated with biases and how people from back then are portrayed in modern media. OP ist talking about training a model only with data from the time period.
ratbastid2000@reddit
I expect in the not too distant future, old models will be sought after for the information they contain that hasn't been subjected to cleansing (whether for safety or just "memory loss". Or in the explicit form of the internet being subjected to political influence to re-frame,spin, and white wash critical dissent). With this in mind, preserving "raw" information grounded in time will function as the next generation of the waybackmachine / archive.org for context. Even the hallucinations that people mention will be valuable in a certain context. Forbidden knowledge embedded in LLM that may have been purged from the future internet...
ideally we also create an immutable database using decentralized networks , ledgers, data stores to also combat this knowledge distortion and purging.
ratbastid2000@reddit
I expect in the not too distant future, old models will be sought after for the information they contain that hasn't been subjected to cleansing (whether for safety or just "memory loss". Or in the explicit form of the internet being subjected to political influence to re-frame,spin, and white wash critical dissent). With this in mind, preserving "raw" information grounded in time will function as the next generation of the waybackmachine / archive.org for context. Even the hallucinations that people mention will be valuable in a certain context. Forbidden knowledge embedded in LLM that may have been purged from the future internet...
ideally we also create an immutable database using decentralized networks , ledgers, data stores to also combat this knowledge distortion and purging.
prototypist@reddit
TimeLMs were a series of models trained on data from specific quarters of 2020 and 2021, so there were some interesting results comparing their perplexity scores on headlines, social media, etc. even after a few months. Beyond news events and facts, language changes surprisingly quickly. The example which I used to give was GPT-2 and BERT having no concept of "social distancing", maybe now I should use the association of "brat" with summer and green since 2024.
There might be enough 2010 content still online to train a GenAlpha LLM, but the amount of digital information grows exponentially; there is significantly more recent digital information than if you digitized all of the English-language books and newspapers we still have from 1960.
jacek2023@reddit
that would be a great idea for retro-future (think: Fallout)
ortegaalfredo@reddit
It's easy to do with a pre-prompt.
I did a Jesus simulation (Don't laugh, it worked perfectly) so I instructed him to act like a man in the year 30, and he did it quite well.
Vehnum@reddit (OP)
I’ve done something similar but that’s not the point.
Who’s to say that the definition of “a man from the year 30” hasn’t changed in the past 15 years within the collective consciousness of the Internet?
robertpiosik@reddit
Would they come up with new knowledge? They certainly could, but these would just be hallucinations.