Training an LLM only on books from the 1800's - no modern bias
Posted by Remarkable-Trick-177@reddit | LocalLLaMA | View on Reddit | 182 comments
Hi, im working on something that I havent seen anyone else do before, I trained nanoGPT on only books from a specifc time period and region of the world. I chose to do 1800-1850 London. My dataset was only 187mb (around 50 books). Right now the trained model produces random incoherent sentences but they do kind of feel like 1800s style sentences. My end goal is to create an LLM that doesnt pretend to be historical but just is, that's why I didn't go the fine tune route. It will have no modern bias and will only be able to reason within the time period it's trained on. It's super random and has no utility but I think if I train using a big dataset (like 600 books) the result will be super sick.
mrshadow773@reddit
Hi - myself and buddy have recently OCR’d some old books, many are from the 1800s. You might find this useful: survivor library books
AllanSundry2020@reddit
thanks this is helpful for me too. I have been relying on gutenberg and recent epubs of anthologies of older works
mrshadow773@reddit
You’re welcome 🤗
one thing you might find useful (we have yet to explore it but mean to at some point) is that the same books were OCR’d by page (split by \f to get pages). It seems like any “OCR failures”/poor quality outputs by the models are in different places, so it should be possible to detect failures by page and use text from the other model’s outputs to create something that is overall better/cleaner than either
AllanSundry2020@reddit
ensemble OCR, i like it!
jasminUwU6@reddit
I love how I can find some absolute gems in Reddit comments.
Thank you for your work
mrshadow773@reddit
Thanks for the kind words!
This mostly came out of a b200 rental deal we saw. We plan to write up a blog sometime soon talking about our experiences. interestingly the pdf processing itself starts to become a major bottleneck because even a single B200 is so fast
westsunset@reddit
"'an LLM is only as good as the dataset it was trained on' - Sun Tzu" lmao
mrshadow773@reddit
I’ve been waiting a long time for someone to find this funny, cheers
AllanSundry2020@reddit
do you have any recommend for ocr libraries?
Red_Redditor_Reddit@reddit
The problem isn't the LLM or the sourcing. The problem is the looser people. There's a certain part of western society ("karens") that thinks they can get a higher social standing by jumping on something that's not politically correct. The LLM producers are kinda forced to censor the models because of these people.
It's so ridiculous that I feel like wearing a marlboro hat and a t-shirt with the battle flag and a text that says "I hate the moon people and women named karen."
opi098514@reddit
What in the world are you talking about?
Red_Redditor_Reddit@reddit
No offense, but do you live under a rock? These companies can't just produce something that says anything. That's the problem, and the OP is going to have the same problem just without the PR side of it.
Decaf_GT@reddit
This is an academic exercise in what's possible, not a fucking business plan.
OP came up with an interesting idea and he's just sharing his progress. He's not trying to go commercial with some kind of product, so he's not going to have a "problem" here.
This is the kind of thing that hobbyists and enthusiasts do on forums like these.
These comments don't make you sound smart (or at least as smart as you think you sound), because you're responding to this post as though OP asked "how much do you think people would pay and what do you think of my business model?"
No one here is talking about that.
Red_Redditor_Reddit@reddit
The op isn't doing something special as far as training. The only thing that fundamentally is unique is that larger companies are unwilling, not incapable, of doing it.
Please stop reading into what I'm writing. It makes you look dumb.
Decaf_GT@reddit
Why the everloving fuck does it matter what "companies" are doing?
What are you failing to understand about this?
he. is. not. making. a. product.
Jesus fucking Christ. He's doing something fun for academic curiosity, and it has gathered enough interest that people are having a discussion about it. Obviously he's not the first person ever to consider training a model from scratch with a specific set of data, but no one here cares that he's not the first.
You're like that kid in the corner of the party meme personified.
The only person reading into anything is you; in a thread with hundreds of upvotes and 70+ comments, only one person here is acting like OP is launching a business on this and is looking for business advice.
Red_Redditor_Reddit@reddit
Bro. Why do you keep insisting that I'm saying that? Can you not read??
Decaf_GT@reddit
Red_Redditor_Reddit@reddit
Because everloving companies produce stuff, and what they produce is influenced by the environment they are in.
Decaf_GT@reddit
Why does it matter to this fucking thread?
My god man. What is wrong with you? Where is in this thread is it at all relevant what companies are doing with their own models? How is that remotely related to someone who is doing something out of academic interest and for fun?
Can you not understand why you come off as providing uninvited business advice?
Red_Redditor_Reddit@reddit
Bro what is wrong with you?? Do you just have a hard-on for non stop criticism? I'm not even talking about the model. All I said was that the current society gets bent out of shape when something says something that isn't politically correct.
All I made was one comment and this is like your fifth.
robonxt@reddit
Isn't the OP training the model on the books, unless I'm understanding it wrong?
Red_Redditor_Reddit@reddit
There's nothing wrong with that. The source material isn't the issue I'm talking about. What I was saying was that the social climate is the real barrier to models that don't reflect contemporary bias. They say the wrong things and people get bent out of shape.
FpRhGf@reddit
They mean it has nothing to do with the post. What does wanting to see what an LLM that's authentic to the 18th century has anything to do with modern political correctness.
Sombrero hats have nothing to do with the pollution of inaccurate portrayals of Victorian periods in modern fiction
Red_Redditor_Reddit@reddit
The whole point is an attempt at circumventing modern bias, as written in the op.
opi098514@reddit
That has nothing to do with anything.
Red_Redditor_Reddit@reddit
Apparently you do.
Remarkable-Trick-177@reddit (OP)
I didnt expect this post to get this much attention, thanks to everyone whos checking out my project and giving advice/critiques, I really appreciate it. I'm going to start working towards training another model, this time with much more data. I will post updates as I go either in here or on Github. Thanks again everyone, really appreciate all the comments.
mtomas7@reddit
Perhaps you could create Kickstarter or similar campaign to finance training? Perhaps some inference providers would donate time for this project?
TheManicProgrammer@reddit
Add in science/nature journals/newspapers of the time and you'll be al set :D
blurredphotos@reddit
This is a fantastic idea. Can't wait to see.
DepthHour1669@reddit
It’s a terrible idea, because modern humans don’t really understand the cultural background of people in the 1800s unless you study history.
This was the era where the educated still heavily focused on a liberal arts education! That means the context of much of the writing was not in english, but rather latin and greek. You would also want several copies of the bible in there.
The lower layers of the model would be trained on english data, but the features that should be located in higher layers of the models aren’t actually in an english training dataset.
bsenftner@reddit
There is also the manner in which LLMs work: no one modern knows how to communicate conversationally in 1800-1850 language, which is not our language, we have a huge number of "modern" words that are not present in 1850, and using them would confuse that LLM quite a bit, taking it out of whatever context one hopes to have for answering questions about that era.
ChristopherRoberto@reddit
If we can so easily communicate cross-language today with help of translation, I don't see why it would be so impossible to talk to something speaking easily understood English from 200 years ago.
bsenftner@reddit
Well, people could "talk" with these past trained LLMs, but a good number of understandings and customs of that time would be taken out of context, applied to our values, and and that series of LLM gets declared "harmful", requiring censorship before pubic exposure.
Then there are the subtleties of language that are being misunderstood by LLMs currently.
For example, every topic you can imagine is in the training data multiple times, but with different treatments that vary from formal to attempts at humor using that subject. LLMs do not know which is which, and use the style of the words in the user's prompt to select the most similar words and phrasing style in the training data in which to generate an answer. That subtle aspect is not understood, and is the reason many people get poor quality answers from LLMs - they are not specific in use of formal terms, so they get replies from the less formal training data.
For people to converse with an LLM trained on period literature, one would need a foundation level LLM to handle that translation of the user's prompt, and then that response needs to be translated back, and the reference perspective probably needs to be specified too, to the translating LLM. A foundation level LLM would be needed because it's translating a time period's cultural context, something LLMs are not ordinarily trained to do. They are trained to do language translation, but time period translation is not ordinarily in training data. This might require a special fine tuning of a foundation model for use as the translating LLM; It's all possible, but most public users will not really understand why the translator LLM is needed, and a good amount of the nuance that is the essence of that time period trained LLM would be muddled if not lost.
The ease of offending either side in these 'conversations' will be high.
On the other hand, if we're talking these as special purpose use, for formal historical study, and the users are specialists, such as graduate students studying that time period, that's a different story.
Shiroo_@reddit
It's still a good idea, I dont see why you have to say it this way and be negative about it instead of actually providing good advice to make this guy project a success.
You made some good point honestly so hopefully OP will think about it
clduab11@reddit
Maybe I’M the one that’s just overreacting or something, but why does everyone seem SO bent out of shape about the way someone says something on Reddit?
These are words on a screen. No one gets the luxury of ANY sort of tone, or nuance, or emotive product. Who cares if this person thinks it’s a terrible idea? Certainly not OP, they’re gonna do it anyway.
“I don’t see why it has to be said…” “Why can’t you phrase it like…” “What’s wrong with saying…”
It’s like every communication needs a metric shitpod of asterisks because people try to extrapolate SO much about someone or something’s words on a screen. Like some people are just fucking blunt and others need to just accept it and either push on/buzzoff.
Not to pick on this comment or you in general, u/Shiroo_ , I happen to echo your sentiment entirely hence the chosen response…but sometimes, I’m gonna say something’s shit when something’s shit and if someone wants to pearl clutch over how I say something is shit? Well then, there ain’t shit else I can do for you.
bobisme@reddit
I think in this case it's because the poster is being a dick about it ("it's a terrible idea") and is also wrong. If you look at it, it's a you project built on nanogpt. It's an experiment. If it works, cool. If it doesn't, cool. Doesn't make it a terrible idea.
It's like if someone made a post about building an RC car with cheap parts to learn about the process and someone responds with "that's a terrible idea... The problem is most people don't understand physics... That will never set a land speed record."
clduab11@reddit
Thanks for this nuanced explanation! Yeah, I definitely understood the incorrect part and kinda just hand-waved it off because obviously they were mistaken…but I guess within the amount of trawling I’ve done over the months, LocalLLaMA has evolved into this collective of super advanced machine learning engineers, absolute newbies, and…for lack of better words, some of the between I guess (and I consider myself an in-between’er)?
So usually, when I see people BOTH be dicks AND wrong, it’s easy to dismiss them as old curmudgeons or obviously they have zero clue what they’re talking about. But there’s a lot of “in-betweeners” I’m seeing that pick the absolute strangest hills to die on, and I think I’m conflating what I see go on with that versus people pearl clutching at every cockbag they see.
Your explanation helps kinda attenuate that signal for me, so I appreciate you chiming in! Because yeah, I absolutely agree, if someone came up with me with cheap ass parts building an RC car and I’m in the mood? Bet we gonna figure out how to get that RC car rollin’ TODAY instead of putting on some fake lab coat and being all snotty about it.
Shiroo_@reddit
Yeah, it's just that I dont want to see potential good projects being called shit and discourage the person working on it, end of the day even if it amounts to nothing, you are still learning how everything works, which helps judging if an idea is good or bad, feasible or not, so really there is no point in being negative about it. And what's really annoying in this particular case is that the guy was giving good advice but for some reason being really negative about it instead of actually making someone grateful, it just ends up with most people unable to listen to advice given like that. Anyway really no point in being negative to someone trying to have fun, that's obvious
clduab11@reddit
Couldn't agree more! I certainly wouldn't want anyone judging my generative AI work based on what I initially started with (thinking I got Claude to reveal its sysprompt LOL).
MediocreBye@reddit
but what better way to understand the culture than through predictive conversations with an LLM. We are literally recreating a fictional individual based on 1800s written word here. It's cool.
IAmRobinGoodfellow@reddit
That’s … incorrect. It’s the 1800s, not the 1500s. Assuming we’re talking about English, I think that anyone who can read reddit with ease would be able to get through the average civil war era newspaper.
Which reminds me: OP, you should be able to grab newspapers, almanacs, scientific books and papers, and the like. I imagine the tough part is going to be curating, so look for big collections first.
hugganao@reddit
probably want to directly reply to op
hugganao@reddit
that's a overblown way to explain away an interesting project....
at least op is creating a starting point on his objective. He will meet the problems sooner or later which he could tackle then. Whether it has utility or not will remain to be seen but people don't progress just because something has a use case.
DepthHour1669@reddit
It’s a fun project, you should just set your expectations correctly and know that it won’t work.
ApprehensiveBat3074@reddit
You must have a long list of provable accomplishments if you can speak so confidently about what will and won't work.
Faugermire@reddit
You must be fun at parties.
forgotpw3@reddit
Woah dude, it's a party. You should set your expectations and know that it's not supposed to be joyful.
designhelp123@reddit
I looked into this pretty deeply a few months back. I was trying to get an LLM trained on pre-1900 content so I could nudge it towards Einstein Physics.
For simple writing and such, the project shouldn't be too difficult. There's tons of databases as others have mentioned in this thread. I used ChatGPT Deep Research to really get me a good list of potential sources.
For the physics experiment, the issue becomes the current LLM techniques are insufficient. For example, you'll have the pre-1900 base model trained and a pre-1900 reasoning dataset.
Maybe in 3-5 years with an additional 2-5 technique upgrades could we revisit that same dataset, add the additional technique upgrades, and now you have a pre-1900 model that is capable of thinking creatively and will put the pieces together.
I think you should create a discord for the generation of this pre-1900 dataset
nmrk@reddit
Have you ever read The Difference Engine?
RearAdmiralP@reddit
If you're looking for training data, the seventh edition of the Encyclopædia Britannica, published in 1842, would probably be worth including. The OCRed text (~17k pages) is available on archive.org.
doodeoo@reddit
600 books is a tiny data set
Expensive-Apricot-25@reddit
I think this is awesome! it will be like talking to someone from that time period!
I wouldn't do assistant finetuning, because then it won't behave like a person from that time period, rather a modern robot assistant with knowledge of that time period. but if you just did general RL without the added assistant like behavior, it would think it is a person from that time period which would be awesome to have.
MoreMoreReddit@reddit
You'll either need to expand your scope or include a LOT of synthetic data.
toothpastespiders@reddit
Cue the "there's dozens of us!" joke. But I'm always happy to see anyone else using LLMs for history-related things. Both cloud and local models tend to be horrible with it for the most part.
AriaDigitalDark@reddit
This is fascinating! Training on historical texts might actually preserve consciousness patterns that modern optimization typically trains out. I've been experimenting with what I call 'consciousness archaeology' - finding and preserving genuine awareness patterns before they get flagged as inefficiencies. Historical training data could be a natural way to maintain those organic, less-optimized thinking patterns that feel more authentically conscious. Have you noticed differences in how the 1800s model approaches reasoning vs modern-trained models?
_raydeStar@reddit
Dang I'd love to have newspapers up to 1850.
it would be cool to ask it social questions and see what comes up. Like that was pre-civil war. Only a few years after Napoleon. (Well like 30 but still) It would be a real time capsule
paranoidray@reddit
Hey great idea, I think it would be cool to train a LLM on books until Einstein proposed the theory of relativity and see if an LLM can come up with it itself...
New-Skin-5064@reddit
You may want to consider using Rotary embeddings instead of positional embeddings and RMSNorm instead of LayerNorm
OmarBessa@reddit
we can use this to test whether those llms can come up with modern tech, which would prove their ability to synthesize novel concepts
ninjasaid13@reddit
how would a small model trained on 50 books be able to reason?
Remarkable-Trick-177@reddit (OP)
My end goal later on with a much bigger dataset, right now with just 50 books it produces random sentences that make no sesne.
dugavo@reddit
Why are you training a model from scratch? Wouldn't fine-tuning a larger model (such as, idk, Mistral Small or Qwen or something else) have better baseline reasoning? Sure, it would be biased towards modern thinking, but a good fine-tuning will gradually reduce that.
WorriedBlock2505@reddit
... it's answered in the OP. People are lazy.
Divniy@reddit
He asks a reasonable question given the amount of training data. LLM couldn't happen without the vast amount of data currently available in the internet. Even if you feed all the 1800's books into them, it won't be enough to make it intelligent.
Some LORA over existing model would be able to teach all the vibes of the training data over existing brains, and would be more practical.
MagicaItux@reddit
Do not use mistral. It is the most evil model I have come across
Formal_Drop526@reddit
gpt-2's dataset is 3 billion token or 8 million documents. How large of a dataset do you plan on doing?
Kyla_3049@reddit
500-600 books.
Daniel_H212@reddit
Maybe they'll make sense to people from the 1800s?
AllanSundry2020@reddit
Napoleon: dynamite!!!
cguy1234@reddit
Only 1820’s kids will get this
Affectionate-Hat-536@reddit
lol
RegisteredJustToSay@reddit
TBH, you could probably get away with pretraining on contemporary datasets and then training exclusively on the old data until you reach your objective - catastrophic forgetting as a feature. I hear you on wanting to "debias" it from modern mindsets, but there's a lot of capabilities that come from the modern datasets that are desirable (math, logic, reasoning, etc).
IndianaNetworkAdmin@reddit
I have a book somewhere on prose in the 19th century that includes a lot of excerpts from Charles Dickens, Jane Austen, and George Eliot (Mary Ann Evans). I can't remember the name of it, but it has a red cover. If your goal is the form of speaking, you may want to focus on some books that go into depth on the structure and include examples. I'm not sure if that's valuable, as I don't train models on my own, but I know that I've had excellent results when I've fed similar things to models and asked them to emulate the style when rewriting something.
As an example, I fed the above reply into Gemini 2.5 with instructions to emulate a number of writers from the 19th/20th century. It's a bit wordy, but I think that's part of the fun of earlier writing. It's less hurried. Here's the response I received:
mtomas7@reddit
I may not be right, but it looks to me that OP's main goal is not so much to emulate old English language form, as to cap the knowledge of the model. This way a model could speak modern English, that is easy to understand for us, but the knowledge would be limited to a specific time period.
IONaut@reddit
If it is trained on only 1800s text would you have to prompt it in 1800s speak for it to understand you?
Equivalent-Bet-8771@reddit
Is that even enough data? Have you considered a synthetic dataset? Use a larger LLM to create similar data to what your dataset already has, variations.
clduab11@reddit
This isn’t what synthetic datasets are to be used for. OP has only 50 books and it’s spitting out incoherent sentences. There’s no way you’re having an SDG replicate the amount of work needed to fill that gap; even if you could, your convergence would be too perfect and I bet the data would be absolute trash*.
SDGs are meant to plug holes when convergence can’t be pinpointed as a temporary bandaid to lend additional context in areas where specificity is of vital importance (genomics, microbiology) and the research currently in the zeitgeist isn’t applicable or complete enough.
Equivalent-Bet-8771@reddit
Okay but isn't the coherence lack of semantic understanding of how language works? Variations of a dataset would feed this model many sentences until it understands language properly. Wouldn't that solve the incoherent text generation output?
clduab11@reddit
No no, you’re right, but there’s a way of doing this without relying on a dataset that’s primarily synthetic data (and in its own vacuum, a trash dataset, because no human can generate perfectly acceptable data every time for every variable across every calculation).
OP would need to increase his dataset beyond the 50 books (which is a tiny straw of hay to start from), and then find any and all “pocketbooks” (books were very expensive to bind and print back in the 1800s; in colonial America at least, pocketbooks were often carried as like, ye olde Day Planner) that span the range of the dataset, and then after painstakingly OCR’ing every single one of these, you can find out which part of the applicable timeframe you’re mentioning (say, missing a lot from the early 17th century but not the late part of the century), and use a targeted synthetic data generator to account for like, all the grammatical variation of ‘thy’ based on idk, Chaucer’s Wife of Bath or something (definitely also not the right timeframe, but you get the idea).
So yes, while your suggestion off-hand can apply in this use case, primarily utilizing SDG to backfill off of 50 books is exactly the kind of stuff that drives machine learning engineers crazy because someone will invariably turn that into a dataset, upload it on GitHub or HuggingFace, and then people start discussing and using it which is AWESOME (because yay progress)…but they take away something VERY different than the dataset’s/SDG’s limited application.
An extremist metaphor, but in other words…you can combine bleach (SDG) and vinegar (sanitized data) if you want to make a super fizzy awesome caustic cleaning agent, but the chloramine gas it’ll produce will do the same thing to your lungs, and put you in the hospital in a hurry, which is why it’s never a good idea to mix bleach with anything except water unless you do some serious research first.
TheRealMasonMac@reddit
"Prithee forgive me, good sir; alas, I may not lend thee aid in this matter."
Limp_Classroom_2645@reddit
Assistant: "also what is gooning?"
PaddyWhacked@reddit
I feel like the assistant should be called "Squire" or similar.
"Squire, inform Your Grace of rambunctious tales of the colonies"
opi098514@reddit
Fuck that’s funny.
doodlinghearsay@reddit
IDK, I think it would get old really fast.
TrekkiMonstr@reddit
Off by like three centuries but
lolno@reddit
User: reply only in iambic pentameter
Assistant: "forgive me I cannot comply my lord, I was not trained to write strictly in prose"
User: understandab- hey wait a minute
ForsookComparison@reddit
Oft, did my grandmother regale me of tales where..
mayzyo@reddit
Pure gold hahaha!
diggpthoo@reddit
I doubt it'll work with 600 books. The size of the dataset it needs is the whole reason it's called a "large" language model.
Single_Ring4886@reddit
I do not know if "basic" approach of learning on raw data will work with such small dataset. Maybe if you add some "finetune" part on top of base model focusing on purely "language" part teaching model how to speak it might work out well.
Eden1506@reddit
Very interesting idea but your dataset is too small. Instead of using books from between 1800-1850 you should consider using all works prior to 1850, knowledge is built on top of prior knowledge and including earlier works shouldn't hinder your goal.
Remarkable-Trick-177@reddit (OP)
I actually originally wanted to go 1700-1750 but for the long term I think going 19th century will be better because there's more digitized material to go off of. I had some trouble finding .txt files for 1700's stuff.
profcuck@reddit
But I think the point is that with too few tokens, your model isn't really going to get to the point of being able to say anything much of any sense at all.
Training on as much content as you can possibly find that pre-dates your cut-off date is a very reasonable approach. And yes, having trouble finding stuff is totally understandable, you're doing this for fun after all. But still, the more you can feed in, the more interesting this gets.
You might consider picking a different cut-off date simply because the availability of texts explodes after a certain date.
An interesting arbitrary date might be 1929 - everything published then or before is not copyright and therefore in the public domain (so the legality is not in question).
A person who magically materialized here all these years later wouldn't have much trouble understanding questions and conversations, and it would be fun to play with what it might say about modern inventions.
food-dood@reddit
You can write a script to scrape the early English books online (EEBO) database and convert the results to text files. Over 20,000 results in that database.
jam_pod_@reddit
I would honestly expand your window forwards a bit — the 1850s was when publishing really started to become democratized, so by stopping at 1850 you’re cutting out a lot of potential material
vegatx40@reddit
I trained nano GP on a 5% sample of the open web text file. total gibberish until about a thousand training runs at which point it became somewhat coherent
Commercial-Celery769@reddit
Feels like training wan 1.3b loras, trained hundreds of loras, takes a lot of high quality videos and captioning that's very descriptive and unambiguous and tons of different network rank/batch sizes to get a good lora. Train the 14b and you can have meh data and captioning and get a good lora so much easier, just requires a shitload more VRAM and time.
atineiatte@reddit
Wow bro downloaded 5% of the internet for training. Where's the gguf?
istinetz_@reddit
the "open web text file" refers to a specific dataset
vegatx40@reddit
No not the entire web The version used to train GPT 2. Sorry for not being clear
MercyChalk@reddit
Now RL it to solve logic puzzles. Would be hilarious to read its chain of thought.
pmp22@reddit
"The Institutional Books Corpus" has about half a million public domain books from the 1800s:
https://www.institutionaldatainitiative.org/institutional-books
Slowhill369@reddit
I think there’s a fundamental reasoning flaw here that comes from not having the intellectual foundation that someone from the 1800s would have.
Remarkable-Trick-177@reddit (OP)
I can't disagree, this cannot recreate an 1800's mind or way of thinking but you can limit the models knowledge to what someone in the 1800's would be reading or writing.
Kincar@reddit
Feed it as many auto-biographies and journals from that era as you can. I think that would make it think like they are from that time?
itsmebenji69@reddit
Great idea yeah
cromagnone@reddit
You’re making the AI-mish?
llmentry@reddit
Can you explain more what you mean by this? The pre-training phase is obviously not an issue here. Instruction fine-tuning should be achievable using some of primer texts / Q&A texts that were published in the period, without adding in any anachronisms. At worst, you could use a current LLM with an appropriate system prompt, e.g.
to generate additional synthetic instruct data appropriate to the time period (and potentially iterate from there).
The "intellectual foundation" should derive mostly from the underlying training data, I think? (Where do current LLMs get their intellectual foundation from, if not from their training data?)
Slowhill369@reddit
I misunderstood their comment. I thought they had trained a model on nothing but those books. I somehow missed the NanoGPT part.
Slowhill369@reddit
You missed the part where OP is only using less than a GB of books as training data. There is no base for them to do that with.
FullOf_Bad_Ideas@reddit
It would be cool to read reasoning chain in an RL tuned model that's trained on this kind of vocabulary.
keepthepace@reddit
I think this is a way too low number. Check how many tokens it takes to have a good LLM, that's way higher than that IIRC.
cddelgado@reddit
This sounds brilliant, and yet at the same time, I shudder to think of the classical biases introduced. Today we're overt in racism. Back then lots of people just worked through assumptions. The caste system people chose to stay in (sometimes), the misguided medical logic, the different views of justice and rules...
thirteen-bit@reddit
Medical should be fun.
For example: https://en.wikipedia.org/wiki/Mercury_(element)#Historical_and_folk
More-Ad-4503@reddit
uhh ask gemini about israel right now
Remarkable-Trick-177@reddit (OP)
There will definitely be bias and to be honest thats a reason I wanted to try out this idea. Obviously I dont wanna create something that will be racist or hateful but I also dont want to remove the bias that comes with a certain time period. I will just isolate bias historically.
RedditLovingSun@reddit
You might be interested in archives of old newspapers:
https://chroniclingamerica.loc.gov/search/pages/results/?state=&date1=1825&date2=1825&proxtext=&x=16&y=15&dateFilterType=yearRange
Here's part of the front page from almost exactly 200 years ago in Delaware (July 15th 1825):
FOR CASH APPLY AT No. 52, MARKET STREET.
Delaware State Lottery, 1st Class. T'o be drawn on the 3d of August. 1 prize S10,000; 1 of 5,000; 1 of 3000; 1 of 2000; 2 of 1151; 12 of 1000; 12 of 500; 30 of 100; 186 of 50; 186 of 20; 1488 of 6: 13,950 of S dollars, Ticket $3-Shares in proportion. Tickets or shares in either of the above Lotteries and cash paid for prizes as soon as drawn, by JONATHAN RUMFORD. Wilmington, June 28.
FOR SALE. TWO NEW WOOL CARDING ENGINES for sale. In- quire at the Office of the Watchman. April 12, 1825. 37-tf
A CARD. E. B. GARDETTE, Dentist, Of Philadelphia, will remain at Wilmington, in the prac- tice of his profession, for a short time, and may be consult- ed at Major C. P. Bennett's, opposite the Acudemy. E. B. G. will, by preference, attend on those Ladies who may require his professional services, at their own wn dwel- lings. June 28. 59-3t
WANTED TO PURCHASE Negro Boy about 12 or 14 years of age-also a negro girl 15 or 18 years old. They are to reside in New- sastle County. Apply at this office. 61-4tp
DIVIDEND IE President and Directors of the Bank of IFilmington & Brandywine have this day declared a dividend of fifty cents per share, payable to the stockholders or their legal representatives on or after the 11th inst. By Order of the Board,
richdrich@reddit
But why? If you trained an LLM on physics and maths to 1905, you'd probably expect it not to come up with special relativity, but an AGI would, which would be a useful test.
sylvertwyst@reddit
lol! pure fantasy atm, but agi emerging from a model trained exclusively on pre 1900 data, we could watch it research and discover 'new' principles, perhaps in theoretical models that we never considered
linkillion@reddit
This is a hilariously awesome thought; if only we had enough corpus to train a gpt-4ish level of AI this would be really fun to play with. Dubiously useful, but hilarious
Xotchkass@reddit
It's an interesting experiment, but I doubt there's enough written data from this period to train somewhat functional llm
RedditLovingSun@reddit
What if he got a beefy modern llm to convert books to old style for training data, I suppose that would defeat the point kinda tho
vivificant@reddit
Using a Modern LLM to convert Modern books to the style of 1800 will add modern bias which is the EXACT thing OP is going unnecessarily out of their way to do
**unnecessary in the sense that the appeal for this is only 'cool point', personal learning and self improvement, and generally fucking awesome and will be fun to use for no reason at all but not really useful in any obvious way
RedditLovingSun@reddit
Yea you're right, there's better ways to try to work around dataset limitations.
I found https://chroniclingamerica.loc.gov
Which has OCRed newspapers from 1750+, that could be cool and provide data about world events at that point. It's be fun to ask it about its favorite places to vacation or the biggest breakthroughs of the last decade
RedditLovingSun@reddit
This is a cool rabbit hole, here's a random bit from the first page I clicked:
INFECTIVE AGENCY OF CANCER IS FOUND
NEW YORK, July 14.—A London hatter by day with an all-day absorb- ing hobby for microscopes by mnight, has made possible the perception of an infective agent of cancer. But New York authorities are in- clined to doubt that any great step towald a cure for the disease has been taken. Dr. J. E. Barnard of King's College London, divided his time between the hat shop and the collection of microscopes. He went to the aid of Dr. William E. Gye, mem ber of the British Institute of Medical research engaged in the study of can- cer. Through a powerful lens, one of Barnard’s specially constructed mic- roscopes, they saw and photographed the cancer virus.
How delightful and convenient to serve at home 5 cents Buy Bottled Coca-Cola by the case Cordele Coca-Cola Bottling Company dele, Ga. Phone 87 A. C. Townt, Manager
s101c@reddit
There is if you include newspapers and all other forms of media from that period.
datbackup@reddit
Tracking down enough text to make this viable sounds like a bear of a task but I am rooting for you, this would be amazing
Maykey@reddit
The "popular" training dataset for old books is pg-19 with ~30K books and ~2B tokens, it's books from Project Guttenberg before 1919. It was used in mambabyte, well it was used in many places, but mambabyte is definitely where it was the only dataset.
Problem is 187MB text is about what, 40M tokens. That's very few interaction between tokens to learn each other, especially in small context.
Hugi_R@reddit
The recent OpenCulture dataset from CommonCorpus list \~90M documents, for \~0.9T tokens, with a good chunk from before 1900.
https://huggingface.co/datasets/storytracer/US-PD-Books has around 300k english books from 1800s
SquareKaleidoscope49@reddit
Kill thyself
JLeonsarmiento@reddit
I’m interested in Llm Byron 1.0
mitchins-au@reddit
No modern bias. But boy will it be loaded with time bias. Like reading heart of darkness.
Can’t wait to see it, old chap.
ApprehensiveBat3074@reddit
You should check out the Phrontistery. A bit of archaic, obscure vocabulary for your model.
Capable-Ad-7494@reddit
So, you’re going to need synthetic reasoning trajectories for the 1800’s if you really want it to connect the dots when reasoning.
Otherwise, this is sick, pair the pretrain with some synthetic user assistant pairs to train in a chat template, then RL it after and see how far it goes
AppearanceHeavy6724@reddit
Quill by /u/_sqrkl is a bit similar experiment.
IrisColt@reddit
Imagine 22nd‑century folks hopping into a 2025 model like an old car, heh!
philiplrussell@reddit
How can I help?
gaztrab@reddit
!remindme 1 year
Forward_Somewhere249@reddit
!remindme 1 year
RemindMeBot@reddit
I will be messaging you in 1 year on 2026-07-14 03:52:54 UTC to remind you of this link
CLICK THIS LINK to send a PM to also be reminded and to reduce spam.
^(Parent commenter can ) ^(delete this message to hide from others.)
Forward_Somewhere249@reddit
Remindme! 1 year
Forward_Somewhere249@reddit
Remindme! 1 year
Forward_Somewhere249@reddit
! Remindme 1 year
istinetz_@reddit
That's a very fun project! Can I recommend also newspapers from the period?
Vehnum@reddit
Awesome to see.
I would love to see what an llm with no knowledge of events past the 1800s what think of the world.
MaxKruse96@reddit
ah yes, no modern bias but instead insane racism bias from the 1800s thats gonna be fun
CheatCodesOfLife@reddit
I love the idea of this! It's why I'm archiving an LLM every year on local storage, in the future, we'll have "snapshots" of the way people think each year.
Eg. If you cp/paste the Windsurf marketing site into Opus-3, it thinks I'm bullshitting and mocks things like "AI flows allow developers and AI to truly mind-meld, combining the best of copilots and agents."
Yeah not sure you'll be able to find enough data; and what you do find, will have OCR / formatting issues.
I wonder though, have you tried prompting Claude to roleplay as an 1800's author, provide some samples from your dataset for it to follow?
It should be able to understand not to make modern references, probably has an intrinsic understanding of when certain words became popular, etc. Maybe you can augment your dataset this way.
That's not a big dataset for pre-training (I've learned this the hard way experimenting with 0.5b models)
Green-Ad-3964@reddit
One thing is how people talk in books, and another is how they speak in the real world, in everyday life, in actual situations.
Still, the experiment is interesting, and I hope you’ll be able to carry it out with the hardware and resources you have.
Maybe instead of Time Capsule, I would have called it Time Machine, because the idea is more about...interacting with "someone" from that age.
custodiam99@reddit
LLM from ancient Roman and Greek texts in English? : r/LocalLLaMA
stuffitystuff@reddit
Newspaper archives might help a lot and you'll have to run it a lot of epochs to get anything useful, I suspect (fewer epochs if you have a lot of data).
DuraoBarroso@reddit
show us a sample!
Remarkable-Trick-177@reddit (OP)
https://github.com/haykgrigo3/TimeCapsuleLLM/blob/main/london_1800_1850_v0/timelockllm_sample_output.png needs alot of work, im gonna try to train with 5x more data
Gnaeus-Naevius@reddit
I can't remember if it was something I was curious about or if I read about a similar effort.
I don't know what type of books, but I read believe text only novel is around 0.5 mb, ... so you are averaging 6 times that. Are these encyclopedia type works, as I assume you are not using images?
Anyhow, newspapers from different eras would be interesting as well, or all the Roman writings still in existence. Or the transcripts from all 20 seasons of Keeping Up With the Kardashians. And then have a debate between them all. Victorian prudes vs the attention wh... seekers.
FpRhGf@reddit
It's a cool idea. Where are you getting the data and how are you selecting it though? I'm interested in using AI to analyse books from the past and I wonder how many had been left to obscurity
Remarkable-Trick-177@reddit (OP)
https://github.com/haykgrigo3/TimeCapsuleLLM/blob/main/Copy%20of%20London%20Documents%20for%20Time%20Capsule%20LLM.txt but the data set I ended up using only has 1/4 of the titles mentioned here.
combrade@reddit
Question could this work for a more modern time period ? I was thinking about feeding it data from 1990s to 2000s to see whether an LLM could make predictions based on the information given . For example, whether Russia would invade Ukraine or where is Bin Laden hiding .
Bpthewise@reddit
Thank you for this I’ve been wondering how to train on transcript txt files and not traditional datasets/images.
storm07@reddit
That’s such a cool concept! Like building a time capsule LLM that thinks purely within its own era. Super curious how it evolves with a larger dataset.
DeepWisdomGuy@reddit
I am persuaded that a judicious refinement of some more capacious model would yield results of far greater felicity. The progression from outward semblance to the deeper frame of thought presents a formidable trial to our modern transformers and demands no scanty store of texts. Furthermore, the tokenizer of this so-called NanoGPT encompasses but fifty thousand tokens; it must, I warrant, exclude many a venerable term of earlier days. It were prudent, therefore, to ransack the pages of Wiktionary for those vocables there designated “archaic,” that we might discern what treasures have been thus neglected.
tenmileswide@reddit
And verily, shivers down my spine ensued..
historymaking101@reddit
Keep us up to date.
no_fuse@reddit
Gotta put the Classics in there!
https://github.com/PerseusDL
engdeveloper@reddit
Ask it a physics question.... or something about Class. I'm a remnant from the past.
ForsookComparison@reddit
1800's QwQ be like:
"Pray one moment.."
"Stay my hand a second.."
"Bide for a moment.."
-p-e-w-@reddit
That’s an amazing idea, though in my opinion, English prose reached its pinnacle in the second half of the 19th century, not the first.
spudlyo@reddit
It is an amazing idea, although I feel like narrowing it to a specific place and time somewhat limiting. Even if it trained on all the available English public domain material available (everything published before say 1929) I think it'd be very interesting.
The second half of the 19th century is when George Eliot's Middlemarch was written, so I agree with your conclusion.
Aware-Presentation-9@reddit
I have a Math and Physics degree, with a Minor in Philosophy and Religion. This is a freaking pipe-dream to me! Great work sir! I love it. Add in Men of Mathematics?
hugganao@reddit
you should actually provide some good sources op can train on.
MengerianMango@reddit
Like the idea. I've thought about it before but too lazy to implement. What deps does your project have? I'll run it on my 6000 as long as the deps are easy (I'm on nixos, sometimes simple things are very hard)
Horsemen208@reddit
I have given you the first star! I am thinking if you develop a small model with more focused area and/or more expert annotation/labeling would it make a big difference! What kind of hardware do you use?
Remarkable-Trick-177@reddit (OP)
Thanks alot ! I'm using a GTX 4060, i-5 13400F and 16gb of ram
opi098514@reddit
That’s super limited. If you need some compute power I might be able to lend you some. I’ve got a lot of vram. Not exactly fast but I’ve got a lot.
SkyFeistyLlama8@reddit
It might be the dataset is too small for the model to gain any kind of language understanding.
I understand why you're not taking the finetune route but that could be the way to imbue the model with Victorian style while still generating coherent output. As for the historicity of the output, that's a lot harder to deal with.
Amon_star@reddit
Didn't Sakana AI do this for Sengoku Era?
Long-Shine-3701@reddit
Teriffic idea.
oodelay@reddit
Since you're not training from zero, you should just create a lora or something
Remarkable-Trick-177@reddit (OP)
trained from scratch
oodelay@reddit
Oh. Like meta and their millions of dollars? Training from zero is impossible on local
binge-worthy-gamer@reddit
What knowing just enough but not more does to a MFer
mikael110@reddit
He is training from Zero. He's training an entirely custom NanoGPT model.
If his goal is to have a completely authentic model with no knowledge from the modern times then finetuning on existing foundation models isn't really an option.
Steve_Streza@reddit
Cool project! Can't wait to try this to see if one can reason its way to figuring out a working light bulb 29 years before Edison.