How to make a small LLM from scratch?
Posted by Charming_Barber_3317@reddit | LocalLLaMA | View on Reddit | 21 comments
I want to build an llm 0.1B to 0.6B params on a less popular language. How much data will i require of that particular language? and what are the exact steps i should follow? is this a good project for my final year? I have access to rtx3090 on which i can run 20B to 40B models easily at q4_k_m.
johnkapolos@reddit
https://github.com/karpathy/nanoGPT
badgerbadgerbadgerWI@reddit
Thank you!
abnormal_human@reddit
OP, please listen to this. I trained a small GPT in early 2023 off of this codebase and had a lot of success with it. It was fast, easy to work with, and I was able to understand the whole thing with no magic.
Cultural_Ad896@reddit
What unpopular languages are you thinking of? Like Dart?
Charming_Barber_3317@reddit (OP)
No I'm talking about Punjabi and Sindhi 🙂
Cultural_Ad896@reddit
Ah, thank you. I was mistaken.
Charming_Barber_3317@reddit (OP)
No I'm talking about Urdu 🙂
FullOf_Bad_Ideas@reddit
Does it need to be usable for anything or can be just a toy?
I am pre-training an LLM for Polish right now, from scratch. Using Bailing v2 MoE arch (Ling Mini 2.0 uses it), between 1B and 4B, pre-trained on 110B tokens. I plan to use around 1000 GPU-hours of H100 compute. Should be done by end of month if wind blows right. It will most likely not be usable in any way, just a toy like GPT2 nowadays.
Once you have the dataset, it's mostly a matter of finding the right config based on known scaling laws and applying it. I am going for MoE to push harder with my limited compute than I could have done with dense model and this amount of compute, I don't know if it'll work - we'll see soon.
Look into TorchTitan, Ling-V2, Megatron-LM and renting H100x8 nodes.
ikkiyikki@reddit
No idea. Hopefully someone here can chime in. The methods are well published so I wouldn't be surprised if there were already apps that can do it (as opposed to a mere finetune).
The dataset is the easy part. Wikipedia I think is something like 5 billion tokens and Common Crawl like ten times that, so more than enough for your project.
Charming_Barber_3317@reddit (OP)
Wikipedia and Common crawl are in English. I want to train the model on a separate middle east language.
Coldaine@reddit
If there's not a huge corpus of data for whatever you want to train the model on, don't underestimate how reasonably priced it is to synthesize training data. Depending on the language, you can spend a couple of dollars having Google Gemini 2.5 Flash generate synthetic training data, review it for quality, but it's pretty good.
I used it to train an image model, and it was great.
lasizoillo@reddit
You can get dumps of wikipedia in many languages.
Common crawl is a crawl of pages in multiple languages (mostly in english, but not exclusive), if you filter by your local .TLD probably found a lot of pages in your language. In CC you don't need to download a full snapshot, you can get index and then only download interesting parts.
Probably is easier use public domain books and open data in your language for base train. Then distill knowledge from a bigger LLM to generate instruction datasets.
You can also filter datasets by language in huggingface.
thebadslime@reddit
Hey there!
I am in the process of training a 960M model from scratch. I am using the transformers library on Amazon Sagemaker to train mine. Chinchilla optimal for .6 would be 12 Billion tokens.
You are going to need like a 20gb card for training and it will take weeks.
ArthurParkerhouse@reddit
About $180,000 cash shot directly into the bloodstream.
Hefty_Wolverine_553@reddit
Andrej Karpathy has a really great tutorial on training LLMs from scratch. However, note that anything you can come up with on a 3090 will be basically the same quality as GPT-2. I'd consider renting some good GPUs on runpod/vast for a few bucks for anything slightly more intensive.
Figai@reddit
The project can 100% be good for a final year, but it might be a little bit overdone, imo. Though when most people would just use LoRA to fine tune an llm and be done. You’re going a lot further than that. You’re gonna need to be comfy with PyTorch, jax or whatever have you, use as much prewritten code as you can, don’t get bogged down in writing like cuda kernels or smth. Oh and karpathy course and stuff, there’s so many tutorials. Though there are things you should look at beyond tutorials.
I would look at niche language llms on huggingface, you’d probably want to see if they reported hyperparameters or reported anything on weights and biases. Also, just cold contact their creators, the community is super nice.
Chinchilla scaling laws say about 20 tokens per param, albeit maybe outdated? and standard practice usually trains just with as many tokens as possible. Like 100s-1000s.
I’m not exactly sure how much code you want to write yourself, but you could try some smaller tweaks to the standard transformer model. You’ll be using the same prebuilt optimisers, maybe try moonshots one!
This will also depend on your course, could be computational linguistics to like straight maths idk. For the former more typical ngram models would be cooler, for the latter you’d probably want a more experimental.
Oh and also don’t underestimate how much a 24/7 gpu fan will drive you insane, there’s a reason my fine tuning rig is stuffed in a garage and I just SSH into it.
Healthy-Nebula-3603@reddit
Literally you can ask Gemini 2.5 pro or gpt-5 thinking for it ...
Charming_Barber_3317@reddit (OP)
Human responses are more helpful sometimes. Also if we start asking everything from LLMs then what is the point of reddit?
Figai@reddit
Adding to your point, Reddit is the most cited website for most LLMs lol. So no matter human or not answering you, it all leads back to Reddit, either for training data or as a direct source.
Monkeylashes@reddit
check this out first, you can also play with the tinystories model to get a feel for what is achievable.
https://arxiv.org/abs/2305.07759
Languages_Learner@reddit
This project allows to build tiny llm: tekaratzas/RustGPT: An transformer based LLM. Written completely in Rust, you can scale it to bigger size by using different Question-Answer dataset for your preferred language. I successfully ported it to C# with help of Gemini 2.5 Pro, so i think it can be ported to C, C++, Python, Go-lang etc.