Why is there no community project for training your own LLM from scratch on consumer hardware?

Posted by tevlon@reddit | LocalLLaMA | View on Reddit | 69 comments

ok so this has been bugging me for a while. We've got nanoGPT/nanoChat from Karpathy which is honestly great and I'd point anyone to it. But here's the thing: to actually follow along and get real results you still end up renting cloud GPUs. And not everyone wants to drop $80+ on cloud compute just to mess around and learn. That barrier alone keeps a ton of curious people out imo. So why isn't there a project (or even just a solid tutorial) built around one hard rule: **it has to train on 8GB of VRAM. no cloud, no rented A100s.** if it doesn't fit on a normal gaming GPU it doesn't count. The dream is a small but actually-real model trained on something like a Wikipedia dump, with a full writeup walking through the whole pipeline. And here's the part I really want: it should use the modern tricks people keep hyping but rarely bundle into one beginner-friendly thing. stuff like: * BitNet / low-bit training to crush the memory footprint * the Muon optimizer instead of plain old AdamW (apparently like 2x more compute efficient + decent memory savings, sounds perfect for a tight VRAM budget) * aggressive quantization to stay inside 8GB * whatever else helps squeeze a trainable model onto consumer hardware basically nanoGPT's vibe but with a hard "must run on your gaming PC" constraint and a modern technique stack, so anyone can train a model end to end for free. so my questions: 1. does this already exist and I just haven't found it? if so please link 2. if not... anyone wanna build it together?

Reply to Post

Reply

69 Comments

[-]

Gunnarz699@reddit

>Why is there no community project for training your own LLM from scratch on consumer hardware? There is. PyTorch is almost 10 years old now...

Reply

[-]

tevlon@reddit (OP)

it's for educational purpose. nanochat is a good beginner repo. I am looking for nanochat-advanced. I want to see how far we can get, when we squeeze all optimization techniques into one project. INCLUDING BitNet

Reply

[-]

Gunnarz699@reddit

Why? Why would anyone devote an entire dev teams effort, time, and focus into this only to chop off both legs when running training inference? You're talking about thousands of dev hours into making a fundamentally physics limited project. >nanochat is a good beginner repo I think your confused about what you're asking for. Are you looking for an LLM or a model creation architecture? You can go make another nanochat exactly the same way that project did. They used PyTorch and so would you... You're asking why no one builds a mass market car in their driveway. Yes it's possible. No it's not worth it. Yes people spend lots of effort and time on bespoke solutions, no they're not going to make an open source Honda civic when it already exists. There is a (MASSIVE) reason why everyone just trains finetunes of existing models.

Reply

[-]

Silver-Champion-4846@reddit

Because... we want a model we can actually use?

Reply

[-]

Gunnarz699@reddit

Have you seen qwen3.6-27b? Qwen3.5-4b? GEMMA4? Nemotron? LFM? Ministral? Llama? There are literally hundreds of FANTASTIC models available to run locally right now. Why would you try to develop another to compete while hamstringing yourself on training inference so severely?

Reply

[-]

Silver-Champion-4846@reddit

None of those can, other than qwen3.5 4b, run on 8gb ram 8th gen intel core I5 U type cpu, and even that one only gives 1-2t/s without being worth it for my needs.

Reply

[-]

Gunnarz699@reddit

Qwen3.5-4b:IQ4_xs GEMMA4:e2b_Q4_k_m Nemotron 3b:Q4_k_m LFM-2.5 Ministral 3b or 8b quantized. Literally all of these families have edge device capable models and ALL of them are more capable than anything cooked up on a consumer card. Your "needs" are fundamentally incompatible with reality.

Reply

[-]

Silver-Champion-4846@reddit

And you don't believe Bitnet.cpp could work better if made more accessible?

Reply

[-]

hopbel@reddit

>educational purpose > I am looking for nanochat-advanced You don't learn anything by asking to be spoonfed code.

Reply

[-]

Formal-Exam-8767@reddit

So, how many years are you willing to wait for an LLM to finish training on a single 8GB VRAM card?

Reply

[-]

UnethicalExperiments@reddit

Why is this sub so bloody hostile. I was recommended to.check this out by another user to get a better idea how to make the most of my hardware. Half of this sub is whoring cloud services ( when mentioning it's localllama I got told I was apparently dick riding local hosting ). The other half is just hostility towards anything not discussing what's in vogue.

Reply

[-]

unknowntoman-1@reddit

Some folks don’t see the ”Local” in the name of LocalLLaMA. Probably skilled professionals but lost in context…

Reply

[-]

darkpigvirus@reddit

Search Andrej Karpathy in YouTube doing Chat GPT 2

Reply

[-]

Hopeful_Ad6629@reddit

Oddly enough, Claude code and I built an LLM system written in dotnet. We were able to create an LLM from scratch and train it on 24 gb (2 - 12 gb 2060). Took it a while to train but was able to make it work :)

Reply

[-]

Voxandr@reddit

show us its's output for lols please.

Reply

[-]

Aromatic-Low-4578@reddit

Probably because most people interested in local llms have more than 8gb of vram. Both unsloth and HF have put tons of effort into excellent learning materials about all of this. I'm really not sure what more you'd want.

Reply

[-]

CubicleHermit@reddit

There are decent local LLMs that run on 8GB of VRAM, especially for basic text-processing tasks (vs. coding)

Reply

[-]

NicolaZanarini533@reddit

The main reason is that, outside of research projects or learning, there is really no point in doing it - while not impossible, any model you could train on a single 8GB card will be a toy model by today's standards, and usefules only as either an experiment to test a specific hypothesis or as baseline for ablation on that hypothesis. I would also think that there is no "dedicated repo for the project" because once you know enough to do it, it becomes almost trivial/not worth it creating a community repo for it (again, what would be the point of tiny LLM today outside of research?)

Reply

[-]

carloselieser@reddit

This is literally what Unsloth is for.

Reply

[-]

stiflers-m0m@reddit

Ton of how to youtubes. I followed one 3 or 4 years ago on using Shakespeare books. You will need a large training sample..... I also did a jetson nano dog or not dog pictures... That was fun

Reply

[-]

tevlon@reddit (OP)

I am specifically asking, because we have since then mHC, TurboQuant, BitNet, MTP. There is no project that is like nanochat + all those techniques combined. I will do it. Apparently, no one seems to understand the relevance.

Reply

[-]

Gunnarz699@reddit

>Apparently, no one seems to understand the relevance. If no one else thinks it's worth doing there's a 0.01% chance you're about to be a brilliant visionary and a 99.99% chance you're just missing something critical... Good luck!

Reply

[-]

FreQRiDeR@reddit

The thing is, you have to build a big model first before you can wuantize it.

Reply

[-]

dash_bro@reddit

? I thought nanochat IS this! Also, would recommend a light reading of anything unsloth for training and RL. Swap out models for the size you can run locally (0.6B qwen if needed). You're going through the motions to learn and correct your understanding only. Beyond that, the truth is you'll either need to curate or create after that. I'm not sure it's worth your time since anyone who needs it already can likely spend the $$ to do the H100 runs or is ingrained into llama.cpp enough that they can take up PRs with patches, optimize or patchfix it for their usecases before an official release is out.

Reply

[-]

stonetriangles@reddit

- Bitnet model train slower than bf16 - Muon is already in nanochat - Quantized models train slower than bf16 - If it works it's already in nanochat

Reply

[-]

tevlon@reddit (OP)

TurboQuant? MTP? DFlash?

Reply

[-]

Polite_Jello_377@reddit

Buzzword bingo 🙄

Reply

[-]

dinerburgeryum@reddit

You’re throwing buzzwords at the wall now. These things are inference-time optimizations and are not of concern for training. You’d have to train MTP tensors alongside your model for example. DFlash is a bolt-on to my knowledge but it is itself a diffusion model that would have to be trained. TurboQuant is a compression algorithm only marginally suited for use in the KV cache during inference.

Reply

[-]

Snoo_81913@reddit

I was gonna say Turboquant has nothing to do with training its a compression for your KV cache and theres a fork of Llama.cpp that supports it. MTP for a model under 1GB doesn't make sense really, im not sure that a small model could even use it, but you could do bit token (not sure thats right tbh) training which does 4 token training (might be different but its muti token training at least) I know thats not the same as MTP but im not sure a small model could think hard enough for mtp not really sure.

Reply

[-]

tevlon@reddit (OP)

I know. I know. Training is only part of the project. Like nanochat, i would like to have an equally optimized inference (including TurboQuant).

Reply

[-]

Snoo_81913@reddit

There is LLM From Scratch.

Reply

[-]

tevlon@reddit (OP)

Link?

Reply

[-]

Snoo_81913@reddit

https://github.com/angelos-p/llm-from-scratch/tree/main

Reply

[-]

tevlon@reddit (OP)

Cool! That's what i was looking for. BUT, it's missing all the optimizations, which is not a deal breaker. I might fork it and then build on top 😎 Thanks!

Reply

[-]

Snoo_81913@reddit

Or just download unsloth and use it

Reply

[-]

tevlon@reddit (OP)

but that's just fine-tuning. I don't know why everyone is mentioning unsloth. Can i train from scratch with BitNet and mHC and infer with TurboQuant?

Reply

[-]

llama-impersonator@reddit

finetuning is the same as pretraining. there is zero technical difference other than chat template stuff you don't need to mess with for pretraining.

Reply

[-]

Snoo_81913@reddit

You can do pre-training with Unsloth. You just need a blank model so just build out a GPT-2 model like in the link I gave you and use Unsloth to train it

Reply

[-]

Snoo_81913@reddit

The main page will say "using a Macbook but you can literally build it with anything. I did it on just a regular old laptop PC."

Reply

[-]

giveen@reddit

Look up Axolotyl

Reply

[-]

tevlon@reddit (OP)

i am talking about training from scratch. not fine-tuning.

Reply

[-]

llama-impersonator@reddit

you can pretrain with axolotl, have you even looked at it?

Reply

[-]

giveen@reddit

Oh I see, yeah, I honestly dont know where to start

Reply

[-]

llama-impersonator@reddit

it's actually cheaper to drop $80 on cloud gpus as LLM training is compute bound and that node of B200 is many, many times faster than your local shit gpus.

Reply

[-]

BidWestern1056@reddit

npcpy has all the tools [https://github.com/npc-worldwide/npcpy](https://github.com/npc-worldwide/npcpy) training some fine tunes right now for improving npcsh [https://github.com/npc-worldwide/npcsh](https://github.com/npc-worldwide/npcsh)

Reply

[-]

iMrParker@reddit

Both nano chat and nano gpt are possible on consumer hardware with smaller batches. And you're also not required to use the datasets provided in the readme. I honestly think tinkering and changing those repos are the best way to learn rather than find one that is plug and play. Otherwise you're basically just putting in commands and learning nothing

Reply

[-]

tevlon@reddit (OP)

Not a BitNet model, though. We can squeeze more on 8GB. Nanochat is a solid base, but no TurboQuant for example. no MTP.. There are tons of optimizations that it's missing

Reply

[-]

Such_Advantage_6949@reddit

Train yourself and let us know if your model is better than any current gen popular kodel

Reply

[-]

tevlon@reddit (OP)

that's not the point. The only comparisons that make sense are Qwen 0.6B and nanochat.

Reply

[-]

Such_Advantage_6949@reddit

Then are those model of much practical use?

Reply

[-]

iMrParker@reddit

You could replace GPT 2s linear layers with bitnet, and possibly other optimizations. But this probably doesn't exist because the work required to make an arch like this for < 8gb weights Could be a fun project. Even if it's a bit niche and tedious

Reply

[-]

fulgencio_batista@reddit

Trained a 131M parameter model on an 8gb RTX3070 on about 2B tokens in \~40 hours (optimized code). It's really not hard. I don't think it's talked about because training an architecture like GPT2 is fairly trivial for anybody who's has some coding and linear algebra under the belt. And it's not really exciting, 40 hours of training for a model that predicts text - OK.

Reply

[-]

Double_Cause4609@reddit

There is. Nanochat and the NanoGPT speedrun both support different values of gradient accumulation to run on lower end hardware and they explicitly document it. If you design something new, what's going to happen is you're going to optimize it for your own hardware, and somebody will have to...Tune the hyperparameters to run it on their hardware. Just like you could have tuned the gradient accumulation on nanochat, etc.

Reply

[-]

tevlon@reddit (OP)

Not a BitNet model, though

Reply

[-]

stonetriangles@reddit

BitNet will train at the same speed as standard bf16. It's a meme.

Reply

[-]

tevlon@reddit (OP)

Speed is not the main objective. It's fine. It can take longer. The constrained is the memory of 8GB

Reply

[-]

Double_Cause4609@reddit

Could you clarify why you're focused on Bitnet? Bitnet uses QAT. That means that basically you're using a bit more memory than normal for training, so that you can pretend that the numbers perform like they will when they're quantized. Why do you want to use more memory and train slower when you're training on constrained hardware. Can you explain?

Reply

[-]

dinerburgeryum@reddit

Can you train a BitNet model without using FP32 for backprop? I didn’t think you could train ternary weights directly without having severe vanishing gradient problems.

Reply

[-]

BoogerheadCult@reddit

The people who can do it probably already worked at or poached by big AU firms making 500k to 7 figure. Why would they want to do all that for free ? Not to mention the hardware required, you need a huge hardware farm, only FAANGs can afford. Not even small to mid size companies can do that from scratch. So if just follow the money, you will get the answers.

Reply

[-]

Valuable_Touch5670@reddit

This is actually a very interesting idea. Even though I have a hobby project at hand, I felt the itch to stop it and give this one a try haha. I tried building a LLM from scratch before, **for fun**. It was a GPT-2 style character-level LLM that writes Chinese novels, inspired by the incredible Andrej Karpathy. But it was long ago and, like you said, there were so many incredible advances in the field since then. I recently read about the HRM architecture and its efficiency. I have **limited knowledge**. But what if we can: 1. Somehow make BitNet work with HRM. 2. Use some or all the techniques you mentioned. 3. Train on a relatively small, but high quality dataset. Then this may work lol

Reply

[-]

tevlon@reddit (OP)

Thank you! Finally, someone who gets it!

Reply

[-]

CreamPitiful4295@reddit

Unsloth

Reply

[-]

andy_potato@reddit

It wouldn't be more than an interesting academic exercise. With all the open licensed LLMs available, there is little to none real-world demand for a model like that. So why would anyone bother?

Reply

[-]

tevlon@reddit (OP)

same argument could be applied to nanochat. It's for educational purposes, obviously.

Reply

[-]

galibert@reddit

Because the data needed to train a performant model is enormous in volume and I suspect in large part illegal (from a copyright violation POV)

Reply

[-]

tevlon@reddit (OP)

did you read my post? i was talking about an educational project .. not a performant model

Reply

[-]

ketosoy@reddit

Kaggle is a partial answer

Reply

[-]

Lissanro@reddit

It would be interesting projects, I think to make it actually useful it would have to focus on training various small specialized LLMs, possibly with some common training dataset for general knowledge. But the main issue is that with one GPU it is not practical to train even 0.6B model if it is general purpose one. And the project to train your osn LLM from scratch also may benefit from having benchmarks against fine-tuning existing general models, not necessary to demonstrate beating them, but providing comparison of what kind of results to expect and how much room is there for improvement, so even far from modern 0.6B model (like Qwen 3.5), I think it still would be very educational comparison, to know what is possible and what is yet to achieved if using just one home GPU. Anyway, just sharing my ideas and suggestions. Myself, I only got as far as fine-tuning existing models, mostly when I needed to do some tasks in bulk and just prompt engineering wasn't enough and large models were not fast enough for the task I had.

Reply

[-]

AustinSpartan@reddit

Why bother?

Reply