Stanford's CS336 2025 (Language Modeling from Scratch) is now available on YouTube

Posted by realmvp77@reddit | LocalLLaMA | View on Reddit | 26 comments

Here's the CS336 website with assignments, slides etc

I've been studying through it for a week and it's one of the best courses on LLMs I've seen online. The assignments are huge, very in depth, and they require you to type a lot of code from scratch. For example, the 1st assignment pdf is 50 pages long and it requires you to code the BPE tokenizer, a simple transformer LM, cross-entropy loss and AdamW and train models on OpenWebText

[-]

Lazy-Pattern-5171@reddit

Finally. Anyone wants to race to the finish on this one? We can track goals and metrics on Discord. first one to SOTA 1B model wins 1000$. You can’t have prior LLM knowledge or should’ve watched and implemented Karpathy’s videos obviously but using AI should be allowed so my guess is that eventually systems will align.

[-]

Expensive-Apricot-25@reddit

You’re not going to be able to make a state of the art 1B model.

[-]

Lazy-Pattern-5171@reddit

What’s the largest I can hope to make realistically?

[-]

Expensive-Apricot-25@reddit

if you have a dedicated mid-high range consumer GPU, probably around 100-200 million. I would say around 20-50 million is more realistic though since you can train it in a matter of hours rather than days.

Thats not the problem though, the problem is thinking you are going to make a "state of the art model", that is not going to happen.

There are teams of people with decades of experience, access to thousands of industrial GPUs, who get paid massive amounts of money to do this, there is no way you are going to be able to compete with them.

You need huge amounts of resources to make these models, thats the reason why only huge companies are the ones able to release open source models

[-]

man-o-action@reddit

It's not about making state of the art. It's about learning from first hand experience, learning by doing.

[-]

Expensive-Apricot-25@reddit

they specifically listed making a state of the art model as their goal.

[-]

man-o-action@reddit

Sorry didnt see that. Thats stupid

[-]

Lazy-Pattern-5171@reddit

I’ve the classic 2x3090

[-]

Expensive-Apricot-25@reddit

oh wow, thats really good, but you're still going bottlenecked by compute not memory. training uses way more compute than inference does.

But again, you are not going to make a SOTA model. thats the main issue

[-]

Lazy-Pattern-5171@reddit

Can I make a SOTA 100M? I want to give myself a constraint motivating enough to bet 1000$ on myself and also finish it. That’s why dreaming of the leaderboard right now seems to be the only goal people are talking about.

[-]

sleepy_roger@reddit

Honestly, I wouldn’t take Expensive-Apricot’s comments too seriously. If you dig into their history, it’s clear they speak with a lot of certainty on topics they don’t necessarily have deep experience in. The kind of black-and-white thinking they’re showing, “you can’t do X,” “you won’t make Y” is exactly what kills innovation before it starts.

You’ve already shown you're open to feedback and willing to iterate, which is half the battle in this space. 2x3090s is plenty to do some serious work. You might not build a model that dethrones GPT-4, but setting an ambitious goal, learning along the way, and seeing how far you can push a 100M or even 500M model is absolutely worthwhile.

Don’t let people with rigid mindsets set your ceiling. Just make sure you're getting feedback from folks who actually build things and always look at their history before treating what they say as gospel.

Keep going. You’re asking the right questions.

[-]

Expensive-Apricot-25@reddit

No, you’re not.

Again, there are companies that hire full teams of people with decades of experience, and infinite compute resources that are working on this 24/7.

You don’t even have any experience. You simply can’t compete.

Remember, SOTA means better than everything else, not “using SOTA techniques”.

[-]

Lazy-Pattern-5171@reddit

Fair. What would be a good challenge then that’s also you know like, a challenge?

[-]

Expensive-Apricot-25@reddit

make your own model completely from scratch that is able to actually produce legible output, and have basic Q/A abilities.

Trust me, this is harder than you think.

[-]

Lazy-Pattern-5171@reddit

Well. I hope I don’t find out that this whole LLM thing has been a conspiracy all along and we have paid actors typing out responses.

[-]

Expensive-Apricot-25@reddit

ik your making a joke here, but i think your vastly underestimating just how technical, and resource intensive this stuff is.

let me know how it goes

[-]

Lazy-Pattern-5171@reddit

Gladly. If I can digest this material or if it’ll be a colonoscopy I’ll let you know either way.

[-]

Dudmaster@reddit

But it would be well worth the $1000 bounty

[-]

realmvp77@reddit (OP)

just as a warning, even though the course is called "Language Modeling from Scratch", it ramps up pretty fast, so it's not meant for total beginners. I wouldn't go into it without some basic LLM knowledge. I read Sebastian Raschka's "Build a LLM" book and thought it was great prep for this course. Karpathy's playlist is great too, I watched that before I read the book

[-]

Lazy-Pattern-5171@reddit

Even more important to race to the finish line then. Would know if it’s for me or not faster.

[-]

fandogh5@reddit

Is it finished?

[-]

realmvp77@reddit (OP)

yes, all the lectures and assignments are there

[-]

Kathane37@reddit

https://www.amazon.com/Build-Large-Language-Model-Scratch/dp/1633437167

I start digging this book, do you think I need to watch the classes or will I be fine ?

[-]

realmvp77@reddit (OP)

I just finished that book and it's great. you should read the appendix's links too and do the bonus sections on github. CS336 goes deeper than it, and it requires you to write lots of code on your own, so if you wanna study further, you should read the book and then do CS336

[-]

Sea-Rope-31@reddit

Thanks for sharing!

[-]

Accomplished_Mode170@reddit

Will check later; love 3Blue1Browns visuals in particular so I’m interested in similar versions for NSA because sparsity itself seems fundamental to reasoning (read: spline fitting the circuit)