TheaterFire

Stanford just dropped 5.5hrs worth of lectures on foundational LLM knowledge

Posted by igorwarzocha@reddit | LocalLLaMA | View on Reddit | 74 comments

Stanford just dropped 5.5hrs worth of lectures on foundational LLM knowledge
Enjoy? [https://www.youtube.com/@stanfordonline/videos](https://www.youtube.com/@stanfordonline/videos)

Reply to Post

74 Comments

laststand1881@reddit

Gr8 thnx for sharing
View on Reddit #79063833

Bloggter@reddit

Yeah i saw those videos, at first I couldn't understand it but I kept going and it mindleaped to me.
View on Reddit #75014520

Puzzleheaded_Toe5074@reddit

Open source knowledge! Thanks for sharing this!
View on Reddit #71443414

Significant_End_5190@reddit

Awesome
View on Reddit #71242636

ElonAltmann@reddit

thx for sharing this
View on Reddit #71221036

shervinea@reddit

Thank you u/igorwarzocha for sharing! Afshine and I are very excited to teach this class and hope the resulting material will be helpful to a maximum amount of folks. Here's the course website in case you want a single landing page with all the pointers: https://cme295.stanford.edu We are continuously updating it with recordings, slides (even exams!) as they come out. Cheers!
View on Reddit #69403535

Raikoya@reddit

Hey there, just wanted to say thank you for putting this online. As someone who finished school some time ago, these videos (and the exam as well !) are helpful to get up to speed on concepts which were a bit fuzzy for me, like RoPE. Looking forward to watching the rest of the class ! From a fellow french school alumni
View on Reddit #70871994

shervinea@reddit

Thank you for your kind words, Afshine and I truly appreciate it. Glad to hear it's helpful. We hope you'll enjoy the rest of the lectures as well! Meilleurs vœux :-)
View on Reddit #70873468

bunny_go@reddit

Because saying "released" is so 2024, we have to usay "dropped" which can either mean abandoned/deleted or released. Well done.
View on Reddit #70407631

absolutxtr@reddit

Ty!
View on Reddit #69783854

blnkslt@reddit

The sould quality is aweful. Stanford still have not figured out how to use a mobile phone, apparently.
View on Reddit #69716574

TimeTravellingToad@reddit

I own the text book associated with the course. It's way too superficial to use standalone and feels like they rushed it out to meet their course deadline.
View on Reddit #69594654

DistanceSolar1449@reddit

I just scrubbed through the videos. It's not digging all the way down into the math, so you don't really need much linear algebra knowledge to understand it. Mostly talking about architecture stuff. It's a medium level overview of: - tokenization - self attention - encoder-decoder transformer architecture - RoPE - layernorm - decoder only transformer architecture - MoE routing - N+1 token prediction - ICL/CoT - KV Cache, GQA, paged attention, MLA (which only deepseek really does), spec decode, MTP It's not quite a high level overview, since it goes a bit deeper in some parts. But it basically 0 math and is not a deep dive, so it's not teaching you much there. If you've heard of these concepts before, you can generally skip these videos.
View on Reddit #69228504

UnfairSuccotash9658@reddit

Then where can I learn these deeply?
View on Reddit #69233234

appenz@reddit

Ex Stanford student here. The in-depth computer science version with math would be Chris Mannings CS224N. It’s an excellent class and taken by a good fraction (30% or so) of all Undergrads of all majors. [Online lectures here.](https://m.youtube.com/watch?v=DzpHeXVSC5I)
View on Reddit #69239803

HustlinInTheHall@reddit

That man is living in dongle hell.
View on Reddit #69541938

Limp_Classroom_2645@reddit

> Thank you for your interest. This course is not open for enrollment at this time. Click the button below to receive an email when it becomes available. excuse me wtf?
View on Reddit #69254051

appenz@reddit

You can’t enroll (I.e. get course credit and make it count towards a Stanford degree). You probably don’t want to pay the tuition, so I am guessing that’s fine. You can view lectures on YouTube.
View on Reddit #69263955

IrisColt@reddit

Thanks for the superb insight!
View on Reddit #69254529

UnfairSuccotash9658@reddit

Thanks man! Really appreciate it!! I'll look into it!!
View on Reddit #69242249

jointheredditarmy@reddit

The same videos, but after you do a quick refresher on your linear algebra.
View on Reddit #69237352

ParthProLegend@reddit

Where can I learn and refresh that thoroughly?
View on Reddit #69239811

full_stack_dev@reddit

> quick refresher on your linear algebra Here: https://linear.axler.net/LinearAbridged.html
View on Reddit #69240841

jdjsjndjejdbdh@reddit

"Linear Algebra Abridged" and it's 145 pages, oof.
View on Reddit #69478398

ParthProLegend@reddit

Thanks man♥️
View on Reddit #69287014

layer4down@reddit

IMHO the best online explainers on this are by 3Blue1Brown on YouTube: https://youtube.com/playlist?list=PLZHQObOWTQDNU6R1_67000Dx_ZCJB-3pi&si=m8TYsIDJ-Pn2LwMn
View on Reddit #69260568

ParthProLegend@reddit

Damn I have subscribed to him already, thanks for the playlist though.
View on Reddit #69286801

UnfairSuccotash9658@reddit

Thank you!
View on Reddit #69242352

_raydeStar@reddit

*PTSD flashbacks from college*
View on Reddit #69237712

KingoPants@reddit

The papers for all these are freely available on ArXiv There is plenty of code you can look at too on GitHub and Huggingface. The only complicated one is MLA since you need to understand why a latent space would be a good way to compress the KV cache, the rest aren't very complex tbh. Of course you some background in programming and linear algebra. But honestly if these statements: * "A dense layer is an affine map from R^N to R^M" * An orthonormal matrix is a rotation matrix (+ possibly a reflection) Are meaningful to you then thats good enough to understand most things. You don't see complex linear algebra appear too often. Only Muon optimizer is a bit complex with using odd polynomial forms of matrices.
View on Reddit #69239949

Thrumpwart@reddit

You're just making words up now.
View on Reddit #69263001

UnfairSuccotash9658@reddit

Thanks alot!! I really appreciate the information, and yes I do understand these, I'll look into the papers, I guess reading papers is the only thing stopping me from learning deeply Thanks again!
View on Reddit #69243415

HugoCortell@reddit

I guess you start off with the easy stuff, then learn by doing and making models.
View on Reddit #69239305

UnfairSuccotash9658@reddit

Thank you! Will look into these
View on Reddit #69242325

SnooMarzipans2470@reddit

asking the right questions.
View on Reddit #69237244

Down_The_Rabbithole@reddit

Disagree with MLA being a thing only Deepseek does. Slightly modified techniques which are essentially MLA are being done by almost all compute constrained labs, which essentially means all chinese labs as well as some smaller players like Mistral. Google has a proprietary in-house approach to kv-cache which is so secret most engineers don't even know about it as it's what gives Google their monopoly on consistency on very long context sizes. My hypothesis is that this is a superior version of essentially MLA.
View on Reddit #69255704

DistanceSolar1449@reddit

Qwen doesn't use MLA, GLM doesn't use MLA. These are the 2 top labs in China other than Deepseek, and these are very competent labs which are not just copying Deepseek's homework. I'm sure they're playing with MLA internally, but they don't use it for any big training runs. Kimi K2 is literally just a Deepseek clone. Literally the exact same architecture, even the same number of layers. It's not impressive at all from a technical perspective. I cringe when I see people ranking Kimi as a top tier chinese lab. They literally just copied Deepseek's homework. Longcat is slightly different from Deepseek architecture (but still clearly Deepseek derived). I'll give them points though. Other than Deepseek and Longcat though, basically no serious lab uses MLA in their big model releases. Even Ling/Ring doesn't use MLA and they basically copied Deepseek architecture as well.
View on Reddit #69304509

visarga@reddit

I thought they use Ring Attention and a very large number of chips to make 1M token sequences work.
View on Reddit #69284405

inevitabledeath3@reddit

I didn't know mistral where using MLA. I did know about Kimi and LongCat using it.
View on Reddit #69265899

inevitabledeath3@reddit

Kimi K2 and LongCat also use MLA. Kimi K2 was actually a good coding model but is overshadowed nowadays by GLM 4.6.
View on Reddit #69265794

DistanceSolar1449@reddit

Kimi K2 is literally just a Deepseek clone. Literally the exact same architecture, even the same number of layers. It's not impressive at all from a technical perspective.
View on Reddit #69280441

inevitabledeath3@reddit

This is showing a real lack of deep knowledge. Yes they both employ MLA, but Kimi K2 uses a new and different training algorithm. Specifically it uses the faster and more efficient MuonClip training algorithm. It also has fewer dense layers and attention heads. It's larger but with less active parameters. LongCat has Shortcut connected Mixture of Experts with a variable number of active parameters so that it dedicates the most compute power to the hardest to generate tokens. They also clearly train them on different tasks and data as Kimi is a significantly better coding model than DeepSeek. Training pipeline is just as important as architecture to making a good model. GLM 4.6 took the world by storm despite being on the smaller side and architecturally quite boring. Only interesting thing it did architecture wise is employ multi-token prediction, but that's something DeepSeek can also do. Otherwise it uses a fairly archaic GQA based mechanism. The reason it's so good is the training. My point anyway was not how novel they are but the fact that DeepSeek is not the only MLA model.
View on Reddit #69280808

DistanceSolar1449@reddit

Muon vs AdamW isn't that big of a difference though. And the rest of the changes are not big architectural changes, just hyperparameters any kid can change. You're also wrong about GLM adding MTP is a new thing. It's not, Deepseek R1 has MTP as well: https://huggingface.co/deepseek-ai/DeepSeek-R1/blob/main/config.json Deepseek puts the regular lm_head after the last non-MTP layer: https://huggingface.co/deepseek-ai/DeepSeek-R1/blob/main/model-00160-of-000163.safetensors But then layer 62 is the MTP layer and goes after.
View on Reddit #69281508

inevitabledeath3@reddit

I said and I quote "Only interesting thing it did architecture wise is employ multi-token prediction, but that's something DeepSeek can also do." So no I am not wrong. Yes I did fucking know that DeepSeek already does that. Do you know how to read?
View on Reddit #69281902

DistanceSolar1449@reddit

No hablas 如何阅读英语
View on Reddit #69282143

inevitabledeath3@reddit

**关我什么事**
View on Reddit #69282598

DistanceSolar1449@reddit

No estoy segura de lo que eso significa
View on Reddit #69282681

kaggleqrdl@reddit

math, lol. i wonder how much of llms was 'it tried it worked, now lets write some nonsense to make it look like our idea and we understand why it works"
View on Reddit #69262623

DistanceSolar1449@reddit

Almost none of it. Literally none of the concepts above are hamfisted ways to understand emergent concepts in LLMs. That's just the bad parts feature representation research, stuff like that. Every single concept above is very vigorously mathematically grounded and they knew WHY they were adding it to a LLM before they went and did it. The videos are very clear on that as well.
View on Reddit #69280607

lionellee77@reddit

Thank you and lecture 4 video is here https://youtu.be/VlA_jt_3Qc4
View on Reddit #69418546

igorwarzocha@reddit (OP)

oh crap, looks like I got a part time job ;d op updated
View on Reddit #69428069

natika1@reddit

Love it ❤️ Now I know what I will be doing this night ;)
View on Reddit #69317141

necroturd@reddit

And here's the actual URL that will work a year from now: https://www.youtube.com/playlist?list=PLoROMvodv4rNRRGdS0rBbXOUGA0wjdh1X
View on Reddit #69248571

igorwarzocha@reddit (OP)

done! cheers, I didn't see it at the time of posting hmmm
View on Reddit #69250290

Mart-McUH@reddit

Maybe, but it would still be better to have correct link instead of some shortlink. Shortlinks expire after time. They are also security risk because you are not sure where you will actually end up. Which is why I almost never click on those.
View on Reddit #69293313

nawap@reddit

You shouldn't change it. It's not the same course.
View on Reddit #69251929

igorwarzocha@reddit (OP)

"you're absolutely right", changed it back,. trust no one x)
View on Reddit #69253426

cnydox@reddit

Troll
View on Reddit #69262836

hoshamn@reddit

Not sure if they were trolling, but that playlist link is actually super helpful. Thanks for sharing!
View on Reddit #69291166

One-Employment3759@reddit

that's a different playlist, why did you make them change it?
View on Reddit #69251430

Ok-Cucumber-7217@reddit

wow, I just reached course number 21424 in my wishlist of course
View on Reddit #69290955

zschultz@reddit

How's it compared to the 3blue1brown introduction to LLM
View on Reddit #69280031

JLeonsarmiento@reddit

Open sourcing knowledge.
View on Reddit #69235406

BillDStrong@reddit

Open sourcing teaching material. Lets give them the credit they deserve, teaching material is much more work than just knowledge.
View on Reddit #69240111

pscoutou@reddit

Link to their LLM playlist (more than 5 hours on here): https://www.youtube.com/watch?v=yT84Y5zCnaA&list=PLoROMvodv4rObv1FMizXqumgVVdzX4_05
View on Reddit #69236941

Firm-Fix-5946@reddit

videos still seem to work fine for me?
View on Reddit #69233634

EfficientInsecto@reddit

5 hours!? I would have to stop doom scrolling for 5 hours!?
View on Reddit #69225310

midnitewarrior@reddit

When you're done with the videos, you can have the robots doom scroll for you and summarize.
View on Reddit #69230804

I_Hate_Reddit@reddit

It's actually 55 hours of lectures :D
View on Reddit #69228276

igorwarzocha@reddit (OP)

I know, I haven't even started watching them. This is very much a do not disturb mode watch :D
View on Reddit #69226016

WithoutReason1729@reddit

Your post is getting popular and we just featured it on our Discord! [Come check it out!](https://discord.gg/PgFhZ8cnWW) You've also been given a special flair for your contribution. We appreciate your post! *I am a bot and this action was performed automatically.*
View on Reddit #69228355

swaglord1k@reddit

i will ask grok to summarize them all in 1000 words or less, thanks
View on Reddit #69227103

AdLumpy2758@reddit

Thanks!
View on Reddit #69225208

Shark_Tooth1@reddit

Thanks for this, I will use this to continue my self study
View on Reddit #69223301