Video of how my LLM's decoder blocks changed while training
Posted by 1ncehost@reddit | LocalLLaMA | View on Reddit | 53 comments
This is in response to my popular post: https://www.reddit.com/r/LocalLLaMA/comments/1sivm24/heres_how_my_llms_decoder_block_changed_while/
It was requested that I make a video of this data, so here it is. Enjoy!
More-Curious816@reddit
Looks like bacteria in Petri dish.
Impossible-Hunt9117@reddit
It is undoubtedly something alive in some way
Enthu-Cutlet-1337@reddit
Compression is hiding the block-scale dynamics; lossless GIF/WebM or a log-scaled colormap would show the real phase shifts.
1ncehost@reddit (OP)
Here is the lossless version and video gen source: https://huggingface.co/buckets/curvedinf/exodus-18m-training
IllegibleCheeto@reddit
Rarely is the question asked: is our models learning?
RogerRamjet999@reddit
It appears to pulse, is that some known phase change in your training? Also at the point of the pulse, there's a fairly large change in the general motion of the main clouds.
1ncehost@reddit (OP)
The pulse is from it interpolating between key frames. The key frames are generated from checkpoints I saved every 100 steps of training and happen every 1 second in the video. Video is 179 key frames long (2 min 49 sec + 5 seconds of outro)
overand@reddit
So 158 actual frames? So without video interolation, it's a \~10.5 second video at 15fps?
1ncehost@reddit (OP)
Yeah, but the interpolation moves the samples 1-to-1 to where they appear in the next frame, so without it, it would appear a lot less coherent.
overand@reddit
0.018B parameter model - very cool stuff!
Dangerous_Tune_538@reddit
So early and late layers stabilize, while middle layers keep moving about. I wonder if increasing batch size could fix this issue.
1ncehost@reddit (OP)
Possible, but batch size is half a million tokens, so not particularly small. I've theorized when that happens, it is moving representations between layers. I mentioned in the tiny text at the bottom that it has 8x AttnRes blocks ( https://arxiv.org/abs/2603.15031 ) which learn residual channels between blocks of 16x layers and each layer afterwards. That's important because its possible that the layers can connect to one another through those AttnRes channels and thus move representations through those channels in addition to the previous and next layers. Not saying I know what's happening though!
DataPhreak@reddit
Oh shit, we have working AttnRes models already?
1ncehost@reddit (OP)
π "model"
DataPhreak@reddit
Yes, I know AttnRes is an attention mechanism, not a model. But the model has to be trained with the attention for it to work so yes... an attnres model.
1ncehost@reddit (OP)
Nono i was making fun of my model not your terminology
DataPhreak@reddit
Actually, was this AttnRes implementation something you built from scratch, or did you use someone else's replication?
1ncehost@reddit (OP)
From scratch
DataPhreak@reddit
Based
DataPhreak@reddit
Oh.... hehehe
NandaVegg@reddit
In the more recent multi-lingual training regime (at least since Llama 2) late layers usually "converts" intermediate representation of language (that is not necessarily attached to any single language, or maybe the mix of most dominant languages in the pre-training datasets, most likely English and Simplified Chinese) to the final output language while adding some special tokens or set of symbols related to instruction-tuning. Meanwhile early layers tend to have parameters related to *very* common knowledges (like New York = City, Cow = Animal) and sorts things out in by rough features, so that mid-to-late layers can work on them in details.
Because of this observation I think it is perfectly normal and fine that middle layers keep moving (since early layers and late layers are effectively "pegged" and the only mid-layers are moving, gradient norm would not explode).
addandsubtract@reddit
You should post this on /r/dataisbeautiful
1ncehost@reddit (OP)
They dont allow videos :(
addandsubtract@reddit
Hit them with the first-frame (before), last-frame (after)
DraconPern@reddit
I am just about done with the book "LLM from Scratch" which is teaching about LLM using GPT-2. Background of a BS in CS a while back. I might be simplifying a lot, so forgive me. But if I understand it correctly, your experiment is replacing the transformer block which is many giant matrices with many high dimensional spline that models the points the matrices define when trained?
1ncehost@reddit (OP)
Cool project. π The MLPs usually in a model try to predict an output vector based on an input vector, aka are a regression model. The splanifolds themselves are used the same way, as a regression model, except they are lower dimensional (model is 128D, splanifolds are 8D), and the input vector is projected to a learned subspace before regression and the output vector is projected back up with another learned subspace.
Fun-Newspaper-83@reddit
Do you see certain layers stabilizing earlier than others, or do they all evolve at a similar pace?
breadislifeee@reddit
it looks pretty cool.
Chromix_@reddit
At 0.93B the cat becomes possible, shortly after at 2.18B it even becomes relevant. Yet at 2.73B and many times after it becomes possible again. What seemingly doesn't become possible is completing that into a few somewhat correct sentences though.
Dany0@reddit
The cat is indeed, possible
FrontStreet3@reddit
I mean, it's a 0.018B parameter model lmao
snapo84@reddit
But with 128 Layers, the deeper you make it the slimmer you can make it...
moonrust-app@reddit
one of these days they will find a way to escape that lab and take over humanity. you keep laughing
h-mo@reddit
this is genuinely one of the most interesting things I've seen posted here in a while. watching blocks specialize in real time makes all the theoretical stuff about layer depth and representation learning click in a way that reading papers never quite does.
Revolutionary_Ask154@reddit
any chance you share this?
ShelZuuz@reddit
What causes the smooth movements? Is that gradient descend in action?
tmvr@reddit
L109, L110 and L111:
rainhunter007@reddit
πππ this made me laugh unreasonably hard
arm2armreddit@reddit
We must see this in Hollywood movies as well. Very cool to visualize "AI." The regular node movement is outdated, or matrix-like letters flow.
LegacyRemaster@reddit
L109 ... it's a rebel!
1ncehost@reddit (OP)
You could say I'm somewhat of an L109 myself π₯Έ
LegacyRemaster@reddit
ahahahahah! I do the LLM training from scratch every day (between 120 and 200M parameters) and I would like to "see" what happens like you do.
IrisColt@reddit
The beginning of that background music gave me very strong LucasArts's "Afterlife" vibes.
moahmo88@reddit
Amazing!
Sliouges@reddit
The cat is possible.
Clean_Hyena7172@reddit
I don't know what I'm looking at but it looks pretty cool.
One_Curious_Cats@reddit
AI Rorschach images? :)
Borkato@reddit
Theyβre dancing!
BannedGoNext@reddit
It's AI slop in a peatri dish I think :D. Actually same, but it's super cool
aiyakisoba@reddit
POV: Looking at microorganisms under a microscope.
IntelligentFire999@reddit
Coolest video I have seen in a long time.
SmartCustard9944@reddit
Can definitely notice a higher coherence at lower loss.
Medium_Chemist_4032@reddit
That's a one suing cat