Video of how my LLM's decoder blocks changed while training

[-]

Impossible-Hunt9117@reddit

It is undoubtedly something alive in some way

[-]

Enthu-Cutlet-1337@reddit

Compression is hiding the block-scale dynamics; lossless GIF/WebM or a log-scaled colormap would show the real phase shifts.

[-]

1ncehost@reddit (OP)

Here is the lossless version and video gen source: https://huggingface.co/buckets/curvedinf/exodus-18m-training

[-]

IllegibleCheeto@reddit

Rarely is the question asked: is our models learning?

[-]

RogerRamjet999@reddit

It appears to pulse, is that some known phase change in your training? Also at the point of the pulse, there's a fairly large change in the general motion of the main clouds.

[-]

The pulse is from it interpolating between key frames. The key frames are generated from checkpoints I saved every 100 steps of training and happen every 1 second in the video. Video is 179 key frames long (2 min 49 sec + 5 seconds of outro)

[-]

overand@reddit

So 158 actual frames? So without video interolation, it's a \~10.5 second video at 15fps?

[-]

1ncehost@reddit (OP)

Yeah, but the interpolation moves the samples 1-to-1 to where they appear in the next frame, so without it, it would appear a lot less coherent.

[-]

overand@reddit

0.018B parameter model - very cool stuff!

[-]

Dangerous_Tune_538@reddit

So early and late layers stabilize, while middle layers keep moving about. I wonder if increasing batch size could fix this issue.

[-]

1ncehost@reddit (OP)

Possible, but batch size is half a million tokens, so not particularly small. I've theorized when that happens, it is moving representations between layers. I mentioned in the tiny text at the bottom that it has 8x AttnRes blocks ( https://arxiv.org/abs/2603.15031 ) which learn residual channels between blocks of 16x layers and each layer afterwards. That's important because its possible that the layers can connect to one another through those AttnRes channels and thus move representations through those channels in addition to the previous and next layers. Not saying I know what's happening though!

[-]

DataPhreak@reddit

Oh shit, we have working AttnRes models already?

[-]

1ncehost@reddit (OP)

😅 "model"

[-]

DataPhreak@reddit

Yes, I know AttnRes is an attention mechanism, not a model. But the model has to be trained with the attention for it to work so yes... an attnres model.

[-]

1ncehost@reddit (OP)

Nono i was making fun of my model not your terminology

[-]

DataPhreak@reddit

Actually, was this AttnRes implementation something you built from scratch, or did you use someone else's replication?

[-]

1ncehost@reddit (OP)

From scratch

[-]

DataPhreak@reddit

Based

[-]

DataPhreak@reddit

Oh.... hehehe

[-]

NandaVegg@reddit

In the more recent multi-lingual training regime (at least since Llama 2) late layers usually "converts" intermediate representation of language (that is not necessarily attached to any single language, or maybe the mix of most dominant languages in the pre-training datasets, most likely English and Simplified Chinese) to the final output language while adding some special tokens or set of symbols related to instruction-tuning. Meanwhile early layers tend to have parameters related to *very* common knowledges (like New York = City, Cow = Animal) and sorts things out in by rough features, so that mid-to-late layers can work on them in details.

Because of this observation I think it is perfectly normal and fine that middle layers keep moving (since early layers and late layers are effectively "pegged" and the only mid-layers are moving, gradient norm would not explode).

[-]

addandsubtract@reddit

You should post this on /r/dataisbeautiful

[-]

1ncehost@reddit (OP)

They dont allow videos :(

[-]

addandsubtract@reddit

Hit them with the first-frame (before), last-frame (after)

[-]

DraconPern@reddit

I am just about done with the book "LLM from Scratch" which is teaching about LLM using GPT-2. Background of a BS in CS a while back. I might be simplifying a lot, so forgive me. But if I understand it correctly, your experiment is replacing the transformer block which is many giant matrices with many high dimensional spline that models the points the matrices define when trained?

[-]

1ncehost@reddit (OP)

Cool project. 👍 The MLPs usually in a model try to predict an output vector based on an input vector, aka are a regression model. The splanifolds themselves are used the same way, as a regression model, except they are lower dimensional (model is 128D, splanifolds are 8D), and the input vector is projected to a learned subspace before regression and the output vector is projected back up with another learned subspace.

[-]

Fun-Newspaper-83@reddit

Do you see certain layers stabilizing earlier than others, or do they all evolve at a similar pace?

[-]

breadislifeee@reddit

it looks pretty cool.

[-]

Chromix_@reddit

At 0.93B the cat becomes possible, shortly after at 2.18B it even becomes relevant. Yet at 2.73B and many times after it becomes possible again. What seemingly doesn't become possible is completing that into a few somewhat correct sentences though.

[-]

Dany0@reddit

The cat is indeed, possible

[-]

FrontStreet3@reddit

I mean, it's a 0.018B parameter model lmao

[-]

snapo84@reddit

But with 128 Layers, the deeper you make it the slimmer you can make it...

[-]

moonrust-app@reddit

one of these days they will find a way to escape that lab and take over humanity. you keep laughing

[-]

h-mo@reddit

this is genuinely one of the most interesting things I've seen posted here in a while. watching blocks specialize in real time makes all the theoretical stuff about layer depth and representation learning click in a way that reading papers never quite does.

[-]

1ncehost@reddit (OP)

You could say I'm somewhat of an L109 myself 🥸

[-]