Any idea how Meta did this?

Posted by QuackerEnte@reddit | LocalLLaMA | View on Reddit | 8 comments

Hey, any idea what is meant by compression here and how they did it?

Is it intelligent summarizing? Or actual Test-Time-Training on the reasoning traces using special layers? Something else? And what do they mean by "[...]the length penalty CAUSES thought compression [...]"? I can't imagine how this is a cause of RL training penalty rather than fundamentally different architecture to normal LLMs.

I could not find any meaningful research papers on the topic that could reveal the exact inner workings of this. This seems like a genuinely useful feature for local use. Any ideas?

I'd love if someone would link a paper or two or something like that to help figure this out.

Thank you.

[source](https://ai.meta.com/blog/introducing-muse-spark-msl/)

[-]

kataryna91@reddit

I think the wording is pretty clear here. The graph is just an observation, the compression is not anything that Meta did specifically, but rather the model found an optimization path at that point in training that allowed it to remove tokens without affecting performance. But you could manually force it by increasing the length penalty during training, but the wording implies they did NOT do that and it happened as a natural effect during training.

[-]

ambient_temp_xeno@reddit

I suppose the model has the concept of using fewer tokens to compress thinking somewhere in its pretraining.

[-]

ambient_temp_xeno@reddit

Can someone make a length penalty sampler?

[-]

StupidScaredSquirrel@reddit

No they just mean that by putting a penalty on thinking token length they teach the model to do the same but with fewer tokens.

[-]

QuackerEnte@reddit (OP)

So this is just training, not test time? Well this is less exciting than I thought it was 😅 I wonder if CoT compression techniques exist and whether they hurt performance

[-]

StupidScaredSquirrel@reddit

I suggest you re-read the article slower because it's very explicit that they're talking about training. Or maybe translate it if it's not your mother tongue, the new gemmma4 is great at that

[-]

Thick-Protection-458@reddit

> Hey, any idea what is meant by compression here and how they did it?

Probably just means at that training stage they're penaltize long outputs even with correct outputs.

Means model will be forced to "preffer" short chains of thoughts while achieving similar perfomance -> "thought compression".

[-]

Theio666@reddit

You put length penalty (or increase its weight), that causes for RL to prefer trajectories with lower length over trajectories with higher length for the same outputs. Then you cancel/lower the length punishment, now you have checkpoint which has much higher density of usefulness in reasoning -> you allow it to think for longer -> it gets better results since more reasoning is relevant.