Efficient pretraining with token superposition by Nous Research
Posted by de4dee@reddit | LocalLLaMA | View on Reddit | 15 comments
Posted by de4dee@reddit | LocalLLaMA | View on Reddit | 15 comments
brown2green@reddit
Like this?
Beyond Next Token Prediction: Patch-Level Training for Large Language Models
learn_and_learn@reddit
Did you just recall this paper ? Curious about how you drew a link between both
brown2green@reddit
I looked into it a while back and made my own tests on patch-level pretraning on tiny models.
More-Curious816@reddit
is your day job related to research or is it hust a hobby?
brown2green@reddit
Just a hobby. If you limit model size to around 50~100M parameters at most, you can do a lot of interesting LLM architecture experimentation even on one GPU.
Silver-Champion-4846@reddit
Funny you should say that, I know a discord whose people are obsessed with maximizing the power of 0.5m, 1m and 5m models
Ok-Reflection-9505@reddit
willing to share a link?
Silver-Champion-4846@reddit
https://discord.gg/58J2truSF
Youknowwhyimherexxx@reddit
I would like to know too
Double_Cause4609@reddit
When you read a lot of papers connections between them fall out pretty naturally. When you really get down to it most ideas are derivatives, special cases, or a step removed from at least some other idea.
Like, for example, attention is actually kind of related to CNNs in its dynamics in some scales and regimes in the sense that it's weight tied with respect to the map.
Or, N-GPT probably gets some benefit from the normalization preventing grokking (in the sense of the delay, not of the understanding) which you can infer if you read the paper that introduced orthograd and noted that grokking was the result of weight scaling due to cross entropy.
NandaVegg@reddit
It looks like a fairly generalize-able idea (to cut compute by averaging) that has a lot of potential to expand. Probably more useful in earlier training phase rather than mid-to-post training.
Dany0@reddit
Anthropic breathed a sigh of relief. We can survive on one less data centre
Dario pet roko's basilisk and plead with it "see how much us humans are trying? please don't kill me 🥺"
ResidentPositive4122@reddit
Fascinating paper.
Had a real chuckle at this:
I will bet my left nut that this was found / proposed by an LLM, and then verified by the team while scratching their head :)
IShitMyselfNow@reddit
It's actually been referenced by quite a few LLM related papers, see references: https://www.researchgate.net/publication/1780370_Entropy_and_Long-Range_Correlations_in_Literary_English
nuclearbananana@reddit
This and a number of other papers I've seen all seem to be doing the same thing, training the model to predict meaning/ideas without overfocusing on specific tokens.