Efficient pretraining with token superposition by Nous Research

[-]

brown2green@reddit

Like this?

Beyond Next Token Prediction: Patch-Level Training for Large Language Models

The prohibitive training costs of Large Language Models (LLMs) have emerged as a significant bottleneck in the development of next-generation LLMs. In this paper, we show that it is possible to significantly reduce the training costs of LLMs without sacrificing their performance. Specifically, we introduce patch-level training for LLMs, in which multiple tokens are aggregated into a unit of higher information density, referred to as a `patch', to serve as the fundamental text unit for training LLMs. During patch-level training, we feed the language model shorter sequences of patches and train it to predict the next patch, thereby processing the majority of the training data at a significantly reduced cost. Following this, the model continues token-level training on the remaining training data to align with the inference mode. Experiments on a diverse range of models (370M-2.7B parameters) demonstrate that patch-level training can reduce the overall training costs to 0.5x, without compromising the model performance compared to token-level training

[-]

learn_and_learn@reddit

Did you just recall this paper ? Curious about how you drew a link between both

[-]

brown2green@reddit

I looked into it a while back and made my own tests on patch-level pretraning on tiny models.

[-]

More-Curious816@reddit

is your day job related to research or is it hust a hobby?

[-]

brown2green@reddit

Just a hobby. If you limit model size to around 50~100M parameters at most, you can do a lot of interesting LLM architecture experimentation even on one GPU.

[-]

Silver-Champion-4846@reddit

Funny you should say that, I know a discord whose people are obsessed with maximizing the power of 0.5m, 1m and 5m models

[-]

When you read a lot of papers connections between them fall out pretty naturally. When you really get down to it most ideas are derivatives, special cases, or a step removed from at least some other idea.

Like, for example, attention is actually kind of related to CNNs in its dynamics in some scales and regimes in the sense that it's weight tied with respect to the map.

Or, N-GPT probably gets some benefit from the normalization preventing grokking (in the sense of the delay, not of the understanding) which you can infer if you read the paper that introduced orthograd and noted that grokking was the result of weight scaling due to cross entropy.

[-]

NandaVegg@reddit

It looks like a fairly generalize-able idea (to cut compute by averaging) that has a lot of potential to expand. Probably more useful in earlier training phase rather than mid-to-post training.

[-]

Dany0@reddit

Anthropic breathed a sigh of relief. We can survive on one less data centre

Dario pet roko's basilisk and plead with it "see how much us humans are trying? please don't kill me 🥺"

[-]

ResidentPositive4122@reddit

Fascinating paper.

Had a real chuckle at this:

A small additional refinement concerns the weighting within the bag. In the simplest version of the loss, each of the positions in a target bag contributes equally. At larger bag sizes this is suboptimal. We find that a power-law weighting in which the -th target position contributes to the loss produces lower final loss than uniform weighting for , while being indistinguishable at smaller . The weighting is motivated by an observation due to Ebeling and Pöschel, who showed in 1994 that mutual information between pairs of English letters decays as a power law with distance. We measured the equivalent quantity for tokenized DCLM and found the same functional form, with fitted exponent . Weighting near targets more heavily than far ones is therefore the inductive bias consistent with the statistics of natural text, and it is the weighting that wins empirically; the coincidence struck us as worth recording.

I will bet my left nut that this was found / proposed by an LLM, and then verified by the team while scratching their head :)

[-]

IShitMyselfNow@reddit

It's actually been referenced by quite a few LLM related papers, see references: https://www.researchgate.net/publication/1780370_Entropy_and_Long-Range_Correlations_in_Literary_English

[-]

nuclearbananana@reddit

This and a number of other papers I've seen all seem to be doing the same thing, training the model to predict meaning/ideas without overfocusing on specific tokens.