Another sampling strategy drops: 75% accuracy at T=3.0
Posted by tomorrowdawn@reddit | LocalLLaMA | View on Reddit | 13 comments
TL, DR:
threshold = logits.max(dim=-1,keepdim=True).values - n*logits.std(dim=-1,keepdim=True)
logits[logits<threshold] = float('-inf')
It is called top-nsigma, directly filtering out tokens using logits' information.
Imo the most interesting finding is: the logits are naturally split into two regions: a Gaussian noise region and an informative region. When the model is not confident enough or temperature is high, the gap between "meaningful" tokens and "noise" tokens shrinks, and noise tokens start sneaking into your sampling pool, downgrading the quality.
And this performance drop is serious - imagine spending millions on training a massive model just to have poor sampling mess up its outputs.
Check out the original github repo and top-nsigma has been merged into aphrodite-engine. (Honestly it's so simple you could probably whip it up in a few minutes). Feel free to try it out and let us know what you think!
silenceimpaired@reddit
I wonder if this sampler will not be ideal for creative efforts.
kulchacop@reddit
The paper is titled "Top- nσ : Not All Logits Are You Need"
_Erilaz@reddit
What does it even mean, strictly speaking? It it some sort of Yoda talk? In a preprint? My non-native brain can't process this sentence, the word order is all over the place!
tomorrowdawn@reddit (OP)
Due to the inherent flaw of softmax, not all logits should be considered to produce positive probabilities(which will downgrade the quality).
_Erilaz@reddit
I get the rough idea of the technique itself, it essentially tries filter out random tokens that are creeping into the the sampler considerations by evaluating low probability token distribution to determine where the informative tokens begin. I understand that, and it indeed seems reasonable when the model itself is noisy or simply isn't confident enough.
I am only asking about the naming. "Not All Logits Are You Need", I can't understand that.
Evening_Ad6637@reddit
The title is simply a direct allusion to what this study is about.
DeProgrammer99@reddit
They meant, "You don't need all the logits," but they tried to copy the " is all you need" title pattern that numerous other papers have used, starting with "Attention is All You Need."
anchortense@reddit
Very similar to the logit threshold sampler I developed a couple of months ago. I actually tested filtering by standard deviations from the max logit at the time but found a fixed logit threshold to be more stable at higher temperatures, in terms of filtering out incoherent tokens.
https://old.reddit.com/r/LocalLLaMA/comments/1fvm1gv/two_new_experimental_samplers_for_coherent/
https://github.com/turboderp/exllamav2/pull/657
fogandafterimages@reddit
This reminds me a lot of [2410.05258] Differential Transformer, which has a very similar idea but applied the Q•K attention logits. Same premise of "The logits are a filter that combines some actual signal with some random noise, and finding a way to cut out the noise is probably good."
Rather than thresholding, Differential Transformer divides each attention head into two halves, a signal-detecting part and a noise-detecting part. They compute attention logits as
σ(signal) - λ • σ(noise)
, where λ is some learnable parameter, initialized to something like 0.8 and separate for each head.PickleFart56@reddit
Agree there many papers that have shown that model performance degrades when it attends to all tokens, instead model should attend only few tokens. Here is another great paper - https://arxiv.org/html/2410.02703
m18coppola@reddit
Here's my implementation of it in llama.cpp. Looking for feedback. I will probably make a pull request after playing around with it.
It's disabled by default. The paper uses
N = 1.0
. You might want to disable some other samplers from the chain if you want to test it on its own.placebomancer@reddit
The coolest part to me is the temperature invariance. Regardless of what temperature is applied to the logits prior to sampling, it will sample the same number of tokens. The implies that it is adaptive to different token distributions (which is reasonable considering it uses standard deviation of the logits to set the threshold). Am I understanding correctly that it only works on the raw logit distribution and that softmaxing and then converting back to logits distorts information about the logit distribution? If so, that's interesting too. I'll have to test it in actual use, but first glance suggests it could be an improvement over min-p, TFS, and my personal sampling technique.
mrjackspade@reddit
Sounds like TFS?
https://www.trentonbricken.com/Tail-Free-Sampling/