Is Min P sampling really the preferred modern alternative to Top K/Top P?

Posted by bgravato@reddit | LocalLLaMA | View on Reddit | 16 comments

According to what I've been reading (and also according to all models I've asked about this), the consensus seems to be that Min P is the better/more modern approach to sampling and that it should be preferred over Top P/Top K, which should be used only if Min P isn't available or for legacy reasons...

Yet, looking and recently published LLM on huggingface and elsewhere, the recommended parameters for sampling are still largely Top K and/or Top P. Is this only for legacy reasons? Or some other reason?

[-]

laser50@reddit

Tbh, example, qwen's suggested settings are TopP 0.95, top_k 2, I run the top_k on 0 and min_p on 0.05 and it's vocabulary seemed much smarter lol.

[-]

DrVonSinistro@reddit

I was the biggest Min-P user until the last few models which had poor results with it.

[-]

bgravato@reddit (OP)

what are you using now?

[-]

NNN_Throwaway2@reddit

Definitely not. Its simply a different method that can be used in conjunction with other samplers, and like with everything else, there are trade-offs.

The main advantage of min-P is that it sort of works complementary to top-P. When the model has high certainty of the next tokens, min-P tends to reinforce that by filtering out less probably tokens. When the model has lower certainty, min-P can help improve diversity by allowing a longer tail of possible completions.

This is also the main disadvantage of min-P, however. When the model has high certainty, min-P can reinforce stale writing or even repetition. Conversely, when the model has low certainty, it can allow in a long tail of incoherent completions. Temperature doesn't help here because min-P is applied first (at least, it is in llama.cpp).

Ultimately, min-P is just one tool among many. If you find adding or switching to min-P improves your outputs, use it. Generally, I would recommend sticking with the recommended sampling parameters for a given model and only change them if you are doing creative tasks.

[-]

cantgetthistowork@reddit

What's the consensus for coding?

[-]

Mart-McUH@reddit

I don't think there can be any consensus because models are so different that they need different approach both with prompting and sometimes even sampling.

Going just by intuition, I would say that when you do actual code generation (or tool call etc.), you want more deterministic sampler and so lowering temperature will have bigger effect than whether you choose Top P or Min P anyway. When you do actual reasoning / exploring solutions, you generally need more creative less deterministic sampler, otherwise model can get stuck and not consider new approaches. But again, Min P and Top P do not affect this as much as said temperature / quadratic smoothing etc.

[-]

cantgetthistowork@reddit

Just knowing which direction to walk each variable is already good enough. If it's for coding but the COT is beneficial? Do we want more or less deterministic?

[-]

fligglymcgee@reddit

This would depend on the model and, largely, the recommended parameters, as u/NNN_Throwaway2 mentioned.

[-]

LambdaLogician@reddit

I've heard a better sampling method is to take the standard deviation of the logits, and include all tokens within the top logit minus 5 or so standard deviations.

See https://arxiv.org/pdf/2411.07641.

[-]

StorageHungry8380@reddit

I'm no expert, but reading the paper their experiments are run at temperatures of 1.0, 1.5, 2.0 and 3.0. I was under the impression one typically did not go much above 1.0 in temperature, at least for coding and such. Unlike the other methods however their method seems to behave well even at temp of 3.0, though to me that suggests it sort of bypasses the effect of temperature...

[-]

Ueberlord@reddit

That seems to be the top n-sigma sampling, no? --top-nsigma in llama.cpp.

[-]

nuclearbananana@reddit

That seems quite heavy cause you have to figure out the standard deviation every time?

[-]

LambdaLogician@reddit

Compared to the entire neural network, it's a drop in the ocean.

[-]

DistanceSolar1449@reddit

The median is always basically 0

[-]

Mart-McUH@reddit

Top K alone is not enough, because it can't guarantee cutting the tail of low probability tokens.

Top P and Min P do very similar thing about cutting said tail of very low prob. tokens and are mostly matter of taste. Thing is Min P does this better since it ensures something very low probability (compared to best token) will never make the cut. Top P with some unlucky distribution on some token, can include even super low probability tokens because it needs to add tokens to certain budget (eg 95% with value 0.95). So if reasonable tokens only add up to say 93% and the rest is very low prob. tokens, those last 2% will be filled with the very low probability that if chosen will likely break the generation. Min P prevents that.

[-]

Long_comment_san@reddit

It seems that models are basically hardwired to their default samplers settings.

I had very, very little success using external samplers over recommended. Larger models are actually more flexible in my experience over smaller ones which are completely rigid.

I kinda wish we started using things like dynamic temp as a default but things doesn't seem to be heading this route.