What am I missing about samplers?
Posted by TacticalRock@reddit | LocalLLaMA | View on Reddit | 10 comments
Hi all,
With the recent release of models that require temp = 1, top_k = N, and top_p = 0.95, I'm wondering why labs actually prefer those truncation samplers over just min_p?
As far as I understand, min_p isn't supported everywhere, and they're just following industry standards with top_k and top_p, but if one replaces those two truncation samplers with just min_p, is there a real reason not to? Let's say, for Qwen 3.6, instead of top_k of 20 and top_p of 0.95, I just do min_p of 0.05-0.10, is there a mechanical/structural or analytical reason not to?
I know I can just stick to the given samplers and call it a day, but I'm just curious, and I like the dynamic nature of min_p :)
Thanks!
Herr_Drosselmeyer@reddit
Don't now about Qwen specifically, but Gemma 4 also recommends Top K=64 and Top P=0.95, and I've run it with just Temp=1.2 and Min P=0.02, works fine.
But with a grain of salt: I tried it for creative writing. Just Min P, especially low values, will let some odd tokens through. For coding or other stuff that doesn't want the model to be too 'creative', I'd suggest sticking to the recommended settings.
TacticalRock@reddit (OP)
Checks out. Temp of 1, as I understand for llama.cpp, is just the untransformed distribution, so a value greater than 1 artificially flattens the token distribution, making lower-probability tail tokens more likely to be sampled. From the purist's perspective this may appear as more creative output, but is more accurately "tolerable incoherence." You're not in the model's native distribution anymore, and you're promoting tokens the model itself ranked as less appropriate. Could be good or bad for creativity.
Mart-McUH@reddit
Even temp 1 is transformed distribution. The raw output of LLM are just some real numbers (I think called logits) that can be in range -infinity to +infinity (or whatever is max number representation in given format).
You need to transfer those logits into probabilities. That is usually done with some SoftMax function I think. Temperature is exponent in that formula.
TacticalRock@reddit (OP)
Sorry, meant T=1 is without additional scaling.
Herr_Drosselmeyer@reddit
It's usually good unless you go crazy with the values. You want to be in a place where something like "And then he decided to go to the..." can return a variety of tokens. "Beach" might be the top probability one, but you're ok with "bowling alley" instead, you just need to avoid completely wild ones like "moon". ;)
Obviously a bit simplified, but you get the idea. Samplers that punish repetition also help in that regard. Just don't forget to change back your settings when you want more coherent things like code. I forgot once, and Gemma 4 couldn't generate working code anymore because of it. It actually gave up at some point and told me there was something wrong with my chat interface. It wasn't wrong, per se.
TacticalRock@reddit (OP)
Interesting. Yeah that's what I want to figure out with rote correct/incorrect answer benchmarks: what's the Goldilocks min_p value holding temp at 1, and does it outperform the recommended sampler settings? Because with creativity related work, it's harder to measure, as you demonstrated; how does one even judge the choice of sequence of words for creative purposes without setting up a complicated rubric based llm as judge framework? Easiest is by reading, and that'll depend from me to you. Generally agree with your statement, increasing temp modestly doesn't harm coherence, though it does need to be lowered as context grows to mitigate the rot that happens naturally due to bloat.
Mart-McUH@reddit
I think it is just historical reasons because top_p was sooner than min_p and is perhaps more commonly known/supported by chat interfaces. Both top_p and min_p try to achieve same thing - cut the tail of low probability tokens, so using one of them is quite important. top_k of course also cuts tokens, but it is more naive so is not enough on its own (eg very low probability tokens that would likely just break answer can still make it into top_k, but will be cut by top_p/min_p).
top_k first is useful though to speed up the samplers processing so that it does not need to run on whole tokens vocabulary, as models today have very large token vocabulary.
After that, each model behaves differently and each task has different requirements too (more deterministic factual answer, more variety for creative writing and also for formal tasks where you want to explore more options, anti-repetition may be sometimes needed but can often break models etc.)
ResidentPositive4122@reddit
Samplers are knobs that local folks swear by, but no-one really wants or needs, because at the end of the day, and contrary to popular local users, they break the model. For high accuracy stuff (i.e. math), even min_p is not that good, unless you get to large temps (1.5+). Also, the min_p paper only ran the numbers for that era's models (llama, mistral7b, etc). I don't think anyone has checked since then.
TacticalRock@reddit (OP)
As I suspected. Maybe it's time for me to do some math and IF benchmarks for min_p vs default samplers with the smaller Qwens and Gemmas.
DinoAmino@reddit
Thinking models require those higher values than my order to generate diverse tokens for their reasoning traces. Lower values are great for more deterministic responses from higher probability tokens - dense models benefit from that but reasoning models suffer of they cannot complete the thinking they are trained for.