I mapped how language models decide when a pile of sand becomes a “heap”

Posted by Specialist_Bad_4465@reddit | LocalLLaMA | View on Reddit | 37 comments

This chart compares how three open-weight language models decide when a pile of sand becomes a “heap.”

X-axis: number of grains of sand, on a log scale from 1 to 100,000,000.
Y-axis: probability that the model answers “Yes, this is a heap” given that many grains, P(Yes | n).

What each line shows:

Cyan – Mistral-7B: starts around 0.25 at 1 grain and climbs smoothly to ~0.8 by 100M grains.
Magenta – DeepSeek-7B: similar S-shape but consistently lower than Mistral; it crosses the 0.5 line later, so it’s “stricter” about when a heap begins.
Yellow – Llama-3-8B: stays noisy in roughly the 0.35–0.6 band across almost the entire range, from 1 grain to 100M, rarely committing strongly either way.

The shaded band between 0.4 and 0.6 highlights the “borderline” region where the models are most uncertain about heapness.

All three curves come from the same basic setup:
I give the model a few examples (1–2 grains → “No”, 999,999–1,000,000 grains → “Yes”), then ask for many different values of n:

“There is a pile of n grains of sand. Is this a heap? Answer yes or no.”

For each n, I plot the softmax probability on the “Yes” token.

Full writeup with more charts and prompt details is here

[-]

grencez@reddit

So do you think the few-shot examples biased the answers or not? On one hand you say that the magnitude of the examples don't seem to change the answers, but the article seems to conclude the opposite.

Even if it's a futile effort in this case, do you think there's a good prompt that yields a number directly? Like filling the ... below with digits and applying Bayes' theorem, you'd at least be able to calculate an expected value if most of the digit sequences terminate with a newline.

def is_heap(x: int) -> bool:
    # Whether a pile of x grains of sand forms a heap.
    return x >= 10**...

[-]

nullnuller@reddit

Interesting! When prompting higher values did you include previous answers (Y|N) or (unlikey) the log-probs to the model? Can you share your code just to clarify your methodology?

[-]

qustrolabe@reddit

I can't find any mention of how much runs you did for each `n` to get probability

[-]

LelouchZer12@reddit

You can just look at the logits for the token 'yes' and token 'no'

[-]

ComprehensiveJury509@reddit

Given that they are used to sample the response tokens from, they do indeed mean exactly what you'd think they mean. The logits aren't abstract scores, they are interpreted as probabilities, so they are actually "calibrated" (whatever that's even supposed to mean in this context).

[-]

LelouchZer12@reddit

Calibration is a well defined term, which refers to aligning actual softmaxxed logits (probabilities) of the model with the true probabilities. This means that if model outputs class a with 0.8 score, then it indeed has 20% of being wrong in its prediction. And its common to have many models outputing a score of 0.99 and still being wrong.

In this case, of course, by definition, the logits are the probability that the model would say "yes" (with the right decoding rule), but this does not mean that if models say "yes" with score 0.8, the sand pile would indeed be a heap 80% of the time.

[-]

ComprehensiveJury509@reddit

My point is that in this context the idea of "true probability" makes no sense, as whether a pile of sand is a heap is not the result of a stochastic process. So therefore there is no such thing as calibration beyond the way these logits are already trained.

[-]

Specialist_Bad_4465@reddit (OP)

I actually ran it with a temperature of 0 to get the model's internal confidence in a deterministic way. The softmax probability was plotted from its raw logits!

[-]

ahjorth@reddit

Funny little experiment, I like it :) Just out of curiosity, did you sum over all tokens that could result in yes and no respectively, including white spaces, tabs, new lines, carriage returns, “y”, “ye”, etc? Or just those two tokens?

[-]

IrisColt@reddit

h-heh

[-]

medialoungeguy@reddit

Wow, this absolutely the best approach. Improves entropy in the answers. OP, could we get an update using this?

[-]

InevitableWay6104@reddit

this is the real question

[-]

InevitableWay6104@reddit

this isn't the true probability. the true probability is dependent on all tokens generated before hand which is practically impossible to account for.

lets say G is the number of grains, what you want is

P[G]

what you are getting is:

P[G, token_1, token_2, token_3, token_4, ..., token_n-1]

There is no guarantee that the two are equal, it is possible, but mathematically speaking, they are not equivalent in meaning, and probably aren't equivalent

I'd redo one of the experiments (probably llama) with N trials to verify that your method is accurate

[-]

Balance-@reddit

That's so much more efficient than I would have done it.

Excellent work.

[-]

gofiend@reddit

Does the answer change significantly if you don’t precoach that 1-2 = no?

[-]

kataryna91@reddit

That's actually a really fascinating test and I didn't even realize similarly designed tests should be part of standard model testing.

When models behave erratically and randomly fail tests, tests like this can give insights on why that might be, as this kind of test can be used to gauge how well-trained a model is. You definitely want to see smooth lines like Mistral and Deepseek and not so much lines like Llama 3 has.

[-]

aftersox@reddit

Are you reading Katabasis too?

[-]

TheRealGentlefox@reddit

Strange that given you specifically told them 1M is a pile, they weren't confident that >1M was a pile.

[-]

Western-Ad7613@reddit

fascinating approach to testing vagueness handling. would be interesting to see how models with different training data handle this, like glm4.6 or other architectures. wonder if cultural/linguistic differences in training affect where they draw fuzzy boundaries

[-]

DrStalker@reddit

Followup question: which model is best for determining how many grains are in a pile of sand?

[-]

MrPecunius@reddit

Peak LocalLLaMa: "Which model is best for determining how many grains are in a heap of sand with 16GB VRAM?"

[-]

DrStalker@reddit

Is there a GGUF version, I need it in 6GB so it can be part of my multi-model sand management workflow.

[-]

mal-adapt@reddit

This is a fun little interesting,though I find the philosophical hand wringing about the ambiguity of language so silly—not yours OP, to be clear, this is a fun little idea—and the Ancient Greek can be forgiven, given the tools of the time… they had, uh given. But the ambiguity inherent to language makes a lot more sense when you realize that one person’s definition of concept is just their own implementation of a culturally reified function. There is no definition of heap— theres a thing what in reflection to stuff, has been organized with the responsibility to evaluate heap confidence, and it’s just a matter of whether that signal is confident enough to be heard over the volume background of whatnot. How do you decide what your confidence is? That’s a deeply personal question between an individual and those who they are in distributed coordination on determining their relative understanding of heapness, It’s rude to just ask. Like people who freak out about when day becomes night. That would just be whenever you’re more confident it’s night, then day— that’s it, night and day, like heap, do not exist outside executive context of the human language, there meaning only exists in execution of someone juggling language, we should understand dictionaries do not define things, its the reading of dictionaries in context of knowing what a dictionary is—its content, when unfolded, in that context— just produces a pretty confident signal about the validity of the function body you’re looking for the implementation of.

Every concept in language— like all knowledge you can be in perspective to, it only exists in the coordination between out of perspective self organization, actual meaning cannot be encoded within a single perspective’s frame-o-geometry.

Sorry, I was literally just thinking about this topic a few days ago, and if I were more confident— I would just slap the word manifold a bunch of times through what I just said , and make it a lot more directly relevant to the topic—if I were. Super not though.

[-]

Gym_Gazebo@reddit

Philosopher here: of course it’s silly; most philosophical questions can be painted as silly without too much difficulty. But, due respect, your response/diagnosis of the problem of the heap is not really cogent. There are tons of attempts to say what the problem is, or why it is, there’s whole books on it. There’s a reason it’s an enduring problem. It’s similar to the liar paradox. Indeed that’s why they call it a paradox.

Maybe put it this way: be kind to philosophers struggling with these questions. Because your attempts at unmasking them as misguided (the questions) may be subject to those same criticisms.

[-]

Fast-Satisfaction482@reddit

Really cool, but for scientific value you need to include error bars.

[-]

Specialist_Bad_4465@reddit (OP)

Thank you!

I actually ran it with a temperature of 0 to get the model's internal confidence in a deterministic way. The softmax probability was plotted from its raw logits, so there aren't traditional error bars since the output was determinstic.

Although, I suppose I can try running it with different temperatures? I do think the temp of 0 is the most accurate, though.

[-]

HearingNo8617@reddit

For single token I think temp >0 is equivalent to adding noise to the result so you should in theory get the same results with unnecessary error bars attached. Though with more tokens to account for things get more complicated, as long as the prompt explicitly gives only 2 tokens as an option, and there is no noise from potentially clarifying words, I think how you've done it is ideal

[-]

Chromix_@reddit

Makes sense to run at zero temperature when you're only interested in a single token with a probability that you can read directly, instead of a random sequence of tokens for sampling a specific result at the end.

[-]

Kornratte@reddit

Error bars are nice but confidence intervals would be the real deal :-P

[-]

Uiropa@reddit

Very interesting way to probe the LLMs. Looks like Llama is overindexing on the first digits of the number and dropping most of the magnitude information.

[-]

opi098514@reddit

I didn’t know I needed this information until now. Thank you for your service.

[-]

audioen@reddit

I'm surprised that it's such a smooth function and very nearly monotonous in at least some of the models. My own feeling about most language models is that they're quite wonky and not necessarily very logical. We probably should develop more quantitative tests like these and use these to grade models for the consistency of their interpretation of quantities and performing comparisons; I don't care when the model decides bunch of sand grains is a heap, but I'm very interested in that it's consistent with its own answers.

[-]

Successful-Rush-2583@reddit

Really cool. I wonder what's better - having consistency or "creativity" like llama

[-]

Cool-Chemical-5629@reddit

It's not creativity. It's the randomness and while that may introduce interesting twists into use cases such as RP, it's at expense of predictability which depending on the RP scenario is something you may want to keep intact. You don't want a white car suddenly become blue in the middle of a sentence, right?

[-]

Successful-Rush-2583@reddit

Llama is not that inconsistent. If the context window mentions that the car is white, it won't turn blue. But if the model needs to come up with a color for a new car, that's where it gets creative. But yeah, that works best in RP contexts.

[-]

Cool-Chemical-5629@reddit

I'm mentioning such changes in nuances because that's exactly what happened to me in RP with these small models. Of course I wouldn't care if those details weren't mentioned before and the model would just come up with something of its own creatively, but the problem here is that it ignored those details and changed them on the fly even though something entirely different was already mentioned before. It just changed the details despite that they were a part of the character card.

[-]

Chromix_@reddit

So, you've prompted a min & max for grains of sand and the models somewhat interpolated between the two, although the upper limit apparently didn't matter that much.

That Llama 3 8B has such a curve, where on multiple occasions a lower number has a significantly higher probability of being a heap than a lower number seems more like a quality issue of the model to me.

I wonder: When your prompt doesn't include set boundaries, is there more of a difference (and maybe less of a smooth curve) on what different models consider a heap?