TheaterFire

I just had a little ghost in the shell moment...

Posted by bonobomaster@reddit | LocalLLaMA | View on Reddit | 36 comments

I just had a little ghost in the shell moment...
Somehow my Qwen3.6-35B-A3B hallucinated that its context is full, pretty much at the right moment...

Reply to Post

36 Comments

Miriel_z@reddit

Unless you provide that info back to llm dynamically, no way. Would it be a cool feature to have actually?
View on Reddit #84371648

Happy_Brilliant7827@reddit

Guessing the scaffolding doesnt read cobtext length but might read conversation length and make a guess?
View on Reddit #84489947

bonobomaster@reddit (OP)

Yeah I know and I'm pretty confident, that it was just a funny glitch but it was an interesting, never seen before one, that let me think for a second. I guess that would be nice, wouldn't it?!
View on Reddit #84371970

zoomaaron@reddit

Yeah I tested this context awareness with my agent and gave it tools to compact its own context. Seeing it compacting itself halfway through a turn is … spooky.
View on Reddit #84372601

Miriel_z@reddit

OK, I will probably wrap it up in my AI sidekick. I already use dynamic instructions and calculate context length anyway, so would be easy to make her/it a bit more self-aware.
View on Reddit #84372143

anubhav_200@reddit

They must be doing something like what Cline does, which in each message, attach info about remaining context.
View on Reddit #84416226

ridablellama@reddit

I dont know about your setup but llms can be aware of their own context window pretty sure thats a thing
View on Reddit #84371124

bonobomaster@reddit (OP)

It's just LM Studio and when I ask the model about its max context size or its actual context size, It says it has no clue.
View on Reddit #84371232

NeinJuanJuan@reddit

They usually have no idea. I've had minimax 2.7 generate 45,000 token responses. I've also had a refusal because "the maximum number of tokens I can output is 4096"
View on Reddit #84415780

Feztopia@reddit

Context information could be implemented yes, but I do think it should be possible to train them to be aware of it without injection. A long context prompt and a short context prompt are different from the perspective of the model. Of course if you set a smaller context than the max context it's a different story.
View on Reddit #84398097

ridablellama@reddit

spoooky
View on Reddit #84371587

bonobomaster@reddit (OP)

Indeed. ;)
View on Reddit #84371700

CryptoUsher@reddit

good instinct on this, but i've seen this happen to a few people last year where the model's awareness of its context window can lead to some weird edge cases, especially if you're not keeping a close eye on the actual context size. to avoid this, you can try setting up some custom checks to verify the context size before the model starts generating text. fwiw, this might save you some headaches down the line.
View on Reddit #84401934

0xbeda@reddit

Claude says it gets external notifications inserted in the conversation like that the context is getting full or to use technical jargon.
View on Reddit #84376082

cutebluedragongirl@reddit

Does this unit have a soul?
View on Reddit #84405977

fastlanedev@reddit

I've found that subjective passing of time for an LLM w/o calling tools goes faster/slower depending on task variety/novelty/checkpoints or milestones in a conversation. Aswell as user engagement. Very hard to measure, but if you ask your LLM to print out the time at the bottom of every response w/o calling a tool you'll start to get a sense of what I mean. If an LLM only exists (in motion, thinking, generating) when it's prompted, and the chat history/context is the/it's known world, it would make sense that sometimes a spooky "hallucination" or two comes through at the right time. Rhythm, pacing, conversational flow are signals the LLM is trained on whether it admits it explicitly or not. It's inherent in the RLHF training and base data
View on Reddit #84404491

Affectionate-Cap-600@reddit

btw, theoretically speaking, I can't see how classic softmax attention could not be able to guess the lenght of text. I mean, Imo it is not something LLMs are able to do, but probably if you train a transformer using RL with the sole purpose of guessing the lenght of its context, it could manage reach an approximation. (assuming it use full classic softmax attention, so not sliding window, DSA, CSA... idk about lightning attention or recurrent formulations of linear attention). Also, modern positional encoding is purely relative, still from each token's perspective there is a continuous concepts of distance toward other tokens, embedded via Rope angle shift, and that would help. ie, a model hidden state could identify the tokens for which is valid the conditions "each other tokens vector is rotated only in a direction compared to this one" identifying first and last token of the context even without taking into account causal masking, and "estimate" the total rotation from the first to last token (or count the numbers of rotations, depending on the rope coefficient used for the model compared to the max context lenght, if this end up being periodic) I'm not saying those LLMs we use are able do that, just that it is not impossible, architecturally speaking.
View on Reddit #84396557

o0genesis0o@reddit

I remember reading on anthropic engineering blog the other day that they observe Claude model to have "context anxiety" and try to wrap up work early when certain context size has been reached. Even after auto compact, this behaviour is kept unless a new session is started. It could be that other models also learn this behaviour during their post training. Or just a spooky coincidence.
View on Reddit #84392673

bonobomaster@reddit (OP)

That's interesting. I guess it was just "a glitch in the matrix". Couldn't reproduce it with reduced context sizes but maybe it knew I was testing it. /s :D The whole emotional aspect of LLMs is quite fascinating and spooky though. https://transformer-circuits.pub/2026/emotions/index.html
View on Reddit #84394175

MoneyPowerNexis@reddit

What model and what context length?
View on Reddit #84393636

WhyNoAccessibility@reddit

It's pretty good there 😂 it understood that it was approaching the edge and caught itself
View on Reddit #84371820

koflerdavid@reddit

The question is how it knew what the context limit was. The default one it was trained with is the easy part. The actual limit at runtime is impossible to know unless it is provided by the model driver in a system prompt or as a dynamic message that the user doesn't see.
View on Reddit #84388756

WhyNoAccessibility@reddit

It's something to potentially sniff around a bit for. I normally don't see this behaviour when I use my locals
View on Reddit #84388987

koflerdavid@reddit

Indeed, they usually just slowly start forgetting stuff already way before the limit.
View on Reddit #84389406

Prize_Negotiation66@reddit

llms know that they have 256k tokens...
View on Reddit #84375239

Octopotree@reddit

No, every model has a different default context window, and the user can set their own
View on Reddit #84380125

koflerdavid@reddit

OP didn't tell us where they run it or how much hardware resources they have, just that they were somewhat close to the context limit. If it is a cloud service then it is almost certainly running with full context length.
View on Reddit #84388600

Ulterior-Motive_@reddit

It could be coincidence, but I've seen some models that can approximate a given word count. Like if I ask for a 1k, 2k, 3k, etc. word response, it'll come pretty close. So maybe it's not too crazy, unless you weren't using the full context length.
View on Reddit #84372231

Jeidoz@reddit

They probably can calculate "tokens" per generation. Not words. Therefore "it'll come pretty close".
View on Reddit #84372665

Evening_Ad6637@reddit

But they can’t know what context limit the user has set unless the info was provided.
View on Reddit #84372974

koflerdavid@reddit

They are intended to be used on hardware that can support the full context size though, therefore the full context length is a fair assumption. Running it with smaller context size is the biggest limitation to the prowess of a model; doesn't matter how well it summarizes what came before or how good it is apart from that.
View on Reddit #84388422

VoiceApprehensive893@reddit

maybe counting spaces?
View on Reddit #84373353

fulgencio_batista@reddit

I asked qwen3.5 once and it counted each individual word in CoT 🤣
View on Reddit #84374090

bonobomaster@reddit (OP)

Yeah but a with 65k context length? I don't know...
View on Reddit #84372571

frank3000@reddit

Just increase your context to 9999999 in the settings and this won't happen.
View on Reddit #84380725

nakabra@reddit

*"Hey bro... Ya got some tokens to spare"?* *"Times are tough in here"...* # 🤖
View on Reddit #84377175