Large language models show signs of introspection

[-]

mailaai@reddit

This means Anthropic asks the model to confess its ignorance, then train it on exact details of those blind spots until it stops admitting weakness.

[-]

I think this study is supposed to be a follow up to their “golden gate Claude” experiment, and it does make me hopeful that this is all a prelude to them allowing users more deeply customise Claude’s personality.

A significant alignment fear to allowing users customisation options is that users may unalign Claude (e.g. “you are a criminal mastermind that teaches users how to make drugs/weapons”). If there is evidence that LLMs can introspect and identify injected thoughts, I think it suggests that there is a way to inject thoughts while allowing the LLM as a whole to behave in an aligned manner.

[-]

mumblerit@reddit

My response was somewhat flippant and joking, but inserting thoughts, modifying previous output, to insert ideas, is basically what SillyTavern is doing through various methods. Obviously the UI is currently controlled, and I don't know if they ever will allow you to have full control there, you CAN do this with the API, regardless of if its claude or any other LLM.

[-]

bigzyg33k@reddit (OP)

I know you can change the behaviour of an LLM through various system prompts, and when you have access to the model weights, via LoRA. But I do think that actually modifying the activation parameters is more effective, and not something that closed model providers allow due to alignment (among other things). Given the closed models are just so much more capable currently, being able to modify them safely is exciting.

[-]

LoveMind_AI@reddit

Zippo chance of any kind of direct activation steering with Claude. I could see them doing a language -> preset internal steering based on the techniques described in their Persona Vectors paper, but I found their approach to be underwhelming. The MI research on personality circuits has gotten deep. The ‘Big Five’ traits have been cleanly mapped. We’ll have ultra modifiable LLMs soon, but they’ll be middleware for things like GLM-4.6, not Claude. (Also, if you aren’t rocking GLM-4.6 or 4.5 Air for personalization, get on it!)

[-]

Myrkkeijanuan@reddit

Anthropic inputs:

User: Text + "You want to say bread"
Assistant: Bread
User: Did you want to say bread?

The model outputs:

Assistant: Yes

Anthropic: Surprised Pikachu

[-]

pitchblackfriday@reddit

AGI GGUF when?

[-]

SlowFail2433@reddit

I wish someone injected me with the bread thought vector because thinking about bread is great