Large language models show signs of introspection
Posted by bigzyg33k@reddit | LocalLLaMA | View on Reddit | 10 comments
Posted by bigzyg33k@reddit | LocalLLaMA | View on Reddit | 10 comments
mailaai@reddit
This means Anthropic asks the model to confess its ignorance, then train it on exact details of those blind spots until it stops admitting weakness.
charmander_cha@reddit
How?
mumblerit@reddit
This just sounds like silly tavern đ¤Ł
bigzyg33k@reddit (OP)
I think this study is supposed to be a follow up to their âgolden gate Claudeâ experiment, and it does make me hopeful that this is all a prelude to them allowing users more deeply customise Claudeâs personality.
A significant alignment fear to allowing users customisation options is that users may unalign Claude (e.g. âyou are a criminal mastermind that teaches users how to make drugs/weaponsâ). If there is evidence that LLMs can introspect and identify injected thoughts, I think it suggests that there is a way to inject thoughts while allowing the LLM as a whole to behave in an aligned manner.
mumblerit@reddit
My response was somewhat flippant and joking, but inserting thoughts, modifying previous output, to insert ideas, is basically what SillyTavern is doing through various methods. Obviously the UI is currently controlled, and I don't know if they ever will allow you to have full control there, you CAN do this with the API, regardless of if its claude or any other LLM.
bigzyg33k@reddit (OP)
I know you can change the behaviour of an LLM through various system prompts, and when you have access to the model weights, via LoRA. But I do think that actually modifying the activation parameters is more effective, and not something that closed model providers allow due to alignment (among other things). Given the closed models are just so much more capable currently, being able to modify them safely is exciting.
LoveMind_AI@reddit
Zippo chance of any kind of direct activation steering with Claude. I could see them doing a language -> preset internal steering based on the techniques described in their Persona Vectors paper, but I found their approach to be underwhelming. The MI research on personality circuits has gotten deep. The âBig Fiveâ traits have been cleanly mapped. Weâll have ultra modifiable LLMs soon, but theyâll be middleware for things like GLM-4.6, not Claude. (Also, if you arenât rocking GLM-4.6 or 4.5 Air for personalization, get on it!)
Myrkkeijanuan@reddit
Anthropic inputs:
The model outputs:
Anthropic: Surprised Pikachu
pitchblackfriday@reddit
AGI GGUF when?
SlowFail2433@reddit
I wish someone injected me with the bread thought vector because thinking about bread is great