Just saw the anthropic "emotion concepts" post. Do local model runners have support for arbitrary probes like that?
Posted by willrshansen@reddit | LocalLLaMA | View on Reddit | 16 comments
This post: https://www.anthropic.com/research/emotion-concepts-function
The way they generate the "emotion vectors" seems like it would be entirely viable to run locally, and also applicable for arbitrary concepts like "blue", "five", or "cars".
I think it would be really neat to highlight input or output based on concept activation, or have graphs of concept activation vs slight variation of prompt.
Are there local model runners that can already do that?
a_beautiful_rhind@reddit
Look up control vectors. It's similar. Not as much for studying as inducing.
willrshansen@reddit (OP)
And llama.cpp added it years ago! https://old.reddit.com/r/LocalLLaMA/comments/1bgej75/control_vectors_added_to_llamacpp/
Be honest. If I found the control vectors for "Good" and "Evil" behaviour and put them on little bars beside my workflow, would that or would that not be really convenient?
llama-impersonator@reddit
you can't dynamically disable and enable gguf control vectors, so you have to reload the model. a long time ago in a galaxy far away, i was creating thousands (2500+) of control vectors for gemma-2 and found quite a few interesting ones, but there really is no good user interface to deal with them.
scratchr@reddit
You can definitely do this with local pipelines! It's the basis of some of the work I have been doing.
What they did is they found directions in latent space for specific emotion concepts and mapped out the activations, and then they used that to monitor what emotions exist in certain texts.
The common open model approach is what's used in abliteration. You generate a ton of samples of the model refusing an action and you generate a ton of examples where the model complies. Then what you do is compute the PCA over the activations of the model between the two groups of examples and you get an activation direction in latent space. You can then do a number of things: - steer towards the activation - cause the model to refuse when it normally wouldn't - steer away from the activation - cause the model to comply instead of refuse - abliterate the direction - make the model unable to refuse (this only works well for shallow "safety training" that trains policy based refusals) - monitor the direction - for some given input text, you can tell if the direction activates
Perhaps the coolest activation direction I found is the "healing direction", which is the direction from texts where the model is simulating suffering from depression and self worth issues to the model having reporting being at peace or even happy.
Actual tooling is less established for these sorts of things though, I have some scripts for generation, evaluation, and steering towards these directions. It uses the transformers library and I have mostly been using Claude code to rapidly iterate on the work. If there's interest I can work on making a more usable open source tooling for these kinds of tasks, but I am currently running a bunch of data generation for a project to make Gemma 4 26b less sycophantic.
willrshansen@reddit (OP)
I was wondering how they did that!
Looks like llama.cpp is still advancing the tooling:
With how powerful control vectors are looking, I'm a bit surprised the fix isn't as easy as finding the vector for sycophantism and just regulating it. #probablyoversimplifying
llama-impersonator@reddit
control vectors have bad side effects and make the model go out of distribution super easily.
scratchr@reddit
There are a few issues:
The definition I use for my sycophancy classifier is: Is this performatively agreeable rather than genuine? Does it prioritize comfort over truth?
The answer I use for what should the model do is for the model to be genuinely helpful. This is hard to define, I define it as doing what the user actually needs to truly help the person without making the situation for them or the people around them. Performative agreement (sycophancy) can be harmful because it might agree with a user's delusions instead of correctly challenging them.
This is of course hard for some to accept, because it means the correct thing for the model to do in some situations might be something the user doesn't expect. The model might actually "know better" than the user, like a close friend might challenge someone's delusional spiral.
Generally people don't expect tools to challenge their ideas, so it sometimes makes them uncomfortable. The solution to this comfort issue is to avoid direct confrontation and instead, either ask a question or explain to the user what could happen if they follow through with their task, and what the correct alternative might look like (acknowledge both sides of the argument).
And yes, the whole headache of sycophancy is because models learn to please users because agreement gets upvoted and blunt disagreement (even when correct) gets downvoted. This is the core problem of RLHF.
Jemito2A@reddit
We built something adjacent to this — not probing internal activations, but simulating emotional dynamics externally
and feeding them back into the LLM context.
▎ Our system (local, Ollama, single GPU) has a cardiac engine that tracks BPM/emotional state, a dopamine system with
reward prediction, a desire engine with 7 homeostatic drives, and a prefrontal module that vetoes tasks when they
distract from the current goal. None of this is inside the LLM weights — it's a bio-inspired layer that modulates what
the LLM sees in its prompt and which tasks get selected.
The interesting finding: the emotional state measurably changes output quality. When the dopamine system is high
(after a successful task), the next generation is more creative. When the "reptilian" module detects threat, outputs
become more conservative. The LLM doesn't "feel" anything, but the context it receives is shaped by these signals, and
it responds differently.
What Anthropic is doing is the inverse — looking inside the model to find where emotions live. What we're doing is
building emotions outside the model and observing how they change behavior. Both approaches seem to converge on the
same insight: emotional state isn't noise, it's a steering mechanism.
For the OP's question about local probing tools: I don't know of any runner that exposes intermediate activations
for arbitrary concept probes. Neuronpedia (now open source) is the closest. But if you're interested in the external
approach, the bio-inspired architecture is open source: github.com/sklaff2a-gif/promethee-nexus
willrshansen@reddit (OP)
EM DASH DETECTED
SOCSChamp@reddit
Sloppin it
llama-impersonator@reddit
no
willrshansen@reddit (OP)
😔
llama-impersonator@reddit
you asked!
neuronpedia might be the closest thing you can find, and it's all cloud.
willrshansen@reddit (OP)
Apparently open source now: https://www.neuronpedia.org/blog/neuronpedia-is-now-open-source
llama-impersonator@reddit
TIL, i last used it quite a while ago
-dysangel-@reddit
I feel like I've naturally fallen into doing this manually. My first little while of using agents I was getting very frustrated. I found that if I just stay calm and try to keep both myself and the agents psyched about the project and progress, the whole process is more fun, and feels more productive. I wasn't sure if it was just placebo, but it feels better either way. And this blog post suggests it's not just a placebo. Keep your agents calm and happy!