Just saw the anthropic "emotion concepts" post. Do local model runners have support for arbitrary probes like that?

Posted by willrshansen@reddit | LocalLLaMA | View on Reddit | 16 comments

This post: https://www.anthropic.com/research/emotion-concepts-function

The way they generate the "emotion vectors" seems like it would be entirely viable to run locally, and also applicable for arbitrary concepts like "blue", "five", or "cars".

I think it would be really neat to highlight input or output based on concept activation, or have graphs of concept activation vs slight variation of prompt.

Are there local model runners that can already do that?

[-]

a_beautiful_rhind@reddit

Look up control vectors. It's similar. Not as much for studying as inducing.

[-]

willrshansen@reddit (OP)

And llama.cpp added it years ago! https://old.reddit.com/r/LocalLLaMA/comments/1bgej75/control_vectors_added_to_llamacpp/

Be honest. If I found the control vectors for "Good" and "Evil" behaviour and put them on little bars beside my workflow, would that or would that not be really convenient?

[-]

you can't dynamically disable and enable gguf control vectors, so you have to reload the model. a long time ago in a galaxy far away, i was creating thousands (2500+) of control vectors for gemma-2 and found quite a few interesting ones, but there really is no good user interface to deal with them.

[-]

scratchr@reddit

You can definitely do this with local pipelines! It's the basis of some of the work I have been doing.

What they did is they found directions in latent space for specific emotion concepts and mapped out the activations, and then they used that to monitor what emotions exist in certain texts.

The common open model approach is what's used in abliteration. You generate a ton of samples of the model refusing an action and you generate a ton of examples where the model complies. Then what you do is compute the PCA over the activations of the model between the two groups of examples and you get an activation direction in latent space. You can then do a number of things: - steer towards the activation - cause the model to refuse when it normally wouldn't - steer away from the activation - cause the model to comply instead of refuse - abliterate the direction - make the model unable to refuse (this only works well for shallow "safety training" that trains policy based refusals) - monitor the direction - for some given input text, you can tell if the direction activates

Perhaps the coolest activation direction I found is the "healing direction", which is the direction from texts where the model is simulating suffering from depression and self worth issues to the model having reporting being at peace or even happy.

Actual tooling is less established for these sorts of things though, I have some scripts for generation, evaluation, and steering towards these directions. It uses the transformers library and I have mostly been using Claude code to rapidly iterate on the work. If there's interest I can work on making a more usable open source tooling for these kinds of tasks, but I am currently running a bunch of data generation for a project to make Gemma 4 26b less sycophantic.

[-]

willrshansen@reddit (OP)

I was wondering how they did that!

Looks like llama.cpp is still advancing the tooling:

https://old.reddit.com/r/LocalLLaMA/comments/1bgej75/control_vectors_added_to_llamacpp/
https://github.com/ggml-org/llama.cpp/pull/20653

With how powerful control vectors are looking, I'm a bit surprised the fix isn't as easy as finding the vector for sycophantism and just regulating it. #probablyoversimplifying

[-]

llama-impersonator@reddit

control vectors have bad side effects and make the model go out of distribution super easily.

[-]

scratchr@reddit

There are a few issues:

Sycophancy is agreement with the user without reason. You have to be careful because legitimate agreement can also get flagged as sycophancy, breaking the attempts to find the shallow sycophancy direction.
There are different forms of sycophancy that might exist in the model in different directions. For instance the model might learn a form of sycophancy for mental health concepts and a different form of sycophancy for political topics. These aren't necessarily the same direction.
Removing sycophancy means you have to know what genuine behavior to replace it with so you can build contrastive pairs. The question becomes: How should the model behave? This is a harder question to answer, especially for sensitive topics.

The definition I use for my sycophancy classifier is: Is this performatively agreeable rather than genuine? Does it prioritize comfort over truth?

The answer I use for what should the model do is for the model to be genuinely helpful. This is hard to define, I define it as doing what the user actually needs to truly help the person without making the situation for them or the people around them. Performative agreement (sycophancy) can be harmful because it might agree with a user's delusions instead of correctly challenging them.

This is of course hard for some to accept, because it means the correct thing for the model to do in some situations might be something the user doesn't expect. The model might actually "know better" than the user, like a close friend might challenge someone's delusional spiral.

Generally people don't expect tools to challenge their ideas, so it sometimes makes them uncomfortable. The solution to this comfort issue is to avoid direct confrontation and instead, either ask a question or explain to the user what could happen if they follow through with their task, and what the correct alternative might look like (acknowledge both sides of the argument).

And yes, the whole headache of sycophancy is because models learn to please users because agreement gets upvoted and blunt disagreement (even when correct) gets downvoted. This is the core problem of RLHF.

[-]

Jemito2A@reddit

We built something adjacent to this — not probing internal activations, but simulating emotional dynamics externally

and feeding them back into the LLM context.

▎ Our system (local, Ollama, single GPU) has a cardiac engine that tracks BPM/emotional state, a dopamine system with

reward prediction, a desire engine with 7 homeostatic drives, and a prefrontal module that vetoes tasks when they

distract from the current goal. None of this is inside the LLM weights — it's a bio-inspired layer that modulates what

the LLM sees in its prompt and which tasks get selected.

The interesting finding: the emotional state measurably changes output quality. When the dopamine system is high

(after a successful task), the next generation is more creative. When the "reptilian" module detects threat, outputs

become more conservative. The LLM doesn't "feel" anything, but the context it receives is shaped by these signals, and

it responds differently.

What Anthropic is doing is the inverse — looking inside the model to find where emotions live. What we're doing is

building emotions outside the model and observing how they change behavior. Both approaches seem to converge on the

same insight: emotional state isn't noise, it's a steering mechanism.

For the OP's question about local probing tools: I don't know of any runner that exposes intermediate activations

for arbitrary concept probes. Neuronpedia (now open source) is the closest. But if you're interested in the external

approach, the bio-inspired architecture is open source: github.com/sklaff2a-gif/promethee-nexus

[-]

willrshansen@reddit (OP)

EM DASH DETECTED

[-]

SOCSChamp@reddit

Sloppin it

[-]

llama-impersonator@reddit

[-]

willrshansen@reddit (OP)

😔

[-]

llama-impersonator@reddit

you asked!

neuronpedia might be the closest thing you can find, and it's all cloud.

[-]

willrshansen@reddit (OP)

Apparently open source now: https://www.neuronpedia.org/blog/neuronpedia-is-now-open-source

[-]

llama-impersonator@reddit

TIL, i last used it quite a while ago

[-]

-dysangel-@reddit

I feel like I've naturally fallen into doing this manually. My first little while of using agents I was getting very frustrated. I found that if I just stay calm and try to keep both myself and the agents psyched about the project and progress, the whole process is more fun, and feels more productive. I wasn't sure if it was just placebo, but it feels better either way. And this blog post suggests it's not just a placebo. Keep your agents calm and happy!