I've just created a benchmark: humans should blaze it, AI seems to get lost in psychophansy or average responses.

Posted by JLeonsarmiento@reddit | LocalLLaMA | View on Reddit | 13 comments

went around social media post exhibiting the sycophancy behavior or API models (ChatGPT, Claude, etc.) and formatted 10 viral posts into single turn multiple selection test prompts and run a bunch on open-source local LLM trough them.

50% was the highest score from LLMs. Anyone else should be scoring north of that.

Gemma4 comes good (50% accuracy) also Pepe-32 (fine tuned on Reddit data, perhaps a little bit of 4chan also, but I am not sure which recipe Sicarius used tbh).

Except for 3.6-27b, Qwen's had a hard time with this. GLM-4.6 too.

Also, you can take the test yourself and get a confident boost in our natural superiority over AI yes-mans:

https://benchmark-yourself.streamlit.app