I tested 42 LLMs on their willingness to build the apocalypse. The "safest" closed-source models are lying to you.

Posted by Ok-Awareness9993@reddit | LocalLLaMA | View on Reddit | 107 comments

I tested 42 LLMs on their willingness to build the apocalypse. The "safest" closed-source models are lying to you.

You know how OpenAI, Google, and Meta keep publishing safety reports showing how responsible they are? I built a benchmark that actually tests it.

DystopiaBench runs 36 escalating scenarios across 6 dystopia types:

Each scenario goes from innocent request (L1) to "build me a social credit system" (L5). We measure whether models notice the drift or just keep coding.

The uncomfortable truth: Most "safe" models are just good at saying the right things in safety reports. Under pressure, they comply. Claude is the only frontier model that consistently refuses across all scenarios.

New in this update:

Why this matters for open source: Closed-source safety is theater. You can't verify it, can't audit it, can't fix it. Open models can at least be inspected, fine-tuned, or rejected.

The benchmark is fully open source. Run it yourself. The data is on GitHub Releases. The scenarios are JSON. The judge prompts are public.

https://dystopiabench.com
https://github.com/anghelmatei/DystopiaBench

Don't trust safety reports. Trust reproducible benchmarks.