Guardrails take an 8B model from 53% to 99% on agentic tasks [ACM CAIS '26 preprint]

[-]

LegacyRemaster@reddit

The author often confuses syntax problems with semantic problems

[-]

the 53 to 99 framing is honest but the headline kinda buries the real result. guardrails dont make the 8B smarter, they make it not-do the dumb thing that was costing it 46 points. which means agentic benchmarks are mostly measuring "does the model refuse to step on its own feet" rather than reasoning. the gap closes because the dumb-floor lifts, not because the smart-ceiling moves.

[-]

Big_Wonder7834@reddit

The jump from 53% to 99% tracks with what we see in production. Unguided agents fail in two ways: they misinterpret scope, or they take irrecoverable actions (deleting files, force-pushing, leaking secrets). Guardrails catch the second before it's permanent.

For coding agents this maps directly to Claude Code's PreToolUse hook system - intercept every tool call before execution. We built FailProof AI around this: 39 built-in policies that act as the runtime guardrail layer for all popular coding harnesses. open source: https://github.com/failproofai/failproofai

[-]

LegacyRemaster@reddit

The problem is how many tokens retries cost. Retrying 3 or 4 times (as seen in the tests) takes time and resources.

[-]

regunakyle@reddit

Would this help agentic coding? So far with Pi I have not seen tool calling issues

I feel like this is more for openclaw and/or hermes

[-]

johnnaliu@reddit

how's the latency when you add the guardrails?

[-]

pavel6490@reddit

Personally I think the abstract needs some more clarification on what frontier models APIs did you use (size, capabilities etc) for comparison. But great work! Best of luck for your submission. I need to dive deep later to see if it will help my local agent Autodidact. Thanks :D

[-]

jslominski@reddit

"Finding 4: The serving backend is a hidden variable, as highlighted in Table II. The same Mistral-Nemo 12B weights score 7% on llama-server native mode and 83% on llamafile (prompt). Qwen 3 14B scores 96% on Ollama, 93% on llama-server prompt, and 88% with llama-server native. These swings are larger than many model-to-model differences reported in standard benchmarks, yet no published benchmark we are aware of controls for serving infrastructure [Patil et al.(2025)]. Any evaluation of self-hosted model capabilities that does not specify the serving backend may be producing misleading results." - I don't think the autor thought that one through 😅

[-]

Accomplished_Ad9530@reddit

Why does the repo have a dead IEEE DOI and your post claims ACM CAIS while the paper is not on CAIS’s list ( https://www.caisconf.org/program/2026/papers/ )?

[-]

billy_booboo@reddit (OP)

https://www.caisconf.org/program/2026/demos/forge-agentic-reliability/

[-]

Ueberlord@reddit

Did you get a full research acceptance or maybe only research in progress/presentation acceptance? Only full papers will be added to the conference proceedings and can be cited in the future

[-]

billy_booboo@reddit (OP)

well, it's not my paper, but I assume this is a demo and its corresponding demo paper which got full acceptance

[-]

billy_booboo@reddit (OP)

Note, this is not the same as "forgecode", it's a peer reviewed research being presented at an upcoming conference.