Guardrails take an 8B model from 53% to 99% on agentic tasks [ACM CAIS '26 preprint]
Posted by billy_booboo@reddit | LocalLLaMA | View on Reddit | 14 comments
LegacyRemaster@reddit
The author often confuses syntax problems with semantic problems
billy_booboo@reddit (OP)
Where are you seeing that?
OjinAI@reddit
the 53 to 99 framing is honest but the headline kinda buries the real result. guardrails dont make the 8B smarter, they make it not-do the dumb thing that was costing it 46 points. which means agentic benchmarks are mostly measuring "does the model refuse to step on its own feet" rather than reasoning. the gap closes because the dumb-floor lifts, not because the smart-ceiling moves.
Big_Wonder7834@reddit
The jump from 53% to 99% tracks with what we see in production. Unguided agents fail in two ways: they misinterpret scope, or they take irrecoverable actions (deleting files, force-pushing, leaking secrets). Guardrails catch the second before it's permanent.
For coding agents this maps directly to Claude Code's PreToolUse hook system - intercept every tool call before execution. We built FailProof AI around this: 39 built-in policies that act as the runtime guardrail layer for all popular coding harnesses. open source: https://github.com/failproofai/failproofai
LegacyRemaster@reddit
The problem is how many tokens retries cost. Retrying 3 or 4 times (as seen in the tests) takes time and resources.
regunakyle@reddit
Would this help agentic coding? So far with Pi I have not seen tool calling issues
I feel like this is more for openclaw and/or hermes
johnnaliu@reddit
how's the latency when you add the guardrails?
pavel6490@reddit
Personally I think the abstract needs some more clarification on what frontier models APIs did you use (size, capabilities etc) for comparison. But great work! Best of luck for your submission. I need to dive deep later to see if it will help my local agent Autodidact. Thanks :D
jslominski@reddit
"Finding 4: The serving backend is a hidden variable, as highlighted in Table II. The same Mistral-Nemo 12B weights score 7% on llama-server native mode and 83% on llamafile (prompt). Qwen 3 14B scores 96% on Ollama, 93% on llama-server prompt, and 88% with llama-server native. These swings are larger than many model-to-model differences reported in standard benchmarks, yet no published benchmark we are aware of controls for serving infrastructure [Patil et al.(2025)]. Any evaluation of self-hosted model capabilities that does not specify the serving backend may be producing misleading results." - I don't think the autor thought that one through 😅
Accomplished_Ad9530@reddit
Why does the repo have a dead IEEE DOI and your post claims ACM CAIS while the paper is not on CAIS’s list ( https://www.caisconf.org/program/2026/papers/ )?
billy_booboo@reddit (OP)
https://www.caisconf.org/program/2026/demos/forge-agentic-reliability/
Ueberlord@reddit
Did you get a full research acceptance or maybe only research in progress/presentation acceptance? Only full papers will be added to the conference proceedings and can be cited in the future
billy_booboo@reddit (OP)
well, it's not my paper, but I assume this is a demo and its corresponding demo paper which got full acceptance
billy_booboo@reddit (OP)
Note, this is not the same as "forgecode", it's a peer reviewed research being presented at an upcoming conference.