Q4_K_M is fine for chat and a trap for agents. Here is math mathing.

Posted by Napster3301@reddit | LocalLLaMA | View on Reddit | 53 comments

saw the Q4_K_M vs Q6 thread earlier and the comments are talking past each other. "few errors per hour" vs "errors every couple days" sounds like a 24x difference. for chat thats fine. for agentic loops thats the whole game.

run the math. if your agent does a 30-step tool calling loop and each step has a 2% chance of producing a malformed arg or picking the wrong tool, end-to-end success is 0.98\^30 = 0.54. coin flip.

at Q4_K_M with "few errors per hour" the per-call malformation rate is probably \~3%. 30 steps = 40% completion.

at Q6 with "errors every couple days" call it 0.3%. 30 steps = 91%.

3x per-call accuracy comes out to 2.3x end-to-end agent success. and the failure mode is silent: confident format, wrong content, orchestrator accepts it, the artifact breaks two hops downstream when some other consumer tries to parse it. you dont catch it inline. you catch it when the final output is broken and have to bisect the whole trace.

alot of people running Q4_K_M for agents are measuring chat quality and extrapolating. its a different workload. token-level entropy stacks differently when one bad token kills the whole loop instead of mangling a sentence.

abliterated/heretic models compound this btw, because stripping refusal circuits also chips away at the "wait that doesnt parse" reflex that catches malformed JSON before emit. youre trading safety for raw output and picking up downstream brittleness in the bargain.

is anyone actually logging per-call output validity in live agentic loops? not eval benchmarks. with prod logs, on a real workload, over a week.