Q4_K_M is fine for chat and a trap for agents. Here is math mathing.
Posted by Napster3301@reddit | LocalLLaMA | View on Reddit | 53 comments
saw the Q4_K_M vs Q6 thread earlier and the comments are talking past each other. "few errors per hour" vs "errors every couple days" sounds like a 24x difference. for chat thats fine. for agentic loops thats the whole game.
run the math. if your agent does a 30-step tool calling loop and each step has a 2% chance of producing a malformed arg or picking the wrong tool, end-to-end success is 0.98\^30 = 0.54. coin flip.
at Q4_K_M with "few errors per hour" the per-call malformation rate is probably \~3%. 30 steps = 40% completion.
at Q6 with "errors every couple days" call it 0.3%. 30 steps = 91%.
3x per-call accuracy comes out to 2.3x end-to-end agent success. and the failure mode is silent: confident format, wrong content, orchestrator accepts it, the artifact breaks two hops downstream when some other consumer tries to parse it. you dont catch it inline. you catch it when the final output is broken and have to bisect the whole trace.
alot of people running Q4_K_M for agents are measuring chat quality and extrapolating. its a different workload. token-level entropy stacks differently when one bad token kills the whole loop instead of mangling a sentence.
abliterated/heretic models compound this btw, because stripping refusal circuits also chips away at the "wait that doesnt parse" reflex that catches malformed JSON before emit. youre trading safety for raw output and picking up downstream brittleness in the bargain.
is anyone actually logging per-call output validity in live agentic loops? not eval benchmarks. with prod logs, on a real workload, over a week.
MackThax@reddit
I'm not saying you're wrong, but even fp models make mistakes. We already need to design systems around that.
Linkpharm2@reddit
FP64 🤤 🤤 🤤
Infamous_Mud482@reddit
you're just guessing that people are extrapolating off of that when I'm also sure "alot" of them are running benchmarks agenticly. do you understand the value in "running the math" on made up numbers? this is end to end based on hypotheticals, you've produced a thought experiment at best
lordekeen@reddit
The harness and prompt may have a high impact in the error rate. I've noticed that when my prompt is very straight forward the result almost never misses.
samorollo@reddit
If I could fit bigger quants I wouldn't think twice, but if I could fit bigger quant I would choose bigger model anyway
Top-Rub-4670@reddit
Sounds like you would think twice, then.
cibernox@reddit
Well, I even use the Q4_K_S, because for us peasants with 24gb of vram, is the only way of having a decent \~200k context, with KV-cache in k-q8/v-q5.
Q6 would leave me no context at all.
Q5 would give me some context, it could be usable.
But giving this sentiment I've seen here, maybe I should try to use the UD-Q4_K_XL model, so even if I stay in Q4, it is supposed to be the best Q4.
Similar-Ad5933@reddit
Do you really need 200k context? Are you making it write long books or what?
Borkato@reddit
Honestly no but it’s annoying when you want to do something that does and you can’t, or you have a long session and know compaction will cause both you and the model to have to reorient yourselves mentally
DaMoot@reddit
I use 170k context because I injest log data that can be 60-70k+ tokens per batch. And that's pre-truncated and filtered by the python script! Or if I'm having the agent make or fix a script.
Agentic frameworks have large bootstrapping injections of 15-20k.
Plus some of it is a personal operation flex not having to see compressions very often.
mmhorda@reddit
Of course. Sessions search, web extract, memory extract, the context. Even 200k sometimes is not enough for me 😅
cibernox@reddit
Agents and coding. For coding, below 150K you get compaction issues very often. For agentic uses, 2x100k agents do work a lot quicker, and given that they baseline context at a agent uses is around 20k, having, say, 65k context would leave you with only 45k usable context, which essentially means would be compacting context every 5minutes.
rebel_cdn@reddit
If you're calling it from a coding agent, the extra context is useful. If you're trying to solve a problem in a large code base, one of the first things the agent will do is start ingesting lots of files to try to get the full context.
I've had this fill up a 128k context pretty quickly. It's not the end of the world due to auto compaction, but condensing the context also removes detail that would help complete the task.
YMMV depending on what you're working on, but I've found that using 200k over something smaller makes a meaningful difference.
Borkato@reddit
tbh I think it’s easy to get into the trap that “what matters most is size” or “benchmark scores mean my model is objectively worse”. But that’s like saying “A big car is always better than a small car”. For moving, yes. For everyday life? Absolutely not! So much gas, power, etc all used up just to go from here to there? Hell no! And what if you can’t afford the bigger car? etc.
The ONLY, and I do mean only, important question is, “does this model work for my workflow without me wanting to rip my hair out?” And if it does, use it. The end.
I feel like if someone released a 1M parameter model that was literally ASI or something, people would be like “:/ anyone else kinda think it sucks? It’s only 1M, imagine what we could do with 100M, it would be so much better…” lol
Napster3301@reddit (OP)
yeah for 24gb that math is real. fwiw 200k context at Q4 isnt necessarily better than 32k at Q6 for agentic loops. most agent working windows are 8-32k anyway, the rest is dead weight in KV that you pay for on every step. you'd be surprised how rarely the 64k+ region is actually informing the next token.
UD-Q4_K_XL is genuinely worth trying. unsloth's dynamic allocation keeps the high-error-per-param tensors at higher bits, so the brittle layers (attention output projections, MLP gates) dont get hammered like in uniform Q4. closes maybe 40-50% of the gap to Q6 for the same VRAM. for agents specfically thats huge because those are the exact layers driving tool selection.
but also, drop one tier on params. 14B at Q6 beats 27B at Q4 on tool calling benchmarks even when it loses on knowledge tasks. depends on what your agent actually does.
cibernox@reddit
Nope, that's just not true. For instance, if you have hermes with a few memories and skills, your baseline context is already around 20k, and that's before you have done anything at all. 32k is useless. 65k the bare minimum for being somewhat useful in an agentic setup. 100k is where it starts to be useful.
I want 200k context so I have have 2 agents with 100k each. Although if you are doing coding, needing 150k+ is not rare at all.
mmhorda@reddit
With hermes 64k is a pain. I run it with 128k but looking for the ways to raise it to 200+ somwhow.
admnb@reddit
Is turboquant not usable? I'm running Q5 with 256k context and it seems fine, but I haven't tested long agentic loops yet.
cibernox@reddit
From what I've read, turboquant q4 has essentially no benefit over regular q4 KV cache and it's a bit slower.
But Q4 KV cache is risky if you want to preserve context, specially for the K. There has been measurements of perplexity by using different KV quantizations at different context lengths and Q8/Q5 seems to be right around the sweet spot. Saving over Q8/Q8 are meaningful while recollection/allucination remains nearly identical.
IIRC, the next step in memory saving/quality was Q8/Q4, then Q5/Q5, and then Q4/Q4.
That said, Q5 is not as optimized as Q8 or Q4, so prompt processing is a bit slower.
cleversmoke@reddit
Don't sweat it! If you've figured out how to use Q4_K_S or Q4_K_M effectively, you're in a good spot for agentic coding, in my humble opinion. It would mean you have your harness set up well, with good docs, good planning and likely have a good sense in software architecture. I'd hire an engineer that knows how to work with Q4_K_S than one that tokenmaxx and tries to one-shot everything.
andreasntr@reddit
That's what I'm experiencing too. I never use my agent in freestyle mode, I always chat in plan mode first (i's say I spend 90% of the time in plan mode) and when the plan looks good and sount to me, i'll set the build agent free.
In this specific scenario, q4 errors do not stack up because they are intercepted by the human in the loop, and the build agent has enough tools to correct itself (linter, cmd to read errors etc)
ParaboloidalCrest@reddit
Bring on the downvotes I never understood the "we can't use anything else" mindset. There is a model out there for every amount of vram imaginable.
Xp_12@reddit
Tell that to Qwen.
ParaboloidalCrest@reddit
That's your problem. You don't have to just stick to the latest collection of Qwen models. There are tens of decent 14-20B models out there.
social_tech_10@reddit
How does it affect your math if 99% of "errors" are caught and corrected on the next turn or two?
One of the things I love most about "computer science" is that we have the option to make it directly experimental, like real science, if we care enough about the question. If you set this up as an experiment and performed the measurement yourself, I bet you would learn more than you expect. And because of the amazing moment we live in, a LLM could even help you design the experiement, write the code, and give you as much tutoring as you might need to fully understand and control the whole experiment. Have fun with it, and let us know what you find out.
Zestyclose_Leek_3056@reddit
Curious how NVFP4 compares.
audioen@reddit
If agent fails on tool call, it tries again. Therefore it will complete eventually, even if it makes errors at first. Have you used agentic software?
CooperDK@reddit
Q4_K_M is not that bad. It is close to Q8 in quality and Q5 and Q6 quants are essentially unnecessary, as are Q3 and below (that is where the quality really drops). Q4 is usually something like 96-98% accurate, which sounds bad until you know that a 16-bit model is also only 98-99% accurate.
In short, you can safely use Q4. If you have a Blackwell GPU, use nvfp4 which has a precision much closer to fp16.
svagis@reddit
Going from 96 to 98% accuracy is a 100% increase in performance, as is 98% to 99%
Napster3301@reddit (OP)
96-98% on what though? perplexity? MMLU? thats single-token correctness on well-formed prompts. agentic loops dont look like that.
even at your numbers, compounding bites. 0.99\^30 = 0.74 success. 0.98\^30 = 0.55. a 1% drop in per-token accuracy turns into a third of your agent traces broken end to end.
benchmarks dont measure the thing thats actually killing agents: JSON validity, tool selection, schema adherance, arg format. those degrade faster than perplexity when you quantize because theyre tail behaviors, not averages. Q4 keeps the token distribution mean intact while shaving the edges, and the edges are where structured output lives.
nvfp4 being closer to fp16 is fair. most people in this thread arent on blackwell though, theyre on 3090s running k-quants in llama.cpp and the gap there is bigger than the perplexity numbers suggest.
slalomz@reddit
This is way overstating the issue. You're treating every different token as a wrong token. You're not taking into account that high-confidence tokens (the ones where getting it wrong would induce an error) are not going to be affected at all by quantization. You're treating the higher quant as if it generates 100% "correct" tokens all the time. The truth is that a few second-place tokens (according to the higher quant) will make 0 difference to the end result.
yoyoyoba@reddit
There really needs to be more data on this. Most evals show quality going down from q8 to q6 to q5 to q4... so they should all make sense. Take the largest you can fit also consider the speed you need.
I have not seen nvfp4 performing much above q4 (if it is not natively trained in nvfp4). It is definitely not bf/fp16.
DataPhreak@reddit
The nice thing about running quants is that you're likely doing it on local hardware, so building in robust verification steps to your agentic workflow isn't about cost any more. It's about speed. If you're using something with a lot of VRAM like a DGX or a Strix Halo, you can even run multiple quants simultaneously and put more complex steps on the stronger model.
leonbollerup@reddit
All sounds good.. but the HW cost of running a Q6 / Q8 is quite higher unless you are ready to give away performance or context size .. we would properly all be running higher quant is it was option
OAKI-io@reddit
yeah this is the part people underestimate. chat quality can feel basically fine while the tail risk kills tool loops. for agents the eval should be “can it complete 50 boring multi-step tasks without one malformed call or bad assumption,” not “does the answer read well.”
florinandrei@reddit
The assumption is that errors are fatal. That's not always the case.
NotARedditUser3@reddit
this is only true if your agentic loop takes hours.
I use agents and most of their tasks are done within a few minutes. An error every few hours or two is irrelevant, especially if it can catch when it's made a mistake.
This post is functionally just more noise.
LeatherRub7248@reddit
when you say 'errors', does your post assume error correction / recovery logic ?
malformed args can be corrected quite easily, as for wrong tool, possibly debatable if it isnt clear what exactly is a wrong vs right tool, but if that can be determined then it shd be correctable too.
SnooPaintings8639@reddit
What's so critical about errors? I mean, if a model makes a mistake in every session over 30 tool cals, then it will just noticed and try again. This itself is not a deal breaker. The only real issue is if it gets genuinely dumb at long context and forget how to make any call or adhere to instructions.
No-Refrigerator-1672@reddit
People recognized those shortcomings long time ago, and that's why frontier quantization methods (AWQ and it's derivatives, EXL2/EXL3, etc) all use training datasets to produce additional calibration coefficients that will nudge the model back to original performance on the tasks present in the dataset. In gguf world, something similar is imatrix; you should use those when you can.
FotografoVirtual@reddit
This is THE answer, using imatrix really is a night and day difference.
nbncl@reddit
I can tell you for a fact that sonnet 4.6 fails more often with tool calling than Qwen 3.6 UD-Q4_K_XL locally.
Prize_Negotiation66@reddit
"if you poor just get more money"
Icy-Degree6161@reddit
I'll buy your book
Similar-Ad5933@reddit
Yes, this is something people should understand. Also how bits work, because 4-bit can hold only 16 different values. 16-bit can hold 65535 values. So when you convert those values to 4-bit there are a lot of same values.
I see too much people to be proud that they run Q3 just fine. It's like saying that I can type if I just mash all keys.
Lissanro@reddit
Based on my experience, it is small to medium size models that are impacted the most, like Qwen 3.6 27B, which has noticeable error rate increase even at Q6, especially at tasks that involve vision, where Q8 would still be better. Q6 and Q8 for 27B come close in terms of quality though.
Medium size models are impacted less. For Qwen 397B, I found Q5 to be the sweat spot, still fast and maintains good quality, including in tasks that involve vision. At least, this is my experience.
Larger models, especially if natively INT4, may work perfectly at Q4. For example, Kimi K2.6 is the model I run the most on my rig and find its Q4_X quant quite reliable in agentic tasks. I also sometimes use GLM 5.1, which isn't INT4 natively but still its Q4_K_M quant I find to be good enough in agentic use cases.
nickm_27@reddit
That still misses a lot of nuance though. Agent is such an overloaded term. An agent can be anything from a simpler voice assistant which handles a wide array of voice commands, to a 24/7 running OpenClaw / Hermes which runs at much longer context and has a lot of different tasks that it handles as well as much higher runtime leading to more opportunities for failure.
As a voice agent, which still handles many tool calls including chained tool calls, Q4_K_M can work great without any failures. I have never seen an actual failure to call a tool correctly with Q4_K_XL. Every single failure is directly noticed as that would manifest as failing to answer a question, control a device, play music, etc.
MackThax@reddit
I'm not saying you're wrong, but even fp models make mistakes. We already need to design systems around that.
complexminded@reddit
Your use case often determines the "best" quant for you. Saying that Q4 is close to Q8 in quality might be true in specific scenarios, depending on the use case.
There's too much generalizing on the "best" quant when people have totally different use cases and different quants perform differently depending on the use case and context limit.
0-0x0@reddit
I didn't encounter such issues even with an uncensored qwen3.6 35B running at Q4. I gave it the root link to a project's documentation and had it run a research and prepare a demo usage of it with no problems beyond missing some things in the docs and requiring a secondary nudge to the page containing the info. Maybe with Gemma 4 it would've been worse since it's more sensitive to quantization.
OkFly3388@reddit
And what the problem with errors in agent mode ?
Its basically mandatory to have some error recovery strategy, because absolutely every model can make mistakes. And if you have that strategy, error rate is just speed penalty, and having 3% error rate(aka 3% slowdown) is really good, compared to having 2% error rate, but overall 15% slowdown for bigger quant.
DrBearJ3w@reddit
Depends on the harness.
Brilliant-Resort-530@reddit
this is the right framing but the real fix is constrained generation not a higher quant. llama.cpp --json-schema, outlines, or xgrammar — token-level JSON constraint collapses that malformation rate regardless of quant level