Do smaller quants silently break tool calls / JSON output?

Posted by Fun_Employment6042@reddit | LocalLLaMA | View on Reddit | 8 comments

I posted recently about EvalShift, an OSS CLI for regression-testing LLM model changes.

A few people pointed out that for LocalLLaMA, the more interesting use case may be quantization regression:

Q8 -> Q4_K_M

Same base model, same prompts, lower VRAM, but behavior may subtly change.

I want to test failures like:

invalid JSON / structured output
changed tool/function selection
mutated tool arguments
skipped retrieval
weaker instruction following
plausible-looking output that breaks downstream code

I’m thinking of adding a LocalLLaMA demo: same golden suite, same base model, two quants, then generate an HTML report showing what regressed.

Questions:

Which model + quant pair should I test?
Is Q8 -> Q4_K_M the right comparison?
Should I test Ollama, llama.cpp, or vLLM first?
Best demo task: JSON extraction, tool calls, RAG, coding, or instruction following?

Repo:
https://github.com/babaliauskas/evalshift-cli

MIT licensed. Local-first, no backend, no accounts, no telemetry.

[-]

BeautyxArt@reddit

as for my simple use test , small models Q8 always gave broken code.

[-]

NNN_Throwaway2@reddit

How do you "silently" break a tool call?

[-]

Ill-Fishing-1451@reddit

If you are going to do the test, you could compare your result with unsloth's KLD vs disk size test. I do believe his result is correct and model quality decrease exponentially with higher quant.

[-]

NickCanCode@reddit

I use Qwen3.6-27B-Q2_K_MIXED.gguf and doesn't have tool call problem. However, this `mixed` seem imply that it is not a simple Q2 and the size of the file is actually close to Q4.

[-]

Ok-Measurement-1575@reddit

Test llama-server pre/post mcp support with 3.6 35b for tool calls.

[-]

Fun_Employment6042@reddit (OP)

Good suggestion. That fits EvalShift well: same model, same prompts, but different serving/runtime behavior before vs after MCP/tool-call support.

I’ll look into a small tool-calling benchmark for llama-server with 3.6 35B: tool selection, argument correctness, call ordering, and whether the final answer depends on actually using the tool.

That would be a better LocalLLaMA demo than a generic model migration example.

[-]

Here’s my first-hand experience: I tried Qwen3.6-27B-Q2 on Pi + llama-server before, it has occasionally tool call failures. Then I switched to Q4, even though TG is much slower, it did NOT show tool call failures on the same prompts.

I think extremely low quants can hurt intelligence so much so it affects tool calling consistency.

[-]

Fun_Employment6042@reddit (OP)

That’s exactly the kind of case I want to capture.

Q2 -> Q4 is interesting because the regression may not show up as “bad answer quality” immediately, but as lower tool-call consistency: skipped calls, malformed arguments, wrong tool choice, or unstable behavior across the same prompt set.