Do smaller quants silently break tool calls / JSON output?
Posted by Fun_Employment6042@reddit | LocalLLaMA | View on Reddit | 8 comments
I posted recently about EvalShift, an OSS CLI for regression-testing LLM model changes.
A few people pointed out that for LocalLLaMA, the more interesting use case may be quantization regression:
Q8 -> Q4_K_M
Same base model, same prompts, lower VRAM, but behavior may subtly change.
I want to test failures like:
- invalid JSON / structured output
- changed tool/function selection
- mutated tool arguments
- skipped retrieval
- weaker instruction following
- plausible-looking output that breaks downstream code
I’m thinking of adding a LocalLLaMA demo: same golden suite, same base model, two quants, then generate an HTML report showing what regressed.
Questions:
- Which model + quant pair should I test?
- Is Q8 -> Q4_K_M the right comparison?
- Should I test Ollama, llama.cpp, or vLLM first?
- Best demo task: JSON extraction, tool calls, RAG, coding, or instruction following?
Repo:
https://github.com/babaliauskas/evalshift-cli
MIT licensed. Local-first, no backend, no accounts, no telemetry.
BeautyxArt@reddit
as for my simple use test , small models Q8 always gave broken code.
NNN_Throwaway2@reddit
How do you "silently" break a tool call?
Ill-Fishing-1451@reddit
If you are going to do the test, you could compare your result with unsloth's KLD vs disk size test. I do believe his result is correct and model quality decrease exponentially with higher quant.
NickCanCode@reddit
I use Qwen3.6-27B-Q2_K_MIXED.gguf and doesn't have tool call problem. However, this `mixed` seem imply that it is not a simple Q2 and the size of the file is actually close to Q4.
Ok-Measurement-1575@reddit
Test llama-server pre/post mcp support with 3.6 35b for tool calls.
Fun_Employment6042@reddit (OP)
Good suggestion. That fits EvalShift well: same model, same prompts, but different serving/runtime behavior before vs after MCP/tool-call support.
I’ll look into a small tool-calling benchmark for llama-server with 3.6 35B: tool selection, argument correctness, call ordering, and whether the final answer depends on actually using the tool.
That would be a better LocalLLaMA demo than a generic model migration example.
Valuable_Touch5670@reddit
Here’s my first-hand experience: I tried Qwen3.6-27B-Q2 on Pi + llama-server before, it has occasionally tool call failures. Then I switched to Q4, even though TG is much slower, it did NOT show tool call failures on the same prompts.
I think extremely low quants can hurt intelligence so much so it affects tool calling consistency.
Fun_Employment6042@reddit (OP)
That’s exactly the kind of case I want to capture.
Q2 -> Q4 is interesting because the regression may not show up as “bad answer quality” immediately, but as lower tool-call consistency: skipped calls, malformed arguments, wrong tool choice, or unstable behavior across the same prompt set.