Do smaller quants silently break tool calls / JSON output?

Posted by Fun_Employment6042@reddit | LocalLLaMA | View on Reddit | 8 comments

I posted recently about EvalShift, an OSS CLI for regression-testing LLM model changes.

A few people pointed out that for LocalLLaMA, the more interesting use case may be quantization regression:

Q8 -> Q4_K_M

Same base model, same prompts, lower VRAM, but behavior may subtly change.

I want to test failures like:

I’m thinking of adding a LocalLLaMA demo: same golden suite, same base model, two quants, then generate an HTML report showing what regressed.

Questions:

  1. Which model + quant pair should I test?
  2. Is Q8 -> Q4_K_M the right comparison?
  3. Should I test Ollama, llama.cpp, or vLLM first?
  4. Best demo task: JSON extraction, tool calls, RAG, coding, or instruction following?

Repo:
https://github.com/babaliauskas/evalshift-cli

MIT licensed. Local-first, no backend, no accounts, no telemetry.