Llama.cpp quantization is broken | TheaterFire

Llama.cpp quantization is broken

Posted by Ok-Importance-3529@reddit | LocalLLaMA | View on Reddit | 53 comments

Main reason is, that qunatization quality directly affects models performance and stability and this results in real usefullness. Even though GRM-2.6-Plus is in benchmarks better than qwen3.6 27b model from which it derives, it gives worse results.

This is just one example, most of the quants i tested suffer from same problems and only few of them mostly with different quantization mechanism are usefull below Q5.

I want to advocate for autoround quantization as standard for lower quants Q1-Q4, also apex was performing quite well, but size is larger, maybe you know of other alternative methods that give consistent results, because standard quants like Q4_K_M dont provide adequate results and often results in bugged behavior overall (looping, halucinations, inconsistency).

Prompt: Create svg image of a pelican riding a bicycle

Multiple examples of different quant results

https://www.reddit.com/r/LocalLLaMA/comments/1szp96f/comment/oj3r4b1/

Autoround Q2_K_Mixed https://huggingface.co/sphaela/Qwen3.6-27B-AutoRound-GGUF

Regular llama.cpp Q4_K_M https://huggingface.co/morikomorizz/GRM-2.6-Plus-GGUF

This is just one example and the output quality is consistently worse, when i ask it tricky questions, how much it hallucinates, loops etc.

Community should understand, that typical quantization under Q5-6 is inadequate for qwen models unless you tinker with it through some more intelligent mechanism like intel autoround does.

Looping from my experience is for example direct symptom of broken quantization, occasional syntactic errors in agentic coding another.

[-]

Pablo_the_brave@reddit

https://huggingface.co/mradermacher/Qwen3.6-27B-i1-GGUF/discussions/2#69f8ec00b1e79b10307be10a

[-]

Formal-Exam-8767@reddit

Fair point, in that case you should implement AutoRound support in llama.cpp.

[-]

datbackup@reddit

Bro autoround quants already run in current llama.cpp

They are better too, i think OP is right and it should become the norm

[-]

pmttyji@reddit

Do you mean Autoround GGUFs on llama.cpp? Thought it's not up yet(check my other comment)

[-]

nickm_27@reddit

Conversion isn’t supported but Intel has a script to convert a model to GGUF which does work in mainline llama.cpp

[-]

Formal-Exam-8767@reddit

So nothing is really broken in llama.cpp?

[-]

Ok-Importance-3529@reddit (OP)

Autoround has capability of producing GGUFs to my understanding, people just dont make them and are used to llama.cpp....

[-]

pmttyji@reddit

Found only this Not Planned Ticket

https://github.com/ggml-org/llama.cpp/issues/20216

[-]

Juan_Valadez@reddit

Do yourself a favor and delete your post now.

[-]

666666thats6sixes@reddit

So some quant of some model oneshots an uglier bird picture than a different quant of a different model? Is that the claim? How is that related to llama.cpp?

[-]

Ok-Importance-3529@reddit (OP)

Most of the GGUF quants are created using llama.cpp, the claim isnt one model oneshots svg creationg better, the claim is that there seems to be great degradation in quality at smaller quants and that other quantization methods handle it better and why arent those more common.

SVG is just an example, reason behind all this is how to get better working quants for consumer hardware because the errors and quality in todays quants vary a lot and whether there arent better methods that suit models like qwen3.6 better.

[-]

LetsGoBrandon4256@reddit

the claim is that there seems to be great degradation in quality at smaller quants

That's a fact that everyone knows. How is that related to "Llama.cpp quantization is broken"?

smaller quants and that other quantization methods handle it better

You really need a better metric for the "handle it better" than generating a duck riding a bike in SVG. We're not at /r/StableDiffusion arguing about the best sampler with a handful of gens.

why arent those more common

Again, how is this related to "Llama.cpp quantization is broken"?

[-]

HumanDrone8721@reddit

Well, this is quite an extraordinary claim, so it will be either a downvote storm or be accepted as valid, looking with interest.

[-]

pkmxtw@reddit

No idea why OP is even comparing autoround to some Q4 quant of a random finetune, so we have no idea if the damage is done by the finetune, or improper quantization, or real underlying issue with llama.cpp.

As a baseline, this should be compared to Q4/Q8 from unsloth or bartowski.

[-]

Gesha24@reddit

Funny enough, just yesterday I decided to try running Unsloth BF16 and Q8 quants of Qwen3.6-35B and... it couldn't even handle proper formatting in the web chat - prompts were leaking, etc. 4 and 5 bit quants fully loaded in VRAM work just fine. 2 options: 1) The whole world doesn't use those quants and nobody reported they are broken and 2) something in busted in my setup for running MoE models across both GPU and CPU. I should probably go make a post suggesting #1, because it can't be #2, right?

[-]

Hot-Employ-3399@reddit

If it was built-in web chat I would blame it the first. It constantly breaks something

[-]

Ok-Importance-3529@reddit (OP)

Those are compared with same prompt in provided link here: https://www.reddit.com/r/LocalLLaMA/comments/1szp96f/comment/oj3r4b1/

But this is just an example, it doesnt cover all the cases of local testing when one uses those models for specifics task, agentic coding for example is for me most thorough evaluation of model quality.

[-]

shockwaverc13@reddit

>Q2_K_MIXED

>looks inside

>pure Q4_K quant

what the hell intel

[-]

LetsGoBrandon4256@reddit

2-bit

Q2_K_S 9.47 GB

Q2_K_MIXED 16.4 GB

Fucking kek

[-]

soyalemujica@reddit

I have been running the Autoround models of Qwen 3.6 27b because I also did found them to be smarter than Unsloth for example, and also faster (Q4KM) in this case, what about that GRM 2.6 Plus? Is it better?

[-]

Ok-Importance-3529@reddit (OP)

This is exactly my experience, so far GRM 2.6 plus at provided quant didnt do anything differently than lets say other Q4_K_M of base model i tested, i did run just a few prompts though and result wasnt better than autoround so i decided ill wait for other quants to test it out.

[-]

soyalemujica@reddit

For the sakes of testing, I gave that GRM 2.6 Plus a try, and it get stuck in a loop compiling my app (which it did compile) I checked the original repository and there's some different settings, such as Temp 1.0 rather than 0.6 (I didn't test it as such, trying that behavior now).

[-]

CalligrapherFar7833@reddit

Compare the same quants

[-]

Velocita84@reddit

You lost me at the clickbait title, your claims are automatically garbage

[-]

Ok-Importance-3529@reddit (OP)

thats the reason for clickbait, to make more people read it and react, also there is an older view on this, you make strong claim and let discussion begin....

[-]

Velocita84@reddit

It's not a strong claim, it's straight up just a stupid claim. Llama.cpp isn't "broken" just because you tested a random quantized finetune on a vibe test that has no real world uses and it performed worse than a QAT

[-]

Ok-Importance-3529@reddit (OP)

well if you would read through this post and comments here you would know, this is just a example i provided, i tested multiple normal quants from all the known providers and also the less known ones, i also did this since qwen3.5 and with both i did create some apps i use just for fun to test how local llms perform and how they stand against sota models.

[-]

NNN_Throwaway2@reddit

I don't see BF16 or FP8 results here.

[-]

Ok-Importance-3529@reddit (OP)

fair point, i dont have hw to run them, thats why i have so much experience with all the quants of different creators, tested all major ones on huggingface, basicaly all alternative quants that are there for qwen3.6 and 3.5 27B model, also 35B model and results are always in favor of alternative quants.

I use models for agentic coding mostly and i want consistent results without any syntactic mistakes and looping. So far autoround proves the king in <Q5 territory for me.

Also i would like to meantion apex and steampunques mixed quants. Unsloth Q4_K_XL also quite good.

I didnt event mention inference speeds and that a big factor also.

[-]

NNN_Throwaway2@reddit

They are available via API or can run in the cloud, no?

[-]

Ok-Importance-3529@reddit (OP)

Bottom line is, im quite sure im not the only one experiecing this and more people will come up with evidence, im sure we are facing issues of inadequate benchmarks for quants evaluation and due to that we use probably worse quants than we could if there were some reliable benchmarking methods on how well model behaves vs original.

Behaves is the problem here, i dont care about ppl or kld, i care about tests that simulate real use, agentic, coding or other, until we have reliable metrics its hard to make progress.

My real metrics are working with those models as daily drivers and doing some real work with them and i can tell you there are huge differences in this space.

[-]

NNN_Throwaway2@reddit

Making an SVG of a duck is real use?

[-]

Ok-Importance-3529@reddit (OP)

No it was just an example on how model can provide very different results in quality, i think the svg is great example of one of the models capabilities, if it produces consistent results of multiple rounds of svgs that are \~ same quality and compare same result with pressumably same model at same roughly same size and get very different output thats one strange thing that points to broken quantization.

My testing for my "daily driver" on local coding is different, i firstly test model with multiple prompts on inference speed, llama.cpp bench, looping and also car wash question and evaluate correct answer consistency. Then if model behaves well and speed is adequate i code with it in agentic environment with opencode. There i see whether model makes coding mistakes and which ones, also whether tool calls errors are there.

The biggest plus is, iv done this with multiple GGUFs versions of same model since qwen3.5 27B and i always struggled with picking reliable working quant at \~Q4 size

[-]

NNN_Throwaway2@reddit

I've never personally noticed any specific issues at Q4 or below that really jumped out at me compared to higher or even full precision. But I do coding and writing, not generating SVGs.

[-]

Ok-Importance-3529@reddit (OP)

Sorry if it looks like that, i tested on local hw higher precision quants and at Q5 and they generally perform better and dont have same problems, for example steampunque qwen3.5 27B Q6 mixed quant was my go to driver before qwen3.6 came.

Reason why i wont test against full precision is as i stated because i cant run it on my local hw, comparing it with online provider which i dont know what backend, version, precision they use.

Also i dont want to pay for online services just for fun of it, im not quantizer to do right benchmarks and metrics and provide measurable data, im on the other end just like you i do coding and writing.

I do code with local llm and there is a real difference in how model performs at different quants.

[-]

D2OQZG8l5BI1S06@reddit

big if true

[-]

Adventurous-Paper566@reddit

À quel moment a-ton commencé à considérer Q2 comme étant viable?

[-]

a_beautiful_rhind@reddit

Also got exllama3 quants and IK_llama quants. Pretty bold to diss Q4_K_M though.

Drawing things is a good test but you also have to account for sampling.

[-]

DefNattyBoii@reddit

bro just open an issue on thier repo

on the other side im also yearning for a good sub 4 bit quant method. Doesn't seem to be a focus anywhere which isnt really understandable for me

[-]

Finanzamt_Endgegner@reddit

Bro 1st as others have mentioned some random Q4 Quant might not be a good comparison, and then PLEASE compare the same model for quantization techniques, there is no reason to compare different finetunes and then make claims about quantization quality lol

[-]

Ok-Importance-3529@reddit (OP)

i did compare same models, just didnt provide them here because as i stated a lot of them are already in this post: https://www.reddit.com/r/LocalLLaMA/comments/1szp96f/comment/oj3r4b1/

[-]

StorageHungry8380@reddit

Not sure what's going with your `llama.cpp` but both Q8 and Q4_K_M versions of https://huggingface.co/morikomorizz/GRM-2.6-Plus-GGUF worked just fine with your prompt on my `llama.cpp` release b8733. Here's the Q4 version for reference, Q8 was comparable.

[-]

Ok-Importance-3529@reddit (OP)

Thats interesting i use llama-b8946

[-]

ttkciar@reddit

I tried Q6 on the suggestion of others on this sub, and saw no discernable difference from Q4_K_M, so will stick with Q4_K_M.

[-]

Ok-Importance-3529@reddit (OP)

I would like to add one important thing, the reason behiind this is to stir some discussion and add some awareness, models are evolving and with them they mechanisms that were ok half a year ago are obsolete now. By saying broken i want to challenge what standards of quantization are adequate for todays models and how to retaion most behavior at quants smaller than Q5, where problem is most visible, but people use them a lot, they fit in most consumer hw.

[-]

Imaginary_Bench_7294@reddit

It is possible that the fine-tuning itself is the culprit, not the quant strategy.

Comparing different fine-tunes is an apples to oranges comparison - they're both fruits, but very different on the inside. That's why LoRA's for LLMs haven't taken off the same way they have for image gen models.

[-]

Ok-Importance-3529@reddit (OP)

You are right that fine tune may behave very differently. I mostly tested classic versions, GRM-2.6-Plus was exception.

[-]

JacketHistorical2321@reddit

File on GitHub then. Don't know why you're wasting time here

[-]

jacek2023@reddit

try this https://github.com/ggml-org/llama.cpp/blob/master/tools/perplexity/README.md

[-]

Ok-Importance-3529@reddit (OP)

If ppl and kld were metrics showing problems i described we wouldnt be here, problem is current metrics arent enough to evaluate models behavior.

[-]

sammcj@reddit

What about Unsloth's UD quants?

[-]

Ok-Importance-3529@reddit (OP)

Some are better than the others, idk why but historically i got a lot of looping, also when i use K_XL ones that behave better i get slower inference speed as drawback, cant tell much about quality of <Q4 lately iv been testing Q4-Q5 that fit to my hw, i settled on autoround version which has great speed and even better results. Q5_0 autoround is also good size/quality/speed ratio.

Also for 35B model i found apex to perform reasonably well.

[-]

Any-Chipmunk5480@reddit

I also experienced looping at COT in both qwen 3.6 35b a3b Ud-q4km and gemma 4 26b a4b Ud-q4km