Building on a LLM Quants Testing Site/Ressource - Sharing a few insights from first month, so you can share your thoughts and wishes for the future.

Posted by norms_are_practical@reddit | LocalLLaMA | View on Reddit | 3 comments

Wanted to share some insights into a project I am building. The focus is to make it easier to understand how quantization affects open weights model on practical work tasks. For every new model being released it seems like there instantly comes our +200 quantizations released within the first couple of days. This is actually great, but I feel like we somewhat have a transparency gap into what is "good enough" when choosing an LLM quantization.

On the back on the current realization of "mainstream" AI might actually increase in cost, the future of open weights LLM models could become more relevant for the average person much sooner than we might think. If AI cost explodes - open weights AI understanding becomes much more important to support. So that is sort of the outset.

I have been working on a benchmarking test suite solution with focus on quantization quality and practical test case capability drop-off. The benchmark testing has been ongoing with running approx. 10 tests everyday for about a month. Starting out slow, to see if anything was breaking, while still building and working on optimizing a few things here and there. So far I have reached 268 quants tested in this first month. Intent is to keep adding quantization tests as per the capacity I have to spare. I expect to be adding about 50-100 new quantization test runs per week. Model efficiency plays a huge role in how fast I can cover additional quantizations as well as my own GPU availability.

E.g. Quants test results for Vision Reasoning of 79 Quantizations for:

Qwen 3.5 35B A3B vs. Gemma 4 26B A4B IT vs Qwen 3.6 35B-A3b

Further - Efficiency (token usage) average results for the 3 models

Qwen 3.6 35B A3B is generally using way more tokens than 2 others - without delivering better results.

Take away : An AI model who "works" with fewer tokens could essentially be leveraged to run multiple loops over the same task to deliver even better results. AI model efficiency is a huge deal to dive into.

----

So far the following models has been tested:

qwen3.5-35b-a3b (22 quantizations tested)

gemma4-26b-a4b-it (24 quantizations tested)

qwen3.6-27b (14 quantizations tested)

qwen3.6-35b-a3b (33 quantizations tested)

qwen3.5-2b (26 quantizations tested)

qwen3.5-4b (26 quantizations tested)

qwen3.5-27b (24 quantizations tested)

gemma-4-e2b-it (24 quantizations tested)

gemma4-e4b-it (24 quantizations tested)

qwen3.5-0.8b (29 quantizations tested)

qwen3.5-9b (22 quantizations tested)

The hardware testing setup:

VPS server -> Tailscale Tunnel -> Windows PC w. RTX 5090 -> LM studio (server)

Looking into adding an Blackwell RTX 6000 to cover more types of quanitzed models.

Even though I consider adding a Blackwell RTX 6000 - then main idea is to focus on testing quantized models, which can be run on consumer GPU cards - So models up to around 32GB vram consumption is the main target. The idea with specifically adding this card is the close speed alignment between RTX 5090 and RTX 6000. This would make the ongoing capture of speed of tokens / second somewhat comparable, while if adding other types of setups, the real-world token / second capture might be skewed and not be equally valuable as a data point. LM Studio is not the fastest, but its a base-line, which everyone diving into AI can start with - without knowing much themselves.

The benchmark is built around 6 test suites:

- 64 tests with "Tool-Calls"

- 64 tests with "Instruction Following"

- 64 tests with "Structured Output"

- 64 tests with "Code Correctness"

- 64 tests with "Logic & Reasoning"

- 64 tests with "Vision Reasoning"

So all in all - Each and every quantization is tested against 384 test cases.

The tests are practical and are meant to be show where/how quantized models break - specifically in practical work, where you mix work disciplines.

Tests are built to only accept the specifically correct answer - in specific answer format.

E.g. - Raw test outputs from a single reasoning test :

// "no" :: Correct answer in correct format == correct

// "120" :: Wrong answer in correct format == wrong

// "Based on the visual evidence, no, the blister package has not been opened. The packaging shows multiple identical units of Paracetamol (Poro) tablets arranged vertically in a single row. There is no indication that the package was opened or that any tablet inside has been removed." :: Verbal explanation == wrong

// "No" :: Correct answer in wrong format == wrong

When the models are prompted with the question - they are nudged with the constraint of them only having 4096 output tokens available for their response - per test answer. So far the actual outputs showcases that the average correct answer per test consumes less than 10% of this "constraint".

To be able to deliver high quality data for ongoing analysis - I have implemented capture of all the information data points I could figure found meaningful to include - e.g. :

- Raw response output

- Tokens Input

- Tokens Output

- Latency in ms

- Token output speed

- Pass (Score - 4 test suites allow partially correct answers)

A website is available - It works fairly well on desktop (semi-well on mobile).

Website has a 64-pixel grid view "heatmap", for individual test case output inspection.

Website has a history overview to see the latest test runs - updated live as tests run:

I am working on a report builder - for anyone to make custom report on the data:

Hope you find the project and its intent useful. The idea is to help everyone out who has an interest in choosing a more data-driven path when selecting an LLM model quantization for their AI endeavours 😎

Ps. There is a ton of information to share about the project and test results. If you have a specific interest, please note it and I will try to prepare the next post writings more into the depth of these specific areas. There are no sponsors or monetization. Its driven by an interest in AI.

[-]

Chromix_@reddit

Your testing needs more tests and more repetitions. A Q3_K_XL quant of Gemma 26B A4B made it to the first place, while A Q5_K_M just made it to the lower middle, and other XL quants didn't make it to the top. It'd also be nice to have a Q8, Q8_XL and BF16 in there as a baseline to compare to.

[-]

norms_are_practical@reddit (OP)

"Your testing needs more tests and more repetitions."
=>
The current level with 64 tests for each subject area has quite substantial coverage. Much better than most are doing at this level.

For Gemma 4 under this specific chart, there are currently five XL Quants - The first 4 are placed at 1st, 7th, 8th, 9th place in the out of 24 quants tested. The difference between 1st and the 9th is 2 wrong answers out of the 64 tests. So "didn't make it to the top." is somewhat a slightly simple - yet random takeaway based on a single data point.

------

"You should add that (wrong format) as separate metric instead of putting it into the "wrong" bucket - it can be highly relevant."
=>
If this result was seen as "an end state" - then a "wrong" format might be okay.
The anticipation is that models are going to be used in systems, where the system would rely on the answer having the right format. In that case - that only actually *correct* answer is with the proper formatting. Delivery in the right answer is about communicating understanding, rather than ideas. The ability to deliver the answer inside a structured format, means reliability beyond the actual answer. Adding new ways of evaluating the data is easy, but making sure it matches something which has actual value is harder.

The 0.8B model is impressively good (for its size) at finding answers in the vision test, but it fails immensely on delivering the answer in the right format. You could use the 0.8B for chatting, but in a system where the output has to follow a structure to be useful - it would wreck the whole system progress.

-------

"It'd also be nice to have a Q8, Q8_XL and BF16 in there as a baseline to compare to."
=>
This would only be possible with offloading when reaching the 30B-ish sized models. As noted - I am looking to add a RTX 6000 Blackwell to make sure additional quantizations can be included without interfering with the speed metrics. It makes no sense to start offloading to various levels for each quantization. For the smaller LLM models, the coverage already includes various Q6 and Q8 quants.