Qwen 3.6-35B-A3B KV cache part 2: PPL, KL divergence, asymmetric K/V, 64K row on M5 Max

Posted by Defilan@reddit | LocalLLaMA | View on Reddit | 15 comments

Followup to yesterday's post: https://www.reddit.com/r/LocalLLaMA/comments/1sy7srk/. Comments asked for perplexity, KL divergence, asymmetric K/V combos, and a 64K data point. Ran them overnight. Same M5 Max, same Qwen 3.6-35B-A3B Q8, same TheTom TurboQuant fork (feature/turboquant-kv-cache).

Quality (perplexity + KL divergence on wikitext-2)

For u/milpster and u/Karyo_Ten. Context size 4096, since the canonical 512 doesn't fill enough KV cache to surface cache-quantization effects. f16 saves the baseline logits via --kl-divergence-base, then each quant run computes KL against that.

Cache	PPL	KL vs f16	Top-1 token agreement
f16	5.7438	baseline	n/a
q8_0	5.7433	0.0016	98.64%
turbo3 (~4.9x)	5.8092	0.0199	93.93%
turbo4 (~3.8x)	5.7810	0.0131	95.28%

q8_0 KV is essentially free at this depth. The PPL delta is -0.0005, well inside the ±0.036 stderr. KL is 0.0016. The quantized cache picks the same top-1 token as f16 98.6% of the time. The worry from yesterday's comments was "what does this cost in quality." At 4k context, it's noise.

turbo3 costs about 1% PPL increase and 5 percentage points of top-token disagreement, with KL roughly 12x q8_0. turbo4 sits between, in line with its lower compression ratio. Quality cost scales with compression, no surprises.

Asymmetric K/V (depth sweep)

For u/Sabin_Stargem and my own untested caveat from yesterday. Decode tok/s, same llama-bench flags as the symmetric sweep:

Depth	q8_0 K / turbo4 V	q8_0 K / turbo3 V	f16 K / turbo4 V
0	82.9	81.8	72.8
8K	75.4	75.6	16.9
32K	66.0	63.2	8.6
128K	41.0	38.2	2.8
256K	27.1	25.0	skipped
512K	16.5	14.8	skipped

-ctk q8_0 -ctv turbo4 is the standout. At 256K it matches yesterday's symmetric q8_0 throughput (pp 128 vs 124, tg 27.1 vs 26.6), and it fits 512K where symmetric q8_0 OOM'd. So you get q8_0-grade prefill behavior with turbo4-grade context ceiling. Sabin's hypothesis that V compresses cheap and K compresses expensive looks right on the throughput side. Quality side I'd want a PPL run on the asym combos to fully close the loop.

-ctk q8_0 -ctv turbo3 does the same trick but with worse decode. Tighter V quant taxes the generation side more.

-ctk f16 -ctv turbo4 is broken on this fork on Metal. The FlashAttention kernel doesn't fast-path that K/V type combination, so it falls back to a generic dequant-then-attention path. At 8K it's 34x slower than symmetric f16. At 128K it's 78x slower (4.1 t/s pp). Cells past 128K weren't worth completing. Don't use this combo.

64K row

For u/ocarina24. Filling the 32K to 128K gap on the prefill curve. All seven configs at depth 65536:

Cache	pp512	tg128
f16 (symmetric)	602.0	59.8
q8_0 (symmetric)	479.2	57.9
turbo3 (symmetric)	469.8	49.9
turbo4 (symmetric)	418.0	55.2
q8_0 K / turbo4 V	468.2	55.9
q8_0 K / turbo3 V	465.6	52.6
f16 K / turbo4 V	8.3	4.9

Two things stood out. First, the prefill curves are nearly converged at 64K. turbo3 (470) is within 2% of q8_0 (479). Yesterday's data showed turbo3 actually pulling ahead by 128K (253 vs 245), so the bandwidth-bound regime kicks in somewhere between 64K and 128K on this hardware. Earlier than I'd estimated. Second, the asymmetric q8_0/turbo* rows track symmetric q8_0 prefill closely at this depth too. Same story as the deeper rows.

What I take from all of this

Updated cache-type recommendation from yesterday:

Coding agents (deep context, lots of generated tokens): -ctk q8_0 -ctv turbo4 is the new pick. q8_0 quality on K, turbo4 savings on V, fits 512K.
RAG or batch QA (heavy prefill, short answers): same combo, or symmetric turbo3 at the deepest depths.
Pure 1M context maxing: still symmetric turbo3, only thing that fits.
Short interactive (under 32K): f16 if memory allows, else q8_0. Quality cost is genuinely zero.

Caveats

PPL was at 4096 context. Quality at deeper contexts, where the cache is more saturated, might tell a different story.
Asymmetric quality numbers are still pending. Throughput data argues V-side compression is cheap, but I haven't measured KL or PPL on the asym combos yet.
f16 K + turbo* is a kernel fallback on this fork on Metal. Verify before assuming this combo works on other backends.
Single hardware data point (M5 Max, 128 GB). Crossover depths and the prefill/decode split likely shift with memory bandwidth and GPU core count.

Still in flight

u/GCoderDCoder. Aider Polyglot pass for f16, turbo3, and turbo4 (q8_0 was 62.2% earlier this week, n=225). Each Polyglot run is about 6 to 12 hours, so it's a few nights serial. Running later this week.
u/noctrex. Wider quant types (q4_0, q4_1, iq4_nl, q5_0, q5_1) extending the depth sweep. After Aider.
u/Able_Librarian1569. Same sweep on a non-MoE non-DeltaNet model for transferability. After the wider quant types.

Same offer as yesterday. If you have non-M5-Max Apple Silicon and want to run a slice of this matrix, drop your numbers below or DM me. Happy to send the raw llama-bench and llama-perplexity output for anyone who wants to dig into the per-cell stats.

Full writeup with the methodology and the per-cell stderr numbers: https://llmkube.com/blog/turboquant-m5-max-quality-and-asymmetric

[-]

PaceZealousideal6091@reddit

I appreciate this update but what's the point of this if you don't include the Q4_0 (and maybe q5_0) kv cache kld to compare with? You are advocating turbo3 and 4 when most of the other results indicate that it's not that better or even worse than the standard kv quants like q4_0. The whole reason lcpp hasn't integrated turbo quants is because atv it's current implementation, it's not that good.

[-]

Defilan@reddit (OP)

Fair point, the q4_0 / q5_0 gap is the real comparison and I should have included it in this round. q4_0 at \~4.5 bpv vs turbo4's \~4.25 is the head-to-head that actually settles whether turbo4 is winning anything. Without that data, calling it "the new pick" was overreach on my part. The only claim I can actually defend from this round is the memory-ceiling story (q8_0 K + turbo4 V fits 512K where symmetric q8_0 OOMs at 256K), not a quality argument. Wider quant types are in the "still in flight" list at the bottom of the post; I'll fold KL + PPL into that sweep so we get a direct head-to-head, and if q4_0 lands closer to f16 than turbo4 does, the recommendation gets updated. Good callout, I appreciate it! All part of the journey.

[-]

Gringe8@reddit

Why does this read like opus

[-]

Orolol@reddit

Because it is.

[-]

PaceZealousideal6091@reddit

Actually, it's not even memory ceiling story. Without comparing the quality part of the argument, Q3_0 and lower quants will beat turbo quants anyday.

[-]

Defilan@reddit (OP)

Unless I'm missing something Q3_0 isn't a KV cache type in mainline llama.cpp. I verified the `kv_cache_types` list in `ggml-org/llama.cpp` `common/arg.cpp`: F32, F16, BF16, Q8_0, Q4_0, Q4_1, IQ4_NL, Q5_0, Q5_1. The Q3_K family is weight-only quantization, not KV. So the actual lowest-bpv standard KV option is IQ4_NL or Q4_0 at \~4.5 bpv, vs turbo3 at \~3.25 bpv. On raw symmetric memory ceiling, turbo3 wins that comparison. You're right though that I can't claim it wins on quality-per-bit until I have q4_0 head-to-head numbers, which is exactly what's in the next sweep. If q4_0 PPL/KL lands where turbo3's does at less compute and zero kernel risk, that absolutely changes the recommendation. Happy to be wrong on the data, that's why I'm running these benchmarks.

[-]

No-Refrigerator-1672@reddit

At 4k context, it's noise.

If you would just enable all the default tools of OpenWebUI, it's already 10k tokens just for the system prompt. The days of 4k long requests are long gone, you now need consider KLD and other metrics at 16k-33k to stay up to date. Otherwise, thank you for thd data, I appreciate thorough approach to testing, that's untipical for average review.

[-]

Defilan@reddit (OP)

For sure, you're not doing much of anything with 4k. I'm new to benchmarking all of this but have been diving in a lot the past few months and trying to find that balance for what to test that helps inform but also shed light for what is practical for daily use. Still figuring that out but I appreciate your thoughts, totally agree.

[-]

StorageHungry8380@reddit

Interesting data, will be looking forward to your additional data points.

As an aside, I tried computing mixed precision KLD using `llama-perplexity`, it said it required Flash Attention enabled so I did that, but speed dropped like a stone. Turned out it's doing most of the work on the CPU. No errors or warnings suggesting this though, `llama.cpp` missing some mixed-quantization KV FA kernels?

[-]

Beamsters@reddit

Hey, came here to say thank you for sharing your test. It was a great read. I'm pretty curious about the same thing but Qwen3.6 27b. Dense model and the trade off between quality/memory/pp/tg. I've read from somewhere that the dense model is a step more resilient to quant compare with the sparse one. Have you ever test it somehow?

[-]

Defilan@reddit (OP)

Thanks for the comment! Quick architecture check though, just pulled the Qwen 3.6 27B model card and it actually uses the same hybrid DeltaNet + Gated Attention pattern as the 35B-A3B. 16 attention layers out of 64 (vs 10 of 40 in the 35B), still GQA. So the KV cache per token is similar, and I'd expect the crossover regime to land in the same 64K to 128K range, not earlier.

For the hybrid-vs-classic contrast you're probably hoping for, a dense model with classic GQA at every layer would be the better candidate. There are some I'm wanting to test as well in the overnight tests. Regarding MoE vs dense quant resilience: that's well-documented for weight quantization. KV cache quants are a different story since the K/V projections are dense (non-MoE) even in MoE models, routing happens at FFN not attention. So weight quant sensitivity doesn't transfer.

I appreciate the callout, I've got this queued up on my list after wider quant types and Aider Polyglot. Will share numbers in a follow-up post.

[-]

milpster@reddit

You rock! Thank you. I'd be really interested in calculations with long ctx though.

[-]

Defilan@reddit (OP)

haha, thanks! This has actually been really fun to do and pushing the system a bit while seeing what the model can do. I'm hoping to get the longer context too, will be running more tests over the next few nights and will share the results.

[-]

bennmann@reddit

Include Vulkan tests? might be useful for AMD+Intel fam

[-]

Defilan@reddit (OP)

Honestly, Vulkan isn't really in scope on this hardware. Whole series has been M5 Max + Metal because that's the machine I'm testing on now, and running Vulkan on M5 Max via MoltenVK would just test the translation layer, not what AMD or Intel users would actually see on their cards. If anyone reading this has an AMD or Intel GPU and wants to run a slice of the matrix on Vulkan, drop the numbers and I'll fold them into a comparison post, same model + same llama-bench flags would make the rows directly comparable. Otherwise it'll have to wait until I can get hands on the hardware. Great idea, though. Would love to see more numbers for AMD+Intel folks around this.