What is the current status with Turbo Quant?

[-]

Astrale321@reddit

The Google paper is a year old, it has probably been in use since around then.

[-]

Pleasant-Shallot-707@reddit

Right…that’s why llama cop and others have been racing to integrate it

[-]

Astrale321@reddit

https://arxiv.org/abs/2504.19874 Read it yourself

[-]

I’m fucking aware of the paper, fool. They would have integrated it already and not made a big deal out of the implementation and HF wouldn’t be experiencing a flurry of TQ variants if it had been available for people to use already.

[-]

Independent-Let-8988@reddit

You clearly weren't, the technology has been out since the paper launch, just without attention

[-]

cnmoro@reddit

TheTom's repo works very well for me. Using q8 for k and turbo4 for v. It's blazing fast and uses small amounts of VRAM. I'm running qwen35ba3b with 128k context on a 5060ti 16gb very well.

[-]

dbzunicorn@reddit

3b active should be running fast on a 5060 without turboquant can we test some larger models

[-]

hwpoison@reddit

also for me using with CPU Only configuration.

[-]

Wisam_Abbadi@reddit

How does It perform? And what specs are you running on?

[-]

apollo_mg@reddit

+1 for TheTom's repo, using RDNA4.

[-]

Overall-Somewhere760@reddit

Why q8 for k? Better speed or better PPL?

[-]

Dr4x_@reddit

Which quant of the model do you use for it to fit on 16gb ?

[-]

cnmoro@reddit

Q3ks from byteshape

[-]

enrique-byteshape@reddit

<3

[-]

EffectiveCeilingFan@reddit

Very limited. The majority of pull requests and implementations for TurboQuant in llama.cpp are entirely vibe-coded and absolute dogshit. It’s all just hype, anyway. Google massively promoted an incremental improvement that draws almost entirely from existing techniques.

[-]

Pleasant-Shallot-707@reddit

You’re misunderstanding based on the video that that guy released.

They compared compression rates to fp 32, yes. If you think the awesome thing was compression and are saying it’s hype because they should have compared to bf16, then you missed the point. It’s the fact that they had zero quality loss.

So, yeah, you’re getting 4x compression rather than 8x…you’re still not losing quality.

[-]

EffectiveCeilingFan@reddit

What video? I’m basing this off my own analysis of the paper and existing implementations, as well as experiments on those implementations I’ve conducted.

Several vibe-slop implementations that I’ve seen linked here aren’t even TQ, literally just a Hadamard on a Q4 KV. Results with an implementation that I do believe to be correct were unimpressive. TQ4 was in the ballpark of Q8_0 in terms of real-world performance, however it wildly varies. Hybrid models performed the worst, with very minor improvement over baseline Q4_0. My assumption is that the sparsity of the attention means that KV values encode information with a higher intrinsic dimensionality than a traditional full-attention transformer. I lack the expertise to perform a more rigorous analysis of the exact failure mode. It could be that quantized KV interacts weirdly with SSM layers. Either way, my conclusion is that TQ is an improvement, but a very mediocre one.

[-]

Pleasant-Shallot-707@reddit

Weird how you repeat, word for word a YT video that just came out

[-]

EffectiveCeilingFan@reddit

Why are you being so vague with your accusation? I don’t even know what video you’re talking about dude.

[-]

rmhubbert@reddit

There's a turboquant implementation on vLLM nightly now. Added today. Haven't had a chance to try it yet, though - https://github.com/vllm-project/vllm/pull/38479

[-]

the__storm@reddit

My god the discussion on the PR is hard to read. It's like 40% humans talking to each other interspersed with 60% LLM text walls lol

[-]

rmhubbert@reddit

Ha! Aye, I got lost pretty much instantly.

[-]

Conscious_Nobody9571@reddit

It was hype and the sheeple fell for it

[-]

fuckingredditman@reddit

nope it's not, i can run gemma4 31b Q4_K_M with 92k context using it (TheTom llama.cpp fork) instead of only 4k without it.

[-]

EffectiveCeilingFan@reddit

What kinda hogwash is this? If you could only fit 4k before, you’d be able to fit like 16k now with TurboQuant. These numbers are just wrong. Also, you could do that before anyway with q4 KV cache quantization. The TQ matching F16 performance is largely unproven, in particular for hybrid architectures.

[-]

fuckingredditman@reddit

tbh the original number might not be 100% accurate because i didn't fit it properly on the upstream version because it clearly wasn't usable for coding but with 3 bit TQ i get 92k, that's really the main point i'm trying to make.

[-]

EffectiveCeilingFan@reddit

Has it outperformed traditional Q4 KV quantization? I’m surprised that fitting Gemma 4 context is so tight. Gemma 4 uses half the KV that other leading models do because the K and V are unified. I don’t know if it’s implemented in llama.cpp yet tho.

[-]

fuckingredditman@reddit

i never tried it with regular 4 bit KV quant because in my experience it performs quite poorly on any model i've tried it on so far in coding use cases

[-]

Neat_Zucchini_9938@reddit

What is your set up I have 24 vram.

[-]

fuckingredditman@reddit

i run this fork https://github.com/TheTom/llama-cpp-turboquant on that branch and occasionally rebase it on upstream if there are any interesting fixes

then just

llama-server -hf unsloth/gemma-4-31B-it-GGUF:gemma-4-31B-it-Q4_K_M -ngl 99 --cache-prompt --flash-attn on -b 1024 -ub 1024 -c 92000 -ctk q8_0 -ctv turbo3 --parallel 1

[-]

Cferra@reddit

Not really considering it’s working really well for me

[-]

-_Apollo-_@reddit

lol we still don’t have mtp for qwen3.5 in llama.cpp. Some things move slow.

[-]

StrikeOner@reddit

https://github.com/ggml-org/llama.cpp/discussions/20969

thank me later.

[-]

kickerua@reddit (OP)

Can I thank you now?

[-]

StrikeOner@reddit

thank yous get only accepted after you read trough each and every of the 500+ messages there and you can answer any qa about that thread there without looking it up!

[-]

Twirrim@reddit

This is localllama, aren't we supposed to just throw the text into our locally hosted LLM to summarise?

[-]

Dany0@reddit

This is localllama, only posers throw all the text into an LLM. Real Gs know lingebra

[-]

IllegibleCheeto@reddit

My random Hadamard matrices are a little rusty. I took a numerical linear algebra class a long time ago. In addition to the regular linear algebra.

Can I still be a Real G?

[-]

maglat@reddit

You are so right. I so often miss to use my actual AI power for that kind of tasks ^^

[-]

BriguePalhaco@reddit

This is a localllama, we don't have enough context window. :-/

[-]

Twirrim@reddit

I can only run small models, so I got mine to quickly vibe-code me a script to download the comments content, and then used aistudio / Gemini 3 to summarise.

This GitHub discussion tracks the community’s rapid effort to implement **TurboQuant** (and its relative, **PolarQuant**) into `llama.cpp`. TurboQuant is a quantization algorithm from Google Research designed to compress the LLM KV cache to sub-3-bit levels with minimal accuracy loss using Randomized Hadamard Transforms (WHT) and Lloyd-Max quantization.

The following is a summary of the key technical findings, implementation milestones, and performance results discussed:

### 1. The Fork Ecosystem

Multiple independent implementations emerged simultaneously, with frequent cross-collaboration:

* **TheTom (Metal/General):** Developed the primary `turboquant_plus` fork, focusing on Apple Silicon and general logic.

* **AmesianX (Independent CUDA):** Focused on high-end NVIDIA hardware (Blackwell/DGX), implementing fused kernels and robust `head_dim` detection.

* **spiritbuun & Madreag (Optimized CUDA):** Developed highly optimized CUDA kernels for Ampere and Ada architectures.

* **andrei-ace:** Explored outlier-aware channel splitting and calibration-driven quantization.

* **jesusmb1995 & paudley:** Brought support to Vulkan and RDNA 3.5 (AMD) hardware.

### 2. Key Technical Discoveries

The community moved beyond the original paper with several critical "best practices":

* **Asymmetric K/V Importance:** A recurring theme is that **Key (K) cache is extremely sensitive** to quantization noise due to softmax amplification, whereas **Value (V) cache is highly compressible.**

* *Recommendation:* Use `q8_0` for Keys and `turbo3` or `turbo4` for Values. This often results in near-lossless performance with massive VRAM savings.

* **QJL vs. MSE:** The original paper suggests a second-stage QJL (Quantized Johnson-Lindenstrauss) correction. Many implementers (`TheTom`, `Arclabs001`) initially found QJL increased variance and hurt quality. However, `AmesianX` demonstrated that QJL works effectively if independent sign patterns are used for the two stages.

* **Boundary Layer Protection:** The first and last few layers of a model (the "boundary layers") are disproportionately important for quality. Implementations like "Boundary V" (LA-V7) keep these layers at higher precision (e.g., `q8_0`) while compressing middle layers to `turbo2`.

* **Normalization Correction:** Storing `original_norm / reconstruction_norm` allows `turbo3` to occasionally outperform `q8_0` in perplexity tests.

### 3. Performance & Hardware Benchmarks

* **Blackwell (RTX 5090):** Initially, Blackwell showed a massive (40%) decode penalty. Developer `signalnine` solved this with a **V12 fused mmvq kernel** that uses shared memory instead of global scratch buffers, reducing the penalty to \~7%.

* **Ada (RTX 4090):** Ada architecture handles TurboQuant exceptionally well, often showing **prefill speeds faster than FP16** because the compressed cache reduces memory bandwidth pressure.

* **Apple Silicon (M-Series):** Significant context extensions were achieved (e.g., 104B models running 128K context on an M5 Max).

* **Compression Ratios:** The community achieved roughly **5.12x compression** for `turbo3` and **7.53x** for `turbo2` using block-size 128 optimizations.

### 4. Architecture-Specific Findings

* **Qwen Series:** Qwen 2.5/3 models are noted for having extreme "outlier channels" in the K cache. This makes them highly sensitive to symmetric quantization but perfect candidates for asymmetric `q8_0-K / turbo-V` setups.

* **Gemma 4:** Support for Gemma 4's unique `head_dim=512` and Sliding Window Attention (SWA) was added. Implementers found it necessary to keep SWA layers in `f16` to maintain reasoning accuracy.

* **Non-Power-of-2 Dims:** Standard WHT requires power-of-2 dimensions (64, 128, 256). Models with `head_dim=80` or `576` required the **Vilenkin-Hartley Transform (VHT)** or block splitting to work correctly.

### 5. Extension to Weights (TQ4_1S)

TheTom and signalnine extended the TurboQuant logic to **model weights**.

* **Config I:** A mixed policy where Attention and FFN gate/up tensors use TurboQuant, but critical FFN down-projections and boundary layers stay in `Q4_K` or `Q8_0`.

* **Result:** This allows massive models (like 104B or 122B) to fit onto consumer hardware (e.g., dual 3090s/5090s) while maintaining coherent reasoning.

### 6. Accuracy Validation Metrics

The thread established a "Gold Standard" for testing KV quantization:

**PPL (Perplexity):** Measures general language fluency.

**NIAH (Needle-in-a-Haystack):** Ensures long-context retrieval is not corrupted.

**Math/Delayed Recall:** Testing if the model can recall specific facts after 2000+ tokens of filler.

**Pauli Test:** A specific test for sharp attention peaks using Korean transliteration of German names.

### Current Status

While TurboQuant is not yet in the official `llama.cpp` master branch, the forks (specifically **TheTom/turboquant_plus** and **AmesianX/TurboQuant**) are considered highly functional "Beta" versions. Users can now run frontier-class models with massive context windows (up to 256K) on consumer-grade hardware by combining asymmetric KV cache and TQ weight compression.

[-]

AppealSame4367@reddit

Has anyone seen a dflash pr?

[-]

MachineZer0@reddit

Saves me about 7gb on Minimax M2.7. Was able to move up from Q3 to Q4 on 128gb VRAM

https://github.com/richginsberg/llama-cpp-turboquant/tree/feature/turboquant-kv-cache

I took https://github.com/TheTom/llama-cpp-turboquant branch this weekend and merged master from https://github.com/ggml-org/llama.cpp into it.

[-]

maglat@reddit

Whats you GPU setup for that?

[-]

jdiegmueller@reddit

I've been running the TomTom fork -- specifically the feature/turboquant-kv-cache branch -- for a week or so now on CUDA hardware. I've landed on using q8 for K and turbo2 for V with both qwen3.5-27B:Q6 and qwen3.5-9B:Q6.

[-]

Overall-Somewhere760@reddit

Why q8 on k ? Better speed or better PPL?

[-]

jdiegmueller@reddit

Check out TomTom's TurboQuant repo (not llama.cpp; TurboQuant). I'm mobile or I'd direct link it for you.

A lot of research notes, test results, etc. Truthfully a lot of it is over my head, but all the data is there.

Ultimately my takeaway was that you can be as aggressive as you want with V without any real world penalty. On Q8 and Q6 quants I got the impression using turbo4 or even turbo3 would probably be just fine for K on qwen3.5, but turbo2 V gave me enough headroom to crank my context + parallel sessions to where I wanted so I decided to take the win for now while this stuff gets sorted out.

I do also plan to try out the vLLM PR tonight.

[-]

Naiw80@reddit

Google is busy trying to patch gemma4 to work as expected…

[-]

b1231227@reddit

Not bad. I managed to fit in context that I previously couldn’t use, and there’s hardly any loss (K Q8_0 V turbo3).

[-]

Overall-Somewhere760@reddit

Why q8 for k ?

[-]

Porespellar@reddit

Just give Milla Jovovich some more time to get it coded.

[-]

superdariom@reddit

I'm running the rocm version which also includes triattention and it is working really well and I'm using qwen 3.5 Q5 XL with over 100,000 context on 24gb card. I think with some more upstream bug fixes even larger context may be possible. It's my daily agentic driver and I have not seen any problems at all. Wish it were merged upstream.

[-]

Blizado@reddit

I think even 2 Weeks are too less to see a proper implementation. That will take some more weeks until we got outside of the experimental phase. I would guess there is still a lot of optimization needed to make.

[-]

FrogsJumpFromPussy@reddit

TokForge a new android app that's built for both gguf and mnn formats has an experimental implementation of turbo quant that fixes the Gemma E4B.mnn long context window. It seems to work with 32,000 context window but I didn't test the whole of it yet.

[-]

jacek2023@reddit

Another day, another discussion about TurboQuant in llama.cpp

[-]

AdamDhahabi@reddit

Since two weeks we have TurboQuant-alike KV cache improvements: https://www.reddit.com/r/LocalLLaMA/comments/1s9nri7/attnrot_turboquantlike_kv_cache_trick_lands_in/

But now I realize there is more to come as someone just posted here in the comments: https://github.com/ggml-org/llama.cpp/discussions/20969

[-]

norofbfg@reddit

most of what landed in llama.cpp looks experimental so it has not really translated into stable usage yet

[-]

qwen_next_gguf_when@reddit

A bunch of people validate their own implementations and nothing is confirmed on the mainstream.