Is Turboquant really a game changer?
Posted by Interesting-Print366@reddit | LocalLLaMA | View on Reddit | 67 comments
I am currently utilizing qwen3.5 and Gemma 4 model.
Realized Gemma 4 requires 2x ram for same context length.
As far as I understand, what turbo quant gives is quantizing kv cache into about 4 bit and minimize the loses
But Q8 still not lose the context that much so isn't kv cache ram for qwen 3.5 q8 and Gemma 4 truboquant is the same?
Is turboquant also applicable in qwen's cache architecture? because as far as I know they didn't tested it in qwen3.5 style kv cache in their paper.
Just curious, I started to learn local LLM recently
Finguili@reddit
Actually, Gemma is more memory-efficient compared to Qwen (31B vs 27B). Gemma has a 2x larger head dimension for global attention layers, same number of heads, but fewer global attention layers (10 vs 16), and V is the same as K, so there is no need to store it. However, I suspect llama.cpp doesn’t support this right now and does store V, hence 2x higher usage. A full context for Gemma in optimised implementation should take around 10 GiB + ~800 MiB for local SWA, while for Qwen it’s ~16 GiB for global + some contant memroy for gated DeltaNet layers (I think it was smaller than what Gemma uses for SWA).
Also, it may be worth using
-np 1to avoid allocating SWA for additional slots (unless you need them).Witty_Mycologist_995@reddit
what's the current pull fixing this?
GoodTip7897@reddit
I couldn't find any pr... If that comment is right then someone should create an issue at least.
Apprehensive_Ad784@reddit
happy cake day
Witty_Mycologist_995@reddit
happy cake day
Samurai2107@reddit
But gemma is too big E4b q8 cant compete qwen 3.5 27b q4 and gemma 4 31b dense i mean to fit on a 16vram needs q3 max which means a lot of precision loss ( if someone handled it better please say so)
MainFunctions@reddit
Ah yes. I know some of these words.
dampflokfreund@reddit
Turbo Quants are a hype. So far the benchmark suggests it has lower quality than even q4_0, which makes sense considering its 3 bit. But it's not the lossless quanting Google made it out to be, like tq3_0 being on par with q8_0, far from it. There's a ton of vibe coded forks of llama.cpp right now, some more involved than others, but not a single one has convinced the legends like ggerganov or ikawrakow that turbo quants are better than what we have right now for KV quantization.
kidflashonnikes@reddit
This is absolutely false. The paper uses 2.5 and 3.5 bit for compression. They use a two part algorithm to do the wuantiziation for the kvcache and uses 32 channels to average out the distortion rate to effectively reduce all loss of accuracy. This guy has no idea at all. It’s not hype at all - I work at one of the largest AI labs in the world and we are actually using this god send of research from Google.
jtjstock@reddit
If it’s not hype, then we’re all in for a long wait for a correct implementation.
kidflashonnikes@reddit
This guy has no idea what he’s talking about. Let me be clear - before the Google paper - anything less than 8 bit wuantizqtion for kvcache was a fever dream. Google absolutely cooked. 4 bit wuantixqtion is now possible for kvcache - something not even appreciable until this paper came out. Before the paper - anything else that was close, such as Polar Quant still had accuracy loss. Google 100% just pushed the limits and it’s not theoretical at all. It will take time to implement but it’s real and it works
relmny@reddit
Honest question (I have no much idea about this), how do you know "it's real and works"? is your implementation successful in reducing KV cache memory requirements while being lossless?
kidflashonnikes@reddit
yes, so in the google paper, they actually quantized the kvcache to 2.5 and 3.5 bits, because they used 32 channels and averaged out the channels. They did this by using a two part algorithm. We implemented the research for our own internal inference engine and we tested it and it worked compared to turboquant. All you have to do is just take the two algorithms, put them together the exact way Google implemented them, and tailor it to an inference engine, and you have a turboquant feature for kvcahce.
I want to be clear - the AI company that I work for, million of people use our products everday. We have people in the math area who did it within 24 hours of the research results being published. I can tell you this - it is the best kvcahce quant out there. We will absolutely be using it for our pro subsciption users moving forward soon, we just need to time to test out the scale at which it can used. Anyone who tells you otherwise is 100% wrong, and all labs are already switiching over to it, to some degree.
hwertz10@reddit
I read a description of how it worked, and Google showed 6:1 compression (and 1/6th the time taken to run) with a version that straight up has no error compared to the original; the quantization caused a 1-bit error intermittently and then they had a correction table to correct those values out to retain full fidelity of the original. As you say this will be huge.
As for the current implementations? I have no idea, if it's not working well it's not implemented correctly yet.
llama-impersonator@reddit
my dad is the head of nintendo and nuh uh
FullOf_Bad_Ideas@reddit
exllamav2 and exllamav3 don't exist.
Those projects had reasonably good 4-bit KV cache quantization for years now and people have been using them on a regular basis.
If your claim about your employer is true and that's also what they think, they should come and hang out at localllama more often.
TurboQuant has significant accuracy loss unless you look at metrics valuable for vector storage.
we would already see those great implementations now, it's been some time now. TurboQuant paper came out 342 days ago and blog post came out 12 days ago.
jtjstock@reddit
Waiting for an implementation that isn’t worse than q4_0.
MoffKalast@reddit
Make wild claims without releasing any code.
Claim all implementations are incorrect when they underperform your wild claims.
Pretend to be the only genius who can do it right.
Profit, somehow, probably.
a_beautiful_rhind@reddit
The profit is in making people reinvent the wheel and question their inference engines. How much effort was put into this vs implementing hadamard in llama.cpp and calling it a day?
jtjstock@reddit
Well, I trust ggerganov more than claude:)
a_beautiful_rhind@reddit
Damage kinda done. Now Q8 is "bad" over .0001 KLD difference. Meanwhile gemma4 seems completely cooked while people hardly notice.
Natrimo@reddit
What's this about Gemma 4? I find the smaller models do a good job.
jtjstock@reddit
People were hyping it being amazing on llama even while there were known issues running it on llama that precluded it from being amazing.
Need to wait for things to finish settling. It’s easy to get swept up in the initial hype, the sober view comes later after sustained use and inference issues being resolved…
FastDecode1@reddit
I think a lot of people here are just posers and are fucking lying about running anything locally.
What they actually do is go over to the model developer's hosting platform, spend five minutes screwing around with the models at 10,000 tps, and then come here to declare how amazing the models are to run locally.
a_beautiful_rhind@reddit
So far seems broken in all the local engines I tried.
EbbNorth7735@reddit
Gemma4 just came out. I'd expect it to be broken for a few weeks.
I'm still not convinced qwen3.5 works in Llama server and the swapping feature is definitely borked.
jtjstock@reddit
The hype train never stops pulling into new stations and the YT needs new content every 10 seconds
No_Algae1753@reddit
Which techniques do we currently have implemeneted? What settings would you recommend therefore? And also, is it possible that the current implementations are just not good enough?
jtjstock@reddit
Current techniques.. use use a llama that does hadamard on the q8_0 k cache, ik llama has had this for a while, mainline llama is adding it, I think it’s been merged? Not sure, very recent PR for it. The Turboquant forks also have this fyi. For the v cache, you can use q4_0, as the v cache isn’t as sensitive to quantization, mixing the two has a performance penalty though. Best performance is matching k and v cache, but you should not do q4_0 for the k cache as the quality degradation is going to hurt more than a smaller context.
jtjstock@reddit
Qwen 3.5 and Gemma 4 are both model families, there are different variants of each, some use more or less memory than others. An MOE model will use a lot less than a dense one of similar size.
Interesting-Print366@reddit (OP)
I identified that gemma4 31b requires about 10GB more RAM than qwen3.5 27b when running with the same context length. Could you possibly let me know how to resolve this? I am using llama.cpp.
llama-impersonator@reddit
try -ctxcp 4, 8, etc
Mr_Moonsilver@reddit
Can't resolve it. Qwen has a hybrid architecture with mamba layers, which makes it much more efficient in regards to traditional architectures such as gemma 4 has
Velocita84@reddit
No. Use at most Q8_0 if you don't want your llm's context understanding to drop off a cliff
EffectiveCeilingFan@reddit
I feel like I always see you under posts about TurboQuant, the profile picture is so distinctive lol. Honestly, most of the hype would die overnight if people actually read the paper IMO. I am shocked by how much I hear about TQ online relative to what I perceive as a pretty incremental paper.
Velocita84@reddit
You could say i'm getting successfully ragebaited every time
And-Bee@reddit
I thought that the savings came from storing the difference between key values rather than a full precision value. Hence no quality loss
Velocita84@reddit
All PPL measurements i've seen between llama.cpp forks and ik_llams.cpp discussion point to TQ being strictly worse than the existing Q4_0
jtjstock@reddit
They have all pivoted to doing mixed, q8_0 k with tq for v.
FullOf_Bad_Ideas@reddit
and for V some implementations now try to just skip dequanting it, making tq somewhat irrelevant there.
spky-dev@reddit
No, use K @ Q8, V @ Q4, you only need the keys at higher quality, the values can be more truncated.
Velocita84@reddit
Going from Q8/Q8 to Q8/Q4 incurs a significant kld increase, these numbers are before kv rotation was merged into llama.cpp so in reality all of these should be lower, i should probably measure them again
DefNattyBoii@reddit
Please do there isn't enough resources and talks about cache quants it's just mostly "will work"
Velocita84@reddit
I will probably do so either in about a week or when the last open turboquant PR (21089) gets merged/rejected, in the case that it's merged i'll test it along the normal quants
-Ellary-@reddit
Just START with f16 and at worse case scenario switch to Q8\Q8, it is all what you need to know. Don't go to Q8/Q4 cuz you will loose processing speed, about x2. ALL models react differently to KV cache Qs, some may work ok, some may break. Don't use it blindly, for most cases I just upload layers to CPU, but don't go to Q8. At the moment when speed drops really low, then yeah, choose is Q8 or nothing, at this point Q8 is a better variant.
DifficultSand3885@reddit
Turbo quant working great for me running llama 3.1 8b and qwen 3.5 9b with 32k context 👑 with q4_k_m quant
GroundbreakingMall54@reddit
gemma 4 eating 2x ram for same context is rough. turboquant helps but honestly the real game changer would be if google just released a more efficient architecture from the start instead of us having to band-aid it with quants
EffectiveCeilingFan@reddit
The Gemma 4 architecture, first off, uses 1/2 the cache memory of Qwen3.5 because the K and V are equal, literally just half as much data to store. Even before that, though, Gemma 4 also has fewer global attention layers than Qwen3.5 for the equivalent models. The implementations are all still incomplete or completely broken as far as I’m aware, possibly explaining why OP came to such an outlandish conclusion.
dampflokfreund@reddit
I think Gemma 4 is pretty efficient. Not as much as a RNN, but the sliding window attention works well. The neat thing about this architecture is that you can decide between context shifting and high context, whereas with Qwen you are stuck to no context shift. Disabling SWA increases memory consumption by a lot but context shifting is possible, you don't have that option with Qwen. Ideally though, they would implement an architecture that is both crazy efficient and allows for context shifting.
b1231227@reddit
It does save context space, but not as much as reported in the news. Because K(Q8_0) cannot be compressed, V's quality is acceptable in Turbo4.
FullOf_Bad_Ideas@reddit
Not for Gemma 4 and Qwen 3.5 architectures since they have low exposure to TurboQuant due to aggressive linear / sliding window attention in their architectures.
For other architectures it's barely moving the needle
Ignore this, it'll probably die as a road to nowhere.
TurnUpThe4D3D3D3@reddit
@grok what do you think
aoleg77@reddit
Use SWA at BF16. That's how it's supposed to be used.
Ell2509@reddit
You are saying that you benchmarked turboquant, and found kt to half performance?
CryptographerGood989@reddit
before yesterday I was using qwen3.5-27b on 2 gpus and it was eating 26.5GB vram. Switched to gemma4-26b yesterday and it actually uses less around 23.3GB. So in my case gemma 4 eats less not more. Ollama splits it automatically between rtx 5070ti and rtx 3060 12gb
Running it non-stop on my home pc, even at night the thing keeps working
Fluffywings@reddit
Gemma 4 26B is MoE vs Qwen3.5 27B is dense so they typically should not be directly compared.
def_not_jose@reddit
You are comparing a full fat 27b dense model to harebrained a4b. Gemma 4 31b dense is whole other beast.
CryptographerGood989@reddit
yeah fair point, no argument here =) but gemma 4 release was perfect timing for me, freed up just enough vram for kv cache. with 28gb total thats a big deal
murkomarko@reddit
nah, its all hype
gigaflops_@reddit
In a local LLM on one GPU serving one user, it's not as big of a deal because the kv cache uses up a relatively small amount of memory as compared to the model weights. For any particular model on any given machine, rarely will it be unusable at 32K context and speed up enough to suddenly become usable at 4K context.
The math works differently when you have a GPU cluster serving hundreds of requests concurrently. The entire cluster only needs to store one copy of the model weights that can be used to serve everyone's request. KV cache on the other hand, every user has their own KV cache. The model weights may occupy 2 TB in memory, and each user's KV cache may only occupy 100 GB, but with 100 concurrent users, everybody's KV cache combined uses up 10 TB.
KV cache optimization matters more in data centers because a because KV cache is more of a burden in data centers. Most AI is still cloud-based, and that's why TurboQuant is a big deal, not because it's incredibly helpful for consumer/home LLMs.
This_Maintenance_834@reddit
majority of the local models concentrate on 30b parameters space. at 4bit quant, turboquant can make 24GB graphics cards dealing with meaningful long context. so, it is significant in the current hardware environment.
Pixer---@reddit
If they claim it’s lossless they can serve that to free or low paid tiers for more efficient inference
adel_b@reddit
I have implemented TQ for vector search, the 8bit is pretty good at keeping accuracy vs f32 while talking smaller space, now the issue is dequant taking a lot of time, the speed is worst than f32 yes the quality is the same
sjoerdmaessen@reddit
Huge in my case, went from 82k context with 1 process to 2 parallel 128k context processes because of it.
Daemontatox@reddit
Nope , just hype
DifficultSand3885@reddit
Turbo quant working great for me running llama 3.1 8b and qwen 3.5 9b with 32k context 👑 with q4_k_m quant
spky-dev@reddit
Not huge, but still useful. Newer models use hybrid attention, so their KVCache are already relatively small compared to older architectures.
https://huggingface.co/blog/jlopez-dl/hybrid-attention-game-changer