Is Turboquant really a game changer?

[-]

Finguili@reddit

Actually, Gemma is more memory-efficient compared to Qwen (31B vs 27B). Gemma has a 2x larger head dimension for global attention layers, same number of heads, but fewer global attention layers (10 vs 16), and V is the same as K, so there is no need to store it. However, I suspect llama.cpp doesn’t support this right now and does store V, hence 2x higher usage. A full context for Gemma in optimised implementation should take around 10 GiB + ~800 MiB for local SWA, while for Qwen it’s ~16 GiB for global + some contant memroy for gated DeltaNet layers (I think it was smaller than what Gemma uses for SWA).

Also, it may be worth using -np 1 to avoid allocating SWA for additional slots (unless you need them).

[-]

Witty_Mycologist_995@reddit

what's the current pull fixing this?

[-]

GoodTip7897@reddit

I couldn't find any pr... If that comment is right then someone should create an issue at least.

[-]

Apprehensive_Ad784@reddit

happy cake day

[-]

Witty_Mycologist_995@reddit

happy cake day

[-]

Samurai2107@reddit

But gemma is too big E4b q8 cant compete qwen 3.5 27b q4 and gemma 4 31b dense i mean to fit on a 16vram needs q3 max which means a lot of precision loss ( if someone handled it better please say so)

[-]

MainFunctions@reddit

Ah yes. I know some of these words.

[-]

dampflokfreund@reddit

Turbo Quants are a hype. So far the benchmark suggests it has lower quality than even q4_0, which makes sense considering its 3 bit. But it's not the lossless quanting Google made it out to be, like tq3_0 being on par with q8_0, far from it. There's a ton of vibe coded forks of llama.cpp right now, some more involved than others, but not a single one has convinced the legends like ggerganov or ikawrakow that turbo quants are better than what we have right now for KV quantization.

[-]

kidflashonnikes@reddit

This is absolutely false. The paper uses 2.5 and 3.5 bit for compression. They use a two part algorithm to do the wuantiziation for the kvcache and uses 32 channels to average out the distortion rate to effectively reduce all loss of accuracy. This guy has no idea at all. It’s not hype at all - I work at one of the largest AI labs in the world and we are actually using this god send of research from Google.

[-]

jtjstock@reddit

If it’s not hype, then we’re all in for a long wait for a correct implementation.

[-]

kidflashonnikes@reddit

This guy has no idea what he’s talking about. Let me be clear - before the Google paper - anything less than 8 bit wuantizqtion for kvcache was a fever dream. Google absolutely cooked. 4 bit wuantixqtion is now possible for kvcache - something not even appreciable until this paper came out. Before the paper - anything else that was close, such as Polar Quant still had accuracy loss. Google 100% just pushed the limits and it’s not theoretical at all. It will take time to implement but it’s real and it works

[-]

relmny@reddit

Honest question (I have no much idea about this), how do you know "it's real and works"? is your implementation successful in reducing KV cache memory requirements while being lossless?

[-]

kidflashonnikes@reddit

yes, so in the google paper, they actually quantized the kvcache to 2.5 and 3.5 bits, because they used 32 channels and averaged out the channels. They did this by using a two part algorithm. We implemented the research for our own internal inference engine and we tested it and it worked compared to turboquant. All you have to do is just take the two algorithms, put them together the exact way Google implemented them, and tailor it to an inference engine, and you have a turboquant feature for kvcahce.

I want to be clear - the AI company that I work for, million of people use our products everday. We have people in the math area who did it within 24 hours of the research results being published. I can tell you this - it is the best kvcahce quant out there. We will absolutely be using it for our pro subsciption users moving forward soon, we just need to time to test out the scale at which it can used. Anyone who tells you otherwise is 100% wrong, and all labs are already switiching over to it, to some degree.

[-]

hwertz10@reddit

I read a description of how it worked, and Google showed 6:1 compression (and 1/6th the time taken to run) with a version that straight up has no error compared to the original; the quantization caused a 1-bit error intermittently and then they had a correction table to correct those values out to retain full fidelity of the original. As you say this will be huge.

As for the current implementations? I have no idea, if it's not working well it's not implemented correctly yet.

[-]

llama-impersonator@reddit

my dad is the head of nintendo and nuh uh

[-]

FullOf_Bad_Ideas@reddit

anything less than 8 bit wuantizqtion for kvcache was a fever dream.

exllamav2 and exllamav3 don't exist.

Those projects had reasonably good 4-bit KV cache quantization for years now and people have been using them on a regular basis.

If your claim about your employer is true and that's also what they think, they should come and hang out at localllama more often.

such as Polar Quant still had accuracy loss.

TurboQuant has significant accuracy loss unless you look at metrics valuable for vector storage.

It will take time to implement but it’s real and it works

we would already see those great implementations now, it's been some time now. TurboQuant paper came out 342 days ago and blog post came out 12 days ago.

[-]

jtjstock@reddit

Waiting for an implementation that isn’t worse than q4_0.

[-]

MoffKalast@reddit

Make wild claims without releasing any code.
Claim all implementations are incorrect when they underperform your wild claims.
Pretend to be the only genius who can do it right.
Profit, somehow, probably.

[-]

a_beautiful_rhind@reddit

The profit is in making people reinvent the wheel and question their inference engines. How much effort was put into this vs implementing hadamard in llama.cpp and calling it a day?

[-]

jtjstock@reddit

Well, I trust ggerganov more than claude:)

[-]

a_beautiful_rhind@reddit

Damage kinda done. Now Q8 is "bad" over .0001 KLD difference. Meanwhile gemma4 seems completely cooked while people hardly notice.

[-]

Natrimo@reddit

What's this about Gemma 4? I find the smaller models do a good job.

[-]

jtjstock@reddit

People were hyping it being amazing on llama even while there were known issues running it on llama that precluded it from being amazing.

Need to wait for things to finish settling. It’s easy to get swept up in the initial hype, the sober view comes later after sustained use and inference issues being resolved…

[-]

FastDecode1@reddit

I think a lot of people here are just posers and are fucking lying about running anything locally.

What they actually do is go over to the model developer's hosting platform, spend five minutes screwing around with the models at 10,000 tps, and then come here to declare how amazing the models are to run locally.

[-]

a_beautiful_rhind@reddit

So far seems broken in all the local engines I tried.

[-]

EbbNorth7735@reddit

Gemma4 just came out. I'd expect it to be broken for a few weeks.

I'm still not convinced qwen3.5 works in Llama server and the swapping feature is definitely borked.

[-]

jtjstock@reddit

The hype train never stops pulling into new stations and the YT needs new content every 10 seconds

[-]

No_Algae1753@reddit

Which techniques do we currently have implemeneted? What settings would you recommend therefore? And also, is it possible that the current implementations are just not good enough?

[-]

jtjstock@reddit

Current techniques.. use use a llama that does hadamard on the q8_0 k cache, ik llama has had this for a while, mainline llama is adding it, I think it’s been merged? Not sure, very recent PR for it. The Turboquant forks also have this fyi. For the v cache, you can use q4_0, as the v cache isn’t as sensitive to quantization, mixing the two has a performance penalty though. Best performance is matching k and v cache, but you should not do q4_0 for the k cache as the quality degradation is going to hurt more than a smaller context.

[-]

jtjstock@reddit

Qwen 3.5 and Gemma 4 are both model families, there are different variants of each, some use more or less memory than others. An MOE model will use a lot less than a dense one of similar size.

[-]

Interesting-Print366@reddit (OP)

I identified that gemma4 31b requires about 10GB more RAM than qwen3.5 27b when running with the same context length. Could you possibly let me know how to resolve this? I am using llama.cpp.

[-]

llama-impersonator@reddit

try -ctxcp 4, 8, etc

[-]

Mr_Moonsilver@reddit

Can't resolve it. Qwen has a hybrid architecture with mamba layers, which makes it much more efficient in regards to traditional architectures such as gemma 4 has

[-]

Velocita84@reddit

Is Turboquant really a game changer?

No. Use at most Q8_0 if you don't want your llm's context understanding to drop off a cliff

[-]

EffectiveCeilingFan@reddit

I feel like I always see you under posts about TurboQuant, the profile picture is so distinctive lol. Honestly, most of the hype would die overnight if people actually read the paper IMO. I am shocked by how much I hear about TQ online relative to what I perceive as a pretty incremental paper.

[-]

Velocita84@reddit

You could say i'm getting successfully ragebaited every time

[-]

And-Bee@reddit

I thought that the savings came from storing the difference between key values rather than a full precision value. Hence no quality loss

[-]

Velocita84@reddit

All PPL measurements i've seen between llama.cpp forks and ik_llams.cpp discussion point to TQ being strictly worse than the existing Q4_0

[-]

jtjstock@reddit

They have all pivoted to doing mixed, q8_0 k with tq for v.

[-]

FullOf_Bad_Ideas@reddit

and for V some implementations now try to just skip dequanting it, making tq somewhat irrelevant there.

[-]

spky-dev@reddit

No, use K @ Q8, V @ Q4, you only need the keys at higher quality, the values can be more truncated.

[-]

Velocita84@reddit

Going from Q8/Q8 to Q8/Q4 incurs a significant kld increase, these numbers are before kv rotation was merged into llama.cpp so in reality all of these should be lower, i should probably measure them again

[-]

DefNattyBoii@reddit

Please do there isn't enough resources and talks about cache quants it's just mostly "will work"

[-]

Velocita84@reddit

I will probably do so either in about a week or when the last open turboquant PR (21089) gets merged/rejected, in the case that it's merged i'll test it along the normal quants

[-]

-Ellary-@reddit

Just START with f16 and at worse case scenario switch to Q8\Q8, it is all what you need to know. Don't go to Q8/Q4 cuz you will loose processing speed, about x2. ALL models react differently to KV cache Qs, some may work ok, some may break. Don't use it blindly, for most cases I just upload layers to CPU, but don't go to Q8. At the moment when speed drops really low, then yeah, choose is Q8 or nothing, at this point Q8 is a better variant.

[-]

DifficultSand3885@reddit

Turbo quant working great for me running llama 3.1 8b and qwen 3.5 9b with 32k context 👑 with q4_k_m quant

[-]

GroundbreakingMall54@reddit

gemma 4 eating 2x ram for same context is rough. turboquant helps but honestly the real game changer would be if google just released a more efficient architecture from the start instead of us having to band-aid it with quants

[-]

EffectiveCeilingFan@reddit

The Gemma 4 architecture, first off, uses 1/2 the cache memory of Qwen3.5 because the K and V are equal, literally just half as much data to store. Even before that, though, Gemma 4 also has fewer global attention layers than Qwen3.5 for the equivalent models. The implementations are all still incomplete or completely broken as far as I’m aware, possibly explaining why OP came to such an outlandish conclusion.

[-]

dampflokfreund@reddit

I think Gemma 4 is pretty efficient. Not as much as a RNN, but the sliding window attention works well. The neat thing about this architecture is that you can decide between context shifting and high context, whereas with Qwen you are stuck to no context shift. Disabling SWA increases memory consumption by a lot but context shifting is possible, you don't have that option with Qwen. Ideally though, they would implement an architecture that is both crazy efficient and allows for context shifting.

[-]

b1231227@reddit

It does save context space, but not as much as reported in the news. Because K(Q8_0) cannot be compressed, V's quality is acceptable in Turbo4.

[-]

FullOf_Bad_Ideas@reddit

Not for Gemma 4 and Qwen 3.5 architectures since they have low exposure to TurboQuant due to aggressive linear / sliding window attention in their architectures.

For other architectures it's barely moving the needle

Ignore this, it'll probably die as a road to nowhere.

[-]

TurnUpThe4D3D3D3@reddit

@grok what do you think

[-]

aoleg77@reddit

Use SWA at BF16. That's how it's supposed to be used.

[-]

Ell2509@reddit

You are saying that you benchmarked turboquant, and found kt to half performance?

[-]

CryptographerGood989@reddit

before yesterday I was using qwen3.5-27b on 2 gpus and it was eating 26.5GB vram. Switched to gemma4-26b yesterday and it actually uses less around 23.3GB. So in my case gemma 4 eats less not more. Ollama splits it automatically between rtx 5070ti and rtx 3060 12gb
Running it non-stop on my home pc, even at night the thing keeps working

[-]

Fluffywings@reddit

Gemma 4 26B is MoE vs Qwen3.5 27B is dense so they typically should not be directly compared.

[-]

def_not_jose@reddit

You are comparing a full fat 27b dense model to harebrained a4b. Gemma 4 31b dense is whole other beast.

[-]

CryptographerGood989@reddit

yeah fair point, no argument here =) but gemma 4 release was perfect timing for me, freed up just enough vram for kv cache. with 28gb total thats a big deal

[-]

murkomarko@reddit

nah, its all hype

[-]

gigaflops_@reddit

In a local LLM on one GPU serving one user, it's not as big of a deal because the kv cache uses up a relatively small amount of memory as compared to the model weights. For any particular model on any given machine, rarely will it be unusable at 32K context and speed up enough to suddenly become usable at 4K context.

The math works differently when you have a GPU cluster serving hundreds of requests concurrently. The entire cluster only needs to store one copy of the model weights that can be used to serve everyone's request. KV cache on the other hand, every user has their own KV cache. The model weights may occupy 2 TB in memory, and each user's KV cache may only occupy 100 GB, but with 100 concurrent users, everybody's KV cache combined uses up 10 TB.

KV cache optimization matters more in data centers because a because KV cache is more of a burden in data centers. Most AI is still cloud-based, and that's why TurboQuant is a big deal, not because it's incredibly helpful for consumer/home LLMs.

[-]