Will Google TurboQuant help people with low end hardware?

[-]

Old-Hovercraft6504@reddit

I’d still expect latency to be the real bottleneck for many people, even if memory usage improves.

Reply

[-]

aibasedtoolscreator@reddit

I have implemented turboquant research paper https://github.com/kumar045/turboquant_implementation Run massive context length LLM without high end gpu machine.

Reply

[-]

MahabharataHindi@reddit

I recommend you to read this blog, very descriptive and explains everything in easy language. Link https://quantg.in/google-turboquant-explained/

Reply

[-]

cutebluedragongirl@reddit

There is no escape... Hardware is too expensive...

Reply

[-]

oatmealcraving@reddit

Software is cheap though: [https://archive.org/details/sw-net-16-b](https://archive.org/details/sw-net-16-b)

Reply

[-]

Presumably they are using the fast Walsh Hadamard transform? Or did they say at all in the paper? [https://archive.org/details/whtebook-archive](https://archive.org/details/whtebook-archive) The fast WHT is self-inverse so you can swap backward and forward between the 2 weight spaces very easily. If the weights are highly structured you may need a fixed pattern of random sign flips as well but that seems unlikely.

Reply

[-]

mr_zerolith@reddit

No, it will only help you out with ram. Do know that more context = more GPU grunt needed And larger model = more GPU grunt needed If your hardware has very high speed but not enough memory ( most nvidia consumer hardware ), you'll have a good time.

Reply

[-]

jestr1000@reddit

Can this reduce the price of long context prompting? aka 256k+? Any idea by how much?

Reply

[-]

ML-Future@reddit

TurboQuant can only compress context memory, models still being the same weight, but this will help to have larger context.

Reply

[-]

Ryan_Blue_Steele@reddit (OP)

So it would be possible for someone to make it run on a RTX 3050 with 6GB of VRAM and 32GB of RAM?

Reply

[-]

TripleSecretSquirrel@reddit

So you can definitely run some models on your hardware now without turboquant, you’re just limited to pretty small models and/or aggressive quantization (trades size and speed for precision — for most of us, that’s necessary). As the person above wrote (but didn’t really explain), TurboQuant only compresses the KV cache, not the model weights themselves. So the models remain very very big still at this point. The KV cache is just kind of the short-term memory of the model. So it’s really cool and helpful that we can compress KV cache that much, and it makes our models a little smarter because they can remember more context, but in most cases it probably doesn’t mean you can run a bigger model than you could before unless you were just right on the edge of it fitting before.

Reply

[-]

SirApprehensive7573@reddit

No. Read again the comment

Reply

[-]

suprjami@reddit

Do the math. TQ gives about a 4.9x to 5x reduction in VRAM usage for KV cache. Qwen 3.5 27B with 128k context will reduce KV cache from 16 GiB to ~3.25 GiB. That will make the Unsloth Dynamic Q5 quant (20.5 GiB) fit on a single 24G card with good reasoning support. You currently need a whole second card for context to achieve this. That's effectively free ~12.75 GiB VRAM. That seems like a huge difference to me. The smaller Qwen 3.5 models and the 35B MoE use half the RAM for context, so the same saving applies there, but halved.

Reply

[-]

EffectiveCeilingFan@reddit

You're off by a factor of 2. 128k context on Qwen3.5 27B is only 8GiB. ```sh $ llama-server --host 0.0.0.0 --port 8080 \ -fit on -fa on -np 1 \ --no-mmap -dev Vulkan1,Vulkan2 \ -c 131072 \ -m bartowski__Qwen_Qwen3.5-27B-GGUF/Qwen_Qwen3.5-27B-Q4_K_M.gguf [snip] llama_kv_cache: size = 8192.00 MiB (131072 cells, 16 layers, 1/1 seqs), K (f16): 4096.00 MiB, V (f16): 4096.00 MiB ``` For the 3.5 bit TQ, that shinks KV cache size down to 1.75GiB; 6.25GiB saved. I also wouldn't call a 24G card a "low-end device" lol; that's at least a 3090 right?

Reply

[-]

suprjami@reddit

You're right, I was thinking of 256k. My mistake.

Reply

[-]

Ryan_Blue_Steele@reddit (OP)

Would it even be possible to lower it to under 6 GiB VRAM?

Reply

[-]

ForsookComparison@reddit

If you can run a model today you will be able to pick a less-quantized version of it or opt to use it with more context. It will **not** unlock any new models for anyone's hardware.

Reply

[-]

ttkciar@reddit

Yes, but perhaps not as much as you expect. TurboQuant only reduces the KV cache's memory consumption. I say "only" but that can mean a difference of gigabytes, and give you much longer in-VRAM context. It does nothing to reduce the size of the model weights, but whatever VRAM you have left after loading the weights will accommodate much more context. The main differences between TurboQuant and quantizing your K and V caches to q4 are that TurboQuant will squeeze a little more space out of it than q4, and unlike traditional quantization TurboQuant is **lossless.** Your inference quality should not diminish at all using TurboQuant.

Reply

[-]

EffectiveCeilingFan@reddit

No. It can be used to get more accurate quantized KV cache performance. However, on low end devices, running long context is undesirable. Not only do low-end models lack performance at longer context (like, >16k), but long-context prompt-processing on a weak device is just going to be awful.

Reply

[-]

dkeiz@reddit

nope. small models fit in current hardware allready and overbloating with large context. large models still required lots or memory. qwen3.5 is somewhere between and its allready good with context as it is. we need better capable models, its just basic requriements for them is ryzen 128gb shRam.

Reply

[-]

H_DANILO@reddit

No, most likely Emgran will

Reply

[-]

sunshinecheung@reddit

Maybe this one could https://preview.redd.it/4mict0nsrgsg1.jpeg?width=1156&format=pjpg&auto=webp&s=80e839a8fd90e6397fe007305ed836cebb106023

Reply

[-]

Tyme4Trouble@reddit

It might help you run models with larger context windows, but it doesn’t make the models weights smaller. It just compresses the KV cache from 16-bits to 3-4 with low overhead and quality loss.

Reply

Will Google TurboQuant help people with low end hardware?

Reply to Post

23 Comments

Old-Hovercraft6504@reddit

aibasedtoolscreator@reddit

MahabharataHindi@reddit

cutebluedragongirl@reddit

oatmealcraving@reddit

oatmealcraving@reddit

mr_zerolith@reddit

jestr1000@reddit

ML-Future@reddit

Ryan_Blue_Steele@reddit (OP)

TripleSecretSquirrel@reddit

SirApprehensive7573@reddit

suprjami@reddit

EffectiveCeilingFan@reddit

suprjami@reddit

Ryan_Blue_Steele@reddit (OP)

ForsookComparison@reddit

ttkciar@reddit

EffectiveCeilingFan@reddit

dkeiz@reddit

H_DANILO@reddit

sunshinecheung@reddit

Tyme4Trouble@reddit