About TurboQuant | TheaterFire

[-]

ekryski@reddit

Turboquant as the paper is written is revolutionary but has flaws. The QJL bit kills speed. A bunch of us have implemented alternatives using some of the core concepts (PolarQuant is revolutionary) plus some additional speed ups.

Look at TheTom’s TurboQuant+ repo on github. Lots of good stuff in his papers.

I’ve worked on mlx swift implementations in collaboration with Tom heavily over the last couple weeks. We linked up on Twitter because we were both working on it, him in llama.cpp and me in swift mlx, and have been jamming since.

TurboQuant core concepts + Tom’s realization that asymmetrical and targeted KV compression + performance speed ups we’ve both done IS revolutionary and we’re going to post numbers within days that prove it. We’re just verifying benchmarks across multiple models right now so that we don’t speak too soon.

Local AI renaissance incoming!

[-]

sonicnerd14@reddit

I hope you results really are correct. We need anything that will help give the memory cartel the shaft for once.

[-]

crantob@reddit

Cartels don't stop competition, only governments (and deep states) do.

[-]

ekryski@reddit

They are: https://x.com/no_stp_on_snek/status/2043519086697517538?s=20

[-]

stoppableDissolution@reddit

I might be wrong about your testing strategy, but if it is basically QA on low context/rag/codebase then its imo kinda useless to validate the effect of kv quantization. In my experience where models really suffer even with Q8 kv is long chats. Something about having 50-100+ turns rather than one big chunk of data makes it difficult for them.

I do want to believe that I'm wrong tho, lol.

[-]

ekryski@reddit

Speeds definitely degrade over longer context. This happens with all models regardless of KV cache quantization. However our primary benchmark has been summarization of actual books over contexts spanning from 128 to 256k tokens. The compression has now evolved so it actually improves speeds over longer contexts compared to baseline. Initially TurboQuant (especially with QJL bit correction) was waaaay worse, especially at contexts greater than 16k.

[-]

stoppableDissolution@reddit

I'm not talking about speed, but about how lossy it is. Slowdown because of cache size is perfectly expected.

Kv quantization seems to accumulate errors quite badly over time, and it seems to be more pronounced specifically when there is a lot of back and forth.

[-]

ekryski@reddit

Ah. Yup again, a little bit but not much. Was a big problem initially but this is where Tom's asymmetric KV quantization was such a genius discovery. After we implemented that, much less PPL and KLD loss at longer contexts.

But take that with a grain of salt. We mostly tested on summarization and NIAH and now have started to branch out beyond that for benches. So far looking good but still more benches to do before we feel comfortable saying "yup, it looks good!".

in reality, the real test comes from actual usage so we started to get all our quant and speed improvements into our harnesses on the weekend and really start using alongside benches to see how it feels.

[-]

CYTR_@reddit

We suspect there's a reduction in size 😅. But what would be interesting is to have the degree of comparative divergence at various quants. I seem to recall seeing recently that, TurboQuant or not, there was still a loss of accuracy in the responses ; even with rotation. I might be mistaken!

[-]

kingharrison@reddit

Thank you for your service on this!

Is any of this applicable to RAG / KBs? Can any of it be used?

[-]

ekryski@reddit

In theory the compression can be applied to any (Q)KV vector but we haven't tried RAG, LoRAs or vector search or anything so far. We started with KV Cache due to the paper and and now started dipping into model weidghts. You can compress anything, the question is really, "is it still usable after?". TBD.

[-]

RandomTrollface@reddit

How much better is turboquant really than the hadamard rotation currently in llama.cpp? If you look at these results: https://github.com/ggml-org/llama.cpp/pull/21089 it doesn't seem that much better.

[-]

ekryski@reddit

Read the whole PR thread. Lots in there. I comment there too. https://github.com/ggml-org/llama.cpp/pull/21089#issuecomment-4189580314

[-]

kyr0x0@reddit

Thx for the link. Do you have a quick link to your repo?

[-]

ethertype@reddit

I have been following the main discussion on github. Very interesting.

I hope you also publish benchmarks with math/code/recall etc., and not purely "proxy" values (PPL/KLD).

[-]

Kerem-6030@reddit

isnt it same whit k v cache quabt on lm studio?

[-]

unjustifiably_angry@reddit

It's the equivalent to an scam phone call in more ways than one.

[-]

qwen_next_gguf_when@reddit

We will see after it is fully merged to llamacpp mainstream.

[-]

Exact_Law_6489@reddit (OP)

any PRs yet?

[-]

ekryski@reddit

Yes. They’re gonna take a while due to egos.

[-]

superSmitty9999@reddit

Wdym?

[-]

jacek2023@reddit

for some strange reason llama.cpp is not happy with AI slop

[-]

sonicnerd14@reddit

Of course, there's always some bureaucrat holding things back. I've been wondering what's taking so long to have TQ integrated into models or the engines at least.

[-]

crantob@reddit

Just as often there's a bureaucrat pushing tech that's not ready for prime-time onto a populace. Llama.cpp is big now and added features impose costs on the project as a whole, growing geometrically as more are added.

It's understandable that this is seldom understood. Many things are.

Why spend time on turboquant (+20-30% speed) instead of RotorQuant: https://news.qq.com/rain/a/20260327A0220A00

I can't answer the question.

[-]

sonicnerd14@reddit

In this cases it's not a time scale problem. It's a knowledge gap issue. I understand the part with the overwhelming amount of requests for such a big project. That's just noise though. If that's what's holding them back from actually beta testing things, then that's a poor reason to do so.

Why focus on just one or the other when both can be used together for an advantage. Instead of nitpicking metrics sometimes hybridization is a better option over inaction.

[-]

ekryski@reddit

Not all PRs done with AI are slop. It’s an AI project for Christ’s sake! You’d think they’d be more open to AI generated code. 🫠

I left a comment on one of the first TurboQuant PRs (which is a good PR, I didn’t write it and don’t know the guy). This was over a week ago now. Georgi (maintainer) called it shit without any constructive feedback. As someone who’s done a lot of open source for over a decade that’s not how you get good contributors to keep coming back. I empathize with maintainers (I’ve been one for years), there’s a lot of garbage out there. But imho you invest in tooling, testing, security and good release versioning and change logs. Then ship often and quickly merging PRs that have potential into a canary branch. This is not happening and llama.cpp (and its users) are now falling behind.

[-]

jacek2023@reddit

could you give a link to the discussion?

[-]

ekryski@reddit

Two good ones:

The PR I'm referring to. Maintainer kept it minimal on purpose and followed guidelines very well. https://github.com/ggml-org/llama.cpp/pull/21089
Longer discussion with lots of Tom's views. https://github.com/ggml-org/llama.cpp/discussions/20969#discussioncomment-16483708

[-]

jacek2023@reddit

I saw this PR, it's not closed, but I remember this comment:

"So, half the speed, but negligible quality difference? Is it even against latest master?"

and then this:

"In any case, if you just look at the numbers from the OP the tbq4_0 type isn't even really better than the existing naive q4_0 type. That alone is I think enough of an argument as to why we should not invest maintainer time here."

so what's the actual speedup here?

[-]

ekryski@reddit

Did you see my comment in that thread though? Of course it's not gonna be faster. The OP did it as the bare minimum in CPU to try and get the ball started. It's not gonna be super fast in CPU!

He also posts benches against the rotation algo in main. The point is, you ask for a minimally invasive PR, guy does it and acknowledges it's a start and it's not gonna be fast, gets shit on, comes back with improvements and benches to show it's still an improvement, and then crickets...

[-]

jacek2023@reddit

Is there a branch somewhere that shows a significant speedup?

[-]

Velocita84@reddit

Turboquant is just not worth it. It's not significantly better than Q4_0. Even IK disregarded it.

[-]

somerussianbear@reddit

Exactly. An AI project that doesn’t accept the current state of AI (which is that AI CAN write good quality code if properly guided).

[-]

draconic_tongue@reddit

I don't think this is correct. If you check their core members, half of their prs if not more has "YES" in the ai usage disclaimer. They are only discriminating against fully ai written comments, because there's too many dogshit ones where it's not worth the potential time investment. All they're literally asking is that you clean up your pr and explain in a way that has your intent in it

[-]

Exact_Law_6489@reddit (OP)

I mean, I read their Agents file. Its basically filled with stuff that tells AI "You cannot write code, only answer questions"

[-]

the_dadmin@reddit

I wonder if this is a copyright play because AI generated code doesn’t have protection.

[-]

a_beautiful_rhind@reddit

the "play" is getting slammed with 5000 untested PRs and not wanting to deal with that

[-]

jacek2023@reddit

also from my link: "In anticipation of the incoming flood of vibe generated PRs implementing TurboQuant,"

[-]

xeeff@reddit

turboquant hip triattention fork has been working great for me. using turboquant, i've been satisfied and my expectations were low but i keep it as default now

[-]

redblood252@reddit

Is it a drop-in replacement? Does it work as well for cuda gpus?

[-]

xeeff@reddit

Is it a drop-in replacement?

yes. I use llama-swap so I only replaced it for Gemma 4 and qwen3.5 but it's a good fork so i'm thinking of hosting my own so I can have latest upstream as well

Does it work as well for cuda gpus

I remember reading tests on an RTX 3090 as well, so yes

[-]

redblood252@reddit

I’m currently using upstream container image in kubernetes for inference running qwen3.5 9b on 16gb vram 128k context. I will see if using this vram I can improve the speed or increase the size.

[-]

xeeff@reddit

speed won't improve (I mean, maybe for full attention models? never noticed that personally and findings inside the repo say the same) but VRAM definitely

[-]

redblood252@reddit

Great so maybe larger size? Currently 9b q5. Will see for larger. Or maybe a gemma4 dense

[-]

xeeff@reddit

just had another look at the repo and saw this (supposedly -19% accuracy on CUDA whereas ROCm is 0% loss)

[-]

redblood252@reddit

I see. Would you happen to know of a fork for cuda?

[-]

xeeff@reddit

I misunderstood. that fork should be okay to use

[-]

xeeff@reddit

I wouldn't have much hope but let me know what you find

[-]

superdariom@reddit

Same here. Running bigger models with larger context

[-]

xeeff@reddit

yeah I noticed the VRAM savings are massive so I can run way bigger context and the default - np 4 - kvu is really nice cuz you can have parallel convos and you can run better quant. rotorquant apparently beats turboquant but i've yet to see anything for ROCm/HIP (gfx1030)

[-]

Betadoggo_@reddit

There have been at least 14 vibecoded prs closed so far. It's a graveyard.
turboquant_plus seems to be the repo with the most activity and mentions, but it contains so much llm generated text it feels like it's bordering on psychosis, though some claim it does work.

Here's the remaining turboquant pr: https://github.com/ggml-org/llama.cpp/pull/21089
It seems mostly dead since their results aren't any better than the regular q4 quant, indicating either a bad implementation or that they method isn't as strong as claimed.

Part of the method has already been merged and is the default for quanted kv: https://github.com/ggml-org/llama.cpp/pull/21038

[-]

draconic_tongue@reddit

I've seen posts about people using that turboquant+ impl and say good things. That combined with specualtive decoding on the 26b model sounds pretty big. I will try it soon but I'm too lazy

[-]

AppealSame4367@reddit

Since Dflash was published and could 2x-4x inference speed, but needs more vram, turboquant will be necessary for it in combination. add byteshape ggufs and tech similar to dflash for cpus and we might run 20b models on average gaming laptops with 6-8 gb vram as daily agentic driver in a few weeks or months.

[-]

ekryski@reddit

Agreed. Working on it in swift as we speak.

[-]

putrasherni@reddit

why swift ?

[-]

ekryski@reddit

Fastest. Compiler is optimized for Swift and C. No other dependencies. Plus can run on iOS devices.

[-]

crantob@reddit

Since I joined dependency anonymous I've been mostly dependency dry for 10 years now. I'd like to thank everyone who supports this for their support.

[-]

putrasherni@reddit

Makes sense Do check out rotorquant, seems like an upgrade to turboquant

Look forward to what you build !

[-]

kyr0x0@reddit

macOS? MLX? Would make sense to merge it upstream to MLX-LM then :) I can implement it in oMLX afterwards.

[-]

ekryski@reddit

Yes, MacOS and iOS. MLX + Swift. Saw someone post about having it working with mlx-lm already. Always open for collabs. Trying to push local Ai forward as fast as possible.

[-]

loftybillows@reddit

music to my ears brother

[-]

shingkai@reddit

TurboQuant only compresses KV cache, the model weights themselves are unaffected right?

[-]

ekryski@reddit

Not anymore. We've done a flavour now where we're quantizing model weights too. Just tried it with minimax 2.7 and compressed it to 87GB with no real loss. https://x.com/no_stp_on_snek/status/2043519086697517538?s=20

Although it's disingenuous to call it TurboQuant anymore. TurboQuant by the paper is no good. We've swiss cheesed the shit out of everything. It's now a mash up of PolarQuant + Higgs-esque/RabbitQ + Asymmetric KV + boundary layer protection (sandwiching).

We're probably gonna call it something else.

[-]

FullOf_Bad_Ideas@reddit

why no ppl and kld?

Then you could compare to other quants and see if there's actually something special there.

[-]

fuckingredditman@reddit

to be fair there's a perplexity measure in the linked screenshot (4.604), but we don't know the dataset used nor the perplexity of the baseline model (couldn't find a chart for the GUUFs quickly either)

[-]

FullOf_Bad_Ideas@reddit

True, I didn't notice it.

But it's not useful if it's not compared against baseline on the same dataset at the same context length and measured with the same logic.

More likely then not, those quant attempts usually flop when you look at them critically and compare to other methods.

[-]

ekryski@reddit

Completely agree! We'll share everything. Admittedly that was a bit of a teaser. The model is up but we're wanting to get more comprehensive benches and these answers (and some code cleaned up) before sharing more. That way other people can verify and hopefully benefit and try with other models.

[-]

DeepOrangeSky@reddit

Very cool. Although, I will admit to wondering how much of the lack of significant loss is to do with a big model like Minimax naturally being able to handle quite a bit of quantization before quality starts to significantly drop off. (a known tendency of really big models, from what I understand. Albeit maybe not to this degree)

For example this post in this thread was interesting. Their 3-bit 89GB quant of Minimax 2.7 got the same score on their benchmarks (actual benchmark exam tests, not ppl/kld, that is) as their 4-bit and 6-bit quants, apparently.

Do you think you might try testing the quality drop/lack thereof on like a 9b model, 12b model, 24b model, etc, to see if it holds up well on small models as well?

[-]

ekryski@reddit

The guy behind JANG quants is doing good stuff but his benches aren't as honest as I would like. He runs them once without think and if the model fails, tries again with thinking enabled. Essentially taking two stabs at the same question. Not sure if he's re-using anything in the second prompt from the first, regardless it doesn't feel like a good test to me as it could skew evaluation.

I've tried previous JANG models and they haven't been as coherent as I've needed. Tom took a different quantization approach and we're seeing better coherence. If we had found models that were coherent enough and small enough we would have used them instead of doing our own quantization. 🤷‍♂️

[-]

ekryski@reddit

Smaller models will likely be hurt more by quantization. Not sure yet how well they'll hold up with the same method Tom tried on Minimax 2.7. We'll see this week! I would like to try Qwen3.5-27B and 9B personally.

[-]

DeepOrangeSky@reddit

Nice. Well, I am definitely excited to see how it goes. Sounds like a pretty interesting mixture of things you guys are trying out, although I am still way too big of a noob to understand most of the stuff yet (although I'm learning a lot from reading this sub). Well, good luck to you guys. Pretty fun to see the wide variety of different new things people keep trying every few days on here. Seems like it's just a law of averages that out of all the wild stuff people keep trying, some of them are going to end up being major breakthroughs, one way or another.

[-]

ekryski@reddit

100%. We're not dumb but a lot of the credit goes to the original authors (and our agents). We've just been combining things in the right way and repeatedly benching new iterations to find what works and what doesn't based on actual benchmarks and usage.

[-]

JayPSec@reddit

When you say no real loss, how much loss are we talking about. I've been doing some testing and this model seems very sensitive to quabtization

[-]

ekryski@reddit

Tom will be a better person to answer because I don't know what the baseline PPL is across the test he ran with MinMax. But post quant it was 4.64 (in the image in linked tweet). We'll post everything. We're running more benches for exactly this reason but it's looking promising.

Not all benches are created equal.

[-]

BassNet@reddit

Does it require retraining? Got a GitHub? I’d like to apply it to other models

[-]

ekryski@reddit

No re-training or fine tuning needed. We're doing some clean up and more benchmarks. Will post model and source code when ready.

[-]

AppealSame4367@reddit

Excellent!

[-]

Memz_R_Dreamz@reddit

Oh man, I hope this makes local llm more accessible! Thanks! keep up the good work

[-]

tronathan@reddit

Thank you for the details; I hadn’t heard much since the splashy announcement, this is great. I used to feel like hanging out in locallama, I’d get more low level details, but that was back in the days of exllama2 and ggml.

Do these discussions evolve slowly over days and weeks in GitHub issues, or are they lost forever in discord, or some other such?

[-]

AppealSame4367@reddit

The most active people in llama cpp and other projects post here. They do evolve into GitHub issues.

[-]

chuan_l@reddit

" Turbo quant " adds random vector rotation ..
Meanwhile model weights are matrices of higher order tensors. Its not the same but people get it confused. The " kv cache " is the key - value pair of previously computed tokens. You have a lot more data in weights and quantisation reduces the bits ..

[-]

chuan_l@reddit

" Turbo quant " adds a random vector rotation ..
That adds the optimization. Meanwhile model weights are matrices of higher order tensors. Its not the same but people get it confused. The " kv cache " is the key - value pair of previously computed tokens. You make models smaller by reducing bits on the data ..

[-]

Colecoman1982@reddit

I get that, but wouldn't that still make a huge difference for situations where a server is serving many clients at one time? As far as I understand it, in that situation, you'd still only need one copy of the weights in memory but you'd need numerous different KV caches in memory at the same time. It might not be able to help simple one-man home users like me, but that still seems like a lot of VRAM potentially made redundant across industry...

[-]

fauxpasiii@reddit

Correct, and mostly just the V bit. Compressing the K vectors is possible, but quality drops off a cliff. Not clear to me whether that's a brute fact about the entire approach, or just a detail of the current implementation.

https://github.com/TheTom/turboquant_plus

[-]

vulgrin@reddit

Right. This video helped explain it to me: https://youtu.be/XLlQDfhyBjc

[-]

UnclaEnzo@reddit

I'm running 20b on the reg with 64gb dram and no gpu. No turbo or other quant massaging.

Does the same work, a little more slowly.

Turboq 20b model I have is 3x faster.

[-]

AppealSame4367@reddit

That sounds interesting, since i desperately want to run gemma 4 26b on my laptop. 6gb vram, 32gb ram, but so far it only gets to 4 tps, while qwen3.5 35b achieves 10-25 tps

What did you do apart from using turboquants?

[-]

UnclaEnzo@reddit

Nothing much really, but my expectations are not high; I'm generally happy with tps rates others would call failure because the solution doesn't flash right onto the screen.

[-]

tat_tvam_asshole@reddit

I was already running llama3 70b on a mobile 4070 with 8GB vram and 128 system ram last year

[-]

sonicnerd14@reddit

I have 3 systems that I've been configuring to have a pretty robust network that allows me to do pretty much everything locally. With some more optimization, then just one machine with 16GB would probably be all you'd need for a pretty standard agent stack.

[-]

Song-Historical@reddit

Can't fucking wait. Between that and graphify I want an optimized coder on my computer to just leave working most days.

[-]

putrasherni@reddit

dflash is a much more important and revolutionary, turboquant is not revolutionary
turboquant is already superseded by rotorquant

[-]

crantob@reddit

The one comment that is of value to me ty. https://news.qq.com/rain/a/20260327A0220A00

[-]

Feztopia@reddit

Usually tech isn't mediocre you stack tech up and get an impressive tower made of smaller parts. Sometimes it inspires other better improvements.

[-]

VoidAlchemy@reddit

I don't bother with it, i use ik_llama.cpp with `-khad -ctk q8_0 -vhad -ctv q6_0` and if I still need more context, i usually just have to go down to one size smaller quant.

Folks have already dropped links about both ik and mainline having hadamard transform "rotations" already implemented for kv-cache since late last year.

Some of ik's recent discussions on the same question here: https://github.com/ikawrakow/ik_llama.cpp/pull/1625#issuecomment-4237851162

[-]

guiopen@reddit

Rotation , which is part of what turboquant, is already implemented in llama.cpp and gave a pretty gains to kV cache quantization, now 18 is almost equal to f16

[-]

FullOf_Bad_Ideas@reddit

Hadamard rotation was not introduced to KV cache quantization by TurboQuant.

It was used by TurboQuant, yes

It's a great technique though, but I think it's potentially misleading to say that this is a part of TurboQuant without this context.

It was already used in ik_llama.cpp and exllamav3, llama.cpp was simply lagging behind and it should have been implemented regardless of TurboQuant being a thing or not.

[-]

VoidAlchemy@reddit

Yeah, ik implemented it late last year: https://github.com/ikawrakow/ik_llama.cpp/pull/1033

[-]

guiopen@reddit

Thanks for the information

[-]

Awwtifishal@reddit

q8*

[-]

guiopen@reddit

Thanks, fixed the typo

[-]

the__storm@reddit

Here's the PR for anyone curious: https://github.com/ggml-org/llama.cpp/pull/21038

[-]

Spiritual_Scheme8158@reddit

uhhhh.... i get that there's always overhyping of tech... but didn't google develop some non-mediocre technologies...? like search engines, cloud technologies, as well as a lot of AI related research that went into Gemini and Genie (which can generate a real time game with proper physics with a single photo...)? the AI that beat pro go and starcraft players?

Any single one of these technologies is groundbreaking and has the power to change the direction of humankind...

[-]

FullOf_Bad_Ideas@reddit

Google invented transformers, backbone of all LLMs. They produce a lot of good technologies.

But TuboQuant is misinterpreted and is not as valuable of a contribution. Both can be true.

[-]

Spiritual_Scheme8158@reddit

Yeah? I never said Turoboquant is revolutionary. It’s just a further development from quantizing models.

But the OP wrote smthg like “is it another mediocre tech developed by google”

like what the fuck?

[-]

FullOf_Bad_Ideas@reddit

or is it just another mediocre technology that has been overhyped by Google and Twitter?

the way I read it, he's saying that there are a lot of mediocre technologies, not that it's specifically yet another mediocre thing coming from google

[-]

Spiritual_Scheme8158@reddit

That makes more sense. I guess I don't try to keep the overhyped mediocre garbage in my memory and just think of the important ones.

[-]

LeucisticBear@reddit

Almost every major AI breakthrough has come out of Google, deepmind, or a former member.

The other labs seem to be doing much better at implementation but the core research is fairly centralized.

[-]

Spiritual_Scheme8158@reddit

Yeah, so what the FUCK is the OP thinking, calling things from google mediocre? Sounds arrogant as fuck.

[-]

Exact_Law_6489@reddit (OP)

I see lots of videos either hyping up TurboQuant or trashing it, saying it's overhyped, etc. I'm genuinely confused.

[-]

ExpensivePilot1431@reddit

It is overhype with academic dishonesty (if not fraud). https://www.reddit.com/r/MachineLearning/comments/1s8yni2/comment/odq9c9d/

[-]

the-final-frontiers@reddit

I heard triattention is pretty awesome too.

[-]

FullOf_Bad_Ideas@reddit

it's a waste of time

[-]

ekryski@reddit

It’s not. Very lossy because it’s evicting too aggressively. Poor performance in real world usage. Again, Tom (TurboQuant+) did an implementation of it and benched it. I confirmed his findings. Check his Twitter.

[-]

kyr0x0@reddit

Yep and TQ is also not the holy grail. It helps most for quantization below 4 bit. And beware QJL when you run softmax() in the end. This is a formula for boosting small errors..

[-]

ekryski@reddit

QJL is useless. We scrapped it within a day.

[-]

ZealousidealShoe7998@reddit

the best thing you can do is, read the paper yourself and everything that you dont understand you ask an ai to explain it to you. if feels too complicated open a new chat and ask you to explain it like you are 5, things start clicking go ahead and ask more question . you will form a much better opinion than a collection of people that havent used the tool at all and are just anwsering based on the same videos you watched.

[-]

FullOf_Bad_Ideas@reddit

you can feed BS paper to LLMs and it can tell you how innovative it is, if you prompt it a specific way.

It's a terrible way to gauge value of research, and this is what produced those AI slop PRs and hype.

[-]

dryadofelysium@reddit

Google published a really cool research paper, along with a blog post to talk about it. How is that overhyping?

[-]

Exact_Law_6489@reddit (OP)

Im genuinely confused at this point. I see videos on YouTube either hyping up TurboQuant or criticising it harshly.

[-]

FullOf_Bad_Ideas@reddit

And Youtubers get to collect money earned on ads. Their job is to get you to watch whatever they put out.

[-]

dryadofelysium@reddit

Sure, but that has nothing to do with Google.

[-]

jacek2023@reddit

Imagine the worst person in the world, YouTuber will always be worse

[-]

cakemates@reddit

Google's article built a few strawmans and then destroyed it with a stick. They compared turboquant to an full precision model, while no one really runs models without quantization, had they compared q4 vs q4 for example the differences would be less dramatic but actually accurate. Which is why the reception is mixed, they didnt have to do that to prove its great stuff.

[-]

dryadofelysium@reddit

I hear you, but Google is not responsible for people being unable to read or understands words of an article and misinterpreting them. Also I am not positive that cloud model providers, who are very much also a target group for this research, necessarily run (heavily) quantizated models the way local people on reddit do.

[-]

FullOf_Bad_Ideas@reddit

They literally knowingly misrepresented previous attempts to make TurboQuant look better compared to RabitQ, it's academic dishonesty bordering on fraud.

[-]

relmny@reddit

AFAIK they just used RabittQ research, without even properly acknowledge it.

They made a lot of claims, but nothing yet has been proved, in particular the "lossless" part.

And look at the amount of projects in github...

So yes, overhyped. Until someone makes it work as they claim.

[-]

PsychologicalOne752@reddit

Research papers and blog posts are great. They do get some people promoted within Google. But where the rubber meets the road is a real-world implementation where it delivers X% lower VRAM using model Y.

[-]

ReturningTarzan@reddit

TurboQuant itself is a quantization method like so many others before it, and if you're willing to sacrifice speed and simplicity for memory savings it lets you do that in a slightly new way. But we've had "lossless 2-bit KV cache" in various forms for years, and it never gains traction because the tradeoffs just aren't worth it. Still, it's an interesting bit of research with a few novel ideas worth integrating.

The real issue is with the blog post making claims like "lossless", "zero overhead" and "8x faster." There's no source for any of those claims. The paper doesn't mention anything about TQ being faster (except compared to CPU-based RaBitQ in a semantic-search context), and the "zero overhead" seems to refer to distortion rates, not computational overhead.

There are also no real implementation details in the paper, just a snippet of pseudocode and some synthetic results. But the proposed method inherently adds a lot of computational overhead. It may still give you a net speedup in memory-bound situations, but that speedup isn't implied by the algorithm, isn't universal even if it can be achieved situationally, and is always going to be less than a simpler quantization scheme under the same circumstances.

So then it would come down to accuracy, right? But then why not compare it to other methods that make similar claims:

GEAR: Combines quantization with low-rank and sparse matrices, "near-lossless" at 2 bits
QAQ: Adjusts bitrate per token according to estimated importance
MIKV: Aggressive quantization for most tokens, preserves "pivotal" tokens
RotateKV: 2-bit method using rotation, "near-lossless"
PM-KVQ: Specifically addresses long CoT contexts where many "near-lossless" methods turned out not to be so lossless in practice
etc.

FP8 is commonly used in production, is trivial to implement and comes with immediate performance benefits. NVFP4 is the really interesting one because of its extremely high throughput on Blackwell GPUs, yet it still has a reported <1% accuracy loss on real benchmarks.

So even if TQ did outperform everything else, you should still curb your expectations somewhat: maybe you might reduce the effective size of your cache from 4 bits to 3.5 bits. For modern models that already employ a lot of memory-saving techniques at the architectural level (linear attention, MLA, SWA) it's simply not that big a deal.

So no, it's not revolutionary, and yes, Twitter is out of control. In Google's own (mind you, very limited) testing it doesn't even unambiguously outperform KIVI from 2024.

[-]

noctrex@reddit

There is an interesting video over here about this, from bycloud: TurboQuant: The Incredible Marketing Stunt By Google

[-]

Alex_L1nk@reddit

👆this

[-]

Healthy-Nebula-3603@reddit

Yes that's really a good shit. Because of that you can use X2 bigger context.

Finally cache Q8 is usable and has as good output quality as FP16 cache.

Before even Q8 cache had noticable degradation.

[-]

UnclaEnzo@reddit

Maybe fire up a turboq model and compare it side by side with the model before it logits were rotated, randomized and normalized.

[-]

VoiceApprehensive893@reddit

overhyped but the hype kicked off development in the right direction

people made it work but its currently slow,might get fast in a few months

[-]

PhotographerUSA@reddit

You can use a 122B module easily on a 6GB video card if it works.

[-]

defmans7@reddit

You forgot the /s 😂

For anybody reading in the future, the quantisation only affects the KV cache, not the model weights, meaning it allows you to fit more context and get similar accuracy as full precision mode.

Thing is, there are already some good quantisation methods, so the turboquant option is really kinda a small increase in quality and not much of a change in context cache size.

[-]

kyr0x0@reddit

People already started trying TQ on weights..

https://github.com/cksac/turboquant-model

Just saying..

[-]

stddealer@reddit

But why? You can't just rotate all the weights an expect the model to still work... As soon as there is a non-linear operation that would break everything. So I'm guessing they're using TurboQuant without the rotation part? What's the point of using TQ then?

[-]

defmans7@reddit

Cool. the field is fast moving and interesting!

[-]

Exact_Law_6489@reddit (OP)

To be honest, it sounds wrong. It sounds great, but I can't say anything without trying it.

[-]

Radiant_Condition861@reddit

As I understand it, it reduces the KV cache by changing it from a coordinate system to a polar system without reduction in precision. It can still pick up small tokens within a large context. The advantage is reduction in memory requirements for the cache.

[-]

a_beautiful_rhind@reddit

Turdoquant lets you use Q3 instead of Q4 cache. Never leads with perplexity or KLD testing in any of the implementations I have seen.

On the upside, it got llama.cpp to implement hadamard rotations for KV cache.

[-]

Simusid@reddit

I think it's very important and the underlying math (Johnson–Lindenstrauss encoding) is sound. I was excited to try http://github.com/thetom/llama-cpp-turboquant tonight. I tried the three different KV encodings and all caused a 15% slowdown using the same cmake build, same model, and same launch parameters.

[-]

vertigo235@reddit

As with all things in the space it is overhyped but does appear to be a good step forward.

[-]

wazymandias@reddit

The polar decomposition approach is clever but paper benchmarks are all clean academic datasets. Production inference workloads where quantisation error actually matters is the real test...

[-]

MachineZer0@reddit

Give it a shot. https://github.com/TheTom/llama-cpp-turboquant/tree/feature/turboquant-kv-cache

I have a fork that I merged master into, if anyone wants.

I'm running it now on a Quad V100 SXM2 32gb. I was running MiniMax-M2.5-UD-Q3_K_XL before 101gb. Now MiniMax-M2.7-UD-IQ4_XS 108gb. Same context size. Same exact VRAM footprint.

~/llama-cpp-turboquant/build/bin/llama-server -m ~/model/MiniMax-M2.7-UD-IQ4_XS-00001-of-00004.gguf   --host 0.0.0.0   --ctx-size 131072   -ctk turbo4 -ctv turbo4   -sm layer   -ts 1,1,1,1   -fa on   -ub 512   -tb $(nproc)   -np 4   --mlock   --no-mmap   --no-op-offload   --temp 1.0 --top-p 0.95 --top-k 40   --alias MiniMax-M2.7

[-]

Osprey6767@reddit

It's revolutionary. I tested it with various models and got INSANE performance gains while not impacting quality a bit. To anyone who will contradict -- I was doing local ai for years and years and tested tons of ai models. This hasn't been better.

I built a cli tool for people to use to test out turboquant. You can check it out here: https://github.com/md-exitcode0/turbo-cli

[-]

PhotographerUSA@reddit

Yes, it will be implemented in all AI modules next quarter.

[-]

jacek2023@reddit

https://www.reddit.com/r/LocalLLaMA/comments/1s9lge6/llama_rotate_activations_for_better_quantization/

https://www.reddit.com/r/LocalLLaMA/comments/1sf61n2/kvcache_support_attention_rotation_for/

[-]

Exact_Law_6489@reddit (OP)

Thank you \^\^

[-]

RudeboyRudolfo@reddit

It's a very good technology, that has been overhyped.