You guys seen this? beats turboquant by 18%

Posted by OmarBessa@reddit | LocalLLaMA | View on Reddit | 25 comments

https://github.com/Dynamis-Labs/spectralquant

basically, they discard 97% of the kv cache key vectors after figuring out which ones have the most signal

[-]

Zestyclose_Yak_3174@reddit

It sounds very good in theory. Like with many of these further developed and enhanced methods, they rarely end up in inference frameworks. Hopefully this will be different

[-]

Thrumpwart@reddit

It's a better turboquant and we saw how fast turboquant is being adopted. I just hope it gets the same amount of attention.

[-]

charmander_cha@reddit

Esperando ansiosamente pelo PR no llama.cpp no vulkan

[-]

last_llm_standing@reddit

can someone explain the lore behind the above person getting downvoted?

[-]

Due-Memory-6957@reddit

Not speaking in English is already a reason.

[-]

Due-Memory-6957@reddit

People prefer when they can understand each other (and so, speak a common language) rather than to recreate the tower of babel.

[-]

Thrumpwart@reddit

No, you prefer that. I like living in a multi-lingual world.

Ignorance is a bad thing, not a good thing.

[-]

Due-Memory-6957@reddit

I like too, and I speak multiple languages, still, spaces are better when everyone can understand everyone instead of having little islands.

[-]

Beginning-Window-115@reddit

ok yeah but this server is English only for a reason

[-]

Thrumpwart@reddit

Is it? Where does it say that anywhere?

[-]

Beginning-Window-115@reddit

weird they removed it, ig not anymore.

[-]

xandep@reddit

Not entirely his fault: reddit and google defaults to translate everything. Right now he's reading "not speaking English", but in Portuguese 😂. Imagine his confusion.

[-]

The_frozen_one@reddit

This sub has tons of bots that downvote comments mentioning software that isn’t the one they want to promote. That’s my theory at least

[-]

Frosty-Cup-8916@reddit

I've also seen non-English comments be downvoted into the void on this sub.

[-]

last_llm_standing@reddit

Ah, I seen a lot of with just the Hermes Agent. Didn't know this was a epidemic

[-]

EffectiveCeilingFan@reddit

I see they chose to only test ancient models, just like TurboQuant: “3–4% across Qwen (1.5B, 7B, 14B), Llama 3.1-8B, Mistral 7B, and Gemma 2-9B”

I’m guessing that, just like TurboQuant, the results suck on anything recent?

[-]

Caffeine_Monster@reddit

Part of the problem is these quantisation techniques assume festured are somewhat sparse and there is a lot of redundancy. This is pess true with newer models.

[-]

Ive analyzed attention signal activation, and my personal findings are that it changes a lot by layer and model. In the experiment i recently performed, the last 1/4 of layers had very few attention activations and something like this could be performed with little consequence.

I highly doubt it is univerally effective.

[-]

ShepardRTC@reddit

I assume models all train their features differently, perhaps there’s a link to their initial values. But that makes finding signals very custom.

[-]

Chromix_@reddit

Well, it makes sense from a theoretical perspective, if a vector only has very few large values that contribute, then removing the remaining "noise" shouldn't hurt the results that much.

The presented approach requires a calibration dataset. So it sort of amplifies the imatrix "problem" that we already have: What's a good dataset to calibrate on? (The answer to that is difficult and noisy).

The long context tests performed here were only up to 8k tokens. That's not a lot, and the old needle-in-a-haystack test from 2023 is rather outdated by now. Still, the results at least give confidence that this approach doesn't totally break things.

Thus now would be the time to validate this with contemporary benchmarks, including modern long-context checks.