You guys seen this? beats turboquant by 18%
Posted by OmarBessa@reddit | LocalLLaMA | View on Reddit | 25 comments
https://github.com/Dynamis-Labs/spectralquant
basically, they discard 97% of the kv cache key vectors after figuring out which ones have the most signal
Zestyclose_Yak_3174@reddit
It sounds very good in theory. Like with many of these further developed and enhanced methods, they rarely end up in inference frameworks. Hopefully this will be different
Thrumpwart@reddit
It's a better turboquant and we saw how fast turboquant is being adopted. I just hope it gets the same amount of attention.
charmander_cha@reddit
Esperando ansiosamente pelo PR no llama.cpp no vulkan
last_llm_standing@reddit
can someone explain the lore behind the above person getting downvoted?
Due-Memory-6957@reddit
Not speaking in English is already a reason.
Thrumpwart@reddit
Why?
Due-Memory-6957@reddit
People prefer when they can understand each other (and so, speak a common language) rather than to recreate the tower of babel.
Thrumpwart@reddit
No, you prefer that. I like living in a multi-lingual world.
Ignorance is a bad thing, not a good thing.
Due-Memory-6957@reddit
I like too, and I speak multiple languages, still, spaces are better when everyone can understand everyone instead of having little islands.
Beginning-Window-115@reddit
ok yeah but this server is English only for a reason
Thrumpwart@reddit
Is it? Where does it say that anywhere?
Beginning-Window-115@reddit
weird they removed it, ig not anymore.
Thrumpwart@reddit
Sure buddy, sure.
xandep@reddit
Not entirely his fault: reddit and google defaults to translate everything. Right now he's reading "not speaking English", but in Portuguese 😂. Imagine his confusion.
The_frozen_one@reddit
This sub has tons of bots that downvote comments mentioning software that isn’t the one they want to promote. That’s my theory at least
OmarBessa@reddit (OP)
makes a lot of sense
Frosty-Cup-8916@reddit
I've also seen non-English comments be downvoted into the void on this sub.
last_llm_standing@reddit
Ah, I seen a lot of with just the Hermes Agent. Didn't know this was a epidemic
Prize_Negotiation66@reddit
yo soy abogado
Igot1forya@reddit
EffectiveCeilingFan@reddit
I see they chose to only test ancient models, just like TurboQuant: “3–4% across Qwen (1.5B, 7B, 14B), Llama 3.1-8B, Mistral 7B, and Gemma 2-9B”
I’m guessing that, just like TurboQuant, the results suck on anything recent?
Caffeine_Monster@reddit
Part of the problem is these quantisation techniques assume festured are somewhat sparse and there is a lot of redundancy. This is pess true with newer models.
1ncehost@reddit
Ive analyzed attention signal activation, and my personal findings are that it changes a lot by layer and model. In the experiment i recently performed, the last 1/4 of layers had very few attention activations and something like this could be performed with little consequence.
I highly doubt it is univerally effective.
ShepardRTC@reddit
I assume models all train their features differently, perhaps there’s a link to their initial values. But that makes finding signals very custom.
Chromix_@reddit
Well, it makes sense from a theoretical perspective, if a vector only has very few large values that contribute, then removing the remaining "noise" shouldn't hurt the results that much.
The presented approach requires a calibration dataset. So it sort of amplifies the imatrix "problem" that we already have: What's a good dataset to calibrate on? (The answer to that is difficult and noisy).
The long context tests performed here were only up to 8k tokens. That's not a lot, and the old needle-in-a-haystack test from 2023 is rather outdated by now. Still, the results at least give confidence that this approach doesn't totally break things.
Thus now would be the time to validate this with contemporary benchmarks, including modern long-context checks.