TurboQuant enabled Runtime Valkyr
Posted by inigid@reddit | LocalLLaMA | View on Reddit | 16 comments
Based on the recent TRiP source code by Carlo Valenti.
Ported to Zig and headless Vulkan Compute shaders.
TurboQuant added an optional inference path.
Achieves 120 tok/s on RTX 3090 for Gemma.
Notes regarding TurboQuant:
Right now Algorithm 1 only, RHT pre-conditioner + Lloyd-Max scalar quantization to a global 4-bit codebook + a small norm-correction γ.
We deliberately drop QJL (Algorithm 2)
Five independent practitioner reproductions converged on this decision.
The sign-bit residual eliminates bias but explodes attention-score variance, which softmax tolerates much worse than bias.
Randomized Hadamard Transform, not random orthogonal.
At 4 bits, plain random rotation this gives PPL 604 vs RHT's 10.12 on Qwen3-1.7B per arclabs001's benchmarks.
Norm-correction γ (TheTom / spiritbuun)
We store original L2 / ‖reconstruction‖ instead of raw L2.
This provides free PPL, and guarantees the dequantized block has the original L2 norm exactly.
Asymmetric K= fp / V=TQ4 by default (the dense-model recommendation from llama.cpp practitioner data).
The TQ4 pack kernel produces 256/256 indices bit-exact versus both the CPU oracle and Python reference on a deterministic input ramp (regeneration script in scripts/cross_validate_turboquant.py).
Memory savings on Gemma 2B at max_pos = 2048
V cache shrinks from 36 MiB to 4.6 MiB across 18 layers (\~5.5×), plus a 2 MiB shared dequant scratchpad.
Hardware Requirements
Any Vulkan 1.3 GPU (AMD / Intel / NVIDIA / Apple via MoltenVK / Android).
One SPIR-V binary per shader, across any vendor.
https://github.com/Foundation42/valkyr
RelevantShape3963@reddit
I am astonished you could achieve so much, starting from TRiP! As its original author, I'm impressed: congrats!
inigid@reddit (OP)
Well TRiP is so clean and pedagogically accessible. You did a bang up job with it.
The way you broke everything down and covered so much ground in one project is really something. You should be very happy.
And while other projects exist, they are scattered around with no singular "voice" going through them.
I hope a lot of other people are inspired by it as well.
Honestly, I think working on it was just my way of saying "Cheers, good job mate".
Cheers, good job mate!
xeeff@reddit
gemma 1 2b? this is a joke
inigid@reddit (OP)
It is gemma-2b-it.
Not sure what your point is.
Okay fine, I'll do Gemma 3, LLaMa, Qwen and Mistral.
Anything else you would like me to do while I'm at it?
xeeff@reddit
if your project supports ROCm/Vulkan I'll check it out, assuming you get qwen3.6 27b working on it. I mean, since it's your project, MTP implementation?
inigid@reddit (OP)
I just posted an update. Qwen 3 is done.
I'll do Qwen 3.5/3.6 soon promise.
I'm totally done right now need sleep.
It should support ROCm. I haven't tried it. I can try tomorrow maybe.
Ughh. Cheers.
xeeff@reddit
i'm surprised you even replied to me ngl, all the luck to you. i've not got much time or energy to check it out now but if i can use it first hand then I'll see it for myself :p
inigid@reddit (OP)
Heh yeah. I mean I want ROCm for myself as well. And with TurboQuant especially.
We shall see. Goodnight.
xeeff@reddit
turboquant would be an excellent addition as well, yeah
inigid@reddit (OP)
TurboQuant is already in for Qwen 3.
If you or anyone can test on ROCm or other cards than my dev 3090 seriously appreciated.
Laters 🫡
xeeff@reddit
i'm down to test it, but once qwen3.6 support is out :p
inigid@reddit (OP)
Good news, Qwen 3.5 just dropped. It's checked in.
That was a heck of a lot of work. Phew. I was supposed to be asleep!
The delta to 3.6 is quite small though, no architectural changes, so nearly there. Will let you know.
xeeff@reddit
ready to be let known
inigid@reddit (OP)
Okay, full Qwen 3 support dropped with TurboQuant.
I am a bit tired now but I'll get to Llama, Mistral, more modern versions of Gemini etc in the coming weeks
Qwen 3.5/3.6 will take me a bit extra as they are MAMBA hybrids, but don't worry they are on the roadmap.
Thank you for your support!
Goodnight.
inigid@reddit (OP)
Qwen 3 version just dropped on the repo
Give me a moment, I'll drop the TurboQuant version momentarily.
Jokes continue..
inigid@reddit (OP)
I'll do Qwen 3-4B-Instruct right now. Is that okay?