TurboQuant enabled Runtime Valkyr

Posted by inigid@reddit | LocalLLaMA | View on Reddit | 16 comments

Based on the recent TRiP source code by Carlo Valenti.

Ported to Zig and headless Vulkan Compute shaders.

TurboQuant added an optional inference path.

Achieves 120 tok/s on RTX 3090 for Gemma.

Notes regarding TurboQuant:

Right now Algorithm 1 only, RHT pre-conditioner + Lloyd-Max scalar quantization to a global 4-bit codebook + a small norm-correction γ.

We deliberately drop QJL (Algorithm 2)

Five independent practitioner reproductions converged on this decision.

The sign-bit residual eliminates bias but explodes attention-score variance, which softmax tolerates much worse than bias.

Randomized Hadamard Transform, not random orthogonal.

At 4 bits, plain random rotation this gives PPL 604 vs RHT's 10.12 on Qwen3-1.7B per arclabs001's benchmarks.

Norm-correction γ (TheTom / spiritbuun)

We store original L2 / ‖reconstruction‖ instead of raw L2.

This provides free PPL, and guarantees the dequantized block has the original L2 norm exactly.

Asymmetric K= fp / V=TQ4 by default (the dense-model recommendation from llama.cpp practitioner data).

The TQ4 pack kernel produces 256/256 indices bit-exact versus both the CPU oracle and Python reference on a deterministic input ramp (regeneration script in scripts/cross_validate_turboquant.py).

Memory savings on Gemma 2B at max_pos = 2048

V cache shrinks from 36 MiB to 4.6 MiB across 18 layers (\~5.5×), plus a 2 MiB shared dequant scratchpad.

Hardware Requirements

Any Vulkan 1.3 GPU (AMD / Intel / NVIDIA / Apple via MoltenVK / Android).

One SPIR-V binary per shader, across any vendor.

https://github.com/Foundation42/valkyr

[-]

RelevantShape3963@reddit

I am astonished you could achieve so much, starting from TRiP! As its original author, I'm impressed: congrats!

[-]

inigid@reddit (OP)

Well TRiP is so clean and pedagogically accessible. You did a bang up job with it.

The way you broke everything down and covered so much ground in one project is really something. You should be very happy.

And while other projects exist, they are scattered around with no singular "voice" going through them.

I hope a lot of other people are inspired by it as well.

Honestly, I think working on it was just my way of saying "Cheers, good job mate".

Cheers, good job mate!

[-]

xeeff@reddit

gemma 1 2b? this is a joke

[-]

inigid@reddit (OP)

It is gemma-2b-it.

Not sure what your point is.

Okay fine, I'll do Gemma 3, LLaMa, Qwen and Mistral.

Anything else you would like me to do while I'm at it?

[-]

xeeff@reddit

Anything else you would like me to do while I'm at it?

if your project supports ROCm/Vulkan I'll check it out, assuming you get qwen3.6 27b working on it. I mean, since it's your project, MTP implementation?

[-]

inigid@reddit (OP)

I just posted an update. Qwen 3 is done.

I'll do Qwen 3.5/3.6 soon promise.

I'm totally done right now need sleep.

It should support ROCm. I haven't tried it. I can try tomorrow maybe.

Ughh. Cheers.

[-]

xeeff@reddit

i'm surprised you even replied to me ngl, all the luck to you. i've not got much time or energy to check it out now but if i can use it first hand then I'll see it for myself :p

[-]