TurboQuant enabled Runtime Valkyr

Posted by inigid@reddit | LocalLLaMA | View on Reddit | 16 comments

TurboQuant enabled Runtime Valkyr

Based on the recent TRiP source code by Carlo Valenti.

Ported to Zig and headless Vulkan Compute shaders.

TurboQuant added an optional inference path.

Achieves 120 tok/s on RTX 3090 for Gemma.

Notes regarding TurboQuant:

Right now Algorithm 1 only, RHT pre-conditioner + Lloyd-Max scalar quantization to a global 4-bit codebook + a small norm-correction γ.

We deliberately drop QJL (Algorithm 2)

Five independent practitioner reproductions converged on this decision.

The sign-bit residual eliminates bias but explodes attention-score variance, which softmax tolerates much worse than bias.

Randomized Hadamard Transform, not random orthogonal.

At 4 bits, plain random rotation this gives PPL 604 vs RHT's 10.12 on Qwen3-1.7B per arclabs001's benchmarks.

Norm-correction γ (TheTom / spiritbuun)

We store original L2 / ‖reconstruction‖ instead of raw L2.

This provides free PPL, and guarantees the dequantized block has the original L2 norm exactly.

Asymmetric K= fp / V=TQ4 by default (the dense-model recommendation from llama.cpp practitioner data).

The TQ4 pack kernel produces 256/256 indices bit-exact versus both the CPU oracle and Python reference on a deterministic input ramp (regeneration script in scripts/cross_validate_turboquant.py).

Memory savings on Gemma 2B at max_pos = 2048

V cache shrinks from 36 MiB to 4.6 MiB across 18 layers (\~5.5×), plus a 2 MiB shared dequant scratchpad.

Hardware Requirements

Any Vulkan 1.3 GPU (AMD / Intel / NVIDIA / Apple via MoltenVK / Android).

One SPIR-V binary per shader, across any vendor.

https://github.com/Foundation42/valkyr