turboquant: on-device search and recommendation

Posted by init0@reddit | LocalLLaMA | View on Reddit | 7 comments

https://h3manth.com/ai/cinematch/

TurboQuant is Google Research’s new breakthrough quantization algorithm that applies random rotation to high-dimensional vectors to eliminate outliers, enabling extreme low-bit compression with near-zero accuracy loss.

While it is currently making waves for shrinking LLM KV caches, I wanted to see how it handles semantic search on device!

I’ve integrated it into a client-side recommendation demo (CineMatch) to run entirely on-device.

Here is how the engine drives the architecture:

- 6x Compression: TurboQuant applies its randomized rotation and 3-bit scalar quantization to crush 384-dim Float32 embeddings from 1,536 bytes down to just 249 bytes.

- Micro-Payloads: Because of that density, the entire vectorized movie index ships instantly to the client as a lightweight \~12KB JSON file.

- WASM SIMD Execution: We don't even decompress at runtime. The browser computes dot products directly against the compressed vectors using WebAssembly SIMD.

- Zero-Jank Matching: Top-K cosine similarity runs in \~13ms staying well under the 16ms threshold for a flawless 60fps experience without a single server roundtrip.

Pushing advanced quantization algorithms natively into the browser unlocks massive potential for privacy-first, zero-compute-cost AI.