RotorQuant Local Build – 112K context on a single RTX 4090 with Qwen3.5-27B (open source, one command)

Posted by Junior_Swimming_8416@reddit | LocalLLaMA | View on Reddit | 11 comments

Been messing around with getting large context windows to actually work on consumer GPUs and figured I'd share what I ended up with.

The heavy lifting comes from https://github.com/scrya-com/rotorquant — compresses the KV cache 3.8x with basically no quality loss (97% of fp16 decode speed). It swaps out the dense rotation matrix from TurboQuant (ICLR 2026) for Clifford algebra rotors, so 44x fewer parameters and way faster CUDA kernels. Full credit to the RotorQuant team for the actual compression — what I did was wrap it into a one-command Docker server with tuned configs per GPU tier. I also stacked it with Unsloth's imatrix quants (UD-Q3_K_XL, UD-IQ3_XXS, etc.) for the 16 GB profiles — Unsloth squeezes the weights, RotorQuant handles the KV cache, and together they make stuff fit that really shouldn't.

Tested on a 4090 and 5090, everything works, but I haven't battle-tested every edge case. Seems to work as well as qwen3.5:27B

What you get:

- RTX 4090 (24 GB): \~112K context with Qwen3.5-27B Q4_K_M
- RTX 5090 (32 GB): \~252K context
- RTX 4060 Ti (16 GB): \~28K with Unsloth Q3 quants, \~56K with IQ3_XXS

Throughput mode:

There's a throughput-optimized profile too (make run-throughput) that trades context length for parallel slots. Basically, single-user decode is memory-bandwidth bound (mat-vec), but with N slots the weight multiplications become mat-mat and tensor cores kick in — aggregate throughput scales almost linearly. RotorQuant really shines here: iso4 KV at 16K = 1.06 GB/slot vs 4.0 GB/slot with fp16, so 3.8x more concurrent users on the same card. On 24 GB that's 6 slots (\~320 aggregate tok/s), on 32 GB it's 14 slots (\~660 tok/s). Per-user latency stays around \~67 tok/s at low load, \~50-55 when fully loaded.

It's all Docker + llama.cpp under the hood:

`make build && make run-qwen`

Gives you an OpenAI-compatible API. Also has profiles for Gemma 4 26B and a reasoning-tuned Qwen distillation.

GitHub:
https://github.com/rapatel0/rq-models

Would love to hear from anyone trying this on 16 GB cards especially — the Q3 configs could use more real-world testing.