RotorQuant Local Build – 112K context on a single RTX 4090 with Qwen3.5-27B (open source, one command)

Posted by Junior_Swimming_8416@reddit | LocalLLaMA | View on Reddit | 11 comments

Been messing around with getting large context windows to actually work on consumer GPUs and figured I'd share what I ended up with.

The heavy lifting comes from https://github.com/scrya-com/rotorquant — compresses the KV cache 3.8x with basically no quality loss (97% of fp16 decode speed). It swaps out the dense rotation matrix from TurboQuant (ICLR 2026) for Clifford algebra rotors, so 44x fewer parameters and way faster CUDA kernels. Full credit to the RotorQuant team for the actual compression — what I did was wrap it into a one-command Docker server with tuned configs per GPU tier. I also stacked it with Unsloth's imatrix quants (UD-Q3_K_XL, UD-IQ3_XXS, etc.) for the 16 GB profiles — Unsloth squeezes the weights, RotorQuant handles the KV cache, and together they make stuff fit that really shouldn't.

Tested on a 4090 and 5090, everything works, but I haven't battle-tested every edge case. Seems to work as well as qwen3.5:27B

What you get:

- RTX 4090 (24 GB): \~112K context with Qwen3.5-27B Q4_K_M
- RTX 5090 (32 GB): \~252K context
- RTX 4060 Ti (16 GB): \~28K with Unsloth Q3 quants, \~56K with IQ3_XXS

Throughput mode:

There's a throughput-optimized profile too (make run-throughput) that trades context length for parallel slots. Basically, single-user decode is memory-bandwidth bound (mat-vec), but with N slots the weight multiplications become mat-mat and tensor cores kick in — aggregate throughput scales almost linearly. RotorQuant really shines here: iso4 KV at 16K = 1.06 GB/slot vs 4.0 GB/slot with fp16, so 3.8x more concurrent users on the same card. On 24 GB that's 6 slots (\~320 aggregate tok/s), on 32 GB it's 14 slots (\~660 tok/s). Per-user latency stays around \~67 tok/s at low load, \~50-55 when fully loaded.

It's all Docker + llama.cpp under the hood:

`make build && make run-qwen`

Gives you an OpenAI-compatible API. Also has profiles for Gemma 4 26B and a reasoning-tuned Qwen distillation.

GitHub:
https://github.com/rapatel0/rq-models

Would love to hear from anyone trying this on 16 GB cards especially — the Q3 configs could use more real-world testing.

[-]

Objective-Stranger99@reddit

Wait a minute. How can I achieve a 262K context length on Qwen3.5 35B with q4_K_S on the model and Q8_0 on the KV cache? Am I doing something wrong? I only have 8GB of VRAM.

[-]

Junior_Swimming_8416@reddit (OP)

A few misunderstandings. I only got 262K on a single 5090 RTX (32GB VRAM) using Qwen3.5:27B

This works by quantizing the model and rotorquant quantizes the activations. so 256K context historically used about >32GB ram. but with rotorquant it uses about \~5GB

I don't think you will be able to quantize down to that level. the weights for 27B are already \~16GB after Q4_K_XL so just the weights will fill up your card

[-]

EffectiveCeilingFan@reddit

Ah. I just wasted my time commenting. You’re just making all your numbers up. 256k context simply does not use 32GB of RAM on Qwen3.5, you’re lying. It uses around 8GB I believe.

[-]

Objective-Stranger99@reddit

I am getting around 8GB KV cache usage with Qwen3.5 35B.

[-]

Junior_Swimming_8416@reddit (OP)

See the calculation in the other comment

[-]

Junior_Swimming_8416@reddit (OP)

> 262K on a single 5090 RTX (32GB VRAM)

this is just the card spec, i.e., an RTX has 32GB of VRAM.

I was saying that i was able to fit the model with that context window on that card. Not that the KV cache was 32GB.

[-]

Objective-Stranger99@reddit

Right, I forgot to mention that I'm using quite some system RAM as well (about 12-15 GB).

[-]

Junior_Swimming_8416@reddit (OP)

You just gave me a new idea. :D

gonna get started on something for thsi

[-]

EffectiveCeilingFan@reddit

Are these numbers made up? Qwen3.5 doesn’t use anywhere near that amount of KV cache. 16k context doesn’t use anywhere near 4GB of RAM. 128k uses 4GB of ram.

FYI, you need to compare against BF16 KV cache. Qwen3.5 sees measurable performance degradation even from F16 vs BF16.

Furthermore, TurboQuant is not lossless. Far from it. They only tested super old models in the paper for a reason: hybrid models only see a slight boost, nowhere near “lossless”. I see you didn’t do any benchmarking. I think you’ll find that this performs significantly worse than BF16 KV cache.

[-]

Junior_Swimming_8416@reddit (OP)

You might be thinking of a smaller Qwen model — Qwen3.5-9B or similar has \~32 layers with 2 KV heads, which would be \~4 GB at 128K.

Qwen3.5-27B is 64 layers with 8 KV heads — 8x more KV state per token. That's the whole reason KV cache compression matters for this model.

[-]

EffectiveCeilingFan@reddit

Holy shit dude you had an AI write a summary of your own results for you? Couldn’t even be bothered to read your own README?