KIV: 1M token context window on a RTX 4070 (12GB VRAM), no retraining, drop-in HuggingFace cache replacement - Works with any model that uses DynamicCache

Posted by ThyGreatOof@reddit | LocalLLaMA | View on Reddit | 6 comments

Been working on this for a bit and figured it was ready to share. KIV (K-Indexed V Materialization) is a middleware layer that replaces the standard KV cache in HuggingFace transformers with a tiered retrieval system. The short version: it keeps recent tokens exact in VRAM, moves old K/V to system RAM, and uses K vectors as a search index to pull back only the \~256 most relevant V entries per decode step.

Results on a 4070 12GB with Gemma 4 E2B (4-bit):

1M tokens, 12MB KIV VRAM overhead, \~6.5GB total GPU usage
4.1 tok/s at 1M context, 12.9 tok/s at 4K
70/70 needle-in-haystack tests passed across 4K-32K
Perfect phonebook lookup (unique names) at 58K tokens
Prefill at 1M takes about 4.3 minutes (one-time cost)
Decode is near-constant regardless of context length

The core finding that makes this work: K vectors are smooth and structured, which makes them great search indices. V vectors are high-entropy and chaotic, so don't try to compress them, just retrieve them on demand. Use K to decide which V entries deserve to exist in VRAM at any given step.

No model weights are modified. No retraining or distillation. It hooks into the HuggingFace cache interface and registers a custom attention function. The model has no idea it's talking to a tiered memory system. Works with any model that uses DynamicCache. Tested on Gemma 4, Qwen2.5, TinyLlama, and Phi-3.5 across MQA/GQA/MHA.

There are real limitations and I'm upfront about them in the repo. Bounded prefill loses some info for dense similar-looking data. Collision disambiguation doesn't work but that's the 4-bit 2B model struggling, not the cache. Two-hop reasoning fails for the same reason. CPU RAM scales linearly (5.8GB at 1M tokens).

Still actively optimizing decode speed, especially at longer contexts. The current bottleneck is CPU-to-GPU transfer for retrieved tokens, not the model itself. Plenty of room to improve here.

GitHub: https://github.com/Babyhamsta/KIV (can be installed as a local pip package, no official pip package yet)

Happy to answer questions about the architecture or results. Would love to see what happens on bigger models with more VRAM if anyone wants to try it.

[-]

Voxandr@reddit

What models do you tested on ? Can this be developed for VLLM/Llamacpp?

ThyGreatOof@reddit (OP)

Tested fully on Gemma 4 E2B, Qwen2.5 3B, TinyLlama 1.1B, and Phi-3.5 mini across MQA/GQA/MHA. Topology auto-detection is verified on Llama 3/3.2, Mistral, Gemma 2/3, and Cohere Command R but those haven't had full eval suites run yet. Should work on any HuggingFace model that uses DynamicCache.

For vLLM/llama.cpp, not currently but that's on the radar. Right now it hooks into HuggingFace's cache interface and attention function registry which makes it portable across HF models but doesn't translate directly to other backends. A vLLM integration would need to plug into their block-level cache manager, and llama.cpp would need a C++ port of the page-based retrieval. Both are doable but significant effort. If there's enough interest I'd prioritize vLLM first since it's Python and closer to the current implementation.

Yeah better to supprot vLLM first.

Post removed by mod , make a post without generating using AI first i guess.

Don't see why it mattered that Claude helped with documentation and ect. The research is still valid and true. Fully tested on my system.

Here we are AI slops PROJECT spams , yours sounds like one from the post