(Yet Another) KV cache calculator - kvanta.vcerny.cz

Posted by Fun-Purple-7737@reddit | LocalLLaMA | View on Reddit | 6 comments

Hello everyone, I thought all public web-based KV cache calculators kinda suck.. so I decided to create one I would like to use myself - KVANTA

https://kvanta.vcerny.cz

It should support any LLM/VLM from Hugging Face, if not let me know!

(also, it's Apache 2.0)

[-]

tecneeq@reddit

Type qwen 3.6, see unsloth 27b-MTP, select it.

Error!

Slop'ed once again.

[-]

Moreover, the calculations seem incorrect, or at least it's completely unclear what kind of inference engine this is supposed to evaluate.
For llama.cpp the numbers are completely different. I.e. Qwen is using much less, 60Gb in this scenario. And Gemma - much more, 96 Gb and it's the opposite in this "calculator".

[-]

tecneeq@reddit

It's really interesting that nobody is able to solve this seemingly simple problem once and for all.

[-]

Fun-Purple-7737@reddit (OP)

a) of course its vibecoded, would you expect anything else in 2026? I do not think I am the crazy one here...

b) its early version, yes, but that is why I posted here, to get a feedback... silly me!

c) I did not really account for llama.cpp/GGUF, which is ironic here. My bad! I am inferencing on private H200s, so I had mostly vllm in mind, that is why there was overhead field in %.

Oh well, thanks for a warm introduction into Reddit reality! :)

[-]

ttkciar@reddit

Violates Rule Three: Low-effort (repo was entirely vibe-coded a few hours before post)

[-]

JGeek00@reddit

You have to add an option to select the model quant