I made another LLM VRAM calculator
Posted by PreferenceAsleep8093@reddit | LocalLLaMA | View on Reddit | 9 comments
Most calculators just guess based on parameters, so I made one that actually pulls the config.json from Hugging Face to calculate the K/V cache and runtime overhead.
What it does:
- Handles K/V quantization (Q8/Q4) and context scaling.
- Includes bandwidth-based speed estimates.
- No ads, no tracking, just a static site.
Aphid_red@reddit
Neat, but rather incomplete on the system front.
Prompt processing speed, which matters for most uses, is based on FLOPs. (Which you can also fetch for known GPU models) I would suggest doing the following things to make it much more useful:
But you're still not really done if you want this thing to be more than 'I have this hardware, which model should I use?' and also answer "I have this much money... what hardware and model to get the best result?"... you need to factor in price. It's possible to fetch/estimate that as well, you'd have to also know lifespan of the GPU/CPU (5 years?), usage factor (say 4 hours/day means 16%), cost of one KWh, power usage of the GPU, to arrive at an $/token estimate for a certain hardware config.
For example: 'I have a 3090, a 4070ti, and a 5060ti on a threadripper system, how long to I wait for a 500 token reply to my 10K input from my 70B model? <== Type of question to answer. Optimizing can get a little involved.
PreferenceAsleep8093@reddit (OP)
Wow! Thank you for such detailed feedback! You've given me some homework.
For the multi-GPU selection, I did decide to skip that for now just because I personally don't have that kind of rig and wouldn't really be able to verify any of the numbers with my actual hardware.
Aphid_red@reddit
I think you can extrapolate to multi-gpu without too much trouble if some resources on it are consulted. Basically, there's two ways to do it, so you calculate both and take the biggest number. I can show you how to do the math.
Tensor parallel:
All of the GPUs run at the speed of the slowest GPU, and use the memory of the least capable one. (so this works best if they're all the same type) You can add the flops of the GPUs for prompt processing. You can also add memory bandwidth for generation speed.
The second requirement is that you have to divide the model by its 'KV heads'. For example, Behemoth has 8 KV heads, so it can be divided into 4 or 8.
However, you also need to check how much network you need. I just found https://www.naddod.com/blog/tensor-parallelism?srsltid=AfmBOor6SPWoJgKrBrNoffIoJv9KrjCfl2Vk_bpCHz8iauh3HYdFheMR which is very informative. You take the TFlops of the GPU, and divide it by a factor.
This factor is: 6 x embedding_size / num_gpus / bytes_param.
So with an extreme example, say behemoth-123B running on 4x 5090s (let's say we're running in 16-bit), that's 6 x 12288 / 4 / 2 = 9216. Say the motherboard is running PCI-e version 4, 16 lanes to each GPU.
The 5090 has 419 Tensor TFlops (ignore sparsity). That's 419,000 GFLOPs.
So, you need 419000 / 9216 = 45.36 GB/s network to run full speed. Take https://en.wikipedia.org/wiki/PCI_Express, you get 31.508 GB/s speed only. So now, you do 31.508 / 45.36 = 0.6946.
Due to the network bottleneck, the 4 5090s can only run at 69.46% of their rated speed, and only provide 291 TFlops. This will have zero impact on generation speed, but prompt processing will be a little slower.
Once you've corrected for the network, just multiply the GPU specs. Treat all four 5090s as a single super-5090 with four times the bandwidth, four times the compute (or, in this case, because of the PCI-e problem, 4x0.6946). Then run the calculation normally to get your result.
Sequential:
This is what LLama.cpp and ollama do by default. It's much easier to setup, but tends to be slower if you can meet the requirements for tensor parallel.
Really simple to calculate! Here, the network requirement is negligible. You only need to move a single vector from one GPU to the other with each token. It's kilobytes, it might as well be zero. So, we can ignore network complexity and just run through each one in series. The trick is to work with reciprocals (calculate in seconds per token rather than tokens per second).
The layers are divided between the GPUs. Let's take a Gemma-3 27B Q8 divided between a 3090 (140 tflops, 1008 GB/s bandwidth) and a 4070Ti (160 Tflop, 504 GB/s). Say we put 20 layers on the 4070Ti, and 42 layers on the 3090.
Speed-wise, you're looking at (42/62 * 27) / 1008 = 18.1 ms per token for the 42 layers on the 3090.
Prompt processing adds another (42/62 * 27) * 20 / 140000 = 2.6 ms, for a total of 20.7 ms.
Now you look at the 4070Ti. (20/62 * 27) / 504 = 17.3 ms per token.
And the prompt processing: (20/62 * 27) * 20 / 160000 = 1.1 ms per token, for a total of 18.4 ms.
Add both up; 20.7 + 18.4 = 39.1 ms/token. So you should see roughly 25.6 tokens/sec.
Both:
When you have many GPUs, you can combine the two. This should now be easy: First, make 'tensor parallel' groups to handle a number of layers each, and calculate how each of these groups does. Then apply the steps for 'sequential' to them. For example, in the 4 5090s case, you could divide the behemoth model in half, and run two sets of 'tensor parallel'. (The result would likely be much slower, but this is an illustrative example.)
Aphid_red@reddit
Formula:
Multiply the (normalized) benchmark score by f(x) = 1 / (1 + exp( 6 / (A - B) + 3 (A + B) / (B - A) ) where A is your lower target and B is your upper target. Should produce a smooth S curve with \~0.05 at x=A and \~0.95 at x=B.
This is rather useful because it will heavily down-rank models that can't beat the minimum speed but not over value models that are much faster than the target speed (can't read that fast anyway, so there's not much point).
Potential-Gold5298@reddit
Not everyone has a modern GPU (for example, I have a GTX 560 1 Gb which makes no sense to use) – add the “no GPU” option.
I tried to enter my configuration - 32 Gb RAM and selected the weakest video card with 8 Gb VRAM. The advice can be seen in the screenshot. In reality, I use the Gemma 4 26B-A4B as my main everyday model and it is significantly better and faster than the Qwen3-8B.
The idea is good, but it requires some improvement.
YourNightmar31@reddit
What exactly are the calculations that you do to make the suggestions? Because i think it is too strict.
For my 3060 12GB + 128GB RAM system it says Qwen2.5-Coder 7B Instruct. This is an old model. Qwen3.6 35BA3B runs fine with 128K context, the speed is not the best, but usable imo. This model is years ahead of Qwen2.5-Coder in terms of quality and results.
For my 3090TI 24GB + 64GB RAM system it says Qwen2.5-Coder 14B Instruct. But considering Qwen3.6 35BA3B already worked fine on the 3060, this is still a better model than Qwen2.5-Coder 14B and it will run better than the 3060 where it was already usable.
libregrape@reddit
I love the UX. It does exactly what it claims to do, with no extra bs. I am not sure the results are good tho. For my system (rtx5060 ti 16GB, 32GB ram) it recommends Qwen 3 8B. But in practice, I can run qwen 3.6 35B at IQ3, and get 52 tps at 200k context, which is a superior choice to the 8B. The tool needs to account for the efficiency of MoEs, I think.
PreferenceAsleep8093@reddit (OP)
That's a good point! Thank you for the feedback. I actually have not factored in MoE to the tool yet. In practice, that does make a huge difference but also requires modifications to the calculation formula.
libregrape@reddit
I think it would be interesting to see a tool, that can give you model recommendations based on:
And provided those params it will try to select the best configuration, while penalizing quantization