I made another LLM VRAM calculator

[-]

Aphid_red@reddit

Neat, but rather incomplete on the system front.

You can't add multiple GPUs or specify how many RAM sticks.
The suggested models don't change much with better hardware because you look at raw TPM. An overabundance of it is not useful, you need to also measure quality, not just quantity.
Your performance estimate is rather useless for real use-cases, because it doesn't factor in the TTFT (time to first token), which can be disastrously bad on CPU systems for long context use cases.
There's a bunch of fluff text that doesn't say much beyond the actual calculated numbers. I'd review or remove the slop.

Prompt processing speed, which matters for most uses, is based on FLOPs. (Which you can also fetch for known GPU models) I would suggest doing the following things to make it much more useful:

Include common Pro tier GPUs as well. Second hand P40, MI50, MI60, MI100, A100, GTX 8000, RTX 6000 (ampere/ada/pro), etc. are all decent choices on various spots on the raw power / value for money curve.
Allow selecting more than one GPU. An 'add GPU' button would work, or a # of GPUs dropdown. Typically you can get up to 16 cards in one system. Note that you'd get less bandwidth between cards without a PCI-e switch for systems with many cards, even HEDT/server. This is where things can get a little complicated, but you can do an upper estimate: divide PCI-e lanes by card count and round down to 1/2/4/8/16 per card. This'll work for every card except for very expensive pro cards with NVlinks, SXM/OAM boards.
Allow setting a 'performance margin', which is a software inefficiency penalty to the performance number. Default to 70% of theoretical amount. (That's a good rule of thumb). This is useful for sorting later.
Allow selecting the CPU model (or custom: number of CPUs/cores/threads/AVX capability or SIMD rate or straight TFLOPs). This determines prompt processing for the part of the model that doesn't fit in VRAM. For extra credit successfully estimate MoE (which is more complex than dense models to do)
Allow selecting RAM channel count, version, theoretical bandwidth. I.e. 8x2 16GB DDR4-2933 RDIMM per socket (2 sockets, 512GB total, bandwidth = 16x2.933GT/s = 375.424 GB/s).
When model does not fit in VRAM: Force using 'sequential' calculation, factor in the cost of offloading to CPU.
Assume 0.5GB of VRAM (can be changed) is in use for the desktop. (Set to zero to turn off for home servers. Note, this assumes Linux, on windows 10 and 11, you actually can only use 81-92%)
Use a relevant benchmark for each use case, and import its values. E.g. UGI for roleplay, UGI NatInt for chat, UGI writing for creative writing, LiveBench coding for coding/agents, LiveBench Reasoning for RAG usage.
Set a default 'speed minimum' and 'speed target' for each use case. Allow user to override this target with a text box. For example, for Chat/Roleplay, take the p50 and p95 of human reading speed as the minimum and target. These numbers are 5 tps and 12 tps respectively (using 1 token = 1 'word', because words are typically taken as being only 5 letters in wpm scores for typing/reading). The user should be able to change this minimum and target speed from this default suggestion.
When rating models / sorting them, combine the speed target and the benchmark into a rating. I would suggest turning the speed into an S-curve with domain of 0 to 1, so that a model that meets the target scores say 0.95, while meeting the minimum receives 0.05, smoothly transitioning. (I can give a formula in a comment if you'd like)
Allow the user to scroll down further than just the 3 best rated suggestions.
Automatically pick between Parallel or Sequential if one is strictly better, present both if there is a trade-off. So you compute the output of both Tensor Parallel and Tensor Sequential. To compute parallel you have to check how many 'attention heads' the model has: the number of GPUs has to be a multiple of it. So if you have 7 GPUS and 12 Heads, 6 are used. If you have 16 GPUs for 8 Heads, then 2 groups of 8 are used sequential. If you want to be really fancy also factor in how much network load/latency you have; you may be capped by PCI-e bandwidth.
When using multiple parallel GPUs with different speeds you use the slowest one. When using multiple sequential speeds add the time you need for a token for each one, then take the reciprocal. (Fill the fastest GPUs first for best results).
Allow the user to specify the input:output ratio. Default to 20:1 (see data from Openrouter). Better yet: Default to a value that's realistic to the use-case and allow user to override.
Compute the TPS based on both input and output. That is, compute the time t1 needed to input 20 tokens, plus the bandwidth t2 to output 1 token. Then present 1 / (t1 + t2).
When "Tools and Agents" is selected, allow user to specify "Number of active agents". Check the compute:bandwidth ratio for output: it may be capped by compute instead of bandwidth with enough agents.
Add "Creative Writing / Roleplay" as a use-case. It's different enough from Chat as the context tends to be longer, and different from Coding because the speed requirements are lower and quality is more important.
Allow the 'custom' VRAM and RAM settings to be more specific: User should be able to enter both bandwidth and quantity in a text field, not select a dropdown with very limited options. You can have near limitless variety by combining various GPUs.
When the user loads a model from Huggingface, display the stats of this model, regardless of fit, and always put the chosen models above the default ones. (That is, sort by this thing first if it's enabled).
Suggest which hardware component to improve for these selected models. (You can use some heuristics to get a good guess as to what the bottleneck is; such as more VRAM if it doesn't fit yet, or more RAM if the MoE experts don't fit in RAM but the base model does fit in VRAM.)

But you're still not really done if you want this thing to be more than 'I have this hardware, which model should I use?' and also answer "I have this much money... what hardware and model to get the best result?"... you need to factor in price. It's possible to fetch/estimate that as well, you'd have to also know lifespan of the GPU/CPU (5 years?), usage factor (say 4 hours/day means 16%), cost of one KWh, power usage of the GPU, to arrive at an $/token estimate for a certain hardware config.

For example: 'I have a 3090, a 4070ti, and a 5060ti on a threadripper system, how long to I wait for a 500 token reply to my 10K input from my 70B model? <== Type of question to answer. Optimizing can get a little involved.

[-]

PreferenceAsleep8093@reddit (OP)

Wow! Thank you for such detailed feedback! You've given me some homework.

For the multi-GPU selection, I did decide to skip that for now just because I personally don't have that kind of rig and wouldn't really be able to verify any of the numbers with my actual hardware.

[-]

Aphid_red@reddit

I think you can extrapolate to multi-gpu without too much trouble if some resources on it are consulted. Basically, there's two ways to do it, so you calculate both and take the biggest number. I can show you how to do the math.

Tensor parallel:

All of the GPUs run at the speed of the slowest GPU, and use the memory of the least capable one. (so this works best if they're all the same type) You can add the flops of the GPUs for prompt processing. You can also add memory bandwidth for generation speed.

The second requirement is that you have to divide the model by its 'KV heads'. For example, Behemoth has 8 KV heads, so it can be divided into 4 or 8.

However, you also need to check how much network you need. I just found https://www.naddod.com/blog/tensor-parallelism?srsltid=AfmBOor6SPWoJgKrBrNoffIoJv9KrjCfl2Vk_bpCHz8iauh3HYdFheMR which is very informative. You take the TFlops of the GPU, and divide it by a factor.

This factor is: 6 x embedding_size / num_gpus / bytes_param.

So with an extreme example, say behemoth-123B running on 4x 5090s (let's say we're running in 16-bit), that's 6 x 12288 / 4 / 2 = 9216. Say the motherboard is running PCI-e version 4, 16 lanes to each GPU.

The 5090 has 419 Tensor TFlops (ignore sparsity). That's 419,000 GFLOPs.
So, you need 419000 / 9216 = 45.36 GB/s network to run full speed. Take https://en.wikipedia.org/wiki/PCI_Express, you get 31.508 GB/s speed only. So now, you do 31.508 / 45.36 = 0.6946.

Due to the network bottleneck, the 4 5090s can only run at 69.46% of their rated speed, and only provide 291 TFlops. This will have zero impact on generation speed, but prompt processing will be a little slower.

Once you've corrected for the network, just multiply the GPU specs. Treat all four 5090s as a single super-5090 with four times the bandwidth, four times the compute (or, in this case, because of the PCI-e problem, 4x0.6946). Then run the calculation normally to get your result.

Sequential:

This is what LLama.cpp and ollama do by default. It's much easier to setup, but tends to be slower if you can meet the requirements for tensor parallel.

Really simple to calculate! Here, the network requirement is negligible. You only need to move a single vector from one GPU to the other with each token. It's kilobytes, it might as well be zero. So, we can ignore network complexity and just run through each one in series. The trick is to work with reciprocals (calculate in seconds per token rather than tokens per second).

The layers are divided between the GPUs. Let's take a Gemma-3 27B Q8 divided between a 3090 (140 tflops, 1008 GB/s bandwidth) and a 4070Ti (160 Tflop, 504 GB/s). Say we put 20 layers on the 4070Ti, and 42 layers on the 3090.

Speed-wise, you're looking at (42/62 * 27) / 1008 = 18.1 ms per token for the 42 layers on the 3090.
Prompt processing adds another (42/62 * 27) * 20 / 140000 = 2.6 ms, for a total of 20.7 ms.

Now you look at the 4070Ti. (20/62 * 27) / 504 = 17.3 ms per token.
And the prompt processing: (20/62 * 27) * 20 / 160000 = 1.1 ms per token, for a total of 18.4 ms.

Add both up; 20.7 + 18.4 = 39.1 ms/token. So you should see roughly 25.6 tokens/sec.

Both:

When you have many GPUs, you can combine the two. This should now be easy: First, make 'tensor parallel' groups to handle a number of layers each, and calculate how each of these groups does. Then apply the steps for 'sequential' to them. For example, in the 4 5090s case, you could divide the behemoth model in half, and run two sets of 'tensor parallel'. (The result would likely be much slower, but this is an illustrative example.)

[-]

Aphid_red@reddit

Formula:

Multiply the (normalized) benchmark score by f(x) = 1 / (1 + exp( 6 / (A - B) + 3 (A + B) / (B - A) ) where A is your lower target and B is your upper target. Should produce a smooth S curve with \~0.05 at x=A and \~0.95 at x=B.

This is rather useful because it will heavily down-rank models that can't beat the minimum speed but not over value models that are much faster than the target speed (can't read that fast anyway, so there's not much point).

[-]

Potential-Gold5298@reddit

Not everyone has a modern GPU (for example, I have a GTX 560 1 Gb which makes no sense to use) – add the “no GPU” option.

I tried to enter my configuration - 32 Gb RAM and selected the weakest video card with 8 Gb VRAM. The advice can be seen in the screenshot. In reality, I use the Gemma 4 26B-A4B as my main everyday model and it is significantly better and faster than the Qwen3-8B.

The idea is good, but it requires some improvement.

[-]

YourNightmar31@reddit

What exactly are the calculations that you do to make the suggestions? Because i think it is too strict.

For my 3060 12GB + 128GB RAM system it says Qwen2.5-Coder 7B Instruct. This is an old model. Qwen3.6 35BA3B runs fine with 128K context, the speed is not the best, but usable imo. This model is years ahead of Qwen2.5-Coder in terms of quality and results.

For my 3090TI 24GB + 64GB RAM system it says Qwen2.5-Coder 14B Instruct. But considering Qwen3.6 35BA3B already worked fine on the 3060, this is still a better model than Qwen2.5-Coder 14B and it will run better than the 3060 where it was already usable.

[-]

libregrape@reddit

I love the UX. It does exactly what it claims to do, with no extra bs. I am not sure the results are good tho. For my system (rtx5060 ti 16GB, 32GB ram) it recommends Qwen 3 8B. But in practice, I can run qwen 3.6 35B at IQ3, and get 52 tps at 200k context, which is a superior choice to the 8B. The tool needs to account for the efficiency of MoEs, I think.

[-]

PreferenceAsleep8093@reddit (OP)

That's a good point! Thank you for the feedback. I actually have not factored in MoE to the tool yet. In practice, that does make a huge difference but also requires modifications to the calculation formula.

[-]

libregrape@reddit

I think it would be interesting to see a tool, that can give you model recommendations based on:

Task type
How much tps you need
How much context you need
Type of hardware you have

And provided those params it will try to select the best configuration, while penalizing quantization