Nvidia H100(94GB VRAM) - should I run llama.cpp or vllm for 30 users inference?
Posted by Rabooooo@reddit | LocalLLaMA | View on Reddit | 22 comments
I was given the great opportunity to borrow a H100 with 94GB VRAM at work until it is needed by a customer. (No idea how much system ram I will get, but I guess they are a bit flexible on this).
- I want to build a inference endpoint that can handle up to 30 users.
- I want a fairly reasonable big context, say 131,072-262,144.
- I think in most situations, realistically speaking, not more than 10-15 users will use it concurrently.
- Main use for this will be tools like Pi and OpenCode.
Was thinking to use Qwen3.6-27B unless anyone can recommend a better one for agentic coding given the constrains.
- Should I use vllm or llama.cpp? Will llama.cpp able to handle the concurrency?
- If running on llama.cpp I would probably use UD-Q6_K_XL or UD-Q8_K_XL quant from Unsloth.
- If running on vllm I have no idea on what quant to use? Some advice here would be great.
- Is there any good tool to benchmark "concurrent users"?
Bohdanowicz@reddit
Vllm or sglang.
cognitium@reddit
I compared llama.cpp and vllm on an rtx 6000 pro and vllm was like 3x faster with the same model. Qwen 27b is slow though. 35b is much faster.
swagonflyyyy@reddit
vLLM, its specifically built for that.
TechNerd10191@reddit
Use vLLM (or TensortRT/SGLang). However, with vLLM, GGUF models are not the best. Instead, pick Qwen3.6-27B-FP8 (officially released by Qwen) or Nvidia's NVFP4 model (select Marlin backend in that case) and you will have \~60 GB for KV Cache.
DeedleDumbDee@reddit
vLLM tells you directly in their docs not to use gguf because they’re unstable and still experimenting with implementation
Juulk9087@reddit
I get about 30 TPS faster on SG Lang then vllm
dionysio211@reddit
Will everyone be using it at once, all the time? vLLM/SGLang would be better on this setup but you could also do it in llama.cpp if you test your configs well. The gap between the two platforms has narrowed substantially over the past 6 months. We have a rig we are testing at 64 concurrency in a modified llama.cpp setup and it's doing very well. We went back and forth testing vLLM but found that in this particular setup, llama.cpp had a slight edge, but that's not usually the case.ff
In both platforms, you can cycle slots in and out of RAM (-cram in llama.cpp). Pulling a slot from RAM incurs about a 0.3 second penalty in llama.cpp. You can also persist slots to NVME (--cache-idle-slots in llama.cpp), which is why NVME prices are so high right now. That's a longer delay but still better than reprocessing 200K tokens. If you are using full context, it is about 10GB in f16 for the 27B model when the cache is full. It's 5GB at 8bit. So after model loading, you would have the capacity for around 11-12 full slots of working memory (double if you are looking to use around 125K context) and then you would cycle them out. In reality, the slots are never going to be full all the time so your actual concurrency could be closer to 20. Slots are the way it's conceived in llama.cpp even though it's now a unified KV cache by default, much like vLLM.
If you are going to use MTP or Eagle3, and you should, it stretches the acceptable concurrency very far. Even though it is a data center card and a good one, most inference relies on tensor parallelism for base level speed increases. Speculation is really the only way to accelerate it somewhat on a single card. Both systems have great speculative decoding options. If you are going to use llama.cpp, using ngram-mod and MTP would be a good combination.
I agree with others on the nvfp4 format on vLLM/SGLang. SGLang, as far as I know, is still the most efficient in terms of a cache pool. It's particularly good with prefix caching, especially if your system prompts tend to be shared (common IDE) and don't have unique information each time, such as date and time.
StardockEngineer@reddit
vllm. You don't have enough VRAM for coders. You'll need a lot more memory for the KV Cache required.
sourceholder@reddit
Can KV cache be spilled-over into (much slower) system memory using high-demand periods? Otherwise, defaulting to vRAM when user activity is lower?
gh0stwriter1234@reddit
You can but not by default you have to pass --cpu-offload-gb 10 where 10 is the number of GB to offload in his case probably like 64GB but maybe more would be needed to keep sessions around.
DinoAmino@reddit
Look into using LMCache with vLLM.
https://docs.lmcache.ai/
https://docs.vllm.ai/en/stable/examples/others/lmcache/
celsowm@reddit
llama.cpp is not good for multiple users
gh0stwriter1234@reddit
Not exactly I'd say its fine for up to about 4... after that though.. its starts being worth it to go to vLLM.
heeeeeeeeeeeee1@reddit
We'll most likely use Gemma4 but not for coding (decision making and some agentic work and sorting). But I believe we need FP16 and processing one task at a time (30-40sec/task)
__JockY__@reddit
Just take a day or two and test it with your use cases. Devise your test cases, pick llama.cpp and vLLM, and run your tests.
When you’re done you can curse yourself for ever bothering with llama.cpp for serving multiple parallel users. You’ll settle on vLLM and be glad it doesn’t buckle under the load.
Llama.cpp is good for running in single-user VRAM-constrained cases that need CPU offloading; it was never intended for your use case.
vLLM has been optimized for serving high concurrency, high throughput loads purely on GPU. This is your scenario. Use the tool for the job.
Eyelbee@reddit
Vllm is recommended for concurrency and probably would be faster, but if it you're able to fit, llama.cpp will work. Probably slower but it'd work. Don't know if it would fit though, you need to crunch some numbers. If no more than 15 users will use it, try concurrency 15, and see how much vram do you need for that, q6 might work but I'm not sure, kv cache behaviors differ between models. Gemma was more forgiving on that. I wouldn't quantize the kv cache for 30 users if it's never gonna get past 15. 15 may be tight but may actually fit. Probably worth learning about the vllm for this though.
Syst3m1c_An0maly@reddit
Hi, if you need this kind of concurrency vLLM is the way to go. It adjusts KV Cache usage dynamically depending on the requests to serve and is faster for parallel request especially on this kind of hardware (well optimized).
If you go with Qwen 3.6 27B on vLLM use the FP8 quant (natively supported and faster on H100) half the size and close to no loss in quality vs unquantized FP16
Syst3m1c_An0maly@reddit
Depending on your config (MTP on or not), you should be able to get 1 200 000 and 1 500 000 tokens on your KV Cache at max 256k context if you allocate the full H100 to the model
tilda0x1@reddit
VLLM seems to be the industry standard when it comes to production services and multi-concurrency.
I also run it on 2x Nvidia RTX A4000 PRO and it is stable.
Craftkorb@reddit
Vllm and it's not even a contest.
skullfuckr42@reddit
10x the vram and you you're good 👍
HVACcontrolsGuru@reddit
I’d try SGLang with the MTP. At that RAM and user count you can squeeze 150-200k context windows. You can run the FP8 KV and keep FP16 on the weights to give some headroom. 10 users per a 100GB is a decent rule of thumb for larger context workloads.