Nvidia H100(94GB VRAM) - should I run llama.cpp or vllm for 30 users inference?

Posted by Rabooooo@reddit | LocalLLaMA | View on Reddit | 22 comments

I was given the great opportunity to borrow a H100 with 94GB VRAM at work until it is needed by a customer. (No idea how much system ram I will get, but I guess they are a bit flexible on this).

- I want to build a inference endpoint that can handle up to 30 users.
- I want a fairly reasonable big context, say 131,072-262,144.
- I think in most situations, realistically speaking, not more than 10-15 users will use it concurrently.
- Main use for this will be tools like Pi and OpenCode.

Was thinking to use Qwen3.6-27B unless anyone can recommend a better one for agentic coding given the constrains.
- Should I use vllm or llama.cpp? Will llama.cpp able to handle the concurrency?
- If running on llama.cpp I would probably use UD-Q6_K_XL or UD-Q8_K_XL quant from Unsloth.
- If running on vllm I have no idea on what quant to use? Some advice here would be great.
- Is there any good tool to benchmark "concurrent users"?