Is vLLM worth it? | TheaterFire

[-]

HarambeTenSei@reddit

I actually found llamacpp to be faster for some models than vllm, at least for my single user workload. But vllm has better support for some things like vlm and audio models.

[-]

Smooth-Cow9084@reddit (OP)

Yeah I need good batched requests support. Nonetheless, what models gave you better performance?

[-]

HarambeTenSei@reddit

qwen3-30b-a3b runs faster in unsloth gguf than in vllm awq for me even after I tweaked a bunch of the parameters

[-]

Are you sure that it actually runs faster? Llama-bench with default settings will only measure speed at 0-length prompt, which is never the case IRL; in all of the tests that I've ran vLLM always outperforms llama.cpp for prompta longet than 8k-16k, depending on the model and the card.

[-]

HarambeTenSei@reddit

If I tell if to give me a very long story llamacpp just blitzes through the output, vllm doesn't

[-]

noneabove1182@reddit

I think when it comes to batched requests VLLM and sglang are the golden standards

[-]

Smooth-Cow9084@reddit (OP)

How is model support/stability/ease with sglang?

[-]

noneabove1182@reddit

don't quote me on this, but i think when a model is supported by sglang it's more stable, but their support isn't as good

also i've heard sglang is a bit easier because it will try out different VRAM usages to find the stable amount that can be used, whereas sometimes VLLM will fill your VRAM too much and crash (though not common)

[-]

Barry_Jumps@reddit

One thing I don't love about vllm is how long a cost start takes in a serverless setup. Have been experimenting with Ollama, llamacpp and vllm on Modal and vllm consistently takes over a 100 seconds to serve the first token from a dead start. Ollama and llamacpp take less than 15 seconds.

[-]

suicidaleggroll@reddit

I tried it but was super disappointed by the loading times. Llama.cpp can load up the model in 10 seconds, vllm takes 2+ minutes. I hit-swap models, that kind of loading time wipes out any advantage vllm might possibly have.

I just use llama.cpp and ik_llama.cpp. The latter most of the time since prompt processing is significantly faster.

[-]

munkiemagik@reddit

Do you have any comparative bench numbers for GPT-OSS-120 and GLM-4.5-Air on llama and ik_llama please?

[-]

suicidaleggroll@reddit

I don’t, but in general ik_llama has about the same generation rate, maybe +10%, nothing crazy, and about double the prompt processing rate compared to llama.cpp. That was fairly consistent on all of the models I tried.

[-]

munkiemagik@reddit

Thank you, the double prompt processing rate makes it sound worthwhile to revisit ik_llama. Appreciate it.

[-]

Smooth-Cow9084@reddit (OP)

I saw you can put servers to sleep on cpu ram and you can recover them in seconds.

Haven't read about in_llama is it a fork? Why don't you use it all the time if it's better?

[-]

suicidaleggroll@reddit

Haven't read about ik_llama is it a fork?

It is. It focuses on performance at the expense of some recent model and capability support.

Why don't you use it all the time if it's better?

I use llama-swap so I can call either of them depending on the model I’m using. If ik_llama supports the model I use that, otherwise llama.cpp. It’s just a difference in the llama-swap config entry for that model.

[-]

Smooth-Cow9084@reddit (OP)

I am kinda new. How can you tell which models are supported? If a model is supported will all of its quants and finetunes be cool too?

Also where can I get that config entry dif? I might settle for your setup

[-]

suicidaleggroll@reddit

I just try them and see if/how well they work

The config entry is customized to my setup. I custom build both llama and ik_llama and then build my own llama-swap docker container with both of them inside. Then llama-swap calls a bash script to load up the model, and tells the script which server to use. It took a little effort to set up, but at this point my llama-swap entry just says “llama” or “ik_llama” to pick between them.

[-]

pmttyji@reddit

I am kinda new. How can you tell which models are supported? If a model is supported will all of its quants and finetunes be cool too?

ik_llama models

[-]

HarambeTenSei@reddit

you can but you can't load different weights from a different sleeping vllm into that ram in the meantime. Or at least you couldn't when I tried

[-]

Smooth-Cow9084@reddit (OP)

I see... I wanted to do exactly that. What did you ended up using?

[-]

HarambeTenSei@reddit

nothing in particular. I just wait for the model to load up :))

I mostly stick to vllm because it can run qwen3 omni. But I have a system that can switch between model deployment systems.

[-]

kryptkpr@reddit

You can give up some runtime speed for loading speed with --enforce-eager but yeah torch.compile() is a dog

SgLang starts overall much faster.

[-]

Smooth-Cow9084@reddit (OP)

I read SGLang is a vllm competitor. How is support for models on it? If it causes less headache than vllm but better batched/long context performance than llama.cpp it's what I am looking for

[-]

kryptkpr@reddit

Its development moves slower and it supports less architectures overall, but it still often gets day-one support for big releases .. MiniMax M2 for example shipped both vLLM and SgLang support together.

Overall performance is similar, the knobs available for tweaking are somewhat different.

[-]

oKatanaa@reddit

vLLM currently has way too many bugs related to gpt-oss (or even qwen3) and other important features (such as structured outputs). After trying multiple releases of their official docker images I gave up and moved onto sglang. And oh god I wish I did it much earlier because after an hour of setup it worked like a charm: 3x higher throughout and properly working structured outputs (for both qwen and gpt-oss). So my advice is to try sglang docker image, key thing is to set the correct reasoning parser and you're good to go

[-]

munkiemagik@reddit

Right now I'm recovering from a breakdown after trying to get vllm up and running this morning. My fault though, For whatever reason I have Cuda 13 installed on my system.

I was having a pig of a time setting up the system with Nvidia drivers on Ubuntu 24.04 running a mix of RTX 5090 and RTX 3090 GPUs (I think this is a specific to Ubuntu problem, I had no such issues with Fedora or Proxmox>Ubuntu server). I was getting constant conflicts with packages and errors with multiple Nvidia-drivers and apt updates failing. I cant remember how I resolved it all in the end or which install method I used. But I'm now on Nvidia-driver-580-open (proprietary) and Cuda 13.0. im scared shi7less to change anything now (downgrade to cuda 12.8) in case it starts breaking everything all over again.

And this was a BIG problem trying to get vllm up and running this morning. The Cuda13.0 kept throwing a spanner in the works. I could get 13.0 compatible Pytorch by pip installing torchvision and torchaudio along with torch whilst setting whl/cu130 (just pip installing only torch and setting whl/cu130 was still reverting back to cu128).

But when trying to pip install vllm it kept removing cuda13 and reinstalling with cuda12.8. Even when I tried to clone the repo and build vllm from source. Which would then throw back the cuda version mismatch errors when trying to pip install flash-attn.

Disclosure: I'm very ignorant in all matters python and venv (in fact in most matters in general) and probably need to look further into how to properly control the build from source to see if I can force cuda 13 build of vllm as I can see that the repo does have a vllm-0.11.2-cu130.whl in the latest assets. Anyone got advice/guidance my ears are open X-D, not because I need vllm but I just need to learn to solve the problem.

So for the moment I gave up with vllm and stick with llama.cpp and llama-swap. Sooooo much easier for model hopping use-cases and for single, user no-batching needed, with mixed GPU architectures, the benefits of vllm are not really worth the agony

[-]

1ncehost@reddit

vLLM is for hosts who want to maximize tokens per second over many simultaneous requests, where llama.cpp optimizes for single request speed. In my trials llama.cpp caps its batched performance at around twice the single request speed, where vLLM scales far beyond that.

Llama.cpp is much easier to get running also, so if you dont need multi request throughput you should skip vLLM.

[-]

cybran3@reddit

First of all you should use docker to avoid environment and dependency errors. Second, vLLM is great and easy to setup if you use relatively new NVIDIA GPUs, otherwise you might run into some weird issues. Third, if model + compute kernels + KV cache doesn’t fit into GPU you will not be able to run it.

I managed to run gpt-oss-20b in 2x RTX 5060 Ti 16 GB. With concurrency I managed to get to ~3000 TPS of generation with something like 128 requests, and high KV cache hits.

[-]

AutomataManifold@reddit

I've never run into errors with vLLM once I have it set up correctly, but I can see having models with more unusual architectures being more problematic. I haven't tried it with oss.

As for the volume gains, running 10 queries simultaneously is vastly better than trying to run them sequentially if you have an application that can support that. Some use cases are necessarily sequential, of course. But that does require you to have the hardware to actually run simultaneous queries, since there is a little bit of memory overhead.

[-]

Smooth-Cow9084@reddit (OP)

Yeah my problem is getting it to run in the first place. I think I'll have to go with ollama or llama.cpp

[-]

No_Afternoon_4260@reddit

Go for llama.cpp, vllm isn't much harder. Vllm is good for batches. Often vllm support model faster than llama.cpp, sometimes it's the contrary..
what I'm sure about is that transformers (from hugging face) supports virtually anything but is so slow!

[-]

kryptkpr@reddit

If you have compatible hardware (sm90+) then very yes, that Cutlass really flies.

If you have mostly compatible hardware (sm86, sm89) still probably yes. Marlin is no slouch.

But if you have anything else, probably best to stick to GGUF.

[-]

Smooth-Cow9084@reddit (OP)

I have 3090 and 5060ti. I'd assume those are good. So how do you run servers? With docker? Do you load the base docker image and it works headache free?

[-]

kryptkpr@reddit

I don't docker with GPUs personally, hit too many weird quirks where things would stop working after a few days or a few weeks and restarting container was the only fix.

I now install vllm via the Holy Trinity

uv venv -p 3.12 
source .venv/bin/activate
uv pop install vllm flashinfer-python --torch-backend=cu128

If you need nightly add --extra-index-url https://wheels.vllm.ai/nightly to that pip install.

This is enough for 90% but some models have additional dependencies such as triton-kernels, this will be covered in their model cards usually.

[-]

adel_b@reddit

if you are on macos and nothing works for you, please try my package https://github.com/netdur/hugind