Is using vLLM actually worth it if you aren't serving the model to other people?
Posted by ayylmaonade@reddit | LocalLLaMA | View on Reddit | 89 comments
So, as most of us here are, I'm a llama.cpp loyalist. Easy to understand, great configuration, relatively stable, etc. But I’ve been increasingly tempted by vLLM, especially since AMD just added it as a built-in inference engine to Lemonade, and I happen to have an AMD GPU. The thing is, I've never actually used vLLM directly, but I've heard good things about how it performs compared to llama.cpp, with vLLM apparently outperforming it pretty much across the board.
Buuuuut, I only serve my model to myself - no hosting for others to worry about, and another thing I've heard is that vLLM is engineered more for scenarios where you're serving many requests at once. But the apparent speedup still piques my interest.
Has anybody here actually done this? Is it worth all the hassle, or is it basically unnoticeable and not something to bother with? It would be great to hear some of the experiences from people who aren't just using it in enterprise-type settings.
Appreciate any help, ty!
Miserable-Dare5090@reddit
Yes, it’s worth it if you want a reliable service, but the biggest thing for me is tensor parallel work across nodes/gpus. I am serving Qwen 397b (aka Fat Qwen) on two nodes of 128Gb each (DGX spark) and there is no way to do this on lcpp reliably and as solidly.
I serve also Skinny Qwen (Qwen 27B) with vLLM, but again, single container/single GPU, etc. For 35B, which runs with cpu offload, I use lcpp and for other small models too.
No_Afternoon_4260@reddit
You need that spark-vllm from eugr right?
Neithari@reddit
Sparkrun from spark-arena is another option. Same config files and more quality of life features. It's also possible to do by hand, but these two make it easy.
ganhedd0@reddit
> Fat Qwen
Oh lawd he inferrin'
lemondrops9@reddit
? Im using Llama.cpp with RPC mode between two PCs no problem. Im also running Qwen 397B with good pp and tg. Wondering what quant you are using and the speed your getting with that setup.
mister2d@reddit
llama.cpp has to now. Does it not work for you?
Miserable-Dare5090@reddit
You can’t tp across nodes as reliably with lcpp. I don’t tensor parallel across GPUs in the same box
mister2d@reddit
I missed the part across modes.
Mountain_Patience231@reddit
llama cpp tp never works for me in AMD cards
starkruzr@reddit
not the Fat Qwen
Miserable-Dare5090@reddit
I love Fat Qwen. It’s a solid model even if it is at times ridiculous in the fixes it proposes. Very smart, very well adapted to situations that are NOT coding…
FullOf_Bad_Ideas@reddit
I'm using it for coding, it's doing a good job IMO. Why it's unfit to coding in your opinion? Is it about prefill speed or just model output quality?
Miserable-Dare5090@reddit
I just don’t use it for coding, so I can’t comment on it.
starkruzr@reddit
I just find that name hilarious, lol.
Miserable-Dare5090@reddit
I call them fat qwen (397b), skinny qwen (27b) fast qwen (35b moe) and the qwenlings
prestigeful@reddit
I hope to see a baby qwen (say a gemma 4 e4b equivalent) for my baby setup (8gb vram)
pbpo_founder@reddit
Poor mid 122B Qwen…
starkruzr@reddit
Qwennie and The Jets
thrownawaymane@reddit
Fat Thor energy
I had that one spoiled for me and I still busted out laughing when I first saw it
BoxWoodVoid@reddit
Stop fat shaming our BBQ (Big Beautiful Qwen)!
Klutzy-Snow8016@reddit
It usually gets new models before llama.cpp. The same goes for features. For example, MTP is already supported for Qwen3.6 and Gemma 4. vLLM can be much faster if you can use tensor parallelism. It does multi-user better. And since it's used professionally, I feel like there's a lower chance of a model having a bug where it works at a glance but deviates from the output of the reference implementation (but I can't prove this).
On the other hand, vLLM uses way more VRAM for the same amount of context. It takes much longer to start up. Every time you upgrade to run a new model, you will break every older model (or at least that's my luck). And unlike llama.cpp, it's not a good option if you can't fit the entire model on GPU. And it wants modern GPUs, also.
I usually use llama.cpp since I have regular consumer hardware and like to swap between models. I use vLLM to try new models if they aren't yet available in llama.cpp and I can fit them.
xcel102@reddit
Could you elaborate a bit more technically why this is the case?
Do you have a rough time scale? Say for Qwen3.6 27B 4-bit quantized, how many seconds/minutes does it take?
relmny@reddit
I tested it a bit and yes, both vllm and sglang require enough VRAM for the model and KV and everything else, all in advance (when loading the model).
I can run, on two different PCs both with 128gb (DDR4/DDR5) , one with 16gb VRAM which can run qwen3.6-35b-q6k at 35t/s and the other with 32gb VRAM qwen3.6-35b q6k at around 115 t/s and qwen3.6 27-q6lk 21t/s, all with llama.cpp/ik_llama.cpp.
I could not fit any NVFP4/INT4/etc either.
I could only fit Qwen3.5-4B-AWQ-BF16-INT4 in the 16gb one.
On the 32gb I only tested sglang, and could only fit qwen3.5-9b.
ayylmaonade@reddit (OP)
Thanks for the reponse! Based on this and what everyone else is saying, it seems like it's at least worth a try. I've been waiting for MTP to be merged into upstream in llama.cpp for a while, so vLLM having it already is sweet.
Hopefully it's not too rough on my VRAM with the differences in memory use regarding context size... I'm running Qwen3.6-35B-A3B @ 115K context on my RX 7900 XTX and it'd be a shame to have to drop it further.
Cheers!
voyager256@reddit
Really? I wasn't aware of this. You mean for multi user etc.?
Imaginary_Belt4976@reddit
thank you for calling this out. I switched to llama cpp and was utterly shocked by how much faster it was to spin up the model. as someone who is constantly switching between serving llm, diffusion, and other cv models, sitting there waiting on vllm to spin up got old fast
otacon6531@reddit
Yeah, I love llama.cpp. my bench marks have me conclude that ollama allows flexibility, llama.cpp is the fastest for one process, and vllm is slower for one, but significantly better for concurrent processes. I am pretty sure everyone agrees, so I am juat confirming.
So for work it is vllm and at home it is llama.cpp. it is the only way I can get 52 t/s with Tesla P40 on my pcie 3 ddr3 ram Dell Precision t3610 an Qwen3.6:35B at 262k context.
Due-Project-7507@reddit
For use cases with multiple parallel requests, vLLM or SGLang can be much faster than llama.cpp. The major problem with vLLM and SGLang in my opinion is that they are very unstable in comparison to llama.cpp. Many quantized model versions which should work don't work with your GPU generation and there are many regressions so after every update, it could stop working with your quantized model generation on your GPUs (e.g. happened with Gemma 4 26B-A4B AWQ, was working in the past and I think still broken now).
Farmadupe@reddit
The general rule of thumb is that you should use vllm any time your model fits fully in VRAM. It's way faster in almost all situtations, and generally more reliable too.
llama.cpp is for the everyman (ie me and you) so it's totally the place to go if you're vram poor.
But otherwise the real answer is that vllm and others are where you would go first if you had the choice.
Front-Relief473@reddit
Wrong. In another case, you should choose llamacpp when the model weight is barely enough to load into the main memory, so that you can have enough context.
xcel102@reddit
Is this because vLLM "uses way more VRAM for the same amount of context" as someone commented?
Farmadupe@reddit
Possibly, but also llama.cpp has 999 quants of varying sizes per model so there's almost always one that fits snugly. With vllm often you have to choose between a qauntnthst doesn't fit or one that's too lobotomized
xspider2000@reddit
does vllm support rtx 3090 cards? Can I run qwen 3.6 27b on double 3090 out of box or i need some hacks?
Farmadupe@reddit
Totally works out of the box. Just go and use an fp8 version of the model and it'll spin right up :)
xspider2000@reddit
Thx. I figured out why vllm less popular here than llama.cpp, vllm has bad support for gguf format. gguf is big thing.
CanWeStartAgain1@reddit
Is it also faster when dealing with non batch inference? I googled a bit but I could not reach that conclusion,
Sufficient_Prune3897@reddit
Often yeah, but the difference is like 30tps vs 38tps. Not worth the 5 minute vllm startup time
ROS_SDN@reddit
Add to the rule of thumb if your amd or Intel vllm likely hates you.
exact_constraint@reddit
Also look into SGLang. The other big multi-user software package.
I’d say it depends on what kinda GPU you have, and your use case. I’ve been coming back to trying vLLM every once in a while on the R9700. It’s a massive PITA. Technically, total throughput is impressive - ~250-280tok/sec with an R9700 and Qwen3.6 27B. But per stream is 1/2 to 1/3rd the speed of what llama.cpp can do. It’ll get better. It already has. But ime, both SGLang and vLLM are geared towards the actual datacenter hardware - Working with consumer or consumer adjacent GPUs is an uphill battle. And please, for the love of God someone tell me I’m wrong and I’ve just been missing something fundamental.
axiomatix@reddit
i'm sorry, but can you confirm those numbers again.. are you sure those aren't for the 35b and not the 27b? or do you mean total concurrent throughput?
exact_constraint@reddit
Definitely 27B - “Total throughput”, so massive concurrent (1000 iirc) requests to saturate the card. Single request is like I said - 1/3-1/2 what llama.cpp can achieve. Think the max I’ve gotten in vLLM, single request, is around 14tok/sec with a realish context window. Typical I’d put @ 10tok/sec. Vs llama.cpp + Vulkan: ~31tok/sec base speed, consistently ~20ish in real agentic coding workflows. Ngram-mod spec decoding gets to me to an average of 26.
I imagine the performance gap will close over time, why I keep checking in on it, but not as of a week ago lol.
axiomatix@reddit
ahh.. somehow i missed the total throughout bit.
not bad for a single card. getting around 65 tok/s c=4 max context on dual 3090s. considering getting 2x 9700s, but i'm still on the fence.
exact_constraint@reddit
Ultimately pretty happy w/ the R9700 - I went that route so I can eventually scale to 4, then 8 cards. And a single card was a pretty cheap entry point to get my feet wet. I think in terms of a blended VRAM+compute/$ for a new card, it’s a good value.
More than a few times I’ve dreamed about a different world where I just vomited blood down my sleeve and ponied up the cash for an RTX6000 though. They’re violently expensive, but the trade off is, you take some wallet violence in exchange for not having to suppress the urge to do your keyboard violence, when it’s 2AM and you’re trying to figure out how to manually patch a bunch of files so vLLM will think you’re running an Mi350 instead of an R9700.
balerion20@reddit
Which model precision did you use, if I may ask ? Because I couldn’t hit that that number even with total throughput I think
exact_constraint@reddit
IIRC it was either a Lorbus int4 auto round or the CyanKiwi AWQ int4 quant - Downloaded both models, and I know the max throughput numbers came from the CyanKiwi quant, can’t remember which I use when I was testing single stream speed.
exact_constraint@reddit
This is also ignoring multi-GPU setups w/ tensor parallelism - In that case, you don’t have much of a choice.
ipcoffeepot@reddit
I like to have multiple tabs/windows going at once. Its just me, but im regularly batching 8-12 requests at a time (sometimes all from pi, sometimes its a mix of opencode and hermes, depends on what im doing)
sammcj@reddit
It's around 20-35% faster than llama.cpp with many models for me (single user, not batched)
Tormeister@reddit
Yes
It's really not a bigger hassle than llama.cpp - for both engines you have to compile the binaries or use a container, find a model that suits you, learn the launch args, and tweak them until you are satisfied.
With vLLM I get much higher PP speed, slightly higher TG and can use parallel requests with MTP, which is pretty good for agents (Qwen3.6 27B variants & 5090).
The only real downside I've observed is that it ends up not being very suitable to consumer hardware - the quants that fit a 5090 are much worse than similarly sized quants that run on llama.cpp, so I gave up on parallel requests and went back to it. With more VRAM to fit higher precision models I would absolutely pick vLLM instead.
Septerium@reddit
For a multi-GPU environment, yes, it is. At least for now. I get much faster token generation with vLLM for dense models on my quad 3090 setup
a_beautiful_rhind@reddit
Is it actually faster for you? Does it have a feature other engines don't? The pain of vLLM is the vram usage and rather limited quantization variety.
For single batch you can also use ik_llama and exllama. They will probably be close in speeds. Exllama also has batching but any of the .cpp aren't so good at it.
If you make a lot of parallel requests it's worth trying. Not like you have to delete all other inference engines to use it.
agentzappo@reddit
Yes, it is worth it even if you are a single user. The question to ask yourself is this:
“how many concurrent agents will I be running?”
If the answer is >1, vLLM batched inference across concurrent requests will likely provide you with at least a 2-3x bump in total token generation throughput.
Medium_Chemist_4032@reddit
Oh for sure. "Other people" might be your Claude Code agent trying to name current conversation, or a subagent exploring the codebase, or editing a file. For me vllm has been the most reliable solution for long context agent work. Llama.cpp and ik_llama can do wonders for the cpu+gpu splits, but there are few quirks, that just keep on happening on each new model release (templating issues, random repetition loops that some try working around with the repeatition penalty, preserving thinking) and you have to wait for patches, while vllm at the same time almost always has a decent zero day support.
PersonalityQuick5750@reddit
If you aren't doing parallel sub-agents, vLLM is like buying a Ferrari to drive to the grocery store. llama.cpp is the reliable Honda Civic we all actually need.
blackhawk00001@reddit
Yes, it's worth it. I struggled to optimize my dual R9700 setup in llama.cpp's split/multi gpu configurations and tried to follow an MXFP4 thread to try something different. I figured out the default rocm image deployment with qwen3.6 27b-FP8 and saw a 2-3x speed increase over llama.cpp. Then with the help of another redditor and their custom aml731 image I was able to get unified aiter from the mi350x config working with rdna4 and saw another 2-3x increase to tg. Coding tools are now snappy and I'm happy with the results. I can only do 200k context with a single request at a time but can increase max concurrency if I reduce the context and lower batch sizes. Speeds are good enough that a second task isn't queued for more than a minute or two.
The downsides is there's currently a bug somewhere that causes high 100w idle each with some cpu thread use so the system is running 250-300W just sitting there.
I still prefer llama.cpp for quickly testing a model or loading something different for a quick task.
I'm trying to stay entirely within vram with vllm so I'm not sure yet sure how it handles splitting to ram.
grunt_monkey_@reddit
Hey, i hear if you get ubuntu 26.04 this is fixed… the fix is in the linux 7.0 kernel. For me im running 24.04 with the 6.17 hwe. I pulled the linux 7.0 amd dkms package and patched my amd dkms and now i no longer idle at 90-100w.
https://www.phoronix.com/news/Linux-7-Fix-Idle-RDNA4-Compute
The issue is with the vision encoder. You might get the same thing with mmproj loaded on llama.cpp but here only one card will be pegged at 90w.
relmny@reddit
I have almost no experience with it except for trying sglang and vllm in my 16gb VRAM + 128gb RAM windows host (podman over wsl2) and what I found out is that they use lots of memory.
I could only fit a qwen3.5-4b AWQ BF16-INT4 but not higher than that, while with llama.cpp/ik_llama.cpp I can load qwen3.6-35b-q6kl (around 33t/s) or 27b-iq3m.
But I only tested it a bit, once I get a new GPU I'll probably test it further.
Fit_Check_919@reddit
Definitiv vllm, especially for vision-language models
chensium@reddit
Multiagentic and subagentic workflows
Savantskie1@reddit
I would love to try vllm, but from what I’ve read it doesn’t support MI50 anymore? Or Vulkan? Is this true?
ikkiho@reddit
went back to llama.cpp after a few weeks on vllm honestly. prefill is faster sure but model load takes ages and i kept fighting rocm versions every time i wanted to try a new quant. for solo casual use the operational tax eats the throughput win imo. if you've got parallel agents running 12 hours a day it pays off.
FullOf_Bad_Ideas@reddit
I use vllm/sglang if I can fit the model in VRAM at BF16 easily with buffer and I want to do batch tasks. Like translate this 30M token dataset, rewrite this etc. Sometimes I am also using it when it supports a model that exllamav3/ik_llama.cpp/llama.cpp do not support - vllm and sglang more often then not have day 0 support for major models and good support for all smaller releases while llama.cpp still doesn't support a bunch of older models and never will.
rorowhat@reddit
I hate it
Altruistic_Heat_9531@reddit
yep, even when you are the only user, technically speaking you still want multi user capabilities. Parallel agent call, etc, is blazing fast in VLLM.
It is just like any OS. you are the true user, but there are many services and apps that has use os managed user to do its thing
DataPhreak@reddit
Just added vLLM to Lemonade? Not seeing it. Maybe it's not available for rocm?
Rattling33@reddit
I can see VLLL(experiment tagged), I just updated lemonade
Vicar_of_Wibbly@reddit
I can’t imagine going back 2-3 years to using GGUFs and llama.cpp now that vLLM and safetensors is adopted as The Way for me.
A lot of the time I’m in Claude cli with MiniMax FP8 in vLLM. If i switched to Q8 and llama.cpp it would destroy my long context speeds and prefill speeds. vLLM is just faster for this, especially in tensor parallel with FP8 Blackwell.
Another thing: there seems to be an air of mystique around vLLM (and sglang) that is undeserved - you can literally “uv pip install” and the help is decent.
The real hurdle to vLLM is having enough VRAM. There’s no offloading (sglang+ktransformers for that). No GGUFs. But we do have AWQ, NVFP4 etc. so all is not lost.
etaoin314@reddit
I have used vllm llama.cpp and tabby API. ATAPIXL2 files is the fastest followed by followed by ollama
NickCanCode@reddit
Are you using multi-gpu with Qwen3.6 27B? I can't get exllama to work with two cards. It will give me:
when I try to use
tensor_parallel: truei_am_not_a_goat@reddit
You can run it without TP. Get yourself the dflash quant as well and it’ll go very fast.
NickCanCode@reddit
Thanks. I can finally get it to run with TP off. However, when I provide
the models will load but when I make a new request, it will immediately give me
Do you know what would be the issue? There are still 6 GB of free VRAM available when this happen.
i_am_not_a_goat@reddit
Yeah I think there’s a bug in how it’s allocating vram. Gonna raise an issue on GitHub when I get a moment. You can reduce your context size and it should work but that’s kinda not a great fix.
ochbad@reddit
Even with one user, would vLLM be useful if you’re running parallel sub-agents?
lordekeen@reddit
Hijacking the thread, i see that i can make better use of my two RTX 3060s with vLLM, can anyone point me in the right direction in how to setup this? llama.cpp docker image its just so easy to deploy.
meldas@reddit
vllm/sglang's prompt processing and caching is very reliable and fast, so if you are doing a lot of agentic tool calls and development that easily operates on 100k token context, the difference is felt almost immediately.
With llama.cpp backend there are constant prompt-processing loops (due to cache misses) almost every other tool call. I can't say for certain a specific productivity multiplier, but simply eliminating the stress of waiting MINUTES for prompt processing made all the difference, from essentially unusable to viable.
My OWUI node for general chatbot stuff use llama.cpp though, as I prefer how it releases/evicts resources when not in use
Ghost_is_bourne@reddit
I've stayed with lcpp so far because I'm using a different GPU architecture (3090/4090) and I've heard that vllm can cause headaches with that. Is that still the case?
ttkciar@reddit
If you do any batched inference at all, vLLM is going to scale better and more easily, since it allocates VRAM per-batch as needed as context grows and as more concurrent queries arrive, while llama.cpp requires K and V caches to be pre-allocated for the maximum batch size and maximum context.
Other than that, though, llama.cpp is pretty great. I'm really happy with it. I'm all AMD GPUs here too (MI50, MI60, V340) but llama.cpp's Vulkan back-end works splendidly with them.
The only other reason I can think to switch to vLLM is if you're interested in picking up marketable tech job skills, since vLLM is the go-to for the Enterprise. It's what Red Hat's RHEAI is built around, for example. vLLM experience would look better on a CV than llama.cpp experience.
Verenda@reddit
+1 to vLLM being a marketable skill. We use it and TRT-LLM at [big company]
Farmadupe@reddit
It's totally ironic that llama.cpp is built for people without VRAM but needs to pre-allocate all possible VRAM at launch time.
If llama.cpp ever gets a mature paging/preempting KV cache we'll all wonder how ever we used the software at all.
It would certainly help with the "why does llama.cpp always reprocess my prompts" queries
o0genesis0o@reddit
I remember seeing a PR that implements a similar system to paged attention of vLLM. Maybe soon enough they would not need pre-allocation anymore.
computehungry@reddit
You start noticing the difference when you use some agentic harness which tries to run 2 or 3 agents at once. Even with np>3 the whole thing will die quite often. I'm pretty sure it's an unintentional bug, not an inherent limitation of llama.cpp (or I just don't know some black magic param), but vllm just works in this case.
Annual_Award1260@reddit
I have no problem maxing out vllm 100% all day
ridablellama@reddit
amen
Theio666@reddit
From our little testing, when we used to serve glm air on our hpc vs prod server, llamacpp was really unreliable with cache hits for some reason. We had something like fp8 quant on dual a100 for vllm, and awq quant for 3x a6000, and on a6000 on long context agentic work it did cache misses on full 40k+ context periodically, which led to 30-90s of waiting for prompt reprocessing there, and we never seen things like that happening with vllm for the same agent backend.
Keep in mind, vllm is not perfect, it's a go-to solution if you wanna multi gpu setup with the same GPUs involved (and from the box MTP support) and squeeze all speed from it, but there are bugs. Like, right now the qwen models in some combination of mtp mode has semi-rare parsing bug for tool calls (fixable by disabling parsing on inference side and enabling proxy for parsing xd). This is like 3months old model, and I bet that awq related parsing bugs for glm air are still there too, that one is a 9 month old model. So if you enter vllm world be ready for some "fun". It's not as bad as with SGLang, but still can be quite frustrating. I think all big cloud inference providers use custom versions of either vLLM or SGLang with their own bugfixes added, since out of the box there are bugs.
NNN_Throwaway2@reddit
PP speed and long context decoding speed is way faster. If you're doing anything with agents and your models fit in VRAM, you should be moving off llama.cpp.
HonestoJago@reddit
It’s probably worth using it just to learn how to get it up and running, which can be a pain but also might be a skill you’ll need someday.
6969its_a_great_time@reddit
I’ve tried to use Claude code with local models and vllm always does better at handling large context len compared to llama.cpp
GCoderDCoder@reddit
Fp8 capable gpu devices will benefit in my opinion as it makes heavier 8bit faster giving speed with higher accuracy. I'm not a fan of the q4 options so outside of fp8 I'm frequently using gguf since I'm primarily doing personal inference and llama.cpp and mlx handle my one or two requests at a time fine. I tend to like q6 for accuracy and size best value but vllm doesnt serve much unique benefit there or at least I have not found any yet
sautdepage@reddit
Of course, if a particular model/gpu combo works well with it.
Prompt processing tends to be significantly faster - on CUDA at least. That alone makes it well worth it. I can stop my comment here, actually.