Stop Wasting Your Multi-GPU Setup With llama.cpp: Use vLLM or ExLlamaV2 for Tensor Parallelism

[-]

fallingdowndizzyvr@reddit

My Multi-GPU Setup is a 7900xtx, 2xA770s, a 3060, a 2070 and a Mac thrown in to make it interesting. It all works fine with llama.cpp. How are you getting that working with vLLM or ExLlamaV2?

[-]

androidGuy547@reddit

I have 2 A770 and wanna try dual gpu setup, but worried about pcie lane limitation since I have B550m mobo, will pcie 3.0 * 4 sufficient for the second gpu without too much bottleneck?

[-]

For split up the model and run each section sequentially, it's overkill. For TP, it's a little light. But TP isn't really there for the A770. I was just trying it again last week. It's still slower than layer splitting.

[-]

androidGuy547@reddit

thank you

[-]

CompromisedToolchain@reddit

If you don’t mind, how do you have all of those rigged together? Mind taking a moment to share your setup?

[-]

fallingdowndizzyvr@reddit

3 separate machines working together with llama.cpp's RPC code.

1) 7900xtx + 3060 + 2070.

2) 2xA770s.

3) Mac Studio.

My initially goal was to put all the GPUs in one server. The problem with that are the A770s. I have the Acer ones that don't do low power idle. So they sit there using 40 watts each doing nothing. Thus I had to break them out to their own machine that I can suspend when it's not needed to save power.

[-]

_mannen_@reddit

I just the A770 and am quite disappointed in inference speed under Linux. I find the comment that it runs faster on Windows interesting, and while I was already planning to move it to another computer that runs Windows, I will pay more attention to performance and run some more benchmarking.

I got the 3060 as well, and cheaper than the A770 but the 4GB additional VRAM is interesting on the A770. Initial testing shows that the 3060 performs better under Linux than the A770.

If the A770 performs well under Windows, and actually matches the 3060, I might pass-through the A770 to a Windows VM and return the 3060.

Interesting indeed.

[-]

fallingdowndizzyvr@reddit

Care to share some info about your A770 setup under Windows? Just download llama.cpp and run?

Pretty much. There's nothing special to do on the A770 end. Vulkan is supported by the basic drive. For llama.cpp, just download and run the Windows binary compiled with Vulkan support. Then just run it. That's all there is to it.

[-]

fullouterjoin@reddit

That is amazing! What is your network saturation like? I have part of what you have here, I could run on a M1 Macbook Pro 64GB instead of a studio.

That is criminal that those cards don't idle. How much better is the A770 perf on Windows than Linux?

I have 10 and 40GbE available for testing.

[-]

fallingdowndizzyvr@reddit

What is your network saturation like?

There is no network saturation in terms of bandwidth. Even when running RPC servers internally with the client on the same machine where there is effectively unlimited bandwidth, for what do it hovers at around 300mbs. Well under even pretty standard gigabit ethernet. It really depends on the number of layers and the tks. Running a tiny 1.5b model with a lot of tk/s gets it up to about a gigabit.

I think latency is more of an issue than anything else.

How much better is the A770 perf on Windows than Linux?

I didn't realize it was until recently. Since until recently, Intel did their AI work on Linux. That all changed with AI playground which is Windows only. Then the gamers reported that the latest Windows driver was so much better. It hadn't come to linux the last time I checked. So I tried running in Windows instead to test that new driver. It's much faster. I talked about it here. Windows is about 3x faster than linux for the A770.

https://www.reddit.com/r/LocalLLaMA/comments/1hf98oy/someone_posted_some_numbers_for_llm_on_the_intel/

[-]

ivchoniboy@reddit

I think latency is more of an issue than anything else.

Any insight why would latency be an issue? Is this in the case because you are issuing a lot of concurrent requests to the llama.cpp server?

[-]

fallingdowndizzyvr@reddit

Latency is an issue. It has nothing to do with a lot of concurrent requests. Even with a single request, latency is an issue.

I'm going to use an analogy to demonstrate the point. Say you have a carton that holds 6 eggs. There's a team of 6 people to fill that carton. Each person puts in an egg. It takes 1 second per person. So they should be able to fill the carton in 6 seconds. But they can't because they need to move the carton between them. Say that takes a second. So really, it takes 11 seconds. That time to move that carton from person to person is latency.

It's the same with inferring across multiple machines. To pass the baton from one machine to another takes time. That time is latency.

[-]

CheatCodesOfLife@reddit

Damn, I might have to install Windows to try this. I recently found that removing my A770's and just using Nvidia + Threadripper sped up my R1 inference substantially (Thread-ripper is faster than A770)

[-]

ph0n3Ix@reddit

I have 5GBE adapters but I'm having reliability issues with them, connection drops.

Realtek chipset? They have a reputation and it's not a good one. copper based gigabit+ tends to run hot. I've had issues that cleared up after putting a fan on the PHYs to keep them cool(er).

[-]

fallingdowndizzyvr@reddit

Yep. 8157 if I remember right.

[-]

adityaguru149@reddit

RAM for Mac Studio?

[-]

fallingdowndizzyvr@reddit

32GB.

[-]

CompromisedToolchain@reddit

Thanks! Been looking to solve this same problem.

[-]

zelkovamoon@reddit

So how many tokens/s are you getting on this with, I assume, at least 70b models?

[-]

ZealousidealPage5309@reddit

Commenting to second u/CompromisedToolchain ‘s request.

[-]

ttkciar@reddit

Higher performance is nice, but frankly it's not the most important factor, for me.

If AI Winter hits and all of these open source projects become abandoned (which is unlikely, but call it the worst-case scenario), I am confident that I could support llama.cpp and its few dependencies, by myself, indefinitely.

That is definitely not the case with vLLM and its vast, sprawling dependencies and custom CUDA kernels, even though my python skills are somwhat better than my C++ skills.

I'd rather invest my time and energy into a technology I know will stick around, not a technology that could easily disintegrate if the wind changes direction.

[-]

Potential-Leg-639@reddit

AI Winter?

[-]

ttkciar@reddit

https://wikipedia.org/wiki/AI_winter

I'm too young to have experienced the first AI Winter, but was active in the field for the second one, and the conditions prior to the second Winter are very similar to conditions today.

[-]

anderspitman@reddit

Late to the party but this appears to be a ripoff of this article: https://www.ahmadosman.com/blog/do-not-use-llama-cpp-or-ollama-on-multi-gpus-setups-use-vllm-or-exllamav2/

[-]

ykoech@reddit

Does LM Studio work with multiple GPUs ?

[-]

minyor@reddit

Closed source and no commercial use is allowed.. No thank you

[-]

suprjami@reddit

vLLM compared to llama.cpp on my dual 3060 12G system:

vLLM container is massive at 16.5 GiB. My llama.cpp container is 1.25 GiB.

vLLM is very slow to start, it takes 2 minutes from start to ready. llama.cpp takes 5 seconds.

vLLM VRAM usage is higher than llama.cpp with the same model file and config. vLLM seems more badly affected by long context despite Flash Attention being used for both servers.

Model name in vLLM API server is the full long file path which is ugly.

vLLM does not provide statistics like the token length provided to Open-WebUI.

vLLM has no generation stats per prompt in its logs, only basic prompt/gen tok/sec printed every few seconds.

The only good point: vLLM inference was faster. llama.cpp running L3 8B gets 38 t/s on one GPU and same on two GPUs. vLLM on one GPU got 35 tok/sec, tensor-parallel on both GPUs got 52 tok/sec. That's a ~36% speedup.

I can only just load a 32B Q4 or 24B Q6 model with llama.cpp. I don't think vLLM would be able to do those with its high VRAM use so I'd have to go down a quant, which is not ideal at those sizes.

Considering the worse experience everywhere except inference speed, I am not impressed with vLLM.

[-]

npl1986@reddit

I would like to second this. I have the same setup, dual 3060. I still couldn't figure out how to fit 32B Q4 using VLLM even with very small context size. Maybe I'm very new to this. The VRAM usage of VLLM is just annoying. The initial setup and finding AWQ files are not user friendly at all. For my hardware, I will simply ignore the extra speed for the user experience and convenience.

[-]

Ok_Warning2146@reddit

Since you talked about the good stuff of exl2, let me talk about the bads:

No IQ quant and K quant. This means except for bpw>=6, exl2 will perform worse than gguf at the same bpw.
Architecture coverage lags way behind llama.cpp.
Implementation is full even for common models. For example, llama 3.1 has array of three in eos_token. However, current exl2 can only read the first item in the array as the eos_token.
Community is near dead. I submitted a PR but no follow up for a month.

[-]

CheatCodesOfLife@reddit

For example, llama 3.1 has array of three in eos_token. However, current exl2 can only read the first item in the array as the eos_token.

Found this via google. Thank you for this! Explains some issues I've been having with trying to use it with llasa3. I'll handle this in my code this.

Community is near dead. I submitted a PR but no follow up for a month.

It's not dead, just that it's one developer, and he's working on exl3 + all these new models like gemma3 coming out at once.

[-]

Weary_Long3409@reddit

Wait, q4km is on par with 4.5bpw exl2, and 4.65bpw is slightly better than q4km. Many people wrongly compared q4km with 4.0bpw. Also there's 4.5bpw with 8bit head, it's like q4kl.

[-]

Rich_Artist_8327@reddit

Is vllm faster than Ollama if having 1 GPU BUT many conqurrent users/requests? My understanding is vLLM is only faster when more than 1 GPU?

[-]

Holly_Shiits@reddit

Tried both, vLLM is fast, but it's unreliable Exllamav2 is not as fast as vLLM even with tensor parallelism, and also unreliable

Verdict: llama.cpp and gguf might be slower, but it's the most stable and decent ecosystem

[-]

b3081a@reddit

Even for a single GPU, vLLM is performing way better than llama.cpp from my experiences. The problem is the setup experience, its pip dependencies are just awful to manage and cause ton of headache.

I had to spin up a Ubuntu 22.04.x container to run vLLM because one of the native binary in a dependency package is not ABI compatible with latest Debian release, while llama.cpp simply builds in minutes and works everywhere.

[-]

bjodah@reddit

Old thread, but I'd just like to add that running vllm using docker/podman is quite easy, this the command I use:

podman run \

--name vllm-qwen25-coder \

--rm \

--device nvidia.com/gpu=all \

--security-opt=label=disable \

-v \~/.cache/huggingface:/root/.cache/huggingface \

--env "HUGGING_FACE_HUB_TOKEN=hf_REDACTEDREDACTEDREDACTEDREDACTED" \

-p 8000:8000 \

--ipc=host \

vllm/vllm-openai:latest \

--api-key some-key-123 \

--model Qwen/Qwen2.5-Coder-14B-Instruct-AWQ \

--gpu-memory-utilization 0.6 \

--max-model-len 8000

[-]

bjodah@reddit

I should add that I currently am mostly running exllamav2 using tabbyapi OCI image instead. The command is similar:

podman run \
--name tabby-qwen25-coder \
--rm \
--device nvidia.com/gpu=all \
--security-opt=label=disable \
-v \~/.cache/huggingface/hub:/app/models \
-v \~/my-config-files/tabby-config.yml:/app/config.yml \
-v \~/my-config-files/tabby-api_tokens.yml:/app/api_tokens.yml \
-e NAME=TabbyAPI \
-p 8000:5000 \
--ipc=host \
ghcr.io/theroyallab/tabbyapi:latest

my tabby-config.yml then contains the following entries (at the relevant places), I should probably use a symlink instead of the weird path encoding going on in the model name, but you get the idea:

model_name: models--bartowski--Qwen2.5-Coder-14B-Instruct-exl2/snapshots/612dc9547c5753e6ceb28c5d05d9db48e99d6989
draft_model_name: models--LatentWanderer--Qwen_Qwen2.5-Coder-1.5B-Instruct-6.5bpw-h8-exl2/snapshots/5904487d2dc0e0303b2a345eba57dbf920d53053

That gives me on the order of 70 tokens per second for generation on my single RTX 3090. Ideally I'd like to use the 32B model, but I would need more vram because I also run whisper, kokoro, and my X desktop on that GPU.

[-]

segmond@reddit

I like the ease of llama.cpp, I have 6 GPUs so tensor parallelism doesn't apply. I have had to rebuild vllm multiple times and now I just limit it for vision models, each model with it's own virtual environment. I like llama.cpp's cutting edge, ability to offload kv to system memory to increase context size. I'm not using my GPU so much that token/sec is my bottleneck. My bottleneck so far is how fast I can come up with and implement ideas.

[-]

fairydreaming@reddit

Earlier post that found the same: https://www.reddit.com/r/LocalLLaMA/comments/1ge1ojk/updated_with_corrected_settings_for_llamacpp/

But I guess some people still don't know about this, so it's a good thing to periodically rediscover this.

[-]

daHaus@reddit

Those numbers are surprising, I figured nvidia would be performing much better there than that

For reference I'm able to get around 20 t/s on a RX580

[-]

SuperChewbacca@reddit

Hey, I am the person who did that post and tests. I ran the tests at FP16 to make the testing simple and fair across the inference engines.

It runs much faster when quantized, you are probably running a 4 bit quant.

[-]

daHaus@reddit

Q8_0, FP16 is only marginally slower

  MUL_MAT(type_a=f16,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                    36 runs - 28135.11 us/run -  60.13 GFLOP/run -   2.14 TFLOPS
  MUL_MAT(type_a=q8_0,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                   40 runs - 25634.92 us/run -  60.13 GFLOP/run -   2.35 TFLOPS
  MUL_MAT(type_a=q4_0,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                   44 runs - 23794.66 us/run -  60.13 GFLOP/run -   2.53 TFLOPS
  MUL_MAT(type_a=q4_K,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                   24 runs - 41668.04 us/run -  60.13 GFLOP/run -   1.44 TFLOPS

[-]

SuperChewbacca@reddit

Thanks, I will check it out. Haven’t used llama.cpp on my main rig in awhile.

[-]

trararawe@reddit

How do you serve multiple models with vLLM? That's the only reason why I use Ollama.

[-]

No-Statement-0001@reddit

Yes and some of us have P40s or GPUs not supported by vllm/tabby. My box, has dual 3090s and dual P40s. llama.cpp has been pretty good in these ways over vllm/tabby:

supports my P40s (obviously)
one binary, i static compile it on linux/osx
starts up really quickly
has DRY and XTC samplers, I mostly use DRY
fine grain control over VRAM usage
comes with a built in UI
has a FIM (fill in middle) endpoint for code suggestions
very active dev community

There’s a bunch of stuff that it has beyond just tokens per second.

[-]

k4ch0w@reddit

Yeah, I recently got a 5090, but unfortunately, it’s not yet supported for vllm. :(

[-]

Durian881@reddit

This. Wish vllm supports Apple Silicon.

[-]

XMasterrrr@reddit (OP)

You can use CUDA_VISIBLE_DEVICE envar to specify what to run on which gpus. I get it though.

[-]

No-Statement-0001@reddit

I use several different techniques to control gpu visibility. My llama-swap config is getting a little 😝

[-]

a_beautiful_rhind@reddit

P100 is supported though. Use it with flash attention.

[-]

kiselsa@reddit

P40 is the only justification for using llama.cpp over exllamav2 in parallel inference. I have P40 and 3090 too. Exllamav2 is just so much faster. And exllamav2 on 3090 has 5x more throughtput than P40 with llama.cpp

[-]

gaspoweredcat@reddit

i never had luck with exllamav2, i did try vllm for a bit but its just not as user friendly as things like LM Studio or Msty, itd be interesting to see other backends plugged into those apps but i suspect if they were going to do that they would have by now. itd be nice if someone built something similar to those apps for exlv2 or vllm

[-]

Lemgon-Ultimate@reddit

I never really understood why people are prefering llama.cpp over Exllamav2. I'm using TabbyAPI, it's really fast and reliable for everything I need.

[-]

Kako05@reddit

Because it doesn't matter whatever you get 6t/s or 7.5t/s text generation speed. It is still fast enough for reading. And whatever EXL trick I used to boost speeds seemed to hurt processing speed which is more important. Plus gguf has a context shift feature, so entire texts don't need to be reprocessed every single time. GGUF is better for me.

[-]

sammcj@reddit

tabby is great, but for a long time there was no dynamic model loading or multimodal support and some model architectures took a long time to come to exllamav2 if at all

[-]

henk717@reddit

For single GPU its as fast, way less dependencies, easier to use / install. Exllama doesn't make sense for single user / single GPU for most people.

[-]

_hypochonder_@reddit

exl2 runs much slower on my AMD card with ROCm.
Not everybody has leather jackets at home.

vLLM I didn't try yet. I setup docker and build the docker container, but never run it :3

[-]

Sudden-Lingonberry-8@reddit

just upstream gpu parallelism into llama.cpp?

[-]

No_Afternoon_4260@reddit

We want some deepseek r1 q4 speeds on 14 3090 !! Lol

[-]

TurpentineEnjoyer@reddit

I tried going from Llama 3.3 70B Q4 GGUF on llama.cpp to 4.5bpw exl2 and my ingerence gain was 16 t/s to 20 t/s

Honestly, at a 2x3090 scale I just don't see that performance boost to be worth leaving the GGUF ecosystem.

[-]

llama-impersonator@reddit

then you're not leaving it right, i get twice the speed with vllm compared to whatever lcpp cranks out

[-]

Weary_Long3409@reddit

As it is using parallel tensor, vllm about twice as fast but still slow at long context. Have you try lmdeploy? It's crazy fast running AWQ using it's turbomind engine. I left vllm for lmdeploy. I run parallel tensor for 32B AWQ 4x3060, flies at 46 tok/sec. Feels like having a gpt-4o mini at home.

[-]

mgr2019x@reddit

My issues with tappy/exllamav2 is that the json mode (openai lib, json schema, ...) is broken in combination with speculative decoding. But i need this for my projects (agents). And yeah llama.cpp is slower, but this works.

[-]

Small-Fall-6500@reddit

It sounds like that 25% gain is what I'd expect just for switching from a Q4 to 4.5 bpw + llamacpp to Exl2. Was the Q4 a Q4_k (4.85bpw), or a lower quant?

Was that 20 T/s with tensor parallel inference? And did you try out batch inference with Exl2 / TabbyAPI? I found that I could generate 2 responses at once with the same or slightly more VRAM needed, resulting in 2 responses in about 10-20% more time than generating a single response.

Also, do you know what PCIe connection each 3090 is on?

[-]

TurpentineEnjoyer@reddit

I reckon the results are what I expected, I was posting partly to give a benchmark to others who might come in expecting double the cards = double the speed.

One 3090 is on pcie4x16 the other is on pcie4x4

Tensor parrallelism via oobabooga's loader for exllama, and I did not try batch because I don't need it for my use case.

[-]

gtek_engineer66@reddit

Add in speculative decoding for a real gain in tokens, and recent kv cache optimisations for massive context

[-]

TurpentineEnjoyer@reddit

speculative decoding is really only useful or coding or similarly deterministic tasks.

[-]

No-Statement-0001@reddit

It’s helped when I do normal chat too. All those stop words, punctuation, etc can be done by the draft model. Took my llama-3.3 70B from 9 to 12 tok/sec on average. A small performance bump but a big QoL increase.

[-]

Weary_Long3409@reddit

This is somewhat correct, but also I left exllamav2 for vLLM. And now I left vLLM for lmdeploy. It's crazy fast running AWQ, much faster than vLLM, especially on long context. Still use exllamav2 for multi GPU without tensor parallelism.

[-]

Willing_Landscape_61@reddit

What is the CPU backend story for vLLM? Does it handle NUMA?

[-]

Lesser-than@reddit

At least put some context of your blog post rather than just a link to your article, this feels more like your trying to generate hate views more than a discussion.

[-]

SecretiveShell@reddit

vLLM and sglang are amazing if you have the VRAM for fp8. exl2 is a nice format and exllamav2 is a nice inference engine, but the ecosystem around it is really poor.

[-]

Mart-McUH@reddit

Multi GPU does not mean the GPU's are equal. I think tensor parallelism does not work when you have two different cards. llama.cpp does work. And it also allows offload to CPU when needed.

Also recently I compared 32B DeepseekR1 distill of Qwen and Q8 GGUF worked great. While EXL2 8bpw was much worse in output quality. So that speed gain is probably not for free.

[-]

Small-Fall-6500@reddit

Article mentions Tensor Parallelism being really important but completely leaves out PCIe bandwidth...

Kinda hard to speed up inference when one of my GPUs is on a 1 GB/s PCIe 3.0 x1 connection. (Though batch generations in TabbyAPI does work and is useful - sometimes).

[-]

a_beautiful_rhind@reddit

All those people who said PCIe bandwidth doesn't matter, where are they now? Still should try it an see or did you not get any difference?

[-]

Small-Fall-6500@reddit

I have yet to see any benchmarks or claims of greater than 25% speedup when using tensor parallel inference, at least for 2 GPUs in an apples to apples comparison, so if 25% is the best expected speedup then PCIe bandwidth still doesn't matter that much for most people (especially when that could cost an extra $100-200 for a mobo that has more than just additional PCIe 3.0 x1 connections)

I tried using the tensor parallel setting in TabbyAPI just now (with latest Exl2 0.2.7 and TabbyAPI) but the output was gibberish, looked like random tokens. The token generation speed was about half of the normal inference, but there is obviously something wrong with it right now. I believe all my config settings were the default, except for context size and model. I'll try some other settings and do some research on why this is happening but I don't expect the performance to be better than without tensor parallelism anyway.

[-]

llama-impersonator@reddit

difference for me is literally 17T/s to 32T/s

[-]

Small-Fall-6500@reddit

For two GPUs, same everything else, and for single response generation vs tensor parallel?

What GPUs?

[-]

llama-impersonator@reddit

2 3090, 1 PCIe 4 x16, 1 PCIe 4 x4 on B650e board

[-]

a_beautiful_rhind@reddit

For me its a difference between 15 and 20t/s or there about. Doesn't fall as fast when context goes up. On 70b its like whatever, but for mistral large it made the model much more usable for 3 gpus.

IMO, its worth it to have at least 8x links. You're only 1x a single card but others were saying to 1x large numbers of cards and it would make no difference. I think the latter is bad advice.

[-]

Aaaaaaaaaeeeee@reddit

3060, and P100 vllm fork have the highest gain. P100x4 is benchmarked by DeltaSqueezer, I think it was 140%

There also exist some other cases.

someone getting these results with vllm:

F16 70B 19.93 t/s
INT8 72B 28 t/s
Sharing single stream (batchsize = 1) inference on 70B fp16 weights on 2080ti x 8
speed is 400% higher than a single 2080ti's rated bandwidth.

[-]

XMasterrrr@reddit (OP)

Check out my other blogposts, I talk about that. Wanted this to be more concise.

[-]

Small-Fall-6500@reddit

Wanted this to be more concise.

I get that. It would probably be a good idea to mention it somewhere in the article though, possibly with a link to another article or source for more info at the very least.

[-]

laerien@reddit

Or exo is an option if you're on Apple Silicon. Installing it is a bit of a pain but then it just works!

[-]

a_beautiful_rhind@reddit

vLLM needs even numbers of GPUs. Some models aren't supported by exllama. I agree it's preferred, especially since you know you're not getting tokenizer bugs from the cpp implementation.

[-]

edude03@reddit

Needs a power of two number but also it's not a vllm restriction

[-]

deoxykev@reddit

Quick nit:

vLLM Tensor parallelism requires 2, 4, 8 or 16 GPUs. An even number like 6 will not work.

[-]

a_beautiful_rhind@reddit

Yes, you're right in that regard. At least with 6 you can run it on 4.

[-]

memeposter65@reddit

At least on my setup, using anything else than llama.cpp seems to be really slow (like 0.5t/s). But that might be due to my old GPUs.

[-]

silenceimpaired@reddit

This post fails to consider the side of the model and the cards. I still have plenty of the model in ram… unless something has changed llama.cpp is the only option

[-]

Massive-Question-550@reddit

Is it possible to use an AMD and Nvidia GPU together or is this a really bad idea?

[-]

fallingdowndizzyvr@reddit

I do. And Intel and Mac thorn in there too. Why would it be a bad idea? As far as I know, llama.cpp is the only thing that can do it.

[-]

Previous_Fun_4508@reddit

exl2 is GOAT 🐐

[-]

tengo_harambe@reddit

Aren't there quality differences between EXL2 and GGUF with GGUF being slightly better?

[-]

a_beautiful_rhind@reddit

XTC and Dry implementation is different. You can use it through ooba.

[-]

fiery_prometheus@reddit

It's kind of hard to tell, since things often change in the codebase, and there are a lot of variations in how to make the quantizations. You can change the bits per weight, change which parts of the model gets a higher bpw than the rest, use a dataset to calibrate and quantize the model etc, so if you are curious you could run benchmarks or just take the highest bpw you can and call it a day.

Neither library uses the best quantization technique in general though, but there's a ton of papers and new techniques coming out all the time, VLLM and Aphrodite has generally been better at supporting new quant methods. Personally, I specify some that some layers should have a higher bpw than others in llamacpp and quantize things myself, but I still prefer to use vllm for throughput scenarios and prefer awq over gptq, then int8 or int4 quants (due to the hardware I run on) or hqq.

My guess is, when it comes to which quant techniques llamacpp and exllamav2 use, is that they should be able to produce a quantized model in a reasonable timeframe, since, some quant techniques, while they produce better quantized models, take a lot of computational time to make.

[-]

randomanoni@reddit

Sampler defaults are different. Quality depends on the benchmark. As GGUF is more popular it might be confirmation bias.

[-]

JockY@reddit

Agreed. Moving to tabbyAPI (exllamav2) from llama.cpp got me to 37 tok/sec with Qwen1.5 72B at 8 bits and 100k context.

Llama.cpp tapped out around 12 tok/sec at 8 bits.

[-]

AdventurousSwim1312@reddit

Can you share your config? I am reaching this speed on my 2*3090 only in 4bit and with a draft model

[-]

JockY@reddit

Yeah I have a Supermicro M12SWA-TF motherboard with Threadripper 3945wx. Four GPUs: - RTX 3090 Ti - RTX 3090 FTW3 (two of these) - RTX A6000 48GB - total 120GB

I run 8bpw exl2 quants with tabbyAPI/exllamav2 using tensor parallel and speculative decoding using the 8bpw 3B Qwen2.5 Instruct model for drafts. All KV cache is FP16 for speed.

It gets a solid 37 tokens/sec when generating a lot of code.

[-]

AdventurousSwim1312@reddit

Ah yes, the difference might come from the fact you have more GPU

With that config you might want to try MLC Llm, vllm or Aphrodite, from my testing, their tensor parallel implementation works a lot better than the one from exllama v2

[-]

LinkSea8324@reddit

poor ggerganov :(

[-]

You_Wen_AzzHu@reddit

The model is too large to fit into one GPU and you want us to use Tensor Parallel.

[-]

ParaboloidalCrest@reddit

Re: exllamav2. I've love to try it, but ROCm support is a pain in the rear to get running, and the exllama quants are so scattered and way harder to find a suitable size than GGUF.

[-]

stanm3n003@reddit

How many people can you serve with 48gb Vram and vLLM? Lets say a 70b q4 Model?

[-]

Leflakk@reddit

Not everybody can fit the models on GPU so llama.cpp is a amazing for that and the large panel of quantz is very impressive.

Some people love how ollama allows to manage models and how it is user firendly even if in term of pure performances, llamacpp should be prefered.

ExLlamaV2, could be perfect for GPUs if the quality were not degraded compared to others (dunno why).

On top of these, vllm is just perfect for performances / production / scalability for GPUs users.

[-]

gpupoor@reddit

this is a post that explicitly mentions multigpu, sorry bit your comment is kind of (extremely) irrelevant

[-]

Leflakk@reddit

You can use multi gpu with cpu offloading if the model does not fit

[-]

ForsookComparison@reddit

Works with ROCm/Vulkan?

[-]

gpupoor@reddit

vllm+tp works with rocm, it only needs a few changes. I'll link them later today

[-]

ParaboloidalCrest@reddit

Neva!

[-]

ozzie123@reddit

I love EXL2 with Oobabooga. I just wish more UX supports vLLM.

[-]

bullerwins@reddit

I think most of use agree. Basically we just use llama.cpp when we need to offload big models to ram and can't fit it to vram. Primeagen was probably using llama.cpp because it's the most popular engine, I believe he is not too deep into LLM's yet.
I would say vLLM if you can fit the unquantized model or like the 4bit awq/gptq quants.
Exllamav2 if you need a more fine graned quant like q6, q5, q4.5...
And llama.cpp for the rest.

Also llama.cpp supports pretty much everything, so developers with only mac without a gpu server use llama.cpp