llama.cpp: owners of old GPUs wanted for performance testing

[-]

Robert__Sinclair@reddit

llama.cpp should add back opencl support. It was giving me 20%-30% on my core i7 GTX 970M notebook.

Reply

[-]

Robert__Sinclair@reddit

yep.. because it uses a very old version of llama.cpp

Reply

[-]

Its a fork of Llamacpp and one of the goals of the fork is retaining compatibility where we can. We still have support for all the GGML formats, we still have support for Vision models over the API, we still have OpenCL. But if you want to use it with something modern its also very modern with it being based on a llamacpp version from only a few days ago.

Reply

[-]

Robert__Sinclair@reddit

you're right. my bad.

Reply

[-]

fish312@reddit

No it's quite up to date

Reply

[-]

Robert__Sinclair@reddit

still can't convert and use phi-3-small it relies too much on llama.cpp

Reply

[-]

satireplusplus@reddit

vulkan not an option?

Reply

[-]

Robert__Sinclair@reddit

not much improvement with that.. but a 20-30% boost with opencl for some f'ing reason.

Reply

[-]

satireplusplus@reddit

you'd need to pay attention that the vulkan backend actually uses the GPU and not the Vulkan-CPU backend.

Reply

[-]

Robert__Sinclair@reddit

I don't see any improvement with vulkan or at least it's not noticeable as the one of openCL.

Reply

[-]

SystemErrorMessage@reddit

Vulkan is a graphics api not compute. Opengl/vulkan has opencl interoperability. If your software does compute for graphics thats where this helps as on the same gpu can skip the cpu and render results straight. For example lets say you use opencl not physx to do physics, interoperability lets you show the results without going through cpu. Cant use vulkan to run compute, its a render pipeline.

Reply

[-]

timschwartz@reddit

https://www.khronos.org/blog/getting-started-with-vulkan-compute-acceleration

Reply

[-]

SystemErrorMessage@reddit

The compute mentioned is for graphics not something like physics. Although you can do matrice operations i dont think it supports layered data. Opencl is also made by the same org and the tutorial talks about shader programming, to offload graphics processing of some elements from cpu to gpu

Reply

[-]

fallingdowndizzyvr@reddit

Khronos has said they want to converge OpenCL and Vulkan as much as possible. Vulkan is for compute as well.

Reply

[-]

SystemErrorMessage@reddit

Converge doesnt mean compute and a vulkan backend is only possible if vulkan has the needed compute features which if it does for your software then ofcourse its faster. The problem here is thinking that vulkan replaces opencl which it doesnt. Ive spoken to devs and vulkan is very difficult to code for. Intel gpus should have the advantage in opencl including their igps as they should be cut down x86 processors. Intel has the best opencl to core/clock performance than amd/nvidia on gpus. So if they include multiple ALUs they would be fast and not need the slow npu they have on arc that even my arm ones are much faster.

Reply

[-]

fallingdowndizzyvr@reddit

> The problem here is thinking that vulkan replaces opencl which it doesnt. Tell that to Khronos. Their stated goal is to merge OpenCL into Vulkan. "OpenCL is announcing that their strategic direction is to support CL style computing on an extended version of the Vulkan API. The Vulkan group is agreeing to advise on the extensions." https://pcper.com/2017/05/breaking-opencl-merging-roadmap-into-vulkan/ > Ive spoken to devs and vulkan is very difficult to code for. And? A lot of things that are worthwhile are more difficult to do. The same dev wrote both the OpenCL and Vulkan backends for llama.cpp. He's chosen to go with Vulkan. That's why the OpenCL backend is gone now. > Intel gpus should have the advantage in opencl including their igps as they should be cut down x86 processors. I Intel is pushing SYCL, not OpenCL.

Reply

[-]

SystemErrorMessage@reddit

thats not replacing, rather kronos wants to integrate opencl into vulkan rather than replace. I've run opencl on nvidia, amd and intel GPUs and the performance efficiency of intel GPUs for opencl is good compared to the rest but amd actually did worse despite their compute focused cards tested. So its not that vulkan supports compute directly, rather the article from kronos is talking about making it easier to use openCL with vulkan which is useful for softwares like blender. The other reason for openCL is because its the same whether it is CPU or GPU so you can combine a multi vendor multi processor system and it would work the same on any processor.

Reply

[-]

fallingdowndizzyvr@reddit

> thats not replacing, rather kronos wants to integrate opencl into vulkan rather than replace. I suggest you look into OpenGL and Vulkan. Since OpenCL is heading down the same road. > The other reason for openCL is because its the same whether it is CPU or GPU so you can combine a multi vendor multi processor system and it would work the same on any processor. There's no reason that Vulkan can't do the same. Vulkan is a graphics and compute API. While it's not allowed to have a graphics only implementation. It is allowed to have a compute only implementation. Why allow for that if it's not envisioned that it will run on compute only devices? Like a CPU.

Reply

[-]

satireplusplus@reddit

5 months ago llama.cpp released a Vulkan backend that's fully compatible with the current Vulkan standard. I don't know what's really left to discuss here, at the end of day it's easy to try it out on your hardware and it opens up support for AMD GPUs, Intel GPUs and any other GPU that supports Vulkan. It also works on Nvidia cards, but the CUDA backend is a lot more mature of course, so it doesn't really make sense to use Vulkan here. Here's this sub discussion on this new feature: https://www.reddit.com/r/LocalLLaMA/comments/1adbzx8/as_of_about_4_minutes_ago_llamacpp_has_been/ Now it's still new and whether it will work for your GPU or not is still up to how good the vulkan driver impl is for your hardware, but in theory it should run on any GPU that supports Vulkan 1.3. I tried running it on some Arm SoC hardware (Orange Pi 5) with a Mali GPU and hit a dead end with some op that the driver didn't implement, but it was worth a try and the backend may get much better with time.

Reply

[-]

daHaus@reddit

FP16 support in Vulkan is still lacking for many platforms (AMD) where it wasn't an issue with OpenCL.

Reply

[-]

fallingdowndizzyvr@reddit

I guess you don't realize that OpenCL backed was removed from llama.cpp because the Vulkan backend has replaced it.

Reply

[-]

satireplusplus@reddit

llama.cpp uses the render pipeline as compute: See https://github.com/ggerganov/llama.cpp > Vulkan and SYCL backend support > cmake -B build -DGGML_VULKAN=1 > cmake --build build --config Release > # Test the output binary (with "-ngl 33" to offload all layers to GPU) > ./bin/llama-cli -m "PATH_TO_MODEL" -p "Hi you how are you" -n 50 -e -ngl 33 -t 4 > > # You should see in the output, ggml_vulkan detected your GPU. For example: > # ggml_vulkan: Using Intel(R) Graphics (ADL GT2) | uma: 1 | fp16: 1 | warp size: 32

Reply

[-]

SiEgE-F1@reddit

Sometimes it is easier for you to keep using an older version of llama.cpp(you still have access to it by downloading the sources before the commit that removed opencl support), than for them to upkeep an outdated technology. Did they explain why they've removed it?

Reply

[-]

Robert__Sinclair@reddit

that's what I do.. but new models are coming out and the old version does not support them.

Reply

[-]

SiEgE-F1@reddit

But maybe there are workarounds? Are you sure opencl is your only option?

Reply

[-]

Robert__Sinclair@reddit

yep. makes no difference...

Reply

[-]

SiEgE-F1@reddit

[https://github.com/ggerganov/llama.cpp/issues/8219](https://github.com/ggerganov/llama.cpp/issues/8219) [https://github.com/ggerganov/llama.cpp/issues/7768](https://github.com/ggerganov/llama.cpp/issues/7768) Ah.. I see you've already did some work in the background for 2 months already. How is CUDA not viable for you?

Reply

[-]

Robert__Sinclair@reddit

because I have an old notebook, with a GTX970M... I tried it but I see no real advantage... with opencl instead it was 20%-30% faster than normal.

Reply

[-]

MDSExpro@reddit

And just focus on it. One codebase = faster development. Everything supports OpenCL.

Reply

[-]

fallingdowndizzyvr@reddit

> Everything supports OpenCL. That's not true at all. There are plenty of things that don't support OpenCL. Vulkan on the other hand, is pretty much universal on GPUs.

Reply

[-]

MDSExpro@reddit

There is a way to enable OpenCL on Pixels. Vulkan being limited only to GPUs while more and more processing moves to other accelerators would be suicide for project.

Reply

[-]

fallingdowndizzyvr@reddit

> There is a way to enable OpenCL on Pixels. Say more. Since I've never heard of anyone being able to do it. But even if you can, then it's still more of a hassle than Vulkan. Which just works. > Vulkan being limited only to GPUs while more and more processing moves to other accelerators would be suicide for project. Why don't you think Vulkan will not run on said accelerators? It would be suicide for them not to support it.

Reply

[-]

MDSExpro@reddit

> Say more. Since I've never heard of anyone being able to do it. But even if you can, then it's still more of a hassle than Vulkan. Which just works. https://github.com/reekotubbs/Pixel_OpenCL_Fix Long story short: HW is OpenCL capable and most is software is here, Google is just abusing it's monopoly, as always. > Why don't you think Vulkan will not run on said accelerators? It would be suicide for them not to support it. Which would turn Vulkan into OpenCL. Might as well skip to the end and just use OpenCL.

Reply

[-]

fallingdowndizzyvr@reddit

> Long story short: HW is OpenCL capable and most is software is here, Google is just abusing it's monopoly, as always. Yeah. That's always been known. That's why people, including me, have tried using the OpenCL libraries released for other phones using the same GPU. Didn't work. > https://github.com/reekotubbs/Pixel_OpenCL_Fix Have you personally tried it? Did it work? > Which would turn Vulkan into OpenCL. Might as well skip to the end and just use OpenCL. Khronos, who controls both OpenCL and Vulkan, has said the goal is to merge OpenCL functionality into Vulkan. So it's more like skip to the end and just use Vulkan. As per Khronos. "OpenCL is announcing that their strategic direction is to support CL style computing on an extended version of the Vulkan API. The Vulkan group is agreeing to advise on the extensions."

Reply

[-]

MDSExpro@reddit

> Khronos, who controls both OpenCL and Vulkan, has said the goal is to merge OpenCL functionality into Vulkan. So it's more like skip to the end and just use Vulkan. > As per Khronos. > "OpenCL is announcing that their strategic direction is to support CL style computing on an extended version of the Vulkan API. The Vulkan group is agreeing to advise on the extensions." There is nothing on merge, just that Vulkan will mimic part of OpenCL computing API. > Have you personally tried it? Did it work? I'm more of Samsung guy.

Reply

[-]

fallingdowndizzyvr@reddit

> There is nothing on merge, just that Vulkan will mimic part of OpenCL computing API. If Vulkan can do what OpenCL can do, then why do you need OpenCL? Especially since Vulkan runs on way more devices than OpenCL. It's just part of the basic installation. Even on many devices that support OpenCL, like AMD GPUs, installing OpenCL is another step. Vulkan comes by default. This is what Khronos says about OpenCL. "OpenCL is not native to the Windows operating system, and as such **isn't supported** across the board of UWP (Universal Windows Platform) platforms (XBox, Hololens, IoT, PC)" You know what is natively supported? Vulkan.

Reply

[-]

MDSExpro@reddit

> Especially since Vulkan runs on way more devices than OpenCL That's simply wrong. OpenCL runs on CPUs, GPUs, FPGAs, DSPs, NPUs and dozen of more exotic accelerators. Vulkan runs only on GPUs. Stop trying to pain picture of Vulkan covering more ground, you are just digging yourself a deeper hole. > "OpenCL is not native to the Windows operating system, and as such isn't supported across the board of UWP (Universal Windows Platform) platforms (XBox, Hololens, IoT, PC)" UWP is dead and deprecated, for over 3 years now. OpenCL runs fine on Windows without it.

Reply

[-]

fallingdowndizzyvr@reddit

> That's simply wrong. OpenCL runs on CPUs, GPUs, FPGAs, DSPs, NPUs and dozen of more exotic accelerators. The fact that it can run, doesn't mean it does. As seen by the Pixel example. Vulkan on the other hand, runs on a whole lot of things. Also, there's nothing that says that Vulkan has to run only on GPUs. So stop insisting it does. > Stop trying to pain picture of Vulkan covering more ground, you are just digging yourself a deeper hole. LOL. So says someone who has his head firmly planted in a hole in the ground. Vulkan is a graphics and compute API. It's not just a graphics API. In fact, you can have a compute only implementation of Vulkan. Hm.... why would it only run on a GPU if it doesn't do graphics? Here. Read this. It will dispel much of the misconceptions you have about Vulkan. "In fact, the specification says that you **can’t have a graphics-only implementation of Vulkan**. However, you certainly **can have a compute-only implementation**." https://www.electronicdesign.com/technologies/embedded/article/21238533/coreavi-11-myths-about-vulkan > OpenCL runs fine on Windows without it. Which is an extra install for many devices. Vulkan on the other hand, comes as standard.

Reply

[-]

MDSExpro@reddit

> Here. Read this. It will dispel much of the misconceptions you have about Vulkan. I have no misconception about Vulkan, stop trying to push that idiotic narrative. Quote me on that or STFU. > Which is an extra install for many devices. Vulkan on the other hand, comes as standard. cuDNN is also extra and yet 99% of SW that does any computing related to neural networks manages to install that. This is not a barrier to adoption.

Reply

[-]

Remove_Ayys@reddit (OP)

Sorry, but that is simply incorrect. The portability of GPU performance is extremely poor because it depends heavily on hardware details. Writing relatively general and high-level OpenCL/Vulkan code will never be as fast as CUDA/ROCm code.

Reply

[-]

fallingdowndizzyvr@reddit

Just use Vulkan. It's way more universal and has better support than the OpenCL backend ever did.

Reply

[-]

Robert__Sinclair@reddit

for some reason I see no improvement with vulcan on my system but I see it with opencl.

Reply

[-]

CanineAssBandit@reddit

I have four M40s I can deploy!

Reply

[-]

smcnally@reddit

These get very hot. Four M40s would help with a July 4th cookout. But they’re working well with recent builds. These are Maxwell 2.0 and compute 5.2. I want to see if Maxwell 1.0 also gets a llama.cpp bump.

Reply

[-]

candre23@reddit

I got an M4000 on the shelf collecting dust. I could chuck it in a machine and run some tests if that would help.

Reply

[-]

Remove_Ayys@reddit (OP)

M4000 (Maxwell) is just what I would be interested in; if it's not too much trouble I would appreciate the results.

Reply

[-]

smcnally@reddit

I’ve done tests with the M40 on it’s own and mixed with others. Will share more on your PR. Thanks for the work. ‘Tesla M40, compute capability 5.2, VMM: yes’

Reply

[-]

candre23@reddit

OK, it's not too tough for me to stick it in one of the parts machines. I can handle installing mainline LCPP, but I'm not exactly savvy with git/github - how do I go about installing your version to test against?

Reply

[-]

AdamDhahabi@reddit

I own a Quadro P5000 (Pascal architecture) which does not support \_\_dp4a. Where can I find your build? I could run some tests but I don't have cmake installed.

Reply

[-]

Remove_Ayys@reddit (OP)

P5000s have compute capability 6.1 and therefore have \`\_\_dp4a\`. I won't need you to test the performance on that card.

Reply

[-]

AdamDhahabi@reddit

OK, I follow you on that, any reason why llama.cpp b3266 won't force MMQ then? llm\_load\_print\_meta: model type = 8B llm\_load\_print\_meta: model ftype = Q6\_K llm\_load\_print\_meta: model params = 8.03 B llm\_load\_print\_meta: model size = 6.14 GiB (6.56 BPW) llm\_load\_print\_meta: [general.name](http://general.name)= Meta-Llama-3-8B-Instruct llm\_load\_print\_meta: BOS token = 128000 '<|begin\_of\_text|>' llm\_load\_print\_meta: EOS token = 128001 '<|end\_of\_text|>' llm\_load\_print\_meta: LF token = 128 'Ä' llm\_load\_print\_meta: EOT token = 128009 '<|eot\_id|>' llm\_load\_print\_meta: max token length = 256 ggml\_cuda\_init: GGML\_CUDA\_FORCE\_MMQ: no ggml\_cuda\_init: GGML\_CUDA\_FORCE\_CUBLAS: no ggml\_cuda\_init: found 1 CUDA devices: Device 0: Quadro P5000, compute capability 6.1, VMM: yes llm\_load\_tensors: ggml ctx size = 0.27 MiB llm\_load\_tensors: offloading 32 repeating layers to GPU llm\_load\_tensors: offloading non-repeating layers to GPU llm\_load\_tensors: offloaded 33/33 layers to GPU llm\_load\_tensors: CPU buffer size = 410.98 MiB llm\_load\_tensors: CUDA0 buffer size = 5871.99 MiB ......................................................................................... llama\_new\_context\_with\_model: n\_ctx = 8192 llama\_new\_context\_with\_model: n\_batch = 2048 llama\_new\_context\_with\_model: n\_ubatch = 512 llama\_new\_context\_with\_model: flash\_attn = 0 llama\_new\_context\_with\_model: freq\_base = 500000.0 llama\_new\_context\_with\_model: freq\_scale = 1 llama\_kv\_cache\_init: CUDA0 KV buffer size = 1024.00 MiB llama\_new\_context\_with\_model: KV self size = 1024.00 MiB, K (f16): 512.00 MiB, V (f16): 512.00 MiB llama\_new\_context\_with\_model: CUDA\_Host output buffer size = 0.49 MiB llama\_new\_context\_with\_model: CUDA0 compute buffer size = 560.00 MiB llama\_new\_context\_with\_model: CUDA\_Host compute buffer size = 24.01 MiB llama\_new\_context\_with\_model: graph nodes = 1030 llama\_new\_context\_with\_model: graph splits = 2

Reply

[-]

GamerGateFan@reddit

Out of curiosity, did you try compiling with `-DLLAMA_CUDA_FORCE_MMQ=1` and if that does work, did it improve the performance?

Reply

[-]

AdamDhahabi@reddit

It's been a long time I compiled llama.cpp, I just grab the release from here [https://github.com/ggerganov/llama.cpp/releases](https://github.com/ggerganov/llama.cpp/releases)

Reply

[-]

GamerGateFan@reddit

Glancing at CMakeLists.txt and the github actions, unless I overlooked something or it is set dynamically at runtime, it is not a detected or default option when they compile it, but something you manually have to add.

Reply

[-]

DeltaSqueezer@reddit

I was wondering, there was previously a patch to enable flash attention for pascal GPUs, but the P100 didn't get much benefit, IIRC. I was wondering whether these 'workarounds' could be applied to the flash attention patches to yield any speedups?

Reply

[-]

compilebunny@reddit

> Relevant GPUs are P100s or Maxwell or older. Relevant models are legacy quants and k-quants. Wait... I thought that llama.cpp and its derivatives (gpt4all) couldn't run any quants other than 4_0 on the GPU because they rely on Vulkan.

Reply

[-]

Swoopley@reddit

I've got a P100 and P40, anything specific you would like me to test?

Reply

[-]

Distinct-Target7503@reddit

Hi... Could you explain to me what are the differences in performance for these GPUs?

Reply

[-]

Swoopley@reddit

https://preview.redd.it/hsnl0z7bbz9d1.png?width=1071&format=png&auto=webp&s=d86d467462feb8795aa3155062f6145bc3f306f1 I don't think I could explain all the differences so instead of reinventing the wheel I'll just paste my search result here. And if you are interested in raw performance it's always a good Idea to look at the techpowerup page of the card: [https://www.techpowerup.com/gpu-specs/tesla-p100-pcie-16-gb.c2888](https://www.techpowerup.com/gpu-specs/tesla-p100-pcie-16-gb.c2888) [https://www.techpowerup.com/gpu-specs/tesla-p40.c2878](https://www.techpowerup.com/gpu-specs/tesla-p40.c2878)

Reply

[-]

Remove_Ayys@reddit (OP)

I own several P40s and I already received P100 numbers from someone else so I don't think I'll need more testing with those GPUs.

Reply

[-]

GG-Irelia@reddit

I have a 4gb gtx 760 collecting dust on my shelf. Is it what you’re looking for?

Reply

[-]

ankurkaul17@reddit

I have a laptop with gtx1080. Happy to help.

Reply

[-]

Remove_Ayys@reddit (OP)

That won't be of use to me. The only Pascal card for which I needed testing is the P100.

Reply

[-]

StarfieldAssistant@reddit

I will do the test as soon as I can, thank you very much dude. IIRC dp4a is what allows P40 and P6000 to execute 4 int8 instructions in one 32bit calculation, i was really wondering if this was implemented or would be done someday.

Reply

[-]

ViennaFox@reddit

>old GPUs Well... I have a NVIDIA GEFORCE3 TI 200 64MB AGP card, but I'm guessing that's too old.

Reply

[-]

amaz0n_com@reddit

I have a Tesla P4 that I don't use. Let me know if that helps. Looks like its CUDA version is 6.1 - Thank you for all you do!

Reply

[-]

kryptkpr@reddit

Just so I'm clear do we want SM61 or P100 here? Because P100 are SM60. It's P40 that are SM61.

Reply

[-]

Remove_Ayys@reddit (OP)

I specifically need someone to test performance on a GPU with compute capability < 6.1 since those are the GPUs on which the `__dp4a` instruction is unavailable and for which the default change matters. On P40s with compute capability 6.1 `__dp4a` is available so I know that the performance is good.

Reply

[-]

kryptkpr@reddit

Roger, updating the issue as I go.

Reply

[-]

Wooden-Potential2226@reddit

Did you test? What was the t/s difference?

Reply

[-]

kryptkpr@reddit

I posted detailed results in the linked GH issue but tldr on my P100 single-stream is 2x and batch is 3x faster and IQ4 doesnt crash with an assert. Q8\_0 is now very usable and batching actually works. These cards have always suffered a performance penalty under llama.cpp that this PR has fixed.

Reply

[-]

Wooden-Potential2226@reddit

Fantastic - thx for the details!

Reply

[-]

pmp22@reddit

> On P40s with compute capability 6.1 __dp4a is available so I know that the performance is good. P40 gang just can't stop winning!

Reply

[-]

harrro@reddit

I only have P40 so can't help here but thanks for everything you do to improve these Tesla GPUs on llama.cpp @JG!

Reply

[-]

LPN64@reddit

ATI 3D Rage pro owners assemble !

Reply

[-]

SystemErrorMessage@reddit

How old we talking about? Gtx 580?

Reply

[-]

Fun_Tangerine_1086@reddit

Happy to give it a try - have an M4000 & Tesla P4 handy. IIUC IQ_* quants are not expected to work on Maxwell currently, right?

Reply

[-]

Remove_Ayys@reddit (OP)

IQ quants don't work on master, for those I don't need a comparison.

Reply

[-]

DeltaSqueezer@reddit

This is as nice improvement. In my initial testing (before these patches) I saw that I was getting 22 tok/s with Qwen 7Bq4 and 15 tok/s with Qwen 7Bq8. This was a far cry from what I was getting with vLLM: 71 tok/s and 49 tok/s respectively. This is what tipped me to go with vLLM as I didn't want to dig into which optimizations were missing. It should be noted that I suspected that they leveraged the 2:1 FP16 capability of the P100 as the performance of q4 quants on the P40 tanked (due to gimped FP16) on vLLM: 3 tok/s on Qwen14q4 vs 44 tok/s on the P100. But the above figures show that there's still a lot of P100 performance left on the table with llama.cpp.

Reply

[-]

kryptkpr@reddit

P40 with vLLM is physically painful for anything except --dtype=float32 which uses just massive amounts of VRAM, need two cards to run 8B :D I've been running aphrodite-engine with EXL2 on my P100s, with context-shifting enabled the performance is quite good and it actually supports batching unlike all other EXL2 implementations like tabbyAPI.

Reply

[-]

DeltaSqueezer@reddit

The only pain I remember was that it took forever to load/initialize. Something like 20-40 minutes. After that, I was getting 24 tok/s on Qwen7q8. But of course, llama.cpp did as well without the strangely long loading times.

Reply

[-]

kryptkpr@reddit

Initialization takes 30 seconds on my 2x3060+2xP100. I had to uninstall flash-attn from the venv, it doesn't work with P100 anyway and it was making the two sets of cards init for ages just like you said.

Reply

[-]