llama.cpp B9387 Significant AMD/ROCm PP Update
Posted by Bulky-Priority6824@reddit | LocalLLaMA | View on Reddit | 44 comments
https://github.com/ggml-org/llama.cpp/releases/tag/b9387
Post your initial results if you try it!
illforgetsoonenough@reddit
Does this make rocm outperform vulkan?
Bulky-Priority6824@reddit (OP)
not on consumer hardware
fallingdowndizzyvr@reddit
ROCm can outperform Vulkan on consumer hardware.
hopbel@reddit
I find rocm tends to have more VRAM overhead and unpredictable spikes that force me to use a smaller quant.
Momsbestboy@reddit
... and not on Debian testing, because rocm is still stuck on 6.4 here. Upgradeing to 7.x requires Ubuntu, which I only run on a small, dedicated SSD.
So: Vulkan it is for me, and my R9700 is actually a pretty GPU. Not the fastest, but 32GB VRAM is really good.
harpysichordist@reddit
I have been installing the latest - or close to it - ROCm on Debian. Try https://rocm.docs.amd.com/en/7.13.0-preview/install/rocm.html?fam=all&os=debian&i=pip&debian-ver=13&w=compute
DrBearJ3w@reddit
Outperform in what?Vram Usage?
LinkSea8324@reddit
^ smartest vibe coder ^
DrBearJ3w@reddit
How does relate the the thread?
arbv@reddit
ROCm usually significantly outperforms Vulkan at PP. Wirh TG Vulkan is slightly faster for shorter contexts in my experience (RX7900XTX).
CalligrapherFar7833@reddit
I bet there is one or half a person here using those cards ?
JaredsBored@reddit
Mi100s go for about $1000 on eBay and there are some people using them here. Mi210 and up are still price prohibitive for basically everyone that's not also cross shopping rtx pro 6Ks
fallingdowndizzyvr@reddit
But why would you want to? Sure the paper specs look great. The reality though is that a 5060ti spanks it silly for PP even including this PR.
JaredsBored@reddit
Double the price of a 5060 Ti 16GB but with double the memory capacity and three times the bandwidth. And the Mi100 actually has the compute to match (unlike the Mi50)
fallingdowndizzyvr@reddit
On paper. The real world is another matter.
"Instinct MI100 32 GB / HBM2 / 4096 bit 2732.83 ± 1.98 110.48 ± 0.14"
"RTX 5060 Ti 16 GB / GDDR7 / 128 bit 3737.25 ± 6.79 90.94 ± 0.02"
As you can see from the PP, it has less compute than the 5060ti.
Which doesn't show up in the TG. Sure it's better but not 300% better. More like 20%.
Or you can just buy 2x5060 ti. Then have 32GB too and have even better performance thanks to TP.
JaredsBored@reddit
Where are these numbers from? What build of ROCm? What model? There's no context here
fallingdowndizzyvr@reddit
https://github.com/ggml-org/llama.cpp/discussions/15021
JaredsBored@reddit
Respectfully, these numbers are old. These don't make sense to compare against. There have been numerous of both ROCm and llama.cpp enhancements
fallingdowndizzyvr@reddit
Those aren't my numbers. If you have a problem with them take it up with the llama.cpp community.
Bulky-Priority6824@reddit (OP)
Do you think the data centers guys jumped for joy when they saw this?
fallingdowndizzyvr@reddit
Ah... data center guys aren't using llama.cpp.
Bulky-Priority6824@reddit (OP)
Exactly
fallingdowndizzyvr@reddit
LOL. So why would they have "jumped for joy when they saw this"?
Bulky-Priority6824@reddit (OP)
i was being facetious
rasbid420@reddit
cries in gfx803
Main_Problem_2696@reddit
Huge for MI250X/MI300 users. Q4_K_S at batch 8 got 68% faster prompt processing. K quants benefit early, legacy no regression.Used Runable to benchmark AMD vs NVIDIA for a local LLM server. Clean charts in 20 minutes. Went MI300X. Glad I did.
hopbel@reddit
So completely irrelevant to consumers.
Bulky-Priority6824@reddit (OP)
But stop and think. If llama can do this with dc hardware maybe other doors will open soon for gains on consumer and hardware.
hopbel@reddit
No, because it relies on instructions which flat out do not exist on consumer cards.
Bulky-Priority6824@reddit (OP)
I see. Yes. However I'm referring to the effort to do so.
peligroso@reddit
Look man, its been 5yrs. Stop trying to make AMD a thing.
Bulky-Priority6824@reddit (OP)
(I don't own any AMD gpus)
fallingdowndizzyvr@reddit
There was already a similar PR for consumer AMD hardware, RDNA 3.5. But it was rejected since it was only for one architecture and they don't want things that only impact such a limited scope complicating the code base.
dc740@reddit
cries in 3x MI50 😞
fallingdowndizzyvr@reddit
Hm... so they approved this PR for only a specific architecture, CDNA, but they rejected the RDNA 3.5 speed up PR because it was only for a specific architecture.
BevinMaster@reddit
Well CDNA has multiple generation while 3.5 is a specific subset of rdna cards, but I guess yeah for hardware matrix related operations it’s only rdna3 and onwards anyway.
oxygen_addiction@reddit
The love-hate rollercoaster of llama.cpp
Inevitable_Mistake32@reddit
from 1300tk/s to 1300tk/s 7900xtx
Bulky-Priority6824@reddit (OP)
damn 1300 on 7900 xtx is painful to see
No-Refrigerator-1672@reddit
Depends on the model. Pretty expected for ~30B dense.
Sensitive_Pop4803@reddit
AGI achieved
nasone32@reddit
Well yeah, "Non-AMD-MFMA paths (NVIDIA, RDNA, CDNA1 without MFMA) are byte-identical
to master"
StardockEngineer@reddit
amazing
Advanced-Picture5016@reddit
doesn't help as my (stupid) mixed 9070xt and 9060xt doesnt even load the model right now