llama.cpp B9387 Significant AMD/ROCm PP Update

[-]

illforgetsoonenough@reddit

Does this make rocm outperform vulkan?

[-]

Bulky-Priority6824@reddit (OP)

not on consumer hardware

[-]

hopbel@reddit

I find rocm tends to have more VRAM overhead and unpredictable spikes that force me to use a smaller quant.

[-]

Momsbestboy@reddit

... and not on Debian testing, because rocm is still stuck on 6.4 here. Upgradeing to 7.x requires Ubuntu, which I only run on a small, dedicated SSD.

So: Vulkan it is for me, and my R9700 is actually a pretty GPU. Not the fastest, but 32GB VRAM is really good.

[-]

harpysichordist@reddit

I have been installing the latest - or close to it - ROCm on Debian. Try https://rocm.docs.amd.com/en/7.13.0-preview/install/rocm.html?fam=all&os=debian&i=pip&debian-ver=13&w=compute

[-]

arbv@reddit

ROCm usually significantly outperforms Vulkan at PP. Wirh TG Vulkan is slightly faster for shorter contexts in my experience (RX7900XTX).

[-]

CalligrapherFar7833@reddit

I bet there is one or half a person here using those cards ?

[-]

JaredsBored@reddit

Mi100s go for about $1000 on eBay and there are some people using them here. Mi210 and up are still price prohibitive for basically everyone that's not also cross shopping rtx pro 6Ks

[-]

fallingdowndizzyvr@reddit

Mi100s go for about $1000 on eBay and there are some people using them here.

But why would you want to? Sure the paper specs look great. The reality though is that a 5060ti spanks it silly for PP even including this PR.

[-]

JaredsBored@reddit

Double the price of a 5060 Ti 16GB but with double the memory capacity and three times the bandwidth. And the Mi100 actually has the compute to match (unlike the Mi50)

[-]

fallingdowndizzyvr@reddit

On paper. The real world is another matter.

"Instinct MI100 32 GB / HBM2 / 4096 bit 2732.83 ± 1.98 110.48 ± 0.14"

"RTX 5060 Ti 16 GB / GDDR7 / 128 bit 3737.25 ± 6.79 90.94 ± 0.02"

has the compute to match

As you can see from the PP, it has less compute than the 5060ti.

three times the bandwidth

Which doesn't show up in the TG. Sure it's better but not 300% better. More like 20%.

Double the price of a 5060 Ti 16GB

Or you can just buy 2x5060 ti. Then have 32GB too and have even better performance thanks to TP.

[-]

JaredsBored@reddit

Where are these numbers from? What build of ROCm? What model? There's no context here

[-]

fallingdowndizzyvr@reddit

https://github.com/ggml-org/llama.cpp/discussions/15021

[-]

JaredsBored@reddit

Respectfully, these numbers are old. These don't make sense to compare against. There have been numerous of both ROCm and llama.cpp enhancements

[-]

fallingdowndizzyvr@reddit

Those aren't my numbers. If you have a problem with them take it up with the llama.cpp community.

[-]

Bulky-Priority6824@reddit (OP)

Do you think the data centers guys jumped for joy when they saw this?

[-]

fallingdowndizzyvr@reddit

Ah... data center guys aren't using llama.cpp.

[-]

Bulky-Priority6824@reddit (OP)

Exactly

[-]

fallingdowndizzyvr@reddit

LOL. So why would they have "jumped for joy when they saw this"?

[-]

Bulky-Priority6824@reddit (OP)

i was being facetious

[-]

Huge for MI250X/MI300 users. Q4_K_S at batch 8 got 68% faster prompt processing. K quants benefit early, legacy no regression.Used Runable to benchmark AMD vs NVIDIA for a local LLM server. Clean charts in 20 minutes. Went MI300X. Glad I did.

[-]

hopbel@reddit

restricted to AMD CDNA architecture

So completely irrelevant to consumers.

[-]

Bulky-Priority6824@reddit (OP)

But stop and think. If llama can do this with dc hardware maybe other doors will open soon for gains on consumer and hardware.

[-]

hopbel@reddit

No, because it relies on instructions which flat out do not exist on consumer cards.

[-]

Bulky-Priority6824@reddit (OP)

I see. Yes. However I'm referring to the effort to do so.

[-]

peligroso@reddit

Look man, its been 5yrs. Stop trying to make AMD a thing.

[-]

Bulky-Priority6824@reddit (OP)

(I don't own any AMD gpus)

[-]

fallingdowndizzyvr@reddit

There was already a similar PR for consumer AMD hardware, RDNA 3.5. But it was rejected since it was only for one architecture and they don't want things that only impact such a limited scope complicating the code base.

[-]

dc740@reddit

cries in 3x MI50 😞

[-]

fallingdowndizzyvr@reddit

MFMA is restricted to AMD CDNA architecture that's MI100, MI200, MI300 series datacenter cards.

Hm... so they approved this PR for only a specific architecture, CDNA, but they rejected the RDNA 3.5 speed up PR because it was only for a specific architecture.

[-]

Bulky-Priority6824@reddit (OP)

damn 1300 on 7900 xtx is painful to see

[-]

StardockEngineer@reddit

amazing

[-]

Advanced-Picture5016@reddit

doesn't help as my (stupid) mixed 9070xt and 9060xt doesnt even load the model right now