Strix Halo users, a rejected PR can give you up to 30% faster PP for MOEs.
Posted by fallingdowndizzyvr@reddit | LocalLLaMA | View on Reddit | 58 comments
Here's the PR by pedapudi.
https://github.com/ggml-org/llama.cpp/pull/21344
It's merge request has been denied so it will not be in mainline llama.cpp. The changes are so small that I just put them into whatever the current release of llama.cpp is.
Read the PR for more info. It will only work with MOEs. Also, it gives the most boost at low context. As the context rises, the gain diminishes. Pedapudi explains why that happens in the PR.
Here are some numbers. It really works well. The tiny amount of time it takes me to apply the code to the current release of llama.cpp is time well spent.
main
ggml_cuda_init: found 1 ROCm devices (Total VRAM: 128000 MiB):
Device 0: AMD Radeon Graphics, gfx1151 (0x1151), VMM: no, Wave Size: 32, VRAM: 128000 MiB
| model | size | params | backend | ngl | mmap | test | t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | ---: | --------------: | -------------------: |
| qwen35moe 35B.A3B Q4_K - Small | 19.45 GiB | 34.66 B | ROCm | 99 | 0 | pp512 | 1106.11 ± 8.60 |
| qwen35moe 35B.A3B Q4_K - Small | 19.45 GiB | 34.66 B | ROCm | 99 | 0 | pp512 @ d10000 | 755.79 ± 2.58 |
| qwen35moe 35B.A3B Q4_K - Small | 19.45 GiB | 34.66 B | ROCm | 99 | 0 | pp512 @ d20000 | 587.61 ± 1.52 |
| qwen35moe 35B.A3B Q4_K - Small | 19.45 GiB | 34.66 B | ROCm | 99 | 0 | pp512 @ d40000 | 415.09 ± 2.45 |
| qwen35moe 35B.A3B Q4_K - Small | 19.45 GiB | 34.66 B | ROCm | 99 | 0 | pp512 @ d60000 | 316.89 ± 2.35 |
PR
ggml_cuda_init: found 1 ROCm devices (Total VRAM: 128000 MiB):
Device 0: AMD Radeon Graphics, gfx1151 (0x1151), VMM: no, Wave Size: 32, VRAM: 128000 MiB
| model | size | params | backend | ngl | mmap | test | t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | ---: | --------------: | -------------------: |
| qwen35moe 35B.A3B Q4_K - Small | 19.45 GiB | 34.66 B | ROCm | 99 | 0 | pp512 | 1447.62 ± 7.10 | **+31%**
| qwen35moe 35B.A3B Q4_K - Small | 19.45 GiB | 34.66 B | ROCm | 99 | 0 | pp512 @ d10000 | 905.60 ± 3.53 | **+20%**
| qwen35moe 35B.A3B Q4_K - Small | 19.45 GiB | 34.66 B | ROCm | 99 | 0 | pp512 @ d20000 | 685.23 ± 3.03 | **+16%**
| qwen35moe 35B.A3B Q4_K - Small | 19.45 GiB | 34.66 B | ROCm | 99 | 0 | pp512 @ d40000 | 459.42 ± 2.70 | **+11%**
| qwen35moe 35B.A3B Q4_K - Small | 19.45 GiB | 34.66 B | ROCm | 99 | 0 | pp512 @ d60000 | 342.41 ± 2.43 | **+8%**
kant12@reddit
This PR is interesting. I got decent results from a quick test with mtp-bench, rocm 7.13, and 122B Q8. The first table I'm building with what had been optimal settings for me: -DGGML_HIP=ON -DGGML_HIP_ROCWMMA_FATTN=OFF -DGGML_HIP_MMQ_MFMA=OFF -DGGML_CUDA_FORCE_CUBLAS=ON -DAMDGPU_TARGETS=gfx1151
At first it was a little worse but then after modifying the build based on settings from the PR I ended up with some pretty good results. I also went back and removed the PR changes and just tested with the new build settings and it was right in the middle of these two. This is what I used to build for the second set of results: -DGGML_HIP=ON -DGGML_HIP_ROCWMMA_FATTN=OFF -DGGML_CUDA_FORCE_CUBLAS=OFF -DGGML_BMI2=ON -DGGML_FMA=ON -DGGML_F16C=ON -DGGML_CUDA_FA_ALL_QUANTS=ON -DAMDGPU_TARGETS=gfx1151
rocm 7.13, base llama.cpp and my previous options
Aggregate
rocm 7.13, PR #21344, and updated options
Aggregate
jld1532@reddit
I'm admittedly new to local AI. I'm experimenting in LM Studio and can't get qwen3.5 122b to run using the vulkan backend at all yet rocm will not utilize all available gpu memory. GPT 120b will run with vulkan and utilize available gpu memory. Has anyone experienced this issue? Any tips? Thanks.
neopolitan77@reddit
Try the Strix Halo Toolboxes: https://github.com/kyuz0/amd-strix-halo-toolboxes
RedParaglider@reddit
The 64gb memory allocation issue on rocm leaves me on vulcan on the halo. I simply have less crashes.
ilintar@reddit
I'm seeing a discussion about how Johannes handled the PR, so I'm asking y'all to stop thinking about this as a personal matter.
I have a PR currently on main (https://github.com/ggml-org/llama.cpp/pull/21160). It's not getting merged. Not now, possibly not ever. I'm maintaining it in parallel.
Reason? The backend maintainers don't want to manage the overhead of the extra code and they believe it won't benefit them enough. And that's all there is to it. There was a discussion, we weighed the pros and cons and agreed that would be the best course of action for now. The maintenance burden is real. If you ever saw Johannes figure out a fix to some obscure CUDA problem and you went "wow, how did he know where to look?", it's because the guy knows his part of the codebase insiide out. That's the idea behind having maintainers for separate parts. But this gets diluted when code gets added strictly for the reasons of "getting features out there".
Beware the dangers of availability bias. If you're looking at a feature that's beneficial to you personally and there are other people commenting on a PR saying it helps them too, it's easy to overlook the people for whom the change would be a net negative. As well as it's easy to overlook the maintainers who are going to have to be looking for bugs if something breaks there.
At some point, if something is a niche feature, it's absolutely fine to have a separate fork just for that feature maintained externally. Or to have a fork until some upstream changes get merged that make it easier to merge your changes as a PR. It's not worth making it an ego conflict, that's how bad things in OSS happen.
jwpbe@reddit
the conversation about his conduct and the fact that he is correct about the PR are two different discusions and nobody is conflating the two as you are implying in your first sentence
ilintar@reddit
Read the discussion in my PR thread 😄 my point is, he's not being intentionally rude - he's just communicating very technically and in a matter-of-fact way. He's not saying the PR is bad or that it doesn't bring benefits to some users - he's only saying that there's currently no reasonable path for merging it due to the code that would be needed for separating the "bad cases" from the "good cases" simply not being there.
JazzlikeLeave5530@reddit
Frankly I don't even see what people are cimplaiy about. I don't read rudeness in any of it.
Marksta@reddit
The guy was asking what the path forward is. The path forward feels like "buzz off" from the answers given. He did present the material blocker but the explanation fizzles after that. Is there some open PR or a plan for how we get to the point where this sort of optimization code can be merged?
No is an okay answer but it needs to be more clear. PR guy is offering good will and spending their own time. Respect time spent, avoid wasting other people's time, and don't mince the words.
"Sorry we're not in a position to accept this sort of PR, and I honestly don't know if or when we would be. The issue is X. Thank you for your contribution but I'm going to close this PR to reflect our current inability to accept it at this time."
I don't actually know what the issue X. Or if this is a now thing or a never thing. And poor PR guy sure doesn't know either from that interaction.
Interpause@reddit
I agree there isn't anything wrong with straight to the point communication, and his logic is completely sound for why the PR can't be merged.
But IMO, for opensource to be built on collaboration is a very human thing. At the very least, a short recognition of effort should've been given (especially given how many other PRs are entirely vibecoded), and some reciprocation of how polite pedapudi was throughout the convo.
The way the communication went is closer to a senior dev instructing the junior dev they are in charge of, rather than a maintainer guiding a volunteer.
ilintar@reddit
You're making a good point and I agree the communication may have come off as a bit terse, which is why I'm trying to explain the situation here.
I think most of it stems from the fact that the outcome Johannes wanted to signal was, in itself, a really unpleasant one, which stems from the limited resources of OSS projects.
When you submit a PR to a project, sometimes it'll just get accepted - and that's nice. Sometimes, you'll get told that it needs a bit of work - and that's fine. Sometimes, it'll get rejected because it's bad - then at least you know what you did wrong.
But the most frustrating outcome, and one that happened in this case (which is why I also brought up my cross-compiler PR) is that your PR is valid, *has* a valid path to adoption, but the path to adoption won't happen because it would need core code rewrites that would have to be done by the core maintainer and the core maintainer already has a backlog of much more important stuff to do. This is a scenario in which nobody's really happy and it stems from limited resources - which is why I'm saying that there are, in fact, legitimate cases where you might want to keep your own fork not out of spite (aka "they rejected my beautiful PR, I'll show'em!"), but due to actually filling a niche that the core project can't afford to maintain properly.
ilintar@reddit
And another thing: giving positive signals can of course be a nice and encouraging act of communication, but it does one more thing that can be bad in cases like this: gives false hope. It's sometimes better for a maintainer to communicate outright that "this can't be merged" than to keep giving positive feedback and shifting goalposts while refusing to merge the PR all the same.
jwpbe@reddit
I understood what he said perfectly fine, your summary didn't change that for me.
Yeah, in a way that many people will interpret as "acting like an asshole", hence the discussion of his conduct.
I get that as a ggml maintainer you probably don't want to agree to "Oh yeah you have to tiptoe around Johannes sometimes because he can be rude we just don't talk about it" but it was his response to your attempt to merge in some of ik-llama inspired technical work into llama.cpp that set the benchmark.
I don't have any stake in either project, but trying to wallpaper over his behavior is disingenuous. It's fine, some people just act like that.
I don't need you to explain that away either, I have a perfectly accurate understanding of his behavior.
Borkato@reddit
They absolutely are conflating it. Read the comments, my lord
jtjstock@reddit
What conduct? Maybe you are just reading way too much into it.
muyuu@reddit
Llama.cpp is a very large project. Some specific optimisations will never make it into the main branch because it would be impossible to maintain. This is why there are spinoff projects like DS4.
This probably belongs in the Lemonade project.
SkyFeistyLlama8@reddit
Kind of like MTP not doing anything for non-CUDA MOE inference or reducing prompt processing and token generation. Llama.cpp is becoming an octopus at this point with so many different combinations for models and inference hardware.
mister2d@reddit
This is the most reasonable response to reaction to the PR discussion.
opossum_cz@reddit
I do not think this has any reason now. This is ROCm only and Vulkan is faster on Halo Strix by 10-25%.
Tested on 122B with mtp-bench.py.
Vulkan:
Patched ROCm 2024/2024, hadn't tweaked it much, but still way worse:
kant12@reddit
This is a summary of what I get with standard llama.cpp using mtp-bench when comparing Rocm 7.13 with the FORCE_CUBLAS/HIPBLASLT settings to mesa-vulkan 26.1.1. Model 122B Q6. I'd also point out that comparing hours of work Vulkan tends to slow down much quicker so I don't really love these benchmarks. But even here there isn't much difference.
opossum_cz@reddit
Could you send exact commands, I can retest? Your Vulcan seems comparable, but your acceptance is way lower, do you have high max n?
kant12@reddit
I ended up with this. Acceptance is a bit lower but overall it seemed to perform better. I can try with just n-max 2 and get rid of the ngram map just to compare.
opossum_cz@reddit
Can you share model as well:
kant12@reddit
Sorry didn't see this earlier. I don't use a draft model and I think the reasoning_effort in the template is used by gemma maybe. I just keep that since it doesn't hurt. Gemma is also probably why I lowered the checkpoints to 15. And -dio works with rocm you can just ignore the warning with Vulkan.
opossum_cz@reddit
You most likely use draft model already integrated in gguf. I am running heretic version and it didn't come with MTP. So I load it separately.
kant12@reddit
Oh yeah you're right.
opossum_cz@reddit
Good parameters. With that I get better results myself:
```
code_python pred= 192 draft= 145 acc= 122 rate=0.841 tok/s=32.2
code_cpp pred= 192 draft= 148 acc= 118 rate=0.797 tok/s=29.9
explain_concept pred= 192 draft= 151 acc= 119 rate=0.788 tok/s=29.6
summarize pred= 192 draft= 134 acc= 127 rate=0.948 tok/s=33.1
qa_factual pred= 192 draft= 141 acc= 124 rate=0.879 tok/s=31.8
translation pred= 192 draft= 159 acc= 114 rate=0.717 tok/s=28.0
creative_short pred= 192 draft= 160 acc= 112 rate=0.700 tok/s=27.6
stepwise_math pred= 192 draft= 142 acc= 125 rate=0.880 tok/s=32.3
long_code_review pred= 192 draft= 157 acc= 121 rate=0.771 tok/s=29.5
Aggregate: {
"n_requests": 9,
"total_predicted": 1728,
"total_draft": 1337,
"total_draft_accepted": 1082,
"aggregate_accept_rate": 0.8093,
"wall_s_total": 66.89
}
```
kant12@reddit
Nice and that's what I was seeing when just comparing log files after doing similar work for a few hours so I'm pretty confident they're good.
opossum_cz@reddit
You use 122B Q6, I have to try with Q6, does it fit your RAM?
kant12@reddit
It fits if you do all those memory tweaks people recommend. radeontop shows around 106 GB used of the 120 GB GTT that I have. At the moment I'm doing 200k context and not specifying any kv cache settings. I'm using the gguf from unsloth but I don't use the UD version. It's just plain Q6.
opossum_cz@reddit
Already tried now, I thought it will push me too close to max memory, but it is pretty ok, results below.
opossum_cz@reddit
With Q6:
MeateaW@reddit
Do you get the same good performance on Q8 models? I know the quality difference is minimal compared to q6, but i found llama.cpp just tanks at q8 on vulkan, but rocm runs great.
opossum_cz@reddit
I generally us Q6/Q5. Don't see much difference. I don't see any performance issue generally. Vulkan works better for me.
MeateaW@reddit
yeah i tried running q8 models and performance was literally unusable on vulkan. some googling indicated it was because vulkan optimised for q6 but completely ignored q8.
I was looking to run the q8s because I had the memory capacity in my strix box, but vulkan perf was legitimately horrendous - it basically has some weird loop that cuases 100% CPU usage despite being fully loaded into vram.
Switch to rocm on windows no less, and it actually performed the way it should.
fallingdowndizzyvr@reddit (OP)
No it's not. ROCm has had the advantage in PP. Vulkan has had it in TG. This extends ROCm's advantage in PP. And yet another PR should address Vulkan's advantage in TG by giving ROCm the same thing that the Vulkan has that gives it that TG advantage.
Anyways, here's the same benchmark using Vulkan. It's slower than ROCm mainline. With this PR, it's much slower.
remeh@reddit
It absolutely does have a reason: improving ROCm performance. E.g., I'm not using Vulkan because in my use-cases I care a lot more about the prefill than token gen speed, and for such use-case the balance weights in ROCm being a better trade-off than Vulkan since Vulkan prefill speed drops after the context starts growing a bit.
imonlysmarterthanyou@reddit
You should submit this to the rock, lemonade, and the strix halo community that all maintain builds for strix halo. I know there is some custom patching already happening for those builds.
YashN@reddit
Seconded. The Lemonade team is great. Small, focused, friendly.
fake_agent_smith@reddit
The maintainer guy is kind of an ass in this PR discussion.
jwpbe@reddit
yeah, i'm sure he has forgotten more about CUDA than I will ever know, but every time i see him in a discussion thread he acts like an asshole. the microsecond a woman acts like he does they get called an insufferable bitch ice queen, but if you're a ggml maintainer you can just be a dick i guess
Due-Memory-6957@reddit
Actually, women get away with a LOT more rudeness than men, look it up. It's just people can't quite punch each other online lol.
Bubbly-Staff-9452@reddit
It’s not just him, maintainers of so many projects that I am pretty involved in are like this. Sometimes I will be so excited to contribute and then I see how they treat people who try but maybe are less knowledgeable than them or don’t know exactly how the maintainer wants to do it. I’m always so grateful for the work put into these free projects but I also hate that it allows them to be a bully and an asshole about everything.
Environmental-Metal9@reddit
I stopped contributing to two different opensource projects because of maintainer attitude.
I understand that they are overwhelmed, overworked, and under appreciated. I understand that dealing with the entitlement of users is rough. Yet, how one interacts with people trying to volunteer their time and coding skills (pre-2022, not much vibe coding back then) for free kind of sets the tone, and it has a real impact on how a person. I had a full life to live already to be dealing with other people’s vitriol. They have the right to shepherd their communities however they choose, and I chose to step away. The projects are still there and thriving, so clearly there are much thicker skinned people out there than I, and that’s good.
MoffKalast@reddit
https://github.com/ggml-org/llama.cpp/pull/21344#issuecomment-4208955749
An ass or not, you can't merge something that tanks performance by 50% on other GPUs to benefit one.
segmond@reddit
How? If you are going to modify 2 kernels they need to be different PRs. The kernels are completely independent and changes with tests should be independent. A lot of folks run just a few models and just one type of GPU. When you run around 20+ models varying in sizes and across many GPUs family. You see crazy regression pop up due to this PRs. Works on your machine or on some machines is not good enough. llama.cpp is now becoming a monster, almost every PR introduces a regression.
sdfgeoff@reddit
I see his side as pretty reasonable TBH. He's looking bigger picture than the PR author and is wanting some preliminary work to be done before device-specific tuning. He probably has something in mind, and until that something (or similar is present), he doesn't want ad-hoc feature gates/optimizations based on device capabilities popping up through the codebase. I'd agree with that. That sort of thing is a nightmare to untangle.
Being a software maintainer trying to keep code quality is hard, and sometimes ypu need to say 'no' to someone.
JazzlikeLeave5530@reddit
I expected way worse reading this thread but I don't read any asshole behavior at all. Very confused lol
fake_agent_smith@reddit
Yes, in terms of technical expertise he’s entirely correct and what he says is a reasonable plan, there are no issues with logic, but in terms of people skills e.g. actually referring to what the PR author says in his other comments or just some basic respect for the guy, well not so much.
hurdurdur7@reddit
Neither is Linus, but he is one heck of a maintainer
Beamsters@reddit
The guy almost solo carry local llm for a few years. I would respect his judgement but this kind of machine target PR should be group and maybe even do a monthly fork. But one maintainer can only go so far, the vendor should do this amd-specific llama cpp themselves.
jwpbe@reddit
linus learned how to chill the fuck out over time
jwpbe@reddit
oh, for sure, he just acts like an insufferable asshole about it, thats the point. i dont think what you said was ever in question
wllmsaccnt@reddit
I don't think being short with someone is the same thing as being an insufferable asshole. I read the conversation as both sides having a miscommunication about expectations. Pedapudi was asking for next steps outside of the PR, while the maintainer was trying to communicate clearly about the fate of the PR (though the maintainers responses lacked affect or empathy).
jwpbe@reddit
i get it, its just not in a vacuum
hurdurdur7@reddit
https://gist.github.com/richhickey/1563cddea1002958f96e7ba9519972d9
opossum_cz@reddit
How is that relevant to anything?
audioen@reddit
I think it might make sense to have some kind of tuning that benchmarks the engine for sensible choices of various workload division parameters, and reports the best results, which then has to be communicated via configuration to make permanent.
I noticed that vllm also seems to have some kind of startup tuning phase where it likely tests for best-performing inference parameter combination.