Strix Halo users, a rejected PR can give you up to 30% faster PP for MOEs.

Posted by fallingdowndizzyvr@reddit | LocalLLaMA | View on Reddit | 58 comments

Here's the PR by pedapudi.

https://github.com/ggml-org/llama.cpp/pull/21344

It's merge request has been denied so it will not be in mainline llama.cpp. The changes are so small that I just put them into whatever the current release of llama.cpp is.

Read the PR for more info. It will only work with MOEs. Also, it gives the most boost at low context. As the context rises, the gain diminishes. Pedapudi explains why that happens in the PR.

Here are some numbers. It really works well. The tiny amount of time it takes me to apply the code to the current release of llama.cpp is time well spent.

main

ggml_cuda_init: found 1 ROCm devices (Total VRAM: 128000 MiB):
  Device 0: AMD Radeon Graphics, gfx1151 (0x1151), VMM: no, Wave Size: 32, VRAM: 128000 MiB
| model                          |       size |     params | backend    | ngl | mmap |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | ---: | --------------: | -------------------: |
| qwen35moe 35B.A3B Q4_K - Small |  19.45 GiB |    34.66 B | ROCm       |  99 |    0 |           pp512 |       1106.11 ± 8.60 |
| qwen35moe 35B.A3B Q4_K - Small |  19.45 GiB |    34.66 B | ROCm       |  99 |    0 |  pp512 @ d10000 |        755.79 ± 2.58 |
| qwen35moe 35B.A3B Q4_K - Small |  19.45 GiB |    34.66 B | ROCm       |  99 |    0 |  pp512 @ d20000 |        587.61 ± 1.52 |
| qwen35moe 35B.A3B Q4_K - Small |  19.45 GiB |    34.66 B | ROCm       |  99 |    0 |  pp512 @ d40000 |        415.09 ± 2.45 |
| qwen35moe 35B.A3B Q4_K - Small |  19.45 GiB |    34.66 B | ROCm       |  99 |    0 |  pp512 @ d60000 |        316.89 ± 2.35 |

PR

ggml_cuda_init: found 1 ROCm devices (Total VRAM: 128000 MiB):
  Device 0: AMD Radeon Graphics, gfx1151 (0x1151), VMM: no, Wave Size: 32, VRAM: 128000 MiB
| model                          |       size |     params | backend    | ngl | mmap |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | ---: | --------------: | -------------------: |
| qwen35moe 35B.A3B Q4_K - Small |  19.45 GiB |    34.66 B | ROCm       |  99 |    0 |           pp512 |       1447.62 ± 7.10 | **+31%**
| qwen35moe 35B.A3B Q4_K - Small |  19.45 GiB |    34.66 B | ROCm       |  99 |    0 |  pp512 @ d10000 |        905.60 ± 3.53 | **+20%**
| qwen35moe 35B.A3B Q4_K - Small |  19.45 GiB |    34.66 B | ROCm       |  99 |    0 |  pp512 @ d20000 |        685.23 ± 3.03 | **+16%**
| qwen35moe 35B.A3B Q4_K - Small |  19.45 GiB |    34.66 B | ROCm       |  99 |    0 |  pp512 @ d40000 |        459.42 ± 2.70 | **+11%**
| qwen35moe 35B.A3B Q4_K - Small |  19.45 GiB |    34.66 B | ROCm       |  99 |    0 |  pp512 @ d60000 |        342.41 ± 2.43 | **+8%**

[-]

kant12@reddit

This PR is interesting. I got decent results from a quick test with mtp-bench, rocm 7.13, and 122B Q8. The first table I'm building with what had been optimal settings for me: -DGGML_HIP=ON -DGGML_HIP_ROCWMMA_FATTN=OFF -DGGML_HIP_MMQ_MFMA=OFF -DGGML_CUDA_FORCE_CUBLAS=ON -DAMDGPU_TARGETS=gfx1151

At first it was a little worse but then after modifying the build based on settings from the PR I ended up with some pretty good results. I also went back and removed the PR changes and just tested with the new build settings and it was right in the middle of these two. This is what I used to build for the second set of results: -DGGML_HIP=ON -DGGML_HIP_ROCWMMA_FATTN=OFF -DGGML_CUDA_FORCE_CUBLAS=OFF -DGGML_BMI2=ON -DGGML_FMA=ON -DGGML_F16C=ON -DGGML_CUDA_FA_ALL_QUANTS=ON -DAMDGPU_TARGETS=gfx1151

rocm 7.13, base llama.cpp and my previous options

Test	Pred	Draft	Accepted	Accept Rate	tok/s
code_python	192	158	112	70.9%	28.6
code_cpp	192	148	116	78.4%	29.2
explain_concept	192	154	113	73.4%	28.0
summarize	192	136	123	90.4%	31.8
qa_factual	192	141	119	84.4%	30.3
translation	192	152	114	75.0%	28.4
creative_short	192	158	111	70.3%	27.5
stepwise_math	192	141	120	85.1%	30.9
long_code_review	192	147	116	78.9%	28.9

Aggregate

Requests	Total Predicted	Total Draft	Total Accepted	Accept Rate	Wall Time
9	1728	1335	1044	78.20%	64.30s

rocm 7.13, PR #21344, and updated options

Test	Pred	Draft	Accepted	Accept Rate	tok/s
code_python	192	171	134	78.4%	32.5
code_cpp	192	186	129	69.4%	29.2
explain_concept	192	183	130	71.0%	29.5
summarize	192	157	138	87.9%	33.8
qa_factual	192	165	136	82.4%	32.7
translation	192	182	130	71.4%	29.6
creative_short	192	202	123	60.9%	26.9
stepwise_math	192	167	135	80.8%	32.4
long_code_review	192	181	129	71.3%	28.9

Aggregate

Requests	Total Predicted	Total Draft	Total Accepted	Accept Rate	Wall Time
9	1728	1594	1184	74.28%	60.97s

[-]

jld1532@reddit

I'm admittedly new to local AI. I'm experimenting in LM Studio and can't get qwen3.5 122b to run using the vulkan backend at all yet rocm will not utilize all available gpu memory. GPT 120b will run with vulkan and utilize available gpu memory. Has anyone experienced this issue? Any tips? Thanks.

[-]

neopolitan77@reddit

Try the Strix Halo Toolboxes: https://github.com/kyuz0/amd-strix-halo-toolboxes

[-]

RedParaglider@reddit

The 64gb memory allocation issue on rocm leaves me on vulcan on the halo. I simply have less crashes.

[-]

ilintar@reddit

I'm seeing a discussion about how Johannes handled the PR, so I'm asking y'all to stop thinking about this as a personal matter.

I have a PR currently on main (https://github.com/ggml-org/llama.cpp/pull/21160). It's not getting merged. Not now, possibly not ever. I'm maintaining it in parallel.

Reason? The backend maintainers don't want to manage the overhead of the extra code and they believe it won't benefit them enough. And that's all there is to it. There was a discussion, we weighed the pros and cons and agreed that would be the best course of action for now. The maintenance burden is real. If you ever saw Johannes figure out a fix to some obscure CUDA problem and you went "wow, how did he know where to look?", it's because the guy knows his part of the codebase insiide out. That's the idea behind having maintainers for separate parts. But this gets diluted when code gets added strictly for the reasons of "getting features out there".

Beware the dangers of availability bias. If you're looking at a feature that's beneficial to you personally and there are other people commenting on a PR saying it helps them too, it's easy to overlook the people for whom the change would be a net negative. As well as it's easy to overlook the maintainers who are going to have to be looking for bugs if something breaks there.

At some point, if something is a niche feature, it's absolutely fine to have a separate fork just for that feature maintained externally. Or to have a fork until some upstream changes get merged that make it easier to merge your changes as a PR. It's not worth making it an ego conflict, that's how bad things in OSS happen.

[-]

jwpbe@reddit

the conversation about his conduct and the fact that he is correct about the PR are two different discusions and nobody is conflating the two as you are implying in your first sentence

[-]

ilintar@reddit

Read the discussion in my PR thread 😄 my point is, he's not being intentionally rude - he's just communicating very technically and in a matter-of-fact way. He's not saying the PR is bad or that it doesn't bring benefits to some users - he's only saying that there's currently no reasonable path for merging it due to the code that would be needed for separating the "bad cases" from the "good cases" simply not being there.

[-]

JazzlikeLeave5530@reddit

Frankly I don't even see what people are cimplaiy about. I don't read rudeness in any of it.

[-]

Marksta@reddit

The guy was asking what the path forward is. The path forward feels like "buzz off" from the answers given. He did present the material blocker but the explanation fizzles after that. Is there some open PR or a plan for how we get to the point where this sort of optimization code can be merged?

No is an okay answer but it needs to be more clear. PR guy is offering good will and spending their own time. Respect time spent, avoid wasting other people's time, and don't mince the words.

"Sorry we're not in a position to accept this sort of PR, and I honestly don't know if or when we would be. The issue is X. Thank you for your contribution but I'm going to close this PR to reflect our current inability to accept it at this time."

I don't actually know what the issue X. Or if this is a now thing or a never thing. And poor PR guy sure doesn't know either from that interaction.

[-]

Interpause@reddit

I agree there isn't anything wrong with straight to the point communication, and his logic is completely sound for why the PR can't be merged.

But IMO, for opensource to be built on collaboration is a very human thing. At the very least, a short recognition of effort should've been given (especially given how many other PRs are entirely vibecoded), and some reciprocation of how polite pedapudi was throughout the convo.

The way the communication went is closer to a senior dev instructing the junior dev they are in charge of, rather than a maintainer guiding a volunteer.

[-]

ilintar@reddit

You're making a good point and I agree the communication may have come off as a bit terse, which is why I'm trying to explain the situation here.

I think most of it stems from the fact that the outcome Johannes wanted to signal was, in itself, a really unpleasant one, which stems from the limited resources of OSS projects.

When you submit a PR to a project, sometimes it'll just get accepted - and that's nice. Sometimes, you'll get told that it needs a bit of work - and that's fine. Sometimes, it'll get rejected because it's bad - then at least you know what you did wrong.

But the most frustrating outcome, and one that happened in this case (which is why I also brought up my cross-compiler PR) is that your PR is valid, *has* a valid path to adoption, but the path to adoption won't happen because it would need core code rewrites that would have to be done by the core maintainer and the core maintainer already has a backlog of much more important stuff to do. This is a scenario in which nobody's really happy and it stems from limited resources - which is why I'm saying that there are, in fact, legitimate cases where you might want to keep your own fork not out of spite (aka "they rejected my beautiful PR, I'll show'em!"), but due to actually filling a niche that the core project can't afford to maintain properly.

[-]

ilintar@reddit

And another thing: giving positive signals can of course be a nice and encouraging act of communication, but it does one more thing that can be bad in cases like this: gives false hope. It's sometimes better for a maintainer to communicate outright that "this can't be merged" than to keep giving positive feedback and shifting goalposts while refusing to merge the PR all the same.

[-]

jwpbe@reddit

I understood what he said perfectly fine, your summary didn't change that for me.

he's just communicating very technically and in a matter-of-fact way.

Yeah, in a way that many people will interpret as "acting like an asshole", hence the discussion of his conduct.

I get that as a ggml maintainer you probably don't want to agree to "Oh yeah you have to tiptoe around Johannes sometimes because he can be rude we just don't talk about it" but it was his response to your attempt to merge in some of ik-llama inspired technical work into llama.cpp that set the benchmark.

I don't have any stake in either project, but trying to wallpaper over his behavior is disingenuous. It's fine, some people just act like that.

I don't need you to explain that away either, I have a perfectly accurate understanding of his behavior.

[-]

Borkato@reddit

They absolutely are conflating it. Read the comments, my lord

[-]

jtjstock@reddit

What conduct? Maybe you are just reading way too much into it.

[-]

muyuu@reddit

Llama.cpp is a very large project. Some specific optimisations will never make it into the main branch because it would be impossible to maintain. This is why there are spinoff projects like DS4.

This probably belongs in the Lemonade project.

[-]

SkyFeistyLlama8@reddit

Kind of like MTP not doing anything for non-CUDA MOE inference or reducing prompt processing and token generation. Llama.cpp is becoming an octopus at this point with so many different combinations for models and inference hardware.

[-]

mister2d@reddit

This is the most reasonable response to reaction to the PR discussion.

[-]

opossum_cz@reddit

I do not think this has any reason now. This is ROCm only and Vulkan is faster on Halo Strix by 10-25%.

Tested on 122B with mtp-bench.py.

Vulkan:

  code_python        pred= 192 draft= 146 acc= 123 rate=0.843 tok/s=29.0
  code_cpp           pred= 192 draft= 155 acc= 114 rate=0.736 tok/s=25.8
  explain_concept    pred= 192 draft= 151 acc= 117 rate=0.775 tok/s=26.4
  summarize          pred= 192 draft= 135 acc= 129 rate=0.956 tok/s=30.5
  qa_factual         pred= 192 draft= 144 acc= 122 rate=0.847 tok/s=27.9
  translation        pred= 192 draft= 154 acc= 116 rate=0.753 tok/s=26.1
  creative_short     pred= 192 draft= 159 acc= 113 rate=0.711 tok/s=25.4
  stepwise_math      pred= 192 draft= 145 acc= 126 rate=0.869 tok/s=29.1
  long_code_review   pred= 192 draft= 157 acc= 123 rate=0.783 tok/s=26.7

Aggregate: {
  "n_requests": 9,
  "total_predicted": 1728,
  "total_draft": 1346,
  "total_draft_accepted": 1083,
  "aggregate_accept_rate": 0.8046,
  "wall_s_total": 73.53
}

Patched ROCm 2024/2024, hadn't tweaked it much, but still way worse:

  code_python        pred= 192 draft= 163 acc= 113 rate=0.693 tok/s=23.2
  code_cpp           pred= 192 draft= 151 acc= 115 rate=0.762 tok/s=24.6
  explain_concept    pred= 192 draft= 145 acc= 121 rate=0.835 tok/s=21.0
  summarize          pred= 192 draft= 136 acc= 127 rate=0.934 tok/s=28.1
  qa_factual         pred= 192 draft= 149 acc= 122 rate=0.819 tok/s=25.8
  translation        pred= 192 draft= 149 acc= 119 rate=0.799 tok/s=25.4
  creative_short     pred= 192 draft= 172 acc= 111 rate=0.645 tok/s=22.6
  stepwise_math      pred= 192 draft= 141 acc= 127 rate=0.901 tok/s=26.6
  long_code_review   pred= 192 draft= 162 acc= 124 rate=0.765 tok/s=24.8

Aggregate: {
  "n_requests": 9,
  "total_predicted": 1728,
  "total_draft": 1368,
  "total_draft_accepted": 1079,
  "aggregate_accept_rate": 0.7887,
  "wall_s_total": 81.14
}

[-]

kant12@reddit

This is a summary of what I get with standard llama.cpp using mtp-bench when comparing Rocm 7.13 with the FORCE_CUBLAS/HIPBLASLT settings to mesa-vulkan 26.1.1. Model 122B Q6. I'd also point out that comparing hours of work Vulkan tends to slow down much quicker so I don't really love these benchmarks. But even here there isn't much difference.

Backend	Predicted	Draft	Accepted	Accept Rate	Wall Time	Overall tok/s
ROCm 7.13	1728	1654	1162	70.25%	64.24s	26.90
Vulkan	1728	1661	1160	69.84%	63.15s	27.36

[-]

opossum_cz@reddit

Could you send exact commands, I can retest? Your Vulcan seems comparable, but your acceptance is way lower, do you have high max n?

[-]

kant12@reddit

I ended up with this. Acceptance is a bit lower but overall it seemed to perform better. I can try with just n-max 2 and get rid of the ngram map just to compare.

--spec-type draft-mtp --spec-draft-n-max 3 --spec-type ngram-map-k4v --spec-ngram-map-k4v-size-n 16 --spec-ngram-map-k4v-size-m 24 --spec-ngram-map-k4v-min-hits 2

[-]

opossum_cz@reddit

Can you share model as well:

./build-vulkan-gfx1151/bin/llama-server -m models/Qwen3.5-122B-A10B-heretic.Q5_K_M.gguf -md models/models/qwen35-122b-a10b-mtp/UD-Q4_K_M/mtp-Qwen3.5-122B-A10B-UD-Q4_K_M.gguf -c 65536 -ngl 99 -fa on -np 1 --port 18080 --host 0.0.0.0 --spec-type draft-mtp --spec-draft-n-max 6 --spec-type ngram-map-k4v --spec-ngram-map-k4v-size-n 16 --spec-ngram-map-k4v-size-m 24 --spec-ngram-map-k4v-min-hits 2

[-]

kant12@reddit

Sorry didn't see this earlier. I don't use a draft model and I think the reasoning_effort in the template is used by gemma maybe. I just keep that since it doesn't hurt. Gemma is also probably why I lowered the checkpoints to 15. And -dio works with rocm you can just ignore the warning with Vulkan.

llama-server -m /home/john/models/MTP/Qwen3.5-122B-A10B-MTP-Q6_K.gguf --spec-type draft-mtp --spec-draft-n-max 3 --spec-type ngram-map-k4v --spec-ngram-map-k4v-size-n 16 --spec-ngram-map-k4v-size-m 24 --spec-ngram-map-k4v-min-hits 2 -c 200000 --temp 0.7 --top-p 0.95 --top-k 20 --min-p 0.0 --presence-penalty 0.0 --repeat-penalty 1.0 -ngl 999 --ctx-checkpoints 15 --chat-template-kwargs {"preserve_thinking": true, "reasoning_effort": "high"} --no-mmap -dio -np 1 -b 8192 -ub 1024 -fa 1

[-]

opossum_cz@reddit

You most likely use draft model already integrated in gguf. I am running heretic version and it didn't come with MTP. So I load it separately.

[-]

kant12@reddit

Oh yeah you're right.

[-]

opossum_cz@reddit

Good parameters. With that I get better results myself:

```
code_python pred= 192 draft= 145 acc= 122 rate=0.841 tok/s=32.2

code_cpp pred= 192 draft= 148 acc= 118 rate=0.797 tok/s=29.9

explain_concept pred= 192 draft= 151 acc= 119 rate=0.788 tok/s=29.6

summarize pred= 192 draft= 134 acc= 127 rate=0.948 tok/s=33.1

qa_factual pred= 192 draft= 141 acc= 124 rate=0.879 tok/s=31.8

translation pred= 192 draft= 159 acc= 114 rate=0.717 tok/s=28.0

creative_short pred= 192 draft= 160 acc= 112 rate=0.700 tok/s=27.6

stepwise_math pred= 192 draft= 142 acc= 125 rate=0.880 tok/s=32.3

long_code_review pred= 192 draft= 157 acc= 121 rate=0.771 tok/s=29.5

Aggregate: {

"n_requests": 9,

"total_predicted": 1728,

"total_draft": 1337,

"total_draft_accepted": 1082,

"aggregate_accept_rate": 0.8093,

"wall_s_total": 66.89

}

```

[-]

kant12@reddit

Nice and that's what I was seeing when just comparing log files after doing similar work for a few hours so I'm pretty confident they're good.

[-]

opossum_cz@reddit

You use 122B Q6, I have to try with Q6, does it fit your RAM?

[-]

kant12@reddit

It fits if you do all those memory tweaks people recommend. radeontop shows around 106 GB used of the 120 GB GTT that I have. At the moment I'm doing 200k context and not specifying any kv cache settings. I'm using the gguf from unsloth but I don't use the UD version. It's just plain Q6.

[-]

opossum_cz@reddit

Already tried now, I thought it will push me too close to max memory, but it is pretty ok, results below.

[-]

opossum_cz@reddit

With Q6:

  code_python        pred= 192 draft= 154 acc= 117 rate=0.760 tok/s=28.6
  code_cpp           pred= 192 draft= 150 acc= 117 rate=0.780 tok/s=27.9
  explain_concept    pred= 192 draft= 150 acc= 117 rate=0.780 tok/s=27.8
  summarize          pred= 192 draft= 136 acc= 126 rate=0.926 tok/s=31.0
  qa_factual         pred= 192 draft= 150 acc= 119 rate=0.793 tok/s=28.3
  translation        pred= 192 draft= 159 acc= 116 rate=0.730 tok/s=27.2
  creative_short     pred= 192 draft= 176 acc= 107 rate=0.608 tok/s=24.4
  stepwise_math      pred= 192 draft= 148 acc= 121 rate=0.818 tok/s=29.1
  long_code_review   pred= 192 draft= 157 acc= 120 rate=0.764 tok/s=28.0

Aggregate: {
  "n_requests": 9,
  "total_predicted": 1728,
  "total_draft": 1380,
  "total_draft_accepted": 1060,
  "aggregate_accept_rate": 0.7681,
  "wall_s_total": 72.42
}

[-]

MeateaW@reddit

Do you get the same good performance on Q8 models? I know the quality difference is minimal compared to q6, but i found llama.cpp just tanks at q8 on vulkan, but rocm runs great.

[-]

opossum_cz@reddit

I generally us Q6/Q5. Don't see much difference. I don't see any performance issue generally. Vulkan works better for me.

[-]

MeateaW@reddit

yeah i tried running q8 models and performance was literally unusable on vulkan. some googling indicated it was because vulkan optimised for q6 but completely ignored q8.

I was looking to run the q8s because I had the memory capacity in my strix box, but vulkan perf was legitimately horrendous - it basically has some weird loop that cuases 100% CPU usage despite being fully loaded into vram.

Switch to rocm on windows no less, and it actually performed the way it should.

[-]

fallingdowndizzyvr@reddit (OP)

I do not think this has any reason now. This is ROCm only and Vulkan is faster on Halo Strix by 10-25%.

No it's not. ROCm has had the advantage in PP. Vulkan has had it in TG. This extends ROCm's advantage in PP. And yet another PR should address Vulkan's advantage in TG by giving ROCm the same thing that the Vulkan has that gives it that TG advantage.

Anyways, here's the same benchmark using Vulkan. It's slower than ROCm mainline. With this PR, it's much slower.

ggml_vulkan: 0 = AMD Radeon Graphics (RADV GFX1151) (radv) | uma: 1 | fp16: 1 | bf16: 0 | warp size: 64 | shared memory: 65536 | int dot: 1 | matrix cores: KHR_coopmat
| model                          |       size |     params | backend    | ngl | mmap |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | ---: | --------------: | -------------------: |
| qwen35moe 35B.A3B Q4_K - Small |  19.45 GiB |    34.66 B | Vulkan     |  99 |    0 |           pp512 |       1005.92 ± 6.68 |
| qwen35moe 35B.A3B Q4_K - Small |  19.45 GiB |    34.66 B | Vulkan     |  99 |    0 |  pp512 @ d10000 |        775.50 ± 0.48 |
| qwen35moe 35B.A3B Q4_K - Small |  19.45 GiB |    34.66 B | Vulkan     |  99 |    0 |  pp512 @ d20000 |        562.58 ± 0.74 |
| qwen35moe 35B.A3B Q4_K - Small |  19.45 GiB |    34.66 B | Vulkan     |  99 |    0 |  pp512 @ d40000 |        425.04 ± 1.48 |
| qwen35moe 35B.A3B Q4_K - Small |  19.45 GiB |    34.66 B | Vulkan     |  99 |    0 |  pp512 @ d60000 |        322.73 ± 0.87 |

[-]

remeh@reddit

It absolutely does have a reason: improving ROCm performance. E.g., I'm not using Vulkan because in my use-cases I care a lot more about the prefill than token gen speed, and for such use-case the balance weights in ROCm being a better trade-off than Vulkan since Vulkan prefill speed drops after the context starts growing a bit.

[-]

imonlysmarterthanyou@reddit

You should submit this to the rock, lemonade, and the strix halo community that all maintain builds for strix halo. I know there is some custom patching already happening for those builds.

[-]

YashN@reddit

Seconded. The Lemonade team is great. Small, focused, friendly.

[-]

fake_agent_smith@reddit

The maintainer guy is kind of an ass in this PR discussion.

[-]

jwpbe@reddit

yeah, i'm sure he has forgotten more about CUDA than I will ever know, but every time i see him in a discussion thread he acts like an asshole. the microsecond a woman acts like he does they get called an insufferable bitch ice queen, but if you're a ggml maintainer you can just be a dick i guess

[-]

Due-Memory-6957@reddit

Actually, women get away with a LOT more rudeness than men, look it up. It's just people can't quite punch each other online lol.

[-]

Bubbly-Staff-9452@reddit

It’s not just him, maintainers of so many projects that I am pretty involved in are like this. Sometimes I will be so excited to contribute and then I see how they treat people who try but maybe are less knowledgeable than them or don’t know exactly how the maintainer wants to do it. I’m always so grateful for the work put into these free projects but I also hate that it allows them to be a bully and an asshole about everything.

[-]

Environmental-Metal9@reddit

I stopped contributing to two different opensource projects because of maintainer attitude.

I understand that they are overwhelmed, overworked, and under appreciated. I understand that dealing with the entitlement of users is rough. Yet, how one interacts with people trying to volunteer their time and coding skills (pre-2022, not much vibe coding back then) for free kind of sets the tone, and it has a real impact on how a person. I had a full life to live already to be dealing with other people’s vitriol. They have the right to shepherd their communities however they choose, and I chose to step away. The projects are still there and thriving, so clearly there are much thicker skinned people out there than I, and that’s good.

[-]

MoffKalast@reddit

https://github.com/ggml-org/llama.cpp/pull/21344#issuecomment-4208955749

An ass or not, you can't merge something that tanks performance by 50% on other GPUs to benefit one.

[-]

segmond@reddit

How? If you are going to modify 2 kernels they need to be different PRs. The kernels are completely independent and changes with tests should be independent. A lot of folks run just a few models and just one type of GPU. When you run around 20+ models varying in sizes and across many GPUs family. You see crazy regression pop up due to this PRs. Works on your machine or on some machines is not good enough. llama.cpp is now becoming a monster, almost every PR introduces a regression.

[-]

sdfgeoff@reddit

I see his side as pretty reasonable TBH. He's looking bigger picture than the PR author and is wanting some preliminary work to be done before device-specific tuning. He probably has something in mind, and until that something (or similar is present), he doesn't want ad-hoc feature gates/optimizations based on device capabilities popping up through the codebase. I'd agree with that. That sort of thing is a nightmare to untangle.

Being a software maintainer trying to keep code quality is hard, and sometimes ypu need to say 'no' to someone.

[-]

JazzlikeLeave5530@reddit

I expected way worse reading this thread but I don't read any asshole behavior at all. Very confused lol

[-]

fake_agent_smith@reddit

Yes, in terms of technical expertise he’s entirely correct and what he says is a reasonable plan, there are no issues with logic, but in terms of people skills e.g. actually referring to what the PR author says in his other comments or just some basic respect for the guy, well not so much.

[-]

hurdurdur7@reddit

Neither is Linus, but he is one heck of a maintainer

[-]

Beamsters@reddit

The guy almost solo carry local llm for a few years. I would respect his judgement but this kind of machine target PR should be group and maybe even do a monthly fork. But one maintainer can only go so far, the vendor should do this amd-specific llama cpp themselves.

[-]

jwpbe@reddit

linus learned how to chill the fuck out over time

[-]

jwpbe@reddit

oh, for sure, he just acts like an insufferable asshole about it, thats the point. i dont think what you said was ever in question

[-]

wllmsaccnt@reddit

I don't think being short with someone is the same thing as being an insufferable asshole. I read the conversation as both sides having a miscommunication about expectations. Pedapudi was asking for next steps outside of the PR, while the maintainer was trying to communicate clearly about the fate of the PR (though the maintainers responses lacked affect or empathy).

[-]