Strix Halo users, a rejected PR can give you up to 30% faster PP for MOEs.

Posted by fallingdowndizzyvr@reddit | LocalLLaMA | View on Reddit | 58 comments

Here's the PR by pedapudi.

https://github.com/ggml-org/llama.cpp/pull/21344

It's merge request has been denied so it will not be in mainline llama.cpp. The changes are so small that I just put them into whatever the current release of llama.cpp is.

Read the PR for more info. It will only work with MOEs. Also, it gives the most boost at low context. As the context rises, the gain diminishes. Pedapudi explains why that happens in the PR.

Here are some numbers. It really works well. The tiny amount of time it takes me to apply the code to the current release of llama.cpp is time well spent.

main

ggml_cuda_init: found 1 ROCm devices (Total VRAM: 128000 MiB):
  Device 0: AMD Radeon Graphics, gfx1151 (0x1151), VMM: no, Wave Size: 32, VRAM: 128000 MiB
| model                          |       size |     params | backend    | ngl | mmap |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | ---: | --------------: | -------------------: |
| qwen35moe 35B.A3B Q4_K - Small |  19.45 GiB |    34.66 B | ROCm       |  99 |    0 |           pp512 |       1106.11 ± 8.60 |
| qwen35moe 35B.A3B Q4_K - Small |  19.45 GiB |    34.66 B | ROCm       |  99 |    0 |  pp512 @ d10000 |        755.79 ± 2.58 |
| qwen35moe 35B.A3B Q4_K - Small |  19.45 GiB |    34.66 B | ROCm       |  99 |    0 |  pp512 @ d20000 |        587.61 ± 1.52 |
| qwen35moe 35B.A3B Q4_K - Small |  19.45 GiB |    34.66 B | ROCm       |  99 |    0 |  pp512 @ d40000 |        415.09 ± 2.45 |
| qwen35moe 35B.A3B Q4_K - Small |  19.45 GiB |    34.66 B | ROCm       |  99 |    0 |  pp512 @ d60000 |        316.89 ± 2.35 |

PR

ggml_cuda_init: found 1 ROCm devices (Total VRAM: 128000 MiB):
  Device 0: AMD Radeon Graphics, gfx1151 (0x1151), VMM: no, Wave Size: 32, VRAM: 128000 MiB
| model                          |       size |     params | backend    | ngl | mmap |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | ---: | --------------: | -------------------: |
| qwen35moe 35B.A3B Q4_K - Small |  19.45 GiB |    34.66 B | ROCm       |  99 |    0 |           pp512 |       1447.62 ± 7.10 | **+31%**
| qwen35moe 35B.A3B Q4_K - Small |  19.45 GiB |    34.66 B | ROCm       |  99 |    0 |  pp512 @ d10000 |        905.60 ± 3.53 | **+20%**
| qwen35moe 35B.A3B Q4_K - Small |  19.45 GiB |    34.66 B | ROCm       |  99 |    0 |  pp512 @ d20000 |        685.23 ± 3.03 | **+16%**
| qwen35moe 35B.A3B Q4_K - Small |  19.45 GiB |    34.66 B | ROCm       |  99 |    0 |  pp512 @ d40000 |        459.42 ± 2.70 | **+11%**
| qwen35moe 35B.A3B Q4_K - Small |  19.45 GiB |    34.66 B | ROCm       |  99 |    0 |  pp512 @ d60000 |        342.41 ± 2.43 | **+8%**