Benchmarks of Radeon 780M iGPU with shared 128GB DDR5 RAM running various MoE models under Llama.cpp

Posted by AzerbaijanNyan@reddit | LocalLLaMA | View on Reddit | 21 comments

I've been looking for a budget system capable of running the later MoE models for basic one-shot queries. Main goal was finding something energy efficient to keep online 24/7 without racking up an exorbitant electricity bill.

I eventually settled on a refurbished Minisforum UM890 Pro which at the time, September, seemed like the most cost-efficient option for my needs.

 

UM890 Pro

AMD Radeon™ 780M iGPU

128GB DDR5 (Crucial DDR5 RAM 128GB Kit (2x64GB) 5600MHz SODIMM CL46)

2TB M.2

Linux Mint 22.2

ROCm 7.1.1 with HSA_OVERRIDE_GFX_VERSION=11.0.0 override

llama.cpp build: b13771887 (7699)

 

Below are some benchmarks using various MoE models. Llama 7B is included for comparison since there's an ongoing thread gathering data for various AMD cards under ROCm here - Performance of llama.cpp on AMD ROCm (HIP) #15021.

I also tested various Vulkan builds but found it too close in performance to warrant switching to since I'm also testing other ROCm AMD cards on this system over OCulink.

 

model size params backend ngl fa test t/s
llama 7B Q4_0 3.56 GiB 6.74 B ROCm 99 1 pp512 514.88 ± 4.82
llama 7B Q4_0 3.56 GiB 6.74 B ROCm 99 1 tg128 19.27 ± 0.00
llama 7B Q4_0 3.56 GiB 6.74 B ROCm 99 1 pp512 @ d4096 288.95 ± 3.71
llama 7B Q4_0 3.56 GiB 6.74 B ROCm 99 1 tg128 @ d4096 11.59 ± 0.00
llama 7B Q4_0 3.56 GiB 6.74 B ROCm 99 1 pp512 @ d8192 183.77 ± 2.49
llama 7B Q4_0 3.56 GiB 6.74 B ROCm 99 1 tg128 @ d8192 8.36 ± 0.00
llama 7B Q4_0 3.56 GiB 6.74 B ROCm 99 1 pp512 @ d16384 100.00 ± 1.45
llama 7B Q4_0 3.56 GiB 6.74 B ROCm 99 1 tg128 @ d16384 5.49 ± 0.00

 

model size params backend ngl fa test t/s
gpt-oss 20B MXFP4 MoE 11.27 GiB 20.91 B ROCm 99 1 pp512 575.41 ± 8.62
gpt-oss 20B MXFP4 MoE 11.27 GiB 20.91 B ROCm 99 1 tg128 28.34 ± 0.01
gpt-oss 20B MXFP4 MoE 11.27 GiB 20.91 B ROCm 99 1 pp512 @ d4096 390.27 ± 5.73
gpt-oss 20B MXFP4 MoE 11.27 GiB 20.91 B ROCm 99 1 tg128 @ d4096 16.25 ± 0.01
gpt-oss 20B MXFP4 MoE 11.27 GiB 20.91 B ROCm 99 1 pp512 @ d8192 303.25 ± 4.06
gpt-oss 20B MXFP4 MoE 11.27 GiB 20.91 B ROCm 99 1 tg128 @ d8192 10.09 ± 0.00
gpt-oss 20B MXFP4 MoE 11.27 GiB 20.91 B ROCm 99 1 pp512 @ d16384 210.54 ± 2.23
gpt-oss 20B MXFP4 MoE 11.27 GiB 20.91 B ROCm 99 1 tg128 @ d16384 6.11 ± 0.00

 

model size params backend ngl fa test t/s
gpt-oss 120B MXFP4 MoE 59.02 GiB 116.83 B ROCm 99 1 pp512 217.08 ± 3.58
gpt-oss 120B MXFP4 MoE 59.02 GiB 116.83 B ROCm 99 1 tg128 20.14 ± 0.01
gpt-oss 120B MXFP4 MoE 59.02 GiB 116.83 B ROCm 99 1 pp512 @ d4096 174.96 ± 3.57
gpt-oss 120B MXFP4 MoE 59.02 GiB 116.83 B ROCm 99 1 tg128 @ d4096 11.22 ± 0.00
gpt-oss 120B MXFP4 MoE 59.02 GiB 116.83 B ROCm 99 1 pp512 @ d8192 143.78 ± 1.36
gpt-oss 120B MXFP4 MoE 59.02 GiB 116.83 B ROCm 99 1 tg128 @ d8192 6.88 ± 0.00
gpt-oss 120B MXFP4 MoE 59.02 GiB 116.83 B ROCm 99 1 pp512 @ d16384 109.48 ± 1.07
gpt-oss 120B MXFP4 MoE 59.02 GiB 116.83 B ROCm 99 1 tg128 @ d16384 4.13 ± 0.00

 

model size params backend ngl fa test t/s
qwen3vlmoe 30B.A3B Q6_K 23.36 GiB 30.53 B ROCm 99 1 pp512 265.07 ± 3.95
qwen3vlmoe 30B.A3B Q6_K 23.36 GiB 30.53 B ROCm 99 1 tg128 25.83 ± 0.00
qwen3vlmoe 30B.A3B Q6_K 23.36 GiB 30.53 B ROCm 99 1 pp512 @ d4096 168.86 ± 1.58
qwen3vlmoe 30B.A3B Q6_K 23.36 GiB 30.53 B ROCm 99 1 tg128 @ d4096 6.01 ± 0.00
qwen3vlmoe 30B.A3B Q6_K 23.36 GiB 30.53 B ROCm 99 1 pp512 @ d8192 124.47 ± 0.68
qwen3vlmoe 30B.A3B Q6_K 23.36 GiB 30.53 B ROCm 99 1 tg128 @ d8192 3.41 ± 0.00
qwen3vlmoe 30B.A3B Q6_K 23.36 GiB 30.53 B ROCm 99 1 pp512 @ d16384 81.27 ± 0.46
qwen3vlmoe 30B.A3B Q6_K 23.36 GiB 30.53 B ROCm 99 1 tg128 @ d16384 2.10 ± 0.00

 

model size params backend ngl fa test t/s
qwen3next 80B.A3B Q6_K 63.67 GiB 79.67 B ROCm 99 1 pp512 138.44 ± 1.52
qwen3next 80B.A3B Q6_K 63.67 GiB 79.67 B ROCm 99 1 tg128 12.45 ± 0.00
qwen3next 80B.A3B Q6_K 63.67 GiB 79.67 B ROCm 99 1 pp512 @ d4096 131.49 ± 1.24
qwen3next 80B.A3B Q6_K 63.67 GiB 79.67 B ROCm 99 1 tg128 @ d4096 10.46 ± 0.00
qwen3next 80B.A3B Q6_K 63.67 GiB 79.67 B ROCm 99 1 pp512 @ d8192 122.66 ± 1.85
qwen3next 80B.A3B Q6_K 63.67 GiB 79.67 B ROCm 99 1 tg128 @ d8192 8.80 ± 0.00
qwen3next 80B.A3B Q6_K 63.67 GiB 79.67 B ROCm 99 1 pp512 @ d16384 107.32 ± 1.59
qwen3next 80B.A3B Q6_K 63.67 GiB 79.67 B ROCm 99 1 tg128 @ d16384 6.73 ± 0.00

 

So, am I satisfied with the system? Yes, it performs around what I hoping to. Power draw is 10-13 watt idle with gpt-oss 120B loaded. Inference brings that up to around 75. As an added bonus the system is so silent I had to check so the fan was actually running the first time I started it.

The shared memory means it's possible to run Q8+ quants of many models and the cache at f16+ for higher quality outputs. 120GB something availible also allows having more than one model loaded, personally I've been running Qwen3-VL-30B-A3B-Instruct as a visual assistant for gpt-oss 120B. I found this combo very handy to transcribe hand written letters for translation.

Token generation isn't stellar as expected for a dual channel system but acceptable for MoE one-shots and this is a secondary system that can chug along while I do something else. There's also the option of using one of the two M.2 slots for an OCulink eGPU and increased performance.

Another perk is the portability, at 130mm/126mm/52.3mm it fits easily into a backpack or suitcase.

So, do I recommend this system? Unfortunately no and that's solely due to the current prices of RAM and other hardware. I suspect assembling the system today would cost at least three times as much making the price/performance ratio considerably less appealing.

Disclaimer: I'm not an experienced Linux user so there's likely some performance left on the table.