Benchmarks of Radeon 780M iGPU with shared 128GB DDR5 RAM running various MoE models under Llama.cpp

Posted by AzerbaijanNyan@reddit | LocalLLaMA | View on Reddit | 21 comments

I've been looking for a budget system capable of running the later MoE models for basic one-shot queries. Main goal was finding something energy efficient to keep online 24/7 without racking up an exorbitant electricity bill.

I eventually settled on a refurbished Minisforum UM890 Pro which at the time, September, seemed like the most cost-efficient option for my needs.

UM890 Pro

AMD Radeon™ 780M iGPU

128GB DDR5 (Crucial DDR5 RAM 128GB Kit (2x64GB) 5600MHz SODIMM CL46)

2TB M.2

Linux Mint 22.2

ROCm 7.1.1 with HSA_OVERRIDE_GFX_VERSION=11.0.0 override

llama.cpp build: b13771887 (7699)

Below are some benchmarks using various MoE models. Llama 7B is included for comparison since there's an ongoing thread gathering data for various AMD cards under ROCm here - Performance of llama.cpp on AMD ROCm (HIP) #15021.

I also tested various Vulkan builds but found it too close in performance to warrant switching to since I'm also testing other ROCm AMD cards on this system over OCulink.

model	size	params	backend	ngl	fa	test	t/s
llama 7B Q4_0	3.56 GiB	6.74 B	ROCm	99	1	pp512	514.88 ± 4.82
llama 7B Q4_0	3.56 GiB	6.74 B	ROCm	99	1	tg128	19.27 ± 0.00
llama 7B Q4_0	3.56 GiB	6.74 B	ROCm	99	1	pp512 @ d4096	288.95 ± 3.71
llama 7B Q4_0	3.56 GiB	6.74 B	ROCm	99	1	tg128 @ d4096	11.59 ± 0.00
llama 7B Q4_0	3.56 GiB	6.74 B	ROCm	99	1	pp512 @ d8192	183.77 ± 2.49
llama 7B Q4_0	3.56 GiB	6.74 B	ROCm	99	1	tg128 @ d8192	8.36 ± 0.00
llama 7B Q4_0	3.56 GiB	6.74 B	ROCm	99	1	pp512 @ d16384	100.00 ± 1.45
llama 7B Q4_0	3.56 GiB	6.74 B	ROCm	99	1	tg128 @ d16384	5.49 ± 0.00

model	size	params	backend	ngl	fa	test	t/s
gpt-oss 20B MXFP4 MoE	11.27 GiB	20.91 B	ROCm	99	1	pp512	575.41 ± 8.62
gpt-oss 20B MXFP4 MoE	11.27 GiB	20.91 B	ROCm	99	1	tg128	28.34 ± 0.01
gpt-oss 20B MXFP4 MoE	11.27 GiB	20.91 B	ROCm	99	1	pp512 @ d4096	390.27 ± 5.73
gpt-oss 20B MXFP4 MoE	11.27 GiB	20.91 B	ROCm	99	1	tg128 @ d4096	16.25 ± 0.01
gpt-oss 20B MXFP4 MoE	11.27 GiB	20.91 B	ROCm	99	1	pp512 @ d8192	303.25 ± 4.06
gpt-oss 20B MXFP4 MoE	11.27 GiB	20.91 B	ROCm	99	1	tg128 @ d8192	10.09 ± 0.00
gpt-oss 20B MXFP4 MoE	11.27 GiB	20.91 B	ROCm	99	1	pp512 @ d16384	210.54 ± 2.23
gpt-oss 20B MXFP4 MoE	11.27 GiB	20.91 B	ROCm	99	1	tg128 @ d16384	6.11 ± 0.00

model	size	params	backend	ngl	fa	test	t/s
gpt-oss 120B MXFP4 MoE	59.02 GiB	116.83 B	ROCm	99	1	pp512	217.08 ± 3.58
gpt-oss 120B MXFP4 MoE	59.02 GiB	116.83 B	ROCm	99	1	tg128	20.14 ± 0.01
gpt-oss 120B MXFP4 MoE	59.02 GiB	116.83 B	ROCm	99	1	pp512 @ d4096	174.96 ± 3.57
gpt-oss 120B MXFP4 MoE	59.02 GiB	116.83 B	ROCm	99	1	tg128 @ d4096	11.22 ± 0.00
gpt-oss 120B MXFP4 MoE	59.02 GiB	116.83 B	ROCm	99	1	pp512 @ d8192	143.78 ± 1.36
gpt-oss 120B MXFP4 MoE	59.02 GiB	116.83 B	ROCm	99	1	tg128 @ d8192	6.88 ± 0.00
gpt-oss 120B MXFP4 MoE	59.02 GiB	116.83 B	ROCm	99	1	pp512 @ d16384	109.48 ± 1.07
gpt-oss 120B MXFP4 MoE	59.02 GiB	116.83 B	ROCm	99	1	tg128 @ d16384	4.13 ± 0.00

model	size	params	backend	ngl	fa	test	t/s
qwen3vlmoe 30B.A3B Q6_K	23.36 GiB	30.53 B	ROCm	99	1	pp512	265.07 ± 3.95
qwen3vlmoe 30B.A3B Q6_K	23.36 GiB	30.53 B	ROCm	99	1	tg128	25.83 ± 0.00
qwen3vlmoe 30B.A3B Q6_K	23.36 GiB	30.53 B	ROCm	99	1	pp512 @ d4096	168.86 ± 1.58
qwen3vlmoe 30B.A3B Q6_K	23.36 GiB	30.53 B	ROCm	99	1	tg128 @ d4096	6.01 ± 0.00
qwen3vlmoe 30B.A3B Q6_K	23.36 GiB	30.53 B	ROCm	99	1	pp512 @ d8192	124.47 ± 0.68
qwen3vlmoe 30B.A3B Q6_K	23.36 GiB	30.53 B	ROCm	99	1	tg128 @ d8192	3.41 ± 0.00
qwen3vlmoe 30B.A3B Q6_K	23.36 GiB	30.53 B	ROCm	99	1	pp512 @ d16384	81.27 ± 0.46
qwen3vlmoe 30B.A3B Q6_K	23.36 GiB	30.53 B	ROCm	99	1	tg128 @ d16384	2.10 ± 0.00

model	size	params	backend	ngl	fa	test	t/s
qwen3next 80B.A3B Q6_K	63.67 GiB	79.67 B	ROCm	99	1	pp512	138.44 ± 1.52
qwen3next 80B.A3B Q6_K	63.67 GiB	79.67 B	ROCm	99	1	tg128	12.45 ± 0.00
qwen3next 80B.A3B Q6_K	63.67 GiB	79.67 B	ROCm	99	1	pp512 @ d4096	131.49 ± 1.24
qwen3next 80B.A3B Q6_K	63.67 GiB	79.67 B	ROCm	99	1	tg128 @ d4096	10.46 ± 0.00
qwen3next 80B.A3B Q6_K	63.67 GiB	79.67 B	ROCm	99	1	pp512 @ d8192	122.66 ± 1.85
qwen3next 80B.A3B Q6_K	63.67 GiB	79.67 B	ROCm	99	1	tg128 @ d8192	8.80 ± 0.00
qwen3next 80B.A3B Q6_K	63.67 GiB	79.67 B	ROCm	99	1	pp512 @ d16384	107.32 ± 1.59
qwen3next 80B.A3B Q6_K	63.67 GiB	79.67 B	ROCm	99	1	tg128 @ d16384	6.73 ± 0.00

So, am I satisfied with the system? Yes, it performs around what I hoping to. Power draw is 10-13 watt idle with gpt-oss 120B loaded. Inference brings that up to around 75. As an added bonus the system is so silent I had to check so the fan was actually running the first time I started it.

The shared memory means it's possible to run Q8+ quants of many models and the cache at f16+ for higher quality outputs. 120GB something availible also allows having more than one model loaded, personally I've been running Qwen3-VL-30B-A3B-Instruct as a visual assistant for gpt-oss 120B. I found this combo very handy to transcribe hand written letters for translation.

Token generation isn't stellar as expected for a dual channel system but acceptable for MoE one-shots and this is a secondary system that can chug along while I do something else. There's also the option of using one of the two M.2 slots for an OCulink eGPU and increased performance.

Another perk is the portability, at 130mm/126mm/52.3mm it fits easily into a backpack or suitcase.

So, do I recommend this system? Unfortunately no and that's solely due to the current prices of RAM and other hardware. I suspect assembling the system today would cost at least three times as much making the price/performance ratio considerably less appealing.

Disclaimer: I'm not an experienced Linux user so there's likely some performance left on the table.

[-]

dionisioalcaraz@reddit

I have a mini PC with Ryzen 8845HS + 780M and get this numbers using Vulkan backend. I will try to compile llama.cpp with ROCM and see how it goes, but it seems that ROCM better PP and Vulkan better PG, specially in long context.

| ------------------------------ | ---------: | ---------: | ---------- | --: | -: | ---: | --------------: | -------------------: |

| gpt-oss 120B MXFP4 MoE | 59.02 GiB | 116.83 B | Vulkan | 99 | 1 | 0 | pp512 | 164.32 ± 0.00 |

| gpt-oss 120B MXFP4 MoE | 59.02 GiB | 116.83 B | Vulkan | 99 | 1 | 0 | tg128 | 19.93 ± 0.00 |

| gpt-oss 120B MXFP4 MoE | 59.02 GiB | 116.83 B | Vulkan | 99 | 1 | 0 | pp512 @ d16384 | 80.06 ± 0.00 |

| gpt-oss 120B MXFP4 MoE | 59.02 GiB | 116.83 B | Vulkan | 99 | 1 | 0 | tg128 @ d16384 | 15.35 ± 0.00 |

| gpt-oss 120B MXFP4 MoE | 59.02 GiB | 116.83 B | Vulkan | 99 | 1 | 0 | pp512 @ d32768 | 53.48 ± 0.00 |

| gpt-oss 120B MXFP4 MoE | 59.02 GiB | 116.83 B | Vulkan | 99 | 1 | 0 | tg128 @ d32768 | 13.00 ± 0.00 |

| -------------------------------------- | ---------: | ---------: | ---------- | --: | ---: | --------------: | -------------------: |

| minimax-m2 230B.A10B IQ4_XS - 4.25 bpw | 113.52 GiB | 228.69 B | Vulkan | 99 | 0 | pp512 | 55.93 ± 0.00 |

| minimax-m2 230B.A10B IQ4_XS - 4.25 bpw | 113.52 GiB | 228.69 B | Vulkan | 99 | 0 | tg128 | 11.73 ± 0.00 |

| minimax-m2 230B.A10B IQ4_XS - 4.25 bpw | 113.52 GiB | 228.69 B | Vulkan | 99 | 0 | pp512 @ d8192 | 35.83 ± 0.00 |

| minimax-m2 230B.A10B IQ4_XS - 4.25 bpw | 113.52 GiB | 228.69 B | Vulkan | 99 | 0 | tg128 @ d8192 | 5.50 ± 0.00 |

| minimax-m2 230B.A10B IQ4_XS - 4.25 bpw | 113.52 GiB | 228.69 B | Vulkan | 99 | 0 | pp512 @ d16384 | 20.65 ± 0.00 |

| minimax-m2 230B.A10B IQ4_XS - 4.25 bpw | 113.52 GiB | 228.69 B | Vulkan | 99 | 0 | tg128 @ d16384 | 2.78 ± 0.00 |

TylerDurdenFan@reddit

Thank you so much for doing the test.

Your tests were with Qwen 3 and Qwen 3 next, but now we have Qwen 3.5 and 3,6 with the biggest difference being the gated delta net which saves both on memory needed for KV cache and memory bandwidth needed for inference as context grows.

Would you mind running the test on qwen 3.6 35B A3B to compare how much tg128 improves at longer contexts?

jokerpack@reddit

I am testing Qwen3.6 35B A3B on my Minisforum UM890 Pro aand i am quite satisfied with 22 tokens/s

Hello Nyan, i also have the Minisforum UM890 Pro with 32 GB Ram and i think about upgrading to more ram. In my bios i only can allocate 16 of VRAM (default 2 GB). You have 128 GB of system memory, how much of it can you allocato to vram? Is it more than 16 GB? Thank you J0ker

10thDeadlySin@reddit

I was actually wondering if one could pull this off with a Ryzen 8700G and 96-128 gigs of DDR5, maybe with an added T4 or something like it to offload some workloads to it. It's the same 780M iGPU after all. ;)

yeah-ok@reddit

Yeah, can tell you I would have liked a 64gb DDR5 7200Mhz low latency kit setup with the 8700G. Think the ability to overclock and optimize the mem/motherboard base frequency would absolutely rock versus the 7840/780m I got now (limited on the RAM front to 5600Mhz). Then RAM shortage happened 🤷‍

FullstackSensei@reddit

If only 128GB DDR5 didn't cost a kidney...

amatisig@reddit

這句話的含金量正在上升

Serious_Middle_4234@reddit

well done 780m

Past-Economist7732@reddit

I’ve been using a cluster of 780m’s to run embedding models with llama.cpp for a while, it works great! That being said, I’ve had to use the Vulkan backend as I haven’t been able to get HIP to work. Do you have any other info besides using the “HSA_OVERRIDE_GFX_VERSION=11.0.0 override”?

AzerbaijanNyan@reddit (OP)

Easiest way is probably just downloading the lemonade pre-built which supports gfx1100 and use the override.

Alternatively, if you want to be able to pull and build the latest version on your own check out this excellent localllama guide and make sure to use "-DGPU_TARGETS=gfx1100" flag.

iadanos@reddit

UM 890 Pro supports maximum 96 GB RAM, no?

https://www.minisforum.com/products/minisforum-um890-pro

Think that information is outdated and based on what was availible when the system was released.

I haven't had any problems with my 128GB kit with 122GB something availible for LLMS with "GRUB_CMDLINE_LINUX_DEFAULT="quiet splash amd_iommu=off amdgpu.gttsize=122880 ttm.pages_limit=33554432 amd_iommu=off".

Though it might have been overkill since I think I can fit most of these models into 96GB short of running two at the same time.

So, 96GB RAM is not a hardware limit?

SkyFeistyLlama8@reddit

These MOE figures show that MOE models are the way to go for unified RAM setups with lower RAM speeds like Radeon or Adreno iGPUs. I just wish Mistral made some smaller MOEs because their Mistral and Devstral 24B models are great but slow.

PermanentLiminality@reddit

You should try a couple just on the CPU to see how different it is.

Individual-Source618@reddit

why does the PP drop with context length despite having the same number of token to process (512) ? The KV isnt enabled ?

I added the llama-bench command to the post in case anyone wants to compare. Thanks for the heads up, should have added it from the start since it's hard to judge the numbers otherwise.

Top-Outside-9322@reddit

Crazy how that 780M is actually holding its own with 128GB of shared memory, those MoE numbers look pretty solid for what you paid back in September

The power draw at 75W under load is honestly impressive for running 120B models - beats the hell out of spinning up a 4090 just for inference

Absolutely, I have a triple GPU server for more demanding work but I hardly ever fire it up nowadays since the mini pc handles most tasks fine.

It's a shame the prices are what they are now since I feel this setup with gpt-oss 120B is near ideal for small businesses/office tasks where you don't want to/can use cloud services.

1ncehost@reddit

I think these basic amd apu builds are super cool for homelab kind of stuff. Those numbers are surprisingly fast on those size of models. Too bad ram prices make this seem much less attractive right now.