Benchmarks of Radeon 780M iGPU with shared 128GB DDR5 RAM running various MoE models under Llama.cpp
Posted by AzerbaijanNyan@reddit | LocalLLaMA | View on Reddit | 21 comments
I've been looking for a budget system capable of running the later MoE models for basic one-shot queries. Main goal was finding something energy efficient to keep online 24/7 without racking up an exorbitant electricity bill.
I eventually settled on a refurbished Minisforum UM890 Pro which at the time, September, seemed like the most cost-efficient option for my needs.
UM890 Pro
128GB DDR5 (Crucial DDR5 RAM 128GB Kit (2x64GB) 5600MHz SODIMM CL46)
2TB M.2
Linux Mint 22.2
ROCm 7.1.1 with HSA_OVERRIDE_GFX_VERSION=11.0.0 override
llama.cpp build: b13771887 (7699)
Below are some benchmarks using various MoE models. Llama 7B is included for comparison since there's an ongoing thread gathering data for various AMD cards under ROCm here - Performance of llama.cpp on AMD ROCm (HIP) #15021.
I also tested various Vulkan builds but found it too close in performance to warrant switching to since I'm also testing other ROCm AMD cards on this system over OCulink.
| model | size | params | backend | ngl | fa | test | t/s |
|---|---|---|---|---|---|---|---|
| llama 7B Q4_0 | 3.56 GiB | 6.74 B | ROCm | 99 | 1 | pp512 | 514.88 ± 4.82 |
| llama 7B Q4_0 | 3.56 GiB | 6.74 B | ROCm | 99 | 1 | tg128 | 19.27 ± 0.00 |
| llama 7B Q4_0 | 3.56 GiB | 6.74 B | ROCm | 99 | 1 | pp512 @ d4096 | 288.95 ± 3.71 |
| llama 7B Q4_0 | 3.56 GiB | 6.74 B | ROCm | 99 | 1 | tg128 @ d4096 | 11.59 ± 0.00 |
| llama 7B Q4_0 | 3.56 GiB | 6.74 B | ROCm | 99 | 1 | pp512 @ d8192 | 183.77 ± 2.49 |
| llama 7B Q4_0 | 3.56 GiB | 6.74 B | ROCm | 99 | 1 | tg128 @ d8192 | 8.36 ± 0.00 |
| llama 7B Q4_0 | 3.56 GiB | 6.74 B | ROCm | 99 | 1 | pp512 @ d16384 | 100.00 ± 1.45 |
| llama 7B Q4_0 | 3.56 GiB | 6.74 B | ROCm | 99 | 1 | tg128 @ d16384 | 5.49 ± 0.00 |
| model | size | params | backend | ngl | fa | test | t/s |
|---|---|---|---|---|---|---|---|
| gpt-oss 20B MXFP4 MoE | 11.27 GiB | 20.91 B | ROCm | 99 | 1 | pp512 | 575.41 ± 8.62 |
| gpt-oss 20B MXFP4 MoE | 11.27 GiB | 20.91 B | ROCm | 99 | 1 | tg128 | 28.34 ± 0.01 |
| gpt-oss 20B MXFP4 MoE | 11.27 GiB | 20.91 B | ROCm | 99 | 1 | pp512 @ d4096 | 390.27 ± 5.73 |
| gpt-oss 20B MXFP4 MoE | 11.27 GiB | 20.91 B | ROCm | 99 | 1 | tg128 @ d4096 | 16.25 ± 0.01 |
| gpt-oss 20B MXFP4 MoE | 11.27 GiB | 20.91 B | ROCm | 99 | 1 | pp512 @ d8192 | 303.25 ± 4.06 |
| gpt-oss 20B MXFP4 MoE | 11.27 GiB | 20.91 B | ROCm | 99 | 1 | tg128 @ d8192 | 10.09 ± 0.00 |
| gpt-oss 20B MXFP4 MoE | 11.27 GiB | 20.91 B | ROCm | 99 | 1 | pp512 @ d16384 | 210.54 ± 2.23 |
| gpt-oss 20B MXFP4 MoE | 11.27 GiB | 20.91 B | ROCm | 99 | 1 | tg128 @ d16384 | 6.11 ± 0.00 |
| model | size | params | backend | ngl | fa | test | t/s |
|---|---|---|---|---|---|---|---|
| gpt-oss 120B MXFP4 MoE | 59.02 GiB | 116.83 B | ROCm | 99 | 1 | pp512 | 217.08 ± 3.58 |
| gpt-oss 120B MXFP4 MoE | 59.02 GiB | 116.83 B | ROCm | 99 | 1 | tg128 | 20.14 ± 0.01 |
| gpt-oss 120B MXFP4 MoE | 59.02 GiB | 116.83 B | ROCm | 99 | 1 | pp512 @ d4096 | 174.96 ± 3.57 |
| gpt-oss 120B MXFP4 MoE | 59.02 GiB | 116.83 B | ROCm | 99 | 1 | tg128 @ d4096 | 11.22 ± 0.00 |
| gpt-oss 120B MXFP4 MoE | 59.02 GiB | 116.83 B | ROCm | 99 | 1 | pp512 @ d8192 | 143.78 ± 1.36 |
| gpt-oss 120B MXFP4 MoE | 59.02 GiB | 116.83 B | ROCm | 99 | 1 | tg128 @ d8192 | 6.88 ± 0.00 |
| gpt-oss 120B MXFP4 MoE | 59.02 GiB | 116.83 B | ROCm | 99 | 1 | pp512 @ d16384 | 109.48 ± 1.07 |
| gpt-oss 120B MXFP4 MoE | 59.02 GiB | 116.83 B | ROCm | 99 | 1 | tg128 @ d16384 | 4.13 ± 0.00 |
| model | size | params | backend | ngl | fa | test | t/s |
|---|---|---|---|---|---|---|---|
| qwen3vlmoe 30B.A3B Q6_K | 23.36 GiB | 30.53 B | ROCm | 99 | 1 | pp512 | 265.07 ± 3.95 |
| qwen3vlmoe 30B.A3B Q6_K | 23.36 GiB | 30.53 B | ROCm | 99 | 1 | tg128 | 25.83 ± 0.00 |
| qwen3vlmoe 30B.A3B Q6_K | 23.36 GiB | 30.53 B | ROCm | 99 | 1 | pp512 @ d4096 | 168.86 ± 1.58 |
| qwen3vlmoe 30B.A3B Q6_K | 23.36 GiB | 30.53 B | ROCm | 99 | 1 | tg128 @ d4096 | 6.01 ± 0.00 |
| qwen3vlmoe 30B.A3B Q6_K | 23.36 GiB | 30.53 B | ROCm | 99 | 1 | pp512 @ d8192 | 124.47 ± 0.68 |
| qwen3vlmoe 30B.A3B Q6_K | 23.36 GiB | 30.53 B | ROCm | 99 | 1 | tg128 @ d8192 | 3.41 ± 0.00 |
| qwen3vlmoe 30B.A3B Q6_K | 23.36 GiB | 30.53 B | ROCm | 99 | 1 | pp512 @ d16384 | 81.27 ± 0.46 |
| qwen3vlmoe 30B.A3B Q6_K | 23.36 GiB | 30.53 B | ROCm | 99 | 1 | tg128 @ d16384 | 2.10 ± 0.00 |
| model | size | params | backend | ngl | fa | test | t/s |
|---|---|---|---|---|---|---|---|
| qwen3next 80B.A3B Q6_K | 63.67 GiB | 79.67 B | ROCm | 99 | 1 | pp512 | 138.44 ± 1.52 |
| qwen3next 80B.A3B Q6_K | 63.67 GiB | 79.67 B | ROCm | 99 | 1 | tg128 | 12.45 ± 0.00 |
| qwen3next 80B.A3B Q6_K | 63.67 GiB | 79.67 B | ROCm | 99 | 1 | pp512 @ d4096 | 131.49 ± 1.24 |
| qwen3next 80B.A3B Q6_K | 63.67 GiB | 79.67 B | ROCm | 99 | 1 | tg128 @ d4096 | 10.46 ± 0.00 |
| qwen3next 80B.A3B Q6_K | 63.67 GiB | 79.67 B | ROCm | 99 | 1 | pp512 @ d8192 | 122.66 ± 1.85 |
| qwen3next 80B.A3B Q6_K | 63.67 GiB | 79.67 B | ROCm | 99 | 1 | tg128 @ d8192 | 8.80 ± 0.00 |
| qwen3next 80B.A3B Q6_K | 63.67 GiB | 79.67 B | ROCm | 99 | 1 | pp512 @ d16384 | 107.32 ± 1.59 |
| qwen3next 80B.A3B Q6_K | 63.67 GiB | 79.67 B | ROCm | 99 | 1 | tg128 @ d16384 | 6.73 ± 0.00 |
So, am I satisfied with the system? Yes, it performs around what I hoping to. Power draw is 10-13 watt idle with gpt-oss 120B loaded. Inference brings that up to around 75. As an added bonus the system is so silent I had to check so the fan was actually running the first time I started it.
The shared memory means it's possible to run Q8+ quants of many models and the cache at f16+ for higher quality outputs. 120GB something availible also allows having more than one model loaded, personally I've been running Qwen3-VL-30B-A3B-Instruct as a visual assistant for gpt-oss 120B. I found this combo very handy to transcribe hand written letters for translation.
Token generation isn't stellar as expected for a dual channel system but acceptable for MoE one-shots and this is a secondary system that can chug along while I do something else. There's also the option of using one of the two M.2 slots for an OCulink eGPU and increased performance.
Another perk is the portability, at 130mm/126mm/52.3mm it fits easily into a backpack or suitcase.
So, do I recommend this system? Unfortunately no and that's solely due to the current prices of RAM and other hardware. I suspect assembling the system today would cost at least three times as much making the price/performance ratio considerably less appealing.
Disclaimer: I'm not an experienced Linux user so there's likely some performance left on the table.
TylerDurdenFan@reddit
Thank you so much for doing the test.
Your tests were with Qwen 3 and Qwen 3 next, but now we have Qwen 3.5 and 3,6 with the biggest difference being the gated delta net which saves both on memory needed for KV cache and memory bandwidth needed for inference as context grows.
Would you mind running the test on qwen 3.6 35B A3B to compare how much tg128 improves at longer contexts?
jokerpack@reddit
I am testing Qwen3.6 35B A3B on my Minisforum UM890 Pro aand i am quite satisfied with 22 tokens/s
jokerpack@reddit
Hello Nyan, i also have the Minisforum UM890 Pro with 32 GB Ram and i think about upgrading to more ram. In my bios i only can allocate 16 of VRAM (default 2 GB). You have 128 GB of system memory, how much of it can you allocato to vram? Is it more than 16 GB? Thank you J0ker
10thDeadlySin@reddit
I was actually wondering if one could pull this off with a Ryzen 8700G and 96-128 gigs of DDR5, maybe with an added T4 or something like it to offload some workloads to it. It's the same 780M iGPU after all. ;)
yeah-ok@reddit
Yeah, can tell you I would have liked a 64gb DDR5 7200Mhz low latency kit setup with the 8700G. Think the ability to overclock and optimize the mem/motherboard base frequency would absolutely rock versus the 7840/780m I got now (limited on the RAM front to 5600Mhz). Then RAM shortage happened 🤷
FullstackSensei@reddit
If only 128GB DDR5 didn't cost a kidney...
amatisig@reddit
這句話的含金量正在上升
Serious_Middle_4234@reddit
well done 780m
Past-Economist7732@reddit
I’ve been using a cluster of 780m’s to run embedding models with llama.cpp for a while, it works great! That being said, I’ve had to use the Vulkan backend as I haven’t been able to get HIP to work. Do you have any other info besides using the “HSA_OVERRIDE_GFX_VERSION=11.0.0 override”?
AzerbaijanNyan@reddit (OP)
Easiest way is probably just downloading the lemonade pre-built which supports gfx1100 and use the override.
Alternatively, if you want to be able to pull and build the latest version on your own check out this excellent localllama guide and make sure to use "-DGPU_TARGETS=gfx1100" flag.
iadanos@reddit
UM 890 Pro supports maximum 96 GB RAM, no?
https://www.minisforum.com/products/minisforum-um890-pro
AzerbaijanNyan@reddit (OP)
Think that information is outdated and based on what was availible when the system was released.
I haven't had any problems with my 128GB kit with 122GB something availible for LLMS with "GRUB_CMDLINE_LINUX_DEFAULT="quiet splash amd_iommu=off amdgpu.gttsize=122880 ttm.pages_limit=33554432 amd_iommu=off".
Though it might have been overkill since I think I can fit most of these models into 96GB short of running two at the same time.
iadanos@reddit
So, 96GB RAM is not a hardware limit?
SkyFeistyLlama8@reddit
These MOE figures show that MOE models are the way to go for unified RAM setups with lower RAM speeds like Radeon or Adreno iGPUs. I just wish Mistral made some smaller MOEs because their Mistral and Devstral 24B models are great but slow.
dionisioalcaraz@reddit
I have a mini PC with Ryzen 8845HS + 780M and get this numbers using Vulkan backend. I will try to compile llama.cpp with ROCM and see how it goes, but it seems that ROCM better PP and Vulkan better PG, specially in long context.
| model | size | params | backend | ngl | fa | mmap | test | t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | -: | ---: | --------------: | -------------------: |
| gpt-oss 120B MXFP4 MoE | 59.02 GiB | 116.83 B | Vulkan | 99 | 1 | 0 | pp512 | 164.32 ± 0.00 |
| gpt-oss 120B MXFP4 MoE | 59.02 GiB | 116.83 B | Vulkan | 99 | 1 | 0 | tg128 | 19.93 ± 0.00 |
| gpt-oss 120B MXFP4 MoE | 59.02 GiB | 116.83 B | Vulkan | 99 | 1 | 0 | pp512 @ d16384 | 80.06 ± 0.00 |
| gpt-oss 120B MXFP4 MoE | 59.02 GiB | 116.83 B | Vulkan | 99 | 1 | 0 | tg128 @ d16384 | 15.35 ± 0.00 |
| gpt-oss 120B MXFP4 MoE | 59.02 GiB | 116.83 B | Vulkan | 99 | 1 | 0 | pp512 @ d32768 | 53.48 ± 0.00 |
| gpt-oss 120B MXFP4 MoE | 59.02 GiB | 116.83 B | Vulkan | 99 | 1 | 0 | tg128 @ d32768 | 13.00 ± 0.00 |
| model | size | params | backend | ngl | mmap | test | t/s |
| -------------------------------------- | ---------: | ---------: | ---------- | --: | ---: | --------------: | -------------------: |
| minimax-m2 230B.A10B IQ4_XS - 4.25 bpw | 113.52 GiB | 228.69 B | Vulkan | 99 | 0 | pp512 | 55.93 ± 0.00 |
| minimax-m2 230B.A10B IQ4_XS - 4.25 bpw | 113.52 GiB | 228.69 B | Vulkan | 99 | 0 | tg128 | 11.73 ± 0.00 |
| minimax-m2 230B.A10B IQ4_XS - 4.25 bpw | 113.52 GiB | 228.69 B | Vulkan | 99 | 0 | pp512 @ d8192 | 35.83 ± 0.00 |
| minimax-m2 230B.A10B IQ4_XS - 4.25 bpw | 113.52 GiB | 228.69 B | Vulkan | 99 | 0 | tg128 @ d8192 | 5.50 ± 0.00 |
| minimax-m2 230B.A10B IQ4_XS - 4.25 bpw | 113.52 GiB | 228.69 B | Vulkan | 99 | 0 | pp512 @ d16384 | 20.65 ± 0.00 |
| minimax-m2 230B.A10B IQ4_XS - 4.25 bpw | 113.52 GiB | 228.69 B | Vulkan | 99 | 0 | tg128 @ d16384 | 2.78 ± 0.00 |
PermanentLiminality@reddit
You should try a couple just on the CPU to see how different it is.
Individual-Source618@reddit
why does the PP drop with context length despite having the same number of token to process (512) ? The KV isnt enabled ?
AzerbaijanNyan@reddit (OP)
I added the llama-bench command to the post in case anyone wants to compare. Thanks for the heads up, should have added it from the start since it's hard to judge the numbers otherwise.
Top-Outside-9322@reddit
Crazy how that 780M is actually holding its own with 128GB of shared memory, those MoE numbers look pretty solid for what you paid back in September
The power draw at 75W under load is honestly impressive for running 120B models - beats the hell out of spinning up a 4090 just for inference
AzerbaijanNyan@reddit (OP)
Absolutely, I have a triple GPU server for more demanding work but I hardly ever fire it up nowadays since the mini pc handles most tasks fine.
It's a shame the prices are what they are now since I feel this setup with gpt-oss 120B is near ideal for small businesses/office tasks where you don't want to/can use cloud services.
1ncehost@reddit
I think these basic amd apu builds are super cool for homelab kind of stuff. Those numbers are surprisingly fast on those size of models. Too bad ram prices make this seem much less attractive right now.