Using a Radeon 9060 XT 16 GB, the gemma4 24b a4b iq4 nl model achieves 25.9 t/s
Posted by CrowKing63@reddit | LocalLLaMA | View on Reddit | 18 comments
I'm testing running local LLMs on a gaming mini PC (AMD 7840HS, 32 GB RAM) paired with an eGPU (Radeon 9060XT with 16 GB VRAM). Since I'm not very familiar with using llama.cpp, I kept getting unsatisfactory results, but with the recent Gemma4 24B A4B IQ4 NL model I finally reached 25.9 t/s. I even connected it to OpenCode and tried asking questions from my codebase, and it seems usable at this level.
`llama-server -hf unsloth/gemma-4-26B-A4B-it-GGUF:UD-IQ4_NL ``
`--fit on ``
`--fit-ctx 128000 ``
`--fit-target 256 ``
`-np 1 ``
`-fa on ``
`--no-mmap ``
`--mlock ``
`--threads 8 ``
`-b 512 ``
`-ub 256 ``
`-ctk q8_0 ``
`-ctv q8_0 ``
`--temp 0.6 ``
`--top-p 0.95 ``
`--top-k 20 ``
`--min-p 0.0 ``
`--presence-penalty 0.0 ``
`--repeat-penalty 1.0 ``
--reasoning-budget -1
This is the result of using it this way.
Increase -b and -ub any further, it won't even load. Are there any unnecessary arguments or arguments that could be optimized?
Thanks.
Due_Pea_372@reddit
Nice numbers on the 9060 XT! We're running the same model on a 9070 XT via pure Vulkan compute (no ROCm) and hitting \~22 t/s decode with Q3_K_M. Your 25.9 on half the CUs is impressive — rocBLAS is still hard to beat for MoE decode.
btw llama.cpp's Vulkan backend is broken for this model (Issue #21516) — the MoE routing has a sync bug. We hit the same wall and wrote up the root cause: https://github.com/maeddesg/vulkanforge/blob/main/docs/gemma4_26b_moe_solution.md
What backend are you using, HIP or Vulkan?
CrowKing63@reddit (OP)
It’s probably Vulkan. I tried hip just recently, but it kept failing, so I gave up.
r3curs1v3@reddit
which egpu are you using (im not asking about the card. I mean the egpu unit.
CrowKing63@reddit (OP)
If you are referring to the docking station, it is the first-generation product from Minisforum DEG1. I connected it via OCULink.
r3curs1v3@reddit
yes that ... thanks
Solary_Kryptic@reddit
How did Qwen 3.6 35B perform for you?
CrowKing63@reddit (OP)
I haven't tried it. I've heard Qwen performs better, so I'll give it a shot.
KURD_1_STAN@reddit
It should be faster if about the same number GBs needs to be swapped but might not be the case as 35b cant fit in ur vram at q4. Im getting 33t/s at 25gb q5kxl at 100k on 3060
hurdurdur7@reddit
That context size + that model doesn't fit in your vram. You are suffering because you are offloading to cpu and regular ram.
CrowKing63@reddit (OP)
Wow... 64k = Double Speed!
CrowKing63@reddit (OP)
I'll shorten it and test again.
maxpayne07@reddit
Lmstudio won't let you use also the igpu and share some layers to IGPU 780M of Ryzen? I love to kow if possible, i want to by a laptop with amd Ryzen igpu and also with Nvidia gpu mobile
xPXpanD@reddit
Not sure how fast an iGPU would be here, but chances are the combo won't be great; you'd be stepping down to Vulkan because the iGPU doesn't speak CUDA. (at least until mixed-backend becomes more common)
Plastic-Stress-6468@reddit
Isn't decoding mainly memory bandwidth bound? So using the iGPU is basically the same as just running on the CPU?
xPXpanD@reddit
From what I understand, kind of? The iGPU has dedicated hardware for handling some of this stuff, but memory is the big bottleneck.
The main issue here comes when you want to chain in the dedicated NVIDIA chip as well. Once a model exceeds VRAM and starts spilling into system RAM, the iGPU will just be fighting the dGPU trying to get a piece of that system RAM cake.
Probably better off just using the GPU + offload alone, and leaving the iGPU to handle background tasks. Might also free a bit more VRAM doing it this way.
(also, looking back, I was a little... fuzzy on that initial reply; backends probably won't matter too much for this specific setup)
CrowKing63@reddit (OP)
Hmm... I never even considered that possibility.
maxpayne07@reddit
Please do a test. For old time ryzen igpus 😁
xPXpanD@reddit
Fair warning: It'll probably be slower than just using the NVIDIA chip alone and offloading to RAM from there. (the traditional way)
The iGPU is tied to system memory, which is slow by definition. If you combine both chips through Vulkan you're just adding extra steps before hitting that RAM. May see a tiny uptick in prompt processing (if that), but I'd expect generation to end up slower, not faster.