Using a Radeon 9060 XT 16 GB, the gemma4 24b a4b iq4 nl model achieves 25.9 t/s

Posted by CrowKing63@reddit | LocalLLaMA | View on Reddit | 18 comments

I'm testing running local LLMs on a gaming mini PC (AMD 7840HS, 32 GB RAM) paired with an eGPU (Radeon 9060XT with 16 GB VRAM). Since I'm not very familiar with using llama.cpp, I kept getting unsatisfactory results, but with the recent Gemma4 24B A4B IQ4 NL model I finally reached 25.9 t/s. I even connected it to OpenCode and tried asking questions from my codebase, and it seems usable at this level.

`llama-server -hf unsloth/gemma-4-26B-A4B-it-GGUF:UD-IQ4_NL ``

`--fit on ``

`--fit-ctx 128000 ``

`--fit-target 256 ``

`-np 1 ``

`-fa on ``

`--no-mmap ``

`--mlock ``

`--threads 8 ``

`-b 512 ``

`-ub 256 ``

`-ctk q8_0 ``

`-ctv q8_0 ``

`--temp 0.6 ``

`--top-p 0.95 ``

`--top-k 20 ``

`--min-p 0.0 ``

`--presence-penalty 0.0 ``

`--repeat-penalty 1.0 ``

--reasoning-budget -1

This is the result of using it this way.

Increase -b and -ub any further, it won't even load. Are there any unnecessary arguments or arguments that could be optimized?

Thanks.