GMKtec EVO-X2 70B expectation
Posted by Non-Technical@reddit | LocalLLaMA | View on Reddit | 17 comments
I would like to use a 70B model on a GMKtec EVO-X2 AI Mini PC 128GB.
Selected this one: Llama-3.3-70B-Instruct-Q4_K_M.gguf
Ubuntu 24.4.4 LTS and compiled llama.cpp server for the gfx1151. GRUB ttm.pages_limit=26214400 so \~100GB of the unified memory in available to be shared. All of the layers are going into the gpu.
I'm getting 5.25 predicted per second which is a bit slower than I read. Is that normal?
HopePupal@reddit
those are normal speeds for large dense models, yes. is there some reason you need to use a 70B dense antique? did Gemini tell you what to use? there are better models available for that hardware.
Non-Technical@reddit (OP)
Thank you very much. I wanted it to be a good story teller and so I thought a dense model would be the best for creative writing. Going to look into your suggestions.
HopePupal@reddit
you're not wrong in theory, but in practice, all of the open-weight labs gave up on dense models in that size class a while ago, and the current set of MoEs are much more capable than old dense models due to newer training methods.
the first two i listed are the better writers of that bunch, although ime all of them are better than any LLaMA.
fwiw, the two flagship small dense models from this year are Qwen 3.6 27B and Gemma 4 31B, but both of them are still pretty slow on hardware like yours (and mine, i have the same GMKtec), and Qwen at least is not a good writer.
Non-Technical@reddit (OP)
I really am impressed with Gemma 4. It has a warm writing style. Last night I took your advice and downloaded the A10B Qwen 3.5. It took around 45 minutes just to load, but after that, it was very fast, about 17 tokens per second and it seemed to be very friendly and write well. Next, I’ll try a few others from your list. Thanks for the advice. Certainly not committed to the earlier 70B dense model but wanted to push the hardware and see if results were in the line. Meaning the configuration was good.
D2OQZG8l5BI1S06@reddit
There's also Gemma 4 26B A4B MoE if you prefer the speed.
platteXDlol@reddit
Look into GTT, it should give you more than 100GB or your unified memmory to use.
Non-Technical@reddit (OP)
It says that gtt is deprecated and to use tpp.pages instead. I have both in GRUB.
platteXDlol@reddit
Who said its deprecated? On my 96GB machine i get ~94GB instead of the bios 48GB?
Non-Technical@reddit (OP)
It works now but apparently it may be removed in the future. After setting it in grub and restart it displays this in the log.
[ 4.842826] amdgpu 0000:c5:00.0: amdgpu: [drm] Configuring gttsize via module parameter is deprecated, please use ttm.pages_limit
platteXDlol@reddit
Didnt knew that, thanks
llama-impersonator@reddit
try gemma 4 26b, it'll be fast on your system and is a nice writer.
Non-Technical@reddit (OP)
Completely agree. Love that one.
wiltors42@reddit
My favorite as of now is Qwen 3.5 122b Q6. Minimax is also good but too large for full context.
JamesEvoAI@reddit
As the other person called out, try a model released in the last century lol. Qwen 3.6 has a great 30B MoE that runs at \~40tok/s on Strix Halo.
The way the attention mechanism works is for every new token it generates, it has to iterate over every token that came before it. This means as the context length grows the speed drops. At the end of the day the Strix Halo platform is limited by its memory bandwidth.
It's still a great platform, you just have to work within its limitations. Try the Qwen MoE models and see how you fare.
Non-Technical@reddit (OP)
Thank you. This is good information. There is so much to learn but I'm loving every minute of it.
No-Consequence-1779@reddit
Iterate over every previous token or iterate every layer? It’s predicting the next token so …. Well I’m sure you’ll figure it out.
Fit-Produce420@reddit
Llama 3.3 is really old.
Try GPT-OSS 120B or StepFun 3.5 Flash.