LLM's on Android (Snapdragon 8 Elite) MOE Experience

Posted by Dance-Till-Night1@reddit | LocalLLaMA | View on Reddit | 11 comments

So I bought a phone with Snapdragon 8 elite (gen 4) and 24GB ram (Honor magic 7 pro). My experience has been mixed but with solid potential. Hexagon (Snapdragon 8 Elite) NPU and OpenclGPU support and updates have been rolling in fast but still the fastest prompt processing and token generation have mostly been CPU (I would bet that soon enough either NPU or GPU will be faster or more realistically both). CPU has the downside of generating more heat than NPU and GPU inference but overall it's still the fastest currently.

Now there are no phones with 32gb ram without a virtual ram extension which doesn't work with LLM's ofc, so the best you will do is 24gb ram. What can you do with 24gb ram and a smartphone processor though? Quite a lot actually, MOE has been getting quite popular and their Q4 quants of these models are great and fit into the 24GB. My personal recommendation is IQ4_XS and MXFP4_MOE since with what I have tested MXFP4_MOE is quite faster but for the size IQ4_XS can't be beaten. Q4_0 is more optimised but quality wise it's worse than both (subjectively from my own experience). Goes without saying but Q4_K_M is also quite reliable from a speed/quality/size standpoint.

The main models I use currently are Qwen3.6/3.5-35b-A3B (I prefer 3.5), Qwen3-30b-a3b-2507 (Good quality Less ram more ability to run other applications without crashing) Gemma-4-a4b-26b, LFM-24b-a2b, GPT-OSS-20B. The one I don't reccomend the most is GPT-OSS it's way way too censored and too easy to spook into a refusal if your query even hints at something it deems unsafe. All of them are MOE models which makes intelligence quite good and speed also really good. You can try your luck with different quants of these models but i settled on MXFP4 for max speed at a good quality at a good quality and IQ4_XS for the best quality/size so I can fit other apps into ram and not just be using LLM's.

LFM is by far the fastest and smallest model and it's incredibly smart for its size and speed. They should really make more MOE A2b models because this works so so well. Other models I listed are slower but noticeably smarter.

You will get token generation anywhere between about 25 tokens per second (LFM) and about 11 tokens per second (Gemma). Prompt processing speed really needs to improve though. (LFM is 60 and Gemma is 40 tokens per second). Different quants will have different speeds so use this as just what you will get an average from Q4 quants. Any update will probably make it faster and other advancements like MTP will also make it faster I would assume.

I have no idea whether I should write a guide or not but to keep it simple, if you want to try your luck with your device use pocketpal and as a general rule of thumb load models that don't exceed 75% of your system ram. Dense models will be alot slower (14b dense models are way slower than 20-30b moe models)

A quick test shows Q4_K_M of both models is
55 PP 24 TG LFM2-24b-a2b
13 PP 4 TG Phi-4-14b

Also more A2b and A1b models up to 30b total parameters please and thank you! AND LFM 2.5 24b a2b WHEN?

If anyone has any questions or anything they want me to test don't hesitate to ask.

[-]

jamaalwakamaal@reddit

Try Marco Mini

Dance-Till-Night1@reddit (OP)

Tested the mini, more than 45 t/s generation speed but not super smart. The speed is impressive so hopefully the next version they make it 30b total A1b still, maybe then it would be the same speed but smarter

Will try it today and report back.

pdycnbl@reddit

i tried qwen 3.5 9b and i was getting 6-7 tg/s on cpu. i was not able to get it to work with gpu, it used to crash with opencl after offloading more than 4 layers, this was all on termux+llama.cpp . I got better results with google edge ai app with gemma 4 e2b/e4b litert models.

Queasy-Contract9753@reddit

Which android version are you on and how did you get openCL to work? I'm on an older version, snapdragon 855 android 10, I can't get GPU acceleration to work in llama cpp. Vulkan didn't work either it entered version 1.2 but I only have 1.187

i am on android 16. i followed this tutorial except for compilation i used binaries by termux pkg. https://github.com/JackZeng0208/llama.cpp-android-tutorial
vulkan did not worked for me either.

SkyFeistyLlama8@reddit

That is a ridiculous amount of RAM on a phone LOL!

I'm on the laptop version of that chip, X1 Elite on Windows, and you're right about CPU inference being the fastest. It also generates a ton of heat and drains the battery quickly.

Llama.cpp supports ARM accelerated instructions for MXFP4_MOE so I would stick with that quantization instead of older ones like IQ4_NL or Q4_0.

I only use Adreno OpenCL inference on smaller dense models like Mistral Small 24B or Nemo 12B. It's slower but it saves power compared to CPU inference.

I don't have experience with NPU inference using llama.cpp, only with Nexa AI and Microsoft Foundry Local. Hexagon NPU support on Android requires a long build process.

It's currently the biggest supported ram on a phone hopefully the next generation of phone SOC's support 32GB which will make it even better for MOE models. But the smaller the activated params the faster it is, Someone reccomended testing macro-mini which has activated 0.86b which on paper will be the absolute fastest but fingers crossed quality wise it holds up.

The major downsides are:

Heat (Thermal management on phones isn't the best. Your phone will get hot if you don't give it breaks or have some form of active cooling)
Price (24GB ram phones are usually high-end and expensive)
Battery drain (You're working with a phone battery so it can't be helped that LLM's will drain it fast)

user92554125@reddit

This is interesting. Thanks for sharing.

What are you using for the inference backend and UI?

Pocketpal, I'm not a developer or a coder so this has made benchmarking and trying out different models so much simpler. I use CPU mostly because it's still the fastest but from the latest updates Opencl and NPU (Hexagon) have been catching up really well with every update.

MOE models between 20b and 30b have truly been a lifesaver quality and speed wise otherwise we'd be stuck using slow dense models.