LLMs on flagships smartphones?

Posted by TechNerd10191@reddit | LocalLLaMA | View on Reddit | 9 comments

I have been curious to see how small LLMs like Gemma-4-E2B-it run on a flagship smartphone (S25+ with Snapdragon 8 Elite) in terms of prompt processing and token generation. I have created a script that uses llama-cli and I achieve 48 tps prompt processing and 15 tps generation. Note that I run the script via Termux and use the Q4_K_M model.

However, I can't push it beyond these speeds. Changing the threads (2, 4 or 8) does not yield different results, and even key/value data types (q4_0, q8_0, f16) do not seem to affect generation speeds.

Is there something I am missing (specific llama.cpp build for ARM or Vulkan engine) or not? What speeds are you getting if you have tested LLMs on smartphones?

[-]

stddealer@reddit

I have the same Snapdragon 8 Elite (leading version) . With liteRT I get the following performance:

Gemma-4-E4B-it(GPU): Prefill: 551.50 t/s Decode speed 17.83 t/s

Gemma-4-E2B-it(GPU): Prefill : 1411.74 t/s Decode speed: 34.74 t/s

[-]

FullOf_Bad_Ideas@reddit

I didn't realize S25+ and S26 had just 12GB of RAM, that's pretty bad. Many current flagships have 24GB of VRAM.

I like running DeepSeek V2 Lite Q4_0. Old model but it flies and I like it. ChatterUI 0.9.0 on Redmagic 8S Pro 16GB. 15 - 25 t/s TG depending on details.

Ling Mini Q4_0 does about 25 t/s TG.

Gemma 4 E2B Q4_0 had TG around 18 t/s.

Short context only, I didn't test prefill speed since it was just 30-200 tokens of prompt.

[-]

AXYZE8@reddit

Not single current flagship has 24GB of RAM lol

Even the most expensive chinese flagships like Oppo Find X9 Ultra has 12GB as base.

https://m.gsmarena.com/results.php3?nYearMin=2025&nRamMin=24000

Zero flagships in 2025/2026. It's just 6 niche gaming phones and 1 ridiculously expensive Honor Porsche RSR.

Maybe you saw some "24GB*" phones with the fineprint that it's 12GB RAM+12GB Virtual RAM aka SWAP?

[-]

FullOf_Bad_Ideas@reddit

Not single current flagship has 24GB of RAM lol

Looks like many current flagships had 24GB of VRAM and they lost it upon refresh. You can still buy 2024 flagships and get 24GB of RAM. I am not keeping up with smartphones, but OnePlus 13 was definitely a flagship.

Also, ZTE nubia RedMagic 11 Pro+ and ZTE nubia Z70S Ultra, Asus ROG Phone 9 Pro could be considered flagships (or not, depending on your definition) and have 24GB of RAM as an option. If ZTE can do it, why Samsung that literally makes DRAM chips can't?

Maybe you saw some "24GB*" phones with the fineprint that it's 12GB RAM+12GB Virtual RAM aka SWAP?

No, I just do consider those phones flagships. They clearly pack more than current Galaxy S or Xiaomi Mi, and there's no excuse for Samsung and Xiaomi to not offer 24GB as an option if CHEAPER phones do have it.

[-]

vasimv@reddit

I have urge to ask LLM to build an android app with uncensored 2/4b local model and wikipedia pack for emergency local use (like survival in wild areas or during emergencies)... 😄

[-]

AXYZE8@reddit

Use LiteRT-LM (for example in Google AI Edge Gallery).

https://huggingface.co/litert-community/gemma-4-E2B-it-litert-lm

Device	Backend	Prefill (tokens/sec)	Decode (tokens/sec)
S26 Ultra	CPU	557	46.9
S26 Ultra	GPU	3,808	52.1

If you really want to stick with llama.cpp then you should use Q4_NL quants, not Q4_K_M. They run A LOT faster on ARM processors.

[-]

dampflokfreund@reddit

Wow, these are insane speeds. It's just a shame Edge Gallery is not usable as it lacks basic features such as a proper chat history and context mangement, regenerating/swiping, multiple chats with multiple system prompts, OpenAI compatible endpoint exposure and more.

[-]

v-porphyria@reddit

there are some forks of it in GitHub that have added some of these features.

here's a couple that I found interesting:

https://github.com/techjarves/mobile-server

https://github.com/jegly/Box

[-]

Super-Strategy893@reddit

You must run LLMs with npu aceleration. The most usual way to do it with Qualcomm is via QNN framework. LiteRT Will give npu aceleration too.

Llama.cpp suport for hexagon npu still in progress