LLMs on flagships smartphones?

Posted by TechNerd10191@reddit | LocalLLaMA | View on Reddit | 9 comments

I have been curious to see how small LLMs like Gemma-4-E2B-it run on a flagship smartphone (S25+ with Snapdragon 8 Elite) in terms of prompt processing and token generation. I have created a script that uses llama-cli and I achieve 48 tps prompt processing and 15 tps generation. Note that I run the script via Termux and use the Q4_K_M model.

However, I can't push it beyond these speeds. Changing the threads (2, 4 or 8) does not yield different results, and even key/value data types (q4_0, q8_0, f16) do not seem to affect generation speeds.

Is there something I am missing (specific llama.cpp build for ARM or Vulkan engine) or not? What speeds are you getting if you have tested LLMs on smartphones?