How to make PocketPal inference faster on android?

Posted by Ok_Warning2146@reddit | LocalLLaMA | View on Reddit | 4 comments

I have an OnePlus 12 24GB running on LineageOS 22.2 with 6.44GB zram. I ran the PocketPal bench at the default pp=512,tg=128,pl=1 and rep=3. |pp|tg|time|PeakMem|Model| |:-|:-|:-|:-|:-| |14.18t/s|6.79t/s|2m50s|81.1%|Qwen3-30B-A3B-Instruct-2507-UD\_Q5\_K\_XL| |17.42t/s|4.00t/s|3m4s|62.0%|gemma-3-12b-it-qat-Q4\_0| The Qwen model is about 21.7GB and the gemma model is 6.9GB. It seems like the PeakMem refers to the Peak Memory used by the whole system as the gemma model shouldn't fill up 62% of 24GB. In that sense, I presume some of the 21.7GB Qwen model went to zram which is like a compressed swap stored in RAM. Would adjusting zram size affect performance? Would it perform much better if I use a 16GB qwen model? I noticed that PocketPal benchmark doesn't offload anything to the GPU. Does that mean only CPU is used? Is it possible to make PocketPal to use GPU? Thanks a lot in advance.

4 Comments

[-]

pmttyji@reddit

30B models(even 20B) are too much for Mobile. Lower size models better(Ex: Qwen3-8B or 14B). But still If you want to use same 30B model, try [this Pruned one which comes at 15B](https://www.reddit.com/r/LocalLLaMA/comments/1octe2s/pruned_moe_reap_quants_for_testing/?utm_source=share&utm_medium=web3x&utm_name=web3xcss&utm_term=1&utm_content=share_button). Yes, 15B A3B of 30B A3B model.

Ok_Warning2146@reddit (OP)

It is still A3B? Will it be any faster? I think running 30B-A3B at 7t/s is good enough for simple Q&As.

>It is still A3B? Will it be any faster? Yes, same A3B. But this Pruned model(15B A3B .... half the size of original model) will give you double speed like 14-15 t/s. Q4 could give you possibly 20 t/s.

Intelligent-Gift4519@reddit

Try using Paage.ai instead. I don't think PocketPal is using particularly updated APIs. Also, I do think you are using models which are too large for your available ram. Try a 9B.

Reply to Post

4 Comments