Phone LLM's benchmarks?

[-]

FullOf_Bad_Ideas@reddit

I don't think there's anything like that, I think you're going to be limited mostly by memory read speed so search for phones with big and fast RAM. I got a phone for running LLMs recently. Deepseek V2 Lite 16B runs pretty great on phones if you have 16gb of RAM.

[-]

datashri@reddit

So processor speed, when considering the latest gen or the previous gen, doesn't matter as much?

[-]

FullOf_Bad_Ideas@reddit

It matters in some cases like longer context. I think NPUs are getting more use now than a year ago when I was writing the comment above.

[-]

Divniy@reddit

Just curious, how fast does it drain the battery?

[-]

FullOf_Bad_Ideas@reddit

Do you want specific numbers or just a general idea will be enough? Phone gets very hot when doing cpu inference. It's a metal phone with a small fan, (redmagic 8s pro) and it's almost too hot to touch after 10 minutes if I run it without the plastic case that also traps some air in. It has a big battery and it's very quick to charge, so I've not been worrying about battery drain specifically.

[-]

Divniy@reddit

Just a general idea is enough, thank you!

Just found this whole subject to be interesting. Was wondering how practical it is now. Had a discussion with a dude who was like "we are not even close to local usage of LLMs" fairly recently, where I was mentioning him that we are already at a point where you can run pretty good stuff at just macbooks. And he was countering that most consumers of LLMs do it on their phones.

16gb & 6.5 t/s & 10 min limitation sounds like the application of it is mostly just "to prove a point" rather than practical. Wonder at which point we would break that barrier.

[-]

FullOf_Bad_Ideas@reddit

I'm not really focusing on generating code or creative writing on a phone, but I don't think I would be doing it even if inference of bigger models would be quicker - it's just not a good platform for it.

Phones are a good platform for quick chat with a short answer, maybe multi-turn chat when you're bored and don't have anyone to turn to. Somewhat useful for traveling, especially if the internet isn't good. I've found using Mistral Large 2 and Hermes Llama 3 405B via API in a mobile app useful on the last trip I had a few months ago, local models could fill that eventually. Plus multimodal local models should start getting useful soon - I tried Qwen 2 7B VL in MNN-LLM and asked it to give me a recipe for stuff based on what I had in a fridge, I provided a photo of the fridge. Around 90% of the things it suggested were hallucinated. So we're not there yet.

[-]

Divniy@reddit

How did you install the models? How tough is the setup?

[-]

FullOf_Bad_Ideas@reddit

Setup is very simple, similar to koboldcpp, oobabooga or Jan I guess. I use ChatterUI. Version just before stable 0.8.3, so one of the betas. Those support q4_0_4_8 quants. But you should pick a newer version since you don't have a load of old quants. So get the newest ChatterUI apk, and download normal gguf from huggingface, q4_0 quants are specifically optimized to run faster on ARM though, just import the gguf files using the UI and load them. Very simple to setup, no cli or anything like that.

https://github.com/Vali-98/ChatterUI

[-]

Divniy@reddit

Thank you!

[-]

ctrl-brk@reddit (OP)

How many tps?

[-]

FullOf_Bad_Ideas@reddit

Deepseek V2 Lite Chat q5_k_m quant in ChatterUI.

Context Length: 4096 Threads: 4 Batch Size: 512 [00:23:43] : Regenerate Responsefalse [00:23:43] : Obtaining response. [00:23:43] : Approximate Context Size: 44 tokens [00:23:43] : 30.15ms taken to build context [00:24:38] : Saving Chat [00:24:38] : [Prompt Timings] Prompt Per Token: 103 ms/token Prompt Per Second: 9.62 tokens/s Prompt Time: 4.78s Prompt Tokens: 46 tokens

[Predicted Timings] Predicted Per Token: 152 ms/token Predicted Per Second: 6.56 tokens/s Prediction Time: 49.82s Predicted Tokens: 327 tokens

One weird this is that token generation speed isn't smooth and oscillates. RedMagic Nubia 8S Pro 16GB.

[-]

----Val----@reddit

Have you tested with 4048 quants?

[-]

FullOf_Bad_Ideas@reddit

Here's with a Deepseek V2 Lite q4_0_4_8 quant.

I had to restart the phone because app was crashing. After a restart it also failed to build context once and had to force close the app and open again, then it worked.

[14:20:10] : Obtaining response. [14:20:10] : Approximate Context Size: 166 tokens [14:20:10] : 12.02ms taken to build context [14:20:42] : Saving Chat [14:20:42] : [Prompt Timings] Prompt Per Token: 1207 ms/token Prompt Per Second: 0.83 tokens/s Prompt Time: 181.18s Prompt Tokens: 150 tokens

[Predicted Timings] Predicted Per Token: 50 ms/token Predicted Per Second: 19.92 tokens/s Prediction Time: 28.02s Predicted Tokens: 558 tokens

I think prompt processing time includes time it took me to write the prompt or something like that because it was quicker than in the logs.

[-]

FullOf_Bad_Ideas@reddit

Not with DeepSeek v2 Lite, I will though.

I messed with 4048 and 4044 quants on this phone with other models like Mistral Nemo and Danube3 4b but app was just closing down a lot.

I'm seeing the crashes still, quite often, but phone restart usually gets it more stable. Happy to give you logs (adb logcat I guess?) if you would like to troubleshoot that, it typically crashes during model loading or when it's processing the first message I send.

I have 12gb swap enabled since it was useful for running yi-34b 200k iq3xs and iq3xs quants and I guess this could influence stability, though yi 34b inference was fairly stable, but obviously slow :)

[-]

ctrl-brk@reddit (OP)

How do you control swap on Android? Are you rooted?

[-]

FullOf_Bad_Ideas@reddit

Redmagic phone i have comes with 12GB swap enabled by default. https://ibb.co/0Qzcvk2

[-]

compilade@reddit

On a Pixel 9 Pro I'm getting around 9 tokens per second of tg128 with Llama-3.2-3B-Instruct-Q4_K_M.

Regarding the ARM-optimized types (which can properly make use of the int8 dot product and matrix multiplication instructions), (Q4_0_8_8, Q4_0_4_8, Q4_0_4_4), I found Q4_0_4_4 and Q4_0_4_8 to be fast.

| model | size | params | backend | threads | test | t/s | | ------------------------------ | ---------: | ---------: | ---------- | ------: | ------------: | -------------------: | | llama 3B Q4_0_4_4 | 1.78 GiB | 3.21 B | CPU | 4 | pp512 | 53.62 ± 0.05 | | llama 3B Q4_0_4_4 | 1.78 GiB | 3.21 B | CPU | 4 | tg128 | 12.75 ± 0.21 |
| llama 3B Q4_0_4_8 | 1.78 GiB | 3.21 B | CPU | 4 | pp512 | 78.86 ± 1.06 | | llama 3B Q4_0_4_8 | 1.78 GiB | 3.21 B | CPU | 4 | tg128 | 13.73 ± 0.15 |

build: 76c6e7f1 (4049)

(Note: the tg128 of both is very close to identical in similar temperature conditions, but the pp512 is consistently better with Q4_0_4_8 on the Tensor G4)

Also note that setting -DGGML_SVE=TRUE is necessary when compiling with cmake to truly benefit from Q4_0_4_8 (using only -DGGML_NATIVE=TRUE was not enough).

Anyway I suggest you try Q4_0_4_4 (and Q4_0_4_8, if your llama.cpp build was correctly built with sve support). Q4_0_8_8 is much slower from my short testing with it. Probably because the sve_cnt is 16 for the Tensor G4 while Q4_0_8_8 only benefits when sve_cnt is 32.

Also I think on the Tensor G3 (like on the Pixel 8) you might want to compare 5 threads vs 4 threads because there are more performance cores on the G3 vs the G4.

[-]

ctrl-brk@reddit (OP)

Great info! Does anyone know the username of the PocketPal dev so he can be mentioned here?

[-]

Ill-Still-6859@reddit

Hey, I recently added this PR (in llama.rn, which I use as a binding for llama.cpp) to check if SVE is available for compiling with the SVE. but, I haven’t tested it myself - my Android phone doesn’t support sve. If I get the chance, I’ll look into Google’s "Android Device Streaming" list if any phones there support SVE.

[-]

Steel_baboon@reddit

Thank you for building pocketpal! I want to ask a favor though, your phone benchmark page has so many entries that appear fake or impossible. Can you maybe filter some out or suggest how i could? Id like to compare phones and see which does better. Mine are most of the Pixel 9 Pro XL entries.

[-]

Red_Redditor_Reddit@reddit

I run the llama 3.2 3B 4Q on my pixel 7. I get about 7.5 tokens per second.

[-]

ctrl-brk@reddit (OP)

Running the exact same, my pixel 8 is 7.0 tps, lol.

[-]

Same_Leadership_6238@reddit

About 15 tps on an iPhone 15 here for Same model. Make sure you are using the general arm optimized gguf if you are not already for the models on your pixel, speed improve considerably (expect around 40-50% gain)

Oneplus 13 you mentioned below will be faster than pixel 8 since it not just considerably faster cpu (snapdragon x elite), gpu, and RAM but it also supports the latest arm optimized Q4_0_8_8 quantization format. Also has more powerful NPU although this is not currently utilized during decoding.

Also you can find some model benchmark here, some device benchmark linked in the comments there https://www.reddit.com/r/LocalLLaMA/s/6zD8RNDTpz , there’s also some benchmarks on llama GitHub

[-]

datashri@reddit

Doesn't the 8 Elite suck battery though for higher speeds?

[-]

Some_Endian_FP17@reddit

The Snapdragon Elite cores in the new flagship Qualcomm chips are crazy fast. They're even faster than the same cores on laptops, which I'm already happy with.

I'm using Q4_0_4_8 for Snapdragon X on a laptop running llama.cpp. The same format should work on a phone.

[-]

sakaba10@reddit

Can someone elaborate how to measure token per second? Is it done through programming or the installed app provides performance information?

[-]

kidupstart@reddit

just out of curiosity what are some usecases to run llm on a mobile device?

[-]

-BobDoLe-@reddit

i think it has a huge value in an emergency without internet/cell access.

i have one as sort of a swiss army knife on my old pixel 6 pro (seems to run stuff better than any other non-pixel phone maybe because its optimized for it).

you may need something in case of an emergency and so i go for something uncensored and will sacrifice tps (within reason) for quality responses. context window might be a factor too so i also keep a smaller one that i can run with a larger context window if needed.

nice to have a smarter evil macguyver in your emergency pack to help you when you need em.

[-]

ctrl-brk@reddit (OP)

Outside of the countless applications for on-device processing to provide privacy oriented services, my particular use case is my country is in an energy crisis.

We have no power or internet or cell signal for 16 hours a day. So it's nice to be able to chat with someone, flesh out ideas, take notes on projects. Etc

[-]

lebante@reddit

Llama 3.2 3b Q4 in oneplus 12, 15 t/s

[-]

justicecurcian@reddit

Pocketpal uses llama cpp and there is no information of optimizations for phone npus so I think there is none, that means the llm will run on the CPU and basically all of them would run more or less the same on high-end phones with snapdragon chips and a bit worse on mediatek ones. Iphone might have better performance if llama.rn uses metal in that configuration but I can't confirm it. Basically the better the CPU the faster it will run, but 3b models are barely decent on my galaxy s23 with arm optimizations

[-]

ctrl-brk@reddit (OP)

I'm on Pixel 8 (Tensor G4). I'm considering a OnePlus 13 but would really like to compare the two before making that purchase.

[-]

justicecurcian@reddit

It should be more performant, maybe even noticebly. I think you can return the phone if you don't like the performance