Run Llama 3.2 3B on Phone - on iOS & Android
Posted by Ill-Still-6859@reddit | LocalLLaMA | View on Reddit | 132 comments
Hey, like many of you folks, I also couldn't wait to try llama 3.2 on my phone. So added Llama 3.2 3B (Q4_K_M GGUF) to PocketPal's list of default models, as soon as I saw this post that GGUFs are available!
If you’re looking to try out on your phone, here are the download links:
- iOS: https://apps.apple.com/us/app/pocketpal-ai/id6502579498
- Android: https://play.google.com/store/apps/details?id=com.pocketpalai
As always, your feedback is super valuable! Feel free to share your thoughts or report any bugs/issues via GitHub: https://github.com/a-ghorbani/PocketPal-feedback/issues
For now, I’ve only added the Q4 variant (q4_k_m) to the list of default models, as the Q8 tends to throttle my phone. I’m still working on a way to either optimize the experience or provide users with a heads-up about potential issues, like insufficient memory. but, if your device can support it (eg have enough mem), you can download the GGUF file and import it as a local model. Just be sure to select the chat template for Llama 3.2 (llama32).
Beremus@reddit
Can you add llama 3.2 3B uncensored?
ihaag@reddit
Is the app open source? Whats the iOS backend using?
Ill-Still-6859@reddit (OP)
No yet open sourced. Might open source it though. It is using llama.cpp for inference, llama.rn for react native bindings
codexauthor@reddit
Would love to see it open sourced one day. 🙏
Thank you for your excellent work.
Ill-Still-6859@reddit (OP)
Today is that `one day` :)
https://github.com/a-ghorbani/pocketpal-ai
Organization_Aware@reddit
I cloned the repo to try include some tools/agents. But in Android I cannot find any file. Im new to android development so maybe Im just wrong🤣 is that right?
codexauthor@reddit
Excellent! Thank you, this is a great contribution to the open source community.
Some personal recommendations/requests:
Old_Formal_1129@reddit
Llamacpp on the phone would be a bit too power hungry and probably not as fast as it should be. An ANE implementation would be great.
KrazyKirby99999@reddit
I'll install the day it goes open source :)
lnvariant@reddit
Anyone know what’s the best context size for LLama 3.2 3b 8 quant? Running it on an iPhone 16 pro
Jesus359@reddit
Is there a way to add this to shortcuts. I bought PrivateLLM but yours is more stable. The only thing I wish it had is pass an action through shortcuts.
For example, I have a shortcut where it takes whatever is shared with a premade prompt then ask the user for input in case they want to ask questions or anything. The. Passes it to PrivateLLM and the loaded model does what it’s asked.
Fragrant_Owl_4577@reddit
Please add Siri shortcut integration
f0-1@reddit
Hey u/Ill-Still-6859, if I may ask, where do you store your models to let users download? How did you optimize this process? Do you have any tips and tricks that you have obtained in the development process? Thanks...
f0-1@reddit
Wait... You don't even need to host it? Are we directly downloading the models from HuggingFace and not cost to you??
Ill-Still-6859@reddit (OP)
Yes, for each model in the app there is a link if you touch it, it will open the repo in huggingFace
mguinhos@reddit
Can you add LaTeX support?
f0-1@reddit
Hey, just curious, I know what LaTeX is but, what do you mean by LaTeX support? What you have in mind as expectation from product?
vagaliki@reddit
What does the k mean in (Q4_k)
f0-1@reddit
K: This likely refers to K-quants, which is a term that suggests the use of specific types of quantization techniques from the k-quants series. These quantization methods are used to optimize LLMs for faster inference and smaller memory footprints. It can imply specialized handling for model compression.
GoogleOpenLetter@reddit
Oh, you're the PocketPal person!
It's great - I know you like feedback, so please don't take these as criticisms, just my observations as someone that's goodish with tech, but doesn't code.
The tabs for downloaded and grouped - with the tick and untick are kind of unintuitive. I think I'd switch to "Downloaded" and "Available Models", and make them more like traditional tabs. I'd break down the groups on each tab, not as its own tab if required. eg - On Device you might have a arrow dropdown for the gemma group. I imagine most people will only have a few models that they use. I don't think you need a whole grouped tab by itself.
I also get a confusing message when loading the gguf of the llama 3.2 I loaded(it worked right away before you did anything). It gives me "file already exists - replace, keep both, cancel" and cancel seems to be the only option that makes it work properly - I have no idea if that means duplicating the whole model? It's just confusing.
I'd change the "other" folder to "Local Models".
When I go to the chat menu - and it says load a model - I don't need to see the entire list of the things I don't have, which should be fixed with the suggestions above.
The two on the right are how I see the tabs working without the ticks. (the one on the left is the original). This will seem more intuitive in practice, forgive my shit paint skills.
Thanks for your work.
Ill-Still-6859@reddit (OP)
Appreciate the feedback! 🙏
GoogleOpenLetter@reddit
Oh - just another small user experience thing. After you load a model it makes sense to jump to chat automatically, at the moment it stays on the load model page. A "load as default model when opening" option might also make sense - most people will download one and just use that, it would be nice if the app loads it automatically and you can start chatting immediately.
Ill-Still-6859@reddit (OP)
Just released a new version.
Closed: - Directs users to the chat page when hitting model load. - from the chat page you can load the last used model, as opposed to navigating to the model page . - Added support for Llama 3.2’s 1B model. - Fixed issues with loading newer GGUF files. - Swipe right to delete chat (as opposed to left swipe that made it finicky) - (ios) Added memory usage display option on the chat page. - improved text msg for "Reset Models"
Open: The ui/ux improvements for the tabs in the model page.
AngleFun1664@reddit
The swipe right to delete works great. I was never able to get the swipe left to work.
vagaliki@reddit
Hey just tried the app! I'm getting ~15 tokens per second on Llama 3.2 3B Q4_K and ~19 when Metal is enabled (50 layers, 100 layers both about the same). Very usable!
2 pieces of feedback: - enable background downloading of the model. I tried twice (switched app first time, phone screen locked the second time) and the download got stuck. I finally just kept my phone unlocked and on the app for a few minutes to download the model. But 3 gigs is a pretty slow download for most people. - put the new chat button in the side menu as well (to match ChatGPT UI)
Particular_Cancel947@reddit
This is absolutely amazing. With zero knowledge I had it running in less than a minute. I would also happily pay a one time fee for a Pro version.
AnticitizenPrime@reddit
Is it possible to extend the output length? I'm having responses cut off partway.
Ill-Still-6859@reddit (OP)
You can adjust the number of new tokens here on model card settings
SevereIngenuity@reddit
on android there seems to be a bug with this, i cant clear it completely (first digit) and set it to say 1024. any other value gets rounded off to 2048.
Ill-Still-6859@reddit (OP)
Fixed in 1.4.3
AnticitizenPrime@reddit
Thank you!
noneabove1182@reddit
I've been trying to run my Q4_0_4_4 quants on PocketPal but for some reason it won't let me select my own downloaded models from my file system :( They're just grayed out, I think it would be awesome and insanely fast to use them over the default Q4_K_M
Same_Leadership_6238@reddit
For the record I tested this quant of yours with pocket pal on iOS (iPhone 15) and it works fine. 22tokens per second. Thanks for them. Perhaps corrupted download on your end?
noneabove1182@reddit
It's android, so maybe it's an issue with the app, I can see the files but they're greyed out as if the app doesn't consider them gguf files and won't consider opening them
the super odd thing is it was happening for Qwen2.5 as well, but then suddenly they showed up in the app as if it had suddenly discovered the file
Ill-Still-6859@reddit (OP)
fixed. Included in the next release.
noaibot@reddit
Downloaded 4044 qquf model still greyed out on Android 10.
Ill-Still-6859@reddit (OP)
It's not been released yet. Give me a day or two.
cesaqui89@reddit
I don't know if as today the app has been changed but I had the problem in my android and moving the model from downloads to MyDocuments solved it. My phone is an honor 90, running 1B model q4. Thanks for the app. May you add copy from chat options?
Ill-Still-6859@reddit (OP)
It was published on play store about 10 minutes ago :)
You mean coping the text? Long press on text (atm paragraph level) should do. Also hitting that little copy icon should copy the whole response to the clipboard.
cesaqui89@reddit
Nice. Will try it
sessim@reddit
Let me in! Let me innnn!!
Ill-Still-6859@reddit (OP)
Version 1.4.3 resolves this. It's been published a few mins ago. So depending the region it might take some time to be available.
noneabove1182@reddit
Oh hell yes... Thank you!
IngeniousIdiocy@reddit
Remember to reload the model to get the metal improvements.
Lucaspittol@reddit
Man, that's absolutely insane, terrific performance!
5,38 tokens per second on a lowly Samsung A52s using Gemmasutra mini 2B V1 GGUF Q6.
I wish I could run stable diffusion that well on a phone.
mguinhos@reddit
Great application.
geringonco@reddit
Group models by hardware, like ARM optimized models. Add delete chat. And Thanks!!!
IngeniousIdiocy@reddit
I downloaded this 8 bit quant and it worked great. Only about 12-13 tokens per second on my A18 pro vs 21-23 with the 4 bit and the metal API enabled on both. I think you should add the 8bit quant. I struggle with the coherence of 4 bit models.
Great app! I’d totally pay a few bucks for it. Don’t do the subscription thing. Maybe a pro version with some more model run stats for a couple dollars to the people that want to contribute.
https://huggingface.co/hugging-quants/Llama-3.2-3B-Instruct-Q8_0-GGUF/tree/main
lhau88@reddit
Why does it show this when I have 200G left on my phone?
Ill-Still-6859@reddit (OP)
What device are you using?
lhau88@reddit
iPhone 15 Pro Max
bwjxjelsbd@reddit
Wow this is insane! Got around 13 tokens/s on my iPhone 13 Pro Max. Wonder how much faster it is for newer one like 16 Pro max
brubits@reddit
I’m getting 21 tokens/s on iPhone 16
bwjxjelsbd@reddit
Did you have a chance to try new writing tools for Apple intelligence? I tried it on my m1 MacBook and it feels faster than this
brubits@reddit
Testing Llama 3.2 3B on my M1 Max with LM Studio, I’m getting \~83 tokens/s. Could likely increase with tweaks. I use Apple Intelligence tools on my phone but avoid beta software on my main laptop.
bwjxjelsbd@reddit
Where can I download that LM studio?
Belarrius@reddit
Hi, I use PocketPal with a Mistral Nemo 12B in Q4K. Thanks to the 12GB of RAM on my smartphone xD
CarefulGarage3902@reddit
jeez I’m super surprised you were able to run a 12b model. What smartphone? I have a 15 pro max
Belarrius@reddit
Well, it's like 0.5 tokens/s with a Xiaomi Mi 11 Ultra
Ill-Still-6859@reddit (OP)
Amazing!
Aceflamez00@reddit
17-18 tok/s on A18 Pro on iPhone 16 Pro Max
bwjxjelsbd@reddit
No wayyy, I thought it should be much faster than that! I got 12 tokens/s on my 13PM
brubits@reddit
I bet you can juice it by tweaking the settings
App Settings: -Metal Layers on GPU: 70 -Context Size: 768
Model Settings: -n_predict: 200 -temperature: 0.15 -top_k: 30 -top_p: 0.85 -tfs_z: 0.80 -typical_p: 0.80 penalty_repeat: 1.00 penalty_freq: 0.21 penalty_present: 0.00 penalize_nl: OFF
bwjxjelsbd@reddit
It went from 12t/s to 13 t/s lol Thanks dude
brubits@reddit
hehehe lets tweak the settings to match your 13PM:
Overall Goal:
Reduce memory and processing load while maintaining focused, shorter responses for better performance on the iPhone 13 Pro Max.
App Settings:
Model Settings:
IngeniousIdiocy@reddit
In the settings enable the metal api and max the GPU layers and I went up to 22-23 tps from 17-18 on my A18 pro (not the pro max)
brubits@reddit
Thanks! Was looking for a way to test Lama 3.2 on my iPhone 16. Will report back!
brubits@reddit
I'm getting about 15-11 tokens per second.
brubits@reddit
Update: tweaked the settings and now get 21 tokens per second! 🤘
bwjxjelsbd@reddit
What tweak you got to make it faster?
brubits@reddit
App Settings: -Metal Layers on GPU: 70 -Context Size: 768
Model Settings: -n_predict: 200 -temperature: 0.15 -top_k: 30 -top_p: 0.85 -tfs_z: 0.80 -typical_p: 0.80 penalty_repeat: 1.00 penalty_freq: 0.21 penalty_present: 0.00 penalize_nl: OFF
bwjxjelsbd@reddit
Tried this and it a tad faster. Do you know if this lower the quality of the output?
brubits@reddit
Overall Goal:
Optimized for speed, precision, and controlled randomness while reducing memory usage and ensuring focused outputs.
These changes can be described as precision-focused optimizations aimed at balancing performance, determinism, and speed on a local iPhone 16.
App Settings:
Model Settings:
brubits@reddit
More so shortens context length for speed.
mlrus@reddit
Terrific! The image below is a view of the CPU usage on iPhone 14 iOS 18.0
CarefulGarage3902@reddit
are you able to see if this local llm app is making use of the gpu? I have a 15 pro max and apparently the gpu is pretty good too
sourceholder@reddit
Which app did you use to plot resource graphs?
mlrus@reddit
https://apps.apple.com/us/app/system-status-pro-hw-monitor/id401457165
mchlprni@reddit
Thank you! 🙏🏻
Uncle___Marty@reddit
11 tokens/sec aint bad! Thanks for the fast support buddy!
IngeniousIdiocy@reddit
17.25 tokens per second on my iPhone 16 pro.
bwjxjelsbd@reddit
Great work OP. Please make this work on macOS too so I can stop paying chatGPT
Additional_Escape_37@reddit
Hei, thanks so much! That's some real fast app update.
Can I ask why 4bits quants and not 6bits ? It is not much bigger than Gemma 2B 6bits
Ill-Still-6859@reddit (OP)
The hugging-quants were the first I found ggufs, and they only quantized for q4 and q8.
The rationale I could guess is that irregular bit-widths (q5, 5 etc) tend to be slower than regular ones ( q4, 8 ): https://arxiv.org/abs/2409.15790v1
But I will add from other repos during the weekend.
Additional_Escape_37@reddit
Hmm, thanks for the paper link I will read carefully. It Makes sense since 6 is not a power of two.
Any plan to put some q8 in pocketpal ? (I guess I could just download them myself)
Ill-Still-6859@reddit (OP)
yeah, you should be able to download and add. I might add that too, though.
Additional_Escape_37@reddit
Nice, I will try soon.
Are you collecting statistics about inference speed and phone models ? You must have quite a large panel. That could be an interesting benchmark data.
Ill-Still-6859@reddit (OP)
The app doesn't collect any data.
bwjxjelsbd@reddit
Thank goodness
brubits@reddit
I love a fresh arxiv research paper!
_-Jormungandr-_@reddit
Just tested out the app on iOS, i like it but it won't replace the app i'm using right now. "CNVRS". CNVRS lacks the setting i like about your app like temp/top_k/max tokens and such. I like to roleplay with local models a lot and what i'm really looking for as an app that can regenerate answers i don't like and or load characters easy instead of adjusting the prompt per model. A feature that "chatterUI" has on android. So i will keep your app installed and hope it will get better overtime.
riade3788@reddit
is it censored by default because when I tried it online it refused to even identify people in an image or describe them
JacketHistorical2321@reddit
Awesome app! Do you plan to release an ipad OS version? It ”works” on ipad but I cant access any of the settings besides context and models
rorowhat@reddit
Do you need to pay server costs to have an app? Or do you just upload to the play store and that's it?
MoffKalast@reddit
$15 for a perpetual license from Google, $90 yearly for Apple. Last I checked anyway.
Anthonyg5005@reddit
Apple really seems to hate developers. On top of the $100 you also need a Mac
rorowhat@reddit
Screw Apple. For android it's only $15 per year and that's it, is that per app?
MoffKalast@reddit
No it's once per account.
rorowhat@reddit
👍
Informal-Football836@reddit
Make a pocket pal version that works with SwarmUI API. 😂
NeuralQuantum@reddit
Great app for iPhone, any plans on supporting iPads? thanks
Ill-Still-6859@reddit (OP)
https://github.com/a-ghorbani/PocketPal-feedback/issues/28
let's see
MurkyCaterpillar9@reddit
It works on my iPad mini.
AngryGungan@reddit
S24 Ultra, 15-16 t/s. Now introduce vision capability, an easy way to use this from other apps and a reload response and it'll be great. Is there any telemetry going on in the app?
jarec707@reddit
Runs fine on my iPad M2. Please consider including the 1b model, which is surprisingly capable.
upquarkspin@reddit
Yes please
Ill-Still-6859@reddit (OP)
Underway with the next release!
livetodaytho@reddit
Mate, I downloaded the 1B GGUF from HF but couldn't load the model on Android. It's not accepting it as a compatible file format.
NOThanyK@reddit
Had this happened to me too. Try using another file explorer instead of the default one.
livetodaytho@reddit
Tried a lot didn't work, got it working on ChatterUI instead
Ill-Still-6859@reddit (OP)
fix is underway
Th3OnlyWayUp@reddit
how's the performance? is it fast? tokens per sec if you have an idea?
Qual_@reddit
9tk/sec is kind of impressive for a phone and a 3b model.
Ill-Still-6859@reddit (OP)
The credit for being fast goes to the llama.cpp
NearbyApplication338@reddit
Please also add 1B
Ill-Still-6859@reddit (OP)
It's done, but it might take a few days to be published.
findingsubtext@reddit
Is there a way to adjust text size within the app independently? I intend to try this app later, but none of the other options on iOS support that and render microscopic text on my iPhone 15 Pro Max 😭🙏
mintybadgerme@reddit
Works great for me, not hugely fast, but good enough for chat at 8t/s. Couple of points. 1. The loading -start chat process is a little clunky. Would be great if you could just press Load and the chat box would be there waiting. At the moment you have to finagle around to start chatting on my Samsung. 2. Will there be any voice or video coming to phones on tiny LLMs anytime soon? Thanks for your work btw. :)
upquarkspin@reddit
Could you please add also a lighter model like https://huggingface.co/microsoft/Phi-3-mini-4k-instruct It works great on iPhone. Also, it would be great to set the flag for game mode on load, because it allocates more punch to the GPU.
Thank you!!! 🤘🏻
JawsOfALion@reddit
interesting, I only have 2gb ram total in my device, will any of these models work on my phone?
(maybe include a minimum spec for each model as well in the ui and gray out ones that fall out of the spec)
Balance-@reddit
Probably the model for you to try: https://huggingface.co/Qwen/Qwen2.5-0.5B-Instruct
JawsOfALion@reddit
Thanks, is there a rough formal that translates number parameters to the amount of ram needed for something reasonably useable?
ChessGibson@reddit
IIRC its quite similar to the model file size, but there is some more memory needed depending on the context size, but I'm not really sure so would be happy for someone else to confirm this.
ErikThiart@reddit
curious what you guys use this for?
tessellation@reddit
For everyone that has thumbs disabled in their reddit reader: there's a hidden lol I just found out on my desktop..
Steuern_Runter@reddit
Nice app! I am looking forward to see the improvements you already listed on Github.
Belarrius@reddit
Hi! PocketPal is so amazing!
The interface is brilliantly simple and a joy to use! However, it lacks 2 important features:
1 - Ability to write a general context (A context for personalized AI)
2 - Ability to delete a conversation
The idea of having only small quantized models is excellent for smartphone speed: my Xiaomi mi11 Ultra generates 9 to 10 tokens/s with LLaMA 3.2 3B instruct.
EastSignificance9744@reddit
that's a very unflattering profile picture by the gguf dude lol
LinkSea8324@reddit
His mom said he's the cutest on the repo
LambentSirius@reddit
What kind of inferencing does this app use on android devices? CPU, GPU or NPU? Just curious.
Ill-Still-6859@reddit (OP)
It relies on llama.cpp. It currently uses cpu on Android
LambentSirius@reddit
I see, thanks.