What’s up with mobile LLMs?
Posted by Amos-Tversky@reddit | LocalLLaMA | View on Reddit | 21 comments
I see a lot of support for running LLMs on PCs with ollama to vLLM. Whats the current state for running on mobile? Say on an iPhone (A19/pro) or a snapdragon. Also, do you guys think it’s worth it?
Fedor_Doc@reddit
Termux + llama.cpp is my go to.
Cons: it is not integrated in the system, and it drains the battery. Slow decode. Pros: full llama.cpp experience
It is mostly a toy on my phone, but I like the ability to check new model during commute.
mtmttuan@reddit
It's actually pretty bad way to go as llama cpp is not really good at supporting custom accelerators from Qualcomm or Apple. Best is still having dedicated apps that use native API to run models on GPU or NPU.
Fedor_Doc@reddit
I have Exynos chip in my phone, so it doesn't really matter :)
Edge Gallery produces garbage iutput if GPU inference is enabled
ForsookComparison@reddit
This is the way.
And yeah, no avoiding that it'll rip through a phones battery. Pretty clutch for emergencies without signal I guess. 16GB phones can run some respectable models.
No-Amount-493@reddit
Agreed. I'm using a Pixel 7a running Graphene and I can squeeze decent-ish perormance out of Ministral 14B. 7&8B's run fine.
MuDotGen@reddit
What's your phone specs and which models and quants have worked best for you?
custodiam99@reddit
You can run 4b models on it but to be honest, it is just a backup right now. Give them a few years, we need hardware and information density developement.
dryadofelysium@reddit
You can install Google AI Edge Gallery to install Gemma 4 very, very easily. Is it worth it? Depends on what your goal is. Obviously frontier cloud models will destroy these.
Fedor_Doc@reddit
Google Ai Edge Gallery is a preview-first app. No persistent chats, no API endpoint. Pretty well optimized, though
Amos-Tversky@reddit (OP)
How do they do it? Is any of it open source?
No-Amount-493@reddit
It's very early days for mobile AI, but it's THE logical direction. You can get apps like Smolchat and download small LLM's from Huggingface etc, but the usability right now is not great - but, obviously, that will change.
Also - if we have ambitions to steer clear of the AI giants, this is the way to go too.
Decent (but not mindblowing) LLM's for phones are any of the small Falcons and Mistral/Ministral (your mileage may vary). The small IBM Granites are solid too.
Worth trying so you know.
Amos-Tversky@reddit (OP)
Interesting, how efficient are the inference frameworks on mobile chips though?
ClickClawAI@reddit
Yes let me just use 10% of my battery per prompt 😂
For edge devices it can be great considering usecases like CV: you can run yolo models on raspberry pis etc
But a general chatbot? Terrible idea
Amos-Tversky@reddit (OP)
Exactly, aren’t there any NPUs on mobile devices yet? Something power efficient that might get stuff going? Snapdragon has hexagon, but i don’t think there’s a lot of support.
hofor55@reddit
Best viable solution for me has been using the Off-Grid app on Android. It allows direct in-app download from Hugging Face. Additionally, users have simple control over loading and inference settings. I'm running with 8GB RAM, no GPU, only CPU processing. Quantized models like q4_k_m are the sweet spot. For speed, the fastest available is Gemma 4E2B at 10-ish tokens per second, but you can go as low as 0.8B models for faster runs. As of now, Qwen 3.5 2B holds the crown. It supports vision and tool calling, enabling some simple autonomous search and URL reading. Thinking capabilities are ok, and it runs at 6 tokens per second. Regarding the Google Edge Gallery app, its support for .litert models makes well optimized for mobile use, however, it cannot function fully offline and lacks chat history similar to LM Studio.
Federal-Effective879@reddit
It depends on what you’re using the models for. In general, running LLMs on a drains the battery rapidly, and the quality of models you can fit on a phone are useless for complex tasks or tasks requiring world knowledge in the model. However, some models like Gemma are decent at translation tasks and general conversation, if you can tolerate the slowness and battery consumption.
Realistically, you are better off running larger models on a server and connecting to it with your mobile device. However, running something like llama.cpp on your phone can be fun, albeit toy-like.
buildingstuff_daily@reddit
the honest answer is it's a hardware bottleneck that software tricks can only partially fix right now
phones have like 6-8GB of RAM shared between the GPU and everything else the phone is doing. a decent 7B model quantized to Q4 needs roughly 4-5GB just to load, which means your phone is basically gasping for air trying to run the OS, your apps, AND inference at the same time. compare that to even a budget desktop GPU with 8-12GB of dedicated VRAM and it's not close
the other thing nobody talks about is thermal throttling. phones are designed to dissipate heat passively through that little glass and aluminum slab. run sustained inference for more than like 30-60 seconds and the chip starts downclocking itself to avoid cooking your hand. desktops have fans and heatsinks, phones have vibes and prayers
that said, things ARE moving. qualcomm's hexagon NPU in the snapdragon 8 gen 3 can actually do some interesting stuff with smaller models (3B and under run surprisingly well). apple's neural engine is solid too but they're being weirdly conservative about letting third party apps really use it
MLX on apple silicon is probably the most promising path for apple devices. the unified memory architecture means a MacBook with 16GB can comfortably run 13B models, and that same architecture scales down to iPhones and iPads in theory. in practice apple hasn't opened up enough of it yet
my prediction is we're about 2 chip generations away (so like 2-3 years) from phones being able to comfortably run 7B models at decent speeds. the 3B class models are already getting surprisingly good though. check out phi-3-mini and gemma-2b if you haven't, they punch way above their weight
DigitalguyCH@reddit
The only solution that makes sense is Google Ai (Gemma on IOS and some others on Android too). You need at least 12GB RAM to run Gemma 4B without crashes, but 16GB is more reliable (tested on iPhone 17 pro and iPad pro M4 16GB). 6GB iPads will install the base 2B but will crash often. On Android you have smaller models. Still better than nothing.
gigaflops_@reddit
My understanding is that of the "Apple Intelligence" features that utilize LLMs, some default to a local model made by Apple, and fall back on a larger cloud model also made by Apple for complex tasks.
Needless to say, Apple was overly confident in their ability to produce a useful AI model that runs on devices with 8GB of RAM where at least 4-6 GB of it needs to be used for non-AI stuff.
_Muftak@reddit
Liquid's LFM models run pretty well with their Apollo app, and are actually some of best models for their size imo
Enfiznar@reddit
I use MNN chat and it works very well, but only a limited number of models are compatible