Basic PSA. PocketPal got updated, so runs Gemma 4.
Posted by Sambojin1@reddit | LocalLLaMA | View on Reddit | 14 comments
Just because I've seen a couple of "I want this on Android" questions, PocketPal got updated a few hours ago, and runs Gemma 4 2B and 4B fine. At least on my hardware (crappy little moto g84 workhorse phone).
I'm going to try and squeak 26B a4 iq2 quantization into 12gigs of ram, on a fresh boot, but I'm almost certain it can't be done due to Android bloat.
But yeah, 2B and 4B work fine and quickly under PocketPal. Hopefully their next one is 7-8B (not 9B), because the new Qwen 3.5 models just skip over memory caps, but the old ones didn't. Super numbers are great, running them with OS overhead and context size needs a bit smaller, to be functional on a 12gig RAM phone.
Fluffywings@reddit
Just ran it on Pixel 8. Only CPU compatible. I may fork a more GPU aware version.
bucolucas@reddit
Have you gotten anything to run on the GPU/NPU? I've tried some different ways and it seems like I'd have to root the damn thing
Fluffywings@reddit
A quick vibe code check showed Google Edge Gallery leverages Google Play Services to detect the GPU and enable GPU acceleration. Downside is requiring Literm file types.
The fork I was looking at partial GPU detection and enabling but the solution is unlikely to work on all devices seemlessly. Maybe a try GPU toggle and fail gracefully to CPU could work.
Another option I briefly thought of would be realtime conversion of gguf to literm but that seems too intensive of a solution.
EndlessZone123@reddit
I've not found a single Android LLM app that is reliable and can do Web search locally.
Mkengine@reddit
Maybe Edge Gallery with the new Agent Skill could be an option with one of the Gemma models?
----Val----@reddit
Ive tried doing this, it just sucks to run a heavy llm on device + manage web searching. If it redirects you to a browser, good chance your device's memory management will kill the app.
The experience isnt good unless you use something that hooks a bit deeper like termux.
CodeMichaelD@reddit
termux is there for you, can do local host just fine with llama.cpp, tho sometime it's necessary to keep the floating window active to avoid android killing phantom process for very much annoying battery stuff (if you're not talking about ready made apps)
spaceman_@reddit
Anyone else who experiences crashed when trying to run PrismML Bonsai models?
npquanh30402@reddit
PocketShit. It can't detect gpu in my phone so i have to build from llamacpp myself.
Sambojin1@reddit (OP)
Did you test it a day or two ago? Because now the GitHub version works with .gguf's straight out of the box. It got updated to the new llama.cpp like a few hours ago.
Sambojin1@reddit (OP)
Like, meh, it works. SD695, using two processor threads, and dual channel slow RAM, and 2048 context, with Gemma 4's small 26x4B MoE. You'd assume these are rookie figures, and should be 3-8x bigger on newer, faster, 12gig RAM phones. And you can load bigger ones on 16gig ram phones.
This was just an early "does it even work?" test, at the lowest variables. And yes, it does!
ikkiyikki@reddit
It just crashes for me when I run Bonsai
Sambojin1@reddit (OP)
Omfg, between PocketPal and Android, I got 1.31tokens/sec on "Gemma-4-26B-A4B-it-UD-IQ2_M.gguf". At only 2048 token context, but fuck me! It loaded, and ran in old slow RAM!
Huzzah! I got brains LLMs now!
Sambojin1@reddit (OP)
And remember, Gemini 3 doesn't mind giving you her prompt formats after class, coz she's smart and knows herself, so make sure the JavaScript/ SillyTavern character works kind of well. Not really a deep dive or jailbreak, just for us noobies on Gemini 3.1 to Gemma 4: https://www.dropbox.com/scl/fi/6ava62934e3g5trj52x0k/prompts.txt?rlkey=erfklv6c8dbv97w1dxmec9wc1&st=ns5jqjtv&dl=0