Built an Android app that exposes Gemma 4 as an OpenAI-compatible endpoint on your LAN

Posted by angolo40@reddit | LocalLLaMA | View on Reddit | 2 comments

My old Samsung S10 was sitting in a drawer so I turned it into an always-on LLM endpoint.

PocketPal is great for on-phone chat, but I wanted the phone itself to be an OpenAI-compatible endpoint for the rest of my network. Gets 13.76 tok/s on Gemma 4 E2B (GPU), enough for real chat.

VicinoLLM is an Android app that runs Gemma 4 locally via Google's LiteRT-LM SDK and exposes an OpenAI-compatible server on :8080. Point any OpenAI client (Python SDK, OpenWebUI, Home Assistant) at http://phone-ip:8080/v1 and it works.

Bundles a ChatGPT-style web UI on the same port. Apache 2.0, LAN-only, zero Firebase/analytics/Play Services.

Features:

- /v1/chat/completions with SSE streaming, multimodal content parts (text + images + audio + PDF)

- Multi-model routing (load several, request picks)

- Auto-restore after Samsung mem-killer nukes the service

- Optional API key, web UI bypasses it so local access keeps working

Performance (warm decode):

- S10 (Mali-G76) + E2B GPU: 13.76 tok/s

- S24 Ultra (Adreno 750) + E2B GPU: 32.78 tok/s

Caveats:

- Gemma only. LiteRT-LM's pipeline is hardcoded. Use llama.cpp JNI / MLC-LLM for other families.

- E4B (3.65 GB) OOMs on <12 GB RAM devices.

- arm64-v8a only, no tool-calling yet.

- Don't expose :8080 publicly. Use Tailscale/WireGuard for remote.

Repo: https://github.com/angolo40/vicino-llm

Perf numbers from other devices very welcome, I only have the two Samsungs.

[-]

Queasy-Contract9753@reddit

That's awesome I'll try them out. Very helpful that you've given speed numbers and info even for Mali GPU!

Do you think Gemma e4b even at 3.65gb OOM with less than 12gb because of overhead from android? Do you think we could eventually see things like Dflash and turbo quant on android? I'm very out of the loop on mobile.

angolo40@reddit (OP)

Thanks!

E4B: 3.65GB is just weights. Runtime adds vision/audio encoders + KV cache (grows with context) + process heap = 5–6GB active. Fits on 8GB technically, but Samsung's low-memory killer evicts the service when you switch apps. 12GB gives enough headroom. I added a warning dialog for <12GB devices.

DFloat/TurboQuant — honestly out of the loop too. Guess is they help less on mobile since the bottleneck is memory bandwidth, not compute. NPU delegates (QNN, LiteRT NPU) and SSM hybrids like Qwen3.5 look more promising to me.

One-person project, grain of salt.