Built an Android app that exposes Gemma 4 as an OpenAI-compatible endpoint on your LAN

Posted by angolo40@reddit | LocalLLaMA | View on Reddit | 2 comments

My old Samsung S10 was sitting in a drawer so I turned it into an always-on LLM endpoint.

PocketPal is great for on-phone chat, but I wanted the phone itself to be an OpenAI-compatible endpoint for the rest of my network. Gets 13.76 tok/s on Gemma 4 E2B (GPU), enough for real chat.

VicinoLLM is an Android app that runs Gemma 4 locally via Google's LiteRT-LM SDK and exposes an OpenAI-compatible server on :8080. Point any OpenAI client (Python SDK, OpenWebUI, Home Assistant) at http://phone-ip:8080/v1 and it works.

Bundles a ChatGPT-style web UI on the same port. Apache 2.0, LAN-only, zero Firebase/analytics/Play Services.

Features:

- /v1/chat/completions with SSE streaming, multimodal content parts (text + images + audio + PDF)

- Multi-model routing (load several, request picks)

- Auto-restore after Samsung mem-killer nukes the service

- Optional API key, web UI bypasses it so local access keeps working

Performance (warm decode):

- S10 (Mali-G76) + E2B GPU: 13.76 tok/s

- S24 Ultra (Adreno 750) + E2B GPU: 32.78 tok/s

Caveats:

- Gemma only. LiteRT-LM's pipeline is hardcoded. Use llama.cpp JNI / MLC-LLM for other families.

- E4B (3.65 GB) OOMs on <12 GB RAM devices.

- arm64-v8a only, no tool-calling yet.

- Don't expose :8080 publicly. Use Tailscale/WireGuard for remote.

Repo: https://github.com/angolo40/vicino-llm

Perf numbers from other devices very welcome, I only have the two Samsungs.