Could Gemma 4 breathe new life into cheap broken/blocked phones?

Posted by Uriziel01@reddit | LocalLLaMA | View on Reddit | 11 comments

Hi everyone,

I've been thinking about different ways to use the new Gemma 4 4B model. I was able to get it running decently on my old Samsung S23, and I noticed that you can pick these up for around 390 PLN (\~$106) if they are broken or provider-locked where I live (The network lock prevents cellular connection, but it doesn't affect the actual hardware performance). I bet if I looked harder, I could find something even cheaper.

I was originally planning to upgrade my home server since it doesn't have a GPU and CPU inference is slow as a snail. But now? Now I'm thinking I might just need a "new phone" instead.

Am I missing something here? Has anyone already built a solution like this, or is there an obvious bridge/method I should use to turn a phone into a dedicated inference node for a home setup?

[-]

mr_Owner@reddit

How are you planning to expose the api endpoint when running local llm on smartphone?

[-]

Uriziel01@reddit (OP)

It took a moment but I already have a working setup. Termux with compiled llama.cpp, works like a charm. I pointed to the phones IP in my OpenWebUI and I have a working Gemma 4 E4B assistant running on my old 7W TDP phone.

But the speed is really bad, way worse than in Google's demo app, will investigate some more over the weekend what is the culprit, this was just a POC setup and it works.

[-]

mr_Owner@reddit

I tried smolchat android app and that works better then termux, but has no http api offering.

So vibe coded my first try as a fork with http api support... It works good enough but crashes and had no time to pick it up.

If you want you could fork the repo of that app like i tried and vibe your way through haha

[-]

Uriziel01@reddit (OP)

It looks like somebody is even already working on it: https://github.com/google-ai-edge/gallery/issues/552

[-]

mr_Owner@reddit

Niice!

[-]

iits-Shaz@reddit

The speed difference you're seeing vs Google's demo app is almost certainly GPU delegation. Google's apps use their own LiteRT/MediaPipe pipeline which delegates to the phone's GPU (Adreno on Samsung). Raw llama.cpp in Termux is running on CPU only by default.

A few things that should help:

Check if your Termux build has GPU support. llama.cpp supports Vulkan on Android, which would use the Adreno GPU. You need to compile with -DGGML_VULKAN=ON and have the Vulkan libraries available in Termux. This alone could 3-5x your throughput.
Try a smaller quant. If you're running Q8 or Q6, drop to Q4_K_M. On mobile, the memory bandwidth is the bottleneck — smaller quant = less data to move = faster inference. Quality difference is minimal for Gemma 4 E2B/E4B at Q4.
Use the right model size. Gemma 4 E2B (2.3B effective params, ~1.5GB Q4_K_M) runs well on phones with 6GB+ RAM. I've measured 30 tok/s generation and 60 tok/s prompt eval on Android with this config. If you're running the 4B, try the 2B first to establish a performance baseline.

For the "inference node for home setup" angle — the phone approach is actually underrated. An S23 draws ~3-5W under inference load vs 200W+ for a desktop GPU. For always-on personal assistant tasks where you need decent responses but not maximum throughput, that power efficiency is hard to beat.

The real limitation is context window. Phones don't have the RAM for long contexts, so you'd want to keep conversations short or implement aggressive summarization.

[-]

Uriziel01@reddit (OP)

Wow, many thanks for such a detailed answer, I did compile with VULKAN flag and logs do show it is using adreno kernels but also that it used OpenCL instead of Vulkan so this is probably large part of the issue:
```
ggml_opencl: flattening quantized weights representation as struct of arrays (GGML_OPENCL_SOA_Q)

ggml_opencl: using kernels optimized for Adreno (GGML_OPENCL_USE_ADRENO_KERNELS)

ggml_opencl: loading OpenCL kernels........................................................................................................

ggml_opencl: default device: 'QUALCOMM Adreno(TM) (OpenCL 3.0 Adreno(TM) 740)'
```

But the speed difference is really surprisingly huge, on Google's app E2B benchmark shows me 138 prefill tokens/s and 13 decode tokens/s even when I select I want to run it on the CPU and in my llama.cpp setup I'm getting around 0.8-0.9token/s?

I'm testing using Q3_K_S quant because Google's UI shows the model is around 2.5GB which is on par for this quant GGUF file.

---

When using GPU on google's app I'm getting 1116 tokens/s prefill and 31 tokens/s decode.

[-]

iits-Shaz@reddit

OK so the good news is your GPU is being used (OpenCL with Adreno-optimized kernels). The bad news is 0.8 tok/s is way too slow even for an unoptimized path — that points to something more fundamental than just "OpenCL vs Vulkan."

Check your thread count first. Run with -t 4 or -t 6 explicitly. In Termux, llama.cpp might default to 1 thread which would explain the CPU-bound bottleneck. On an S23 (Snapdragon 8 Gen 2) with 1 performance core + 4 mid cores, -t 4 targeting the mid cores is usually the sweet spot.

The Google app gap is expected though — it's a completely different execution stack. Google's demo app uses LiteRT (formerly TFLite) with a purpose-built GPU delegate for Adreno. The model isn't even in GGUF format — it's converted to a mobile-optimized representation with INT4 quantization specifically tuned for mobile GPU memory layouts. That's why their GPU path hits 1116 tok/s prefill — it's doing INT4 matmuls directly on the GPU shader cores with zero CPU-GPU copy overhead.

llama.cpp's OpenCL backend is more general-purpose. It wasn't built specifically for Adreno, so there's likely a lot of memory transfer overhead between CPU and GPU that Google's stack avoids entirely.

Realistically, for phone inference, I'd expect llama.cpp via OpenCL/Adreno to land somewhere in the 5-15 tok/s decode range once you fix the thread count issue — not matching Google's 31 tok/s GPU path, but much better than 0.8. If you're still stuck at <2 tok/s after fixing threads, check if your build has ARM NEON enabled (-DGGML_CPU_AARCH64=ON) — without NEON, you're leaving all the SIMD performance on the table.

[-]

DeltaSqueezer@reddit

You can get a P102-100 off ebay for about $50.

[-]

Uriziel01@reddit (OP)

Maybe it's not even a sensible question but - does using over 8 years old GPU has any other gotchas? Like the supported drivers, IDLE consumption with just the model loaded etc.

[-]

DeltaSqueezer@reddit

last driver was recently issued for pascal class CPUs. this one has very low idle power with the right idle software. works fine with llama.cpp