Unused phone as AI server

[-]

Mac_NCheez_TW@reddit

I've been looking for something like this to run small local LLMs on an ROG 8 with 24gb of ram. I have a bunch of phones I wanted to do this with. Tool usage with them would be nice.

[-]

Qwen30bEnjoyer@reddit

I would be more interested in using old androids as an ubuntu VPS, especially if they have 24gb of RAM.

Has anyone done similar?

[-]

The officially recommended model, Gemma-4-E4B-it, requires 12GB of memory. Due to the design of the Gallery App, it can only load one model at a time, and concurrent inference is also not supported, so 24GB is really too much.

[-]

Mac_NCheez_TW@reddit

No, more ram! More! Lol.

[-]

moneylab_ai@reddit

This is a really clever use of hardware that would otherwise just sit in a drawer. The OpenAI-compatible API layer is the smart part -- it means you can slot it into existing toolchains without rewriting anything. I am curious about the practical throughput though. Even with something like a Snapdragon 8 Gen 3 and 12GB+ RAM, you are probably limited to smaller models (3-7B). For a phone cluster setup, have you looked into any kind of load balancing or request routing across multiple devices? That could make the aggregate throughput actually useful for lightweight local inference tasks like classification or summarization.

[-]

AtypicalComputers@reddit

This is great! I spent some time trying to get ollama deployed as a docker on the built in terminal in a pixel. This seems to be a much easier way of accomplishing the same thing. Excited to try it out!

[-]

Ok_Fig5484@reddit (OP)

One of the more challenging issues is that the model is in lithelm format, and there aren't many available models on https://huggingface.co/litert-community.

[-]

AtypicalComputers@reddit

I'm not seeing the server option when downloading the app from the play store. Is the apk in the GitHub more up to date?

[-]

Ok_Fig5484@reddit (OP)

The original repository does not currently accept community contributions. Please use version 1.0.11-as0.1.0 released from my forked repository.

[-]

AtypicalComputers@reddit

Yup, got it! Running and inferring. Much easier than having to go through the terminal! If there's any way to add metrics similar to llama.cpp, that would be a great addition! Looking forward to the project!

[-]

Dazzling_Equipment_9@reddit

I saw you posted this yesterday: "Open source Android app for native tool calling with Claude", but I noticed you deleted it today. Your demo video also used the same one :)

[-]

Ok_Fig5484@reddit (OP)

What are you talking about? You've got it wrong, that's not me.

[-]

Dazzling_Equipment_9@reddit

Isn't that right? Uh, maybe someone else is doing the same thing as you. Don't worry about it, bro.

[-]

Ok_Fig5484@reddit (OP)

It’s definitely not that, because this app has very limited ability to call native Android system features unless a lot of additional coding is done. When used as an API server, it only returns structured function outputs. From what I’ve observed, the model doesn’t return a response—instead, the tool function gets called directly. I have to admit I haven’t fully figured this out yet, and if anyone has solved this issue, I’d be very interested in taking a look.

[-]

Danmoreng@reddit

I would recommend to not use the edge gallery app as base, but only as reference and implement a much simpler server app from scratch. With whatever you used to make your modifications (I assume Claude/Codex/Gemini), it should be easy to do a clean from scratch implementation as well. For example, I did something similar for my transcription app where I let codex first analyse the edge ai gallery app vs what my app had already, to figure out how to implement the new Gemma models into my app: https://github.com/Danmoreng/vox-transcribe/tree/main/docs

[-]

Ok_Fig5484@reddit (OP)

Yes, I started creating it directly without analyzing the core principles of the gallery. Only during the creation process did I discover that the model's loading lifecycle follows the UI, and only one model is used at a time. This ultimately led me to add a custom task icon.

[-]

Illustrious-Lake2603@reddit

Im interested in the cluster idea. Will this work to link 4 phones together?

[-]

Ok_Fig5484@reddit (OP)

Clusters can only be placed behind load balancers to increase concurrency.

[-]

moneylab_ai@reddit

This is a really clever use of hardware that would otherwise just sit in a drawer. The OpenAI-compatible API layer is the smart part -- it means you can slot it into existing toolchains without rewriting anything. I am curious about the practical throughput though. Even with something like a Snapdragon 8 Gen 3 and 12GB+ RAM, you are probably limited to smaller models (3-7B). For a phone cluster setup, have you looked into any kind of load balancing or request routing across multiple devices? That could make the aggregate throughput actually useful for lightweight local inference tasks like classification or summarization.

[-]

Uriziel01@reddit

Yeah I also love the idea, discussed this yesterday as I think this can be really useful for a generic agent for basic google-like questions https://www.reddit.com/r/LocalLLaMA/comments/1sfvy4x/could_gemma_4_breathe_new_life_into_cheap/?utm_source=share&utm_medium=web3x&utm_name=web3xcss&utm_term=1&utm_content=share_button

[-]

Ok_Fig5484@reddit (OP)

Since there's no quiet GPU, let's use a mobile phone.