24/7 Headless AI Server on Xiaomi 12 Pro (Snapdragon 8 Gen 1 + Ollama/Gemma4)
Posted by Aromatic_Ad_7557@reddit | LocalLLaMA | View on Reddit | 225 comments
Turned a Xiaomi 12 Pro into a dedicated local AI node. Here is the technical setup:
OS Optimization: Flashed LineageOS to strip the Android UI and background bloat, leaving \~9GB of RAM for LLM compute.
Headless Config: Android framework is frozen; networking is handled via a manually compiled wpa_supplicant to maintain a purely headless state.
Thermal Management: A custom daemon monitors CPU temps and triggers an external active cooling module via a Wi-Fi smart plug at 45°C.
Battery Protection: A power-delivery script cuts charging at 80% to prevent degradation during 24/7 operation.
Performance: Currently serving Gemma4 via Ollama as a LAN-accessible API.
Happy to share the scripts or discuss the configuration details if anyone is interested in repurposing mobile hardware for local LLMs.
kermitkiller666@reddit
Is it possible to make a phome cluster to run a larger AI model?
Aromatic_Ad_7557@reddit (OP)
Probably ot is too hard, or AI will work slowly, because each device transfer data fast between CPU and RAM but in case any cluster hardware design there will be slow data transfer between devices in comparing with speed between CPU and RAM.
StatisticianFluid747@reddit
Man, this is exactly the kind of unhinged homelab engineering I sub here for. 😂
Seriously though, how are you handling the long-term battery bloat risk? I know you mentioned the script cuts the charge at 80%, but keeping a lithium battery perpetually warm and tethered 24/7 is still basically a recipe for a spicy pillow down the line. Have you looked into completely bypassing the internal battery and wiring it straight to a DC power supply? I know some folks in the Android server community do it with old Samsungs to make them dedicated nodes without the fire hazard.
Also, big +1 to everyone saying to dump Ollama for raw llama.cpp. Since you already went through the absolute headache of manually compiling
wpa_supplicantjust to get headless Wi-Fi working, compiling llama.cpp for Android will be a total walk in the park for you, and the inference speed jump is usually massive on these mobile chips.Aromatic_Ad_7557@reddit (OP)
Oh, I don't like an idea to open my Xiaomi frame to manipulate with battery, I'd like its design.
In stand by mode it is heat a little, but cooling itself once at 3-5 hours, so probably as for now I don't need to modify hardware.
Thank you for your comment, I will check llama.cpp soon.
Low-Ad1658@reddit
Any recommendations on a snapdragon 8 gen 4 with a don't setup? I have little experience with any of this
Aromatic_Ad_7557@reddit (OP)
You need to get root at first. But probably start with simple installing ollama / llama.cp on current software and check performance. I will share guide for my setup soon also. But if you are fully newbie, probably it is dangerous to flashing device.
_Aeterna-Lux_@reddit
On my S25U I always had issues running some llms because Android is very strict about locking down apps, that seem to take more resources than expected. Also I was never able to use the GPU of Snapdragon Chip, how did you manage it? Though I must admit I only used it via Termux with Chrome interactive.
Aromatic_Ad_7557@reddit (OP)
As for now I didn't test GPU yet, but the CPU produce good speed.
That's why I'm flashing Xiaomi HyperOs to LineageOS to free more RAM and to get full control on hardware.
wordpipeline@reddit
Which compile flags did wpa_supplicant need that standard wpa_supplicant doesn't have? I'm not sure I get what this point is about. Why wasn't the phone headless already? "Android framework is frozen" I don't get this either: You turned off OS updates in Android? That can't be what you meant.
Aromatic_Ad_7557@reddit (OP)
I did command "stop" and in this case standard wpa supplicant disabled and wifi not works.
mail4youtoo@reddit
What is token generation for a prompt other than 'hello'?
Aromatic_Ad_7557@reddit (OP)
Give me a prompt please or series, I will do and update
Hodler-mane@reddit
so this is like a 2b model, max right? what do people even do on these models. genuine question?
Aromatic_Ad_7557@reddit (OP)
It is e2b-q8, as for me it is enough smart for it size. I just check how to work with RAG and fine-tune, before rent hardware to do something with more big models. Also, it could work with images and audio, however current ollama version does not support gemma4 such features, still didn't test it.
Top-Rub-4670@reddit
It's a fun project no doubt but E2B is not particularly smart for its size.
Qwen 3.5 4B is a lot smarter at the same size, btw.
There's also Qwen 3.5 2B that is much smaller than E2B and also smarter. But to be fair it's more of a coin toss there, they're both pretty dumb.
Aromatic_Ad_7557@reddit (OP)
As for me, gemma4 it is most smartests small model from all I did test.
IronColumn@reddit
not smart for having a conversation with but it's good for audio and photo analysis.
semperaudesapere@reddit
No clue. This can't do anything of value in any reasonable time. Might as well learn how to do it yourself at that speed and capability.
Aromatic_Ad_7557@reddit (OP)
I don't think so
semperaudesapere@reddit
24 seconds to respond to hello. Absolutely useless.
Aromatic_Ad_7557@reddit (OP)
Aromatic_Ad_7557@reddit (OP)
Lol, because there is reasoning mode. It replying immediately without reasoning.
International-Try467@reddit
Iirc charging to 80% doesn't have much difference compared to jus5 full charging to 100% or using it until its battery dies.
Although this is pretty cool, what speeds do you get from it?
LoafyLemon@reddit
It does, actually.
Lithium-ion batteries degrade faster when charged to 100% because pushing the battery to maximum voltage creates electrochemical stress that destabilizes the chemical compounds inside. At full charge, the electrolyte becomes more reactive and breaks down faster, the protective layer on the anode (SEI layer) continuously forms and degrades in destructive cycles, and the cathode material itself becomes structurally unstable as oxygen releases from it. Charging to only 80% keeps the battery in a lower-stress voltage range where these degradation processes happen much more slowly, which means you lose some capacity per charge but gain significantly more total charging cycles before the battery becomes unusable.
TLDR; It often extends overall battery lifespan by 20-30% or more.
/nerd out
the_omicron@reddit
That is only if you don't consider charging cycle. This is the second parameter other than battery degradation because of maximum charging voltage. If you charge to 100% you will have less charge cycle, in the end the battery health is relatively the same in the real world usage whether you charge to 80% or to 100%.
Deathcrow@reddit
That's not what charging cycles mean. Charging cycles is not "how often do you plug in the usb cable?". Charging once from 0% - 100%. That's one charging cycle. Charging 5 times from 60% - 80%: That's one charging cycle..
nebenbaum@reddit
Consider this: this is mostly 'unknown', but I am fairly sure that most phones don't actually charge the cell to actual full mega 100% - they probably cut off at 90-95% of theoretical cell capacity anyways.
SodaAnt@reddit
Nope, you can see the voltage if you look at the detailed reporting the phone can output, and it's usually the max the battery can handle. Most phone manufacturers would rather get that extra 5-10% battery life now than make sure the phone keeps working well 5 years later.
nebenbaum@reddit
Did you do a scan across the whole charging curve to see if those numbers are accurate? Because if you didn't, being an embedded dev myself, I see a nonzero possibility of them just scaling the number at some point as an 'easy fix'.
I do believe that yes, a lot of companies don't give a shit, but the only way to be certain is to record the full charge curve charging from the phone, and then again for the bare cell (without the protection circuit, as that might also have some shenanigans going on there) with an external charger
ffpeanut15@reddit
You would be surprised. There have been testing of SiC phone batteries in China and most have significant reserve capacity compared to its true max charge. Some reserved as high as 10%
SodaAnt@reddit
SiC is a pretty rare chemistry still and has bigger reserves because of lower cycle life.
BlueSwordM@reddit
It makes for a BIG difference when cycling, but when plugged in all the time, 80% vs 100% SOC might as well be life vs death.
At 100% charge all the time and under heat, the battery will die very quickly, which is never a great thing.
xeeff@reddit
hard disagree. 80% uses up only 0.5 charging cycles. my previous phone (s20 exynos) had extremely bad battery so i did research on how to take care of my battery. i don't use wireless charging (although it's not a big problem if you don't use it exclusively) and try to not use fast charging much either. i actually limit my charging to 70% because it uses up even less charging cycles. i used to use down to 60-70% depending on the times but for now, almost 2 years later, my battery is still extremely healthy and lasts quite a while, especially if i don't use it and i stopped obsessing over battery cuz i have other things to stress about
misha1350@reddit
Incorrect. The difference between 80% and 100% is huge for prolonged use. At 100%, it would become a spicy pillow in no time.
ParthProLegend@reddit
Wrong. The best balance is 50%. Research first.
Aromatic_Ad_7557@reddit (OP)
I have a video on youtube, but afraid to post it here, however I'm not YouTuber and don't plan.
Just I was work on this build for a month, and I have no one friend who understands what is this =)
Regarding Battery, I setup that battery just don't used at all.
AvidCyclist250@reddit
90% should be equally safe
Aromatic_Ad_7557@reddit (OP)
I just skip battery at all to not use it. There no any difference 80 - 90 -100 , it just don't use battery at all, getting power from adapter only.
AvidCyclist250@reddit
Even better.
And as to what I said above and what's being downvoted, Apple recently published their results. I'd say that's fairly trustworthy.
a_beautiful_rhind@reddit
There is a difference between using the battery after charging to 100% and letting the battery SIT at 100% constantly.
Think it also varies by chemistry.
Salt-Powered@reddit
Limiting helps a little, but I agree it should be 50% instead to have better results in battery longevity.
Fine_League311@reddit
cool wanat to craft it on an redmia a5 with 4GB +4GB swap from internal storage but i failed. Will you made an How do for it Or a Git Paper?
Aromatic_Ad_7557@reddit (OP)
Oh, probably 4GB + 4 swap it will be very slow + you will kill your flash drive on the phone in case to use it intensively.
I plan to post a How to also, in case many people will request
Fine_League311@reddit
aber qwen 2B mit 1,6 GB müsste doch drauf laufen oder?
Aromatic_Ad_7557@reddit (OP)
Agree 👍
Fine_League311@reddit
GitHub/gitlab Link?
Aromatic_Ad_7557@reddit (OP)
I will post a guide and update you here as soon as it will be ready.
MeticulousBioluminid@reddit
awesome
Fine_League311@reddit
Danke Beitrag gespeichert
Aromatic_Ad_7557@reddit (OP)
However, If your phone is Xiaomi you can start googling how to unlock bootloader. It will be most hard pard. )
Fine_League311@reddit
danke für den tipp aber Lineage und co sowie flashing mir bekannt. Mich interessiert dein Workflow
Aromatic_Ad_7557@reddit (OP)
If you have already LineageOS installed, just run termux and pkg install ollama, what's wrong ?
Fine_League311@reddit
ahh you made it with termux? Ok than i understand how.. I thought you had built something for Lineage. Termux, no thanks, I don't need it, I build real server & ecosystems. Termux ist not bad, bat to wealk and unsecure
Aromatic_Ad_7557@reddit (OP)
Ias for me, it is runs locally, don't see any disadvantages of termux. It works stable.
Fine_League311@reddit
I'm not a fan of Termux and I think it's what made all the problems with scammers and Telegram funnels possible in the first place (Termux & OpenClaw), but that's a different topic that doesn't belong here. Still, it's a cool project even if I don't like Termux. Thanks for sharing.
Aromatic_Ad_7557@reddit (OP)
Also, after all installed. You need to connect phone via adb to PC and run from console "stop" it will kill android UI.
maschayana@reddit
A very detailed word on the performance
TheFrenchSavage@reddit
Still waiting for the first token to post haha
Aromatic_Ad_7557@reddit (OP)
vinigrae@reddit
Welp
FinsAssociate@reddit
As a noob to this stuff... are those stats good or bad?
IronColumn@reddit
alright for a phone
Aromatic_Ad_7557@reddit (OP)
He just reply in a second without reasoning
Aromatic_Ad_7557@reddit (OP)
Aromatic_Ad_7557@reddit (OP)
Sorry, I have upload the screen with speed in another one comment below. Just forget that there is no ability to add images to post after it's published.
No-Judgment9726@reddit
This is really cool — repurposing old flagship phones as inference nodes is such an underexplored direction. The Snapdragon 8 Gen 1 has decent NPU capabilities that most people never touch once they upgrade.
Curious about thermal throttling though — did you notice any performance degradation after running it 24/7 for a few days? In my experience, sustained mobile SoC loads tend to hit thermal walls pretty quickly without active cooling. Even a small USB fan pointed at it could help.
Also, the LineageOS approach to reclaim RAM is clever. I wonder if we'll see purpose-built "inference OS" distros for old Android phones at some point — there's probably millions of retired flagships sitting in drawers that could be doing useful work.
mrtrly@reddit
The thermal daemon triggering cooling at 45C is the part most people skip and then wonder why inference degrades after 20 minutes. Sustained throughput on mobile SoCs drops hard once you hit thermal throttling. Smart move freezing the Android framework too, that alone probably bought you 2-3GB of usable context window.
RIP26770@reddit
Compile llama.cpp on your hardware and delete Ollama and double your inference speed.
robberviet@reddit
Ollama do more harm than good at this point.
Code-Useful@reddit
Not disagreeing, but curious why you say this?
robberviet@reddit
As usual, you can search for it in reddit or internet. There are too many problems that I cannot list all. Perf issue, expose ports to public, guillera marketing, not recognizing llama.cpp, naming shenanigans (R1, various others), default context, default quants... Many are fixed but many are still.
export_tank_harmful@reddit
This is the biggest one in my eyes.
It's essentially just a wrapper for llamacpp and they don't credit them for that at all.
Plus, importing your own models to it renames them into non-human-legible hashes which just irks me.
Not symlink. Full rename.
Adventurous-Milk-882@reddit
Trust this man, llama.cpp is much faster
randylush@reddit
I have a lot of frontends plugged into Ollama (open web UI, openclaw, and Claude)
can I connect these to llama.cpp?
Hock_a_lugia@reddit
Yes! It can be used as an API endpoint same as any other llm service
MonteManta@reddit
Not easily
GregoryfromtheHood@reddit
What do you mean not easily? You literally just fire up llama-server and you're done. API endpoint ready to go.
FaceDeer@reddit
People often underestimate the value of "it just works." I'm a programmer, I'm comfortable with technical stuff. A while back I wrote some little local applications that use local LLMs for stuff and it was so convenient to just point them at Ollama and let Ollama figure out the details of loading and unloading models when needed.
I heard that llama.cpp recently added that ability to load and unload on demand, and I spent a bit of time fiddling with it to see if I could swap it out for Ollama. But after a bit of work I realized I was spending effort to replace something there was nothing actually wrong with, and I wasn't sure I was even going to get as good a result. So I just stopped and moved on to other stuff. Ollama continues sitting there on the taskbar doing what's needed in the background.
If I was more concerned with squeezing every bit of performance I could out of my hardware, sure, I'd go back to spending more effort on that. But "good enough" is good enough in my current use cases.
Imaginary-Unit-3267@reddit
For those whose hardware is limited, llama.cpp seems like it's basically a must. If you have an RTX 3060 and nothing else, like I do, Ollama is not worth the overhead.
Nobby_Binks@reddit
Yes, I was in the same boat until I wanted to run really large models that spilled to system ram. llama.cpp gives much more granular control how the model loads - at least I couldn't work out how to do it easily in ollama.
I moved to llama.cpp controlled by llama-swap. Takes a couple of minutes to work out the yaml structure but once setup it's simple. I have both Ollama and llama-swap served models in open webui but have more or less stopped using Ollama.
MuDotGen@reddit
I just use llama-swap. It might be newer, but it's a router that runs with llama-server, making an easy local server that can automatically load and unload models on the fly. All I had to do was download it, the models from Huggingface I wanted, and make a config.yaml with the model configurations I wanted, and then run the server just like any other, more or less. I think llama.cpp has a router mode too, though, but there seem to be some benefits to both.
RIP26770@reddit
Yes, but for the best experience, be sure to add Llama-Swap on top!
flq06@reddit
—mmap
Mathisbuilder75@reddit
Is it actually that much faster to compile it vs using the prebuilt binaries?
RIP26770@reddit
Yes
Mathisbuilder75@reddit
I just tried both. Idk if I did something wrong, but the precompiled version is slightly faster.
```
Compiled locally :
./llama-bench -m /root/.cache/huggingface/hub/models--unsloth--Qwen3.5-9B-GGUF/snapshots/3885219b6810b007914f3a7950a8d1b46
9d598a5/Qwen3.5-9B-Q4_K_M.gguf
load_backend: loaded CPU backend from /opt/llama.cpp/build/bin/libggml-cpu-haswell.so
| model | size | params | backend | threads | test | t/s |
| ------------------------------ | ---------: | ---------: | ---------- | ------: | --------------: | -------------------: |
| qwen35 9B Q4_K - Medium | 5.28 GiB | 8.95 B | CPU | 4 | pp512 | 12.39 ± 0.11 |
| qwen35 9B Q4_K - Medium | 5.28 GiB | 8.95 B | CPU | 4 | tg128 | 2.60 ± 0.45 |
build: fae3a2807 (8796)
===============================================================================================================================================================
Pre built binaries :
./llama-bench -m /root/.cache/huggingface/hub/models--unsloth--Qwen3.5-9B-GGUF/snapshots/3885219b6810b007914f3a7950a8d1b469d598a5/Qwen3.5-9B-Q4_K_M.gguf
load_backend: loaded RPC backend from /opt/llama-b8778/libggml-rpc.so
load_backend: loaded CPU backend from /opt/llama-b8778/libggml-cpu-haswell.so
| model | size | params | backend | threads | test | t/s |
| ------------------------------ | ---------: | ---------: | ---------- | ------: | --------------: | -------------------: |
| qwen35 9B Q4_K - Medium | 5.28 GiB | 8.95 B | CPU | 4 | pp512 | 12.46 ± 0.18 |
| qwen35 9B Q4_K - Medium | 5.28 GiB | 8.95 B | CPU | 4 | tg128 | 2.76 ± 0.08 |
build: aa00911d1 (8778)
```
LA_rent_Aficionado@reddit
how did you compile, what build arguments?
rorowhat@reddit
Only if your hardware is non standard, correct? For example on iGPU I think getting the pre-compiled vulkan backend will give you the same performance as compiling your own
Pale_Ratio_6682@reddit
Does this advice apply to all devices? I.e. is deleting Ollama + compiling with llama.cpp always going to ~double inference speed?
overand@reddit
You also don't need to delete Ollama - and with a little effort you can even point llama.cpp to the files that ollama downloaded. (They're named in a quirky way, but they're just GGUF files like llama.cpp itself uses. Interestingly, recent changes in how llama.cpp uses huggingface's cache directory makes it even more like ollama's directory structure & naming scheme)
Pale_Ratio_6682@reddit
Okay, thanks for the pointers. I ask because im new to using local LLMs and decided to go with Ollama as it seemed the most straightforward and easiest to set up (this was me assuming with no knowledge of the validity of this statement) and while some the bigger models aren't super quick I have had a pleasant experience so far.
Question: sorry if this is a stupid question but are ollama and llama.cpp like the same thing? I want to experiment more with models and fine tuning for different tasks so id like the most configurable option if that makes sense
nasduia@reddit
Ollama is a VC funded company that has been using llama.cpp underneath and slapping a simple UI on top. Ollama hasn't credited llama.cpp as much as it should have which causes some friction. The main difference is it is slightly harder to make llama-server run at startup if you don't have much technical experience, though Ai can help you.
RIP26770@reddit
I can't speak to whether the speed is doubled on every device, but based on my experience, using Vulkan on the Intel Arc iGPU and GPU is a significant improvement. The same goes for the Android Mediatek GPU as well.
SkyFeistyLlama8@reddit
Does Vulkan work on other CPUs like Snapdragon or AMD?
RIP26770@reddit
I believe Vulkan performs really great on AMD hardware!
arcanemachined@reddit
Not nearly as good as ROCm performance in my experience, but it will do in a pinch.
fallingdowndizzyvr@reddit
I have a difference experience. While at no context, ROCm has a slight advantage in PP. Vulkan has a slight advantage in TG. But as context grows, ROCm gets left behind. Eventually Vulkan is faster than ROCm in both PP and TG.
arcanemachined@reddit
I have some benchmarks I need to run again...
RIP26770@reddit
I didn't know that! In Intel hardware it's 20-30% faster than SYSCL for my setup.
treverflume@reddit
In my experience it's always quite a bit better. Maybe not double but always always worth it.
am2549@reddit
Even with MacOS Metal and M5 Series? How come compiling it makes it faster?
Mistic92@reddit
On mac you should try MLX
RIP26770@reddit
I haven't tested it on Mac, so I can't speak to that, but I am quite convinced that on every device, you should compile llama.cpp yourself. It takes about 2-3 minutes; I mean, it's not that difficult to run a .sh or .bat file to do it, and then you can be set! From my experience, every single machine I have compiled myself has yielded faster inferences than using the precompiled binaries of llama.cpp.
Monkey_1505@reddit
llama.cpp is already compiled for mac.
Salt-Willingness-513@reddit
true words. Tried this on my geekom a8 max and gemma 4 26b a4b q8 went from 10 to 18t/s
Nomadic_skier@reddit
Does this work for a local server? I'm running some models on a 1080 and a 2080 with ollama and the performance is meh
PentagonUnpadded@reddit
check out vLLM too. Good for agentic multi-streaming.
Nomadic_skier@reddit
Right now I have everything just piping through ollama and then binding models to certain hardware. But getting dub agent triage and routing has been hard. Will test out llama.ccp and vLLm. Any other tips?
TechnoByte_@reddit
Yes, use llama-server (included with llama.cpp)
RIP26770@reddit
Llama.cpp is designed for local inference.
Aromatic_Ad_7557@reddit (OP)
Good advice, thank you. I will try.
PurpleWinterDawn@reddit
May I? I've written a quick-and-dirty script to compile Llama.cpp on Termux a short while ago.
teleprint-me@reddit
Theres no need to compile it anymore. Its in the repos now for CPU, OpenCL, and Vulkan.
StatisticianFluid747@reddit
man, I feel you so hard on the "no friends who understand this" part lol. We see what you're cooking though! This is honestly one of the coolest repurposing projects I've seen here in a while.
Quick question about the battery setup—since it's running 24/7, did you ever look into completely bypassing the battery and wiring direct power to the board to prevent it from becoming a spicy pillow a year from now? Or does the 80% cutoff script combined with the active cooler keep the temps stable enough that you aren't worried about it?
Aromatic_Ad_7557@reddit (OP)
Yes, most of my friends will say on this built something like "if this device not printing money it is some useless thing" )))
Great that you like it, thank you.
There now two separate daemons. One is checking battery, and if it is more than 80% become power phone directly not using battery.
The second one monitor CPU temperature, when it more than 45 degrees, it enable smart plug and cooler start works. When temperature down to 42 degrees, script disable smart plug power and a cooler as a result.
The only one thing I'm still thinking it is some stand, because front side of the phone heating a little too and it probably mostly because it just touching table surface.
LankyGuitar6528@reddit
Just a bit of air space might help. A couple of those little plastic stick on bumper things?
Water-cage@reddit
bunch of Chinese crap
SaltResident9310@reddit
This is what I'm here for. So tired of seeing 48GB builds and 96GB builds. I was promised flying cars but I'll settle for good models that run well on regular consumer devices.
philmarcracken@reddit
Promised: flying cars
Wont even compromise on: WFH, 15 min walkable cities, small vacctube system for package delivery and trash disposal.
Salt-Razzmatazz-2132@reddit
You mean a Helicopter?
Apprehensive_Side219@reddit
Evtol vehicles of other kinds are starting too
Ivebeenfurthereven@reddit
I'm skeptical they'd ever be safe in the hands of the public.
ImpressiveSuperfluit@reddit
Nevermind safe, it's an absolute energy efficiency nightmare. Annoyed with fuel prices right now? Well, try fighting 1g the whole way. It's completely ridiculous from start to finish.
randylush@reddit
commercial jets do alright in terms of fuel per passenger vs driving. personal aircraft are only about twice as bad as driving a gas car, not like 10 times worse.
ImpressiveSuperfluit@reddit
We're not talking about aircraft, though. We already have those. Because, in contrast to making cars fly, they actually retain a modicum of sense. Because they can make much better use of momentum and use scale to make it work. Cars, virtually by definition, don't do either. And every time you try to solve these problems, you just get back to planes and helicopters again, including their own issues. Spin it as you like, it's a horrificly absurd idea.
SaltResident9310@reddit
Not if you have antigravity tech powered by dilithium crystals.
Apprehensive_Side219@reddit
This guy gets it
lol-its-funny@reddit
No, flying cars
pointer_to_null@reddit
Flying cars sound amazing until you're on the road and notice other drivers and their vehicles; sometimes it's a lack of situational awareness, overconfidence in their own ability, or simply their anger issues. Automation may address these, but maintenance is still going to be cost-prohibitive; when a car breaks down on the road it's not an automatic fiery death sentence for its occupants and unlucky bystanders.
Passenger drones might be your best bet, but current politics and safety regs limit their utility to being hardly better than a chartered helicopter currently. If companies like Ehang can get the cost down, they could revive the short-hop shuttle service over urban centers, like that Pan Am NY Airways service from the 1960-70s did with those big Sikorsky helicopters.
DrummerHead@reddit
I think for flying cars to work (leaving aside implementation and assuming we have the hardware and it's not cost prohibitive) the driver needs to delegate all the driving to the car (autonomous) and there has to either be a "government server" that syncs and handles the driving of all vehicles or each car has to have a standardized shared logic for what to do to avoid other cars.
Basically you'd go to the map and mark your destination and your 'car' handles getting there. People driving when there are so many axis of movement and the increased consequences of something going wrong is not acceptable.
128G@reddit
I don’t have trillions of dollars lying around for a “flying car”
cutebluedragongirl@reddit
People forget just how powerful smartphones are these days.
ousher23@reddit
Mine is into dark haikus...XIAOMI 15, Llama-3.2-3B
FullOf_Bad_Ideas@reddit
Try to run MoE's.
Back in the day, Deepseek V2 Lite 16B was decoding at 25 t/s on this kind of device. So shoot in the size of 12-20B. If you give it tools it would probably be useful as web search agent. But you need a sparse model and enough RAM to make it work.
ricopicouk@reddit
Following, this is neat.
dibu28@reddit
How many TPS you get and which model?
Aromatic_Ad_7557@reddit (OP)
Gemma 4
Aromatic_Ad_7557@reddit (OP)
marloquemegusta@reddit
Most interesting post I have seen here in recent days
Aromatic_Ad_7557@reddit (OP)
Thank you 👍
TripleSecretSquirrel@reddit
Very cool, I love repurposing used hardware!
What’s the use-case for you? Is this just your local chatbot?
DottorInkubo@reddit
Xiaomi 12 is old hardware?
TripleSecretSquirrel@reddit
Not really familiar with Xiaomi’s phones, I thought it was a few years old?
DottorInkubo@reddit
Huh. It’s indeed 4 years old now. Looking at the specs, by my standards it probably still is a very good and usable phone 🙂
TripleSecretSquirrel@reddit
I mean, my phone is older than that, but ya, I guess I just assume that if a phone is being used for non-phone activities, it’s probably been pressed back into service from the old electronics drawer/box that we all surely have.
Aromatic_Ad_7557@reddit (OP)
I'm working on my another project and need this device to test small models. But for now can't share what exactly I do.
dzhunev@reddit
I tried Ollama using Termux with Gemma4, e2b and e4b, but it's really really slow, and the phone kills the process at some point. Running lamma.cpp failed. I'm interested in running litert-llm, but I'm stuck on this When the same models are loaded inside Google AI Edge Gallery, the t/s is much higher and you are able to run them on the GPU. I assume this is the key, although running ollama with Vulkan didn't improve much the performance compared to plain ollama
Aromatic_Ad_7557@reddit (OP)
How many RAM do you have and how many is free when you are trying?
PTBKoo@reddit
I wonder if this is possible on a s26 ultra since it has much better specs but no root yet
Aromatic_Ad_7557@reddit (OP)
You need to find the way how to flash it to another OS, because native OS probably the same as on Xiaomi get 3.5-4 GB of RAM. But, s26 ultra is not old device and there the same 12 GB RAM, probably better sell it and buy two Xiaomi 12 Pro )) Or it cost will cover 32GB mini PC.
StaticInTheStars@reddit
This is very cool! Thanks for sharing! Would love to see a guide on how you set it all up, its a great option for this type of application.
Aromatic_Ad_7557@reddit (OP)
I will prepare and provide here, in few days probably.
MRanonyrat@reddit
That's kinda cool
MyDespatcherDyKabel@reddit
Wow that’s pretty respectful
LtLi0n@reddit
nice, but please use llama.cpp
Also there's something deeply unsettling about using hardware that includes cameras, screen, mic/speaker, sensors and a battery for this 😆
TheGlister@reddit
Cool, I have a similar setup: a OnePlus 9 acting as a home server, though it’s not headless. I compiled a custom kernel with Docker support to run llama.cpp, Linux containers, VSCode Dev Server, Jellyfin, Paperless-ngx, and more. How were you able to kill the Android framework and Zygote without triggering a kernel panic?
Aromatic_Ad_7557@reddit (OP)
Just stop command , and after that verbose system processes like camera, in this case init process don't try to launch them.
TheGlister@reddit
Thanks I will try
Aromatic_Ad_7557@reddit (OP)
I will share my scripts soon, however not sure it will work same on different hardware
david_0_0@reddit
the thermal design is solid. with the active cooling running 24/7 though, how much power are you actually drawing? most people don't factor in that the cooling daemon and wifi module are eating battery even when the model idles. did you end up with a daily charge cycle or can it really run untouched for days?
Aromatic_Ad_7557@reddit (OP)
Actually I have no tasks for now to work 24/7, also I noticed that with no tasks the CPU temperature down fast and in 1-2 minutes cooler is automatically OFF.
I setup charging that phone does not use battery at all, the power goes through adapter directly to the phone, and battery always on 80%.
The phone take 14-15 W, and cooler about 16-18 W.
Torodaddy@reddit
Lmao wut
phovos@reddit
Can you tell us more about LineageOS
Do you have to be fluent in Mandarin to develop or serve files/etc. with it an Chinese hardware?
Aromatic_Ad_7557@reddit (OP)
LineageOS it is popular OS, try google it. I used it just because it is one of most stable Android OS. Just to keep all phone functionality.
You don't need to know Mandarin/ Chinese to work with it. I'm not sure where is LineageOS headquarters, but probably in one of European countries.
phovos@reddit
Oh I was thinking of HarmonyOS
Wow that's incredible I didn't know there was another competitor! (hopefully) It's better than GrapheneOS, I will have to check it out.
Healthy_Bedroom5837@reddit
why not go over usb c tether over ethernet ? , faster , u may say oh then i cant charge, u can get for under $10 bucks off ali X press a usb c ethernet + usb c pd input combo, might help with speed ?
cool idea btw
Aromatic_Ad_7557@reddit (OP)
I did order and return two such adapters. One was price $20, second $35. They don't produce 14W for slave device as I need, but only 5W maximum. Finally I just compile wpa supplicant with opus to run wi-fi after killing the android UI.
However first my idea was the same as you are describing.
ChocomelP@reddit
What about wireless? Not enough power?
Aromatic_Ad_7557@reddit (OP)
I run command "stop" and it kill Android UI and Wifi too. Due to that, I was try firstly to find some usb hub which will provide power charge to slave device, but after two unsuccessful attempts I sid compile wpa supplicant - it is standard service in Android which provides Wifi functionality, so I compile it and run after "stop" to enable wifi back when Android UI killed.
david_0_0@reddit
this is solid. one question - does the phone get hot running 24/7? and how does gemma4 actually perform on inference latency compared to smaller models you tested?
Aromatic_Ad_7557@reddit (OP)
I test it on 6 hours non stop tasks and cooler keeps the temperature of CPU on 48 degrees.
In case phone without tasks it is heat a little due to constant charging for 14 - 15 W.
Gemma works fast. Actually when I was finally finished I was frustrated because on long tests small models, even qwen 14b become stupid on 3-4 prompt, Gemma4 it is really game changer, it persist smart even on long context prompts.
rorowhat@reddit
You know if you wanted you could probably just add a block of copper on top, it would be dead quiet, take no power and would be enough to easily cool it.
Aromatic_Ad_7557@reddit (OP)
This cooler keeps cpu temperature for 48 degrees on non stop tasks. I think cooper plate will not produce such effect
rorowhat@reddit
It will help desippate the heat, 48C is nothing, it can handle 80C plus. I'm not sure about the TJ for this CPU but you have plenty of headroom.
PentagonUnpadded@reddit
Adding on - e-wasted stock cpu heatsinks work well. Stock intel/amd cooler + 5 cent thermal pad in between and voila.
International-Try467@reddit
Also
Ew. (Not to you OP)
mesispis@reddit
what should I use instead?
Cupakov@reddit
llama.cpp
International-Try467@reddit
Or kobold
In_d_ex@reddit
what's wrong with Ollama?
ParthProLegend@reddit
It's devs are trash. Performance is poor. And so much more.
umataro@reddit
How often do you deal with ollama's devs? Performance is not poor. Llama.cpp is only between 15-30% faster these days A price I will gladly pay for ollama's usability.
Monkey_1505@reddit
What's wrong with llama.cpp's usability? Or if you prefer something with a little more webui, koboldcpp?
the_omicron@reddit
30% is plenty my dude, you even chased 5% discount on games yet you said you will gladly pay for "ollama's usability" lmao.
umataro@reddit
30% is the worst case scenario. Average speed loss is well worth the user comfort ollama brings (to some people).
acetaminophenpt@reddit
Well, it's way cheaper than a rtx 6000 :D
Thumbs up!
AtypicalComputers@reddit
Did Ollama function correctly on your device utilizing Vulkan? I tried llama.cpp and the performance on a Pixel 8 was horrendous.
Aromatic_Ad_7557@reddit (OP)
The Ollama works correctly with good performance, but probably it uses only CPU, without GPU. I received advices two try llama.cpp and liteRT, I will try and update here with results.
I also will try to launch GPU and update also, however as for now just CPU's provides not bad speed.
WithoutReason1729@reddit
Your post is getting popular and we just featured it on our Discord! Come check it out!
You've also been given a special flair for your contribution. We appreciate your post!
I am a bot and this action was performed automatically.
overflow74@reddit
for the exact same hardware what are the benchmarks for running google’s litert llm on android? i think you could get better performance running gemma4 with it
Aromatic_Ad_7557@reddit (OP)
I will try and update you. However gemma4 provides not bad results as for me.
overflow74@reddit
i highly recommend checking out the ai gallery demo apps and their litert llm repo on github as well
hackiv@reddit
Here's a guide I've made on how to compile Llama.cpp on android, replace Ollama asap.
https://www.reddit.com/r/LocalLLaMA/s/QrYY3jYp54
beachplss@reddit
What do you do with it once this setup starts running?
Aromatic_Ad_7557@reddit (OP)
I used to work with RAG, and train myself with fine-tune before rent hardware for big models.
digitalwankster@reddit
This could be a cool option for running Home Assistant and Frigate with object detection.
Aromatic_Ad_7557@reddit (OP)
Yep, I'm too think that use phone could be good solution instead ESP32 + modules
srona22@reddit
interested. would be great if you can share setup/guide and also where to get the cooling device, etc.
agsuy@reddit
Yup this deserves a guide!
Aromatic_Ad_7557@reddit (OP)
I probably will share guide, but in few days because it's require many time.
You can find Cooler on ali, by name: Magnetic Semiconductor Phone Cooler - Ice/Frost Cooling Pad for Mobile Gaming & Streaming
Dorkits@reddit
Yeah, make a guide for us
Ok-Measurement-1575@reddit
I would love to see the llama-bench output or indeed any output :D
Aromatic_Ad_7557@reddit (OP)
adobo_cake@reddit
Wow, not bad! Nice idea, might try this out.
Fine_League311@reddit
wow 16 T/s cool one a phone :D
redilaify@reddit
meanwhile the phone:
Processing img upw1dp3mu5vg1...
Aromatic_Ad_7557@reddit (OP)
Lol )) 🤣
Silver-Champion-4846@reddit
Why did you classify this post as funny?
Aromatic_Ad_7557@reddit (OP)
I checked available topics and didn't find something suite for my post. So, to avoid breaking rules of sub.
Silver-Champion-4846@reddit
Hmmm
wayl@reddit
Why not directly a micro board? Am I missing some advantages a phone can give? Interesting in any case!
1ncehost@reddit
Fwiw used phones are essentially free ewaste. Often times better cost to perf than a pi.
Saifl@reddit
Cant wait once 24gb ram phones become dirt cheap or maybe they already are. Idk bout inference speed for 20b parameter models but maybe just extra context with that ram i guess?
Aromatic_Ad_7557@reddit (OP)
It is just my old phone which was useless and I decide to bring him new life )
TripleSecretSquirrel@reddit
I’m guessing the advantage was that OP already had the phone collecting dust in a drawer somewhere.
darkgamer_nw@reddit
It would be great to have a GitHub repository for the project
Aromatic_Ad_7557@reddit (OP)
But there actually almost no code to post.
There automatic daemons for control power and run cooler, probably nowadays with opus 4.6 it could make anyone ))
The most hard thing was it is get an approval from Xiaomi to unlock Xiaomi bootloader to flash the phone to LineageOS.
cviperr33@reddit
How did u get approval "? Did u literally msg xiomi support to unlock ur bootloader so u can load local models :D ?
Aromatic_Ad_7557@reddit (OP)
Nope, there is weird procedure. They allow to unlock 2000 devices per day and counter reset at 00:00:00 by China. And there some fight of the bots. It is impossible to receive approval manually, finally I'm too write a bot and after measuring ping and Xiaomi severs speed I did get approval after 10 days of tests. Lol )
Suschis_World@reddit
That's actually the reason I'll probably never buy Xiaomi again. Never got that stupid procedure to work.
Aromatic_Ad_7557@reddit (OP)
I can share my python script if you need )
xquarx@reddit
Wait... how did you exactly go about installing Ollama or llama, like in Termux? Or does lineage OS allow you to get terminal access easily?
How does the Tok/s performance change with different billion parameter models?
Aromatic_Ad_7557@reddit (OP)
You can install ollama with termux even on HyperOs, it works I did checked.
Another models works too with similar speed however really smart model is gemma4, also nemo was provide not bad results but it requires about 12GB RAM, so was work with SWAP and slowly than gemma4.
dadnothere@reddit
The Redmi Note 8 can install Safishos... some Xiaomis may have aftermarket OS... if you manage to do that you have even more free RAM.
Ok_Fig5484@reddit
cool, I'm using a modified gallery to run the liteRT version of the API, and I'm wondering how its speed compares to the ollama version.