[Appreciation Post] Gemma 4 E2B. My New Daily Driver 😁

Posted by Prestigious-Use5483@reddit | LocalLLaMA | View on Reddit | 58 comments

idk but this thing feels like magic in the palm of my hands. I am running it on my Pixel 10 Pro with AI Edge Gallery by Google. The phone itself is only using CPU acceleration for some reason and therefore the E4B version felt a little to slow. However, with the E2B it runs perfect. Faster than I can read and follow along and has some function calling in the app. I am running it at the max 32K context and switch thinking on and off when I need.

It seem ridiculously intelligent. Feels like a 7b model.

I'm sure there is some recency bias here. But just having it run at the speed it does on my phone with it's intelligence feels special.

Are you guys having a good experience with the E models?

[-]

MoodRevolutionary748@reddit

What are you actually using it for?

[-]

Prestigious-Use5483@reddit (OP)

Nothing high level of course. Just mostly Q&A. Helping with vacation planning. Asking scientific questions. And most importantly, running it locally.

[-]

EffectiveCeilingFan@reddit

You’re telling me this can’t replace Opus 6? Then what’s even the point?! /s

[-]

bobbygenerik@reddit

Opus 6? You think something small enough to run on your phone is gonna perform better than these state of the art models?

[-]

yuvrajsingh1205@reddit

Can you please help me to learn more about e2b? I have some doubts.

[-]

Prestigious-Use5483@reddit (OP)

What do you want to know?

[-]

yuvrajsingh1205@reddit

Hii, thankyou for your response, can you tell me how do I fine tune it for sql generation?

[-]

Prestigious-Use5483@reddit (OP)

Sorry, that is beyond my scope of knowledge

[-]

yuvrajsingh1205@reddit

Ok so what do you know about it, how can I utilise it as per your knowledge scope

[-]

Prestigious-Use5483@reddit (OP)

It seems to be using turbo quant so the context feels very good for its size. It's also fast and has good knowledge. It's like having a calculator in your pocket but for AI. It's not for high level coding. Just like a pocket assistant.

[-]

Dos-Commas@reddit

The phone itself is only using CPU acceleration for some reason and therefore the E4B version felt a little to slow.

Classic Google, their own app and model doesn't even work properly on their own phone.

[-]

TMWNN@reddit

Classic Google, their own app and model doesn't even work properly on their own phone.

I heard a decade ago that Google's apps run better on iOS than on Android, Google Voice being one example.

[-]

Prestigious-Use5483@reddit (OP)

You have no idea how much of a nightmare it has been not having that GPU support. I bought this phone day one thinking it would be good for local edge AI stuff, but it's mediocre at best. I'm glad E2B runs and performs solid though. I wish I could add a layer of TTS now to that on my phone and have it speak the response.

[-]

Danmoreng@reddit

Haha I tried Edge Gallery on my S25 and GPU support seems to finally work. However, now that I try to replicate that and put it into my own Android App for transcription and summaries: GPU is broken again and I have to rely on CPU for now. Really sad that Android LLM GPU support is so bad still.

[-]

TopChard1274@reddit

On my phone (Xiaomi 13 Ultra) it works on GPU. It has GPU acceleration enables from the get go. E4b is 50-60tps I think but there's no really numbers to prove it.

[-]

rorowhat@reddit

It's early, it will get better.

[-]

Acrobatic_Stress1388@reddit

Don't these phones have an npu to use for this stuff?

[-]

Interpause@reddit

even using the qualcomm gen 5 elite, NPU is slower than GPU (using nexa sdk to test)

[-]

OpinionatedUserName@reddit

Open cl works, vulkan is a nightmare to implement.

[-]

Foreign_Risk_2031@reddit

Two different divisions

[-]

Dunkle_Geburt@reddit

Just out of curiosity, what are use cases of such a small model on a phone?

[-]

Willing-Ice1298@reddit

to flex with friends

[-]

OpinionatedUserName@reddit

Gemma 4 E2b is surprisingly efficient for function calling and web search, better than gemma 3n e4b.

[-]

TopChard1274@reddit

How do you use web search with it on phones? With what app do you use e2b?

[-]

OpinionatedUserName@reddit

This is a custom android app i built primarily for vlm purposes (i wanted second set of eyes on chart - not really effective solution so far) but also integrated chat later on with agent capabilities.

[-]

New_Patience_8107@reddit

Do you pull in any functionality from edge gallery or how do you run the LLM locally in your app?

[-]

OpinionatedUserName@reddit

This is running using llama.cpp on cpu/opencl. Vulkan is a nightmare to compile on windows, wsl build is a must and even then driver support is not guaranteed.( I've not been successful for latest versions so far) If you want to go the litert-lm route- the google edge gallery one, that is little faster but you'd be at the mercy of the litert community or others on hugging face for ready-made models to download and test. Unless you convert and make your own models, that path is limited. However I must add that litert- lm has been updated four times in last 2 months and that is looking good. If you are a developer than adding your own requirements to edge gallery will be easier, stable and more satisfying. You can experiment with Androidstudio+Antigravity.

[-]

TopChard1274@reddit

Off grid crashed my phone twice (Xiaomi 13 Ultra). It crashed my friend's Samsung Galaxy s25 as well. If the model is too big, or is not supported, the app has no safety protocols, do it just crashes the system. I won't touch that until the dev fix it

[-]

swfsql@reddit

This gives Iron Man vibes

[-]

Prestigious-Use5483@reddit (OP)

Mostly general tasks and research. No coding.

[-]

InverseInductor@reddit

How are you going to use it for research if it can't search the web?

[-]

Medium_Ganache_6613@reddit

It can do tons: (i) summarise related literature papers that you have on pdf ; (ii) help you with drafting documents, from improving the writing itself to having it formatted properly. (iii) The training knoweledge is enough for you to discuss logical arguments and extensions. (iv) anything to do with translation (v) help you with organization such as generating a detailed to-do list with calendar dates related to the completion of a task

[-]

Mkengine@reddit

I will travel to india next week for work and I feel better having a local model in my pocket for translation and some q&a stuff. I don't know how I will get mobile Internet access there, so it feels good to have at least that.

[-]

DR4G0NH3ART@reddit

If you have a sim, you will have internet anywhere in India, it is quite cheap. Also you will be surprised even street vendors will have QR codes for payment in google pay or phonepe. The speeds are also very competitive unless you go too rural.

[-]

pallavnawani@reddit

Buy a SIM from any provider and buy a pre-paid plan. Buying a SIM might be the issue (Or it might not be) as it requires ID of some sort.

[-]

Flimsy-Blueberry8089@reddit

I did some benchmark and I am impressed with this model.

[-]

bidutree@reddit

I'm running a local pipeline on my old iMac (2011) with an i7 CPU, and the model performs really well there too. Of course, it takes a while, but I let it run while doing other things or overnight while I sleep.

Analyzing a text of about 12,000 tokens takes around 50 minutes - very slow compared to modern systems, but completely workable if you accept the prerequisites. :))

For shorter texts, the example above is a transcription from a 1h talk, the model is ofc much faster even on my old machine, at about 7.5t/s.

[-]

Adventurous-Paper566@reddit

I'm running it with 8k context length on a Galaxy S10e from 2019 (6Gb of RAM), the outputs are generated faster than I can read them, WOW!

[-]

EstablishmentOne633@reddit

Check out this link: https://developers.google.com/ml-kit/genai/aicore-dev-preview?hl=en
You can get access to Gemini Nano 4 based on Gemma 4. It runs directly on the hardware using the NPU and is already visible in the Google Edge Gallery via AICore. I’ve tested it myself on Pixel 10 Pro and it works really well the performance is significantly faster than on CPU.

[-]

Prestigious-Use5483@reddit (OP)

Do you have a solution for the responses cutting off? I tried both Gemini Nano and Nano Flash and they both cut off responses. Also there isn't any way to increase the context higher than 4.5K from what I could tell. Really strange that Google would lock the TPU accelerator behind a beta. But I did love the speed boost

[-]

EstablishmentOne633@reddit

I'm running into the exact same issue. I actually dug through the Google Edge Gallery source code to see if there's any way to change those limits or the context window, but it looks like there isn't a version that fully supports AI Core yet.

[-]

Prestigious-Use5483@reddit (OP)

I actually figured out a workaround. Download the litrRT LM model from hugging face and add it manually in AI Edge Gallery.

https://huggingface.co/litert-community/gemma-4-E4B-it-litert-lm

https://huggingface.co/litert-community/gemma-4-E2B-it-litert-lm

There are some caveats. It's limited to 4K context. There's no thinking toggle. It says it's using CPU instead of NPU, but it actually runs at or close to the speed of the NPU.

The LiteRT LM models are specific formats to Edge AI and takes advantage of of Tensors in the phone. Try it and see :)

[-]

Prestigious-Use5483@reddit (OP)

I saw that in the app. It seems interesting. I'll have to check it out. Thanks.

[-]

TopChard1274@reddit

I tried e4b on Xiaomi 13 Ultra with your suggestion app. It's blazingly fast. Incredibly smart for brainstorming.

The negatives are pretty disheartening. The app freezes while you write but you can keep writing... It's just that you need to restart the app to work again. You lose everything you wrote. Everytime you start the chat it take over 1 minute to load the model. If you modify the app parameters, the app crashes. The thinking model needs to be turned on every time which adds another minute of waiting time.

It's like Google made the app with vibe coding. Terrible all around. No general prompt option? It seems to serve only the purpose of being a "show off" app. Fast, but at the price of everything else.

Then I asked it a few short original riddles to see how inteligent it is. Couldn't figure out any of them 💔. Qwen 3.5 is the only one smart enough at 4-9b to really "think".

[-]

ZootAllures9111@reddit

I don't find Qwen 3.5 4B to be better in any way than Gemma 4 E2B, personally. It thinks for massively longer to arrive at the sam answer at best, and also generally has a noticeably less coherent grasp of English in general I'd argue (i.e. I often find Qwen interprets questions in strange ways that don't make sense to me as a native English speaker).

[-]

swfsql@reddit

Have you tried Qwopus 3.5? It is supposed to have a higher quality of thinking, of qwen finetuned in some Claude Opus prompt/responses.

[-]

TopChard1274@reddit

Not the base. Try this one Jackrong/Qwen3.5-4B-Claude-4.6-Opus-Reasoning-Distilled-GGUF.

I also found out that Qwen3.5-9B-Claude-4.6-HighIQ-THINKING-HERETIC-UNCENSORED.i1-IQ3_XS fits on an 8gbram iPad Pro M1. It‘S much more potent and despite being slower (around 8tps compared to 4b’s 14tps) it’s still very much usable,

[-]

tiffanytrashcan@reddit

Using GPU (surprisingly the default option now) on an Adreno 710 is quite a bit faster, but Qualcomm did something dirty with those drivers.
Random languages start getting spit out in the thinking. It's sad watching it try to recover. "Wait, no, I should output in English as the input language was in English." Fighting to not output the random string of Arabic.

[-]

powerfulparadox@reddit

Android GPU drivers are notoriously terrible, and Adreno seem to often be among the worst. At least, that's what I've gathered from lurking on the fringes of various developer communities for years. When a bunch of really good developers are all complaining about driver bugs making their lives more difficult it starts to feel like the Android community are being left out (not that anyone has particularly amazing GPU drivers, apart from the Mesa project).

[-]

tiffanytrashcan@reddit

It's heavily model dependent. There's certain ones that work flawlessly with Mesa.
710/720 support is still a hack, and now I realize that it's deeper in the Android device than the Mesa project could even fix.
I got similar behavior trying to use the GPU and Vulkan with KoboldCPP (and Mesa on Termux.)
I get some graphical glitches in games using gamenative or gamehub lite. It's mind-blowing the 710 is even powerful enough to play Fallout 4.
I thought all that was the Mesa driver being problematic, but Edge Gallery uses native Android hooks for the GPU.

[-]

powerfulparadox@reddit

Oh. If I wasn't clear enough that pretty much all graphics drivers are terrible, that's on me. The real problem is that they're all terrible in different ways (which makes graphics programming, and by extension GPU runtimes for LLM inference, much, much worse than the nice, shiny programming standards would like you to believe).

I've seen enough complaints about bugs in graphics drivers to be impressed that any complex software runs on GPUs at all. Obviously this is slightly exaggerated and a bit unfair (GPUs are complex, graphics APIs are complex, operating systems are complex, and getting things right is hard), but Android in particular exacerbates the problem by having multiple layers of abstraction between those having the problem and those in a position to do something to fix it. Chip manufacturers go out of their way to have proprietary drivers that even device manufacturers might not get to modify, there aren't many chips/GPUs with good documentation, and the sheer variety of combinations of chips and driver versions means developers can't expect any versions of their software to rely on any released improvements in drivers.

It's honestly one of the most impressive things I can think of that anything works reasonably well on Android GPUs across more than one GPU/driver combination.

[-]

dhruvanand93@reddit

I don't see the new models on edge gallery app. I'm on a oneplus though

[-]

Hello_my_name_is_not@reddit

Make sure you update the app. I'm on a s22 ultra and dont have it either.

But I checked play store and saw there's an update that didn't auto run.

Updated and now I have it and the ui is slightly changed

[-]

Tiny-Sink-9290@reddit

What are you using it for? It's so tiny it's not going to be anything close to daily driver for all things "ai chat".. right? So what do you use it for on your phone?

[-]

Revolutionalredstone@reddit

There are no E models that E before the 2b mean effectively 2B since the model is actually 5b but ~3B just sits there for multimodal / other langues.

[-]

Prestigious-Use5483@reddit (OP)

I see. Thanks for the clarification.

[-]

Super-Strategy893@reddit

I tested it here and it's running on the GPU using liteRT with a backend. It's an evolution of TFlite and still needs support for some GPUs and NPU-type accelerators.

[-]

Prestigious-Use5483@reddit (OP)

Ah thanks for the info