Did Google hide the best version of Gemma 4 e4b in Android? The extracted model beats Unsloth and everything else I've tried.
Posted by LawyerCompetitive478@reddit | LocalLLaMA | View on Reddit | 108 comments
Why does Gemma 4 e4b from Google AI Edge Gallery on Android weigh only 3.6 gigs, while the one from Unsloth (gemma-4-E4B-it-UD-Q2_K_XL.gguf) weighs 3.7, and for some reason the model image in litertlm format extracted via adb from Google AI Edge Gallery on Android acts smarter than all the versions I've downloaded from the internet and tried, and the one from litert-community/gemma-4-E4B-it-litert-lm turned out to be especially buggy, it writes completely incoherent text in Russian. Does anyone else have it like this, or did I get confused somewhere, or am I hallucinating from lack of sleep?
Fit-Produce420@reddit
Yes, I can explain. You see, Gemma 4 was made by highly paid enginners at google who designed the model, the edge app, and understand how to properly serve it.
Your community fine tune was made by random strangers who don't know anything.
Hope that helps.
Ylsid@reddit
Unsloth are random strangers who don't know anything? They post on this sub you know 🍿
Fit-Produce420@reddit
He referenced some random quant it wasn't an unsloth release.
relmny@reddit
yes, and so Bartowski, Ubergarm, AesSedai, etc... they are all "random strangers who don't know anything" according to that commenter and almost 500 more...
That's the status of Localllama, make a wild statement without any sense nor reality, insult a few, and you are the most upvoted...
All of us who use their quants daily for a long time, have no clue about anything...
TotallyToxicToast@reddit
I think this is mostly about comparing the model that the playstore version of AI Edge pulls from google servers with the one that the opensource version pulls. (litert-community/gemma-4-E4B-it-litert-lm)
Apples to Apples Comparison Specifically LiteRT to LiteRT.
Ylsid@reddit
Haha, I know, it was just funny that OP was shit talking devs who might see his post as consequence lol
KickLassChewGum@reddit
And they ain't exactly known for having particularly thick skin either. 🍿🍿
Ylsid@reddit
This gon be gud
LawyerCompetitive478@reddit (OP)
I use them because there are few alternatives, they really do a lot of work and are the first to release models, but their unsloth studio is something terrible, especially for Linux, and I have to use llama cpp
ps5cfw@reddit
also not sure what, if any, level of quality a Q2 quant can generate out of such a small model
Dabalam@reddit
It seems the argument is that the small Q2 is a similar size as the same model extracted from a different source. There's an assumption that similar methods must have been used to get the same model to that size, but performance is notably different.
tiffanytrashcan@reddit
It's a totally different output. Google LiteRT doesn't work with gguf files. Llama.cpp doesn't run the RT blobs.
We see the same thing with awq or whatever it's called for VLLM or MLX based quants for Macs.
scknkkrer@reddit
Can you elaborate?
tiffanytrashcan@reddit
LiteRT is a different backend. Primarily built for Android devices. Google's custom tricks for better performance.
gguf files are made to work with the ggml library, which the Llama.cpp backend uses, and powers a ton of tools.
MLX is optimized for Apple and Metal.
These aren't just about the quantization (compression) the internal structure is totally different. For example you have to bake in the vision projectors in LiteRT, it's all one file, not separate like gguf/mmproj.
Think of HD-dvd vs blu-ray - different lasers to read them, same source material, a similar output.
scknkkrer@reddit
You know that all these formats represent same thing, right? You telling us conversion degrades performance because they do it wrong?
tiffanytrashcan@reddit
"You know that all these formats represent same thing, right?"
Obviously. Key word being represent as they aren't packed the same nor are they compatible. Hence my comparison- "different backends... same source material, a similar output."
"You telling us conversion degrades performance because they do it wrong?"
NO, I didn't remotely imply that. Who even is "they?" and which conversion are you even assuming exactly?
scknkkrer@reddit
Before everything, in a theoretical sense (and engineering perspective, should be too), they are the same thing. They are literally reflection of the same mathematical model. Anything you wanna object so far?
tiffanytrashcan@reddit
Okay, go try to run a gguf file in Google Edge Gallery. Same thing right?
How about some awq or MLX quants in Llama.cpp? (Or the tf / RT blob.)
🤣 🤡
-Ellary-@reddit
EbbNorth7735@reddit
On this note I'd be curious if a fp8 beats the quant from google
LawyerCompetitive478@reddit (OP)
Yes, it surpasses in size, speed and quality.
igorgo2000@reddit
One is formatted as LiteRT and the other one is GGUF - what's not clear? Different file format encoding = different sizes... Google it 😂
LawyerCompetitive478@reddit (OP)
Don't make me angry, mister. The main question is why it works so well at this size, while the "public version" barely copes and is supposedly 2-3 times dumber. Yes, Google probably adjusted the quantization with the training data somehow, whatever that means.
igorgo2000@reddit
There are many reasons as to why one will perform better vs another and a lot of this has to do with how these models are optimized to run on the CPU, GPU and NPU - you are not sharing any information where are you are trying to run these models and on what hardware, using what OS and in which app...? are you trying to run it on your Mac or an iPhone or Android or what device and on what OS? Are you using Llama.ccp or Google tooling? LiteRT (which is TensorFlow Lite) is Google LiteRT and models best run on devices with dedicated hardware accelerators, particularly Android devices featuring modern NPUs and GPUs... So you got to start doing some more reading and just Google it... lots of people already commented below the same thing and you just keep saying the same thing over and over again...
LawyerCompetitive478@reddit (OP)
Thank you for asking the AI for the answer instead of us, and now we won't have to. You literally wrote to the artificial intelligence "how to respond to this message in a cool way". I made it clear in plain text that this is a model from the Google AI Edge Gallery Android. Just read the rest of the messages or press CTRL + A and send the entire text to your AI chat. It makes no difference where you run the same model file; at least, there won't be any differences in the model's responses with the same settings on different devices, provided the model file is the same. However, the difference in response speed can be significant, and if you don't have a graphics card in your PC, the processor will be even slower than if you were running it on a more or less modern phone.
igorgo2000@reddit
You really need help... Maybe it's best for you to ask AI 😂🤦♂️ I can't really add any value for you beyond what's already been said... Wishing you all the best in your AI endeavors, you will needed it.
LawyerCompetitive478@reddit (OP)
Okay, I wrote a long reply explaining why you're wrong, but that post can't be helped anymore. I tried adding links to the main post, and that's why it was automatically deleted. Not because you're right or anything. Okay, good luck to you anyway and thanks for your answers, it’s better than just keeping quiet!
igorgo2000@reddit
You literally changed your post on the fly - you stated with a post about different file format Gemma 4 model not being the same size - GGUF vs LiteRT, then you change it to the Gemma 4 LiteRT model that can be downloaded from Hugging Face is different from the Gemma 4 LiteRT model that is linked by Google from their AI Edge Gallery that's used by their Android App... still you failed to hear what others commented to you about that (after some valid confusion realist to the links)... as I said I can't add much to what has already be shared with you below as I feel it's useless.
tiffanytrashcan@reddit
Just because other people understand RTlite doesn't mean they has to use AI. Perhaps you should ask AI about it though. You obviously don't understand this technology and the differences.
Please, tell us what you're using to run the TF blob on a PC? Please tell us how that gguf file you mentioned works in Google Edge Gallery?
igorgo2000@reddit
Yeah, given what the wrote and his responses, never actually saying what he is doing and what is his setup - what is he running, set up environment, etc. and in spite many people asking him about it he just ignores and just regurgitates same thing over and over again.., I think he read something online or in some thread and just copy pasting some question/comment about someone's experience without any clue what he is saying or what it all means...
TotallyToxicToast@reddit
OP correctly noted that something strange is going on, the performance difference is abnormal even across quantization methods. They left out (or had not yet done) a lot of the extra checks making sure that there is really something different with this specific google model, specifically in the playstore version of AI Edge.
If you follow the investigation in the other comments you will see that the app store version of AI Edge downloads a different LiteRT model (straight from google servers) than the open source version (which loads from Huggingface).
The Performance difference seems to still be there between the two LiteRT models.
This is big IMHO and a genuine finding by OP, although they could have communicated a bit better and with a less confrontational tone I guess.
igorgo2000@reddit
Of course something is different - these are not identical models - one is formatted as LiteRT and the other one as GGUF... and these will be different size. The OP is not sharing what hardware and environment configuration he is using to test these models and on what devices - if he is trying to compared performance of running an LiteRT model an on Android device it will work significantly better vs running a GGUF model using Llama.cpp library powered app on the same Android device... but we don't even know if he is trying to run a LiteRT model on say a Mac or what...? And I did read his subsequent comments which again say pretty much the same thing over and over again... if you can articulate better what he is talking about after so many people already responded, please do... I also needed an example of how mature his is - he posted a comment to a reply I wrote that it was an AI written reply... which I personally typed last night 🤦♂️🤷♂️ I don't know what he is doing there, but whatever. He can simply Google LiteRT vs GGUF and get all of the information he needs...
LawyerCompetitive478@reddit (OP)
Thank you for your support, but your comment is a review from the AI chat without your opinion.
TotallyToxicToast@reddit
? No I wrote that myself, no AI. Not sure what made you think that.
Also without my Opinion?
I literally wrote IMHO "In My Humble Opinion". My opinion is that it is genuinely a big finding, but you could have communicated it a bit better. That's all.
LawyerCompetitive478@reddit (OP)
Well, sorry :) Maybe I was wrong, I can't prove it, but to me it looks like a reply via AI chat, with you or the AI's opinion added at the end. Okay, I just didn't get enough sleep, so I'm being slow today. Otherwise, thank you very much!
TotallyToxicToast@reddit
AI chat finds a lot better ways to include asides instead of awkwardly having the ( ) brackets everywhere, which is what I always do.
Also you can see that I capitalized "Performance". I sometimes do this mistake in English, because I am a native German speaker, where we capitalize all nouns.
But I also can't "prove" that I did not use AI :)
I come from academics and my writing style is at least influenced by writing papers so maybe that's why it seems that way.
TotallyToxicToast@reddit
By the way, the reason I started replying to comments, is that I initially had a similar reaction. I thought there must be some misunderstanding on your side, and some of the top comments seemed to confirm that. Once I read further it was confusing and now I think you are right. Your comments that clarify things are buried and sometimes down voted, so I understand where other people get the impression, since I had the same first Impression.
You were right, but this post makes it not easy to find that. So I started replying to some of the posts to clarify
LawyerCompetitive478@reddit (OP)
I don't even know what to answer, thank you very much then! :)
LawyerCompetitive478@reddit (OP)
As everyone asked for huggingface.co/Hugginf/Gemma4-e4b-ai-edge-gallery-extracted/tree/main
AyraWinla@reddit
Much appreciated!
Sorry to ask for more, but do you have E2B also..? I'm very interested by that one.
Most of my LLM time was with Gemma 3 E2B on my Pixel 8a phone, so I was pretty excited about Gemma 4. The IQ4_NL gguf running on ChatterUI was really good: excellent quality and decent speed, so it's definitely my new main model.
... But the version running on AI Edge was still much faster. So I've been using AI Edge for quick queries and Chatter UI for when cards and other features are useful. But Layla has added experimental Lite-rm support and using the litert-community E2B download, it's running pretty darn fast. After very brief use, it didn't strike me as dumb and seems very usable (excitingly so; outside of LFM 1.2b it's the first time I get a model that writes faster than I can read on my phone).
But if the Google AI Edge version of E2B is smarter than the litert-community one, I'm very interested in getting that one since it's very likely going to be my main model for months to come.
LawyerCompetitive478@reddit (OP)
I finally got this model out :) Link in the comment above
AyraWinla@reddit
Awesome, thank you very much!
LawyerCompetitive478@reddit (OP)
i dont extracted this. but you can do it
dinerburgeryum@reddit
You the man thanks
coder543@reddit
The "extracted" copy lacks the MTP drafter. That's the only difference.
dinerburgeryum@reddit
Oh word thanks. I’ve not used LiteRT LM so I didn’t have the tooling to inspect it installed.
LawyerCompetitive478@reddit (OP)
That's what I'm talking about, I couldn't find this file anywhere, it's not publicly available.
coder543@reddit
If it is not the same, then the file was corrupted somehow during your extraction process. The code is clearly visible. You can see that it is downloading from Huggingface.
coder543@reddit
If it is not the same, then the file was corrupted somehow during the extraction process... the code is clearly visible. You can see that it is downloading from Huggingface.
IllumiZoldyck@reddit
watch out, malware embedded
dinerburgeryum@reddit
Is that possible? Does litertlm unpickle or the like? I didn’t think it did; it’s primarily meant for C++ inference, the Python and Rust bindings are available only for convenience.
LawyerCompetitive478@reddit (OP)
Good joke
LawyerCompetitive478@reddit (OP)
Can anyone try running a file from Google AI Edge Gallery in a browser using browser technologies? I would be very grateful. I'm trying too, but it hasn't worked yet?
https://huggingface.co/Hugginf/Gemma4-e2b-Google-AI-Edge-Gallery-Extracted/tree/main
WhoRoger@reddit
Idk if anyone mentioned it, but apparently Gemma 4 has multi oken prediction capability, but it's disabled in the community versions. Idk if Unsloth or anyone else re-enables it at this point. Maybe that makes a difference?
LawyerCompetitive478@reddit (OP)
unfortunately not anymore
WhoRoger@reddit
Wdym not anymore
LawyerCompetitive478@reddit (OP)
This is my opinion, they are unlikely to redesign the model, they have enough work to do, especially since usually no one even notices the difference
WhoRoger@reddit
My point is if they have MTP enabled on their own version, it may have such better performance, they can get away with a stronger quant or throwing out other parts of the model. So maybe that's why it's smaller
Insurgent25@reddit
Yes I read that it is enabled in litert but not other frameworks due to compatibility issues so they turned it off for a wider release.
Probably the thing affecting performance.
gpalmorejr@reddit
I think it is important to realize that the litert is a specially made format for specially hardware. GGUF is a general compatibility format that also includes the tensors, chat templates, architecture, etc. It is made to be able to run on anything and as such includes a lot of boiler plate stuffs to tell the hardware how to use it. For the litert models, a lot of that information is in the app itself and hardware drivers for the specific mobile GPUs and NPUs it is designed for.
HenkPoley@reddit
Unsloth is a secretly “a bunch of cowboys”. They get results. But there is often some correctness issue with it. Count on Google to get these details correct though.
tiffanytrashcan@reddit
LiteRT =/= gguf.. Its not even made for Llama.cpp. Just like you can't run gguf files on the Edge Gallery app.
Third-party apps that give you an option lose performance in Llama/gguf mode because Google has an entire AI toolchain and framework that ties in driver deep for LiteRT.
Although conversion is fairly trivial, as most things are the same, quantization / compression, and techniques are different. This leads to the different size- Potentially quality, and somewhat performance. But the major characteristic is that you're using an entirely different back end to run the model.
This is very similar to MLX on Macs vs gguf files.
LawyerCompetitive478@reddit (OP)
Thank you for asking the AI for the answer instead of us, and now we won't have to. It's not even the size of the model, but its incredible quality. Experiment. Download the model from LitertLM Community and import it into Google AI Edge Gallery. Set the temperature to zero and compare the responses. They will be different. And if you speak less popular languages than English, the difference is noticeable from the first prompt.
Expensive-Paint-9490@reddit
Zero temperature only affects the logit distribution, if you have different seeds you'll still get different responses.
LawyerCompetitive478@reddit (OP)
After inserting the seed, it's literally no longer the same query, and not all launch solutions insert the seed into the context. What I meant in the answer above is that if you ensure the settings are fully consistent, then it won't make any difference what hardware you're running it on. At the very least, the model won't give different answers on different devices. It can't be "smarter" on a specific device, only faster if you're lucky.
tiffanytrashcan@reddit
A. What? I had nothing to ask an AI about. What exactly would I ask you? (Who is "us"?)
I played with Edge Gallery long before the Play Store release, within days of the original early GitHub release. Thanks to this sub I learned about TFlite a long time ago.
B. Your original post here directly calls out a gguf file. This is what I was responding to. Go ahead, try to import that into Edge Gallery.
Trying to be more productive now,
You've potentially noticed something interesting, the difference in the App on play store downloading a different model from a different source than what's on huggingface.
Focus on the hard evidence that's been gathered there. (JSON files linking the downloads) In that vein, I greatly appreciate the checksum comparison.
Coming back to your experiment here, CPU or GPU generation? I noticed someone else mention it spits out Chinese, I had it randomly roll languages together when using GPU. (I'm assuming it was Adreno driver issues, but maybe not.)
For these tests, we should be focusing on CPU only, an older image gen app mentioned interesting behaviour, set the same seed and prompt on CPU and everything is the same output, GPU is not deterministic. Qualcomm does bizarre stuff and the others are even worse.
coder543@reddit
Unsloth optimizes for English performance.
AI Edge is open source, so nothing is hidden, and nothing needs to be extracted via ADB. No need to be dramatic.
LawyerCompetitive478@reddit (OP)
Can you then see where the model is downloaded from?
ANR2ME@reddit
Aren't they downloading it from https://huggingface.co/litert-community/gemma-4-E4B-it-litert-lm/tree/main ?
LawyerCompetitive478@reddit (OP)
no
coder543@reddit
It literally downloads from huggingface: https://github.com/google-ai-edge/gallery/blob/main/model_allowlists/1_0_12.json#L40
As
codexexplains:LawyerCompetitive478@reddit (OP)
Mister, you've really pissed me off. What the hell is the code? Did you unpack the apk? Did you check the network requests? No. I did, and I saw that the requests weren't going to hugging face. You just fucking asked the code – brilliant.
coder543@reddit
yes, to explain the open source code, which it did.
TotallyToxicToast@reddit
I am confused, Different replies claim different things.
It seem in the end it did not pull from Huggingface afterall?
From another reply
And jet another reply
LawyerCompetitive478@reddit (OP)
you are my hero
Hougasej@reddit
From huggingface litert-community, or from google on huggingface https://github.com/google-ai-edge/gallery/blob/main/model_allowlists/1_0_12.json
TotallyToxicToast@reddit
It seems play store version has a different model allow list
fatihmtlm@reddit
I gave it a try the other day (g play version), asks me to read&accept too much. I need to check github or somewhere else I guess.
DistanceOk7532@reddit
Try: https://play.google.com/store/apps/details?id=com.llmhub.llmhub&hl=en_US
and read: https://grok.com/share/c2hhcmQtMg_5c39fa60-a105-4d0f-b67c-4578991dd47d
LawyerCompetitive478@reddit (OP)
trash advertising. Use official Google AI Edge Gallery on android or iphone from official store
chaitanyasoni158@reddit
Yeah I had that same problem too but with E2B on LiteRT.It just started spewing chinese, no matter how I tried to prompt it. E4B worked out of the box for me though.
LawyerCompetitive478@reddit (OP)
Try the file I posted above - the model works great even in 2B.
Worried-Squirrel2023@reddit
the android version probably has different quantization or distillation that's optimized for low ram inference. nothing nefarious, but worth comparing the actual config files side by side. could just be that the android variant skips features the desktop version includes.
LawyerCompetitive478@reddit (OP)
false
SeriousPanic34@reddit
I wonder where can we download the android version? I have plans to run E4B on a 1060 for a small project, and while the normal unsloth fits, it still offloads to ram.. Would be nice to try the android one if it's not too lobotomized in comparison
LawyerCompetitive478@reddit (OP)
extracted via adb from Google AI Edge Gallery on Android
SeriousPanic34@reddit
I mean the edge app downloads it from somewhere right. And someone very likely has already uploaded it somewhere.. maybe I'm just overestimating the difficulty of using the adb route
LawyerCompetitive478@reddit (OP)
inky_wolf@reddit
Pretty sure the above posted hugging face link is where the model is downloaded from. That's also the model card link straight from the app. The file size matches too.
Also, if you look at the repo, the commit hash in model_allowlist.json matches the repo too.
Where exactly did you get the "from Google servers" from?
LawyerCompetitive478@reddit (OP)
Read the message below. I posted a screenshot. The sizes are different. The cache is different.
inky_wolf@reddit
Ah, Interesting. I checked with the pcapandroid app, can (re) confirm that the model is being downloaded from dl.google.com.
On a sidenote, I downloaded the model from the litert-community/gemma-4-E4B-it-litert-lm repo and loaded it up on the Edge Gallery app, - ran the benchmark: it's slower, (almost 3x slower for prefill speed) - tried AI chat (on GPU): it's responses to the car wash problem were similar to the Google one. - tried Ask image: Also similar response quality
Hougasej@reddit
Wait, app from google play has different model_allowlist? So that's the reason why app from apk from github ask huggingface api key, while google play app doesn't.
Hougasej@reddit
It literally downloads from huggingface https://github.com/google-ai-edge/gallery/blob/main/model_allowlists/1_0_12.json
LawyerCompetitive478@reddit (OP)
rawdikrik@reddit
Can you share the file so others can confirm?
LawyerCompetitive478@reddit (OP)
You can extract it via adb pull, ask the ai
Prize_Negotiation66@reddit
I don't want to bother, upload the file to a cloud
LawyerCompetitive478@reddit (OP)
I don't want to bother either :) Well, okay, if at least 5 people ask, I'll upload it.
some_user_2021@reddit
Will you do it for me? ❤️
dinerburgeryum@reddit
Yea do it why not I'd love to look at how it's packaged and compressed.
LawyerCompetitive478@reddit (OP)
loaded everything
xadiant@reddit
Google probably calibrated their own quants with the original datasets.
LawyerCompetitive478@reddit (OP)
gemma-4-E4B-it.litertlm dif size
3.65 GB
jazir55@reddit
How do you run it, i've never seen that file extension.
LawyerCompetitive478@reddit (OP)
google ai edge gallery / google mediapipe https://ai.google.dev/edge/litert-lm/cli
jazir55@reddit
How do you run it on an Android phone?
LawyerCompetitive478@reddit (OP)
download from google play store app - google ai edge gallery
antwon_dev@reddit
Following. I’m also trying to figure this out. The litertlm file has worked fine for me but I am curious how they did it and why their audio processing is so much better