TheaterFire

Gemma 3n Preview

Posted by brown2green@reddit | LocalLLaMA | View on Reddit | 165 comments

Reply to Post

165 Comments

RomanKryvolapov@reddit

I add support of Gemma 3n https://play.google.com/store/apps/details?id=com.romankryvolapov.offlineailauncher it looks better then Gemma 3
View on Reddit #60783912

ybhi@reddit

Why not Maid?
View on Reddit #74247138

tys203831@reddit

Is this model good for RAG (on text embedding)?
View on Reddit #56931369

condrove10@reddit

!RemindMe 1 week
View on Reddit #61736939

RemindMeBot@reddit

I will be messaging you in 7 days on [**2025-07-23 20:05:20 UTC**](http://www.wolframalpha.com/input/?i=2025-07-23%2020:05:20%20UTC%20To%20Local%20Time) to remind you of [**this link**](https://www.reddit.com/r/LocalLLaMA/comments/1kr8s40/gemma_3n_preview/n3i6r4z/?context=3) [**CLICK THIS LINK**](https://www.reddit.com/message/compose/?to=RemindMeBot&subject=Reminder&message=%5Bhttps%3A%2F%2Fwww.reddit.com%2Fr%2FLocalLLaMA%2Fcomments%2F1kr8s40%2Fgemma_3n_preview%2Fn3i6r4z%2F%5D%0A%0ARemindMe%21%202025-07-23%2020%3A05%3A20%20UTC) to send a PM to also be reminded and to reduce spam. ^(Parent commenter can ) [^(delete this message to hide from others.)](https://www.reddit.com/message/compose/?to=RemindMeBot&subject=Delete%20Comment&message=Delete%21%201kr8s40) ***** |[^(Info)](https://www.reddit.com/r/RemindMeBot/comments/e1bko7/remindmebot_info_v21/)|[^(Custom)](https://www.reddit.com/message/compose/?to=RemindMeBot&subject=Reminder&message=%5BLink%20or%20message%20inside%20square%20brackets%5D%0A%0ARemindMe%21%20Time%20period%20here)|[^(Your Reminders)](https://www.reddit.com/message/compose/?to=RemindMeBot&subject=List%20Of%20Reminders&message=MyReminders%21)|[^(Feedback)](https://www.reddit.com/message/compose/?to=Watchful1&subject=RemindMeBot%20Feedback)| |-|-|-|-|
View on Reddit #61737008

Puzzleheaded-Car8307@reddit

Anyone had luck with running it on Jetson Nano Super Dev. Kit (with ollama)? My RAM is maxing out. I tried the Effective 4B version.
View on Reddit #60289591

bick_nyers@reddit

Could be solid for HomeAssistant/DIY Alexa that doesn't export your data.
View on Reddit #56796891

kitanokikori@reddit

Using a super small model for HA is a really bad experience, the one thing you want out of a Home Assistant agent is consistency, and bad models turn every interaction into a dice roll. Super frustrating
View on Reddit #56809457

soerxpso@reddit

On the benchmarks I've seen, 3n is performing at the level you'd have expected of a cutting-edge big model a year ago. It's outright smarter than the best large models that were available when Alexa took off.
View on Reddit #56827093

dimensions2050@reddit

Agreed. I tested it with some old prompts i made to sonnet3.5 and it matches the answers spot on. If this got released to desktop its over. Last model i tried on desktop was the latest phi and it was pretty useless
View on Reddit #59980926

privacyparachute@reddit

What are you asking it? In my experience even the smallest models are totally fine for asking everyday things like "how long should I boil an egg?" or "What is the capital of Austria?".
View on Reddit #56944096

GregoryfromtheHood@reddit

Gemma 3, even the small versions are very consistent at instruction following, actually the best models I've used, definitely beating Qwen 3 by a lot. Even the 4B is fairly usable, but 27b and even 12b are amazing instruction followers and I have been using them in automated systems really well. Have tried other models, bigger 70b+ models still can't match it for use like HA where consistent instruction following and tool use is needed. So I'm very excited for this new set of Gemma models.
View on Reddit #56815869

kitanokikori@reddit

I'm using Ollama and Gemma3 doesn't support its tool call format but that's super interesting. If it's that good, it might be worth trying to write a custom adapter
View on Reddit #56817641

Ok_Warning2146@reddit

There is a gemma3-tools:27b for ollama. I used it for MCP.
View on Reddit #56851007

some_user_2021@reddit

On which hardware are you running the model? And if you can share, how did you set it up with HA?
View on Reddit #56840715

thejacer@reddit

Which size are you using for HA? I’m currently still connected to GPT but hoping either Gemma or Qwen 3 can save me.
View on Reddit #56812632

kitanokikori@reddit

https://github.com/beatrix-ha/beatrix?tab=readme-ov-file#what-ai-should-i-use-though (a bit out of date, Qwen3 8B is roughly on-par with Gemini 2.5 Flash)
View on Reddit #56812790

harrro@reddit

Also the prices are way off going by openrouter rates. GPT 4.1 mini is way more expensive than Qwen 3 14B/32B for example.
View on Reddit #56824131

kitanokikori@reddit

The prices for Ollama are calculated with the logic of, "Figure out how big a machine I would need to effectively run this in my home, assume N queries/tokens a day, for M years". It's definitely a ballpark more than anything.
View on Reddit #56824460

harrro@reddit

It'd make more sense to just openrouter rates. You're comparing saas rates to saas.
View on Reddit #56825141

kitanokikori@reddit

Well I mean, so that's part of the conclusion that this data kind is trying to illustrate imho - you can get a _lot_ of damn tokens from OpenAI before local-only pays off economically, and unless you _happen_ to just have a really great rig that you can turn into a 24/7 Ollama server already, it's probably a better idea to try a SaaS provider first. The worry with this project in particular is that without guidance, people will set up super underpowered Ollama servers, try to use bad models, then be like "This project sucks", when the play really is, "Try to get the automation working first with a really top-tier model, then see how cheap we can scale down without it failing"
View on Reddit #56826217

mister2d@reddit

Basically all I'm interested in at home.
View on Reddit #56805218

Juude89@reddit

https://preview.redd.it/6wehu2mgc22f1.jpeg?width=581&format=pjpg&auto=webp&s=bc6688f0775d9a2f221f2576a36058b3aaf36b8c not work well
View on Reddit #56845579

abubakkar_s@reddit

Try setting a Good system prompt if possible, and what's the app name?
View on Reddit #56845883

_murb@reddit

I didnt see in the play store, but on gh: [https://github.com/google-ai-edge/gallery](https://github.com/google-ai-edge/gallery)
View on Reddit #56870023

abubakkar_s@reddit

I tried it, since the phone has only 4gb ram so got very slow, like 0.5-0.6t/s
View on Reddit #58789588

BobserLuck@reddit

Hah! Got it to inference on a Linux (Ubuntu) desktop! As mentioned by few folks already, the .task is just an archive for a bunch of other files. You can use 7zip to extract the contents. What you'll find is a handful of files: - TF_LITE_EMBEDDER - TF_LITE_PER_LAYER_EMBEDDER - TF_LITE_PREFILL_DECODE - TF_LITE_VISION_ADAPTER - TF_LITE_VISION_ENCODER - TOKENIZER_MODEL - METADATA Over the last couple of months, there's been some changes to Tensorflow-Lite. Google merged it into a new package called ai-edge-litert and this model is now using that standard known as LiteRT [more info on all that here](https://ai.google.dev/edge/litert/inference). I'm out of my wheel house so got Gemmini 2.5 Pro to help figure out how to inference the models. Initial testing "worked" but it was really slow, 125s/100 tokens on CPU. Though this test was done without the vision related model layers.
View on Reddit #57020040

Nervous-Magazine-911@reddit

hey,which backend did you use? Phone or desktop?
View on Reddit #57464841

BobserLuck@reddit

Standard x64. Hesitent to share mothod as it was mostly generated by AI and has very poor performance. But I'll see about throwing the method up on Github and see if folks who actually know what they are doing can make heads or tails of it.
View on Reddit #57976034

Nervous-Magazine-911@reddit

please share,thank you
View on Reddit #58595709

georgejrjrjr@reddit

Please do! Slow is solvable. Right now there is (to my knowledge) no way to run this on desktop, and tons of interest. Much easier to iterate from a working example, ya know?
View on Reddit #58290931

Skynet_Overseer@reddit

could you tell us a bit more on how to run it? thanks!
View on Reddit #57135321

Expensive-Apricot-25@reddit

[https://ai.google.dev/gemma/docs/gemma-3n#parameters](https://ai.google.dev/gemma/docs/gemma-3n#parameters) Docs are finally up... E2B has slighly over 5B parameters under normal execution, doesnt say anything about E4B, so I am just going to assume about 10-12B. Its basicially a moe model, except it looks like its split based on each modality
View on Reddit #56805212

phhusson@reddit

\> It is built using the gemini nano architecture. Where do you see this? Usually Gemma and Gemini team are silo-ed from each other, so that's a bit weird. Though that would make sense since keeping gemini nano a secret isn't possible
View on Reddit #56949374

Neither-Phone-7264@reddit

I think they said that at i/o
View on Reddit #58139889

Otherwise_Flan7339@reddit

Whoa, this Gemma stuff is pretty wild. I've been keeping an eye on it but totally missed that they dropped docs for the 3n version. Kinda surprised they're not being all secretive about the parameter counts and architecture. That moe thing for different modalities is pretty interesting. Makes sense to specialize but I wonder if it messes with the overall performance. You tried messing with it at all? I'm curious how it handles switching between text/audio/video inputs. Real talk though, Google putting this out there is probably the biggest deal. Feels like they're finally stepping up to compete in the open source AI game now.
View on Reddit #56851817

Xandred_the_thicc@reddit

What's the point of having such an obvious llm as an ad for an "AI agent" company when it literally just regurgitates the content of whatever it's replying to and then barfs out something about "Maxim AI"?
View on Reddit #56873697

Godless_Phoenix@reddit

You're an LLM
View on Reddit #56870166

TheRealGentlefox@reddit

I might be missing something, but a normal 12B 4-bit LLM is ~7GB. E4B is 3GB.
View on Reddit #56861198

brown2green@reddit (OP)

> Gemma 3n models are designed for efficient execution on low-resource devices. They are capable of multimodal input, handling text, image, video, and audio input, and generating text outputs, with open weights for instruction-tuned variants. These models were trained with data in over 140 spoken languages. > >Gemma 3n models use selective parameter activation technology to reduce resource requirements. This technique allows the models to operate at an effective size of 2B and 4B parameters, which is lower than the total number of parameters they contain. For more information on Gemma 3n's efficient parameter management technology, see the [Gemma 3n page](https://ai.google.dev/gemma/docs/gemma-3n). Google just posted new "preview" Gemma 3 models, seemingly intended for edge devices. The docs aren't live yet.
View on Reddit #56795053

Nexter92@reddit

model for google pixel and android ? Can be very good if they run locally by default to conserve content privacy.
View on Reddit #56795301

phhusson@reddit

In the tests they mention Samsung Galaxy S25 Ultra, so they should have some inference framework for Android yes, that isn't exclusive to Pixels That being said, I fail to see how one is supposed to run that thing.
View on Reddit #56799848

Plums_Raider@reddit

Download edge gallery from their github and the .task file from huggingface. Works really well on my s25 ultra
View on Reddit #56808038

djjagatraj@reddit

Brother it is awesome , mind blowing , better model than any model that runs on my laptop
View on Reddit #56851932

messiahua@reddit

how to run it on laptop?
View on Reddit #56981132

djjagatraj@reddit

Run deepseek r1 7b on laptop or cogito ( its the best i think )
View on Reddit #57642046

Plums_Raider@reddit

i totally agree. amazing for its size. hopefully this will soon be adapted into other apps and ollama/llamacpp
View on Reddit #56853397

djjagatraj@reddit

Brother it is awesome , mind blowing , better model than any model that runs on my laptop
View on Reddit #56851910

AnticitizenPrime@reddit

I'm getting ~12 tok/sec on a two year old Oneplus 11. Very acceptable and its vision understanding seems very impressive. The app is pretty barebones - doesn't even save chat history. But it's open source, so maybe devs can fork it and add features?
View on Reddit #56813045

ExtremeAcceptable289@reddit

what chipset is your oneplus 11?
View on Reddit #56875428

AnticitizenPrime@reddit

Snapdragon 8 gen 2 apparently
View on Reddit #56880441

ExtremeAcceptable289@reddit

Ah ok
View on Reddit #56880643

djjagatraj@reddit

Same here , snapdragon 870
View on Reddit #56851963

ibbobud@reddit

It’s the age of vibe coding, fork it yourself and add the feature. You can do it !
View on Reddit #56817215

AnticitizenPrime@reddit

I guess with Gemini's huge context window I could just dump the whole repo in there and ask it to get cracking...
View on Reddit #56826003

treverflume@reddit

Deepseek r1 thinking gave me this: To add chat history to your Android LLM app, follow these steps: ### 1. **Database Setup** Create a Room database to store chat messages. **ChatMessageEntity.kt** ```kotlin @Entity(tableName = "chat_messages") data class ChatMessageEntity( @PrimaryKey(autoGenerate = true) val id: Long = 0, val modelId: String, // Unique identifier for the model val content: String, @TypeConverters(ChatSideConverter::class) val side: ChatSide, @TypeConverters(ChatMessageTypeConverter::class) val type: ChatMessageType, val timestamp: Long ) ``` **Converters** ```kotlin class ChatSideConverter { @TypeConverter fun toString(side: ChatSide): String = side.name @TypeConverter fun toChatSide(value: String): ChatSide = enumValueOf(value) } class ChatMessageTypeConverter { @TypeConverter fun toString(type: ChatMessageType): String = type.name @TypeConverter fun toChatMessageType(value: String): ChatMessageType = enumValueOf(value) } ``` **ChatMessageDao.kt** ```kotlin @Dao interface ChatMessageDao { @Query("SELECT * FROM chat_messages WHERE modelId = :modelId ORDER BY timestamp ASC") suspend fun getMessagesByModel(modelId: String): List<ChatMessageEntity> @Insert suspend fun insert(message: ChatMessageEntity) @Query("DELETE FROM chat_messages WHERE modelId = :modelId") suspend fun clearMessagesByModel(modelId: String) } ``` ### 2. **Repository Layer** Create a repository to handle database operations. **ChatRepository.kt** ```kotlin class ChatRepository(private val dao: ChatMessageDao) { suspend fun getMessages(modelId: String) = dao.getMessagesByModel(modelId) suspend fun saveMessage(message: ChatMessageEntity) = dao.insert(message) suspend fun clearMessages(modelId: String) = dao.clearMessagesByModel(modelId) } ``` ### 3. **Modify ViewModel** Integrate the repository into `LlmChatViewModel`. **LlmChatViewModel.kt** ```kotlin open class LlmChatViewModel( private val repository: ChatRepository, // Inject via DI curTask: Task = TASK_LLM_CHAT ) : ChatViewModel(task = curTask) { // Load messages when a model is initialized fun loadMessages(model: Model) { viewModelScope.launch(Dispatchers.IO) { val entities = repository.getMessages(model.id) entities.forEach { entity -> val message = when (entity.type) { ChatMessageType.TEXT -> ChatMessageText( content = entity.content, side = entity.side ) // Handle other types if needed else -> null } message?.let { addMessage(model, it) } } } } // Override or modify message handling to include DB operations fun sendUserMessage(model: Model, input: String) { // Add user message addMessage(model, ChatMessageText(input, ChatSide.USER)) // Generate response generateResponse(model, input, onError = { /* Handle error */ }) } // Modified generateResponse to save agent messages override fun generateResponse(...) { // Existing code... resultListener = { partialResult, done -> // When done, save the final message if (done) { val lastMessage = getLastMessage(model) as? ChatMessageText lastMessage?.let { viewModelScope.launch(Dispatchers.IO) { repository.saveMessage( ChatMessageEntity( modelId = model.id, content = it.content, side = it.side, type = it.type, timestamp = System.currentTimeMillis() ) ) } } } } } // Clear both UI and DB messages fun clearChatHistory(model: Model) { clearAllMessages(model) viewModelScope.launch(Dispatchers.IO) { repository.clearMessages(model.id) } } } ``` ### 4. **Dependency Injection** Use Hilt to inject dependencies. **AppModule.kt** ```kotlin @Module @InstallIn(SingletonComponent::class) object AppModule { @Provides fun provideChatDao(@ApplicationContext context: Context): ChatMessageDao { return Room.databaseBuilder( context, AppDatabase::class.java, "chat-db" ).build().chatMessageDao() } @Provides fun provideChatRepository(dao: ChatMessageDao) = ChatRepository(dao) } ``` ### 5. **UI Layer Adjustments** Load messages when the fragment starts and handle user interactions. **ChatFragment.kt** ```kotlin @AndroidEntryPoint class ChatFragment : Fragment() { private val viewModel: LlmChatViewModel by viewModels() override fun onViewCreated(view: View, savedInstanceState: Bundle?) { super.onViewCreated(view, savedInstanceState) val model = // Get selected model viewModel.loadMessages(model) sendButton.setOnClickListener { val input = inputEditText.text.toString() viewModel.sendUserMessage(model, input) } } } ``` ### Key Changes: - **Persist Messages**: Only save `ChatMessageText` with `ChatSide.USER`/`AGENT` to the database. - **Load on Startup**: Load messages when the fragment initializes. - **Clear History**: Ensure both UI and database are cleared when resetting. This approach maintains chat history across app restarts and handles streaming responses by saving only the final message. Adjust based on your app's specific needs (e.g., handling images). I did use 3n to find the right file to give to r1. I gave that to 3n promt code snippet with kotlin selected and it liked it.
View on Reddit #56842016

phhusson@reddit

Bonus points for doing it on-device directly!
View on Reddit #56822353

sandy_catheter@reddit

>Google >content privacy This feels like a "choose one" scenario
View on Reddit #56823290

ForsookComparison@reddit

The weights are open so it's possible here. Don't use any *"local Google inference apps"* for one.. but also the fact that you're doing anything on an OS they lord over kinda throws it out the window. Mobile phones are not and never will be privacy devices. Better just to tell yourself that
View on Reddit #56832052

TheRealGentlefox@reddit

Or use GrapheneOS if it's a Pixel, and deny network access once model is installed.
View on Reddit #56861098

AdSimilar3123@reddit

Afaik denying network access doesn't prevent it from communicating with other apps that have network access.
View on Reddit #56866352

TheRealGentlefox@reddit

I did see that google apps potentially send metadata via connecting to Play Services. I think that makes it much easier for us to audit it though. I'm not super familiar with Android internals, but I would guess that inter-app communication can trivially be snooped with a rooted phone.
View on Reddit #56969092

ForsookComparison@reddit

Then you're left doing inference on a tensor SOC lol
View on Reddit #56868551

Plums_Raider@reddit

Yea just tried it. Needs edge gallery to run, but at least what i tried it was really fast for running locally on my phone even with image input. Only thing about google that got me excited today.
View on Reddit #56807903

ab2377@reddit

how many tokens/s are you getting? and which model.
View on Reddit #56870184

Plums_Raider@reddit

gemma-3n-E4B-it-int4.task (4.4gb) in edge gallery: model is loaded in 5 seconds. 1st token 1.92/sec prefill speed 0.52 t/s decode speed 11.95 t/s latency 5.43 sec Doesnt sound too impressive compared to similar sized gemma3 4b model via chatterui, but the quality is much better for german at least imo.
View on Reddit #56902721

webshield-in@reddit

How are you running it? I mean what app?
View on Reddit #56809231

Plums_Raider@reddit

As said "edge gallery". https://github.com/google-ai-edge/gallery/releases
View on Reddit #56809413

DesomorphineTears@reddit

That's Gemini Nano, they have APIs to use it now (and improved it) https://android-developers.googleblog.com/2025/05/on-device-gen-ai-apis-ml-kit-gemini-nano.html?m=1
View on Reddit #56825862

x0wl@reddit

Rewriter API as well
View on Reddit #56795451

Nexter92@reddit

Why using such a small model for that ? 12B is very mature for that and run pretty fast on every PC DDR4 ram ;)
View on Reddit #56798480

x0wl@reddit

Lol no 12B dense will be awfully slow without GPU, and will barely fit into 8GB RAM at Q4. The current weights file they use is \~3GB
View on Reddit #56799031

Nexter92@reddit

I get something like 4 t/s using llamacpp, still good to convert files. Yes for code completion impossible, way to slow. But for vibe coding component, very good.
View on Reddit #56800023

webshield-in@reddit

This is working quite well on my Nothing 2a which is not even a high end phone. I want to run this on Laptop. How would I go about it?
View on Reddit #56814426

Skynet_Overseer@reddit

i guess computer support is coming later, only android for now?
View on Reddit #57135011

askerlee@reddit

very useful for hikers without internet access.
View on Reddit #56851530

AnticitizenPrime@reddit

A year ago I used Gemma 2 9b on my laptop on 16 hour plane flight to Japan (without internet) to brush up on Japanese phrases. This is an improvement on that and can be done from a phone!
View on Reddit #56887047

lookwatchlistenplay@reddit

> Gemma 3n models are designed for efficient execution on low-resource devices. In other words, Google kills homeless people.
View on Reddit #56830124

Bakoro@reddit

>They are capable of multimodal input, handling text, image, video, and audio input, What's the onomatopoeia for a happy groan? "Uunnnnh"? I'll just go with that. Everyone is really going to have to step it up with the A/V modalities. This means we can have 'lil robots roaming around. 'Lil LLM R2D2.
View on Reddit #56815296

No-Refrigerator-1672@reddit

>models to operate at an effective size of 2B and 4B parameters, which is lower than the total number of parameters they contain. So it's an MoE, multimodal, multilingual, and compact? What a time to be alive!
View on Reddit #56798514

codemaker1@reddit

It seems to be better than an MoE because it doesn't have to keep all parameters in ram.
View on Reddit #56806921

Randommaggy@reddit

Didn't really run on my Xcover 6 Pro. Will try on my 16GB Y700 2023 in a couple of days.
View on Reddit #57161091

MustBeSomethingThere@reddit

They need this: [https://github.com/google-ai-edge/gallery](https://github.com/google-ai-edge/gallery)
View on Reddit #56802034

fynadvyce@reddit

Any guide to use this on PC? I tried [https://github.com/google-ai-edge/mediapipe-samples/tree/main/examples/llm\_inference/js](https://github.com/google-ai-edge/mediapipe-samples/tree/main/examples/llm_inference/js) but it gives an error "Failed to initialize the task."
View on Reddit #57134137

MustBeSomethingThere@reddit

There are problems with their mediapipe program, so 3n-models do not work untill they fix it: [https://github.com/google-ai-edge/mediapipe/issues/5976](https://github.com/google-ai-edge/mediapipe/issues/5976)
View on Reddit #57139814

phpwisdom@reddit

You can access it now: [https://aistudio.google.com/prompts/new\_chat?model=gemma-3n-e4b-it](https://aistudio.google.com/prompts/new_chat?model=gemma-3n-e4b-it)
View on Reddit #56800679

AnticitizenPrime@reddit

Is it actually working for you? I just get a response that I've reached my rate limit, though I haven't used AI studio today at all. Other models work.
View on Reddit #56805733

phpwisdom@reddit

Had the same error but it worked eventually. Maybe they are still releasing it.
View on Reddit #56806010

Skynet_Overseer@reddit

yup. also took a while when they dropped gemma 3. i managed to send a single message but the multimodal support is not there yet either.
View on Reddit #57135096

Foreign-Beginning-49@reddit

How do we use it? It doesn't yet mention transformers support? 🤔
View on Reddit #56805892

met_MY_verse@reddit

!RemindMe 2 weeks
View on Reddit #56800355

Neither-Phone-7264@reddit

!remindme 2 weeks
View on Reddit #56905156

RemindMeBot@reddit

I will be messaging you in 14 days on [**2025-06-04 19:37:55 UTC**](http://www.wolframalpha.com/input/?i=2025-06-04%2019:37:55%20UTC%20To%20Local%20Time) to remind you of [**this link**](https://www.reddit.com/r/LocalLLaMA/comments/1kr8s40/gemma_3n_preview/mtj1sk3/?context=3) [**CLICK THIS LINK**](https://www.reddit.com/message/compose/?to=RemindMeBot&subject=Reminder&message=%5Bhttps%3A%2F%2Fwww.reddit.com%2Fr%2FLocalLLaMA%2Fcomments%2F1kr8s40%2Fgemma_3n_preview%2Fmtj1sk3%2F%5D%0A%0ARemindMe%21%202025-06-04%2019%3A37%3A55%20UTC) to send a PM to also be reminded and to reduce spam. ^(Parent commenter can ) [^(delete this message to hide from others.)](https://www.reddit.com/message/compose/?to=RemindMeBot&subject=Delete%20Comment&message=Delete%21%201kr8s40) ***** |[^(Info)](https://www.reddit.com/r/RemindMeBot/comments/e1bko7/remindmebot_info_v21/)|[^(Custom)](https://www.reddit.com/message/compose/?to=RemindMeBot&subject=Reminder&message=%5BLink%20or%20message%20inside%20square%20brackets%5D%0A%0ARemindMe%21%20Time%20period%20here)|[^(Your Reminders)](https://www.reddit.com/message/compose/?to=RemindMeBot&subject=List%20Of%20Reminders&message=MyReminders%21)|[^(Feedback)](https://www.reddit.com/message/compose/?to=Watchful1&subject=RemindMeBot%20Feedback)| |-|-|-|-|
View on Reddit #56905208

Trick-Gazelle4438@reddit

!remindme 2 weeka
View on Reddit #57011566

RemindMeBot@reddit

I will be messaging you in 14 days on [**2025-06-03 17:09:14 UTC**](http://www.wolframalpha.com/input/?i=2025-06-03%2017:09:14%20UTC%20To%20Local%20Time) to remind you of [**this link**](https://www.reddit.com/r/LocalLLaMA/comments/1kr8s40/gemma_3n_preview/mtbrjhh/?context=3) [**CLICK THIS LINK**](https://www.reddit.com/message/compose/?to=RemindMeBot&subject=Reminder&message=%5Bhttps%3A%2F%2Fwww.reddit.com%2Fr%2FLocalLLaMA%2Fcomments%2F1kr8s40%2Fgemma_3n_preview%2Fmtbrjhh%2F%5D%0A%0ARemindMe%21%202025-06-03%2017%3A09%3A14%20UTC) to send a PM to also be reminded and to reduce spam. ^(Parent commenter can ) [^(delete this message to hide from others.)](https://www.reddit.com/message/compose/?to=RemindMeBot&subject=Delete%20Comment&message=Delete%21%201kr8s40) ***** |[^(Info)](https://www.reddit.com/r/RemindMeBot/comments/e1bko7/remindmebot_info_v21/)|[^(Custom)](https://www.reddit.com/message/compose/?to=RemindMeBot&subject=Reminder&message=%5BLink%20or%20message%20inside%20square%20brackets%5D%0A%0ARemindMe%21%20Time%20period%20here)|[^(Your Reminders)](https://www.reddit.com/message/compose/?to=RemindMeBot&subject=List%20Of%20Reminders&message=MyReminders%21)|[^(Feedback)](https://www.reddit.com/message/compose/?to=Watchful1&subject=RemindMeBot%20Feedback)| |-|-|-|-|
View on Reddit #56800445

Few_Painter_5588@reddit

Woah, that is not your typical architecture. I wonder if this is the architecture that Gemini uses. It would explain why Gemini's multimodality is so good and why their context is so amazing.
View on Reddit #56795415

webshield-in@reddit

\> Gemma 3n enables you to start building on this foundation that will come to major platforms such as Android and Chrome. Seems like we will not be able to run this on Laptop/Desktop. [https://developers.googleblog.com/en/introducing-gemma-3n/](https://developers.googleblog.com/en/introducing-gemma-3n/)
View on Reddit #56818906

rolyantrauts@reddit

I am not sure it runs under LiteRT and is optimised to run on mobile and has examples for. Linux does have LiteRT also as TFlite is being moved out and depreciated for TF but does this mean its only for mobile or we just do not have the examples...
View on Reddit #56955000

BobserLuck@reddit

Problem is, it's not just a LiteRT model. It's wrapped up in a .task format. Something that apparently Mediapipe can work with on other platforms. There is a Python package, but I can't for the life of me find out how to inference models via the pip package. Again, only documentation points to WASM, iOS, and Android: [https://ai.google.dev/edge/mediapipe/solutions/genai/llm\_inference](https://ai.google.dev/edge/mediapipe/solutions/genai/llm_inference) There might be a LiteRT model inside, though not sure how to get too it.
View on Reddit #56982402

rolyantrauts@reddit

Its just a zip but then the files inside I haven't got a clue. Hopefully someone will just do it for us... Doh :)
View on Reddit #56987349

uhuge@reddit

It's surely not their focus, but there's nothing indicating they intend to forbid that.
View on Reddit #56954650

x0wl@reddit

They say it's a matformer https://arxiv.org/abs/2310.07707
View on Reddit #56797248

ios_dev0@reddit

Tl;dr: the architecture is identical to normal transformer but during training they randomly sample differently sized contiguous subsets of the feed forward part. Kind of like dropout but you always sample the same contiguous subset of neurons in increasing sizes. They also say that you can mix and match, for example take only 20% of neurons for the first transformer block and increase it slowly until the last. This way you can have exactly the best model for your compute resources
View on Reddit #56811918

-p-e-w-@reddit

Wow, that architecture intuitively makes much more sense than MoE. The ability to scale resource requirements dynamically is a killer feature.
View on Reddit #56834590

nderstand2grow@reddit

Matryoshka transformer
View on Reddit #56810950

webshield-in@reddit

Any idea how we would run this on Laptop. Does ollama and llama need to add support for this model or it will work out of the box?
View on Reddit #56816621

Thomas-Lore@reddit

Matformer would have static parameter count, no?
View on Reddit #56800195

No_Heat1167@reddit

Has anyone managed to run this on iOS? :')
View on Reddit #56869547

BobserLuck@reddit

Might be possible via Mediapipe?
View on Reddit #56983286

Decidy@reddit

So, when is this coming to ollama?
View on Reddit #56838709

sigjnf@reddit

Not soon, it seems to be a proprietary thing, to be used only on Android for now.
View on Reddit #56855735

AnticitizenPrime@reddit

Dunno if I'd say 'not soon', the engine used on smartphones is open source and I'll bet someone will port it before long.
View on Reddit #56887266

BobserLuck@reddit

Congratulations "someone"! When are you porting it? XD
View on Reddit #56983182

Expensive-Apricot-25@reddit

so it has an effective parameter size of 2B and 4B, but what are the actual parameter sizes???
View on Reddit #56803664

uhuge@reddit

yeah, madness it's not stated on the model card
View on Reddit #56954790

codemaker1@reddit

5B and 8B according to the blog: [https://developers.googleblog.com/en/introducing-gemma-3n/](https://developers.googleblog.com/en/introducing-gemma-3n/)
View on Reddit #56807773

MixtureOfAmateurs@reddit

How the flip flop do I run it locally? The official gemma library only has these ``` from gemma.gm.nn._gemma import Gemma2_2B from gemma.gm.nn._gemma import Gemma2_9B from gemma.gm.nn._gemma import Gemma2_27B from gemma.gm.nn._gemma import Gemma3_1B from gemma.gm.nn._gemma import Gemma3_4B from gemma.gm.nn._gemma import Gemma3_12B from gemma.gm.nn._gemma import Gemma3_27B ``` Do I just have to wait
View on Reddit #56845357

AnticitizenPrime@reddit

These are meant to be run on an Android smartphone. I'm sure people will get it running on other devices soon, but for now you can use the Edge Gallery app on an Android phone.
View on Reddit #56887197

Neither-Phone-7264@reddit

It's painfully slow on my 8a...
View on Reddit #56905087

StormrageBG@reddit

Any GGUF?
View on Reddit #56884126

Illustrious-Lake2603@reddit

What is a .Task file??
View on Reddit #56798329

dyfgy@reddit

.task file format used by this example app: [https://github.com/google-ai-edge/gallery](https://github.com/google-ai-edge/gallery) which is built using this mediapipe task... [https://ai.google.dev/edge/mediapipe/solutions/genai/llm\_inference](https://ai.google.dev/edge/mediapipe/solutions/genai/llm_inference)
View on Reddit #56814053

AnaYuma@reddit

No way to use it directly on pc?
View on Reddit #56855702

RandumbRedditor1000@reddit

Obligatory "gguf when?"
View on Reddit #56812800

Ok_Warning2146@reddit

It will take some time. Since google likes to work with transformers and vllm first.
View on Reddit #56851053

celzero@reddit

With the kind of optimisations Google is going after in Gemma, these models seem to be very specifically meant to be run with LiteRT (Tensorflow Lite) or via MediaPipe.
View on Reddit #56832766

Juude89@reddit

edge gallery by google
View on Reddit #56847284

TinySmugCNuts@reddit

would not trust this model \*at all\*. few use cases I tried in ai studio it just completely hallucinated, got basic facts wrong. maybe it will be 'trainable'? but... my god it's bad. absolutely no chance i'd want to use this in its current form.
View on Reddit #56835498

Any_Number_4496@reddit

how to use it ? new to this stuff
View on Reddit #56812196

Comas_Sola_Mining_Co@reddit

follow these steps https://news.ycombinator.com/item?id=44045265
View on Reddit #56833476

and_human@reddit

Active params between 2 and 4b; the 4b has a size of 4.41GB in int4 quant. So 16b model?
View on Reddit #56798257

Immediate-Material36@reddit

Doesn't q8/into have very approximately as many GB as the model has billion parameters? Then half of that, q4 and int4, being 4.41GB means that they have around 8B total parameters. Or I'm misremembering.
View on Reddit #56799829

snmnky9490@reddit

I'm confused about q8/int4. I thought q8 meant parameters were quantized to 8 bit integers?
View on Reddit #56819136

Immediate-Material36@reddit

A normal model, has its weights stored in fp32. This means that each weight is represented by a floating point number which consists of 32 bits. This allows for pretty good accuracy but of course also needs much storage space. Quantization reduces the size of the model at the cost of accuracy. fp16 and bf16 both represent weights as floating point numbers with 16 bits. Q8 means that most weights will be represented by 8 bits (still floating point), Q6 means most will be 6 bits etc. Integer quantization (int8, int4 etc.) doesn't use floating point numbers but integers instead. There are no int6 quantization or similar because hardware isn't optimized for 6-bit or 3-bit or whatever-bit integers. I hope I got that right.
View on Reddit #56824075

snmnky9490@reddit

Oh ok, thank you for clarifying. I wasn't sure if I didn't understand it correctly
View on Reddit #56828805

harrro@reddit

I think he meant q8/fp8 in the first sentence.
View on Reddit #56824916

MrHighVoltage@reddit

This is exactly right.
View on Reddit #56805014

shing3232@reddit

[https://ai.google.dev/gemma/docs/gemma-3n](https://ai.google.dev/gemma/docs/gemma-3n)
View on Reddit #56804779

noiserr@reddit

You're right. If you look at common 7B / 8B quant GGUFs you'll see they are also in the 4.41GB range.
View on Reddit #56804532

Randommaggy@reddit

I wonder how this will run on my 16GB tablet, or how it would run on the ROG Phone 9 Pro, if I were to upgrade my phone to that.
View on Reddit #56821922

kurtunga@reddit

MatFormer gives pareto-optimal elasticity across E2B and E4B -- so you get lot more model sizes to play with -- more ameanable to user's specific deployment constraints. [https://x.com/adityakusupati/status/1924920708368629987](https://x.com/adityakusupati/status/1924920708368629987)
View on Reddit #56819715

LogicalAnimation@reddit

I tried some translation tasks with this model in google ai studio. The quota is limited to one or two message for the free tier at the moment, but according to GPT-o3's evalution, that one-shot translation attempt scored right between gemma 3 27b and gpt-4o, roughly at Deepseek V3's level. Very impressive for its size, the only down side being that it doesn't follow insturctions as well as gemma 3 12b or gemma 3 27b.
View on Reddit #56818356

webshield-in@reddit

Here's the video that shows what it's capable of [https://www.youtube.com/watch?v=eJFJRyXEHZ0](https://www.youtube.com/watch?v=eJFJRyXEHZ0) It's incredible
View on Reddit #56808937

AnticitizenPrime@reddit

Need that app!
View on Reddit #56816999

webshield-in@reddit

It'
View on Reddit #56817671

AnticitizenPrime@reddit

Yeah I've got that up and running. I want the video and audio modalities though :)
View on Reddit #56817898

larrytheevilbunnie@reddit

Does anyone have benchmarks for this?
View on Reddit #56817202

jacek2023@reddit

Dear Google I am waiting for Gemma 4. Please make it 35B or 43B or some other funny size.
View on Reddit #56801830

noiserr@reddit

Gemma 3 was just released. Gemma 4 will probably be like a year from now.
View on Reddit #56804628

jacek2023@reddit

just?
View on Reddit #56808502

sxales@reddit

like 2 months ago
View on Reddit #56815875

ResearchCrafty1804@reddit

Is there a typo in Aider Polyglot benchmark score? I find it pretty unlikely the E4B model to score 44.4
View on Reddit #56805627

SlaveZelda@reddit

yeah that puts it on the level of gemeni 2.5 flash
View on Reddit #56815115

phhusson@reddit

Grrr, MOE's broken naming strikes again. "gemma-3n-E2B-it-int4.task' should be around 500MB right? Well nope, it's 3.1GB! The E in E2B is for "effective", so it's 2B computations. Heck description says computation can go to 4B (that still doesn't make 3.1GB though, but maybe multi-modal takes that additional 1GB). Does someone have /any/ idea how to run that thing? I don't know what ".task" is supposed to be, and Llama4 doesn't know either.
View on Reddit #56800254

nutsiepully@reddit

As u/m18coppola mentioned, the \`.task\` file is the format used by Mediapipe LLM Inference to run the model. See [https://ai.google.dev/edge/mediapipe/solutions/genai/llm\_inference/android#download-model](https://ai.google.dev/edge/mediapipe/solutions/genai/llm_inference/android#download-model) [https://github.com/google-ai-edge/gallery](https://github.com/google-ai-edge/gallery) serves as a good example for how to run the model. Basically, the \`.task\` is a bundle format, which hosts tokenizer files, \`.tflite\` model files and a few other config files.
View on Reddit #56814569

m18coppola@reddit

It's not MOE, it's [matryoshka](https://arxiv.org/abs/2310.07707). I believe the `.task` format is for [mediapipe](https://github.com/google-ai-edge/mediapipe). The matryoshka is a big llm, but was train/eval on multiple increasingly larger subsets of the model for each batch. This means there's a large and very capable llm with a smaller llm embedded inside of it. Esentially you can train a 1b,4b,8b,32b... all at the same time by making one llm exist inside of the next bigger llm.
View on Reddit #56805033

AyraWinla@reddit

As someone who mainly uses LLM on my phone, phone-sized models is what interests me most so I'm definitely intrigued. Plus, for writing-based stuff, Gemma 3 4b was the clear winner for a model that size with no serious competition (though slow on my Pixel 8a). So this sounds like exactly what I want. Going to try that 2b one and see the result, even though compatibility is obviously not existant with the apps I use, so can't do my usual tests. Still, being tentatively optimistic!
View on Reddit #56814255

InternationalNebula7@reddit

Can't wait to try it out with Ollama.
View on Reddit #56812611

No_Conversation9561@reddit

Gemma 4 when?
View on Reddit #56803976

Available_Load_5334@reddit

google io beginns in 15 minutes. maybe they'll say something...
View on Reddit #56797746

x0wl@reddit

The Gemma session is tomorrow: [https://io.google/2025/explore/pa-keynote-4](https://io.google/2025/explore/pa-keynote-4)
View on Reddit #56803256

and_human@reddit

According to their own benchmark (the readme was just updated) this ties with GTP 4.5 in Aider polyglot (44.4 vs 44.9)???
View on Reddit #56798757

x0wl@reddit

Don't compare benchmarks like that, there can be a ton of methodological differences.
View on Reddit #56802098

Zemanyak@reddit

I like this ! Just wish there was a 8B model too. What's the best 8B truly multimodal alternative ?
View on Reddit #56797251

coding_workflow@reddit

This is clearly aimed for mobile.
View on Reddit #56796053