google/gemma-4-12B · Hugging Face

Posted by jacek2023@reddit | LocalLLaMA | View on Reddit | 95 comments

Gemma is a family of open models built by Google DeepMind. Gemma 4 models are multimodal, handling text and image input (with audio supported on E2B, E4B, and 12B) and generating text output. This release includes open-weights models in both pre-trained and instruction-tuned variants. Gemma 4 features a context window of up to 256K tokens and maintains multilingual support in over 140 languages. Featuring both Dense and Mixture-of-Experts (MoE) architectures, Gemma 4 is well-suited for tasks like text generation, coding, and reasoning. The models are available in five distinct sizes: **E2B**, **E4B**, **12B**, **26B A4B**, and **31B**. Their diverse sizes make them deployable in environments ranging from high-end phones to laptops and servers, democratizing access to state-of-the-art AI. Gemma 4 introduces key **capability and architectural advancements**: * **Reasoning** – All models in the family are designed as highly capable reasoners, with configurable thinking modes. * **Extended Multimodalities** – Processes Text, Image with variable aspect ratio and resolution support (all models), Video, and Audio (featured natively on the E2B, E4B, and 12B models). * **Diverse & Efficient Architectures** – Offers Dense and Mixture-of-Experts (MoE) variants of different sizes for scalable deployment. * **Optimized for On-Device** – Smaller models are specifically designed for efficient local execution on laptops and mobile devices. * **Increased Context Window** – The small models feature a 128K context window, while the medium models support 256K. * **Enhanced Coding & Agentic Capabilities** – Achieves notable improvements in coding benchmarks alongside native function-calling support, powering highly capable autonomous agents. * **Native System Prompt Support** – Gemma 4 introduces native support for the `system` role, enabling more structured and controllable conversations. # [](https://huggingface.co/google/gemma-4-12B-it-assistant#models-overview)Models Overview Gemma 4 models are designed to deliver frontier-level performance at each size, targeting deployment scenarios from mobile and edge devices (E2B, E4B) to consumer GPUs and workstations (12B, 26B A4B, 31B). They are well-suited for reasoning, agentic workflows, coding, and multimodal understanding. The models employ a hybrid attention mechanism that interleaves local sliding window attention with full global attention, ensuring the final layer is always global. This hybrid design delivers the processing speed and low memory footprint of a lightweight model without sacrificing the deep awareness required for complex, long-context tasks. To optimize memory for long contexts, global layers feature unified Keys and Values, and apply Proportional RoPE (p-RoPE).

Reply to Post

95 Comments

[-]

jacek2023@reddit (OP)

https://preview.redd.it/8tsvau0hb35h1.png?width=1163&format=png&auto=webp&s=231a022a3a8e2dbbdf6d9ee6ff4214421f2ffd7f

[-]

pmttyji@reddit

It beats Gemma3-27B(no think) on all items so this 12B is good for Poor GPU Club. I can run Q4/Q5 on my 8GB VRAM.

[-]

nixuelkty@reddit

How many t/s on 8GB are you getting?

[-]

pmttyji@reddit

Old Gemma-3-12B(Q4, KVCache-Q8) gave me 20-30 t/s.

[-]

krzyk@reddit

Lucky you, I've got 6GB A1000 :(

[-]

pmttyji@reddit

Go for IQ4\_XS or IQ4\_NL. That's what I do for 14-15B models

[-]

jacek2023@reddit (OP)

But will you?

[-]

pmttyji@reddit

In my current laptop, that's the only way 😄 In past, I tried Gemma3-27B Q3 & laptop cried unexpectedly first time. Got 1-2 t/s I think after loaded half the layers on RAM. In my new rig(This month for sure), I'll go for Q6/Q8 of their 30B models.

[-]

Danmoreng@reddit

I installed Ubuntu in my old gaming notebook (1070 8Gb, 32Gb RAM) and it runs Qwen3.6 35BA3B surprisingly well in Q4. 20GB model file, split across GPU and RAM has around 200 t/s prefill and depending on context 13-20 t/s tg.

[-]

StaysAwakeAllWeek@reddit

Even for the less poor gpu club, with 24gb I can run Q8 with deep context and some light batching. It might not write code as complex as the larger models, but at Q8 it's a lot more reliable especially as the context grows

[-]

MaruluVR@reddit

"Containing the same advanced decoder structure as the Gemma 4 31B Dense model." Does this mean we can glue this encoder onto 31b and have audio and image without extra processing?

[-]

AppealThink1733@reddit

And what about qwen3.7 9B?

[-]

silenceimpaired@reddit

Google will get right on that /s

[-]

Melbar666@reddit

uncensored-heretic when? 😉

[-]

ZdzisiuFryta@reddit

Scared to ask but what's "heretic" doin to a model

[-]

No_Lingonberry1201@reddit

gemma-4-12b-uncensored-heretic-aggressive-terminator-killer-puppykicker-skynet-Q8\_K\_XL.gguf on its way.

[-]

jacek2023@reddit (OP)

I assume I few hours? ;)

[-]

FormerPassenger1558@reddit

sorry for the stupid question: which one would be usable on a MacBook Pro M4 Pro, 48Gb RAM ?

[-]

arbv@reddit

yes

[-]

grudev@reddit

You can make a head to head comparison among all versions once they are released for Ollama: [https://github.com/dezoito/ollama-grid-search](https://github.com/dezoito/ollama-grid-search)

[-]

Few_Painter_5588@reddit

All of them at FP8 should run on that.

[-]

jacek2023@reddit (OP)

Try all and be a happy man

[-]

arbv@reddit

No multimodal support in llama-cpp for now, right?

[-]

M4GMaR@reddit

I'm hyped, but based on the benchmarks, it doesn't seem to be anywhere near the old Qwen 3.5 9B in most tasks. It's a bit frustrating that we've been getting so many small models over the past few weeks, yet none of them seem capable of matching even a fraction of what Qwen 3.5 9B can do. If they can't compete, why release them at all?... At this point, I'm worried that Gemma 4 12B might end up being another disappointment, just like LFM2.5 8B A1B and Mellum 2.

[-]

uhuge@reddit

it has audio/voice input in a very FT-friendly new mind-blowing architecture.

[-]

Adventurous-Paper566@reddit

Cool, but we wanted 124B 😁

[-]

windows_error23@reddit

I’m curious on how its vision and ocr would compare to qwen 3.5 9b.

[-]

unknowntoman-1@reddit

Am I the only one noticing the audio capabilities. Would make this an excellent translator of audio, right?

[-]

annodomini@reddit

Yeah, a Gemma with audio capabilities that's stronger than E4B is what excites me about this one. Will definitely be stronger for translation tasks, I noticed E4B was a bit weak for that (would do OK, but not as good on image or text translation compared to the bigger models).

[-]

jacek2023@reddit (OP)

I posted the picture with that info in the comment

[-]

BoogerheadCult@reddit

Hot garbage, tried and none of these models so far impressed me.

[-]

seamonn@reddit

User Issue.

[-]

BoogerheadCult@reddit

Whatever you say, LLM model is use-case specific, I guess with your intelligence, you work on low level project that such a bad model is good enough for you. Bet you don't even know how to eval models properly to learn their limitations. 🤡

[-]

seamonn@reddit

lmao

[-]

deathacus12@reddit

Does Gemma have RAG? Does anyone have any good benchmarks between Gemma and qwen3.6?

[-]

annodomini@reddit

There's a separate EmbeddingGemma (based on Gemma 3): https://huggingface.co/google/embeddinggemma-300m This is going to be substantially weaker than the Qwen3.6 27B and 35B-A3B. Those both rank stronger than the larger Gemma 4 models on most tasks, and this one ranks weaker than the larger Gemma 4 models on all tasks. But this fills in a gap that Qwen3.6 or the older Gemma 4 models didn't cover, the mid-size between E4B and Gemma4 26B-E4B.

[-]

WithoutReason1729@reddit

Your post is getting popular and we just featured it on our Discord! [Come check it out!](https://discord.gg/PgFhZ8cnWW) You've also been given a special flair for your contribution. We appreciate your post! *I am a bot and this action was performed automatically.*

[-]

seamonn@reddit

Where is that damn 124b!!!???

[-]

Clean_Hyena7172@reddit

They'll never release it, my guess is that it came too close to the performance of the current Flash and Pro models so it would end up eating into their API revenue.

[-]

jld1532@reddit

The alternative is people just keep using qwen which isn't exactly a great look for American tech.

[-]

uhuge@reddit

they can pick Nemotron Super

[-]

Clean_Hyena7172@reddit

More competition would be nice, but looks unlikely.

[-]

jacek2023@reddit (OP)

it's a valid question but you should ask it on X 😉 [https://x.com/osanseviero/status/2062205174785921438](https://x.com/osanseviero/status/2062205174785921438)

[-]

seamonn@reddit

I have been spamming every tweet of his asking for Gemma 4:124b

[-]

arbv@reddit

Same. I want my 124B A4B MoE. Google cough it out, already.

[-]

Toastti@reddit

124b gguf when. 124b gguf when 124b gguf when? (Chant it together. If all of local llama does the incantation will be successful)

[-]

Clean_Hyena7172@reddit

Keep up the good fight.

[-]

jacek2023@reddit (OP)

Good work

[-]

srivatsasrinivasmath@reddit

Based Google. I feel like if more companies dropped open weight models the LLM backlash from society wouldn't be as high and they could still make as much as a profit

[-]

bonobomaster@reddit

Haha, I strongly believe that we are such a niche bubble. Nobody except us LLM enthusiasts knows shit about local LLMs. Nobody uses local LLMs except some early adopter companies and us. "Society" knows and uses ChatGPT and even that's only true for the younger demographic.

[-]

nullbyte420@reddit

Except for nation states and universities and all sorts of companies hiring for roles setting this up

[-]

DedsPhil@reddit

Backlash from us 100%, but the average person thinks AI is consuming unlimited amounts of water and killing artists. Some open weights that most people can't run or make use of they could will not change much.

[-]

jacek2023@reddit (OP)

Now Qwen must answer with 3.7 :)

[-]

stddealer@reddit

Very cool, but qat when?

[-]

hackerllama@reddit

Soon

[-]

dampflokfreund@reddit

Very nice! I hope you take this opportunity to fix some of the issues the model had!

[-]

Opening-Broccoli9190@reddit

Curious that they are positioning it as a laptop-ready solution, while currently I'd say the defacto laptop models are dominated by quantized MTP versions of Qwen3.6 35B A3B. I'd say that a Q4 quant of A3B could fit on 16GB and out-compete 12B Gemma4, but to be sure I'd need to do some benchmarking.

[-]

Tyrannas@reddit

Don't know your usecase and maybe I'm not using llama.cpp properly but I tried 3.6 35 q4 on my 16gb and have nothing left for context which makes it unusable for coding

[-]

jacek2023@reddit (OP)

I think Gemma models have different strengths than Qwen models. Look at available finetunes, for some reason people work on them

[-]

Hoak-em@reddit

Wooooo! Finally something with decent girth and audio support (other than Qwen3 Omni)

[-]

siegevjorn@reddit

12B is actually quite decent size to fit on most consumer GPUs. Nice job.

[-]

Final-Rush759@reddit

I like 31B. It gives concise answers.

[-]

MaartenGr@reddit

Can't help but share this one also here: [https://newsletter.maartengrootendorst.com/p/a-visual-guide-to-gemma-4-12b](https://newsletter.maartengrootendorst.com/p/a-visual-guide-to-gemma-4-12b) Was fun to work on this guide, especially considering the encoder-free architecture of it!

[-]

seamonn@reddit

Will you do one for the Gemma 4:124b when you guys release it?

[-]

Valuable_Touch5670@reddit

I gave this a quick read. The level of details into the LLM architecture is incredible. Well done!

[-]

bonobomaster@reddit

Interesting read, thx for sharing!

[-]

error_museum@reddit

https://preview.redd.it/aj3s6uisd35h1.png?width=773&format=png&auto=webp&s=a916a47bf775ffadc30c3a256acf8d793f06f95c Am I the first to download it via LM studio??

[-]

nickless07@reddit

Not gonna work well, need to update llama.cpp too.

[-]

error_museum@reddit

I'm currently running v2.19.1. Is this very far behind?

[-]

MarkoMarjamaa@reddit

Changes in llama.cpp were merged to master an hour ago, so I think you have to wait a bit.

[-]

nickless07@reddit

It literally states 'release b9482' - Check out llama.cpp directly, as of now latest is b9493 - Not much behind, but no support for the new model yet.

[-]

Guilty_Rooster_6708@reddit

Let’s go this is massive for 16gb VRAM user

[-]

Hydroskeletal@reddit

*very* interested to try this. Qwen9b just fell too short on a 16gb card for certain tasks and Q4 of the MoE was too unreliable - honestly this feels like the sweet spot that the local ecosystem needs.

[-]

mechasquare@reddit

Oh Heck yes! My meager 16GB of vram is ready for this!

[-]

arbv@reddit

Cool! P.S. 124B when? QAT versions when?

[-]

ea_man@reddit

Not complaining: this is good for those with \~12GB gpu. Yet as we had QWEN 9B for that e I would like a \~18B dense model, something in between the 9B and 27B for 16GB GPU.

[-]

EcstaticDentist@reddit

Banger

[-]

Valuable_Touch5670@reddit

Can’t wait to try and see if it beats Qwen 3.5 9B in coding

[-]

M4GMaR@reddit

have you tried it yet?. does it beat 9B?,

[-]

false79@reddit

I am loving this model so much. It is definitely punching above it's weight.

[-]

annodomini@reddit

Nice to have an omni model that's a bit stronger than E4B. I've missed audio support in 31B and 26B A4B, and found that E4B was just a little bit weak, this should be nice for cases where we need audio input. Would be really great to get that 124B at some point (especially if it has audio as well, but even if if not). But nice to see that the Gemma 4 family is still getting releases, gives hope for 124B.

[-]

Hanthunius@reddit

4bit quant will fit in my ipad pro m5 🥳 6bit will fit in my 16gb mac 🥳

[-]

HornyGooner4402@reddit

First dense model that tries to fit into most people's consumer GPU in a while Nowadays, labs don't even do that anymore. They just give you ~30B MoE or 4B dense and look at you funny

[-]

Temporary-Roof2867@reddit

Gemma4-26B-A4B if it has the right prompt is pure magic, I'm definitely curious to try this 12B 🤔 but I think it's far from Gemma4-26B-A4B... but being "dense" (but small enough for my GPU) it could prove to be very interesting 🤔

[-]

BitGreen1270@reddit

Wow - downloading the gguf now to see how it performs on my 780m and my 5090.

[-]

ttkciar@reddit

Fantastic! Thank you for this. I had just given up on using Qwen3.5-9B for my data augmentation project, which was utilizing my 16GB V340, and was at a loss as to what to use instead (preferably something small enough to allow enough surplus VRAM to process data in batches, requiring multiple context buffers/caches). I'd established to my satisfaction that Gemma-4-26B-A4B-it could perform the task, but that is a little too big for 16GB of VRAM, even without partitioning out multiple contexts for batched processing. I found myself wishing wistfully for a smaller Gemma4 model, and here it is!

[-]

false79@reddit

feeeeeed your potato! I liek models I can fit in my VRAM

[-]

alex20_202020@reddit

The info in contradictory about video: https://huggingface.co/google/gemma-4-12B > This model card is for the Gemma 4 12B Unified model, which is part of the Gemma 4 family of open models. Built with the same multimodal functionality as Gemma 4 E2B and E4B (text, audio, image, and video inputs) In the table (https://huggingface.co/google/gemma-4-12B#dense-models): > Text, Image, Audio

[-]

jacek2023@reddit (OP)

[https://developers.googleblog.com/gemma-4-12b-the-developer-guide/](https://developers.googleblog.com/gemma-4-12b-the-developer-guide/)

[-]

larrytheevilbunnie@reddit

Oooh this is a nice middle ground between the E4B and the 26B. Wanted the 124b white whale, but this is nice too

[-]

Jealous-Astronaut457@reddit

waiting for 100b moe

[-]

jacek2023@reddit (OP)

https://preview.redd.it/a6o74t43d35h1.png?width=1197&format=png&auto=webp&s=4083c13d60da34255094761ca134a50d26097022

[-]

Eyelbee@reddit

Is it what they were doing instead of working on gemma 5?

[-]

seamonn@reddit

or releasing the 124b??

[-]

jacek2023@reddit (OP)

https://preview.redd.it/nqasdrqeb35h1.png?width=1217&format=png&auto=webp&s=144767b7483ed8a34f89311baea3f01497b713d8