TheaterFire

Which Gemma model do you want next?

Posted by jacek2023@reddit | LocalLLaMA | View on Reddit | 116 comments

Which Gemma model do you want next?
tell the Gemma team: [https://x.com/osanseviero/status/2046427241341698456](https://x.com/osanseviero/status/2046427241341698456)

Reply to Post

116 Comments

Technical-Earth-3254@reddit

Better vision, larger models and less kv cache size natively.
View on Reddit #84043158

seamonn@reddit

> Better vision Gemma 4 vision is already very good and way better than Qwen. People just need to set the vision budge in Llama cpp with --image-min-tokens and --image-max-tokens. The default is 40 and 280 respectively. For best results, I recommend 560 and 2240. With this, Gemma 4 is able to OCR very fine and hazy details.
View on Reddit #84053797

AlwaysLateToThaParty@reddit

I take that back by the way. I've now done some more analysis, and the gemma 4 31b model is capturing information in images better, especially with large complex images with many parts to analyze.
View on Reddit #86154560

AlwaysLateToThaParty@reddit

> Gemma 4 vision is already very good and way better than Qwen. Not my experience. I'm comparing it to the 122b/a10b qwen3.5 model, so larger than the latest gemma models, but qwen performs consistently better for me.
View on Reddit #84119193

Caffdy@reddit

> Also need to set --batch-size 4096 and --ubatch-size 4096 why these settings in particular?
View on Reddit #84114103

guiopen@reddit

How do you know which the default is? I was searching for this and couldn't find :( Also, do you know how llama.cpp decided the resolution? Let's say my min is 70 and my max is 1120, how will llama.cpp decide? And will it respect Gemma suported resolutions (70, 140, 280, 560 and 1120) or it will pick something between like 700
View on Reddit #84075724

seamonn@reddit

I did a write up [here](https://www.reddit.com/r/LocalLLaMA/comments/1srrhi5/gemma_4_vision/).
View on Reddit #84079496

mga02@reddit

can this be tweaked somehow in LMstudio? because Gemma 4 vision performs way way worse than Qwen 3.5 and I thought it would be because of a misconfiguration.
View on Reddit #84079436

ResidentPositive4122@reddit

The small models are already good. Let's see what 124B was all about. We'll find hardware to run it :)
View on Reddit #84042545

BannedGoNext@reddit

Where's our flash killing 124b model lol. Actually the only improvement I can recommend would be better tool calling like qwen.
View on Reddit #84051587

typical-predditor@reddit

Flash 3 is the 124b version of Gemma 4. It's already commercially viable.
View on Reddit #84084920

BannedGoNext@reddit

You are probably right lol.
View on Reddit #84600435

seamonn@reddit

> Actually the only improvement I can recommend would be better tool calling like qwen. I was struggling with tool calling as well until I switched over to the llama cpp interleaved jinja template.
View on Reddit #84058136

piro4you@reddit

We REALLY need 4.1. Excuse me, but as for now I do not see a reason for gemma 4 when qwen 3.6 exist. It's not only smarter, but overall better product. (yes i know that gemma is multi language and uses less tokens for output)
View on Reddit #84047886

comfiestncoziest@reddit

Gemma 4 is vastly more creative and has better prose than Qwen. Depends on your use case.
View on Reddit #84065597

piro4you@reddit

Tool calling and technical analysis
View on Reddit #84068754

comfiestncoziest@reddit

I have indeed found Qwen to be better at those two things, although Gemma continues to get improvements.
View on Reddit #84179289

korino11@reddit

Ternary implemetnation on 100B+
View on Reddit #84043962

Silver-Champion-4846@reddit

With hardcore investment in optimizing the quality for that ternary compression.
View on Reddit #84084265

korino11@reddit

Shure why not? It veeery promicing! We have 2 company that prooved it. Microsoft + Prism!
View on Reddit #84086619

Silver-Champion-4846@reddit

Falcon also tried their hands at it
View on Reddit #84168391

Serprotease@reddit

It’s pushing against the limit of local models, but I’d really like to see more things in the 200b-300b range.  It’s still something that can be run on some local (high end though) hardware and is a significant jump in intelligence from the 120b MoE.  Glm4.7 is very good at this range but zai moved to 700b now.   That’s a size where models can challenge sonnet with some credibility.  
View on Reddit #84149859

wren6991@reddit

Standardise on tool use tokens
View on Reddit #84141394

Own-Potential-2308@reddit

QAT across the board! Especially strong 4-bit (and experimental lower-bit) versions. Since most local runs are quantized, training with quantization in mind would minimize quality drop at low bits. Google pioneered aspects of this, bring it back!
View on Reddit #84133647

Own-Potential-2308@reddit

A true mid-range gap filler: Something like a strong 12-15B dense (or a compact MoE) that sits nicely between the E4B and the 26B/31B. A lot of 16-24 GB VRAM users feel a bit underserved right now.
View on Reddit #84133262

DelKarasique@reddit

Midrange one. Like 70b. I think that's a sweet and empty spot right now.
View on Reddit #84042552

ego100trique@reddit

We have a different perception of what midrange is apparently
View on Reddit #84073486

_risho_@reddit

it depends on your device. if you are on something like a macbook or a strix halo you can get 64/96/128gb of vram on achievable consumer hardware. if you are using dedicated pci-e gpus then its clearly not.
View on Reddit #84130036

rakarsky@reddit

What is your midrange? 300B?
View on Reddit #84083627

General-Cookie6794@reddit

Waiting for 4b9
View on Reddit #84129782

mr_Owner@reddit

Gemma 4 E10B and or a max 80B MoE pleaaase 🥺
View on Reddit #84050349

General-Cookie6794@reddit

Waiting for this too
View on Reddit #84129773

snek_kogae@reddit

<5 GB safetensor fragments will make it much easier to import into our org!!
View on Reddit #84127425

New_Alps_5655@reddit

Pixel 14 Pro with built-in Taalas in-silicon Gemma5-70b running 17,000 t/s. Google devs I know you browse here..
View on Reddit #84124620

kmp11@reddit

how about Gemma 4.1 31B with memory usage optimizations? With some of google technology (ie TurboQuant) implemented? Give us a 1bit model? Gemma 4, in its current form, is a KV cache hog sleeping in the summer sun. Large and lazy...
View on Reddit #84117848

DeepOrangeSky@reddit

70b dense 124b MoE
View on Reddit #84043981

Lesser-than@reddit

even though I couldn't run them I would root for it. They probably are not terribly interested in releasing outside what the average gamer machine can run though.
View on Reddit #84115399

Caffdy@reddit

yeah, 70B dense/120B MoE is the new mid-sized model range
View on Reddit #84113899

ZeusZCC@reddit

70b dense is can realy good. I love dense models
View on Reddit #84058724

RedParaglider@reddit

What do you use them for?  They are so slow.
View on Reddit #84064620

NigaTroubles@reddit

But they are better
View on Reddit #84066765

ttkciar@reddit

I let them infer long tasks or batches overnight while I sleep. K2-V2-Instruct (72B dense) is wicked-smart, but not great at creative writing nor codegen. I've been using it to populate a RAG database of responses to prompts generated from users' real prompts mutated/diversified by Evol-Instruct, so that the next time we prompt on the same/similar subject, there may be high-quality information in the RAG database the fast model can use to generate a better answer. I'd love to have a big, dense highly-competent local model to do similar for creative writing and code generation. Tried Devstral 2 Large, but it's not great (a lot worse than my go-to, GLM-4.5-Air).
View on Reddit #84066746

anonutter@reddit

woudl be great to have an audio - audio model
View on Reddit #84114846

Waste-Intention-2806@reddit

Natively 4 bit trained or 1 bit like bonsai trained. Model params 70b to 120b and should be MOE so that it can run faster on all devices. Size should be around or less than 48 gb + 10 to 20gb context. Active params should be from 4b to support 8/12gb vram or 8b for 16 &16+ gb vram. If it has intelligence of a model around 200b+ params. This will be the goat
View on Reddit #84046087

dampflokfreund@reddit

I still don't get why QAT is still not popular. Most people are going to use 4 bit, so why not train it in 4 bit. Even better, 1 bit if the quality is still good. Google pioneered it, OpenAI followed and then everyone just abandoned it despite its great performance.
View on Reddit #84050867

stoppableDissolution@reddit

Kimi is native 4bit
View on Reddit #84062522

Caffdy@reddit

and Minimax is 8bit IIRC
View on Reddit #84114130

BigYoSpeck@reddit

Take the per layer embeddings arch of E2B/E4B and make it E62B, then make it MOE with 10B active parameters You'd have a model that anyone with 12gb VRAM + 32gb RAM or more can run which would hopefully beat Gemma 4 31B
View on Reddit #84048552

z_latent@reddit

I'm curious to see how effective PLEs are for bigger models. The explanation that it's "reminding the model of the current token at every layer" implies bigger and deeper models should beneft even more from it.
View on Reddit #84075441

DeepOrangeSky@reddit

Yea, I am curious how much more it could be scaled as well (and how much more they could get out of it, if they did scale it significantly), like if all it could do is mainly just increase the vocabulary size more and more, but not much else, or if it could be used in more interesting ways than just that. I was going to link [this thread](https://www.reddit.com/r/LocalLLaMA/comments/1sd5utm/perlayer_embeddings_a_simple_explanation_of_the/) that -p-e-w- made about it a couple weeks ago that has a lot of in-depth explanation about it, but I was browsing through it and I see you were one of the commenters in it so you already saw it. Most of it is too advanced for me to understand yet, so I don't really know if it can only be used for vocab (and thus not do anything too crazy as far as scaling it massively) or if it can be used for all sorts of other aspects of an LLM, and thus become the future of LLMs in some big way. If it can be used the latter way, then I'd think it would already be a main architecture of model for big models though, no? Like why aren't they already scaling it like that and using it that way? Or is Google the only ones that know how to do it so far, and have just chosen to only do it at small scale to tease us with it a bit, and others would do it at big scale and do crazy shit with it, but they just don't know how yet? Anyway, pretty interesting. Kind of makes me want to learn more in depth stuff about how LLMs work to see if I can understand it better
View on Reddit #84112936

Turbulent_War4067@reddit

76B MoE with very strong reasoning/tool calling and NVFP4 out of the box.
View on Reddit #84112009

ThisGonBHard@reddit

An 120B MOE model, with 10B active. That or an Dense 70-80B.
View on Reddit #84111476

charles25565@reddit

I'd personally like to see a 270M-ish Gemma 4.
View on Reddit #84098582

gnnr25@reddit

Gemma 4n
View on Reddit #84092195

SPoKK1@reddit

***Gemma4 144B A12B please.*** 🎆
View on Reddit #84086548

ea_nasir_official_@reddit

Big MoE
View on Reddit #84085175

Dramatic-Chard-5105@reddit

1B TTS multi language
View on Reddit #84043319

Silver-Champion-4846@reddit

These bigtech corpos are very capable of training the best small tts model, but it doesn't reach the standard of the hypemongers sadly
View on Reddit #84083784

Ps3Dave@reddit

15B dense, 40B MoE. these should fit 12GB VRAM (hopefully). Also an E6B with 256k context.
View on Reddit #84081857

Separate-Forever-447@reddit

"the best open models are those you can run in your devices" Objection, your honor... leading the witness.
View on Reddit #84075926

KillerX629@reddit

Gemma cant compete with qwen on memory management to be honest. But if i could choose, a hybrid gemma that has the same kind of memory footprint would be a gem
View on Reddit #84073810

kevin_1994@reddit

I would love: - FIM compatible model of any size - 50B-70B dense model - 120-200B MoE - QAT quants
View on Reddit #84070397

Fastpas123@reddit

Slightly unrelated: were the overthinking problems with the Gemma 4 models fixed? I was using Gemma 4 E4B IT and it would just keep thinking no matter what I did to it
View on Reddit #84069932

khyryra@reddit

Gemma 5
View on Reddit #84068037

xAdakis@reddit

I want to see how a model like Gemma can perform when only trained on a single language (English) + programming languages.
View on Reddit #84067608

BidWestern1056@reddit

ones that don't flake on actual requests that one would need in an offline emergency. also not refusing to engage in discussions of world politics because "there is no way that iran and the united states would have started a war" lol
View on Reddit #84067442

Skyline34rGt@reddit

60/70B MoE model would be great.
View on Reddit #84044519

mikael110@reddit

On the topic of being closer to Gemini models, I really hope the next release offers Audio input for all of the model sizes. Audio is still an area where most OSS LLMs lag behind the proprietary models. And Gemini in particular is amazing at audio understanding.
View on Reddit #84067283

DesignerTruth9054@reddit

120B and 5-10B active 
View on Reddit #84063191

ttkciar@reddit

A 12B dense, please! Right now there's a gap between E4B and 26B, and consumer-grade GPUs fall right in that gap. Then, if you're feeling generous, that 123B MoE you teased in beta :-)
View on Reddit #84067047

power97992@reddit

Gemma 4  pro the one with 5-7 trillion params 
View on Reddit #84055258

jacek2023@reddit (OP)

what's your usecase for it?
View on Reddit #84055487

power97992@reddit

Coding, i want they os it so the api cost will be cheaper…
View on Reddit #84062373

jacek2023@reddit (OP)

"open source" doesn't mean "cheap", "local" doesn't mean "cheap cloud", just Linux is not "free Windows"
View on Reddit #84067040

NigaTroubles@reddit

170b and 10b active will be great
View on Reddit #84066950

jinnyjuice@reddit

So a couple of models for 32GB memory (assuming 4 bit quants) are already out. How about one for 64GB, one for 128GB, one for 256GB, and one for 512GB? But I'm actually more interested in different numbers of MoE instead. It would be interesting to compare a 128GB model with E8A, and another 128GB model but with E16A.
View on Reddit #84063177

stoppableDissolution@reddit

12-15B and 50-70B dense. Pretty Please?
View on Reddit #84062488

True_Requirement_891@reddit

A 9b gemma or a 24b one
View on Reddit #84060494

Kahvana@reddit

For Gemma 4: That 124B moe model, QAT. For Gemma 5: gated deltanet, engrams, manifold constrained hyper connections, vision + audio for all models.
View on Reddit #84060048

Monad_Maya@reddit

Around MiniMax M2 series so, 230B to 250B MoE.
View on Reddit #84059720

My_Unbiased_Opinion@reddit

give us 124B MOE. do it. and fix the abstinence with tool calling lol.
View on Reddit #84057351

baradas@reddit

Gemma CUA
View on Reddit #84056095

TheAncientOnce@reddit

Waiting to see if they'd pull a Qwen 3.6 moment where everyone votes for one thing and they do another XD
View on Reddit #84055983

Creepy-Bell-4527@reddit

A 120b model.
View on Reddit #84055283

rdsf138@reddit

Focus on multi-modality. I want to see many more modalities on models.
View on Reddit #84053423

a_beautiful_rhind@reddit

124b is already made. Just release it.
View on Reddit #84052771

SirSod@reddit

Eu queria ver evoluções realmente reais sem que fosse preciso ir aumentando a quantidade de parâmetros. 26B com A4B é excelente, roda bem até em placas de vídeo de consumo mais antiga (como a RTX 3060), se pudessem deixá-lo mais inteligente nestas duas áreas: agente e conhecimento geral, seria incrível. Até temos bons modelos agentes abaixo de 100B, mas modelos com dados factuais abaixo desta marca são sempre inexistentes.
View on Reddit #84052580

tat_tvam_asshole@reddit

best in class agentic tool use, safe autonomous behavior
View on Reddit #84052465

hyggeradyr@reddit

Massive variety of tools, skills, and specialized low parameter models for higher efficiency at lower compute. I'd rather run 10 different small orchestrated agents than one shitty, unpredictable, general model.
View on Reddit #84052437

cptrootbeer@reddit

Taalas style chip to run whichever model extremely quickly.
View on Reddit #84052129

ComplexType568@reddit

Hot take but I want to see a 120b dense model from any competent lab tbh (besides mistral), I want to see them push the limits for low sized models (maybe a size like that could compete with trillion-sized models? Or maybe there's a hard ceiling? We wouldn't know until we tried), think about Q3.5 27b and G4 31b, imagine that but >100b. MoEs are super saturated with models already, of course one from such miracle labs like Google and qwen would be good, but I feel like one is bound to release anyway, might as well ask for something special like this. My thoughts though.
View on Reddit #84051984

dampflokfreund@reddit

Misleading thread title. He is asking what features we want to see next, which may include but not limited to model sizes. I would like to see QAT models again. I think Gemma 4.1 is also needed because the agentic and code performance, while not bad, is not great either. And there are some bugs in 26b model like it tells in its reasoning or in the user response it wants to do X but then doesn't call the tool. That seems like a model issue. Would also like to see audio input for all models, ideally not only voice but also sounds. For Gemma 5 I would like to see omnimodality.
View on Reddit #84051051

Mashic@reddit

12B dense model.
View on Reddit #84050391

Significant_Fig_7581@reddit

48B MOE or A 60B MOE...
View on Reddit #84050375

No_Secret4395@reddit

9b gemma
View on Reddit #84049954

durden111111@reddit

124B dude, we know it exists lol
View on Reddit #84049937

Asleep-Ingenuity-481@reddit

I want even smaller models, under 1b params. something that can be run in tandem with gpu intensive tasks, like gaming or something.
View on Reddit #84049484

OpinionatedUserName@reddit

9b-12b, that can be run on mobile, with agentic capabilities trained for search and mobile control . With safeguards so as it doesn't render the operating device bricked unintentionally, i.e it must be trained to not harm the base line Android system so it can work flawlessly when given full access. So basically a mobile focussed variant which is multimodal, better if it is any-to-any.
View on Reddit #84048922

Intelligent_Ice_113@reddit

some bigger MoE models would be nice, as competitor to qwen 3.6 35b
View on Reddit #84048825

Mother_Context_2446@reddit

70b dense, 124b MoE, something that fits on 80-120GB VRAM :-}
View on Reddit #84048342

Firepal64@reddit

I wonder if per-layer embeddings scale. 4B active with 10B PLE?
View on Reddit #84047508

Salt-Advertising-939@reddit

QAT versions of gemma 4
View on Reddit #84046883

brown2green@reddit

Difficult to suggest anything considering that Gemma 4 at least at 31B size is already so good, but definitely I'd like to see QAT _on the entire model_ so we can simply quantize every tensor to 4-bit (or even less than that) with limited to no quality loss. Or they could go even further than that and publish a quantization-aware-trained Gemma 4 124B in ~1-bit just to flex their muscles. That should be able to run on 24GB GPUs. Also, they should release something between the E4B and the 26B models for mid-low range GPUs, I guess.
View on Reddit #84046552

Tokarak@reddit

Does nobody use encoder-decoder models? T5gemma3.
View on Reddit #84046451

kabachuha@reddit

More IP knowledge. Currently, if you read the UGI leaderboard NatInt Categories, Pop Culture, you will see Gemma 4 having 30-31 points while Gemini itself has >78. This shows they have really nerfed its dataset of copyrighted data, very sadly.
View on Reddit #84046263

source-drifter@reddit

i want something like a cat, if it fits, it sits. for me it needs to fit into 24gb vram. lol
View on Reddit #84043199

PiratesOfTheArctic@reddit

That's actually a great idea, a dynamic model that fits memory/specs
View on Reddit #84045824

El_90@reddit

instead of a param size (which doesn't seem to be entirely reflective) lets focus on GB in VRAM It feels like the 24-48GB audience is well served, and the 200GB audience is well served Maybe some more love for the system 128GB users e.g. Strix (so 90-95GB model allowing 20GB cache) Selflishy speaking of course
View on Reddit #84045122

Altruistic-Theme432@reddit

I hope to see a 20B MOE model, like the GPTOSS20B. The gemma26B is still a bit too big for 16GB of video memory.
View on Reddit #84044857

ready_to_fuck_yeahh@reddit

Gemini 4 pro ultra /s
View on Reddit #84043751

MomentJolly3535@reddit

From that emoji i m expecting very small phone models (2B and under)
View on Reddit #84043598

Such_Advantage_6949@reddit

124B model please
View on Reddit #84043439

pmttyji@reddit

* 15B Dense (Q4 could fit 8GB VRAM) - Competitive to Qwen3.5-9B * 70-80B Dense/MOE * Yeah, that 124B one
View on Reddit #84043231

AltruisticList6000@reddit

20b dense model
View on Reddit #84043183

BothYou243@reddit

agentic stuff
View on Reddit #84043140

VoiceApprehensive893@reddit

14bish dense and >40b moe
View on Reddit #84043070