Which Gemma model do you want next?
Posted by jacek2023@reddit | LocalLLaMA | View on Reddit | 114 comments
tell the Gemma team:
Posted by jacek2023@reddit | LocalLLaMA | View on Reddit | 114 comments
tell the Gemma team:
piro4you@reddit
We REALLY need 4.1.
Excuse me, but as for now I do not see a reason for gemma 4 when qwen 3.6 exist.
It's not only smarter, but overall better product. (yes i know that gemma is multi language and uses less tokens for output)
comfiestncoziest@reddit
Gemma 4 is vastly more creative and has better prose than Qwen. Depends on your use case.
piro4you@reddit
Tool calling and technical analysis
comfiestncoziest@reddit
I have indeed found Qwen to be better at those two things, although Gemma continues to get improvements.
korino11@reddit
Ternary implemetnation on 100B+
Silver-Champion-4846@reddit
With hardcore investment in optimizing the quality for that ternary compression.
korino11@reddit
Shure why not? It veeery promicing! We have 2 company that prooved it. Microsoft + Prism!
Silver-Champion-4846@reddit
Falcon also tried their hands at it
Serprotease@reddit
It’s pushing against the limit of local models, but I’d really like to see more things in the 200b-300b range.
It’s still something that can be run on some local (high end though) hardware and is a significant jump in intelligence from the 120b MoE.
Glm4.7 is very good at this range but zai moved to 700b now.
That’s a size where models can challenge sonnet with some credibility.
wren6991@reddit
Standardise on tool use tokens
Own-Potential-2308@reddit
QAT across the board! Especially strong 4-bit (and experimental lower-bit) versions. Since most local runs are quantized, training with quantization in mind would minimize quality drop at low bits. Google pioneered aspects of this, bring it back!
Own-Potential-2308@reddit
A true mid-range gap filler: Something like a strong 12-15B dense (or a compact MoE) that sits nicely between the E4B and the 26B/31B. A lot of 16-24 GB VRAM users feel a bit underserved right now.
DelKarasique@reddit
Midrange one. Like 70b. I think that's a sweet and empty spot right now.
ego100trique@reddit
We have a different perception of what midrange is apparently
_risho_@reddit
it depends on your device. if you are on something like a macbook or a strix halo you can get 64/96/128gb of vram on achievable consumer hardware. if you are using dedicated pci-e gpus then its clearly not.
rakarsky@reddit
What is your midrange? 300B?
General-Cookie6794@reddit
Waiting for 4b9
mr_Owner@reddit
Gemma 4 E10B and or a max 80B MoE pleaaase 🥺
General-Cookie6794@reddit
Waiting for this too
snek_kogae@reddit
<5 GB safetensor fragments will make it much easier to import into our org!!
New_Alps_5655@reddit
Pixel 14 Pro with built-in Taalas in-silicon Gemma5-70b running 17,000 t/s. Google devs I know you browse here..
Technical-Earth-3254@reddit
Better vision, larger models and less kv cache size natively.
seamonn@reddit
Gemma 4 vision is already very good and way better than Qwen. People just need to set the vision budge in Llama cpp with --image-min-tokens and --image-max-tokens.
The default is 40 and 280 respectively. For best results, I recommend 560 and 2240. With this, Gemma 4 is able to OCR very fine and hazy details.
AlwaysLateToThaParty@reddit
Not my experience. I'm comparing it to the 122b/a10b qwen3.5 model, so larger than the latest gemma models, but qwen performs consistently better for me.
Caffdy@reddit
why these settings in particular?
guiopen@reddit
How do you know which the default is? I was searching for this and couldn't find :(
Also, do you know how llama.cpp decided the resolution? Let's say my min is 70 and my max is 1120, how will llama.cpp decide? And will it respect Gemma suported resolutions (70, 140, 280, 560 and 1120) or it will pick something between like 700
seamonn@reddit
I did a write up here.
mga02@reddit
can this be tweaked somehow in LMstudio? because Gemma 4 vision performs way way worse than Qwen 3.5 and I thought it would be because of a misconfiguration.
kmp11@reddit
how about Gemma 4.1 31B with memory usage optimizations? With some of google technology (ie TurboQuant) implemented? Give us a 1bit model?
Gemma 4, in its current form, is a KV cache hog sleeping in the summer sun. Large and lazy...
DeepOrangeSky@reddit
70b dense
124b MoE
Lesser-than@reddit
even though I couldn't run them I would root for it. They probably are not terribly interested in releasing outside what the average gamer machine can run though.
Caffdy@reddit
yeah, 70B dense/120B MoE is the new mid-sized model range
ZeusZCC@reddit
70b dense is can realy good. I love dense models
RedParaglider@reddit
What do you use them for? They are so slow.
NigaTroubles@reddit
But they are better
ttkciar@reddit
I let them infer long tasks or batches overnight while I sleep.
K2-V2-Instruct (72B dense) is wicked-smart, but not great at creative writing nor codegen. I've been using it to populate a RAG database of responses to prompts generated from users' real prompts mutated/diversified by Evol-Instruct, so that the next time we prompt on the same/similar subject, there may be high-quality information in the RAG database the fast model can use to generate a better answer.
I'd love to have a big, dense highly-competent local model to do similar for creative writing and code generation. Tried Devstral 2 Large, but it's not great (a lot worse than my go-to, GLM-4.5-Air).
anonutter@reddit
woudl be great to have an audio - audio model
Waste-Intention-2806@reddit
Natively 4 bit trained or 1 bit like bonsai trained. Model params 70b to 120b and should be MOE so that it can run faster on all devices. Size should be around or less than 48 gb + 10 to 20gb context. Active params should be from 4b to support 8/12gb vram or 8b for 16 &16+ gb vram. If it has intelligence of a model around 200b+ params. This will be the goat
dampflokfreund@reddit
I still don't get why QAT is still not popular. Most people are going to use 4 bit, so why not train it in 4 bit. Even better, 1 bit if the quality is still good. Google pioneered it, OpenAI followed and then everyone just abandoned it despite its great performance.
stoppableDissolution@reddit
Kimi is native 4bit
Caffdy@reddit
and Minimax is 8bit IIRC
BigYoSpeck@reddit
Take the per layer embeddings arch of E2B/E4B and make it E62B, then make it MOE with 10B active parameters
You'd have a model that anyone with 12gb VRAM + 32gb RAM or more can run which would hopefully beat Gemma 4 31B
z_latent@reddit
I'm curious to see how effective PLEs are for bigger models. The explanation that it's "reminding the model of the current token at every layer" implies bigger and deeper models should beneft even more from it.
DeepOrangeSky@reddit
Yea, I am curious how much more it could be scaled as well (and how much more they could get out of it, if they did scale it significantly), like if all it could do is mainly just increase the vocabulary size more and more, but not much else, or if it could be used in more interesting ways than just that.
I was going to link this thread that -p-e-w- made about it a couple weeks ago that has a lot of in-depth explanation about it, but I was browsing through it and I see you were one of the commenters in it so you already saw it.
Most of it is too advanced for me to understand yet, so I don't really know if it can only be used for vocab (and thus not do anything too crazy as far as scaling it massively) or if it can be used for all sorts of other aspects of an LLM, and thus become the future of LLMs in some big way.
If it can be used the latter way, then I'd think it would already be a main architecture of model for big models though, no? Like why aren't they already scaling it like that and using it that way? Or is Google the only ones that know how to do it so far, and have just chosen to only do it at small scale to tease us with it a bit, and others would do it at big scale and do crazy shit with it, but they just don't know how yet?
Anyway, pretty interesting. Kind of makes me want to learn more in depth stuff about how LLMs work to see if I can understand it better
Turbulent_War4067@reddit
76B MoE with very strong reasoning/tool calling and NVFP4 out of the box.
ThisGonBHard@reddit
An 120B MOE model, with 10B active. That or an Dense 70-80B.
charles25565@reddit
I'd personally like to see a 270M-ish Gemma 4.
gnnr25@reddit
Gemma 4n
SPoKK1@reddit
Gemma4 144B A12B please. 🎆
ea_nasir_official_@reddit
Big MoE
ResidentPositive4122@reddit
The small models are already good. Let's see what 124B was all about. We'll find hardware to run it :)
BannedGoNext@reddit
Where's our flash killing 124b model lol. Actually the only improvement I can recommend would be better tool calling like qwen.
typical-predditor@reddit
Flash 3 is the 124b version of Gemma 4. It's already commercially viable.
seamonn@reddit
I was struggling with tool calling as well until I switched over to the llama cpp interleaved jinja template.
Dramatic-Chard-5105@reddit
1B TTS multi language
Silver-Champion-4846@reddit
These bigtech corpos are very capable of training the best small tts model, but it doesn't reach the standard of the hypemongers sadly
Ps3Dave@reddit
15B dense, 40B MoE. these should fit 12GB VRAM (hopefully). Also an E6B with 256k context.
Separate-Forever-447@reddit
"the best open models are those you can run in your devices"
Objection, your honor... leading the witness.
KillerX629@reddit
Gemma cant compete with qwen on memory management to be honest. But if i could choose, a hybrid gemma that has the same kind of memory footprint would be a gem
kevin_1994@reddit
I would love:
Fastpas123@reddit
Slightly unrelated: were the overthinking problems with the Gemma 4 models fixed? I was using Gemma 4 E4B IT and it would just keep thinking no matter what I did to it
khyryra@reddit
Gemma 5
xAdakis@reddit
I want to see how a model like Gemma can perform when only trained on a single language (English) + programming languages.
BidWestern1056@reddit
ones that don't flake on actual requests that one would need in an offline emergency. also not refusing to engage in discussions of world politics because "there is no way that iran and the united states would have started a war" lol
Skyline34rGt@reddit
60/70B MoE model would be great.
mikael110@reddit
On the topic of being closer to Gemini models, I really hope the next release offers Audio input for all of the model sizes. Audio is still an area where most OSS LLMs lag behind the proprietary models. And Gemini in particular is amazing at audio understanding.
DesignerTruth9054@reddit
120B and 5-10B active
ttkciar@reddit
A 12B dense, please! Right now there's a gap between E4B and 26B, and consumer-grade GPUs fall right in that gap.
Then, if you're feeling generous, that 123B MoE you teased in beta :-)
power97992@reddit
Gemma 4 pro the one with 5-7 trillion params
jacek2023@reddit (OP)
what's your usecase for it?
power97992@reddit
Coding, i want they os it so the api cost will be cheaper…
jacek2023@reddit (OP)
"open source" doesn't mean "cheap", "local" doesn't mean "cheap cloud", just Linux is not "free Windows"
NigaTroubles@reddit
170b and 10b active will be great
jinnyjuice@reddit
So a couple of models for 32GB memory (assuming 4 bit quants) are already out.
How about one for 64GB, one for 128GB, one for 256GB, and one for 512GB?
But I'm actually more interested in different numbers of MoE instead. It would be interesting to compare a 128GB model with E8A, and another 128GB model but with E16A.
stoppableDissolution@reddit
12-15B and 50-70B dense. Pretty Please?
True_Requirement_891@reddit
A 9b gemma or a 24b one
Kahvana@reddit
For Gemma 4: That 124B moe model, QAT.
For Gemma 5: gated deltanet, engrams, manifold constrained hyper connections, vision + audio for all models.
Monad_Maya@reddit
Around MiniMax M2 series so, 230B to 250B MoE.
My_Unbiased_Opinion@reddit
give us 124B MOE. do it. and fix the abstinence with tool calling lol.
baradas@reddit
Gemma CUA
TheAncientOnce@reddit
Waiting to see if they'd pull a Qwen 3.6 moment where everyone votes for one thing and they do another XD
Creepy-Bell-4527@reddit
A 120b model.
rdsf138@reddit
Focus on multi-modality. I want to see many more modalities on models.
a_beautiful_rhind@reddit
124b is already made. Just release it.
SirSod@reddit
Eu queria ver evoluções realmente reais sem que fosse preciso ir aumentando a quantidade de parâmetros. 26B com A4B é excelente, roda bem até em placas de vídeo de consumo mais antiga (como a RTX 3060), se pudessem deixá-lo mais inteligente nestas duas áreas: agente e conhecimento geral, seria incrível.
Até temos bons modelos agentes abaixo de 100B, mas modelos com dados factuais abaixo desta marca são sempre inexistentes.
tat_tvam_asshole@reddit
best in class agentic tool use, safe autonomous behavior
hyggeradyr@reddit
Massive variety of tools, skills, and specialized low parameter models for higher efficiency at lower compute. I'd rather run 10 different small orchestrated agents than one shitty, unpredictable, general model.
cptrootbeer@reddit
Taalas style chip to run whichever model extremely quickly.
ComplexType568@reddit
Hot take but I want to see a 120b dense model from any competent lab tbh (besides mistral), I want to see them push the limits for low sized models (maybe a size like that could compete with trillion-sized models? Or maybe there's a hard ceiling? We wouldn't know until we tried), think about Q3.5 27b and G4 31b, imagine that but >100b. MoEs are super saturated with models already, of course one from such miracle labs like Google and qwen would be good, but I feel like one is bound to release anyway, might as well ask for something special like this. My thoughts though.
dampflokfreund@reddit
Misleading thread title. He is asking what features we want to see next, which may include but not limited to model sizes.
I would like to see QAT models again. I think Gemma 4.1 is also needed because the agentic and code performance, while not bad, is not great either. And there are some bugs in 26b model like it tells in its reasoning or in the user response it wants to do X but then doesn't call the tool. That seems like a model issue.
Would also like to see audio input for all models, ideally not only voice but also sounds. For Gemma 5 I would like to see omnimodality.
Mashic@reddit
12B dense model.
Significant_Fig_7581@reddit
48B MOE or A 60B MOE...
No_Secret4395@reddit
9b gemma
durden111111@reddit
124B dude, we know it exists lol
Asleep-Ingenuity-481@reddit
I want even smaller models, under 1b params. something that can be run in tandem with gpu intensive tasks, like gaming or something.
OpinionatedUserName@reddit
9b-12b, that can be run on mobile, with agentic capabilities trained for search and mobile control . With safeguards so as it doesn't render the operating device bricked unintentionally, i.e it must be trained to not harm the base line Android system so it can work flawlessly when given full access. So basically a mobile focussed variant which is multimodal, better if it is any-to-any.
Intelligent_Ice_113@reddit
some bigger MoE models would be nice, as competitor to qwen 3.6 35b
Mother_Context_2446@reddit
70b dense, 124b MoE, something that fits on 80-120GB VRAM :-}
Firepal64@reddit
I wonder if per-layer embeddings scale. 4B active with 10B PLE?
Salt-Advertising-939@reddit
QAT versions of gemma 4
brown2green@reddit
Difficult to suggest anything considering that Gemma 4 at least at 31B size is already so good, but definitely I'd like to see QAT on the entire model so we can simply quantize every tensor to 4-bit (or even less than that) with limited to no quality loss. Or they could go even further than that and publish a quantization-aware-trained Gemma 4 124B in ~1-bit just to flex their muscles. That should be able to run on 24GB GPUs.
Also, they should release something between the E4B and the 26B models for mid-low range GPUs, I guess.
Tokarak@reddit
Does nobody use encoder-decoder models? T5gemma3.
kabachuha@reddit
More IP knowledge. Currently, if you read the UGI leaderboard NatInt Categories, Pop Culture, you will see Gemma 4 having 30-31 points while Gemini itself has >78. This shows they have really nerfed its dataset of copyrighted data, very sadly.
source-drifter@reddit
i want something like a cat, if it fits, it sits. for me it needs to fit into 24gb vram. lol
PiratesOfTheArctic@reddit
That's actually a great idea, a dynamic model that fits memory/specs
El_90@reddit
instead of a param size (which doesn't seem to be entirely reflective) lets focus on GB in VRAM
It feels like the 24-48GB audience is well served, and the 200GB audience is well served
Maybe some more love for the system 128GB users e.g. Strix (so 90-95GB model allowing 20GB cache)
Selflishy speaking of course
Altruistic-Theme432@reddit
I hope to see a 20B MOE model, like the GPTOSS20B. The gemma26B is still a bit too big for 16GB of video memory.
ready_to_fuck_yeahh@reddit
Gemini 4 pro ultra /s
MomentJolly3535@reddit
From that emoji i m expecting very small phone models (2B and under)
Such_Advantage_6949@reddit
124B model please
pmttyji@reddit
AltruisticList6000@reddit
20b dense model
BothYou243@reddit
agentic stuff
VoiceApprehensive893@reddit
14bish dense and >40b moe