Which Gemma model do you want next?

[-]

Technical-Earth-3254@reddit

Better vision, larger models and less kv cache size natively.

Reply

[-]

> Better vision Gemma 4 vision is already very good and way better than Qwen. People just need to set the vision budge in Llama cpp with --image-min-tokens and --image-max-tokens. The default is 40 and 280 respectively. For best results, I recommend 560 and 2240. With this, Gemma 4 is able to OCR very fine and hazy details.

Reply

[-]

AlwaysLateToThaParty@reddit

I take that back by the way. I've now done some more analysis, and the gemma 4 31b model is capturing information in images better, especially with large complex images with many parts to analyze.

Reply

[-]

AlwaysLateToThaParty@reddit

> Gemma 4 vision is already very good and way better than Qwen. Not my experience. I'm comparing it to the 122b/a10b qwen3.5 model, so larger than the latest gemma models, but qwen performs consistently better for me.

Reply

[-]

Caffdy@reddit

> Also need to set --batch-size 4096 and --ubatch-size 4096 why these settings in particular?

Reply

[-]

guiopen@reddit

How do you know which the default is? I was searching for this and couldn't find :( Also, do you know how llama.cpp decided the resolution? Let's say my min is 70 and my max is 1120, how will llama.cpp decide? And will it respect Gemma suported resolutions (70, 140, 280, 560 and 1120) or it will pick something between like 700

Reply

[-]

seamonn@reddit

I did a write up [here](https://www.reddit.com/r/LocalLLaMA/comments/1srrhi5/gemma_4_vision/).

Reply

[-]

mga02@reddit

can this be tweaked somehow in LMstudio? because Gemma 4 vision performs way way worse than Qwen 3.5 and I thought it would be because of a misconfiguration.

Reply

[-]

ResidentPositive4122@reddit

The small models are already good. Let's see what 124B was all about. We'll find hardware to run it :)

Reply

[-]

BannedGoNext@reddit

Where's our flash killing 124b model lol. Actually the only improvement I can recommend would be better tool calling like qwen.

Reply

[-]

typical-predditor@reddit

Flash 3 is the 124b version of Gemma 4. It's already commercially viable.

Reply

[-]

BannedGoNext@reddit

You are probably right lol.

Reply

[-]

seamonn@reddit

> Actually the only improvement I can recommend would be better tool calling like qwen. I was struggling with tool calling as well until I switched over to the llama cpp interleaved jinja template.

Reply

[-]

piro4you@reddit

We REALLY need 4.1. Excuse me, but as for now I do not see a reason for gemma 4 when qwen 3.6 exist. It's not only smarter, but overall better product. (yes i know that gemma is multi language and uses less tokens for output)

Reply

[-]

comfiestncoziest@reddit

Gemma 4 is vastly more creative and has better prose than Qwen. Depends on your use case.

Reply

[-]

piro4you@reddit

Tool calling and technical analysis

Reply

[-]

comfiestncoziest@reddit

I have indeed found Qwen to be better at those two things, although Gemma continues to get improvements.

Reply

[-]

korino11@reddit

Ternary implemetnation on 100B+

Reply

[-]

Silver-Champion-4846@reddit

With hardcore investment in optimizing the quality for that ternary compression.

Reply

[-]

korino11@reddit

Shure why not? It veeery promicing! We have 2 company that prooved it. Microsoft + Prism!

Reply

[-]

Silver-Champion-4846@reddit

Falcon also tried their hands at it

Reply

[-]

Serprotease@reddit

It’s pushing against the limit of local models, but I’d really like to see more things in the 200b-300b range. It’s still something that can be run on some local (high end though) hardware and is a significant jump in intelligence from the 120b MoE. Glm4.7 is very good at this range but zai moved to 700b now. That’s a size where models can challenge sonnet with some credibility.

Reply

[-]

wren6991@reddit

Standardise on tool use tokens

Reply

[-]

Own-Potential-2308@reddit

QAT across the board! Especially strong 4-bit (and experimental lower-bit) versions. Since most local runs are quantized, training with quantization in mind would minimize quality drop at low bits. Google pioneered aspects of this, bring it back!

Reply

[-]

Own-Potential-2308@reddit

A true mid-range gap filler: Something like a strong 12-15B dense (or a compact MoE) that sits nicely between the E4B and the 26B/31B. A lot of 16-24 GB VRAM users feel a bit underserved right now.

Reply

[-]

DelKarasique@reddit

Midrange one. Like 70b. I think that's a sweet and empty spot right now.

Reply

[-]

ego100trique@reddit

We have a different perception of what midrange is apparently

Reply

[-]

_risho_@reddit

it depends on your device. if you are on something like a macbook or a strix halo you can get 64/96/128gb of vram on achievable consumer hardware. if you are using dedicated pci-e gpus then its clearly not.

Reply

[-]

rakarsky@reddit

What is your midrange? 300B?

Reply

[-]

General-Cookie6794@reddit

Waiting for 4b9

Reply

[-]

mr_Owner@reddit

Gemma 4 E10B and or a max 80B MoE pleaaase 🥺

Reply

[-]

General-Cookie6794@reddit

Waiting for this too

Reply

[-]

snek_kogae@reddit

<5 GB safetensor fragments will make it much easier to import into our org!!

Reply

[-]

New_Alps_5655@reddit

Pixel 14 Pro with built-in Taalas in-silicon Gemma5-70b running 17,000 t/s. Google devs I know you browse here..

Reply

[-]

kmp11@reddit

how about Gemma 4.1 31B with memory usage optimizations? With some of google technology (ie TurboQuant) implemented? Give us a 1bit model? Gemma 4, in its current form, is a KV cache hog sleeping in the summer sun. Large and lazy...

Reply

[-]

DeepOrangeSky@reddit

70b dense 124b MoE

Reply

[-]

Lesser-than@reddit

even though I couldn't run them I would root for it. They probably are not terribly interested in releasing outside what the average gamer machine can run though.

Reply

[-]

Caffdy@reddit

yeah, 70B dense/120B MoE is the new mid-sized model range

Reply

[-]

ZeusZCC@reddit

70b dense is can realy good. I love dense models

Reply

[-]

RedParaglider@reddit

What do you use them for? They are so slow.

Reply

[-]

NigaTroubles@reddit

But they are better

Reply

[-]

ttkciar@reddit

I let them infer long tasks or batches overnight while I sleep. K2-V2-Instruct (72B dense) is wicked-smart, but not great at creative writing nor codegen. I've been using it to populate a RAG database of responses to prompts generated from users' real prompts mutated/diversified by Evol-Instruct, so that the next time we prompt on the same/similar subject, there may be high-quality information in the RAG database the fast model can use to generate a better answer. I'd love to have a big, dense highly-competent local model to do similar for creative writing and code generation. Tried Devstral 2 Large, but it's not great (a lot worse than my go-to, GLM-4.5-Air).

Reply

[-]

anonutter@reddit

woudl be great to have an audio - audio model

Reply

[-]

Waste-Intention-2806@reddit

Natively 4 bit trained or 1 bit like bonsai trained. Model params 70b to 120b and should be MOE so that it can run faster on all devices. Size should be around or less than 48 gb + 10 to 20gb context. Active params should be from 4b to support 8/12gb vram or 8b for 16 &16+ gb vram. If it has intelligence of a model around 200b+ params. This will be the goat

Reply

[-]

dampflokfreund@reddit

I still don't get why QAT is still not popular. Most people are going to use 4 bit, so why not train it in 4 bit. Even better, 1 bit if the quality is still good. Google pioneered it, OpenAI followed and then everyone just abandoned it despite its great performance.

Reply

[-]

stoppableDissolution@reddit

Kimi is native 4bit

Reply

[-]

Caffdy@reddit

and Minimax is 8bit IIRC

Reply

[-]

BigYoSpeck@reddit

Take the per layer embeddings arch of E2B/E4B and make it E62B, then make it MOE with 10B active parameters You'd have a model that anyone with 12gb VRAM + 32gb RAM or more can run which would hopefully beat Gemma 4 31B

Reply

[-]

z_latent@reddit

I'm curious to see how effective PLEs are for bigger models. The explanation that it's "reminding the model of the current token at every layer" implies bigger and deeper models should beneft even more from it.

Reply

[-]

DeepOrangeSky@reddit

Yea, I am curious how much more it could be scaled as well (and how much more they could get out of it, if they did scale it significantly), like if all it could do is mainly just increase the vocabulary size more and more, but not much else, or if it could be used in more interesting ways than just that. I was going to link [this thread](https://www.reddit.com/r/LocalLLaMA/comments/1sd5utm/perlayer_embeddings_a_simple_explanation_of_the/) that -p-e-w- made about it a couple weeks ago that has a lot of in-depth explanation about it, but I was browsing through it and I see you were one of the commenters in it so you already saw it. Most of it is too advanced for me to understand yet, so I don't really know if it can only be used for vocab (and thus not do anything too crazy as far as scaling it massively) or if it can be used for all sorts of other aspects of an LLM, and thus become the future of LLMs in some big way. If it can be used the latter way, then I'd think it would already be a main architecture of model for big models though, no? Like why aren't they already scaling it like that and using it that way? Or is Google the only ones that know how to do it so far, and have just chosen to only do it at small scale to tease us with it a bit, and others would do it at big scale and do crazy shit with it, but they just don't know how yet? Anyway, pretty interesting. Kind of makes me want to learn more in depth stuff about how LLMs work to see if I can understand it better

Reply

[-]

Turbulent_War4067@reddit

76B MoE with very strong reasoning/tool calling and NVFP4 out of the box.

Reply

[-]

ThisGonBHard@reddit

An 120B MOE model, with 10B active. That or an Dense 70-80B.

Reply

[-]

charles25565@reddit

I'd personally like to see a 270M-ish Gemma 4.

Reply

[-]

gnnr25@reddit

Gemma 4n

Reply

[-]

SPoKK1@reddit

***Gemma4 144B A12B please.*** 🎆

Reply

[-]

ea_nasir_official_@reddit

Big MoE

Reply

[-]

Dramatic-Chard-5105@reddit

1B TTS multi language

Reply

[-]

Silver-Champion-4846@reddit

These bigtech corpos are very capable of training the best small tts model, but it doesn't reach the standard of the hypemongers sadly

Reply

[-]

Ps3Dave@reddit

15B dense, 40B MoE. these should fit 12GB VRAM (hopefully). Also an E6B with 256k context.

Reply

[-]

Separate-Forever-447@reddit

"the best open models are those you can run in your devices" Objection, your honor... leading the witness.

Reply

[-]

KillerX629@reddit

Gemma cant compete with qwen on memory management to be honest. But if i could choose, a hybrid gemma that has the same kind of memory footprint would be a gem

Reply

[-]

kevin_1994@reddit

I would love: - FIM compatible model of any size - 50B-70B dense model - 120-200B MoE - QAT quants

Reply

[-]

Fastpas123@reddit

Slightly unrelated: were the overthinking problems with the Gemma 4 models fixed? I was using Gemma 4 E4B IT and it would just keep thinking no matter what I did to it

Reply

[-]

khyryra@reddit

Gemma 5

Reply

[-]

xAdakis@reddit

I want to see how a model like Gemma can perform when only trained on a single language (English) + programming languages.

Reply

[-]

BidWestern1056@reddit

ones that don't flake on actual requests that one would need in an offline emergency. also not refusing to engage in discussions of world politics because "there is no way that iran and the united states would have started a war" lol

Reply

[-]

Skyline34rGt@reddit

60/70B MoE model would be great.

Reply

[-]

mikael110@reddit

On the topic of being closer to Gemini models, I really hope the next release offers Audio input for all of the model sizes. Audio is still an area where most OSS LLMs lag behind the proprietary models. And Gemini in particular is amazing at audio understanding.

Reply

[-]

DesignerTruth9054@reddit

120B and 5-10B active

Reply

[-]

ttkciar@reddit

A 12B dense, please! Right now there's a gap between E4B and 26B, and consumer-grade GPUs fall right in that gap. Then, if you're feeling generous, that 123B MoE you teased in beta :-)

Reply

[-]

power97992@reddit

Gemma 4 pro the one with 5-7 trillion params

Reply

[-]

jacek2023@reddit (OP)

what's your usecase for it?

Reply

[-]

power97992@reddit

Coding, i want they os it so the api cost will be cheaper…

Reply

[-]

jacek2023@reddit (OP)

"open source" doesn't mean "cheap", "local" doesn't mean "cheap cloud", just Linux is not "free Windows"

Reply

[-]

NigaTroubles@reddit

170b and 10b active will be great

Reply

[-]

jinnyjuice@reddit

So a couple of models for 32GB memory (assuming 4 bit quants) are already out. How about one for 64GB, one for 128GB, one for 256GB, and one for 512GB? But I'm actually more interested in different numbers of MoE instead. It would be interesting to compare a 128GB model with E8A, and another 128GB model but with E16A.

Reply

[-]

stoppableDissolution@reddit

12-15B and 50-70B dense. Pretty Please?

Reply

[-]

True_Requirement_891@reddit

A 9b gemma or a 24b one

Reply

[-]

Kahvana@reddit

For Gemma 4: That 124B moe model, QAT. For Gemma 5: gated deltanet, engrams, manifold constrained hyper connections, vision + audio for all models.

Reply

[-]

Monad_Maya@reddit

Around MiniMax M2 series so, 230B to 250B MoE.

Reply

[-]

My_Unbiased_Opinion@reddit

give us 124B MOE. do it. and fix the abstinence with tool calling lol.

Reply

[-]

baradas@reddit

Gemma CUA

Reply

[-]

TheAncientOnce@reddit

Waiting to see if they'd pull a Qwen 3.6 moment where everyone votes for one thing and they do another XD

Reply

[-]

Creepy-Bell-4527@reddit

A 120b model.

Reply

[-]

rdsf138@reddit

Focus on multi-modality. I want to see many more modalities on models.

Reply

[-]

a_beautiful_rhind@reddit

124b is already made. Just release it.

Reply

[-]

SirSod@reddit

Eu queria ver evoluções realmente reais sem que fosse preciso ir aumentando a quantidade de parâmetros. 26B com A4B é excelente, roda bem até em placas de vídeo de consumo mais antiga (como a RTX 3060), se pudessem deixá-lo mais inteligente nestas duas áreas: agente e conhecimento geral, seria incrível. Até temos bons modelos agentes abaixo de 100B, mas modelos com dados factuais abaixo desta marca são sempre inexistentes.

Reply

[-]

tat_tvam_asshole@reddit

best in class agentic tool use, safe autonomous behavior

Reply

[-]

hyggeradyr@reddit

Massive variety of tools, skills, and specialized low parameter models for higher efficiency at lower compute. I'd rather run 10 different small orchestrated agents than one shitty, unpredictable, general model.

Reply

[-]

cptrootbeer@reddit

Taalas style chip to run whichever model extremely quickly.

Reply

[-]

ComplexType568@reddit

Hot take but I want to see a 120b dense model from any competent lab tbh (besides mistral), I want to see them push the limits for low sized models (maybe a size like that could compete with trillion-sized models? Or maybe there's a hard ceiling? We wouldn't know until we tried), think about Q3.5 27b and G4 31b, imagine that but >100b. MoEs are super saturated with models already, of course one from such miracle labs like Google and qwen would be good, but I feel like one is bound to release anyway, might as well ask for something special like this. My thoughts though.

Reply

[-]

dampflokfreund@reddit

Misleading thread title. He is asking what features we want to see next, which may include but not limited to model sizes. I would like to see QAT models again. I think Gemma 4.1 is also needed because the agentic and code performance, while not bad, is not great either. And there are some bugs in 26b model like it tells in its reasoning or in the user response it wants to do X but then doesn't call the tool. That seems like a model issue. Would also like to see audio input for all models, ideally not only voice but also sounds. For Gemma 5 I would like to see omnimodality.

Reply

[-]

Mashic@reddit

12B dense model.

Reply

[-]

Significant_Fig_7581@reddit

48B MOE or A 60B MOE...

Reply

[-]

No_Secret4395@reddit

9b gemma

Reply

[-]

durden111111@reddit

124B dude, we know it exists lol

Reply

[-]

Asleep-Ingenuity-481@reddit

I want even smaller models, under 1b params. something that can be run in tandem with gpu intensive tasks, like gaming or something.

Reply

[-]

OpinionatedUserName@reddit

9b-12b, that can be run on mobile, with agentic capabilities trained for search and mobile control . With safeguards so as it doesn't render the operating device bricked unintentionally, i.e it must be trained to not harm the base line Android system so it can work flawlessly when given full access. So basically a mobile focussed variant which is multimodal, better if it is any-to-any.

Reply

[-]

Intelligent_Ice_113@reddit

some bigger MoE models would be nice, as competitor to qwen 3.6 35b

Reply

[-]

Mother_Context_2446@reddit

70b dense, 124b MoE, something that fits on 80-120GB VRAM :-}

Reply

[-]

Firepal64@reddit

I wonder if per-layer embeddings scale. 4B active with 10B PLE?

Reply

[-]

Salt-Advertising-939@reddit

QAT versions of gemma 4

Reply

[-]

brown2green@reddit

Difficult to suggest anything considering that Gemma 4 at least at 31B size is already so good, but definitely I'd like to see QAT _on the entire model_ so we can simply quantize every tensor to 4-bit (or even less than that) with limited to no quality loss. Or they could go even further than that and publish a quantization-aware-trained Gemma 4 124B in ~1-bit just to flex their muscles. That should be able to run on 24GB GPUs. Also, they should release something between the E4B and the 26B models for mid-low range GPUs, I guess.

Reply

[-]

Tokarak@reddit

Does nobody use encoder-decoder models? T5gemma3.

Reply

[-]

kabachuha@reddit

More IP knowledge. Currently, if you read the UGI leaderboard NatInt Categories, Pop Culture, you will see Gemma 4 having 30-31 points while Gemini itself has >78. This shows they have really nerfed its dataset of copyrighted data, very sadly.

Reply

[-]

source-drifter@reddit

i want something like a cat, if it fits, it sits. for me it needs to fit into 24gb vram. lol

Reply

[-]

PiratesOfTheArctic@reddit

That's actually a great idea, a dynamic model that fits memory/specs

Reply

[-]

El_90@reddit

instead of a param size (which doesn't seem to be entirely reflective) lets focus on GB in VRAM It feels like the 24-48GB audience is well served, and the 200GB audience is well served Maybe some more love for the system 128GB users e.g. Strix (so 90-95GB model allowing 20GB cache) Selflishy speaking of course

Reply

[-]