Gemma 4 is fine great even …

[-]

StupidScaredSquirrel@reddit

The real question for me is: can gemma4 26b a4b replace qwen3.5 35b a3b? It's tough to tell right now, we need a week or two of patches to see what the real advantages and tradeoffs are.

[-]

Substantial-Thing303@reddit

Yes. for me it's inference speed, token usage, vram and how good it is at agentic tasks, following instructions.

I have a local setup where I use STT, TTS and a LLM. But I can't use qwen3.5 35b a3b because I would have to load only that and nothing else. Currently I'm using qwen3.5 9b or gpt-oss-20b.

[-]

Substantial-Thing303@reddit

whisper, qwen3-tts-0.6b, and testing gemma4 26b a4b now. It's very tight on 24gb vram. I can't load much context because of the stt and tts. Might have to go for a lower quant.

[-]

StupidScaredSquirrel@reddit

Sounds cool, what do you use for stt and tts?

[-]

whisper and faster-qwen3-tts. It's my local conversation layer. The local llm is just orchestrating conversations, no tools, and decide when to call Claude Code (CC is the only tool). So I end up using Claude Code for all tasks, but I can get snappy conversations before so it feels more natural.

[-]

BunnyJacket@reddit

Wow. I literally built the same setup thinking I was uncovering new grounds. Have you figured out how to have bi-directional cross delegation so you can have a back and forth conversation directly with the terminal as well as the agent. Spawning multiple agents in parallel is also fun.

[-]

Substantial-Thing303@reddit

bi-directional cross delegation so you can have a back and forth conversation directly with the terminal as well as the agent

I'm not really sure what you mean, but here's what I did:

I have a "quick mode" that is very simple: STT -> TopLevelAgent (TLA) -> CC -> CC final response -> TLA -> TTS

In this mode, the TLA is just a thin wrapper to fix grammar and restructure what I said with prompt injection for CC knowing that this is from STT and some words could have been misheard.

Then I have the "responsive mode": The difference is that TLA respond directly in json and decide what to do: respond, respond_and_dispatch, cancell_cc, etc. The TLA is instructed to always communicate in short sentences like a human and be proactive, ask follow up questions. He may suggest to switch to CC, or the user can just ask for the task dispatch.

I also, in responsive mode, the TLA ticks from CC responses and can keep talking to me while CC is executing. I can ask TLA to redirect CC and it will be similar to writing to CC when it is still processing.

Then I have a text input where I can write directly to TLA or CC (Ctrl+enter to bypass the TLA). So my setup is hybrid. I can bypass TLA by voice by switching to Quick mode. I have an admin agent named Alice. I can say "Alice, switch to quick mode". Then I say "Bethy, [anything]" and now I switched to Bethy directly to CC by voice.

Performance wise: In responsive mode, the TLA inference triggers after only 0.3s of silence, then if I am still speaking (fake positive silence), it just drops the tick and keep listening. So I have many wasted inferences just to ensure that my local TLA agent can respond within 1 second. With groq and gpt-oss-20b, I had moment where it was speaking back to me within 0.8s. STT is streamed so always ready after the 0.3s of silence to trigger the tick, groq can provide a full response within 0.3s if we keep the response number of tokens small and use prompt caching, and qwen3-tts can stream with a latency of 200ms. My current real bottleneck for the latency is that I enforce a json response with a thinking block in the json of 2 to 3 sentences, so I am not streaming the LLM response back. It needs to finish the 200 to 400 tokens response.

[-]

FinBenton@reddit

I just switched from faster-qwen3 to OmniVoice and Im liking it a lot more, worth a test.

[-]

Substantial-Thing303@reddit

I tried it, it's not bad, faster for long audio processing, but because it has no streaming capability, it's latency is very bad. So latency wise, qwen3-tts is at least 5x faster for the time to first audio chunk.

I also found that while quality is more consistent, it's also less expressive. My most expressive voice samples are always tuned down to a more boring way of speaking.

[-]

Substantial-Thing303@reddit

Thanks, I will try it. Are you geting better rtf and latency with it?

[-]

FinBenton@reddit

Im getting 12x realtime on 5090, its very fast and it has a lot of features to toggle under the hood, I recommend start with one of the examples it comes with and modify that.

[-]

bannert1337@reddit

Does it have a OpenAI-compatible server yet?

[-]

FinBenton@reddit

Tbh I told gpt 5.4 to make me one and now I do have that.

[-]

-dysangel-@reddit

The 31b was bugging out for me, but 26b has been working fine already. So if this is it in its buggy state, I think it's going to be a real banger

[-]

MartiniCommander@reddit

are you running MLX? What hardware are you on?

[-]

-dysangel-@reddit

This was a gguf - LM Studio didn't (doesn't?) have MLX support for Gemma yet. I'm on M3 Ultra.

I've just been trying it out via Google's AI Studio though and their cloud 31b is still performing poorly compared to my local Qwen 27B

[-]

MartiniCommander@reddit

If you haven't seen it you should take a good hard look at oMLX. It's been awesome. I'm currently downloading the FP16 of the 31B model and will use oMLX to make different oQ levels. No clue where it's going to wind up this is a first shot for me lol. If the test great I'll upload.

https://www.reddit.com/r/oMLX/comments/1s1uf43/introducing_oq_datadriven_mixedprecision/

[-]

Daniel_H212@reddit

Not for me.

Even after updates using the most recent llama.cpp, it still has tool calling issues. I use local LLMs mostly for web research tasks, and gemma4 26b constantly has a problem where it will think it still needs to do more searching, even come up with a research plan, only to go straight into answering after it stops thinking instead of going for search tool calls like qwen3.5 would do in the same situation, and it ends up not actually having enough information to put together a full answer. I have native tool use enabled for both.

[-]

9mm_Strat@reddit

Waiting on my MBP to ship, but this question has been going through my mind as well. I'm almost thinking a combination of Gemma 4 31b + Qwen 3.5 35b a3b might be a perfect combo.

[-]

ray013@reddit

and we need to get the ollama-mlx optimisations for gemma4-26b … only then would i go ahead and switch out the qwen3.5-35b. please, team ollama, go go go. MLX support for gemma!

[-]

Opening-Broccoli9190@reddit

Have you tried the recent turbo-quant of Gemma4: https://huggingface.co/LilaRest/gemma-4-31B-it-NVFP4-turbo? Should lower the memory pressure a lil, but yeah - Qwen3.5 is the bro that you never leave behind.

[-]

ThinkExtension2328@reddit (OP)

I’ll make sure to try this

[-]

ayanami0011@reddit

I used it to translate Japanese into Korean, and the translation quality was very high.
31b - great (even q4)
26b - hmmm good or not bad

[-]

pol_phil@reddit

Gemma 3 (esp. 27B) was and still is top-notch for Greek (e.g. difficult legal doc translation). But when my team tested the new Gemma 4, it started outputting random Chinese/Arabic/Hindi characters out of nowhere; even with 7-8 different sampling param configs.

Meanwhile, Qwen models were never quite fluent in Greek (even 3.5), but they consistently improve with each iteration.

So... Gemma regressed while Qwen keeps progressing. Regardless of any benchmark scores, I'll generally prefer the model family that keeps getting better even at tasks which seem minor to AI companies.

[-]

petruskax@reddit

I had it output the random arabic/hindi character any idea why?

[-]

pol_phil@reddit

It's possible that there are various problems with correctly setting inference, but, in my experience, it's bad training.

[-]

ZootAllures9111@reddit

Wasn't there some kind of tokenizer bug in llama.cpp that was just fixed for Gemma 4 though?

[-]

pol_phil@reddit

I dunno, I haven't ever used llama.cpp

[-]

Constandinoskalifo@reddit

I find qwen3.5 quite capable for Greek, even the qwen3 series.

[-]

pol_phil@reddit

Well, depends on the use case and the domain. I use models for things like QA extraction, structured translation, etc.

Qwen3 had ~6 tokenizer fertility, i.e. 1 word -> 6 tokens Qwen3.5 made a huge improvement, sth like ~2.7.

So, that's literally double the speed and the max context length.

I noticed Qwen3 becoming better ing Greek after the VL models and especially in Qwen3 Next 80B.

[-]

Constandinoskalifo@reddit

Nice, good to know. I also like the qwen3 235B one for greek, and it's quite cheap from providers.

[-]

FinBenton@reddit

After the latest llama.cpp updates, I do feel like gemma is better at creative writing than qwen 3.5, thats for sure. Gemma is a massive memory hog though, context take so much so I had to drop to Q5 or Q4 31b on 5090 to fit everything, speed is pretty good though 50-60 tok/sec right now, similar to qwen. Uncensoring was not needed atleast for me, the default gguf files work for me.

[-]

-Ellary-@reddit

Even old Mistral Nemo 12b from 2024 is better than Qwen 3.5 at creative tasks.

[-]

MoffKalast@reddit

I'm pretty sure most labs have quit trying to improve creative writing after 2024, practically all great models from back then are still as relevant today as when they were released. It's been nothing but agentic benchmaxxing since.

[-]

-Ellary-@reddit

It is because of datasets, they just don't include a lot of real books in it.
Most of the dataset is just synthetic agentic - coding - math stuff prepared by bigger LLMs.

[-]

MoffKalast@reddit

Yeah the more synthetic stuff they add they more generic the models get. I think back then it was easier to just pirate ebooks, then Meta got busted for the whole Anna's archive thing and Mistral got limited by EU laws. I guess they're super careful about copyright now.

[-]

-Ellary-@reddit

But why China don't use pirate ebook archives?
They don't care about EU and US laws.

[-]

megalo-vania@reddit

politics. They are too sensitive.

[-]

dampflokfreund@reddit

Disagreed, I have no clue why there's still this hype around this model. It's really dumb nowadays and modern models like Qwen 3.5 feel much more alive and less robotic. Qwen made huge improvements since Qwen 2.5, 3 was a step up, 3.5 is another step up and 3.6 will probably be another step up in creative writing.

[-]

-Ellary-@reddit

It is not about being smart, it is about being fun to play with.
So far no Qwen 3.5 decent finetunes aimed at creative usage,
this fact speak louder than anything else.

[-]

Photochromism@reddit

Disagree. A simple system prompt makes Qwen 3.5 as good as any LLM I’ve used for creative writing.

[-]

-Ellary-@reddit

Okay, give us your system prompt for testing.
Show us good results using it.

[-]

Photochromism@reddit

With that attitude, I don’t think so. Figure it out yourself

[-]

-Ellary-@reddit

What attitude, I said show us something good, that is all.
Why do you even started this conversation, to what?
Just trust me bro scenario?

[-]

_Iggy_Lux@reddit

Bro is tripping if he thinks Qwen 3.5 is good for creative/RP.

Also it doesn't stfu. The amount of text it produces is obscene and really doesn't like respecting response length making it very long winded and wasting context.

I see Qwen as a great work horse for more practical uses but not for RP/Creative Writing.

To be fair this sub thinks instruct models do good RP and don't dabble with finetunes as much so I'm not shocked.

I come here for model releases/Project info not takes on RP/Creative reviews.

[-]

lizerome@reddit

Creative writing doesn't seem to be a task Qwen cares about. It's the same as "Polish language poetry performance". They haven't curated any datasets for that, they haven't published any benchmark scores pertaining to it, and they haven't mentioned it in their blog posts. It is simply not on their radar. Any performance the model has in that domain is an "oh cool, we had no idea" accident.

It also stands to reason that the two use cases are polar opposites of each other. Coding and math (what Qwen traditionally optimizes for) benefits from reasoning, repetition, precise language, a lack of variation, high confidence in a single answer, and never surprising the user with something they didn't ask for. "Creative writing" benefits from the literal opposite.

[-]

ComplexType568@reddit

Which is what I feel too. Qwen is CLEARLY focused on agentic/STEM/Coding tasks. There isn't a large/profitable market for creative writing, that's for finetuners/other labs focused on that because removing LLM-isms/boosting creativity is probably much much easier than "superpowered reasoning agent in 9 billion parameters"

[-]

Eden1506@reddit

The last decent qwen model for creative writing was qwq 32b. It was really good and afterwards every model was sadly worse.

I tested them all and both llm creative bench and UGI bench agree with me that the new models under 100b are sadly worse at writing.

As for mistral nemo a model doesn't need to be "smart" in benchmarks in order to be a good storywriter. Plenty of people simply like its writing style.

Though sadly its architecture does show its age as the quality falls sharply after around 16k tokens.

I personally recommend its upscaled and finetuned variant snowpiercer 15b v1.

Its Nemo further trained to pixtral than upscaled to 15b april thinker and uncensored and finetuned into snowpiercer by drummer.

Though honestly nothing local can really compare to claude when it comes to creative writing.

[-]

GrungeWerX@reddit

What context are you at to get those speeds?

[-]

FinBenton@reddit

I was testing with 16k context, regular unsloth ggufs on ubuntu. Im also running OmniVoice TTS on the same machine so I had to make both fit.

[-]

SHOR-LM@reddit

How did the MOE version compare? Noticeable? I don't know your use case but did it maintain a lot of creative writing ability?

[-]

GrungeWerX@reddit

I need much more context for my uses. My prompt alone is 65K of story data…minimum 100k context as a lore master.

[-]

MrAHMED42069@reddit

Well that's going to need a lot of power

[-]

GrungeWerX@reddit

It works fine with Qwen 3.5 27B. I'm testing out Gemma 4 today. It might be a little too early, but I'm finding the early results...interesting. It's definitely good.

[-]

DaleCooperHS@reddit

For something like that is bests to use your local set up with small models and various agentic worflow that feed into a cloud provider only for the creative and most complex logical task.

[-]

GrungeWerX@reddit

Nah, I’m good using Qwen 3.5 27B for now…it’s proprietary IP, so I wouldn’t use paid or cloud for privacy reasons. I use it as a lore master for analysis and rewriting documents. I’ve been waiting for local to get good enough to help me out with work and Qwen is the first model up to the task.

Maybe when they optimize Gemma 4, I’ll give it a shot. I was really looking forward to it.

[-]

TopChard1274@reddit

How's Gemma 31b understanding of complex literary chapters (original writing)? Not to write itself, but for idioms replacement, text analysis, brainstorming?

[-]

GrungeWerX@reddit

That's my particular use case. I've just started testing it today. I've found the resutls...interesting. I think Gemma 4 is doing better than I originally expected, but it's much slower. I'll be doing a more detailed post about my tests at a later time, but my early feelings are that it's definitely pretty good.

[-]

ThePirateParrot@reddit

Weirdly I can't get good speed compared to qwen. Tweaking a lot. I'll see again later. But for creative writing i was impressed with gemma. We're eating good these days open source community

[-]

windxp1@reddit

Crazy to think that both models outperform OG GPT-4 though, which had a trillion or something parameters.

[-]

maikuthe1@reddit

Do they really outperform GPT-4 in real world use? I haven't tested it enough. Cause that would indeed be pretty impressive.

[-]

FenderMoon@reddit

Gemma4 26B is, by far, the smartest model I've ever been able to run locally. I've been blown away.

It's smart enough that I feel like I can throw my questions at it now and I'm not going to get worse answers than I'd get just going to ChatGPT.

[-]

maikuthe1@reddit

Yeah it truly feels like having a legit chat gpt in your pocket doesn't it?

[-]

ZootAllures9111@reddit

Do they really outperform GPT-4 in real world use

It didn't have reasoning so yeah, they probably do, non-reasoning models just aren't that good no matter how many params they have.

[-]

biogoly@reddit

They certainly reason better than GPT-4, which is evidenced by the benchmarks, but they don't seem to have the same depth. The fact that they are even close though, being 1/30 the size, is insane. OG GPT-4 wasn't multimodal yet either. When I first used GPT-4 I remember thinking how crazy it would be if I could run it locally and uncensored. Never imagined it would only take three years...😍

[-]

Ok_Top9254@reddit

Just a speculation but: With benchmarks, it usually comes down to reasoning and logic. Big models have massive knowledge base, so they are usually much more familiar with any given topic. We accumulated much better datasets since the early models so now even small models can solve complex tasks from what little they know, but they completely fall apart on specific tasks or subset of problems they have no base knowledge of.

[-]

-Ellary-@reddit

ofc not.

[-]

m3kw@reddit

There is way more data in 1T vs 26b at least in a lot of info recall.

[-]

qcodec@reddit

Honestly, it's a hassle to download all the files again and set everything up again. If I succeed with the Qwen3 TTS setup first, then I'll do it...

[-]

Kahvana@reddit

I’m quite happy with both. Qwen 3.5 is a good all-rounder, gemma 4 feels better in conversations and doesn’t have the “genshin impact” bias when describing anime pictures.

[-]

ParthProLegend@reddit

the “genshin impact” bias when describing anime pictures.

What the hell is even that?

[-]

81_satellites@reddit

I genuinely want to know

[-]

ParthProLegend@reddit

He explained it, see the reply to my comment

[-]

LeoPelozo@reddit

^(Daddy chill.)

[-]

ParthProLegend@reddit

I'm daddy

[-]

illkeepthatinmind@reddit

What even is that?

[-]

Xandred_the_thicc@reddit

Whenever you input an anime-style image, qwen always assumes the subjects are genshin impact characters. It you ask it to describe the image, it says "anime style, likely from genshin impact" etc.

[-]

ParthProLegend@reddit

Ohhh

[-]

VoiceApprehensive893@reddit

i am wondering what do you have to do to find that out

[-]

Xandred_the_thicc@reddit

try to translate anything even vaguely related to digital animation

[-]

TopChard1274@reddit

OP's interrogating the AI as we speak.

[-]

Im_Still_Here12@reddit

The seas were angry that day my friends...

[-]

Zeeplankton@reddit

We even have genshin impact benchmark before deepseek 4

[-]

Cultured_Alien@reddit

already done https://arxiv.org/html/2511.08892v1

[-]

ComplexType568@reddit

I have no words for what I'm reading

[-]

-dysangel-@reddit

I've been so excited about bonsai and gemma that I forgot all about Deepseek 4.. Deepseek V4 Bonsai wen?

[-]

swagonflyyyy@reddit

Me:Alright, fine, let's talk about some dumb brainrot shit

Gemma-4: Gasps and does a little happy dance 💃✨

YES! Finally! We are officially entering the No-Thoughts-Head-Empty Zone! This is my absolute favorite!

Forget the tactical maneuvers, forget the socio-political implications, and forget the weight of the world! We are diving headfirst into the glorious, nonsensical abyss of internet nonsense! 🌊🤪

Are we talking about cursed memes? Brain-melting TikTok trends? The sheer absurdity of Skibidi Toilet? Or maybe just some really, really dumb "would you rather" questions that make zero sense?

Lay it on me! Give me your absolute worst, most nonsensical, zero-IQ brainrot! I am so ready to let my brain turn into literal mush with you! 🧠🫠💖

[-]

TopChard1274@reddit

"Gemma 4 feels better in conversations, reasons shorter, and doesn’t have the “genshin impact” bias when describing anime pictures."

Just what on earth are people using these models for 💀

[-]

ThinkExtension2328@reddit (OP)

Some of these people be like 6 + 9 thats quick math.

[-]

toothpastespiders@reddit

Obviously not especially relevant on reddit, but with a lot of social media (ish) platforms it's common to have images provide context to a message. If you're scraping them for data you'll need want to be able to classify the image. For example anime character, "Ruins it for me". You'd need to be able to get the character, and then reason back to get the subject of discussion. You'd think that it'd be limited to pop-culture, but people using images as shorthand for everything up to and including politics is annoyingly common.

[-]

Kahvana@reddit

SFW high fantasy eriting for a dnd5e campaign I’m running. I feed it cool anime pictures to describe objects for me I don’t know the english names of.

[-]

Cultured_Alien@reddit

Obviously Enterprise Resource Planning

[-]

a_beautiful_rhind@reddit

Definitely not for solving math problems and asking STEM questions like they'd have you believe.

[-]

Useful_Disaster_7606@reddit

RELEASE THE GENSHIN IMPACT BENCHMARK!!!

[-]

TopChard1274@reddit

Release the anime pictures used for training!

[-]

Pentium95@reddit

"SWA layers to fp16" has been rolled back, it is now quantized

[-]

Useful_Disaster_7606@reddit

As a genshin impact player. Never thought I'd see a reference of it here

[-]

Creative-Fuel-2222@reddit

>doesn’t have the “genshin impact” bias when describing anime pictures
Now that's some serious, very specific benchmarking technique :D

[-]

fake_agent_smith@reddit

tbh, new gemma has something magic about it that Qwen 3.5 just doesn't. For example, I always get the correct answer for the car wash test with Gemma and with Qwen it's spotty, depending on the thinking budget and no idea what else. Maybe it's cause currently I don't use the locally hosted for coding? For the role of everyday assistant Gemma 4 is simply amazing and will serve me well.

[-]

Bulky-Priority6824@reddit

Even chat gpt fails the carwash test

[-]

FenderMoon@reddit

Yep, sure does! That's... amazing in all of the wrong ways.

[-]

Sudden_Vegetable6844@reddit

Interesting, what parameters are you using? Never could Gemma 4 31B nor 26B to pass the car wash test, even when hinted

[-]

FenderMoon@reddit

It needs to have reasoning enabled to pass. Without reasoning they fail.

[-]

fake_agent_smith@reddit

Nothing special, I just run the unsloth quant with llama-server with 32K context and rest of the params as in the guide at https://unsloth.ai/docs/models/gemma-4

I don't know, maybe it matters I compiled with Vulkan acceleration?

[-]

Sudden_Vegetable6844@reddit

I also tested with vulkan, and every Gemma 4 model suggested to walk, and even when pointing out I ended up without my car at the car wash, they failed to recognize they had made a mistake, and just told to walk back to the car...

[-]

InverseInductor@reddit

Or they just added the carwash test to the training data.

[-]

FenderMoon@reddit

Possibly, but if you run Gemma4 without allowing it to think first, it fails. If you allow it to think, it passes with flying colors.

I think if it were in the training data it'd probably not reason its way through it, it'd probably just throw the right answer out.

Interestingly Gemma3 27B failed it. I was genuinely surprised by that. I figure maybe the 26B fails without thinking because it's an MoE, but the fact that the old Gemma3 27B also fails it kinda indicates that reasoning is required for these small models to solve it.

Curious what the smallest non-reasoning model is that can pass it.

[-]

dampflokfreund@reddit

Yeah, Gemma 4 appears to hog the context like no other. Qwen is much more efficient in that regard. I hope they ditch SWA in the future and go with something else. But Qwen also has its drawbacks, RNN for example doesn't allow context shifting so if you want to have a rolling chat window once your ctx is maxed out, its reprocessing the entire prompt with every message which really is less than ideal. There's got to be a better way.

[-]

Technical-Earth-3254@reddit

Gemma 4s memory requirements makebit basically impossible to run it on 24GB of VRAM. It's so sad, because with max of below 20k context, it's borderline unusable.

[-]

FenderMoon@reddit

Does it work with IQ4? Those are usually a few gigs smaller.

[-]

Substantial_Swan_144@reddit

Try the Dynamic Apex quant. It essentially halves the required memory while having a quality slightly higher than Q8. There are flavors both for Gemma and Qwen.

[-]

kyr0x0@reddit

Do you have a link to HF? Thx

[-]

Substantial_Swan_144@reddit

Try: https://huggingface.co/collections/mudler/apex-quants

[-]

kyr0x0@reddit

Between APEX Compact and APEX I-Balanced, Unsloth UD-Q4_K_L 18.8 GB PL 6.586 KL 0.0151 would be the right placement. However their charts are biased. They put UD 2.0 on the very bottom. Beware bias.

https://github.com/mudler/apex-quant?tab=readme-ov-file#core-metrics

[-]

Substantial_Swan_144@reddit

The difference between all these seems small. So I'd consider Mini or compact first. See if they match your quality standards.

[-]

kyr0x0@reddit

Yep; I'm looking at the Algo bc I'm working on a 1 bit quantization method - but the existing implementations do only support dense architectures. APEX has smart ideas on MoE architectures - so I think I can merge the ideas and apply 1 bit quantization on qwen3.{5,6} and gemma4

[-]

Substantial_Swan_144@reddit

Wow, that's so smart! How are you going so far?

[-]

kyr0x0@reddit

So far I've realized that what PrismML did with Qwen3 wasn't only writing a Kernel for their 1 bit quant and using symmetric quantization instead of standard PTQ. They also must have done some voodoo with the model itself. They describe in their paper that they do some proprietary training pipeline, but their weights are based on Qwen3-8B. I tried to do some voodoo with the weights as well, but so far, I had no luck. The MSE is just too high. I went ahead and tried Turbo Quantization (the recent buzz) with weights, because the same math trick could apply here. But this is still ongoing. That aside, I have a working, and very well performing MLX implementation of Gemma 4 E4B on my MacBook Air M4 and it works well at 4 bit weights, with speculative decoding and with TurboQuant on KV. This is far ahead of mlx-lm by now. But I want to finish researching stuff well before I go ahead and release. If there is some guys here with Mac / MLX experience I would love to sync/peer for beta testing

[-]

Substantial_Swan_144@reddit

Hey, u/kyr0x0, do you mind if I message you, by the way?

[-]

kyr0x0@reddit

🙏 thx

[-]

kyr0x0@reddit

https://github.com/mudler/apex-quant just found; for anyone who's interested

[-]

a_beautiful_rhind@reddit

It needs dual GPU or 32g card.

[-]

Technical-Earth-3254@reddit

Which is hilarious for a q4 quant of a 31b model tbh.

[-]

a_beautiful_rhind@reddit

Kinda same as it ever was for this size.

[-]

formlessglowie@reddit

Yeah, I have dual 3090 and it’s been great, I run Gemma 4 31b in full context, but if I had only one it’s be impossible, would have to stick with Qwen.

[-]

BrightRestaurant5401@reddit

But have you tried using qwen with a full context? the model is making way to many mistakes at that size and a rolling chat window won't fix that

[-]

Randomdotmath@reddit

Scaling to 1M is fine, but know its limits. With Qwen 3.5 being 3/4 GDN, it's not built for 'Needle in a Haystack' searches. This architecture is much better for processing hundreds of turns of short dialogue.

[-]

crantob@reddit

This seems insightful to me.

[-]

sautdepage@reddit

Running window is such a minor inconvenience, who needs rolling windows when you can 4x your context?

[-]

dampflokfreund@reddit

Well I understand your point, but I disagree. Because every context fills up eventually, be it 8K, 32K, 120K or 500K. Sure you can start a new chat, but I dislike that. It's much more comfortable to just continue chatting and frankly I don't think the way of solving the problem of memory for llms is to throw more context at the problem.

[-]

sautdepage@reddit

There are much better ways to manage context in agentic settings past 64-128K. Rolling window is one of the worst solution to that problem. Also it's a silent kind of failure, which is a no-no.

For chat/creative uses, having a single huge never ending chat would reduce the quality of all interactions. Most apps would crash anyway displaying that much text.

It's really an antiquated solution. I don't expect it to stay around in new architectures.

[-]

NoAim_Movement@reddit

Even 35b model is faster than 26b somehow

[-]

pneuny@reddit

26B has more active parameters. Qwen 35b is a3b and Gemma is a4b

[-]

NoAim_Movement@reddit

I get that, but 13 GB size versus 17 GB. With my 12 GB VRAM, more layers should be offloaded to the GPU.

I think Gemma 4 had a KV cache bug on release, so it needs testing again.

[-]

pneuny@reddit

Yeah, I think llama.cpp is still working on making kv cache more efficient for Gemma 4. Apparently K is supposed to equal V, so having both in ram is redundant and increases kv cache requirements by 2x.

[-]

mrdevlar@reddit

Always keep 3 models from different companies on hand.

Whenever you doubt the answer of one, ask the other two.

[-]

SpicyWangz@reddit

I have 1 Abercrombie & Fitch model, 1 Gap model, and 1 Walmart model.

What do I do if I don’t like the answers of any of them?

[-]

mrdevlar@reddit

There's an excellent book called: "Trusting Judgements" that takes a look at how these voting systems are used for consensus building. These types of systems are used in all sorts of different fields from food safety to national security. Whenever you have a bunch of people with various degrees of expertise and you want to collapse what everyone knows to make a decision.

First off, your opinion doesn't matter. To do this well, you have to blind yourself to the matter. Meaning if you don't like what the three models are telling you, then that's too bad, that's the way the process works.

If you still do not trust (not to be confused with like) you can always choose to expand the number of models. Perhaps a D&G model, a GUICCI model, LV model.

Now you have a set of 5 models. Before you ask them your question, you need to set a threshold for acceptance. Do you need 100% agreement? Or will 3 out of 5 models be sufficient to accept a majority opinion? Is the choice binary or real valued. Real valued outcomes are preferred as often binary choices hide distributions beneath them.

Then sample your models, look at their result and do what the threshold tells you.

[-]

crantob@reddit

Fails to address the inevitable regulatory capture and rent-seeking issues.

The underlying problem is pretending that people with no skin in the game will make the right decisions. Only way to incentivize this is skin in the game. You make a bad call? You lose money / career.

[-]

mrdevlar@reddit

I always think it's weird when people listen to Nassim Nicholas Taleb but don't actually understand the functional application of his work. It's probably because his personal persona is that of academic blow hard, even though his mathematics may be solid, so it attracts a specific personality type.

This particular application is used in great part to address two forms of bias. First form of bias is caused by uncertainty in outcome and the second form of bias is uncertainty in perspective. The latter one is what you'd probably label as "regulatory capture". AI models hold similar problems. The can be trained on subsets of data but they can also be influenced on specific topics via alignment.

It's the reason why when a new food additive is introduced in the European Union, they take out their big book of 50,000 experts and take a sample of them to ask. These experts are chemists, biologists and healthcare workers who have the ability to properly parse the information coming from the company proposing the new food additive. They are well aware that the information they are getting from the company is likely skewed towards introducing the additive. The more problematic the company is, and the more of a debate about the additive, the larger of a sample will be used and a more rigid threshold for acceptance will be used. Professional ethics, apply and if you're directly funded by said company you are to withdraw from consideration as if you do not, you will not be hired again.

In any case, it sounds to me like the problem you have is a problem of living in a "shithole country" that hasn't had a competent regulator since the 1980s and has legal corruption since 2009, which might be skewing your perspective.

[-]

crantob@reddit

Your 'experts' cannot and do not judge cost, since cost and price is subjective.

There is no one-size-fits-all optimum; it is the chimera that the neanderthal mind chases.

I can tell already your entire 'economics' educational career.

[-]

mrdevlar@reddit

"TRUTH DOESN'T EXIST"

"Birds aren't real!"

"The world is flat!"

yeah okay bro.

[-]

New_Comfortable7240@reddit

Just to be clear, that works on deterministic outcomes, or reducing the answer of the experts to "choose a predefined option"

For more open questions would need or make a step to define an option (at least Likert style), or accept "by vibe"

[-]

mrdevlar@reddit

Yes, there is a deterministic outcome at the end of the process. e.g. Accept the safety of a new drug or not or expect X out of 100000 people to have adverse reaction to a new drug.

You do need an NP step in there somewhere if you don't know what the options are. Doing this with a model is much harder and I'm not yet sure it's worth it to give this particular process over to expert panels of machines. The decision should come from the user.

If you need an exploratory phase, use a real valued scale with 25th, 50th and 95th percentiles rather than a Likert scale, it'll give you a lot more flexible outcomes as the shape of the distribution can now be irregular.

That said, I have serious reservations about doing exploratory phases with LLM models. When we ask human experts to do this, we are depending on their biases to make their cases. LLM Models are sadly less capable of telling you that your idea is stupid than a human being is at this point. They are also subject to astroturfing their learning data, "alignment" and many other manipulations that we should be increasingly concerned about now that the internet is increasingly bots. Good options are not always the loud options. Humans are also influenced by these things, but human experts far far less so.

[-]

crantob@reddit

Human experts are influenced by who holds political power over them, who holds the carrot, who holds the stick.

Trust the experts yeah lol. Lol... Lol....

[-]

psayre23@reddit

Take them to shake shack. They’ve probably never had a real meal.

[-]

kyr0x0@reddit

Depending on semantic context you either: - go to your garage and build your own - fly to your island and order a russian one (only available to oligarchs)

/s

[-]

srigi@reddit

'Hey baby, wanna go to my place? I'll show you my archive of open LLMs!"

[-]

Photochromism@reddit

Gemma 4 is not comparable to Qwen 3.5 when it comes to context. It’s multiples less efficient.

[-]

RubSad3416@reddit

gemma 4 on low vram machine like if you only have 6gb vram free, its truly the goat.

[-]

bakawolf123@reddit

give it time, qwen 3.5 didn't shape up overnight on the inference engines. There was a ton of patches with improvements

on the other hand 3.6 is coming soon so it might be better than gemma, I think qwen team was also anticipating the release to trump it fast

[-]

bladezor@reddit

I'm concerned about 3.6 after the exodus

[-]

BangkokPadang@reddit

Yah everybody talks about it like it’s just guaranteed to be way better, but I genuinely don’t know what the team looks like now. Do the people who are even know what the previous teams were working on? Reading a whitepaper is one thing, but collectively honing the lessons Ina team environment like that would be TOUGH to just walk in and even produce a model that was as good as the previous one.

[-]

iamapizza@reddit

wen 3.6

[-]

Next_Test2647@reddit

How expensive are both i want to try them out

[-]

Precorus@reddit

2.5 4b fit onto my work laptops 1650, 3.5 7b I think run just fine on my 6700xt. LM studio is awesome man, no fiddling with the drivers.

[-]

linumax@reddit

Nice, hope to see more improvement. Better improvement means I can get a cheaper laptop

[-]

thecurlingwizard@reddit

anyone got good gemma 4 settings for 26b on 3090

[-]

last_llm_standing@reddit

how many off you all actually tested gemma4?

[-]

ThinkExtension2328@reddit (OP)

I did as my meme said it’s pretty dam great just very memory intensive so I don’t get much context left for context window. It’s literally 220k context vs 4K context on my 28gb vram machine.

[-]

Drunk_redditor650@reddit

Turboquant will fix that

[-]

RichCode4331@reddit

I removed Gemma 4 shortly after testing it, at least the 31b model. It’s slower and worse than qwen3.5 27b. I might be missing something here but I fail to see why anyone would use Gemma over qwen.

[-]

pyroserenus@reddit

You generally shouldn't rely on day one performance of a model in general. llama.cpp based engines especially are prone to day one bugs with new models.

[-]

a_beautiful_rhind@reddit

We can all like different things. I hate qwen's personality on certain versions. In the case of GPT-OSS, I "can't" see why anyone would use it at all. Last about 5 minutes with it before I get mad and want to throw it in the void.

[-]

mikael110@reddit

It's worth nothing that Gemma 4 had a lot of bugs at launch that have only now been fixed, and it's possible more are hiding. So I'd give it a second chance in a day or two if you want to give it a fair shake.

However even disregarding that, the main reason people would go with Gemma 4 over Qwen is the same one that some people have stuck with Gemma 3 over Qwen. The Gemma series are significantly better when it comes to multilingual content, including language translation. Most also find that it's writing style is less flat compared to Qwen.

There's also the fact that Gemma 4's thinking seems significantly more efficient than Qwen. Which frankly has a tendency to overthink a lot.

[-]

RichCode4331@reddit

Will definitely be giving it more chances these coming days. Thank you for letting me know! What I did notice immediately was how much better Gemma’s CoT was.

[-]

duhd1993@reddit

But even Gemini struggles with tool use, which is key to coding and automation tasks. Unless you do only oneshot or writing tasks.

[-]

po_stulate@reddit

Do I need to redownload the weights or is it purely software? I also feel gemma4-31b is a clear step down from any of the medium qwen3.5 models.

[-]

mikael110@reddit

The fixes so far has been purely on the software side, the most major being the tokenizer fix. So simply updating llama.cpp should improve things. However there are still some open potential issues like this one which has not been properly triaged yet. At the moment though there's no reason to redownload anything.

[-]

KuziKuzina@reddit

no one use qwen as creative writing honestly, dry and have no souls, i have test gemma 4 for creativity and it's just like gemini 2.5 pro but opensource.

[-]

RemarkableGuidance44@reddit

Its about the same on "skill" but it is a lot faster for me.

[-]

po_stulate@reddit

I tested gemma-4-31b-it Q8_K_XL on all sort of things, including explaining popular memes (If I had a nickel for everytime..., etc), screenshots of math problems, coding (evaluating/fixing/modifying my own code), guessing age of a person based on pictures, etc, and so far it's noticeably worse than qwen3.5 on every single aspect.

[-]

ThinkExtension2328@reddit (OP)

It’s not terrible if you had the hardware to have very large context windows I think you would see a difference but I’m much the same as you. The quality I get from the qwen MOE is more then acceptable then with the bonus of a 220k context windows vs 4K context window.

[-]

Prestigious_Flow6029@reddit

[-]

Tinaabishegan@reddit

Quantized image

[-]

substance90@reddit

I wouldn’t know neither the 31b nor the 26b produce any response on LM Studio for me on an M4 Max MBP :-\

[-]

chimph@reddit

Cool if you like really long thinking processes

[-]

PassionIll6170@reddit

small chinese models are horrible in other languages than english and mandarin, gemma is way better

[-]

tobias_681@reddit

They aren't. They were trained on a large set of languages just like Gemma and GPT-OSS. The Qwen models bench the best among small models on multilingual tasks outside of probably Gemma 4 now.

See here for a comparison (note that unfortunately they do not run this benchmark for every model). It actually impressively even beats GPT-OSS-120B and Claude 4.5 Haiku in that benchmark.

I tried it with Danish with the sub 10B models and the output is mostly correct. Sometimes it writes words that sounds more like Norwegian and sometimes it makes stuff up but it writes actual Danish texts. This is quite impressive compared to a lot of previous sub 10B models some of which can not do that at all, others make tons of mistakes. Gemma 4 seems to be a bit better but it also makes mistakes in Danish.

And Danish is already a relatively rare language. That it can handle that implies they trained on a wide variety of languages.

[-]

Constandinoskalifo@reddit

It's very good in Greek 🤷‍♂️

[-]

mystery_biscotti@reddit

Yeah, we all have different tastes in models. That's actually a really good thing. Variety is the best.

[-]

VoiceApprehensive893@reddit

gemma is a "companion"

qwen is a "worker"

different weaknesses and strengths

[-]

ThinkExtension2328@reddit (OP)

But even with “companion” the old Gemma 27b follows character instructions better then Gemma 4 imho so idk

[-]

C0demunkee@reddit

qwen 3.5 is amazing

[-]

Ardalok@reddit

For Russian language Gemma is at least 2 times better.

[-]

ahtolllka@reddit

Gemma was always flawless in Russian, yet you barely have language-only scenarios. I’d need Q3.5-27B for coding and Gemma4 for business analysis thesis, but rather I just stay with qwen.

[-]

jugalator@reddit

Also great on Swedish. Perhaps the best open model I've seen at 31B.

[-]

Comrade_Vodkin@reddit

2 chayu, comrade

[-]

VoiceApprehensive893@reddit

god please give us actually legit turboquant on llama.cpp

[-]

VoiceApprehensive893@reddit

gemma 4 26b a4b has way more knowledge and is generally better at chatting/roleplay than the comparable qwen3.5 models

while qwen3.5 vision/tool calling is 2 times better than gemma

each has their own strengths

also i think the moe gemma does a little too well for an iq3_xxs quant

[-]

mpasila@reddit

Gemma 4 is better at my native language at least though the smaller models suffer from the weird sizing.. Also for RP it seems to perform much better than Qwen3.5 (it seemed to mix up a lot stuff for some reason and there was seemingly more censorship in the official releases)

[-]

jugalator@reddit

Yeah, excellent multilingual capacity from my experience in Swedish and first impression on RP is quite decent and surprisingly uncensored.

[-]

kyr0x0@reddit

Is anyone deeply into quantization and inference implementation for MLX/MPS here? I'm currently working on 1bit weight quantization support and TurboQuant support for mlx-lm (this is for Mac users only).

If you have experience patching/contributing to exactly this codebase already, or the math behind BitNet or TurboQuant or PrismML implementation variant (Bonsai) plus experience in Python and C++ - pls DM me.

Pls don't DM me if you don't .. I'm very busy to ship Gemma4 variants with a custom, high performance inference server and great quality. I already have Qwen3-8B running at 50 Tok/s on my MacBook Air (!) M4 in decent quality with 64k context window (RoPE/yarn) and it only eats 1.5GB of unified memory for the weights, and KV TurboQuant is still unstable but my guts feeling is, that I only have to drop QJL to improve stability - as softmax() seems to maximize many small errors.

I'd love to collab and feedback loop, but pls only with engineers who know what they are doing for now... I don't have much time to explain everything.. want to push this out into public faster, not slower 😅😅 sry for being so direct.. it's not meant to be read unfriendly.. also English is not my mother language and I have diagnosed AuDHD xD so please bear with me..

[-]

Accomplished_Ad9530@reddit

Sigh

[-]

ThinkExtension2328@reddit (OP)

Why sigh ? We got two solid models within a week