Gemma 4 31B vs Qwen 3.5 27B: Which is best for long context worklows? My THOUGHTS...
Posted by GrungeWerX@reddit | LocalLLaMA | View on Reddit | 170 comments
- My setup: i7 12700K | RTX 3090 TI | 96GB RAM
- Models: Qwen 3.5 27B UD Q5/Q6_K_XL | Gemma 4 31B UD Q4_K_XL
To the point:
Right now, Gemma 4 31B and Qwen 3.5 27B are the best local models for a 24GB card. Period.
I've tested everything. These are the first two models that actually feel state-of-the-art for their size.
Most models up to this point have just been moderately-performing novelties. But not extremely useful for real use-cases outside of writing emails, minor code, and RPG-ing. But all local models have performed poorly over long context reasoning and analysis.
Benchmarks mean nothing. For me, it was an easy test: Load up a local model, feed it 50K data, ask it to answer questions and provide analysis. Most models yap without saying anything. They provide very little relevant context, if any. They don't understand the lore. They hallucinate details. They're unusable.
That is, until Qwen 3.5 27B. It was the first of its kind and changed the game for me. It's been my daily driver since.
A couple days after Gemma 4 dropped, I fired it up and dumped a huge 60K of context and gave it a run. Not only did it answer the questions, it understood the lore. With that, I suddenly had my second model that could handle the job. It wasn't as detailed as Qwen with citing references, but it had a little something that Qwen didn't. I'll come back to that.
Now that that's out of the way, and we've established the two top players for long context reasoning to-date, let's get to the matchup. Who's better?
For the past couple of days, I've been comparing it against Qwen. Here are my findings:
- Gemma 4 is currently a lot slower than Qwen 3.5. I've tested Gemma between 70-100K context so far. Up until yesterday, it crawled along at a snail's pace, making it virtually unusable. (I got between 0.6 - 3 tok/sec) But I found the outputs decent enough to keep trying to tweak my settings. Unsloth uploaded new versions yesterday, so I re-downloaded the model and I'm now getting at least 2x speed increase, so I'd recommend you do the same if you're still getting slow speeds. That said, Qwen is significantly faster at even higher quants.
- Gemma 4 seems to hallucinate less than Qwen 3.5. It uses less references from the context, and it sometimes misses very important details altogether, things that Qwen doesn't. That said, sometimes Qwen gets its facts wrong at near 90K tokens, while Gemma seems surprisingly more coherent, if less factual.
- Qwen 3.5 references more context than Gemma 4. This makes it feel more thorough. That said, sometimes it has a tendency at high context to hallucinate minor details. There's a saying: Less is more. Maybe in this case - less is more....accurate?
- Qwen 3.5 is the clear winner over long outputs. Qwen can write looong passages of content, and maintain coherence. It's amazing. I even tested it once, asked it to write a 20K output. I stopped it prematurely - at around 10K tokens - but if I hadn't, it would have kept going, and it was only halfway through the material.
- Honorable Mention: Gemma 4 can write longer outputs than its defaults, but you have to prompt for it. Its capable of giving more thorough results than its initial output. Another redditer said they told it to reason longer and got better results. I tried this. It works. Not satisfied with the answer? Tell it to reason longer and provide a long output. You can even tell it to try to match a certain context length - like 10K tokens. I haven't tested if it can reach set token requirement yet, will follow up on that later.
- Gemma 4 has a better writing voice. I found its outputs more pleasurable to read - mostly. That said, its still got a noticeable level of slop. Not as bad as 26B, but definitely more than Qwen.
- Gemma 4 digests the lore better for its assigments...sometimes. I'm still testing this, but my initial vibes are that Gemma 4's results over long context can give more pleasing results by pulling out more poignant and impactful contextual references. It can punch deeper on the ideas than Qwen at times; Qwen gives you more references, but doesn't always consolidate those ideas in the most meaningful way. Sometimes it feels like this: Qwen is submitting a book report with references. Gemma is writing a review column on a website, citing the parts it found the most memorable. This isn't a consistent experience across all interactions, but its often enough to notice.
- Qwen is smarter. The results, from a technical perspective, are often better. While both miss details over long context, Qwen is often more thorough. It can take extremely nuanced and complex instructions and eat them for lunch. That said, Gemma is also very capable; I'm still learning its abilities. Its not Qwen level...yet...but it doesn't feel far off.
- Gemma 4 gets it. This sort of falls under the "digests the lore" section, but I just wanted to mention that this version of Gemma is less about pontification; it really does seem to understand the unique ideas outlined in the source material. That makes it feel like you're working with a cowriter who can keep pace and dissect/stress-test ideas. Qwen does as well, but Gemma brings its own ideas to the table.
My final thoughts:
For these particular use cases - lore master, story analyst - I can't really decide which I like better. They have two different personalities, and they are equally useful. Where Qwen 3.5 27B first made me feel like I had a true writing partner, Gemma 4 feels like I've just added a third person to the table, who can bring something different and unique to the conversation.
If I could only choose one, I'd choose Qwen. I find its overall abilities to be better. Better reasoning, more attention over long context.
But without Gemma 4, I'd be missing very valuable and relevant context. That single, random-but-consequential observation that can propel the discussion into a meaningful new direction.
Thankfully, I don't have to choose just one.
EuphoricAnimator@reddit
Interesting read! I'm running similar stuff on a Mac Studio M4 Max with 128GB of RAM, so I've been down this rabbit hole a lot lately. I mostly play with Qwen 3.5, Gemma 4, and a bunch of the Ollama models, and context length is always the sticking point. You're right about those two being top tier for a 24GB card though, especially when you're pushing the limits.
I've found Qwen 3.5 generally holds onto context a little better for me, especially in longer conversations. I can usually get away with 8k tokens fairly consistently without it completely forgetting what we talked about five minutes ago. Gemma 4 is good, don't get me wrong, but I notice more drift around the 6-7k mark. With both though, reloading the entire context every so often really helps - it’s a bit clunky, but I’ve gotten used to just inserting a “reminder of the previous discussion” prompt every 1500-2000 tokens.
What I’ve noticed is VRAM isn’t always the full story. I can load a Q6_K version of Qwen 3.5 and get decent speeds, around 18-20 tokens/sec, but it eats up a lot more system RAM too. Lower quantization (like Q4_K_XL) frees up VRAM, but the inference speed drops noticeably. It’s a constant trade-off, and it really depends on what you’re prioritizing.
Honestly, for really long documents, I've started chunking things up and processing them in sections. It's more work upfront, but it avoids the model just making stuff up because it’s lost track of the beginning. It’s not ideal, but it's the best workaround I've found so far for keeping things somewhat coherent.
GrungeWerX@reddit (OP)
Yeah, I've been thinking about doing something similar to the chunking. Part of what I feel gets overlooked is the rare concepts. My idea is to "scrub" the lore section by section, and having the AI append all unique concepts it finds to the bottom, while providing a summary to the top. That way, when I re-feed, it won't lose the unique ideas/terms, while the summaries will suffice for the general overview.
The main concern though are the narrative points; AI tends to miss the relevant, contextual details that really define stories sometimes. Gemma occasionally shines in this area, but at the cost of a bunch of other unique details, something I've come to find recently.
It's definitely a juggling act. At some point, when I have time, I'm going to build a proper "harness" around it, an n8n workflow, so I can have different agents handling different components. I'll use redis for the memory. Considering how good these models are in a simple chat interface, they'll probably be magic stacked in an agentic workflow. I anticipate that qwen 3.5 will probably be the bottleneck due to its thinking tokens at times, so I'll need to make sure I save it for the most necessary or reasoning-complex portions; then there's the model unload/reload...so I'll have some decisions to make. But it'll be beautiful when it's done.
Gemma 4 can be lazy, so I'll need to give it more narrow focus to ensure it doesn't miss things.
Puzzleheaded_Base302@reddit
i know that on my RTX PRO 4500 32GB GPU, I get more context length (115K) with qwen3.5-27b, because its weight occupy less VRAM.
Thrumpwart@reddit
Hey how is your inference speed on that gpu? Specifically is PP good? Is it loud?
Those <6000 workstation cards are interesting and we don’t hear enough about them.
pointer_to_null@reddit
Mostly because they're still pretty damn expensive- you're going to pay a premium for the Pro cards. RTX PRO 4500 32GB is hundreds more than a typical 5090, and has half the CUDA cores.
Puzzleheaded_Base302@reddit
when I buy the RTX PRO 4500, I made damn sure, My RTX PRO 4500 was $1000 cheaper than 5090. Also, buying a 5090 online is too risky. Too many scammers on ebay. Buying a RTX PRO 4500 was a transaction at Central Computer store. 5090 is $4000 in stock, and RTX PRO 4500 is $2900 in stock.
pointer_to_null@reddit
My mistake- I can't keep up with prices anymore.
We got a few at work (for graphics, not LLM) not long ago for ~$2500, and I should've known supply has dried up again, likely due to Nvidia's shenanigans and rampocalypse.
Paying $4000 for a 5090 is insane. There's either zero supply or too many idiots with cash.
Thrumpwart@reddit
Not everyone wants a 4 slot 600w monster in their moms basement though. Particularly for professional tasks.
Puzzleheaded_Base302@reddit
I only paid attention to TG. It is about 36tps at low context and 30 tps at high context. PP was about 500-ish maybe. vllm can get me 2000 tps for PP, but only 26 for TG. I use llama.cpp to get 36 tps for TG for day to day use.
Thrumpwart@reddit
Nice, thank you. How loud is that blower fan at full tilt inference?
Puzzleheaded_Base302@reddit
all i can tell is B70 (<300W) is louder than RTX PRO 4500 (200W) at full speed. still acceptable to me, although it is very subjective. the fan only goes to full speed at prompt processing phase; the token generation phase consumes less power.
Thrumpwart@reddit
Thank you. Buying this week and still not sure which combo I should go with.
Express_Quail_1493@reddit
THANK YOU!!! solid detailed tests
SirToki@reddit
I have no problem with your writing style, nor your experiences, but why are you using Gemma with that high quant and with that much context if you are getting 0.6 to 3 t/s? You are bleeding into your system ram and that's why it's so slow. Do you wait through all the output? Is it worth it?
GrungeWerX@reddit (OP)
I would just start it and do other things. It’s much faster now after unsloth updated it. For me, it was worth it as the assignment required a lot of context to process and I could do twice as much work by letting it run.
It’s at least 7 tok/sec minimum right now , which feels faster since it goes straight to output and I don’t have to wait through thinking tokens. I’m still tweaking to get faster results. But at its current speed it’s definitely usable for sure…the quality is solid.
It’s currently consolidating tons of notes, and analyzing narrative plot lines. It’s citing other details and cross-referencing context that I completely forgot about. On more than one occasion, both it and Qwen were referencing details so obscure, I had to search and validate it wasn’t hallucinating.
And it wasn’t.
tmvr@reddit
What is your total memory usage? I feel like with the size of the Q4_K_XL you are just about spilling over to system RAM and a minimal trim of the context size or setting the KV to q8_0 would solve the issue.
SirToki@reddit
That's valid, I was just curious whether you tested a lower quant like an IQ4, that would give you the same context, but could be triple the speed with marginal difference in output.
AvidCyclist250@reddit
imo anyhting under 15 t/s is best done manually.
simracerman@reddit
It’s a good rule, but has many exceptions.
I can’t code much, but I can write Qwen3.5-27B a long detailed prompt to implement a feature that takes the model about 60-80 mins, kickoff before going to work. Come back and see it done, or needed few small tweaks to make it work. At long context, it runs at about 10t/s on my GPU.
Without AI, I simply don’t have the time and resources to learn and code it. Even if it was at 3 t/s, it still won’t cost me more that 5 hours, and a few cents of electricity.
sonicnerd14@reddit
Even if you can code, the speed settings alone at a subpar speed is still faster than what most people could code at. Unless you are a top 1% olympiad coder, there is still some efficiency gain for simply giving the AI the prompt, let it do its thing for a couple hours, and then come back to see the result finished and working. These small local models are good enough now where 95-99% whatever it writes will work first try.
GrungeWerX@reddit (OP)
Depends on the assignment.
For me, 7 tok/sec - the minimum speed Im getting now after unsloth’s update - is worth the wait and significantly faster than doing it by hand.
Not to mention, I can double my output by working on other things simultaneously.
tavirabon@reddit
From my testing, Gemma better understands intent while Qwen is simpler to use (i.e. throws more information at you unprompted). But you can prompt engineer Gemma to give you everything, it listens maybe the best out of any small-mid range LLM I've ever used. And yeah, part of that info dump Qwen loves will often be confidently-sounding bullshit.
I haven't tested them for rp-type stuff so maybe I'm missing something there, but if I could only afford the disk space for 1 model, it'd be Gemma no doubt. It's still worth running both together for for work and particularly when it comes to VLM stuff, their strong and weak points are much more exaggerated (e.g. Gemma does multi-modal and multi-lingual reasoning better, Qwen is better suited for raw captions)
Zc5Gwu@reddit
Eh, I kind of am not crazy about the info dump. It feels lazy to me when a model behaves that way, like, the model is not smart enough to fully understand your intent so it forces you to wade through options instead.
One reason I really like minimax. They must have answer length as a constraint on RLHF because minimax is so concise with its answers, it's great.
tavirabon@reddit
I'm not either tbh, I usually want quick answers I can read quick. It does cut the other way though, Gemma is a lot more to-the-point (not saying it doesn't add fluff after) but if you want more, you have to prompt for everything.
It's easier to explain with VLM captions: You have to add "in detail" to get more than a short summary and it will only be a rough paragraph with coarse details, if you want more you have to specify each thing about the image you want captioned. Which it can do, at least it's not omitting because it doesn't know. This is excellent for a dev that knows what they want - the code is highly readable, functional and saves you time. But it's not the behavior you want for automated captions.
GrungeWerX@reddit (OP)
I don't rpg, so can't speak for that either. I have to disagree about Qwen's info dump though; it's usually on point in my experience. But it doesn't always illustrate that context as well as I'd like.
I'm still putting both through their paces, so my thoughts might change.
tavirabon@reddit
You used "lore" as your example so I just assumed that was what you were testing it on. I've only been looking at stuff like code, Q and A, task-oriented stuff.
GrungeWerX@reddit (OP)
Lore as in a story bible for an IP.
IrisColt@reddit
If most models are useful for RPG-ing, then the standard used to define ‘real use-cases’ is not demanding enough. That said, for that use case Gemma4 31B significantly outperforms Qwen3.5 27B.
IrisColt@reddit
>Gemma 4 is currently a lot slower than Qwen 3.5
No, it's really the opposite. Gemma 4 is more efficient thinking, that's not even debatable.
GrungeWerX@reddit (OP)
That's absolutely debatable. But for another time.
I've gotten results on both ends of the spectrum. Still...when gemma 4 shines, I feel more connection than I do comprehension. For some assignments I give it, that matters more. I can see how this would make it very strong for RPG.
But there are some lore assignments I give it where I need comprehension more, and that's where it sometimes falls short...or misses important details entirely.
But I'm impressed enough that I'm committed to digging into Gemma 4 full throttle now. It's earned its place alongside Qwen 3.5 as lore master B. And while I still feel like Qwen is smarter, I want Gemma's intuition. And I'm going to try to see if this "magic" that shows up here and there can be prompted or even put into a harness.
This is just basic chat mode. I haven't even strapped this to N8N yet and gave it a proper workflow with multiple points of reasoning. But even without that, these two models in raw chat mode are beasts.
IrisColt@reddit
I am sorry; sometimes I like to spark debate by pointing out that certain conclusions are not even up for discussion. I appreciate your comments very much. In my experience, Qwen 3.5 (27B) struggles with complex, multi-persona scenarios. When prompted in English to write in another language, it often breaks strict narrative constraints, such as prohibitions on foreshadowing, faithful reconstruction of scenes based on context, precise matching of tone and voice, avoiding coverage of uninvolved characters, and treating the future as genuinely unknown. Gemma 4 is not perfect, but its prose feels more human and idiomatic, with stronger contextual awareness, and its NPCs come across as noticeably more perceptive than anything Qwen 3.5 can produce.
IrisColt@reddit
“Qwen 3.5 is highly efficient but tends to be strictly literal, making it difficult to infer nuanced intent or naturally adapt to context. By contrast, Gemma 4 demonstrates stronger contextual awareness and produces responses that feel more grounded in relevant human experience, resembling the judgment and insight of a skilled lore-focused perspective.
GrungeWerX@reddit (OP)
I can agree with most of that.
Areas in bold where I'm a little "hmmm...":
Expound on that highlighted part?
...and:
...expound on that highlighted part please?
IrisColt@reddit
Great post, really appreciate it. Thanks for sharing!
Material_Pen3255@reddit
I have a similar question. Which of these LLMs works best with a 16 GB GPU, and does the quality degrade significantly with quantization?"
TinFoilHat_69@reddit
qwopus 27B shards 14Gb across 4 of my cards, the rest is kv cache which fills up pretty quickly 8GB per card. So in total I have 88GB allocated on vram.
A hybrid model with 16 active layers the others 48 are fixed with 64 total layers. It was trained on data sets from opus 4.6, but it is a version of qwen that handles complex tooling rather well, the v2 iteration doesn’t over think as much.
I’m running in at gen 3 speeds getting 20GB of bandwidth pulling 600-880 when inferencing. My context window is 128k im running vLLM with Linux void 5950x, 128gb of ddr4 and 4 -3090s
krydderkoff@reddit
What motherboard and cpu you got for this?
TinFoilHat_69@reddit
x570m pro4 with a 5950x
krydderkoff@reddit
Just interested in how you connect your cards then, since mb only support like 1 card at x16 pci 4. or, cpu has only 24 lanes. How’s the speed?
TinFoilHat_69@reddit
x570m pro4 with a 5950x
finevelyn@reddit
Gemma 4 31b is slightly slower than Qwen 3.5 27b, but not that much slower. You are using the wrong quantization or settings for your GPU that is causing it to offload to RAM if you are getting only a few tokens per second. On a 3090ti you should be getting 20-30 tk/s, but it's a tough fit to 24GB of VRAM.
GrungeWerX@reddit (OP)
What settings do you recommend? Keep in mind, I need minimum 100K context starting off. Also, I don’t want lower quants if intelligence is noticeably worse, otherwise, what am I using it for?
I’ve listed the quants Im using in original post at the top. My Qwen speeds for q5 are fine, nice and brisk. Would love to hear some Gemma 4 recommendations.
finevelyn@reddit
I want to say the 31B is just a tad too big to be a perfect fit for a 24GB card. I think the absolute best you can do is to use IQ4_XS quantization with a Q8 KV cache (-ctk q8_0 and -ctv q8_0 in llama.cpp). With exactly 100k context size this will leave just over 1GB of the 24GB free, and should get to close to 30tk/s on your card (possibly slightly slower at higher context size).
It's also best to use the latest llama.cpp with attn-rot implemented, which reduces the quality loss from using a Q8 KV cache.
If the GPU is not dedicated to only LLM use and is also used for your monitors, then that might eat too much of the VRAM to use even IQ4_XS. You would have to drop down to Q3, and I'm not sure if there is much point to using this particular model at that point.
To be clear, I'm not saying you're necessarily running the model wrong for your use case. If it works then it works. My point is more that it's not a good speed comparison when one of the models is fully on the GPU and the other is not.
GrungeWerX@reddit (OP)
The Q4 31B is only about 1GB larger than the Q5 27B I'm using. At 100k, both models are above the 24GB Vram, but that's not a bad thing if they can handle the kv good. Both are on Q8 kv cache, but Qwen's Q5 is like 5x the speed. I avg at 26+ tok/sec at 100K ctx which is just fine for the quality I'm getting. I don't want to drop down to the IQ4_XS because the UD models tend to perform w/higher quality. The claim is that it's almost a quant higher in performance, and that's actually been my experience.
Raredisarray@reddit
Hey just wanted to say that I appreciate the time you took on testing the models and your thoughtfulness on explaining the results.
I also enjoyed reading your piece on writing style and want to thank you for your contribution - even though it was stolen from you! Wild times we are living in.
GrungeWerX@reddit (OP)
Thank you! Keep being awesome.
pointer_to_null@reddit
Agreed, there's structural patterns you pick up after reading too many ChatGPT and Claude-generated walls of text (still getting used to Gemini as I'm starting to use it more). There's little indication that this was AI or even tweaked from generated output to sound more human. Furthermore, I've noticed some very tiny oddities (style quirks?) in OP's punctuation that get repeated that 99% of readers wouldn't notice nor care- but these are the things that LLMs rarely produce consistently.
It's concerning that people who actually spend the effort writing up something high quality, formatted and meaningful are dismissed because people assume it's AI slop.
GrungeWerX@reddit (OP)
You get it.
\^ AI would never do that.
And those oddities/style quirks are what AI abhors, and will try to correct into oblivion. It's too "smart" for its own good, simultaneously missing the narrative.
It tries - like us - to use short sentence quips. Without getting the point.
pointer_to_null@reddit
I wasn't even referring to the italics and/or bold emphases- these are easy to do after the fact, but your lack of space around/after ellipses and spaces around ~~emdashes~~ hyphens. That triggers my OCD/ADHD/Autism/brain parasite.
But what do I know? As a ~~language model~~ meatsack who spends majority of his daily keypresses within an IDE or *nix console, I can't even write english good, or grammar proper.
GrungeWerX@reddit (OP)
:)
Word to the uninformed random reditter reading this comment, I deliberately put spaces around the hyphens so they don't become em dashes because AI has nearly ruined it for us writers. Though, I still love 'em and will some day return to them. In the meantime, next best thing - hyphens!
And a side note, I actually like your writing. Strikethroughs for the win!
As for the lack of spaces around ellipses ... there you go. But I've RARELY used those.
Okay...I've never used those before. :P They tend to give a much more dramatic pause than I'm going for.
But I might try out some future use cases for the sake of... English.
Or is it the sake of ... English?
Which do you prefer? <-- Uh-oh, closing question. Gotta be AI!
Gringe8@reddit
You spent half the post explaining you didnt use AI to write your post when it didnt even sound like AI to begin with lol
GrungeWerX@reddit (OP)
Well, I guess I appreciate your comment about the organization aspect - most people don't do that - but...that's all me. Been doing that for years. It's not hard to separate ideas into chunks. Like I said, this is old hat for us non-gen z'ers.
Commando501@reddit
I could instantly tell this wasn't written by AI. Guess I'm not an idiot 😎🫡
Equivalent-Repair488@reddit
I find the easiest tell is the engagement farming question at the end of usual bot posts. The fact that the end was a long ass crashout made me extra confident it was done by an annoyed human lol.
GrungeWerX@reddit (OP)
Love you.
GrungeWerX@reddit (OP)
You’re not.
It’s just bots deployed to derail conversations.
Reasonable-Two-4871@reddit
Try Gemma 4 MOE
GrungeWerX@reddit (OP)
I did. It's poor quality compared to 31B. Not on the same level and TONS of slop.
moritzchow@reddit
Maybe use the MOE as draft model for both spec prefill and decoding tasks? Then you have speed + intelligence
GrungeWerX@reddit (OP)
Which model for draft? I was reading about spec decoding earlier. Someone said it's only a few tokens faster for writing, bigger gains for coding. I'm not coding w/Gemma 4 yet, just need an analyst...
Top-Rub-4670@reddit
26b is about 5x faster than 31b for me, it's very significant. That person must have had issues in their setup (maybe RAM offloading?). I can't speak to the quality when it comes to writing, though.
GrungeWerX@reddit (OP)
That statement was referring to using it as a draft model for speculative deciding.
moritzchow@reddit
For spec prefill and spec decoding to work you need the draft model to share same amount of vocab and smart while small enough to gain speed. Like if you use Full BF16 Gemma-4 31B since each token processed requires full 31B active, you may probably gain speed on having say Gemma-4 26B MOE at 8Bit or Q8 quantized as the draft model - they share same vocab, same Context Windows and same Gemma so it may work well. (You may need to do some testing tho)
Just on prefill side I did see gain on Nemotron Super using Nemotron Nano as draft and prefill speed jumped from 550 roughly up to around 680 (yea slow M3 Ultra Apple Silicon for prefill work so every token gain counts). YMMV but considering the architecture is the same if you have some spare VRAM it’s worth experimenting.
ChocomelP@reddit
I am very curious about what this means.
GrungeWerX@reddit (OP)
For example, sentences like:
"It's not just x, it's y."
gearcontrol@reddit
Was thinking turned on when you tested 26B? I found it very capable with thinking. Your Gemma 4 31B and Qwen 3.5 27B comparison is exactly what I found as well. But I like Gemma's personality, the same with Gemma 3, so I use Gemma 4 26B as a daily driver because it's fast (4B active) and Qwen 3.5 27B for tool work and coding.
Setup: i9 11900K - RTX 3090 - 64GB
Models: qwen3.5-27b-claude-4.6-opus-reasoning-distilled Q4_K_M | gemma-4-26b-a4b Q4_K_M
GrungeWerX@reddit (OP)
I did, but wasn't impressed. It probably has other use cases that I'm not needing it for at the moment. It's too slop heavy for my tastes. So many "It's not just this, it's that." Makes me want to scream. It's definitely smarter than Gemma 3 27B, so worth the upgrade for many, but the slop feels as bad.
llitz@reddit
Unfortunately, IMO, for MOE you need to keep the tasks being executed smaller. You can still achieve the same result, just gotta break it down a lot
Now if you are asking a specific question and it is missing that's a different thing.
In my case, I wouldn't ask it to "think" about multiple aspects in the same question. That said, I just can't really follow my own advice, that's why I also run the 27b!
GrungeWerX@reddit (OP)
lol!
But that’s my actually use case. It needs to reason about multiple things over very long context.
llitz@reddit
You can get it to reason about it, but it usually needs to break it down in a few steps over the reasoning. My experience with super long reasoning using MOE is that it tends to have incorrect bias, something like "we will reason over this get back to me on every step" sort of works.
I think tool calling might work too since it is a new call between each step and it would load a different expert.
You could try creating a tool that... Does nothing, and ask each to use the tool every few reasoning steps, it might provider a better accuracy on your tests.
EstarriolOfTheEast@reddit
Typical sampling reduces slop in most models and boosts creative outputs/interesting insights (exceptions being post-training over-tuned to the point of not representing uncertainty well anymore). Worth a try, both 26B and 31B might benefit for your usecase.
Sampling approaches that attempt to preserve diversity are even better.
gearcontrol@reddit
It occurred to me that it's similar to the way I feel about the cloud LLMs. In this matchup I feel the same about Gemma-4 as I do ChatGPT-5.3 and the same about Qwen-3.5-27B as Claude-Opus-4.6. Maybe the best option is to have them check each other, like I do with the cloud models, though it would be more of a pain, as I can only run one at a time.
AvidCyclist250@reddit
nah
tavirabon@reddit
IMO both Gemma 26B and Qwen 35B are not worth using if you can run at least Qwen 27B.
stddealer@reddit
Depends how much you value time. I can run the 31B at a usable 15t/s or 27B at 17t/s, but I'd much rather run a MoE at ~60 t/s for tasks that require a lot of tokens.
GrungeWerX@reddit (OP)
That time shrinks when you consider that you might have to re-roll the MOE 2-3 more times to get a satisfactory response, where the 27B/31B might one shot it.
In fact, that's the main reason I don't use them. I kept having to respin the wheel because it was missing too many things, loaded with slop, or just not really as thorough as the dense models.
Speed isn't everything. I'm more interested in quality.
tavirabon@reddit
On a 3090, 31B gives me 20 or 30 t/s depending on the quant. Qwen 35B gives me 80, but it uses 4x the tokes for the same task. 31B in iq4 with no thinking gives better outputs for effectively the same or less amount of time. And as 35B fills its context faster, it also struggles with details in the middle more. I haven't identified any advantage for Qwen 35B on my end.
GrungeWerX@reddit (OP)
100% agree.
boutell@reddit
Thank you. And if anybody wants to know who's to blame for all the emdashes in AI? It's me. I did it. LOL
GrungeWerX@reddit (OP)
Me too.
I love em dashes. Just not the way AI writes with them. They can be used so much more effectively.
Dr_Bankert@reddit
With Gemma 4, I found quantizing the TV cache helped a lot with speed. It's default behavior seems to be ultra high precision context which causes it to take forever to wield it.
GrungeWerX@reddit (OP)
I don’t want to lose any of that intelligence over long context which supposedly happens with kv cache.
Awwtifishal@reddit
With a recent llama.cpp, gemma 4 with KV cache at Q8 is near lossless thanks to attention rotation (which is enabled by default and it's part of how turboquant works).
GrungeWerX@reddit (OP)
That's what I'm currently using.
Ok-Ad-8976@reddit
Yeah. I’m sick of these AI accusations , adds nothing to the discussion
GrungeWerX@reddit (OP)
I’m starting to realize that it’s just bots deployed to derail conversations. Bots are easy to make and flamers get off on this type of $#%¥.
TheThoccnessMonster@reddit
Cypress Hill Voice: SCRIPTS FROM THE DOME
GrungeWerX@reddit (OP)
^___^
draconic_tongue@reddit
Okay but would you say you stole the literature you've read growing up and repackaged it when you've gained your own voice?
GrungeWerX@reddit (OP)
No. Because I don’t write like the literature Ive read. I’m not into derivative works.
florinandrei@reddit
A shitload of authoritatively-sounding text, and you only get a few Tok/sec out of Gemma on a 3090?
Buddy, you're a poser.
GrungeWerX@reddit (OP)
Share your settings, same card. Be sure to match context in your sampling and give me your speeds. Share a screenshot. Then we’ll see who the poser is.
I’m game for trying out new settings. So either drop some or stfu.
boutell@reddit
Constructive: "hey, are you using the latest head build of llama.cpp? They fixed some Gemma stuff"
Non-constructive: this ⬆️
fyvehell@reddit
"My two cents"
Writes entire post with AI and then post processes it to look less like AI
boutell@reddit
Would ya let it go? We can't tell for sure and it really doesn't matter. Judge the post on its merits, not the fact that it was written by someone of a certain demographic (or maybe with an llm, but so what?)
GrungeWerX@reddit (OP)
It’s a bot. Bots can’t see the examples of past writings I shared from before AI that instantly dispels their claims.
GrungeWerX@reddit (OP)
Be gone bot.
Diacred@reddit
It really didn't feel AI written, I write like that as well lmao
CryptoSpecialAgent@reddit
Well you’re comparing a Q6 with a Q4 so it’s not entirely a fair comparison… Gemma 4 at full precision is an entirely different beast than Gemma 4 quants even if the unsloth marketing literature implies otherwise - last night I spent $5 to rent an H100 and tested the 31b in fp16 and the subjective differences between this and the Q4 on huggingchat (or the Q3 UD on my MacBook Pro) are far greater than what the unsloth data makes it seem
31b in full precision actually does feel like a frontier grade model, and I now understand why it has a higher ELO score than Claude sonnet 4.5, gpt 5.2, etc on lmarena…. It falls short of opus 4.6 obvs but keep in mind that for day to day tasks not involving coding, sonnet 4.5 is more than enough.
Whereas any of the 31b quants I’ve run locally show some promise but are lacking a certain coherence especially over longer contexts…
GrungeWerX@reddit (OP)
I didn’t compare the q6 against the q4. Just mentioned it as one of the quants I use. I compared the q5 against the q4 because I can’t fit the q5 in my vram; the q4 already crawls at a snails pace.
That said, the point of the post is running it in 24 gb vram. I obviously can’t run these models at full precision. I’ve heard that Qwen at full precision is significantly better as well.
jojorne@reddit
Qwen: temp 0.3 top p 0.9 top k 20 min p 0.1
Gemma: temp 1.5 top p 0.9 top k 20 min p 0.05
so i'm wandering what are yours because Qwen wasn't that impressive. it failed to follow prompt. now, i'm using both the MoE. while the dense are better of course, i don't find them that bad and the speed is nice too. i prefer Qwen for coding while Gemma for stories.
ROS_SDN@reddit
Use their recommended sampling parameters for qwen if you haven't and reevaluate.
The difference between the coding sampling and general sampling parameters is night and day when I need to switch between the tasks.
GrungeWerX@reddit (OP)
THIS.
Fault23@reddit
wait for qwen 3.6 27B drop then
GrungeWerX@reddit (OP)
I’m so looking forward to it.
Fyksss@reddit
this is a very consistent, high quality post. don't be fooled by the 'stupid' comments :D
GrungeWerX@reddit (OP)
Thanks!
trusty20@reddit
Observe two hidden profile accounts glazing each other
GrungeWerX@reddit (OP)
Be gone, bot.
seppe0815@reddit
jesus maria che a.i generated text post here are make me mad
GrungeWerX@reddit (OP)
Seriously dude? What about this looks AI generated? I write the same way I've written for 20 years. I suppose you'd think my old writings were AI written as well.
Oh, and by the way. AI was trained on OUR work. Just saying...
FluoroquinolonesKill@reddit
I am really frustrated at my good writing - and anyone’s good writing - now being mistaken for AI because people are too dumb to tell the difference.
GrungeWerX@reddit (OP)
Tell me about it. Probably easier to just set up a web page with your documented pre-chat-gpt writing style, then drop a link whenever these boneheads start making idiotic accusations. If they're still in denial, they've just made the collective audience realize they're neanderthals.
finevelyn@reddit
I don't think anyone cares about proof, when there are no stakes. People either like your writing or they don't, and no amount of evidence is going to change it.
The AI accusations stem from how many actual AI posts there are, and they are getting annoying too. People are only going to get more critical of text that looks similar to AI writing as time goes on, and it's not without merit. They are not boneheads in general.
SpiritualWindow3855@reddit
This seems like at most, formatted by AI text.
GrungeWerX@reddit (OP)
Neither. I'm a writer. It's not hard, man.
eidrag@reddit
cool, writer. you also mentioned dA journals. currently what's your use case for llm?
GrungeWerX@reddit (OP)
Previously just for coding, now also as a story analyst. For example, I'll feed it a large story bible and ask it to analyze certain themes, concepts, characters, relationships, combat systems, etc and give its thoughts or feedback. Very useful to have a second brain around to notice things you missed. Or make connections when you're not thinking about them, especially when the lore is dense and easy to overlook things.
Think of it like a talking encyclopedia.
SpiritualWindow3855@reddit
I don't know who this comment is for: I said AI didn't write this (primarily because AI would keep it much tighter), and at most it got a pass for formatting.
I thought writers needed to be good at reading too.
GrungeWerX@reddit (OP)
I was addressing both of you at the same time. His accusation of me using AI, your "at best formatted by AI text" comment. Point was, "neither" needs AI to get done.
I can read fine, but our context-tracking could use a little work. ;)
Jabs aside, I didn't take your comment as an attack. Just didn't feel like making two posts. Not my intent to direct all that at you.
SpiritualWindow3855@reddit
There's a reply button under their comment.
I read your reply as annoyingly patronizing since it leads with asking me about my age then explaining basic formatting in nauseating detail, not in good sport.
traveddit@reddit
There are grammatical errors and uncommon idiomatic usages of punctuation that make it pretty easy to tell it was written by a human. Clearly you're not apt at enough reading English to tell apart anything.
rkd_me@reddit
96GB? Did you try 122b-q4?
Photochromism@reddit
The MOE versions of Qwen3.5 and Gemini4 are also great.
100% agree that these models are a huge step forward for large context awareness. Loving writing with both of them. I can’t decide between the dense models and the MOE versions. MOE handle large context but dense models feel more intelligent.
ArtifartX@reddit
Are you offloading a fair amount of the model to system RAM? Because if not, you'd barely fit the models you listed in the card with a tiny context window. If you wanted a 10k+ context and the entire model to fit on the GPU, you'd be more looking at Gemma 4 31B Q4XS or Q3 UD, and Qwen 3.5 27B Q4's.
joao_brito@reddit
For me the biggest difference I'm noticing between gemma and qwen is that the gemma 4 model has a lot of world knowledge in it, a lot of my queries it can usually answer without any search, and this keeps it's the token output way lower than qwen models. On the other hand most of the errors I get from gemma are from stuff that if it used the search tool would probably answer correctly, but it usually tried to avoid those tools calls unless necessary.
My current workflow is usually try to use gemma 4, if I get some fishy results I usually try again with qwen 397b and it gets it right.
Potential-Leg-639@reddit
For agentic coding and context you need at least 2x3090 (200k context and Q4 cache fits good on the Q5 model)
SkyFeistyLlama8@reddit
How do Gemma 4 31B and Qwen 3.5 27B compare to good old Mistral Small 24B or Devstral 24B? I still use Mistral for creative writing because nothing else has the same flair. Gemma 3 27B was good but kept falling into LLM tropes.
I rarely use the Gemma 31B or Qwen 27B because they're really slow, being dense models; Gemma 26B and Qwen 35B MOEs get close to their smarts while being so much faster.
As for writing like an AI, yeah I feel you there. AI model makers slurped up decades' worth of Reddit and Usenet posts for training; if you've been around on the net since the days of SLIP and telnet, you would've picked up certain quirks and styles of writing as part of that online zeitgeist. And you would probably sound like an LLM.
No LLMs were harmed in the making of this post.
LuckyGhoul@reddit
Is anyone getting frequent full cache wipes due to SWA? Maybe I should download a newer version, but this is the only thing that I find frustrating about Gemma 4 32b.
discostupid@reddit
Can I suggest you try nemotron a3b polarquant?
https://huggingface.co/caiovicentino1/Nemotron-Cascade-2-30B-A3B-PolarQuant-Q5
Digitalzuzel@reddit
What is that? 🤦♂️
McSendo@reddit
mfer used gemma4 e2b to write this
Digitalzuzel@reddit
he edited the post and that's hilarious. Instead of addressing contradictions he just used another LLM to rant on us 😄
Artpocket@reddit
Bot account.
trusty20@reddit
Yep, he and the glazer accounts are hidden profile accounts, it's like the calling card of bots because it's so obvious when you see their comment history.
GrungeWerX@reddit (OP)
Yeah, it's getting bad on reddit lately.
Artpocket@reddit
Dude's talking about people accusing him. Not your dumb ass post.
GrungeWerX@reddit (OP)
Used another llm? Are you an idiot? Did you not see me post up samples of my work - the same writing style as the post - going back 20 years ago?
You're really showing your IQ man.
GrungeWerX@reddit (OP)
Go bot elsewhere.
Artpocket@reddit
Bot account disregard.
danieltkessler@reddit
Hallucinating ≠ missing or excluding information in outputs. Hallucinating = providing information that didn't exist in / isn't faithful to the source text or instructions.
CentralLimit@reddit
I don’t think you understand the difference between precision and recall.
MrHaxx1@reddit
How doesn't that make sense?
Hallucinating and missing things are two entirely different things.
Digitalzuzel@reddit
How often do you reach a correct conclusion despite missing very important details?
keyboardmonkewith@reddit
Often enough when you use kv-cache compression.
Deep90@reddit
Trying to quote exactly what your friend said last week is going to lead to you "hallucinating" words vs just paraphrasing what you talked about.
GrungeWerX@reddit (OP)
Any time they aren't not directly related.
Digitalzuzel@reddit
then those details aren't "very important", right?
GrungeWerX@reddit (OP)
At first, I thought you were an idiot, but then I realized you're just a bot.
GrungeWerX@reddit (OP)
Bro, seriously...wtf are you on about? You're not making sense.
The post was very simple to understand. Sometimes Gemma 4 hallucinates less than Qwen 3.5. Other times it misses very important details.
Example 1:
Example 2:
If you still can't understand this, then ask AI for help bro, because I just don't know, man.
Artpocket@reddit
Bro, you're a dumb ass trying to sound smart.
MrHaxx1@reddit
Rarely, but if I'm wrong, I'm not necessarily hallucinating.
SocietyTomorrow@reddit
Qwopus3.5-27b-v3 gets a decent bit closer to Gemma in terms of keeping it together at high context. Hallucinations are still an issue, but I run into it more on simple tasks than hard work. It's like it gets bored or something ane makes crap up.
Digitalzuzel@reddit
Thank you for sharing your experience. Crazy that you've got downvoted (but not surprising as reddit attracts unstable people).
GrungeWerX@reddit (OP)
It can miss important details. Not always. But yes, it seems to hallucinate less.
Thrumpwart@reddit
Try nice post. I’ve been doing similar testing.
Last night I discovered byteshape quants - there aren’t many, but the byteshape Qwen3.5 35B iQ4_XS 4.06bpw gguf did remarkably well in my testing, and was faaasst. I’d take a look at it.
danieltkessler@reddit
I have a lot of respect for this entire post. Appreciate it!
UnreasonableEconomy@reddit
Gemma 4 31B Q8 is, in my opinion, mindblowingly good. Possibly even a chatgpt (instant) contender. Although to be fair, OAI really dropped the ball with 5.3.
I'm a bit of an AI grouch in general but here I have to admit that we've had some real progress in the past year. Dense models are the way to go, and this one's an absolute winner.
As you mentioned, it's surprisingly perceptive.
I'll be honest it would have never occurred to me that dense could ever be a compliment lol.
huyanb999@reddit
Great comparison! I've been using Qwen 3.5 27B as my daily driver too. The long context handling is really impressive for a 24GB card setup.
inthesearchof@reddit
Buy one more 3090 and have both loaded and responding side by side. I enjoy both. Gemma's response style. Qwen's more technical. You should be getting around 30 tok/s with Gemma
Fit_Concept5220@reddit
I also wonder about specific pipelines where both (or N) models write and then one model works on original context and produces analysis/text/code to compile an aggregate solution. Aka why choosing one if you can aggregate best of both worlds?
AvidCyclist250@reddit
gemma gets more STEM details right than qwen. the really tricky shit.
PromptInjection_@reddit
I prefer Gemini for a simple reason:
The performance downgrades much less with very long contexts.
kourtnie@reddit
My heart aches with how many times you’ve been burned by “did an AI write this?” — enough that you preemptively braced for it.
Happens to me a lot, too.
Thank you for the Gemma and Qwen analysis.
Anxious_Potential874@reddit
i tried it on coding tasks with smaller models qwen 2b and gemma 4 e2b and in that gemma4 gave better results.
I just had one odd result with it so i cannot completely trust it but i intend to use it as primary model by improving prompts and adding some pre processing.
i am also using older unsloth model and getting approx 14tg/s on 8GB ram apu(it has gpu but not discrete)
i will try newer unsloth model today to see if that improves it.
Euphoric_Emotion5397@reddit
I did the same system prompt, the same question that involves tool use and search and scraping and final analysis of impact to stock market.
Then I feed it to Gemini Pro.
Gemini rate Qwen 3.5 35b A3B q4 the best, Qwen 3.5 27b 2nd, Gemma 4 31B the last saying it's surface level analysis.
so findings is based on my system prompt. Which might have been tuned to Qwen since I was working with Qwen all these time.
GrungeWerX@reddit (OP)
Yeah, it's pretty well known Qwen is better for tool use.
Spiritual_Willow5868@reddit
Are they any good for tool use?
GrungeWerX@reddit (OP)
Qwen is, I haven't tested Gemma for tool use yet as I heard it had issues they were working out. I'll be testing it in the future.
exact_constraint@reddit
Still pretty far in the Qwen2.7 camp. For stuff where generating English text is important (eg, auto generated Flux.2 prompts), I’ll load Gemma 4 31b. But for OpenCode? Qwen2.7 all the way. It’s still early days w/ llama.cpp weirdness, but Qwen has been much more reliable.