Gemma 4 31B vs Qwen 3.5 27B: Which is best for long context worklows? My THOUGHTS...

Posted by GrungeWerX@reddit | LocalLLaMA | View on Reddit | 170 comments

My setup: i7 12700K | RTX 3090 TI | 96GB RAM
Models: Qwen 3.5 27B UD Q5/Q6_K_XL | Gemma 4 31B UD Q4_K_XL

To the point:

Right now, Gemma 4 31B and Qwen 3.5 27B are the best local models for a 24GB card. Period.

I've tested everything. These are the first two models that actually feel state-of-the-art for their size.

Most models up to this point have just been moderately-performing novelties. But not extremely useful for real use-cases outside of writing emails, minor code, and RPG-ing. But all local models have performed poorly over long context reasoning and analysis.

Benchmarks mean nothing. For me, it was an easy test: Load up a local model, feed it 50K data, ask it to answer questions and provide analysis. Most models yap without saying anything. They provide very little relevant context, if any. They don't understand the lore. They hallucinate details. They're unusable.

That is, until Qwen 3.5 27B. It was the first of its kind and changed the game for me. It's been my daily driver since.

A couple days after Gemma 4 dropped, I fired it up and dumped a huge 60K of context and gave it a run. Not only did it answer the questions, it understood the lore. With that, I suddenly had my second model that could handle the job. It wasn't as detailed as Qwen with citing references, but it had a little something that Qwen didn't. I'll come back to that.

Now that that's out of the way, and we've established the two top players for long context reasoning to-date, let's get to the matchup. Who's better?

For the past couple of days, I've been comparing it against Qwen. Here are my findings:

Gemma 4 is currently a lot slower than Qwen 3.5. I've tested Gemma between 70-100K context so far. Up until yesterday, it crawled along at a snail's pace, making it virtually unusable. (I got between 0.6 - 3 tok/sec) But I found the outputs decent enough to keep trying to tweak my settings. Unsloth uploaded new versions yesterday, so I re-downloaded the model and I'm now getting at least 2x speed increase, so I'd recommend you do the same if you're still getting slow speeds. That said, Qwen is significantly faster at even higher quants.
Gemma 4 seems to hallucinate less than Qwen 3.5. It uses less references from the context, and it sometimes misses very important details altogether, things that Qwen doesn't. That said, sometimes Qwen gets its facts wrong at near 90K tokens, while Gemma seems surprisingly more coherent, if less factual.
Qwen 3.5 references more context than Gemma 4. This makes it feel more thorough. That said, sometimes it has a tendency at high context to hallucinate minor details. There's a saying: Less is more. Maybe in this case - less is more....accurate?
Qwen 3.5 is the clear winner over long outputs. Qwen can write looong passages of content, and maintain coherence. It's amazing. I even tested it once, asked it to write a 20K output. I stopped it prematurely - at around 10K tokens - but if I hadn't, it would have kept going, and it was only halfway through the material.
Honorable Mention: Gemma 4 can write longer outputs than its defaults, but you have to prompt for it. Its capable of giving more thorough results than its initial output. Another redditer said they told it to reason longer and got better results. I tried this. It works. Not satisfied with the answer? Tell it to reason longer and provide a long output. You can even tell it to try to match a certain context length - like 10K tokens. I haven't tested if it can reach set token requirement yet, will follow up on that later.
Gemma 4 has a better writing voice. I found its outputs more pleasurable to read - mostly. That said, its still got a noticeable level of slop. Not as bad as 26B, but definitely more than Qwen.
Gemma 4 digests the lore better for its assigments...sometimes. I'm still testing this, but my initial vibes are that Gemma 4's results over long context can give more pleasing results by pulling out more poignant and impactful contextual references. It can punch deeper on the ideas than Qwen at times; Qwen gives you more references, but doesn't always consolidate those ideas in the most meaningful way. Sometimes it feels like this: Qwen is submitting a book report with references. Gemma is writing a review column on a website, citing the parts it found the most memorable. This isn't a consistent experience across all interactions, but its often enough to notice.
Qwen is smarter. The results, from a technical perspective, are often better. While both miss details over long context, Qwen is often more thorough. It can take extremely nuanced and complex instructions and eat them for lunch. That said, Gemma is also very capable; I'm still learning its abilities. Its not Qwen level...yet...but it doesn't feel far off.
Gemma 4 gets it. This sort of falls under the "digests the lore" section, but I just wanted to mention that this version of Gemma is less about pontification; it really does seem to understand the unique ideas outlined in the source material. That makes it feel like you're working with a cowriter who can keep pace and dissect/stress-test ideas. Qwen does as well, but Gemma brings its own ideas to the table.

My final thoughts:

For these particular use cases - lore master, story analyst - I can't really decide which I like better. They have two different personalities, and they are equally useful. Where Qwen 3.5 27B first made me feel like I had a true writing partner, Gemma 4 feels like I've just added a third person to the table, who can bring something different and unique to the conversation.

If I could only choose one, I'd choose Qwen. I find its overall abilities to be better. Better reasoning, more attention over long context.

But without Gemma 4, I'd be missing very valuable and relevant context. That single, random-but-consequential observation that can propel the discussion into a meaningful new direction.

Thankfully, I don't have to choose just one.

[-]

EuphoricAnimator@reddit

Interesting read! I'm running similar stuff on a Mac Studio M4 Max with 128GB of RAM, so I've been down this rabbit hole a lot lately. I mostly play with Qwen 3.5, Gemma 4, and a bunch of the Ollama models, and context length is always the sticking point. You're right about those two being top tier for a 24GB card though, especially when you're pushing the limits.

I've found Qwen 3.5 generally holds onto context a little better for me, especially in longer conversations. I can usually get away with 8k tokens fairly consistently without it completely forgetting what we talked about five minutes ago. Gemma 4 is good, don't get me wrong, but I notice more drift around the 6-7k mark. With both though, reloading the entire context every so often really helps - it’s a bit clunky, but I’ve gotten used to just inserting a “reminder of the previous discussion” prompt every 1500-2000 tokens.

What I’ve noticed is VRAM isn’t always the full story. I can load a Q6_K version of Qwen 3.5 and get decent speeds, around 18-20 tokens/sec, but it eats up a lot more system RAM too. Lower quantization (like Q4_K_XL) frees up VRAM, but the inference speed drops noticeably. It’s a constant trade-off, and it really depends on what you’re prioritizing.

Honestly, for really long documents, I've started chunking things up and processing them in sections. It's more work upfront, but it avoids the model just making stuff up because it’s lost track of the beginning. It’s not ideal, but it's the best workaround I've found so far for keeping things somewhat coherent.

[-]