Ignoring benchmarks, how do the newest local models (gemma 4 31B, 26BA4B, Qwen 3.6) “feel” to you? What do you think they compare to?

Posted by opoot_@reddit | LocalLLaMA | View on Reddit | 34 comments

I use local ai mainly for creative writing, and benchmarks are a bit iffy on that I feel like. I’d like to compare Gemma mainly to Gemini as I like their writing the best, I do know that qwen 3.6 is amazing but mostly for coding and agentic work.

I’d like to ask everyone how the new(er?) models feel to you personally rather than looking at benchmarks which they are likely optimised for.

For me, I feel like Gemma 4 31B (even q4) still falls short of 2.5 pro, I’m most familiar with 2.5 pro since I used so much of it for free on ai studio when it was a preview.

The style and prose are there but long context it still misremembers minor details.

I think it’s actually better than 4.5, but tha could be personal preference since, again, I do mostly only creative writing

[-]

Disposable110@reddit

Gemini 2.5 Pro -> Gemma 4.

[-]

lordekeen@reddit

Gemma 4 is good for non coding tasks, like summarization, information retrieval, web search etc.

[-]

arbv@reddit

It is not bad for coding, either. Much less sloppy then the unstoppable Qwen at the moments when it shits its bad (e.g. by failing edits and then ignoring instructions to not overwrite existing files).

[-]

nick_ziv@reddit

Coming from using Claude, started using qwen 3.6 35b a3b on 3090s around 130t/s. I use it for coding. I find that it is much more aligned with what I would expect it to do than claude. It writes less bloated code. It asks better questions to disambiguate intention. And it is much less verbose (most of the time, It still gets into drawn out "wait but..." Loops but much less than Claude.)

For long context I have just learned to stop relying on these models trying to remember something from 100k+ ago and instead try to use it like a rolling window. It remembers always that parts of the code base exist and it goes and finds them. I would not expect it to remember a specific rule I set especially long time ago. I've always been against using Claude.md and other "skills" to modify their behavior because it does not seem to stick well.

I have to say that this Qwen seems to be smarter and more efficient for working than Sonnet 4.5. Also It stays much more consistent. I worked through many Claude model releases where it felt like they lobotomized the model before the next was released. No more of that.

[-]

Present-Ad-8531@reddit

What tool you use with model

[-]

nick_ziv@reddit

OpenCode.

[-]

Present-Ad-8531@reddit

Anything else you can tell ?

Which quantisation? What temp?

What other hyperparams?

[-]

nick_ziv@reddit

Using qwen 3.6 35b a3b unsloth q6. No mtp (found the prefill loss to be not worth it) no changes to the default temp etc. I am using spec-type ngram-mod. I have found ngram to speed up generation for the coding tasks. Keep my context to 180k not because of memory reasons but because doing a rebuild of the cache takes too long (for some reason opencode causes cache rebuilds when I send a message after the model does a long turn)

[-]

Eyelbee@reddit

Qwen 3.6 27b => Sonnet 4.5

[-]

riconec@reddit

I feel like it is good but very dependent on parameters like temp, top_k etc. what do you use on your setup? Find qwens default parameter prone to loops

[-]

Eyelbee@reddit

smaller qwen 3.5 models loop but I never had any looping with 27b models. MoE also sometimes loops.

[-]

iMakeSense@reddit

At what quant?

[-]

onlyrealcuzzo@reddit

If you use them for what they can do... they are quite good at it.

Gemma 4 26B-A4B is quite good at fixing localized bugs for example.

If you try to use it to design a feature for the Linux Kernel, you're going to have a bad time.

[-]

SAPPHIR3ROS3@reddit

Having used both qwen and gemma i can say that qwen i a monster to be honest it’s impressive and with the right setup CAN compete with models way bigger but q4 it’s a bit rough, it can and it will loop, it doesn’t seem to be the case with q6 (i have to try with q5). It can be a good choice for coding, research, general and complex task. Gemma on the other hand is not as consistent (q4) but i will get the job done when i came to ocr, translation (way better than qwen) and writing in general can be better but shows its limitations when it comes to complex task, as for general task results kind of varies depending on the specific task

[-]

ea_man@reddit

I feel like if you use a model at Q4 and with quantized KV cache you are supposed to tune down a bit temperature and sampling, reduce the max ctx length you work with.

You have dumbed down the model by quantization, you should give it less free rein.

[-]

SAPPHIR3ROS3@reddit

I usually go with .1/.2 temp and .95 of sampling, should i go lower? Besides in the past i played with sampling and haven’t seen any meaningful difference, yeah it can be good for some cases but meh

[-]

ea_man@reddit

well no, I would say so, with an IQ4 I would use temp 0.6 with reasoning on, you gotta leave him some freedom to reason.

If you disable reasoning I would go down to 0.3 temp, that would do for agent work as EDIT / DIFF with little sampling and --presence-penalty 0.0

[-]

Life-Screen-9923@reddit

Gemma4 q5 better for coding (one shot) then Qwen 36 q5, in my experience

Gemma4 26b writes code faster in terms of total time, and the code actually delivers the required outcome rather than just being executable

Both models write code without syntax errors, but Qwen36 35b’s code ends up being broken, and it does a poor job of correcting its own errors

Gemma4 26b produces actually working code and normally fixes bugs in a single shot

[-]

PrinceOfLeon@reddit

Most of the answers are talking about coding and agentic tool calling because those things are measurable.

Your own personal "feels" regards "creative writing" are entirely subjective and the only comparison which will be relevant to you is going to be based on your own experiences and use.

It's pointless asking how other people's pure opinion will align to yours. Install the models, feed them some of the same prompts you e used in the past, and decide how you feel about the results.

[-]

Jipok_@reddit

Gemma feels like a somewhat lazier, more hyperfocus-prone version of Gemini 2.5

[-]

GCoderDCoder@reddit

Benchmarks and my initial tests made me think gemma4 31b is a better coder than qwen 3.6 27b but in real usage gemma4 reminds me it's made by the same company as gemini which...

[-]

Kahvana@reddit

Surreal.

Coming from Mistral Nemo 12B, then Mistral Small 3.2 / Magistral Small 2509, and Gemma3-27B, both Qwen3.6-27B and Gemma4-31B are a genuine leap forward, and that's just in a year's time.

Makes me excited for what future local models can do!

With Qwen3.6-27B I got a decent programming helper that can actually help with my .NET 8 projects and decent for home automation / toolcalling, Gemma4-31B for everything else (It roleplays at the same level as DeepSeek v3.2, a huge compliment).

For both Qwen3.6-27B and Gemma4-31B, I notice a huge difference between Q4_K_L and Q6_K_L in capabilities. I limit the former to 128K context (BF16) and the latter to 32K context (Q8_0).

Man, I really want more VRAM...

[-]

migsperez@reddit

Blast I'm the same. I want q8 with 128k context. Planning my next GPU already, only received the first last week.

[-]

f5alcon@reddit

What does even Q4 mean? Use Q6 or Q8 and see how it feels

[-]

Similar-Ad5933@reddit

Qwen3.6-27B > Opus 4.6 kind of. Opus has had some weird problems lately. I use models mostly for coding and writing docs. But my experience has been that Qwen3.6-27B is at same line with sonnet 4.6, sonnet is little bit better with styles. At work I use those big models, if local models were allowed, I would use Qwen almost everything.

[-]

Qxz3@reddit

I suppose the dense models are good but I'm too GPU poor to run anything but the MoEs. The MoE models are competent enough to churn out some code on their own, agentic-style, although they struggle doing precise in-place edits and they're just not that knowledgeable. They are certainly nowhere near frontier models in terms of knowledge, competence and autonomy.

[-]

mkMoSs@reddit

For its tier, Qwen 3.6 27b is top, nothing else comes even close. I use it for coding assistance.
Tried gemma 4 for a bit, it hallucinates a lot, very disappointing results.

[-]

a_beautiful_rhind@reddit

Gemma4 was a breath of fresh air compared to the ones before. It still suffers from modern LLM issues but not as much. You can't expect it to be 2.5 pro.. that model was huge.

Qwen is still qwen. Follows instructions, knows STEM. Kinda dry.

[-]

jacek2023@reddit

Qwen 27B is good for agentic coding, Gemma 31B is good for creative tasks, but I use also 120B models

[-]

caetydid@reddit

gemma4 reminds me of GPT o4 - stupid for coding, insanely good for anything related to natural language.

[-]

ClassicMain@reddit

Qwen 3.6 27b is actually insane

[-]

jonas-reddit@reddit

Yup. It really is a positive surprise for coding.

[-]

ravage382@reddit

I'm using the new qwen models in all my workflows. I had previously only been using gpt 120b for agentic tasks due to how well it did with tool calls.

[-]

dinerburgeryum@reddit

Both MoE’s “feel” like they make too many mistakes despite their speed. Gemma 4 “feels” like a better conversationalist and writer than Qwen 3.6, but Qwen 3.6 “feels” like the better agent worker. Gemma 4 dense “feels” like it’s a little try-hard in agent work, more so than Qwen anyway.