Qwen 3.5 35b, 27b, or gemma 4 31b for everyday use?
Posted by KirkIsAliveInTelAviv@reddit | LocalLLaMA | View on Reddit | 61 comments
I have a 5080 + 64gb of ram. What model would be as intelligent as possible while still running decent enough on my specs?
Mashic@reddit
If you need deep reasoning: qwen:3.5-27B or Gemma:4-31B. If you need simple text extraction or manipulation tasks, Gemma:4-26B or Qwen:3.5-35B.
More_Chemistry3746@reddit
So you are saying qwen 27b is better than qwen 35b
anykeyh@reddit
Dense model vs MoE. 35b will run faster on inference, but is worst overall, yeah.
Rim_smokey@reddit
My tests show 35b at the same quant (Q6) outperforms 27B at complex coding
GrungeWerX@reddit
Nobody believes you.
Rim_smokey@reddit
Well the only way I found out was by not believing other but by putting it to the test. Feel free to do the same. I don't care what you do
GrungeWerX@reddit
I already have, as I’m sure the people who downvoted you have as well.
Maybe share some examples of your use case to enlighten everyone? Complex coding might be subjective…
ixdx@reddit
My tests show that 35b makes errors much more often, deleting unnecessary lines unrelated to the query, using incorrect indentation, and changing the indentation of code that is completely unrelated to the query.
In my personal experience, 27b Q4_K_L performed better than 35b Q6_K_L.
It probably depends heavily on the programming language, the size of the file being edited, and other factors.
I like 35b for its speed. Q5_K_L + mmproj fits into 32GB of VRAM with a 128k context and runs very fast – pp512 2800 / tg128 110.
jopereira@reddit
I think it really depends on how we use the model. I also use 35b because it is 6x faster than 27b and, for coding, many times I stick with OmniCoder 9B (QWEN 3.5 9B) because it is much faster outputting coding results. Speed is a quality on its own, perhaps underrated.
Rim_smokey@reddit
Yep. If a model needs 2 times more attempts but is 4 times faster at performing them, then it is in many cases a better option.
jopereira@reddit
And if you drive the model closely, you'll get to known the code - a one shot solution is not always good for code maintenance...
LeRobber@reddit
Qwen3.5 27B is smarter than Qwen35B which is fast.
Pwc9Z@reddit
I use Qwen 27B over Gemma 31B simply because I'm able to run a higher quant (Q5 over Q4) with longer context on my 3090
Dr4x_@reddit
Is the context size diff really that huge ?
Nobby_Binks@reddit
Anecdotally, I can run Qwen 27B Q8 with full context across 2 of my gpus (~46gb). Gemma4 Q8 with full context needs 3 of them (~60gb).
I guess its a combination of an extra 4B parameters and less efficient KV cache
dinerburgeryum@reddit
Yeah the Qwen models are hybrid recurrent which slashes your KV budget. Gemma uses iSWA which helps but not as much. Honestly I really wanted to love Gemma4 but Qwen3.5 is a tough one to beat for local efficiency.
Gringe8@reddit
I run gemma4 31b Q8 with 131k context on 48gb vram, which is fine for me because i find models to start losing coherency after the halfway point anyway. Havent actually tried these newer models that far yet to see if thats still the case.
GrungeWerX@reddit
Yes
Pwc9Z@reddit
Yeah, Qwen Q5_K_L can do 80000 tokens (mixed BF16/Q8_0 KV cache) on my setup, while I can't push Gemma past 45000 tokens (Q8_0 KV cache), which I is kind of a big deal.
ea_man@reddit
Yes.
-Ellary-@reddit
This is the correct answer.
KirkIsAliveInTelAviv@reddit (OP)
+1
jacek2023@reddit
you must try them and decide yourself, grow up
Paradigmind@reddit
Asking others for opinions is pretty grown up. Fucking things up alone because your ego is too big to ask is childish.
Long_comment_san@reddit
Not really, you're not saving time by making this post. Just download all 3 of them and try, it takes literally 1 minute to swap between them
Paradigmind@reddit
This is so funny. Humanity literally exploded in intelligence, because we startet to share knowledge.
I could never ever afford the time, nor the knowledge of many experienced members here who extensively test the fuck out of models.
Long_comment_san@reddit
Sounds like skill issue not time issue.
Paradigmind@reddit
And even if that's so, it would be an even more valid reason to ask more experienced people for advice. I even explained to you that I also don't have the knowledge for it.
So you are either extremely ignorant, or kinda dumb if you try to fuck me with something I acknowledged before.
Long_comment_san@reddit
You're asking advice on literally boiling water. Use fucking google search or ai chat withput bothering people with shit it takes 1 minute to understand. Don't you dare call me du!b after failing to do so basic
GrungeWerX@reddit
Qwen 27b is the smartest, Gemma 4 31B generally writes better, but has tons of slop. But it sometimes gets context better. Sometimes. But it also misses obvious context too in ways that Qwen doesn’t; employs some really bad logic at times.
I use both to cover my bases. Pound for pound, I find Qwen the most reliable, but Gemma adds some extra flavor.
Important_Quote_1180@reddit
I feel like the harness makes the biggest difference now.
shittyfellow@reddit
What do you recommend
tmvr@reddit
I don't think you have enough VRAM to run 27B in a meaningful manner, the IQ4_XS weights alone are 14 GiB, almost no space left for KV and context without spilling over to system RAM and cratering performance.
florinandrei@reddit
Is there, like, a disability that prevents you from testing them yourself?
sagiroth@reddit
There are so many use cases you have to try it yourself. It all depends of your project. I personally stick to 27b as I am happy with the speed and quality, but for some other tasks I can't find sidference between 9b and 27b
mitchins-au@reddit
Gemma 4. Qwen3.5 is a hot mess. It burns 3-4x as many tokens on reasoning than Qwen3 and even when the chat template and params are geared for no reasoning, it’ll fall back into reasoning.
I’ve had it burn 200,000 tokens repeatedly on a simple python program.
jopereira@reddit
I use the llama.cpp no reasoning switch to have no reasoning at all, just pure instruct mode.
mitchins-au@reddit
Which switch in particular because I’ve changed to that template and the reasoning budget to zero but it still reverts sometimes. That’s aside from the fact you should have to turn off reasoning to make a model stable
jopereira@reddit
I never experienced that (directly on chat UI). Responses start straight away. I don't know if Cline or Kilo Code can "make it think", or better, looks like it is thinking. Also, different llama.cpp versions make models behave differently. I had a recent version that produced garbage and returning to previous version solved the problem.
rainbyte@reddit
I do the same and it works pretty nice on both 27B and 35B-A3B. I'm wondering if it would make sense to have a 2nd set of profiles with thinking available 🤔
OcelotMadness@reddit
It should be said that Gemma 4 has an absolutely massive KV Cache. Better be VRAM rich.
LeRobber@reddit
Gemma4 is better for stories, 3.5 35B for tasks, 27B or 9B for image processing with discernment
Euphoric_Emotion5397@reddit
Do a simple experiment for your use case.
Put your app thru the 3 models and then copy the output
then I ask the frontier models to rate the output. Gemini Pro and Claude and GPT.
One consideration is your system prompt and temp and llm settings are already tune to your app, eg, tune to Qwen. So, the rating might be higher for your Qwen models.
So you need to retune for the new model. So your app should hold 2 profiles and you can switch and ask frontier models to rate.
rainbyte@reddit
As other guys told you here, the best way is to try them all. Just to be clear, you can keep them stored until you need one or the other.
Here I mostly use Qwen3.5-35B-A3B because it is really fast to process data on my setup, but I switch to Qwen3.5-27B as soon as a problem requires more intelligence at expense of speed. Gemma 4 models might be better for multi-language if you work with something other than English.
You can use some tool to measure t/s given that it might be different on your setup than on other people's setup.
UltraCoder@reddit
Wow! For the first time in many years I see other person, who uses the same avatar as me. :)
lmagusbr@reddit
What is your use case? If you need it for programming at all then it's going to be Qwen 3.5 27b, but don't expect a miracle.
Create your own evals, save some prompts of things you would normally send, download all 3 (or 4 models if you would like to try gemma 26b too) and run the prompts in all of them.
Score them from 0 to 5 or something and keep the one that scores the highest overall, or scores the highest in the most important eval.
Maybe you will discover you actually need two models for the range of tasks you need them for.
Gringe8@reddit
Is say try gemma 4 26b for everyday use with 16gb vram. I remember when i was using a 4080 i struggled to run 24b models with good context. With a moe model you can offload some and still get good speeds.
ipcoffeepot@reddit
Try them all. I found myself using qwen3.5-27b waaaay more than i expected. Would not have guessed it ahead of timr
-Ellary-@reddit
It is way better at agentic usage.
For example Gemma 4 is kinda lazy at web search.
But Qwen 3.5 27b is just another level, it will not stop until he founds everything about the topic.
He loves to spit 40kb text file reports, while gemma 4 give me like 8kb.
For example I was gathering info about a game from steam.
Gemma 4 just get info from steam, some review from the site, sum it and that is it.
But Qwen 3.5 get info from steam, then from different sites, then he found info about developer, than that he got a wife that helps him with game development, then that she got sick, then that they breakup, then that developer go to jail for 3 years, he tried to find for what, he even collected popular opinions from different discussions from steam.
Made me a 30kb text file, saying that he gathered crucial info.
Nobby_Binks@reddit
Qwen on your setup simply because you can fit more context for a given quant
Jeidoz@reddit
Try Q3 Qwen 3.5 35b A3B or 27b if it manages to offload to your 16gb GPU. In most cases Qwen works fine out of box with LM Studio + OpenCode or Github Copilot. With Q4 KV cache you can sometime afford 80-160k context which is enough for most of "specific single task" sessions of agentic work.
If it will not fit full offload, switch to A4B gemma. But use Beta LM Studio — it includes tooling fixes for Gemma.
milkipedia@reddit
So far it's Qwen 27b Q4 for me, but I haven't really put Gemma 4 through the paces yet
thinking_computer@reddit
Gemma if you use tooling / agents
Inflation_Artistic@reddit
If you need general knowledge of the world, then Gemma. If you communicate in a language other than Chinese or English, then Gemma. If you mostly code, then consider Qwen (but also check Gemma)
Radiant-Video7257@reddit
Gemma is a much more enjoyable experience overall, Qwen3.5 edges it a bit in coding/reasoning.
JMowery@reddit
Develop your own tests (hell, if you're lazy, use AI to do it), put every model you are considering through the ringer, and then you'll have an objective answer for a model that fits your everyday needs instead of attempting to rely on fake benchmarks and random internet opinions.
Temporary-Roof2867@reddit
I think they have placed a lot of bots/smart agents on reddit to carry out a social experiment and spread propaganda for the Qwen3.5 models.
InternationalNebula7@reddit
I'm experimenting with gemma-4-31B-it-UD-IQ3_XXS.gguf but I've also used Qwen27B.
StardockEngineer@reddit
just try them
texasdude11@reddit
Gemma
mayo551@reddit
Try them and make your own decision rather then using us as the “voice that guides you”.
Seriously just try them.