Gemma4 vs Qwen3.5! MoE vs Dense! Sota vs Obsolete! Porque no los dos?

Posted by Distinct-Rain-2360@reddit | LocalLLaMA | View on Reddit | 4 comments

Gemma4 vs Qwen3.5! MoE vs Dense! Sota vs Obsolete! Porque no los dos?

Every other day, there's someone posting about how the latest hotness of the month is gamechanger, but flawed in some way relative to their previous favorite. I can't help but wonder, does no one else keep their previous gen models on speed dial? After spending so much time learning and working with their quirks and tuning their llamacpp params, I find myself having a hard time letting them go.

There's also a small group of fanboys (or shills) who make it sound like you HAVE to commit to 1 main model, like you're married to it. Absolute loyalty or death penalty, no in-betweens. I used to be a llama-swap person, but after I recompile llama.cpp last weekend, router mode + builtin ui was good enough now that I've moved my harem over to that, much more convenient.

Disk size hasn't been a problem, these ggufs total to around 250gb, lowest quant is qwen3-coder-next at iq3_xxs, highest is devstral2-small-24b at ud-q6_k_xl, the others are between iq4_nl or q5_something.

Surprisingly, disk read speed has been my main issue. I only have a single pcie gen3 drive, so loading models top out at max 3gb/s. And then you need to add the overhead of llama.cpp repacking the tensors and now there's often more time spent switching models than pp + tg combined.

I use opencode, and almost all subagents have specific models assigned to them since you can build the best combo of tooling + prompt + model for a specific task. So when one agent starts a subagent for a subtask, llama.cpp drops the current model, loads the other model, then the subagent does it's thing, opencode gives it's results to parent agent, llama.cpp drops the subagent model, then loads up the parent model, and the parent agent resumes from it's frozen state. But I rarely hear about anyone using multiple models like this. I also need to write a plugin to reload the correct slot on agent switch, unless there's already something out there for this.

Remember guys, most of us may be gpu poor, but a few hundred gigs of disk space shouldn't break the bank.