Gemma4 vs Qwen3.5! MoE vs Dense! Sota vs Obsolete! Porque no los dos?
Posted by Distinct-Rain-2360@reddit | LocalLLaMA | View on Reddit | 4 comments
Every other day, there's someone posting about how the latest hotness of the month is gamechanger, but flawed in some way relative to their previous favorite. I can't help but wonder, does no one else keep their previous gen models on speed dial? After spending so much time learning and working with their quirks and tuning their llamacpp params, I find myself having a hard time letting them go.
There's also a small group of fanboys (or shills) who make it sound like you HAVE to commit to 1 main model, like you're married to it. Absolute loyalty or death penalty, no in-betweens. I used to be a llama-swap person, but after I recompile llama.cpp last weekend, router mode + builtin ui was good enough now that I've moved my harem over to that, much more convenient.
Disk size hasn't been a problem, these ggufs total to around 250gb, lowest quant is qwen3-coder-next at iq3_xxs, highest is devstral2-small-24b at ud-q6_k_xl, the others are between iq4_nl or q5_something.
Surprisingly, disk read speed has been my main issue. I only have a single pcie gen3 drive, so loading models top out at max 3gb/s. And then you need to add the overhead of llama.cpp repacking the tensors and now there's often more time spent switching models than pp + tg combined.
I use opencode, and almost all subagents have specific models assigned to them since you can build the best combo of tooling + prompt + model for a specific task. So when one agent starts a subagent for a subtask, llama.cpp drops the current model, loads the other model, then the subagent does it's thing, opencode gives it's results to parent agent, llama.cpp drops the subagent model, then loads up the parent model, and the parent agent resumes from it's frozen state. But I rarely hear about anyone using multiple models like this. I also need to write a plugin to reload the correct slot on agent switch, unless there's already something out there for this.
Remember guys, most of us may be gpu poor, but a few hundred gigs of disk space shouldn't break the bank.
ttkciar@reddit
Yah, to avoid the model-switching overhead my habit is to load a most-used mid-sized model to the GPU and keep it there, and use larger models via pure-CPU inference. Since the larger models wouldn't fit in VRAM anyway, I don't bother evicting the mid-sized model from VRAM.
Sometimes I will switch the in-VRAM model, but will then stick with it for at least the rest of the day, so the switching overhead gets amortized.
Distinct-Rain-2360@reddit (OP)
That's interesting, I only keep qwen3.5-2b in cpu memory for title generation. If you run the large models in memory only, doesn't the speed get unbearably slow? Even if I can load qwq and seed-oss completely into vram, the low tg is so bad that I only turn to them when I'm truly desperate (<20t/s). I couldn't imagine running something like qwen3-coder-next without a gpu.
Since you use the same mid sized model for a day, do you have to adjust things for different model personality quirks? For all my faster models, they've all got their favorite lane, where when things are good they're great. But if I dress them up as an agent that doesn't jive with the way they like to think, things tend to go bad more often.
ttkciar@reddit
Pure-CPU inference becomes too slow for interactive use (single-digit tokens/second), but it's fine for non-interactive long inference. I can work on other things (or sleep, for overnight tasks) while it is inferring.
Also, if the model is too large to fit in VRAM anyway, inference is going to be very slow regardless. I would rather structure my work habits around that slow inference, so that its inference speed does not matter, and leave my frequently-used mid-sized model in VRAM so I can continue to use it at good speed suitable for interactive use.
Usually it's the other way around. When I need to perform tasks for which that model's quirks are well-suited, that is when I load the model. A given project will often be tailored to a specific model, so when I am working on that project, that is the model that gets loaded.
Distinct-Rain-2360@reddit (OP)
Thanks for the tips, I'll try to find something that can be done overnight without needing my supervision.