Strip Qwen3.6 dense of its multimodal capabilities
Posted by redblood252@reddit | LocalLLaMA | View on Reddit | 27 comments
This may be naive but if we stripped a model of its image processing/voice processing capabilities, can it make it smaller or faster? Is that even possible? Does it vary between MoE and dense?
If it is, why isn't it done on popular models
sine120@reddit
I usually don't load the mmproj file in favor of saving some VRAM for context. Test it with and without.
bonobomaster@reddit
--no-mmproj-offload is even better. Just leave the vision capabilities and run it from CPU for the rate cases you really need it.
robertpro01@reddit
How slow is on the cpu?
bonobomaster@reddit
Very.
Depending on input resolution and obviously the CPU and memory (DDR4 vs DDR5).
DDR4 let's me wait a good 20 to 60 seconds, depending on input.
robertpro01@reddit
Damn! I do need images a lot!
bonobomaster@reddit
Then offloading is not an option.
But maybe you could "hot swap" models for image stuff?!
kiwibonga@reddit
That's just the offloading of the model weights for vision but not the cache. If your batch size is 4192 for instance, 3.6 will allocate an extra 4 GB of BF16 cache in VRAM.
bonobomaster@reddit
*4096 instead of 4192 and that has nothing to do with the vision part.
Batch size and ubatch size are general options and influence VRAM usage no matter where the tokens come from.
Not loading the mmproj at all doesn't give you an extra 4 GB.
Healthy-Nebula-3603@reddit
That works !
Thanks
Now using even Gemma 31b unsloth q4k_xl with q8 rotation cache still have 80k context and 30 t/s and have full multimodal
krileon@reddit
Any way to set that from LM Studio? I've 20gb and so close to fitting 27B. I don't use vision so it'd be great to shave off enough to get this to fully fit.
bonobomaster@reddit
Nope. In LM Studio you can just delete or rename (.bak) the mmproj file in the model folder to remove vision and save some VRAM.
krileon@reddit
Hah, simple. Had no idea you could just do that. Any idea how much it'll save in VRAM? Seams like it's just 1GB. Probably not enough to get me fully offloaded unfortunately. Wish this model was just a biiiit smaller, lol.
bonobomaster@reddit
Depends what version of the mmproj delivered with your particular version of the model.
Sometimes it's the FP32 instead of the BF16 version an that fucker is nearly 2 gigs...
krileon@reddit
Looks like mine's BF16 and just 888MB with LM Studios GGUF. Dang. Was hoping for bigger savings. Will still give it a try, but don't think it'll fit with context regardless. Thanks anyway!
sine120@reddit
Didn't know that, will have to add that to my script.
redblood252@reddit (OP)
I add `--no-mmproj` and it reduces my VRAM, but I don't know if that skips _all_ the needless weights before loading into the VRAM.
666666thats6sixes@reddit
There are no needless weights. The mmproj is a projector that projects an image into the model's embedding space, so from that point on it's just normal vectors (~tokens) going through. There isn't anything vision-specific in the model itself.
JLeonsarmiento@reddit
it does, and it works:
https://huggingface.co/leonsarmiento/Qwen3.6-27B-3bit-mlx
fatboy93@reddit
did you check the optiq version as well? curious to see how much it differs.
JLeonsarmiento@reddit
Straight from the theory behind it, optiq must be superior… BUT I’m firmly believer that use case test >>> benchmarks/theoretical numbers.
redblood252@reddit (OP)
I have a 5060ti and amd epyc 3rd gen, not sure mlx is a good idea for me.
MomentJolly3535@reddit
you might wanna check for "REAP" models on huggingface, some people tried to remove the least useful experts / parts of the model to make it smaller, but overall the model always loose some capabilities.
gpalmorejr@reddit
Makes it smaller by like 350MB is I remember correctly. Doesn't affect speed since it isn't called unless you submit a picture.
dinerburgeryum@reddit
Reframe the question: if it had never been trained for multimodal, would it have had better text generation skills? The answer is yes, probably, because it’s only 27B params and it needs every bit it can get to hold information.
No-Manufacturer-3315@reddit
Isn’t that done by default? You have to load the .9gb multimedia encoder intentionally
Betadoggo_@reddit
You can choose not to load the image portion on llamacpp and it saves around 1GB of memory for qwen
coder543@reddit
Because it isn't.