Strip Qwen3.6 dense of its multimodal capabilities

Posted by redblood252@reddit | LocalLLaMA | View on Reddit | 27 comments

This may be naive but if we stripped a model of its image processing/voice processing capabilities, can it make it smaller or faster? Is that even possible? Does it vary between MoE and dense?

If it is, why isn't it done on popular models

[-]

sine120@reddit

I usually don't load the mmproj file in favor of saving some VRAM for context. Test it with and without.

[-]

bonobomaster@reddit

--no-mmproj-offload is even better. Just leave the vision capabilities and run it from CPU for the rate cases you really need it.

[-]

bonobomaster@reddit

Very.

Depending on input resolution and obviously the CPU and memory (DDR4 vs DDR5).

DDR4 let's me wait a good 20 to 60 seconds, depending on input.

[-]

bonobomaster@reddit

Then offloading is not an option.

But maybe you could "hot swap" models for image stuff?!

[-]

kiwibonga@reddit

That's just the offloading of the model weights for vision but not the cache. If your batch size is 4192 for instance, 3.6 will allocate an extra 4 GB of BF16 cache in VRAM.

[-]

bonobomaster@reddit

*4096 instead of 4192 and that has nothing to do with the vision part.

Batch size and ubatch size are general options and influence VRAM usage no matter where the tokens come from.

Not loading the mmproj at all doesn't give you an extra 4 GB.

[-]

Healthy-Nebula-3603@reddit

That works !

Thanks

Now using even Gemma 31b unsloth q4k_xl with q8 rotation cache still have 80k context and 30 t/s and have full multimodal

[-]

krileon@reddit

Any way to set that from LM Studio? I've 20gb and so close to fitting 27B. I don't use vision so it'd be great to shave off enough to get this to fully fit.

[-]

bonobomaster@reddit

Nope. In LM Studio you can just delete or rename (.bak) the mmproj file in the model folder to remove vision and save some VRAM.

[-]

Hah, simple. Had no idea you could just do that. Any idea how much it'll save in VRAM? Seams like it's just 1GB. Probably not enough to get me fully offloaded unfortunately. Wish this model was just a biiiit smaller, lol.

[-]

bonobomaster@reddit

Depends what version of the mmproj delivered with your particular version of the model.

Sometimes it's the FP32 instead of the BF16 version an that fucker is nearly 2 gigs...

[-]

krileon@reddit

Looks like mine's BF16 and just 888MB with LM Studios GGUF. Dang. Was hoping for bigger savings. Will still give it a try, but don't think it'll fit with context regardless. Thanks anyway!

[-]

sine120@reddit

Didn't know that, will have to add that to my script.

[-]

redblood252@reddit (OP)

I add `--no-mmproj` and it reduces my VRAM, but I don't know if that skips _all_ the needless weights before loading into the VRAM.

[-]

666666thats6sixes@reddit

There are no needless weights. The mmproj is a projector that projects an image into the model's embedding space, so from that point on it's just normal vectors (~tokens) going through. There isn't anything vision-specific in the model itself.

[-]

JLeonsarmiento@reddit

it does, and it works:

https://huggingface.co/leonsarmiento/Qwen3.6-27B-3bit-mlx

[-]

fatboy93@reddit

did you check the optiq version as well? curious to see how much it differs.

[-]

JLeonsarmiento@reddit

Straight from the theory behind it, optiq must be superior… BUT I’m firmly believer that use case test >>> benchmarks/theoretical numbers.

[-]

redblood252@reddit (OP)

I have a 5060ti and amd epyc 3rd gen, not sure mlx is a good idea for me.

[-]

If it is, why isn't it done on popular models

Because it isn't.