Optimizing MiniMax 2.7 - Experts vs Layers for best VRAM/RAM utilization

Posted by CBHawk@reddit | LocalLLaMA | View on Reddit | 6 comments

I'm curious if there is a rule of thumb regarding how to best load Minimax given varying amounts of VRAM/RAM configurations. Is there a way to estimate how many experts versus layers to offload for individuals running either 16GB/24GB/32GB/48GB VRAM? Can you get performance gains by only activating 1 expert with 24GB of vram then offloading x number of layers?

Please forgive my ignorance if I'm thinking about this the wrong way.

[-]

korino11@reddit

APEX.. we need APEX on M2.7!

[-]

LagOps91@reddit

what you generally want to do is keep all of the attention calculation + kv cache on gpu and only offload as many routed experts as needed to stay in the vram budget.

llama.cpp unfortunately doesn't allow you to offload specific experts (would be very nice to keep the most used experts in vram to maximize the benefits).

regardless, the speed doesn't really improve that much with extra vram until you can fit nearly all of it into vram. it's not really worth it to try and get 48gb vram. 32gb vram might still be worth it if you want some more room for context, but with all the context quant improvements, that might no longer be needed either.

24gb vram comfortably fits attention and context, 16gb vram is a bit more of a squeeze, but should still be fine overall, you might need to lower context a bit or quant it more.

in general, you should aim for Q4 and would need 128gb ram regardless of how much vram you have. maybe with 48gb vram 96gb ram would be fine too, but you would need to put more expert weights into vram and lose some of the benefits of having more room for context.

[-]

grumd@reddit

doesn't allow you to offload specific experts

it does, see option --override-tensor

[-]

LagOps91@reddit

tensors are not experts. a single exps tensor contains weights of all experts.

[-]

suicidaleggroll@reddit

In general, with a single GPU, it's best to set "--ctx-size" to your desired context, "--n-gpu-layers 999", and then "--n-cpu-moe" as low as your VRAM will allow. MiniMax is big, and the context for it is huge, so with 24 GB of VRAM you likely won't be able to hold any actual model layers in the GPU, you'll be maxed out with just context, but that's still useful vs running everything on the CPU. It needs 240GB/1M tokens for context (per the official docs), so 128k context needs 31.5 GB, while 64k context needs 15.8 GB. So with 24 GB of VRAM, if you set "--n-gpu-layers 999 --n-cpu-moe 62 --ctx-size 80000", that should get close to maxing out your RAM with all layers on the CPU.

[-]

ambient_temp_xeno@reddit

The main thing, assuming you're using llama.cpp, is to use -cmoe to put the shared experts on the gpu. I didn't bother offloading any more layers on 24gb vram. (6 t/s on ddr4 quad channel Q8 quant)