New llama.cpp options make MoE offloading trivial: `--n-cpu-moe`

Posted by Pristine-Woodpecker@reddit | LocalLLaMA | View on Reddit | 101 comments

No more need for super-complex regular expression in the -ot option! Just do --cpu-moe or --n-cpu-moe # and reduce the number until the model no longer fits on the GPU.