Trying to move to ik_llama.cpp from ollama. Need some help with args..

Posted by My_Unbiased_Opinion@reddit | LocalLLaMA | View on Reddit | 9 comments

Hey y'all. I've been running the Derestricted 120B model on ollama. It was super slow so I moved over to LMstudio. One of the main features that helped a lot was keeping KVcache on GPU and offloading experts to CPU. Apparently, ik\_llama.cpp is faster than llama.cpp for hybrid infrence and there seems to be control regarding the expert weights. Would someone be kind enough to recommend me some launch args specific to ik\_llama.cpp? I'm basically trying to keep the most used experts on GPU, and KVcache. I have about 70gb of free RAM and 24gb to use on my 3090. Ideally I would like a bit left over to keep Z-image loaded on GPU but that's not a big priority. Before on llama.cpp, I was able to hit 6.5 t/s with a 12700K and the DDR4 RAM. Would ik\_llama.cpp be faster?

Reply to Post

9 Comments

[-]

Lissanro@reddit

I shared details [here](https://www.reddit.com/r/LocalLLaMA/comments/1jtx05j/comment/mlyf0ux/?utm_source=share&utm_medium=web3x&utm_name=web3xcss&utm_term=1&utm_content=share_button) how to build and set it up.

[-]

SimilarWarthog8393@reddit

Curious about why you set -b and -ub to the same size - why not use the 4 to 1 ratio like the default ?

[-]

Lissanro@reddit

It was recommended somewhere in ik_llama.cpp related discussion on their github. Higher values (like 8192) may have had some issues, while the default values are to save memory at the cost of performance. I however did not try to experiment with setting one of them lower than the other. It may save memory, but would it improve performance? In my case, the performance is the main bottleneck.

[-]

usrlocalben@reddit

"Most used experts" gets repeated often in this forum but is not a real concept. A few exps may be slightly hotter than others depending on content but it's nearly random on every token. ik\_ has been faster than llama, but the gap is narrowing recently. They have similar controls wrt. exp weight placement.

[-]

My_Unbiased_Opinion@reddit (OP)

How about "shared experts"? Is that a thing?

[-]

usrlocalben@reddit

Yes. They are experts that are always active and candidates for offload. For DeepSeek and Kimi there is just one of them. So, 1/257th (DS) and 1/385th (K2) of the exp total, but 1/9 of the compute (8 active moe per token).

[-]

My_Unbiased_Opinion@reddit (OP)

Is it possible to put shared experts on the GPU with a launch arg?

[-]

usrlocalben@reddit

I think what most people do (and I do) is put \_everything\_ on GPU, then specify overrides to move the MoE back to CPU. For ik\_llama (llamacpp is almost or exactly the same) we use -ngl to move all layers to GPU (999 is just easy to type to get all layers) then a regex expr to pin a few MoE tensors to CUDA0, and then anything with "exp" in the name that hasn't alreay been overided to CPU. The shared experts don't match because they are called "\_shexp" and without the plural s suffix, so they stay on GPU (from the ngl 999) More recently, there are some short-hand arguments that do this without the regex, but I continue to use patterns. -ngl 999 -ot "blk\.([1-4])\.ffn_up_exps=CUDA0,blk\.([1-4])\.ffn_gate_exps=CUDA0" -ot exps=CPU

[-]

fizzy1242@reddit

i think ik\_llama.cpp has its own special ik-quants, but normal ones work too. You can use -ot flag with regex to keep first n blocks on vram. You can lower the value if you want to keep more vram z-image. example below with first 14 on gpu, rest on ram: ./llama-server \ -m /model/ -fa \ -c 32768 \ --batch_size 256 \ --ubatch_size 256 \ -ngl 99 \ -np 1 \ -fmoe \ -ot "blk\.(0?[0-9]|1[0-4])\.ffn_.*_exps.=CUDA0" \ -ot exps.=CPU