Trying to move to ik_llama.cpp from ollama. Need some help with args..
Posted by My_Unbiased_Opinion@reddit | LocalLLaMA | View on Reddit | 9 comments
Hey y'all. I've been running the Derestricted 120B model on ollama. It was super slow so I moved over to LMstudio. One of the main features that helped a lot was keeping KVcache on GPU and offloading experts to CPU.
Apparently, ik\_llama.cpp is faster than llama.cpp for hybrid infrence and there seems to be control regarding the expert weights.
Would someone be kind enough to recommend me some launch args specific to ik\_llama.cpp? I'm basically trying to keep the most used experts on GPU, and KVcache.
I have about 70gb of free RAM and 24gb to use on my 3090. Ideally I would like a bit left over to keep Z-image loaded on GPU but that's not a big priority.
Before on llama.cpp, I was able to hit 6.5 t/s with a 12700K and the DDR4 RAM.
Would ik\_llama.cpp be faster?
9 Comments
Lissanro@reddit
SimilarWarthog8393@reddit
Lissanro@reddit
usrlocalben@reddit
My_Unbiased_Opinion@reddit (OP)
usrlocalben@reddit
My_Unbiased_Opinion@reddit (OP)
usrlocalben@reddit
fizzy1242@reddit