Need to know more about less known engines (ik_llama.cpp, exllamav3..)
Posted by Leflakk@reddit | LocalLLaMA | View on Reddit | 28 comments
I usually stick to llama.cpp and vllm but llama.cpp speed may not be the best and vllm/sglang can be really annoying if you have several gpus without respecting the power of 2 for tp.
So, for people who really know others projects (I mainly know ik_llama and exl3) could you please provide some feedback on where they really shine and what are their main constraints (model/hardware support, tool calling, stability…).
Testing / understanding stuff may take some time so any usefull info is good to have, thanks!
HopefulMaximum0@reddit
Piggybacking on the question: does anybody have experience with ktransformers?
It looks good on paper to combine CPU+GPU and seems to be what DeepSeek uses, but what is the reality? Does juggling kernels and modules work well outside of experimental work?
Electrical-Daikon621@reddit
The infra engine of DeepSeek uses the fork of VLLM, not Ktransformers, and the KT project is basically no longer maintained.It is more suitable for the most advanced CPU,Such as Xeon Gen4,Gen5,also the Xeon 6000 series,It has extensive knowledge of GPU, and even supports Intel GPU
(Sorry, my English is not very good. I have been using k transformers to run the DeepSeek V2. 5 model for a long time. I have written some of my understanding of k transformers in Chinese.)
K transformers不是DeepSeek所使用的推理引擎,DeepSeek用的是VLLM的一个分支,你在github可以找到他们自己发布的FlashMLA仓库。
现在,K transformers似乎已经没人维护了,如果直接按仓库里的方法把代码拉下来 你不可能跑的动如果要部署的话你大概需要在redme里扫码进他们的QQ群,可能会有人帮忙。
这个项目对最先进的CPU支持很好,诸如四代、五代至强可拓展,还有最新的六代,它的核心思路是把MLP层跑在CPU和内存上,把attention, tokenizer还有decode跑在GPU上。
据我所知它支持的GPU还蛮广泛的,CUDA当然没问题,它还支持Rocm,甚至intel gpu(真奇怪 难道是用open VNNO?)。
中国很多开发者喜欢这个推理引擎,在内存价格还不那么高的时候,利用ktransfomers可以以比较低的成本跑起来尺寸很大的模型。但现在内存在中国太贵了,例如16G的DDR5 5600内存在中国大概要卖120美元,so nobody care it😂
HopefulMaximum0@reddit
Very interesting, thanks!
There is now FastLLM that seems to have the same capacity of using GPU and CPU, but they claim to be able to use both at the same time. Most pages I find are in Chinese and they seem to be incomplete. Translation is a solved problem, but I think QQ and Blibli are the places where the Chinese discuss LLMs and getting access to those from abroad is a problem I seem unable to solve.
AXYZE8@reddit
I'll try to make it as easy as possible.
ik_llama.cpp is a build on llama.cpp. It adds modern quant techniques that work better in low bit range (KT/KS/KSS quants, like IQ4_KSS). It focuses a lot on CPU+GPU hybrid inference on newer hardware. It lacks some improvements from newer llama.cpp such as that new web UI that llama.cpp got.
exllamav3 also has modern quant techniques like ik_llama.cpp, but it only supports modern GPU inference. They recommend Ada Lovelace/Blackwell, even Ampere (so cards like RTX 3090) already lack some optimizations. The biggest selling point is their KV cache implementation - even at q4 there is no visible degradation.
So basically:
- Llama.cpp runs on everything. Old PC, New PC, SBC, server. To get compatiblity errors you need to go with very ancient hardware. like 25 year old PC with Athlon XP. This is a great starting point.
- ik_llama.cpp is best for hybrid CPU+GPU inference if you have fairly modern hardware. Something like Ryzen and RTX 3090 is the target for ik_llama. If you are using CPU / CPU+GPU inference try it to see if performance has gone up and then you can get these ik_llama.cpp exclusive quants from 'ubergarm' and 'Thireus' on HF.
- exllamav3 squeezes every single drop out of your GPU. Choose it if you are GPU rich and you need long context (as q4 will allow you to fit 4x more context into same VRAM space). You have something like RTX 4090 or 5090 and you can fit that model into VRAM? Then exllamav3 is winner. You need to spill that MoE to CPU? You're out of luck.
For older GPUs there are also some other llama.cpp forks, for example this https://github.com/iacopPBK/llama.cpp-gfx906 for AMD Mi50/Mi60
FatheredPuma81@reddit
I gave ik_llama.cpp a try and no it doesn't. I went from an insane 25.7 Tokens/s to 25.7 Tokens/s and lost vision capabilities for whatever reason. Maybe asking for a Flappy Bird clone wasn't a good benchmark though, maybe Qwen3.5 isn't well supported yet, maybe Unsloth's UD (checked for F16 using an AI created Python script where was none in the 122B but were some in the 35B version) caused no improvements, or maybe I just don't know what I'm doing or missed a setting somewhere.
The only thing it did do was break the model because Qwen3.5 is very sensitive to Min P and the llama-server GUI's default sampling settings override what I have in my script below and set in the GUI. Had to Reset settings like 3 times before I started getting actual output.
CheatCodesOfLife@reddit
This isn't well documented but ik_llama has a custom kv cache implementation as well:
--k-cache-hadamard -ctk q4_0 -ctv q5_0With almost no visible degradation.
Exllama still has much better speculative decoding speed-up compared with llama.cpp
AXYZE8@reddit
Why V value higher than K? Shouldn't it be other way around?
I'll test it later, because I have my favorite Gemma3 quants in GGUF and I would love to 2x their context (currently using Q8)
yotsuya67@reddit
I have a mix of hardware, which up to now I thought only llama.cpp supported. I'm using a chinese dual x99 motherboard, 2x xeon e5-2630 v4, 128gb of ram (64gb per cpu, quad channel rddr4-2133), an rtx 3060 12gb, a gtx 1070 8gb and a p104-100 8gb.
I just discovered ik_llama.cpp, I was running qwen3 235b a22b 2507 Thinking in iq4_xs (to fit inside 128gb ram) and getting around 4t/.s on llama.cpp. I switched to ik_llama and got 4.5t/s... then realized I shouldn't be using the -sm graph since most of the model is not inside the gpus... switched to -sm layer and got.. 5.2t/s. That's a lot of gain, I didn't even redownload the models.
AbstrusSchatten@reddit
Ikllama is a lot more performant even on CPU only if we are talking about prefill though. So it's a lot better for cases where the models need to keep their answer short but have lengthy prompts
mj3815@reddit
I have tried Huggingface’s TGI, Aphrodite, SGLang. They all had some benefits. Aphrodite and SGLang have been reliable for me. vLLM was the fastest but I would have issues with it hanging sometimes which is why I experimented with alternatives
fizzy1242@reddit
exllama is faster for pure gpu inference. stick to exllamav2 if you have a 3000 nvidia series gpu, v3 with higher
Nrgte@reddit
exl3 isn't much slower on 3000 series anymore than exl2 last time I tried it.
fizzy1242@reddit
really? I thought it's lacking because exl3 requires more compute. glad to hear it's getting better
Nrgte@reddit
Yeah I remember on launch of exl3 the performance on my 3090 was a lot worse than with exl2, but I've tried again recently with Mistral Small and it only was like ~10% slower, which was like 2-3 tps slower.
dinerburgeryum@reddit
So EXL2 is a little long in the tooth now but still a good option for ridiculously fast inference on Ampere. EXL3 is crazy high quality for its BPW but is indeed pretty slow on Ampere or older. It also has easily the best KV cache quantization routines in the wild hands down. Super stable at 5-6 bits. CUDA only no CPU offloading. They’re served through TabbyAPI which unfortunately has kind of shoddy support for tool calling. (The fault of tool call structure in models for sure but it’s worth pointing out).
I’ve moved primarily to ik_llama. The IQ4_KT option is rock solid and super fast. It supports Hadamard rotation on K-cache, and Q6_0 KV cache quant support. No recurrent model support, however, so no Nemotron H, Granite 4 H or Qwen Next. Tool calls work as you’d expect except for the most annoying models (Seed OSS comes to mind). Has graph mode tensor parallel support which is miles ahead of mainline’s row based tensor parallelism.
Nrgte@reddit
exl2 and exl3 are also supported in Ooba and runs great there.
dinerburgeryum@reddit
Haven’t used ooba in a while; how’s tool calling on that side?
Nrgte@reddit
I don't need that personally, but I know Ooba has extensions, so if it's something a lot of people want there is probably an extension for it.
SatisfactionSuper981@reddit
There are a few others:
lmdeploy: Awesome for V100 and Turing, but they aren't keeping up with newer models. Supports autoAWQ - but that's dead. Tried to get smoothquant working, but it's painful. No cpu offloading.
mlc-ai: Has good support for older gpus, and eveen just worked out of the box with MI50 gpus. Project seems pretty dead. Supports Vulkan, but only with a single GPU. Also supports CPU, but I don't know how well it performs.
ftllm: Probably the fastest CPU based engine I've tried. Primarily because it does tensor splitting on NUMA nodes. Allows splitting MoE router and KV off to GPU as well. On a second gen cascade lake I get 5-10 t/s with Deepseek v3. Basically only supports chinese models, most documentation in chinese.
a_beautiful_rhind@reddit
IK has pretty much the fastest inference going right now but you have to remember all the flags and test things.
EXL2/EXL3 used to be the fastest for 2 years and their model support is much less likely to be broken than llama.cpp.
bullerwins@reddit
Why are they downvoting you? I agree with all you said.
CheatCodesOfLife@reddit
There's some kind of historical bickering/drama between the llama.cpp guy and the ik_llama.cpp guy I think. rhind is correct though, ik_llama is even faster than exl2/3 now.
sudochmod@reddit
Is this also true for AMD?
a_beautiful_rhind@reddit
I think AMD in IK isn't working well. For exllama, you'd have to see if it still functions there.
hainesk@reddit
Can we get a specialized LLM that can dynamically analyzes our hardware and the LLM we’re trying to run and then spits out options for running it along with advantages/disadvantages and run commands? Maybe with a focused web search that allows it to read about new models and updates to the various engines that run them?
Lissanro@reddit
Rule of thumb - use ik_llama.cpp by default, and llama.cpp if need to run a model that it does support but ik_llama.cpp does not. Both ik_llama.cpp and llama.cpp allow CPU+GPU and CPU-only inference. You can check https://huggingface.co/ubergarm for ik_llama.cpp specific quants, I shared details here how to build and set it ik_llama.cpp in case you would like to give it a try.
EXL3 is GPU-only, and supports limited selection of models, also not as fast as EXL2 used to be, but saves VRAM by maintaining quality at a bit lower bpw compared to GGUF. Basically it is great when GGUF of needed quality is just a bit too large to fully fit in VRAM, so I can a bit smaller EXL3 at the cost of losing some performance. I heard EXL3 has better performance on newer cards, but only I have 3090 GPUs.
There is no perfect option, each has its own advantages and optimizations. So it is good idea to take some of your most used models and test on your own hardware which backend works the best.
jacek2023@reddit
exlllama is faster, but I trust llama.cpp ecosystem
Ok_Dream3627@reddit
Been using exllamav3 for a while now and it's pretty solid for inference speed, especially if you're running quantized models. The main thing is it's still kinda experimental so don't expect rock-solid stability for production stuff. Hardware support is decent but not as broad as llama.cpp
For ik_llama I've heard good things about memory efficiency but haven't personally tested it much - maybe someone else can chime in on that one