My experience is the opposite.
I used to run deepseek-r1-0528 ud-iq3 (unsloth) as the "last resort" (I can only get about 1t/s) model for when qwen3-235b wasn't even enough (I usually go with qwen3-14b or 32b, as I get "normal" speed) and a few days ago I started testing kimi-k2 ud-q2 (unsloth) and... wow!
I still get 1t/s but as a non-thinking model is, of course, much faster than deepseek-r1, in the end. And the results were amazing.
To the point, no apologies, no "chit chat", just the answer and that's it.
I have it now, at least for now, as my "last resort" model.
I didn't manage to get similar speed like the v3. Offloading layers didn't work for me as it does with r1.
Now I'm trying qwen3-235-thinking, and, so far, I like it a lot...
with an rtx 5000 ada (32gb) and 128 gb RAM I get about 1t/s with UD-Q2 (unsloth).
I use it as a "last resort" model (when I can't get what I want from smaller models). It replaced, for now, deepseek-r1 ud-iq3 for me.
So far I'm very impressed by it.
`prompt eval time = 101386.58 ms / 10025 tokens ( 10.11 ms per token, 98.88 tokens per second)`
`generation eval time = 35491.05 ms / 362 runs ( 98.04 ms per token, 10.20 tokens per second)`
[ubergarm IQ4\_KS quant](https://huggingface.co/ubergarm/Kimi-K2-Instruct-GGUF)
sw is [ik\_llama](https://github.com/ikawrakow/ik_llama.cpp)
hw is 2S EPYC 9115, NPS0, 24x DDR5 + RTX 8000 (Turing) for attn, shared exp, and a few MoE layers
as much as 15t/s TG is possible w/short ctx but above perf is w/10K ctx.
[sglang](https://lmsys.org/blog/2025-07-14-intel-xeon-optimization/) has new CPU-backend tech worth keeping an eye on. They offer a NUMA solution (expert-parallel) and perf results look great, but it's AMX only at this time.
>[sglang](https://lmsys.org/blog/2025-07-14-intel-xeon-optimization/) has new CPU-backend tech worth keeping an eye on. They offer a NUMA solution (expert-parallel) and perf results look great, but it's AMX only at this time.
Ho interesting, happy to se the 9115 so performant!
16 Comments
AaronFeng47@reddit
jzn21@reddit
relmny@reddit
No_Afternoon_4260@reddit (OP)
relmny@reddit
AaronFeng47@reddit
relmny@reddit
eloquentemu@reddit
No_Afternoon_4260@reddit (OP)
eloquentemu@reddit
No_Afternoon_4260@reddit (OP)
eloquentemu@reddit
No_Afternoon_4260@reddit (OP)
usrlocalben@reddit
No_Afternoon_4260@reddit (OP)
segmond@reddit