TheaterFire

Somebody running kimi locally?

Posted by No_Afternoon_4260@reddit | LocalLLaMA | View on Reddit | 16 comments

Somebody running kimi locally?

Reply to Post

16 Comments

AaronFeng47@reddit

There are people hosting kimi k2 using two Mac studio 512gb
View on Reddit #62725928

jzn21@reddit

I do, but at Q2 Unsloth. After testing, I discovered that Deepseek V3 at Q4 is delivering way better results
View on Reddit #62755492

relmny@reddit

My experience is the opposite. I used to run deepseek-r1-0528 ud-iq3 (unsloth) as the "last resort" (I can only get about 1t/s) model for when qwen3-235b wasn't even enough (I usually go with qwen3-14b or 32b, as I get "normal" speed) and a few days ago I started testing kimi-k2 ud-q2 (unsloth) and... wow! I still get 1t/s but as a non-thinking model is, of course, much faster than deepseek-r1, in the end. And the results were amazing. To the point, no apologies, no "chit chat", just the answer and that's it. I have it now, at least for now, as my "last resort" model.
View on Reddit #62805126

No_Afternoon_4260@reddit (OP)

Why not deepseek v3? It is none thinking
View on Reddit #62819233

relmny@reddit

I didn't manage to get similar speed like the v3. Offloading layers didn't work for me as it does with r1. Now I'm trying qwen3-235-thinking, and, so far, I like it a lot...
View on Reddit #62822485

AaronFeng47@reddit

As expected, Q2 could cause serious brain damage (to the model)
View on Reddit #62787702

relmny@reddit

with an rtx 5000 ada (32gb) and 128 gb RAM I get about 1t/s with UD-Q2 (unsloth). I use it as a "last resort" model (when I can't get what I want from smaller models). It replaced, for now, deepseek-r1 ud-iq3 for me. So far I'm very impressed by it.
View on Reddit #62805308

eloquentemu@reddit

People are definitely running Kimi K2 locally. What are you wondering?
View on Reddit #62730103

No_Afternoon_4260@reddit (OP)

What aetup and speeds? Not interested in macs
View on Reddit #62731757

eloquentemu@reddit

It's basically just Deepseek but ~10% faster and needs more memory. I get about 15t/s peak, running on 12 channels DDR5-5200 with Epyc Genoa.
View on Reddit #62733747

No_Afternoon_4260@reddit (OP)

Thx, What quant? No gpu?
View on Reddit #62744262

eloquentemu@reddit

Q4, and that's with a 4090 offloading non-experts.
View on Reddit #62751634

No_Afternoon_4260@reddit (OP)

Ok thx for the feedback
View on Reddit #62752510

usrlocalben@reddit

`prompt eval time = 101386.58 ms / 10025 tokens ( 10.11 ms per token, 98.88 tokens per second)` `generation eval time = 35491.05 ms / 362 runs ( 98.04 ms per token, 10.20 tokens per second)` [ubergarm IQ4\_KS quant](https://huggingface.co/ubergarm/Kimi-K2-Instruct-GGUF) sw is [ik\_llama](https://github.com/ikawrakow/ik_llama.cpp) hw is 2S EPYC 9115, NPS0, 24x DDR5 + RTX 8000 (Turing) for attn, shared exp, and a few MoE layers as much as 15t/s TG is possible w/short ctx but above perf is w/10K ctx. [sglang](https://lmsys.org/blog/2025-07-14-intel-xeon-optimization/) has new CPU-backend tech worth keeping an eye on. They offer a NUMA solution (expert-parallel) and perf results look great, but it's AMX only at this time.
View on Reddit #62740693

No_Afternoon_4260@reddit (OP)

>[sglang](https://lmsys.org/blog/2025-07-14-intel-xeon-optimization/) has new CPU-backend tech worth keeping an eye on. They offer a NUMA solution (expert-parallel) and perf results look great, but it's AMX only at this time. Ho interesting, happy to se the 9115 so performant!
View on Reddit #62744370

segmond@reddit

It's deepseek like, so expect deepseek like (performance / 2) since it's about twice the size.
View on Reddit #62742667