Kimi Linear 30% gain in pp and higher context merged to llama.cpp

Posted by Ok_Warning2146@reddit | LocalLLaMA | View on Reddit | 13 comments

[https://github.com/ggml-org/llama.cpp/pull/19827](https://github.com/ggml-org/llama.cpp/pull/19827) Accidentally found that just changing one line can boost prompt processing by 30% and increase context of IQ3\_M on 3090 from 192k to 300k. It would be great if people with 5090 can report how much context they can get at various quants.

Reply to Post

13 Comments

[-]

EdenistTech@reddit

Not a 5090, but I have a 5070TI/5060TI combination, so still 32GB and Blackwell. Using a Q4\_0 quant, I can fit 256K context and it starts off at a blazing 118 t/s. The MXFP4 quant also fits 256K but runs at a more modest 85 t/s (better quality as well, as expected). I was using the latest llama.cpp stable, so I guess this should include your tweak, OP. I hadn't tried this model before. For a 49B model, this thing is FAST!

[-]

Ok_Warning2146@reddit (OP)

Thanks for your numbers. This model is supposedly the best in long context among all open models. Please try some long context stuff and see ;if it does indeed perform better than the others.

[-]

EdenistTech@reddit

Alright. I asked both models to summarise 1MB markdown text. Nemo started processing at 6300 t/s and ended processing at 4300 t/s in 58 seconds. Kimi started at 1300 t/s and I stopped it at 50% after 2min 30 seconds. I also tested Nemo using 2.6MB markdown which it did in 2-3 minutes (didn't get the exact time) using 64% of 900K context. Now, these models where not like-for-like since Nemo is smaller than Kimi, so I would it except Kimi to be slower. I get what you are saying regarding Kimi Linear being undertrained and I will take a look at it again, if they refine it. For now - for long context work - I am using Nemo.

[-]

Ok_Warning2146@reddit (OP)

Can you also try IQ4\_XS? Its PPL is only slightly worse than MX4FP4 but it is 1GB smaller, so you can roughly run 100k more context.

[-]

EdenistTech@reddit

For me, the quality of the output is not that impressive. If context length is your main priority, you might wan't to look at Nemo 30B. Someone posted running that model with 1M+ ctx on a 3090. I have tried it with 500K context with no issues. It is about as fast as Kimi Linear and to be honest, the output appears to be higher quality (despite KL having 17b more parameters).

[-]

Ok_Warning2146@reddit (OP)

KL is an undertrained experimental model, so it is expected to do poorly on benchmarks except the long context benches. I am aware of the Nemo 30B can also run high context. But at contextarena, its long context performance is way worse than Kimi Linear. Thanks for telling us you find the contrary.

[-]

Deep_Traffic_7873@reddit

the benefit is only for nvidia?

[-]

Ok_Warning2146@reddit (OP)

I find that in CPU only mode, the pp gain is only 8%. But then I am using a 13 years old i7 4930k and ddr3-1600 and I am seeing CPU is not fully utilized during pp, so probably this is not indicative of what's going on in most people's case. Would be great if u can tell us what u get. Cannot test other backends as I don't have the hardware.

[-]

GodComplecs@reddit

My man, invest in a cpu and ram. Those are really holding your 3090 back a lot in other tasks, but offloading is impacted too!

[-]

Ok_Warning2146@reddit (OP)

Was thinking about that but then ClosedAI jacked up the RAM price. Now need to wait a bit longer. :\*-(

[-]

kaisurniwurer@reddit

I'm interested too, since it's a great model for hybrid CPU.

[-]

Ok_Warning2146@reddit (OP)

I observed that CPU only also uses less RAM but not as much reduction as pure CUDA. You can increase your context and see how far u can go in hybrid mode.

[-]

jacek2023@reddit

I have 5070 only :)