TheaterFire

Kimi-Linear-48B-A3B-Instruct-GGUF Support - Any news?

Posted by Iory1998@reddit | LocalLLaMA | View on Reddit | 37 comments

Kimi-Linear seems to handle long context pretty well. Do you have any idea why it's still not implemented in llama.cpp?

Reply to Post

37 Comments

zoyer2@reddit

Ok here are my first impression on the model for code usage: Tested on 2x3090 130k context, great speed \~80 t/s. One-shotting seems kinda OK, tool usage through Kilocode seems so far OK in a small/medium sized project. I would say Qwen Next 80 A3B and GLM 4.7 flash might be better. need a bit more time though
View on Reddit #77168985

Iory1998@reddit (OP)

Did you run in on Llama.cpp?
View on Reddit #77176572

zoyer2@reddit

Ops, accidentally removed that part, yes on llama.cpp! :)
View on Reddit #77219420

ilintar@reddit

PR almost done, gonna come with another speedup to Qwen3Next as well.
View on Reddit #76470148

kripper-de@reddit

https://github.com/ggml-org/llama.cpp/pull/18755
View on Reddit #77104650

kripper-de@reddit

Kimi is coming!!!
View on Reddit #77104665

LegacyRemaster@reddit

remember: you are our hero!
View on Reddit #76498632

TheGlobinKing@reddit

Amazing, thanks! Is there a github issue I can follow for that Qwen3Next speedup?
View on Reddit #76496244

Iory1998@reddit (OP)

Do you have any idea how much time left before it gets merged?
View on Reddit #76483011

Iory1998@reddit (OP)

Thank you for your hard work. Kimi-Linear is a good model. Please take good care of it :D
View on Reddit #76482847

dinerburgeryum@reddit

The man right here folks
View on Reddit #76470813

ilintar@reddit

Not my PR tho, just working with the author to make a common abstraction for delta net models.
View on Reddit #76472496

dinerburgeryum@reddit

Yea you’re doing the work dude keep it up. 👍
View on Reddit #76473116

Amazing_Athlete_2265@reddit

Fuck yeah
View on Reddit #76470809

Ok_Warning2146@reddit

If u want to run it, u can clone my repo. It should be almost the same as the one that is going to be merged. https://huggingface.co/ymcki/Kimi-Linear-48B-A3B-Instruct-GGUF
View on Reddit #76663743

Iory1998@reddit (OP)

Thank you. I use LM Studio. I'll wait for the update.
View on Reddit #76707635

coder543@reddit

We have _so many_ A3B models... I really want some A1B and A5B options to mix things up.
View on Reddit #76474135

R_Duncan@reddit

Granite 4.0 has a A1B model. As expected, is way less performante than the A3B version.
View on Reddit #76578290

coder543@reddit

Granite 4.0 MoEs (the A#B naming) come in 32B A9B and 7B A1B sizes. It is not shocking that such drastically different sizes would perform different, yes. These are also very low sparsity models. The rumor is that Gemini 3 Flash is a >1T model with a very low active parameter count. I have 128GB of medium speed memory. I want a 200B A1B model that is released specifically in a 4-bit precision (QAT, not PTQ). Extreme levels of sparsity, not 7B A1B.
View on Reddit #76583982

R_Duncan@reddit

I think you have to train youself such an unbalanced model, max sparsity till now is 80B-A3B
View on Reddit #76705097

coder543@reddit

That's why I mentioned that higher sparsity models seem to exist, they're just not open weight, and that's why I want such a model. If companies keep releasing A3B, that's their choice, but it will be hard to get excited about that.
View on Reddit #76705957

sloth_cowboy@reddit

Tell me where to start and ill make them. BTW, I never trained a model, and I dont have a server
View on Reddit #76478856

FullOf_Bad_Ideas@reddit

Get data from HF (FineWeb2/FinePDFs) and HPLT3 project. Get comfortable with Megatron-LM and rent 768xH100 node like [this one](https://gpulist.ai/detail/37a1aa3). Train a model with Megatron-LM on that node, then post-train with SFT, then do preference optimization with PPO/ORPO and then do RL with GRPO in [slime](https://github.com/THUDM/slime). Hardware cost is the main limiting factor, I trained 4B A0.3B models on 60/80B tokens and it already cost thousands in compute. You'll need 10M to train a model successfully, but you can manage to do it on your own if you really want, since so much of the stack is open source.
View on Reddit #76499460

FullOf_Bad_Ideas@reddit

I want more A20B and A30B. 120B A30B would be good. 70B A20B too.
View on Reddit #76499277

Not_Syslog@reddit

The only A5B I know of is gpt-oss:120b
View on Reddit #76487620

KvAk_AKPlaysYT@reddit

Here are the current experimental Guf-Gufs: [https://huggingface.co/AaryanK/Kimi-Linear-48B-A3B-Instruct-GGUF](https://huggingface.co/AaryanK/Kimi-Linear-48B-A3B-Instruct-GGUF) Keep in mind that you'd have to run it through the [**PR #17592**](https://github.com/ggml-org/llama.cpp/pull/17592) and not the master branch.
View on Reddit #76475849

alhinai_03@reddit

I'm currently running the model with your branch, it's very promising, but I couldn't find any recommended inference settings, like temp, top-p, top-k. Any idea?
View on Reddit #76593682

Iory1998@reddit (OP)

I use LM Studio which is a few weeks behind the latest llama.cpp update. However, Kimi-Linear is an important model, and I think once it's merged with the main branch, the LM Studio will quickly update their platform to support it. Do you have any idea how much time left before it gets merged?
View on Reddit #76483008

BasketFar667@reddit

They're rapidly declining due to restrictions and the fact that they're not fully open source. Quen is winning, Deepseok is winning too, and Kimi is lagging behind overall. Gemini is improving, but not by much. If GA reaches 3.0, it'll improve.
View on Reddit #76587498

mr_zerolith@reddit

Good question, this one seems to have just been forgotten about
View on Reddit #76469487

Iory1998@reddit (OP)

It took so long. I wish we could just get an update.
View on Reddit #76482893

kaisurniwurer@reddit

You can always check git. Though I approve of you trying to generate hype too, since I'm personally interested.
View on Reddit #76506074

Iory1998@reddit (OP)

Generate hype, he said.... with a one-line post! 🤦‍♂️🤦‍♂️
View on Reddit #76517225

AnomalyNexus@reddit

Would this fit on a 24gb card? Guessing only with offload
View on Reddit #76502432

Iory1998@reddit (OP)

Well, it's a MoE, so it would still be fast.
View on Reddit #76503627

nuclearbananana@reddit

To we have any other benchmarks besides context arena. That one is too specific to draw general conclusions from
View on Reddit #76483582

Amazing_Athlete_2265@reddit

https://old.reddit.com/r/LocalLLaMA/comments/1pvvv8m/kimilinear_support_in_progress_you_can_download/
View on Reddit #76468869