Kimi 2.6 question

Posted by vhthc@reddit | LocalLLaMA | View on Reddit | 18 comments

I am aware that this is kinda a dumb question, but I think I am missing something.

Kimi 2.6 is a 1.1T model with 30b active parameters. It is encoded in INT4. Hence its size is ~600MB.

So with 768GB RAM and 2x3090 (=48GB VRAM) it should be possible to run this, right? 600GB in RAM, ~18GB active parameters in VRAM, context of 100-200kb should fill the remaining 30GB of the VRAM.

I don't expect the speed will be great - maybe 10 t/s?

I think 2x3090 (or more) is something a lot of people here on the sub have available. The 768GB Ram is a harder problem, but before the RAM price spike this was about 2500$ (12x 64GB sticks ~ 200$ each for DDR5), so beside the CPU and motherboard needing to be premium to have the capacity for the RAM - to me this sounds like a machine a lot of people could run locally, I would call it "advanced hobbyist" price range :-)

So why are people saying the Kimi 2.6 is not "local" for most people? Am I missing something? (Serious question, I do not have a 768GB RAM machine, but I am tempted once the prices get down at some point).

Thanks!

[-]

FoxiPanda@reddit

What's the memory bandwidth on that RAM?

If it's some DDR4 sad panda socket with 200GB/s or less...you're gonna have a bad time.

If it's a Genoa/Turin based system with 600GB/s+ of memory bandwidth + the 3090s, you're going to have a better time.

I have Kimi K2.6 running on my Mac studio 512GB (baa-ai's 344GB quant) and as long as I turn thinking off, I get ~24tok/s with reasonable settings. It's not impossible...but it's not fast either. And any machine capable of running it at decent speed is pretty expensive - that 256GB of DDR5-6400 is $7000 + $1000 CPU + $800 motherboard + 1400 GPU + 1400 GPU + 500 storage minimum = $12100 system.

That's not "affordable" by the vast majority of people lol...

Front_Eagle739@reddit

Hows your long context performance? Im getting about the same as you at 0 context and about 14 tok/s at 80k with the 3.6 bit inferencer labs build with his app but I'd rather use something else

I will have to test again tbh. Kimi is great and all, but even at 24tok/s early in context, that's kinda barely usable? I'll load it up and give it a ~80K token long context run though and report back.

RevolutionaryGold325@reddit

Looks like he wrote DDR5. If that is 12 channels, it would not be terrible.

Sure but it could easily be a 2 channel DDR5 platform that gets 90GB/s or less /shrug

FriskyFennecFox@reddit

You're not missing anything, it's supposed to work well on just one or two GPUs that can host the active parameters! The problem is that not a lot of us have 512GB+ of RAM.

Most-Trainer-8876@reddit

Wait, that doesn't make sense, "One or two GPUs that can host active parameters"? I thought experts which are used always changes on every token generated.

vhthc@reddit (OP)

Yes they change of course but transferring 18gb to vram per prompt is fast

gliptic@reddit

Per prompt? The experts change per token.

Oh makes sense and didn’t know, thanks

Well its not, even at pcie5 64GB/s you would still only be talking a ceiling of 4 token/s doing that but thankfully theres a lot of expert reuse and you do a lot of the work on cpu as well. Works fine for prefill though doing it that way

jacek2023@reddit

"So why are people saying the Kimi 2.6 is not "local" for most people?"

Because there is a difference between "I can run it locally!!!" (asking "what is the capital of France?") and hours of real work on agentic coding for example

segmond@reddit

Everyone doesn't have to do agents, there are lots of workflow that are not agentic.

You can run it. Do you have at least a 8 channel 3200mhz memory? If not don't expect to see that speed. Don't be greedy. The killer is PP sucks.

Double-Confusion-511@reddit

I have gpu service, but did not know how to use them, because they are not Nvidia

bigh-aus@reddit

The problem you’re not taking into account is how often it has to swap those experts in and out of vram, I really want to run kimi but the steps in cost seem.. 1 Mac Studio 512gb. 14tps 2 Mac Studios 512gb 23tps Ddr5 512gb based system with an rtx6000 pro Then just keep adding cards 1,2,4,8. I wonder what the gb300 machines will give but we’re talking $100k

lundrog@reddit

Maybe? Try it and report back

ghgi_@reddit

I mean....any model can be local if you try hard enough BUT, When most people are talking about it though they mean reasonably, and reasonably most people cant afford or dont want a mini home datacenter to run a model like this at terrible speeds, a reasonable local alternative for example would be the qwen 3.6 models which can fit on a single consumer card and punch way above there weight class (especially 27b in my testing).