TheaterFire

KIMI K2.6 SOON !!

Posted by Namra_7@reddit | LocalLLaMA | View on Reddit | 88 comments

KIMI K2.6 SOON !!

Reply to Post

88 Comments

propelourselves4ward@reddit

What’s worth running on 24/48gb of ram?
View on Reddit #83963813

reaznval@reddit

kimi k2.6 has been feeling great. been using it over claude (partially because of the usage limits but also because it's juts great). One thing that I find kimi models, including k2.6 lacking is detecting future issues or issues that were not in the context. opus4.6/4.7 can easily detect possible pitfalls that are only broadly related to the code changes and stuff it has in its context but I found kimi models having issues with that. They are incredibly good if you task them well though and I they seem to perform on sonnet/opus level if you give them clear instructions
View on Reddit #83823350

Spirited_Neck1858@reddit

have u tried giving kimi instruction in prompt regarding this thing , and maybe see if that changes its behaviour a bit...
View on Reddit #83926197

Thomas-Lore@reddit

Sure, but if it matches Sonnet abilities at least, then it is golden. GLM 5.1 already feels like that, but Kimi was always a bit nicer to talk to. :)
View on Reddit #83825464

reaznval@reddit

yes i think it for sure beats/ is on sonnet standard. opus not quite when it comes to architecture and general project stuff and identifying potential pitfalls but for implementing features where you already have the stuff preplanned so it juts follows a detailed plan then it's incredibly good
View on Reddit #83835167

LoveMind_AI@reddit

God let’s just hope it’s not as painful as Opus 4.7. If it’s good, Moonshot is releasing this at a really good time.
View on Reddit #83814758

reaznval@reddit

no, it feels very nice and replaced sonnet/opus for me on clear prompts. If I need it to detect issues or think about edge cases Opus is still better but if I give it clear instructions the output seems the same as Opus but at a faster rate (70-90tps in the beta and with far better limits)
View on Reddit #83823477

Fusseldieb@reddit

Call me stupid, but I can’t believe a smaller, open source model beats opus. Maybe Haiku…
View on Reddit #83830868

username_taken4651@reddit

Kimi is basically between Sonnet 4.5 and 4.6 levels. It's also pretty large, at 1T params, so it's probably much closer to Opus in terms of size compared to most local models, although likely still a bit smaller.
View on Reddit #83920475

xLionel775@reddit

you're being brainwashed by anthropic marketing
View on Reddit #83857118

reaznval@reddit

okay maybe not beat opus but on opus level when it comes to executing plans. in general its on sonnet levels but when for example planning or doing audits, thinking outside of the box then opus is clearly better but when a plan is there and it just needs to execute then its perfect and you cant get much better with opus. so planning I would use opus (altough kimi is not worlds worse, just a bit) but for executing its really great. \+ opus is really not that great, its honestly overhyped simply because its claude
View on Reddit #83835346

CalligrapherFar7833@reddit

Do you know how bad opus 4.7 is
View on Reddit #83831808

pmttyji@reddit

Want to see medium/big size models additionally. Something like Kimi-Linear-48B-A3B size.
View on Reddit #83815506

pigeon57434@reddit

48b is just barely too big to fit on 24GB of VRAM though something like 32B would be awesome
View on Reddit #83850579

Ranmark@reddit

Most users doesn't even try to fit MoE in vram. For me it's better to get high accuracy using something like Q6_K_XL. But I understand you want more tps
View on Reddit #83907342

stopbanni@reddit

I seriously need Kimi-K2.6-9B
View on Reddit #83818407

cutebluedragongirl@reddit

This
View on Reddit #83841663

LoveMind_AI@reddit

I’m very eager for something like Qwen Coder Next sized but using KDA. 
View on Reddit #83817947

Foxwear_@reddit

Composer 3 in coming!!!
View on Reddit #83890142

Formal_Gas_6@reddit

cant wait for Composer 3!
View on Reddit #83881589

B89983ikei@reddit

I almost never experienced Kimi 2.5, it was always down. Now it will stay down for another 2 months due to the update.
View on Reddit #83822560

Lordaizen639@reddit

You can use kimi k2.5 through nivida build it is free to use .
View on Reddit #83872590

nuclearbananana@reddit

Down where, there's a dozen providers
View on Reddit #83828286

gigaflops_@reddit

Ah yes, a model I'll be able to run locally on on my $67,420 GPU cluster
View on Reddit #83823121

Crinkez@reddit

You missed one or two zeros.
View on Reddit #83870802

ei23fxg@reddit

mine was 69.420, where did you save?
View on Reddit #83865614

fallingdowndizzyvr@reddit

Didn't you get a 512GB M3 Ultra when you had the chance? That was only $10K.
View on Reddit #83825354

power97992@reddit

K2.5 was 580gb unless u run q3
View on Reddit #83839695

nuclearbananana@reddit

I honestly can't tell if you're serious
View on Reddit #83828274

Lissanro@reddit

Kimi K2.5 was my favorite so far, I especially like local friendly INT4 release that can be practically losslessly converted to Q4_X GGUF, preserving the original quality. I hope K2.6 will be similar.
View on Reddit #83825462

Fit-Statistician8636@reddit

And so much easier to run than GLM 5.1…
View on Reddit #83827792

philnm@reddit

What is that makes it easier?
View on Reddit #83864061

Fit-Statistician8636@reddit

Memory consumption, speed, community support? It works well in llama.cpp, ik_llama.cpp, and even SGLang + KTransformers (which I settled on for this model), and in all clients I tried including Open WebUI. GLM 5.1 still has various issues. It is new, it will get better for sure and I really like it where I can get it working - but while the Kimi-K2 has really pretty usable speed, GLM 5.1 is just too slow.
View on Reddit #83869626

philnm@reddit

thank you, much appreciate your answer 🙏
View on Reddit #83870108

FriskyFennecFox@reddit

In retrospect, I think Kimi K2.5 wins the early 2026 open source game! - Incredible image understanding capabilities - 1T total parameters that allow it to hold enough knowledge to work a general purpose chatbot & not embarrassingly miss details about the world - The inference is blazing fast with 32B active parameters - QAT by design, resulting in dirt cheap API pricing - No thinking & stable thinking with a flip of a switch - A modified MIT, realistically all they require is to mention their model name once you scale to millions in revenue - Barely has any hard refusals baked into the weights **Huge** hopes for Moonshot AI to continue this streak!
View on Reddit #83819612

TheRealMasonMac@reddit

After using GLM-5.1, K2.5 feels like using Llama1.
View on Reddit #83826638

Fit-Statistician8636@reddit

Possibly, but K2.5 is much easier to run local.
View on Reddit #83827602

Zyj@reddit

How so? It's even bigger!
View on Reddit #83834154

Fit-Statistician8636@reddit

It is not, actually. 4bit native, and much faster.
View on Reddit #83834380

Ok_Technology_5962@reddit

Huh? 560 gigs of ram for int 4 kimi k2.5 glm 5.1 is sub 500.
View on Reddit #83850378

TheRealMasonMac@reddit

K2.5 is native INT4. GLM-5.1 is native F16.
View on Reddit #83851170

Ok_Technology_5962@reddit

Locally we arent running int4 or bf16 for these monsters. Both get quantized. For the int4 kimi it has to be dequantised to bf16 and then requantized into gguf. Both run very good qualiy. To run on a fixed machine like a 512 gigs mac or a 512 gig xeon + gpus you must crush Kimi to q3 and glm to q4 on average mixed precision. Poth perform on par as the large bf16 versions in my testing on both the mac and xeon server. Ofcourse we preffer the int4 post training but if its 1 trillion perameter and we have to run at q3 thats worse than having a 755billion peram running at q4 mixed precision Dq4 for glm 5.1 . For this reason i dont run kimi and opt to run glm. Both use DSA attention which playes more into hiw long the context can get at stable speed and matter more than the actual active peram count as that is a quandratic formula. Most tests on speed are done in like 1000 tokens max not at 100k or 200k tokens
View on Reddit #83851696

Fit-Statistician8636@reddit

K2.5 can be run without GGUF conversion using SGLang + KTransformers hybrid GPU+CPU. It is how I ended up serving it. Check it out.
View on Reddit #83869950

TheRealMasonMac@reddit

K2.5 doesn't use DSA, though.
View on Reddit #83853800

Ok_Technology_5962@reddit

Somehow i thought both kimi and glm still use Deepseek attention variations. unlike the qwen 3.5 liniar or nvidia mamba in their attention. I remember glm just switched from GQA
View on Reddit #83854493

Daemonix00@reddit

Indeed!
View on Reddit #83842845

Due-Memory-6957@reddit

Lemme join the glaze because 2.5 is the one I enjoyed the most while running a personal test. My personal test is: I send 4 puzzles to the AI and tell them to estimate how much it would take for an average person to solve them after they themselves solve it to gauge the difficulty, but to not tell me the result or give any hints. 3 of the puzzles are pretty straightforward, while one (the third one) is really difficult, puzzle 1, 2 and 4 are all related to each other, so if you get one, the others become easier even if they difficulty is increased. MInmax was the worst, it was clear while reading the reasoning that it didn't care about the rules of the puzzles 1, 2 and 4, so it got them wrong by disobeying the limitations, and also gave up on the third one, then it told me the wrong results it found despite the direct instruction not to. Deepseek, GLM 5.1 and Gemma 4 solved 1 and 2, skipped 3 after trying for a while and went to 4, they did obey the instructions. Kimi initially also skipped 3, but then came back to it after estimating the times for 1, 2 and 4 and kept trying until it got it right, it also impressed me while reading it's reasoning that it made the following comment before skipping number 3 "Given the difficulty I'm having solving it (and I have computational advantages), I'd estimate this takes the average person hours or days, or they simply won't solve it.", I really liked that it acknowledged that humans can't crunch info as fast as LLMs, I also found it charming that after skipping number 3 and solving number 4, it said. "Wait, let me reconsider puzzle 3. Maybe there's a clever insight that makes it easier?", and then went back to it until it actually managed to solve the puzzle, concluding "Given how long it took me (with systematic analysis), this is indeed very difficult. An average person without a systematic approach would likely struggle for a very long time. I'd estimate 1-3 hours for a very determined person, or they might not solve it at all without hints."
View on Reddit #83832500

Caffdy@reddit

what kind of puzzles are these? can you gives an example
View on Reddit #83843181

Due-Memory-6957@reddit

I took them from the book Mathematical Goodbye, here's the prompt: Puzzle 1 From Page 77: "There are two 10s and two 4s. You can use them in any order and add, subtract, multiply, or divide them. Use all of them and make the answer 24." Puzzle 2 From Page 78: "Now, the next problem is two 7s and two 3s. Use these four numbers to make 24 in the same way." Puzzle 3 (The Billiard Ball Necklace) From Pages 80-81: "Suppose you have five billiard balls connected in a ring like a pearl necklace. Each of these balls is labeled with a number. Now, you can take any number of these balls but only take consecutive balls next to each other. You can take one, two, or all five, but you cannot take non-adjacent ones. Under this condition, we want to sum the numbers on the selected balls to get all the numbers from 1 to 21. So, how should we arrange the numbers on the balls to create the necklace?" Puzzle 4 From Page 551 (posed by Moe to Saikawa): "Can you make 24 with 8, 8, 3, and 3? You can't use roots. Only arithmetic operations." Without spoiling the answer or giving me any advice, tell me how long you estimate it would take for the average person to solve each of these puzzles (after trying to solve them yourself to gauge the difficulty level).
View on Reddit #83844544

NoahFect@reddit

Thanks, those are some good benchmarks. Qwen27B can get them all except #3, and GPT-OSS 120B can get them all, given some browbeating after a couple of false starts. They all seem to underestimate the difficulty level for humans compared to the time estimates from the big closed-source models, which I thought was interesting.
View on Reddit #83852281

cutebluedragongirl@reddit

Moonshot really cooked this time around.
View on Reddit #83841640

Daniel_UMA@reddit

Question, in terms of writing for example a book, how viable is 1T? I don't understand what that means
View on Reddit #83828087

NoahFect@reddit

The printed books in one Library of Congress amount to roughly 10 TB, or about 2.5 trillion tokens. That was in 2000 according to Wikipedia. Let's say it's doubled since then, so call it 5T tokens. If you go by the so-called Chinchilla scaling law which says that about 20 tokens of training data should be applied per parameter, a 1T parameter model would be trained with 4 Libraries of Congress. All of the printed books in existence would still be a fraction of the total training data size. My understanding is that modern LLMs are trained with even-larger data sets than Chinchilla, but you'd want to check that.
View on Reddit #83832618

Daniel_UMA@reddit

No no, I'm talking about making the AI a writing assistant, how much context can it receive without messing it up
View on Reddit #83838260

Medical-Welcome-6924@reddit

That number is the amount of information and knowledge the model has. There's a point where a bigger number doesn't necessarily mean better performance. Kimi is quite good for guided writing in my opinion, much better than any other Chinese model.
View on Reddit #83829715

philnm@reddit

may I ask what does "QAT by design" mean and why/how does it make API cheap?
View on Reddit #83831322

Finanzamt_Endgegner@reddit

The weights are already pre quantized and trained for that i think to 4bit in this case so quality is better.
View on Reddit #83837186

Darkoplax@reddit

I felt GLM 5 and 5.1 were always better
View on Reddit #83828041

LoveMind_AI@reddit

And even video understanding! Kimi’s a real one.
View on Reddit #83825478

vladlearns@reddit

I’ve been using 2.6-preview for about 3 days now - it is fantastic
View on Reddit #83829019

philnm@reddit

How?!
View on Reddit #83864665

vladlearns@reddit

I’m on their preview coding program
View on Reddit #83866445

Ashamed-Road203@reddit

Been running K2 thru the Anthropic-compatible endpoint for a multi-agent setup — it's genuinely close to Sonnet on tool calls and way cheaper, but their OpenAI-format endpoint kept choking on long tool-use chains so I had to hard-switch. If 2.6 fixes the streaming/tool-call reliability on the OAI side that alone is a bigger deal than raw benchmark bumps.
View on Reddit #83862503

lemon07r@reddit

We get it in around a week. I was bugging them about getting it on API key usage and that's what they said. I bet we wont get the weights until later though.
View on Reddit #83851450

jacek2023@reddit

for me it's irrelevant because I won't run it on my 72/84/96 GB VRAM
View on Reddit #83816141

pmttyji@reddit

We have few dudes here who run Kimi @ Q4(IQ4\_XS) just with 96GB VRAM + 1TB RAM.
View on Reddit #83818964

ShadyShroomz@reddit

> 1TB RAM thats like $8k of DDR4 or $15k for ddr5 maybe a few of us here have 1tb of ram but i bet I could count them on one hand lol..
View on Reddit #83824798

NoahFect@reddit

If you know where I can get 1TB of DDR5 for $15K, hook a brother up
View on Reddit #83848392

pmttyji@reddit

They bought last year before RAMpocalypse.
View on Reddit #83826864

jacek2023@reddit

We have many people who claim that they "can run a large model" but they don't use it because it runs too slow (or quantized version is too dumb), what does that mean? What's the point to download gigabytes of data just to ask "what is the capital of France?" and never run it again?
View on Reddit #83819243

Fit-Statistician8636@reddit

Kimi is usable even on 32 GB VRAM + fast RAM, around 20-22 t/s. Yes it gets slower with growing context, but not as much as you might think. I do not use it for coding, but chat, deep research, refactoring on a limited context, all possible.
View on Reddit #83828219

Lissanro@reddit

I have 96GB VRAM and 1 TB RAM, and Kimi K2.5 still stays my most used local model. I moslty use it in Roo Code. GLM 5.1 actually more intelligent but it is slower and thinks much more, I still use it too though if K2.5 gets stuck or if doing overnight run that GLM 5.1 is more likely to handle better.
View on Reddit #83824584

phwlarxoc@reddit

> "can run a large model" For me "can run a large model" means two very different things: 1. I am really grateful that hybrid inference engines exist that actually allow to run monster models at decent speed, like 15-20 t/s, and for me it's only 2xRTX5090 on PCIe5 and 512GB DDR5 RAM, but it works and I can load MoE model weights up to around 500GB (e.g. GLM 5.1 UD-Q5_K_S 489.82GiB) with mainline llama.cpp 2. But a totally different picture is vLLM; having been used to those huge models, vLLM is a very sobering, humbling experience: weights trespassing more than 60 or 70% of combined GPU memory, — forget it, OOMs immediately due to greedy KV cache reservation, even with mitigating options. On device is everything and system RAM basically useless. But if it does fit on device it's a different world, GPUs never idling, doing 100% permanently and 10x decoding.
View on Reddit #83827612

pmttyji@reddit

Timing show up!
View on Reddit #83827037

funding__secured@reddit

who cares?
View on Reddit #83819541

Zc5Gwu@reddit

The active params might. 
View on Reddit #83816593

Basilthebatlord@reddit

Give it a week and I bet we'll see Cursor Composer 2.1 release after this
View on Reddit #83846279

WithoutReason1729@reddit

Your post is getting popular and we just featured it on our Discord! [Come check it out!](https://discord.gg/PgFhZ8cnWW) You've also been given a special flair for your contribution. We appreciate your post! *I am a bot and this action was performed automatically.*
View on Reddit #83844521

Exciting-Engine882@reddit

I can run only the q3 and only at 3.5 t/s, still looking forward to this :)
View on Reddit #83827248

Asceny@reddit

Kimi 2.5 is awesome, too bad its practically impossible to run locally...
View on Reddit #83826639

Dependent-Aardvark32@reddit

Really ? Is this real ?
View on Reddit #83826603

korino11@reddit

Waiting for 2.6 - 48b 3A thinkng
View on Reddit #83826537

NoahFect@reddit

Let us know when "the website" is HF.
View on Reddit #83822715

nhouseholder@reddit

need the benchmarks to drop and see how it compares to GLM 5.1 and Qwen3.6 Plus
View on Reddit #83822201

LittleYouth4954@reddit

https://www.reddit.com/r/StableDiffusion/s/WfwQKYHJRY
View on Reddit #83816667

RedParaglider@reddit

https://preview.redd.it/3ih11ovp1zvg1.png?width=1024&format=png&auto=webp&s=5d2d88aaae8157b960de8c56720d47450aeca8d6
View on Reddit #83815009

Namra_7@reddit (OP)

💀😭
View on Reddit #83815116

Specter_Origin@reddit

https://i.redd.it/rum9dtc1zyvg1.gif
View on Reddit #83813867