KIMI K2.6 SOON !! | TheaterFire

[-]

propelourselves4ward@reddit

What’s worth running on 24/48gb of ram?

Reply

[-]

kimi k2.6 has been feeling great. been using it over claude (partially because of the usage limits but also because it's juts great). One thing that I find kimi models, including k2.6 lacking is detecting future issues or issues that were not in the context. opus4.6/4.7 can easily detect possible pitfalls that are only broadly related to the code changes and stuff it has in its context but I found kimi models having issues with that. They are incredibly good if you task them well though and I they seem to perform on sonnet/opus level if you give them clear instructions

Reply

[-]

Spirited_Neck1858@reddit

have u tried giving kimi instruction in prompt regarding this thing , and maybe see if that changes its behaviour a bit...

Reply

[-]

Thomas-Lore@reddit

Sure, but if it matches Sonnet abilities at least, then it is golden. GLM 5.1 already feels like that, but Kimi was always a bit nicer to talk to. :)

Reply

[-]

reaznval@reddit

yes i think it for sure beats/ is on sonnet standard. opus not quite when it comes to architecture and general project stuff and identifying potential pitfalls but for implementing features where you already have the stuff preplanned so it juts follows a detailed plan then it's incredibly good

Reply

[-]

LoveMind_AI@reddit

God let’s just hope it’s not as painful as Opus 4.7. If it’s good, Moonshot is releasing this at a really good time.

Reply

[-]

reaznval@reddit

no, it feels very nice and replaced sonnet/opus for me on clear prompts. If I need it to detect issues or think about edge cases Opus is still better but if I give it clear instructions the output seems the same as Opus but at a faster rate (70-90tps in the beta and with far better limits)

Reply

[-]

Fusseldieb@reddit

Call me stupid, but I can’t believe a smaller, open source model beats opus. Maybe Haiku…

Reply

[-]

username_taken4651@reddit

Kimi is basically between Sonnet 4.5 and 4.6 levels. It's also pretty large, at 1T params, so it's probably much closer to Opus in terms of size compared to most local models, although likely still a bit smaller.

Reply

[-]

xLionel775@reddit

you're being brainwashed by anthropic marketing

Reply

[-]

reaznval@reddit

okay maybe not beat opus but on opus level when it comes to executing plans. in general its on sonnet levels but when for example planning or doing audits, thinking outside of the box then opus is clearly better but when a plan is there and it just needs to execute then its perfect and you cant get much better with opus. so planning I would use opus (altough kimi is not worlds worse, just a bit) but for executing its really great. \+ opus is really not that great, its honestly overhyped simply because its claude

Reply

[-]

CalligrapherFar7833@reddit

Do you know how bad opus 4.7 is

Reply

[-]

pmttyji@reddit

Want to see medium/big size models additionally. Something like Kimi-Linear-48B-A3B size.

Reply

[-]

pigeon57434@reddit

48b is just barely too big to fit on 24GB of VRAM though something like 32B would be awesome

Reply

[-]

Ranmark@reddit

Most users doesn't even try to fit MoE in vram. For me it's better to get high accuracy using something like Q6_K_XL. But I understand you want more tps

Reply

[-]

stopbanni@reddit

I seriously need Kimi-K2.6-9B

Reply

[-]

cutebluedragongirl@reddit

This

Reply

[-]

LoveMind_AI@reddit

I’m very eager for something like Qwen Coder Next sized but using KDA.

Reply

[-]

Foxwear_@reddit

Composer 3 in coming!!!

Reply

[-]

Formal_Gas_6@reddit

cant wait for Composer 3!

Reply

[-]

B89983ikei@reddit

I almost never experienced Kimi 2.5, it was always down. Now it will stay down for another 2 months due to the update.

Reply

[-]

Lordaizen639@reddit

You can use kimi k2.5 through nivida build it is free to use .

Reply

[-]

nuclearbananana@reddit

Down where, there's a dozen providers

Reply

[-]

gigaflops_@reddit

Ah yes, a model I'll be able to run locally on on my $67,420 GPU cluster

Reply

[-]

Crinkez@reddit

You missed one or two zeros.

Reply

[-]

ei23fxg@reddit

mine was 69.420, where did you save?

Reply

[-]

fallingdowndizzyvr@reddit

Didn't you get a 512GB M3 Ultra when you had the chance? That was only $10K.

Reply

[-]

power97992@reddit

K2.5 was 580gb unless u run q3

Reply

[-]

nuclearbananana@reddit

I honestly can't tell if you're serious

Reply

[-]

Lissanro@reddit

Kimi K2.5 was my favorite so far, I especially like local friendly INT4 release that can be practically losslessly converted to Q4_X GGUF, preserving the original quality. I hope K2.6 will be similar.

Reply

[-]

Fit-Statistician8636@reddit

And so much easier to run than GLM 5.1…

Reply

[-]

philnm@reddit

What is that makes it easier?

Reply

[-]

Fit-Statistician8636@reddit

Memory consumption, speed, community support? It works well in llama.cpp, ik_llama.cpp, and even SGLang + KTransformers (which I settled on for this model), and in all clients I tried including Open WebUI. GLM 5.1 still has various issues. It is new, it will get better for sure and I really like it where I can get it working - but while the Kimi-K2 has really pretty usable speed, GLM 5.1 is just too slow.

Reply

[-]

philnm@reddit

thank you, much appreciate your answer 🙏

Reply

[-]

FriskyFennecFox@reddit

In retrospect, I think Kimi K2.5 wins the early 2026 open source game! - Incredible image understanding capabilities - 1T total parameters that allow it to hold enough knowledge to work a general purpose chatbot & not embarrassingly miss details about the world - The inference is blazing fast with 32B active parameters - QAT by design, resulting in dirt cheap API pricing - No thinking & stable thinking with a flip of a switch - A modified MIT, realistically all they require is to mention their model name once you scale to millions in revenue - Barely has any hard refusals baked into the weights **Huge** hopes for Moonshot AI to continue this streak!

Reply

[-]

TheRealMasonMac@reddit

After using GLM-5.1, K2.5 feels like using Llama1.

Reply

[-]

Fit-Statistician8636@reddit

Possibly, but K2.5 is much easier to run local.

Reply

[-]

Zyj@reddit

How so? It's even bigger!

Reply

[-]

Fit-Statistician8636@reddit

It is not, actually. 4bit native, and much faster.

Reply

[-]

Ok_Technology_5962@reddit

Huh? 560 gigs of ram for int 4 kimi k2.5 glm 5.1 is sub 500.

Reply

[-]

TheRealMasonMac@reddit

K2.5 is native INT4. GLM-5.1 is native F16.

Reply

[-]

Ok_Technology_5962@reddit

Locally we arent running int4 or bf16 for these monsters. Both get quantized. For the int4 kimi it has to be dequantised to bf16 and then requantized into gguf. Both run very good qualiy. To run on a fixed machine like a 512 gigs mac or a 512 gig xeon + gpus you must crush Kimi to q3 and glm to q4 on average mixed precision. Poth perform on par as the large bf16 versions in my testing on both the mac and xeon server. Ofcourse we preffer the int4 post training but if its 1 trillion perameter and we have to run at q3 thats worse than having a 755billion peram running at q4 mixed precision Dq4 for glm 5.1 . For this reason i dont run kimi and opt to run glm. Both use DSA attention which playes more into hiw long the context can get at stable speed and matter more than the actual active peram count as that is a quandratic formula. Most tests on speed are done in like 1000 tokens max not at 100k or 200k tokens

Reply

[-]

Fit-Statistician8636@reddit

K2.5 can be run without GGUF conversion using SGLang + KTransformers hybrid GPU+CPU. It is how I ended up serving it. Check it out.

Reply

[-]

TheRealMasonMac@reddit

K2.5 doesn't use DSA, though.

Reply

[-]

Ok_Technology_5962@reddit

Somehow i thought both kimi and glm still use Deepseek attention variations. unlike the qwen 3.5 liniar or nvidia mamba in their attention. I remember glm just switched from GQA

Reply

[-]

Daemonix00@reddit

Indeed!

Reply

[-]

Due-Memory-6957@reddit

Lemme join the glaze because 2.5 is the one I enjoyed the most while running a personal test. My personal test is: I send 4 puzzles to the AI and tell them to estimate how much it would take for an average person to solve them after they themselves solve it to gauge the difficulty, but to not tell me the result or give any hints. 3 of the puzzles are pretty straightforward, while one (the third one) is really difficult, puzzle 1, 2 and 4 are all related to each other, so if you get one, the others become easier even if they difficulty is increased. MInmax was the worst, it was clear while reading the reasoning that it didn't care about the rules of the puzzles 1, 2 and 4, so it got them wrong by disobeying the limitations, and also gave up on the third one, then it told me the wrong results it found despite the direct instruction not to. Deepseek, GLM 5.1 and Gemma 4 solved 1 and 2, skipped 3 after trying for a while and went to 4, they did obey the instructions. Kimi initially also skipped 3, but then came back to it after estimating the times for 1, 2 and 4 and kept trying until it got it right, it also impressed me while reading it's reasoning that it made the following comment before skipping number 3 "Given the difficulty I'm having solving it (and I have computational advantages), I'd estimate this takes the average person hours or days, or they simply won't solve it.", I really liked that it acknowledged that humans can't crunch info as fast as LLMs, I also found it charming that after skipping number 3 and solving number 4, it said. "Wait, let me reconsider puzzle 3. Maybe there's a clever insight that makes it easier?", and then went back to it until it actually managed to solve the puzzle, concluding "Given how long it took me (with systematic analysis), this is indeed very difficult. An average person without a systematic approach would likely struggle for a very long time. I'd estimate 1-3 hours for a very determined person, or they might not solve it at all without hints."

Reply

[-]

Caffdy@reddit

what kind of puzzles are these? can you gives an example

Reply

[-]

Due-Memory-6957@reddit

I took them from the book Mathematical Goodbye, here's the prompt: Puzzle 1 From Page 77: "There are two 10s and two 4s. You can use them in any order and add, subtract, multiply, or divide them. Use all of them and make the answer 24." Puzzle 2 From Page 78: "Now, the next problem is two 7s and two 3s. Use these four numbers to make 24 in the same way." Puzzle 3 (The Billiard Ball Necklace) From Pages 80-81: "Suppose you have five billiard balls connected in a ring like a pearl necklace. Each of these balls is labeled with a number. Now, you can take any number of these balls but only take consecutive balls next to each other. You can take one, two, or all five, but you cannot take non-adjacent ones. Under this condition, we want to sum the numbers on the selected balls to get all the numbers from 1 to 21. So, how should we arrange the numbers on the balls to create the necklace?" Puzzle 4 From Page 551 (posed by Moe to Saikawa): "Can you make 24 with 8, 8, 3, and 3? You can't use roots. Only arithmetic operations." Without spoiling the answer or giving me any advice, tell me how long you estimate it would take for the average person to solve each of these puzzles (after trying to solve them yourself to gauge the difficulty level).

Reply

[-]

NoahFect@reddit

Thanks, those are some good benchmarks. Qwen27B can get them all except #3, and GPT-OSS 120B can get them all, given some browbeating after a couple of false starts. They all seem to underestimate the difficulty level for humans compared to the time estimates from the big closed-source models, which I thought was interesting.

Reply

[-]

cutebluedragongirl@reddit

Moonshot really cooked this time around.

Reply

[-]

Daniel_UMA@reddit

Question, in terms of writing for example a book, how viable is 1T? I don't understand what that means

Reply

[-]

NoahFect@reddit

The printed books in one Library of Congress amount to roughly 10 TB, or about 2.5 trillion tokens. That was in 2000 according to Wikipedia. Let's say it's doubled since then, so call it 5T tokens. If you go by the so-called Chinchilla scaling law which says that about 20 tokens of training data should be applied per parameter, a 1T parameter model would be trained with 4 Libraries of Congress. All of the printed books in existence would still be a fraction of the total training data size. My understanding is that modern LLMs are trained with even-larger data sets than Chinchilla, but you'd want to check that.

Reply

[-]

Daniel_UMA@reddit

No no, I'm talking about making the AI a writing assistant, how much context can it receive without messing it up

Reply

[-]

Medical-Welcome-6924@reddit

That number is the amount of information and knowledge the model has. There's a point where a bigger number doesn't necessarily mean better performance. Kimi is quite good for guided writing in my opinion, much better than any other Chinese model.

Reply

[-]

philnm@reddit

may I ask what does "QAT by design" mean and why/how does it make API cheap?

Reply

[-]

Finanzamt_Endgegner@reddit

The weights are already pre quantized and trained for that i think to 4bit in this case so quality is better.

Reply

[-]

Darkoplax@reddit

I felt GLM 5 and 5.1 were always better

Reply

[-]

LoveMind_AI@reddit

And even video understanding! Kimi’s a real one.

Reply

[-]

vladlearns@reddit

I’ve been using 2.6-preview for about 3 days now - it is fantastic

Reply

[-]

philnm@reddit

How?!

Reply

[-]

vladlearns@reddit

I’m on their preview coding program

Reply

[-]

Ashamed-Road203@reddit

Been running K2 thru the Anthropic-compatible endpoint for a multi-agent setup — it's genuinely close to Sonnet on tool calls and way cheaper, but their OpenAI-format endpoint kept choking on long tool-use chains so I had to hard-switch. If 2.6 fixes the streaming/tool-call reliability on the OAI side that alone is a bigger deal than raw benchmark bumps.

Reply

[-]

lemon07r@reddit

We get it in around a week. I was bugging them about getting it on API key usage and that's what they said. I bet we wont get the weights until later though.

Reply

[-]

jacek2023@reddit

for me it's irrelevant because I won't run it on my 72/84/96 GB VRAM

Reply

[-]

pmttyji@reddit

We have few dudes here who run Kimi @ Q4(IQ4\_XS) just with 96GB VRAM + 1TB RAM.

Reply

[-]

ShadyShroomz@reddit

> 1TB RAM thats like $8k of DDR4 or $15k for ddr5 maybe a few of us here have 1tb of ram but i bet I could count them on one hand lol..

Reply

[-]

NoahFect@reddit

If you know where I can get 1TB of DDR5 for $15K, hook a brother up

Reply

[-]

pmttyji@reddit

They bought last year before RAMpocalypse.

Reply

[-]

jacek2023@reddit

We have many people who claim that they "can run a large model" but they don't use it because it runs too slow (or quantized version is too dumb), what does that mean? What's the point to download gigabytes of data just to ask "what is the capital of France?" and never run it again?

Reply

[-]

Fit-Statistician8636@reddit

Kimi is usable even on 32 GB VRAM + fast RAM, around 20-22 t/s. Yes it gets slower with growing context, but not as much as you might think. I do not use it for coding, but chat, deep research, refactoring on a limited context, all possible.

Reply

[-]

Lissanro@reddit

I have 96GB VRAM and 1 TB RAM, and Kimi K2.5 still stays my most used local model. I moslty use it in Roo Code. GLM 5.1 actually more intelligent but it is slower and thinks much more, I still use it too though if K2.5 gets stuck or if doing overnight run that GLM 5.1 is more likely to handle better.

Reply

[-]

phwlarxoc@reddit

> "can run a large model" For me "can run a large model" means two very different things: 1. I am really grateful that hybrid inference engines exist that actually allow to run monster models at decent speed, like 15-20 t/s, and for me it's only 2xRTX5090 on PCIe5 and 512GB DDR5 RAM, but it works and I can load MoE model weights up to around 500GB (e.g. GLM 5.1 UD-Q5_K_S 489.82GiB) with mainline llama.cpp 2. But a totally different picture is vLLM; having been used to those huge models, vLLM is a very sobering, humbling experience: weights trespassing more than 60 or 70% of combined GPU memory, — forget it, OOMs immediately due to greedy KV cache reservation, even with mitigating options. On device is everything and system RAM basically useless. But if it does fit on device it's a different world, GPUs never idling, doing 100% permanently and 10x decoding.

Reply

[-]

pmttyji@reddit

Timing show up!

Reply

[-]

funding__secured@reddit

who cares?

Reply

[-]

Zc5Gwu@reddit

The active params might.

Reply

[-]

Basilthebatlord@reddit

Give it a week and I bet we'll see Cursor Composer 2.1 release after this

Reply

[-]

WithoutReason1729@reddit

Your post is getting popular and we just featured it on our Discord! [Come check it out!](https://discord.gg/PgFhZ8cnWW) You've also been given a special flair for your contribution. We appreciate your post! *I am a bot and this action was performed automatically.*

Reply

[-]