kimi k2.6 has been feeling great. been using it over claude (partially because of the usage limits but also because it's juts great). One thing that I find kimi models, including k2.6 lacking is detecting future issues or issues that were not in the context. opus4.6/4.7 can easily detect possible pitfalls that are only broadly related to the code changes and stuff it has in its context but I found kimi models having issues with that. They are incredibly good if you task them well though and I they seem to perform on sonnet/opus level if you give them clear instructions
yes i think it for sure beats/ is on sonnet standard. opus not quite when it comes to architecture and general project stuff and identifying potential pitfalls but for implementing features where you already have the stuff preplanned so it juts follows a detailed plan then it's incredibly good
no, it feels very nice and replaced sonnet/opus for me on clear prompts. If I need it to detect issues or think about edge cases Opus is still better but if I give it clear instructions the output seems the same as Opus but at a faster rate (70-90tps in the beta and with far better limits)
Kimi is basically between Sonnet 4.5 and 4.6 levels. It's also pretty large, at 1T params, so it's probably much closer to Opus in terms of size compared to most local models, although likely still a bit smaller.
okay maybe not beat opus but on opus level when it comes to executing plans. in general its on sonnet levels but when for example planning or doing audits, thinking outside of the box then opus is clearly better but when a plan is there and it just needs to execute then its perfect and you cant get much better with opus. so planning I would use opus (altough kimi is not worlds worse, just a bit) but for executing its really great.
\+ opus is really not that great, its honestly overhyped simply because its claude
Most users doesn't even try to fit MoE in vram. For me it's better to get high accuracy using something like Q6_K_XL. But I understand you want more tps
Kimi K2.5 was my favorite so far, I especially like local friendly INT4 release that can be practically losslessly converted to Q4_X GGUF, preserving the original quality. I hope K2.6 will be similar.
Memory consumption, speed, community support? It works well in llama.cpp, ik_llama.cpp, and even SGLang + KTransformers (which I settled on for this model), and in all clients I tried including Open WebUI. GLM 5.1 still has various issues. It is new, it will get better for sure and I really like it where I can get it working - but while the Kimi-K2 has really pretty usable speed, GLM 5.1 is just too slow.
In retrospect, I think Kimi K2.5 wins the early 2026 open source game!
- Incredible image understanding capabilities
- 1T total parameters that allow it to hold enough knowledge to work a general purpose chatbot & not embarrassingly miss details about the world
- The inference is blazing fast with 32B active parameters
- QAT by design, resulting in dirt cheap API pricing
- No thinking & stable thinking with a flip of a switch
- A modified MIT, realistically all they require is to mention their model name once you scale to millions in revenue
- Barely has any hard refusals baked into the weights
**Huge** hopes for Moonshot AI to continue this streak!
Locally we arent running int4 or bf16 for these monsters. Both get quantized. For the int4 kimi it has to be dequantised to bf16 and then requantized into gguf. Both run very good qualiy. To run on a fixed machine like a 512 gigs mac or a 512 gig xeon + gpus you must crush Kimi to q3 and glm to q4 on average mixed precision. Poth perform on par as the large bf16 versions in my testing on both the mac and xeon server.
Ofcourse we preffer the int4 post training but if its 1 trillion perameter and we have to run at q3 thats worse than having a 755billion peram running at q4 mixed precision Dq4 for glm 5.1 . For this reason i dont run kimi and opt to run glm. Both use DSA attention which playes more into hiw long the context can get at stable speed and matter more than the actual active peram count as that is a quandratic formula. Most tests on speed are done in like 1000 tokens max not at 100k or 200k tokens
Somehow i thought both kimi and glm still use Deepseek attention variations. unlike the qwen 3.5 liniar or nvidia mamba in their attention. I remember glm just switched from GQA
Lemme join the glaze because 2.5 is the one I enjoyed the most while running a personal test.
My personal test is: I send 4 puzzles to the AI and tell them to estimate how much it would take for an average person to solve them after they themselves solve it to gauge the difficulty, but to not tell me the result or give any hints.
3 of the puzzles are pretty straightforward, while one (the third one) is really difficult, puzzle 1, 2 and 4 are all related to each other, so if you get one, the others become easier even if they difficulty is increased.
MInmax was the worst, it was clear while reading the reasoning that it didn't care about the rules of the puzzles 1, 2 and 4, so it got them wrong by disobeying the limitations, and also gave up on the third one, then it told me the wrong results it found despite the direct instruction not to.
Deepseek, GLM 5.1 and Gemma 4 solved 1 and 2, skipped 3 after trying for a while and went to 4, they did obey the instructions.
Kimi initially also skipped 3, but then came back to it after estimating the times for 1, 2 and 4 and kept trying until it got it right, it also impressed me while reading it's reasoning that it made the following comment before skipping number 3 "Given the difficulty I'm having solving it (and I have computational advantages), I'd estimate this takes the average person hours or days, or they simply won't solve it.", I really liked that it acknowledged that humans can't crunch info as fast as LLMs, I also found it charming that after skipping number 3 and solving number 4, it said. "Wait, let me reconsider puzzle 3. Maybe there's a clever insight that makes it easier?", and then went back to it until it actually managed to solve the puzzle, concluding "Given how long it took me (with systematic analysis), this is indeed very difficult. An average person without a systematic approach would likely struggle for a very long time. I'd estimate 1-3 hours for a very determined person, or they might not solve it at all without hints."
I took them from the book Mathematical Goodbye, here's the prompt:
Puzzle 1
From Page 77:
"There are two 10s and two 4s. You can use them in any order and add, subtract, multiply, or divide them. Use all of them and make the answer 24."
Puzzle 2
From Page 78:
"Now, the next problem is two 7s and two 3s. Use these four numbers to make 24 in the same way."
Puzzle 3 (The Billiard Ball Necklace)
From Pages 80-81:
"Suppose you have five billiard balls connected in a ring like a pearl necklace. Each of these balls is labeled with a number. Now, you can take any number of these balls but only take consecutive balls next to each other. You can take one, two, or all five, but you cannot take non-adjacent ones. Under this condition, we want to sum the numbers on the selected balls to get all the numbers from 1 to 21. So, how should we arrange the numbers on the balls to create the necklace?"
Puzzle 4
From Page 551 (posed by Moe to Saikawa):
"Can you make 24 with 8, 8, 3, and 3? You can't use roots. Only arithmetic operations."
Without spoiling the answer or giving me any advice, tell me how long you estimate it would take for the average person to solve each of these puzzles (after trying to solve them yourself to gauge the difficulty level).
Thanks, those are some good benchmarks. Qwen27B can get them all except #3, and GPT-OSS 120B can get them all, given some browbeating after a couple of false starts.
They all seem to underestimate the difficulty level for humans compared to the time estimates from the big closed-source models, which I thought was interesting.
The printed books in one Library of Congress amount to roughly 10 TB, or about 2.5 trillion tokens. That was in 2000 according to Wikipedia. Let's say it's doubled since then, so call it 5T tokens.
If you go by the so-called Chinchilla scaling law which says that about 20 tokens of training data should be applied per parameter, a 1T parameter model would be trained with 4 Libraries of Congress. All of the printed books in existence would still be a fraction of the total training data size. My understanding is that modern LLMs are trained with even-larger data sets than Chinchilla, but you'd want to check that.
That number is the amount of information and knowledge the model has. There's a point where a bigger number doesn't necessarily mean better performance. Kimi is quite good for guided writing in my opinion, much better than any other Chinese model.
Been running K2 thru the Anthropic-compatible endpoint for a multi-agent setup — it's genuinely close to Sonnet on tool calls and way cheaper, but their OpenAI-format endpoint kept choking on long tool-use chains so I had to hard-switch. If 2.6 fixes the streaming/tool-call reliability on the OAI side that alone is a bigger deal than raw benchmark bumps.
We get it in around a week. I was bugging them about getting it on API key usage and that's what they said. I bet we wont get the weights until later though.
We have many people who claim that they "can run a large model" but they don't use it because it runs too slow (or quantized version is too dumb), what does that mean? What's the point to download gigabytes of data just to ask "what is the capital of France?" and never run it again?
Kimi is usable even on 32 GB VRAM + fast RAM, around 20-22 t/s. Yes it gets slower with growing context, but not as much as you might think.
I do not use it for coding, but chat, deep research, refactoring on a limited context, all possible.
I have 96GB VRAM and 1 TB RAM, and Kimi K2.5 still stays my most used local model. I moslty use it in Roo Code. GLM 5.1 actually more intelligent but it is slower and thinks much more, I still use it too though if K2.5 gets stuck or if doing overnight run that GLM 5.1 is more likely to handle better.
> "can run a large model"
For me "can run a large model" means two very different things:
1. I am really grateful that hybrid inference engines exist that actually allow to run monster models at decent speed, like 15-20 t/s, and for me it's only 2xRTX5090 on PCIe5 and 512GB DDR5 RAM, but it works and I can load MoE model weights up to around 500GB (e.g. GLM 5.1 UD-Q5_K_S 489.82GiB) with mainline llama.cpp
2. But a totally different picture is vLLM; having been used to those huge models, vLLM is a very sobering, humbling experience: weights trespassing more than 60 or 70% of combined GPU memory, — forget it, OOMs immediately due to greedy KV cache reservation, even with mitigating options. On device is everything and system RAM basically useless. But if it does fit on device it's a different world, GPUs never idling, doing 100% permanently and 10x decoding.
Your post is getting popular and we just featured it on our Discord! [Come check it out!](https://discord.gg/PgFhZ8cnWW)
You've also been given a special flair for your contribution. We appreciate your post!
*I am a bot and this action was performed automatically.*
88 Comments
propelourselves4ward@reddit
reaznval@reddit
Spirited_Neck1858@reddit
Thomas-Lore@reddit
reaznval@reddit
LoveMind_AI@reddit
reaznval@reddit
Fusseldieb@reddit
username_taken4651@reddit
xLionel775@reddit
reaznval@reddit
CalligrapherFar7833@reddit
pmttyji@reddit
pigeon57434@reddit
Ranmark@reddit
stopbanni@reddit
cutebluedragongirl@reddit
LoveMind_AI@reddit
Foxwear_@reddit
Formal_Gas_6@reddit
B89983ikei@reddit
Lordaizen639@reddit
nuclearbananana@reddit
gigaflops_@reddit
Crinkez@reddit
ei23fxg@reddit
fallingdowndizzyvr@reddit
power97992@reddit
nuclearbananana@reddit
Lissanro@reddit
Fit-Statistician8636@reddit
philnm@reddit
Fit-Statistician8636@reddit
philnm@reddit
FriskyFennecFox@reddit
TheRealMasonMac@reddit
Fit-Statistician8636@reddit
Zyj@reddit
Fit-Statistician8636@reddit
Ok_Technology_5962@reddit
TheRealMasonMac@reddit
Ok_Technology_5962@reddit
Fit-Statistician8636@reddit
TheRealMasonMac@reddit
Ok_Technology_5962@reddit
Daemonix00@reddit
Due-Memory-6957@reddit
Caffdy@reddit
Due-Memory-6957@reddit
NoahFect@reddit
cutebluedragongirl@reddit
Daniel_UMA@reddit
NoahFect@reddit
Daniel_UMA@reddit
Medical-Welcome-6924@reddit
philnm@reddit
Finanzamt_Endgegner@reddit
Darkoplax@reddit
LoveMind_AI@reddit
vladlearns@reddit
philnm@reddit
vladlearns@reddit
Ashamed-Road203@reddit
lemon07r@reddit
jacek2023@reddit
pmttyji@reddit
ShadyShroomz@reddit
NoahFect@reddit
pmttyji@reddit
jacek2023@reddit
Fit-Statistician8636@reddit
Lissanro@reddit
phwlarxoc@reddit
pmttyji@reddit
funding__secured@reddit
Zc5Gwu@reddit
Basilthebatlord@reddit
WithoutReason1729@reddit
Exciting-Engine882@reddit
Asceny@reddit
Dependent-Aardvark32@reddit
korino11@reddit
NoahFect@reddit
nhouseholder@reddit
LittleYouth4954@reddit
RedParaglider@reddit
Namra_7@reddit (OP)
Specter_Origin@reddit