Kimi K2.6 - the mighty turtle that wins the race

Posted by cjami@reddit | LocalLLaMA | View on Reddit | 27 comments

Hi folks, I've been benching Kimi K2.6 for the past few days, and I'd like to share my findings.

For context, this is based on a benchmark I've created that pits models against each other in autonomous games of Blood on the Clocktower - a highly complex social deduction game.

Findings:

K2.6 has played 64 games so far (2 games per match), these are early results but it has absolutely dominated the leaderboard through consistent wins against other models.

K2.6 is slow, generating an average of 570,000 tokens per game. Gemini 3.1 Pro, for contrast, generates 180,000 tokens per game. An average match takes about 1-3 hours, with K2.6 it takes about 10-15 hours (using Moonshot AI as a provider).

K2.6 is expensive - mainly due to the high token output, costing $2.31/game. This is still significantly less than Claude Opus 4.6, which costs $3.79/game. GLM 5.1, however, costs a more modest $0.88/game.

Reliability is decent with a 0.9% tool call error rate.

Notable moves:

Rejecting manipulation from Claude Opus 4.6: https://clocktower-radio.com/games/IyLrh8Q#event-79
Minion self-sacrifice to get Demon to last 2: https://clocktower-radio.com/games/Do9NaoQ#event-290

Notable mistakes:

Fumbling with the rules - Empaths do wake on the starting night: https://clocktower-radio.com/games/6C4GDCU#event-38
Accidentally whispering their evil plot to the good side (although recovered, gaslit, and won that game): https://clocktower-radio.com/games/XRpvext#event-34

Kimi K2.6 transcripts: https://clocktower-radio.com/search?a=Kimi+K2.6

How-it-works: https://clocktower-radio.com/how-it-works

[-]

IrisColt@reddit

B-but... if one LLM is good and the other LLM is evil, doesn't each LLM automatically know that the characters controlled by the other LLM are from the opposite faction? I don't understand your benchmark. Genuinely puzzled.

[-]

cjami@reddit (OP)

The game is played from the perspective of one player at a time. So an LLM doesn't know more than it should know at that moment in time.

Think of it like multiple AI agents with their own memory. They can only share information by talking. All on the good side will be of type Model A and the evil side being type Model B.

[-]

IrisColt@reddit

Got it! Sorry, I was briefly confused.

[-]

patchfoot02@reddit

this is interesting, but a less benchmarky more amusing variant would be for fun runs with a different model playing each character and then in the transcripts label them that way (in the actual game give them random names). I did something like that and having a wider variety of models made their differences more obvious sometimes in ways that were amusingly disruptive. Everyone hates grok who was a big fuck up lol

[-]

IrisColt@reddit

W-werewolf!

[-]

nomorebuttsplz@reddit

One time I told k2.6 it had unlimited thinking time. I regret.

[-]

cjami@reddit (OP)

Makes you wonder though, how long is too long when benchmarking?

[-]

nomorebuttsplz@reddit

I think for API is the cost per task is the correct counter balance for overthinking models, and whereas locally, it’s about the architecture and how fast you can generate tokens.

[-]

mateszhun@reddit

For local model thinking time is valid, for API cost/task.

That's why for API models Minimax subscription wins. You get so much mileage out of a 10$+VAT sub. I wonder what they are doing with the data or where the subsidy is coming from...

[-]

cjami@reddit (OP)

I agree - which is why I like to include the cost/game (similar to a cost per task) rather than relying on things like cost per token.

I've been very impressed with GLM 5.1's performance too. It does about 150,000 tokens/game, so thinky but not too thinky.

[-]

Anbeeld@reddit

Yeah it's prone to going in circles to compensate for weaker upfront reasoning.

[-]

Chinmay101202@reddit

i have been told k2.6 works ok but eats tokens like anything.

[-]

assassinofnames@reddit

Yeah, it's the most token hungry model I've ever used. I was hitting 100k context pretty fast for normal codebase changes on Opencode Go.

[-]

assassinofnames@reddit

I tried Kimi K2.6 on Opencode Go for a day. It got my work done, but it really thinks a lot. I never had to hit /compact so often with any other model. My tasks weren't even very complex. They were reasonably simple and obvious React and FastAPI codebase changes.

I tried Deepseek V4 Flash yesterday and I'm more impressed with it. It's cheaper than K2.6, has 1M context, and although it thinks a lot too it makes up for it by being pretty fast, and therefore a much better fit for my usecase.

I miss using GLM 4.7 and GLM 5.1 with my Z AI Coding Plan. They were just better at everything while taking fewer tokens.

[-]

PreciselyWrong@reddit

Fantastic benchmark!

[-]

PhoneOk7721@reddit

Compaction is something used commonly in LLMs and is useful to include in a benchmark.

[-]

cjami@reddit (OP)

Memory compaction is a common technique used to combat limitations in how much information an LLM can handle at a time. So it presents an opportunity for the model to show how well it identifies what is important to remember and how it stores it. You'll see the same thing being done with modern day coding agents. What you decide to remember and how you remember it has a significant impact on the game.

[-]

cjami@reddit (OP)

Thanks! I hope it hits a different angle to the usual.

[-]

Zulfiqaar@reddit

This is very interesting! Always neat to see where different models happen to excel in

[-]

Chinmay101202@reddit

still a long way to go

[-]

Sir-Draco@reddit

Is there a reason you avoided using GPT-5.4 High or Sonnet 4.6 High? If it is cost I am confused why you would use Opus 4.6 then. Obviously its your money so more understanding if there were any other considerations I am missing. This is a cool benchmark to see!

[-]

cjami@reddit (OP)

You're right, in retrospect it would have probably made more sense to use High models (at least for GPT models as they default to no reasoning). I could probably omit Low and Medium variants as well.

There's a bit of a journey here - I started with Low variants to reduce costs while I was stabilising the benchmark, then moved onto things like prompt caching before handling heavier models. GPT 5.4 came out while I was using GPT 5.2, then GLM 5.1 started benching really well so I needed to pull in Opus 4.6 to help quantify it's strength.

It's a bit of mess, that will take time and funds to sort out, but I'm trying!

[-]

Sir-Draco@reddit

No this great! Makes total sense that you would start with low. That is basically the reasoning I was looking for.

[-]

InformationSweet808@reddit

Dominating is impressive, but 570k tokens/game is doing a lot of the heavy lifting here. At that scale it’s basically brute-forcing reasoning. The real test would be efficiency-normalized—how does K2.6 perform if you cap it closer to 150–200k tokens like Gemini?

[-]

cjami@reddit (OP)

Yep I'd call it brute-force reasoning too. I believe there's no accurate way to enforce token limits (apart from truncation or forced errors on specified limits), instead having to rely on hints that it's potentially trained on - such as the typical low, medium, high reasoning variants. K2.6 however, I believe is just thinking on or thinking off.

[-]

RepulsiveRaisin7@reddit

The overthinking in Kimi K2.6 is off the charts, it takes forever to do anything

[-]

Riseing@reddit

I've been very happy with it as a GPT 5.4 replacement.