Anyone actually coded with Kimi K2 Thinking?
Posted by Federal_Spend2412@reddit | LocalLLaMA | View on Reddit | 42 comments
Curious how its debug skills and long-context feel next to Claude 4.5 Sonnet—better, worse, or just hype?
Trollfurion@reddit
I’ve tried it to code a website from the prompt, it do worse than qwen3 32 vl for example
TheRealMasonMac@reddit
It makes coding mistakes that make me not want to use it for actual coding. Might be good for planning side? Not sure.
shaman-warrior@reddit
How’d you use it?
TheRealMasonMac@reddit
I prompted the official API with a simple edit to improve the CSS of an existing simple self-contained webapp, and it broke the JavaScript when it changed classes without updating the JS. GLM-4.6 could do this without even needing thinking.
shaman-warrior@reddit
Kimi k2 thinking as model? I have tried it yest and today with their coding plan but as model I used kimi-k2-thinking instead of kimi-for-coding.
TheRealMasonMac@reddit
For the webapp case, it was just the straight API call.
TheRealMasonMac@reddit
Yeah, as a model.
kogitatr@reddit
I regret subscribing even to their $19 plan. To my experience, it's slower than sonnet and deliver not as good or sometimes disobey the prompt
shaman-warrior@reddit
I also subscribed. What model did you use?
lemon07r@reddit
It's currently broken for all agents other than Kimi CLI because they have tool calling within their reasoning tags but this isn't supported by any agents atm other than Kimi CLI. Should hopefully be fixed soon in most agents.
vincentz42@reddit
This needs to be upvoted higher. I used Kimi CLI and found the model to be very smart in agentic coding.
ps5cfw@reddit
I've given it a fairly complex task (fix a bug in a fairly complex .NET repository class) and it solved it in two shots.
It's OK, it tends to think a lot, but it's not too much
Federal_Spend2412@reddit (OP)
Thanks, I'm planning to try using Kilo Code + Kiki K2 Thinking in my project to test it out.
Brave-Hold-9389@reddit
use claude code, its allows kimi to use a diferent type of reasoning
GregoryfromtheHood@reddit
How do you use it with Claude code? I've tried using Claude Code Router a few times to use different models but can never got the model to act right. I always default back to using Roo code for any other models because they just work there even if it is a bit of a context hog
Brave-Hold-9389@reddit
Here, check this out
GregoryfromtheHood@reddit
Oh. Anthropic compatible endpoint via a cloud provider, yeah nah I'm not really interested in that. I'm talking about running models locally using openai compatible API endpoints.
I think something in the conversion process isn't 100% right and I haven't been able to get very good performance out if Claude Code with local models.
AI_should_do_it@reddit
I assume you ran it locally? What’s the hardware?
YouAreTheCornhole@reddit
It should be a lot better for the amount of hype
loyalekoinu88@reddit
Agreed. It’s not bad BUT it also isn’t a coding model. It’s an agent/general model. How much of that model space is dedicated to code is up for debate.
YouAreTheCornhole@reddit
If it wasn't gigantic I'd have more hope here, but for it's size it should be a lot better than it is
loyalekoinu88@reddit
I mostly agree but do we have other open trillion parameter models to compare to that are better? I think this model as a base will produce great coding focused models of similar size that are better in that domain. Just a matter of time. :)
llmentry@reddit
We have open models with far fewer params that are arguably better. Does that count?
YouAreTheCornhole@reddit
I hope so but it's kind of like throwing a poop at a house fire, especially when models way smaller are doing things better
loyalekoinu88@reddit
That’s a fair assessment. What’s models are you presently using and for what kind of coding work?
YouAreTheCornhole@reddit
I mainly use Sonnet 4.5 and all kinds of stuff, mainly Python and Go, and C++. Lots of AI and ML stuff
Federal_Spend2412@reddit (OP)
The GLM 4.6 isn't as powerful as advertised. I'm just a little worried about the Kimi K2 Thinking compared to the GLM 4.6 in the same situation.
YouAreTheCornhole@reddit
Kimi K2 Thinking is definitely worse than GLM 4.6
Federal_Spend2412@reddit (OP)
I just know Glm 4.6 > minimax m2
Final-Rush759@reddit
For me, minimax m2 is better than GLM-4.6. It all depends on what you want to do. None of models are perfect. If you have problems, try a different model. I think GPT-5 is very good in fixing bugs.
usernameplshere@reddit
Im curious, what scenario did u use it in?
Brave-Hold-9389@reddit
in frontend
Pink_da_Web@reddit
No, it's not. For me, it's much better than the GLM 4.6. Why do you think that?
TheRealGentlefox@reddit
Advertised by who? A lot of coders vouch for its capabilities. I haven't done super extensive testing yet but I quite like it.
redragtop99@reddit
GLM 4.6 is the best local model I’ve used for text. It’s consistent and right.
daavyzhu@reddit
Born_Operation_6222@reddit
It seems that It's only good at the agentic and IF scores? In all other scores, it's worse than deepseek r1.
Wishitweretru@reddit
Tried it for a fay, it kept failing during project onboarding, figured it might be growing pangs, I’ll try again in a couple days
kaggleqrdl@reddit
It was impressive on on a simple task, but on a larger refactoring one it broke pretty badly. seems to over complicate things. worth trying a few more attempts I think.
Special_Cup_6533@reddit
For single code files it is fine, but when I introduce multiple files in a code base, it falls apart and makes many errors, and is unable to fix them. I end up swapping to deepseek and deepseek fixes them all.
mileseverett@reddit
I put my standard fairly complex computer vision architecture modification questions and it consistently fucked up the dimensions of tensors and couldn't fix itself even after multiple rounds. I found that only closed models get these correct
mborysow@reddit
I just want to know if anyone has managed to get it running with sgLang or vLLM with tool calling working decently.
It seems like it's just a known issue, but it makes it totally unsuitable for things like Roo Code / Aider. I understand the fix is basically an enforced grammar for the tool calling section, but hopefully that will come soon. We have limited resources to run models, so if it can't also do tool calling we need to save room for something else. :(
Seems like an awesome model.
For reference:
https://blog.vllm.ai/2025/10/28/Kimi-K2-Accuracy.html
https://github.com/MoonshotAI/K2-Vendor-Verifier
Can't remember if it was vLLM or sglang for this run, but:
{
"model": "kimi-k2-thinking",
"success_count": 1998,
"failure_count": 2,
"finish_stop": 941,
"finish_tool_calls": 1010,
"finish_others": 47,
"finish_others_detail": {
"length": 47
},
"schema_validation_error_count": 34,
"successful_tool_call_count": 976
}