I tested it by using the orientation of the dowser and google maps. 10/10 times I would have eventually found an ocean if I kept going in the direction the dowser was pointing at.
The model size creep is wild. A year ago 70B was "huge" and now we're seeing 400B+ models that require enterprise setups. At some point the local crowd is going to hit a hard wall where you literally can't run the frontier models locally anymore. That's when things get interesting...
Well bigger simply = better, yes efficiency as in how much smarts you can get out of x size can improve, but every efficiency improvement can be used in 2 ways.
A: smarter model for the same size
B: smaller model for the same smarts.
As long as models scale and get better with more parameters whatever frontier is will always tend to get larger and larger.
Not OP, but I'm on the Moderato plan. Here's a brief review:
Pros:
K2.5 was probably the best open model available. It was my daily coding model. I used it for roughly 80-90% of my tasks and delegated the rest to Codex 5.3/5.4 on my ChatGPT Plus plan. I don't use Claude because I don't like the company's values, and I find Codex better for my use case anyway.
Their Moderato plan is very generous and should be sufficient for most people. I burned through 34M tokens last week and still had \~50% of my weekly quota left. Kimi CLI also feels more token-efficient than Claude Code IMHO. I'd expect Moderato to be enough for most users.
I've only used K2.6 for a couple of hours, but it does feel like a noticeable improvement over K2.5.
Cons:
The Kimi CLI team appears to have undergone some personnel changes, with several of the original developers left. Since then, "new features" have been added to the CLI, but the model wasn't trained to use them, resulting in frequent tool-calling errors. It got bad enough that I had to downgrade the CLI version. The team also seems less responsive to GitHub issues than before. K2.6 mitigates some of these problems, but they're not fully resolved.
I suspect Kimi will gradually become less open and transparent , similar to Anthropic and OpenAI. The latest CLI version redacts thinking traces from the terminal and VS Code (though they're still visible in logs and CLI web sessions), and support for "encrypted thinking traces" has been added (not yet enabled). It's also unclear whether K2.6 will be open weights at all. If supporting open model development is part of your goal, that's worth keeping in mind.
Have you tried it through the OpenCode CLI? Seems to perform better. Also, you seem to be able to talk to K2.6 and the K2.6-code-preview models through the api. I wonder how different both models are.
Reason I ask is I use gpt-5.4 for my day job at the moment and my company supplies API access. Because we have essentially no limit its easy to burn through 2-300M cached input a day, sometimes more. My usage last week was 1.5B.
I'm looking to move to a more open model if possible... But we are Azure based so unless it's on foundry we will have to use a subscription, and I reckon we're not going to find one that covers it.
It'll heavily depend on how you fine-tune. For a 1T model, you could curate a manageable high-quality corpus of a hundred thousand examples or so and run it through SFT.
If you want to LoRA towards a more specific use case, you could try half of that again. Depends on the gains you're looking for.
Pretty certain with llama.cpp can put as much as possible in VRAM, overflow to loading RAM as need and even overflow to loading from disk. With MoE, that should help a whole bunch.
That's insane how fast models are getting bigger. I have 36GB of VRAM plus 64GB of DDR4 RAM and I'm already memory poor for majority of great models that come out these days. In bitnet we believe, hope it'll become more common
It's also insane how models are getting smaller and more capable, I tried qwen 3.5 4B on a 6GB Vram card, i was shocked at how good it is for its size, felt like a 20B model from a year ago
Yes, I tried the opus distilled ones. Only qwen3.5 27b at q4_k_m was borderline fine. Then I tried the non-distills and even the 9b at q4 is better than the 27b opus distilled.
I've found the 9b to be very impressive, with my very limited testing. I'd say it also outperforms 35b a3b.
Last time I tried any, like qwen2.5 they were unusable.
I really really hope we get the 3.6 version too...
In my experience, "distilled models" remain roughly the same at best, and at worst, they become dumber. All this "Claude x100500 reasoning" is mainly good for improving writing style, but not for enhancing intelligence. Alibaba or Google likely distill their own large models, and do it more efficiently, so third-party distillation adds nothing.
Yep. I don't understand all of the sudden hype around these Claude distills. Those silly "3000x brainstorm" Claude/Gemini/GPT reasoning datasets have been getting distilled into all sorts of models for like 6+ months now, and I can't think of a single one that was even on par with the original base model. Before 3.5, I tried a good few Qwen3 distills of this type and every single one of them was worse than the default Qwen.
There's benefits to models of all sizes becoming open, even if I can't run them locally. As we've been seeing with the recent fiascos involving Anthropic nerfing or locking away their APIs it's important to know that other big companies can provide access to those models as well regardless of what the models' originators decide to do with them.
I could easily run some \~4bpw 122B-A10B finetune but to me speed, free memory for dekstop usage and to fit some image gen models into VRAM simultaneously matters. Honestly, idk if the model I'm using even good by rolling standards but it does almost everything I need right now and I'm content with it
Kimi dropping 2.6 already? This is moving stupid fast. I’ve been running Kimi variants locally and the context handling + tool use has been surprisingly clean. Anyone got early leaks on what’s actually new in this one or are we waiting for the official drop to benchmark it against Gemma 4 and Qwen 3.5?
Yeah, really looking forward to K3.
So many nice Innovations from open source labs... Residual Attention, Engram, e.g.
Also can't wait for big models to adopt the idea of dflash diffusion speculative decoding...
This will be for very high end setups, but still very exciting if they can keep the improvement shown in their earlier releases. Huge to have something really good that can be run with significant resources, but without depending on any particular vendor shenanigans.
TheRealMasonMac@reddit
https://www.reddit.com/r/LocalLLaMA/comments/1s6stgl/kimi_k26_will_drop_in_the_next_2_weeks_k3_is_wip/
LMAO this is funny in hindsight
-p-e-w-@reddit
I mean to be fair 95% of such claims are BS.
If a dowser happens to stumble upon water, you don’t conclude that dowsing works.
oroora6@reddit
Maybe you're just not dowsing hard enough.
I tested it by using the orientation of the dowser and google maps. 10/10 times I would have eventually found an ocean if I kept going in the direction the dowser was pointing at.
HardworkPanda@reddit
he is just a dumb top poster, he thinks he knows everything even if he hasn't tried it ever.
mouseynaides@reddit
That guy really just randomly leaked kimi k2.6. what a goat
DerDave@reddit
He got soo much shit for his comment post. Poor bastard didn't even lie.
pneuny@reddit
To be fair, it's hard to know if it's true when it isn't backed up with a source.
DerDave@reddit
He actually said he had a source (buddy working at moonshot.ai ) - only he couldn't proof it haha
MoodDelicious3920@reddit
Almost everyone abused him in replies , saying things like "Who r u to say" 😂
wazymandias@reddit
The model size creep is wild. A year ago 70B was "huge" and now we're seeing 400B+ models that require enterprise setups. At some point the local crowd is going to hit a hard wall where you literally can't run the frontier models locally anymore. That's when things get interesting...
the_omicron@reddit
Don't worry, Gemma 4 26B A4B is pretty good already for non-coding tasks.
KeinNiemand@reddit
Well bigger simply = better, yes efficiency as in how much smarts you can get out of x size can improve, but every efficiency improvement can be used in 2 ways. A: smarter model for the same size B: smaller model for the same smarts.
As long as models scale and get better with more parameters whatever frontier is will always tend to get larger and larger.
Successful-Brick-783@reddit
GPT-3 has 175 billion parameters and was released 6 years ago, idk why you think 400B is wild
Caffdy@reddit
enterprise and server are gonna hit walls as well; not the same ones, but they too have limits in inference and size
pr3miere@reddit
It just dropped!
No_Conversation9561@reddit
I yearn for the days when “dropped” meant dropped weights on huggingface.
VEHICOULE@reddit
I'm interested into their moderato subscribtion, would you mind sharing your impressions ?
vincentz42@reddit
Not OP, but I'm on the Moderato plan. Here's a brief review:
Pros:
K2.5 was probably the best open model available. It was my daily coding model. I used it for roughly 80-90% of my tasks and delegated the rest to Codex 5.3/5.4 on my ChatGPT Plus plan. I don't use Claude because I don't like the company's values, and I find Codex better for my use case anyway.
Their Moderato plan is very generous and should be sufficient for most people. I burned through 34M tokens last week and still had \~50% of my weekly quota left. Kimi CLI also feels more token-efficient than Claude Code IMHO. I'd expect Moderato to be enough for most users.
I've only used K2.6 for a couple of hours, but it does feel like a noticeable improvement over K2.5.
Cons:
The Kimi CLI team appears to have undergone some personnel changes, with several of the original developers left. Since then, "new features" have been added to the CLI, but the model wasn't trained to use them, resulting in frequent tool-calling errors. It got bad enough that I had to downgrade the CLI version. The team also seems less responsive to GitHub issues than before. K2.6 mitigates some of these problems, but they're not fully resolved.
I suspect Kimi will gradually become less open and transparent , similar to Anthropic and OpenAI. The latest CLI version redacts thinking traces from the terminal and VS Code (though they're still visible in logs and CLI web sessions), and support for "encrypted thinking traces" has been added (not yet enabled). It's also unclear whether K2.6 will be open weights at all. If supporting open model development is part of your goal, that's worth keeping in mind.
Clear-Ad-9312@reddit
Have you tried it through the OpenCode CLI? Seems to perform better. Also, you seem to be able to talk to K2.6 and the K2.6-code-preview models through the api. I wonder how different both models are.
sjsosowne@reddit
Does the 34M include cached input?
Reason I ask is I use gpt-5.4 for my day job at the moment and my company supplies API access. Because we have essentially no limit its easy to burn through 2-300M cached input a day, sometimes more. My usage last week was 1.5B.
I'm looking to move to a more open model if possible... But we are Azure based so unless it's on foundry we will have to use a subscription, and I reckon we're not going to find one that covers it.
TheRealMasonMac@reddit
On the bright side, kimi-cli is very hackable! I have my own fork with changes.
TokenChingy@reddit
Oh my god, no wonder Kimi felt much more capable today.
DerDave@reddit
Did it? Are you sure you already had it all day?
DerDave@reddit
Where is this?
Dany0@reddit
Y'all missed the mostly important detail. Kimi K2.6 Code
it's a code focused finetune! Maybe they looked at Mythos and thought we can do that too
Clear-Ad-9312@reddit
Seem to be able to talk to both K2.6 and the K2.6-code-preview models through the API.
I wonder how different both models are.
Guardian-Spirit@reddit
> Maybe they looked at Mythos and thought we can do that too
Training takes way longer than that.
seamonn@reddit
not if you distill it
zdy132@reddit
Guys someone is paying me to answer questions, am I being distillation attacked?
seamonn@reddit
just ask questions back instead of answering.
zdy132@reddit
I will just tell them "Go to sleep." That's a good trick.
Dany0@reddit
Finetune
Orolol@reddit
Finetune is training. It still takes longer than that on a 1T model.
KickLassChewGum@reddit
It'll heavily depend on how you fine-tune. For a 1T model, you could curate a manageable high-quality corpus of a hundred thousand examples or so and run it through SFT.
If you want to LoRA towards a more specific use case, you could try half of that again. Depends on the gains you're looking for.
Dany0@reddit
Exactly. And Mythos has been teased ever since it was available to anthropic employees (march iirc?)
Clear-Ad-9312@reddit
You can actually talk with Kimi K2.6 through the API
Due_Net_3342@reddit
another model that I cannot run even on my 144gb setup :)
SilentLennie@reddit
Pretty certain with llama.cpp can put as much as possible in VRAM, overflow to loading RAM as need and even overflow to loading from disk. With MoE, that should help a whole bunch.
UpperParamedicDude@reddit
That's insane how fast models are getting bigger. I have 36GB of VRAM plus 64GB of DDR4 RAM and I'm already memory poor for majority of great models that come out these days. In bitnet we believe, hope it'll become more common
Limp_Classroom_2645@reddit
It's also insane how models are getting smaller and more capable, I tried qwen 3.5 4B on a 6GB Vram card, i was shocked at how good it is for its size, felt like a 20B model from a year ago
alphapussycat@reddit
Yes, I tried the opus distilled ones. Only qwen3.5 27b at q4_k_m was borderline fine. Then I tried the non-distills and even the 9b at q4 is better than the 27b opus distilled.
I've found the 9b to be very impressive, with my very limited testing. I'd say it also outperforms 35b a3b.
Last time I tried any, like qwen2.5 they were unusable.
I really really hope we get the 3.6 version too...
Potential-Gold5298@reddit
In my experience, "distilled models" remain roughly the same at best, and at worst, they become dumber. All this "Claude x100500 reasoning" is mainly good for improving writing style, but not for enhancing intelligence. Alibaba or Google likely distill their own large models, and do it more efficiently, so third-party distillation adds nothing.
ayylmaonade@reddit
Yep. I don't understand all of the sudden hype around these Claude distills. Those silly "3000x brainstorm" Claude/Gemini/GPT reasoning datasets have been getting distilled into all sorts of models for like 6+ months now, and I can't think of a single one that was even on par with the original base model. Before 3.5, I tried a good few Qwen3 distills of this type and every single one of them was worse than the default Qwen.
Kodix@reddit
Exactly my experience, as well. Their popularity seems to be hype-based rather than performance-based.
IrisColt@reddit
You nailed it. Instruction following is always damaged significantly.
Limp_Classroom_2645@reddit
My sentiment as well
TheRealMasonMac@reddit
It's because an SFT dataset, unless carefully crafted, will undo the RL training the model underwent.
Objective-Stranger99@reddit
Qwen3.5 4B currently beats GPT OSS 20B in most benchmarks.
FaceDeer@reddit
There's benefits to models of all sizes becoming open, even if I can't run them locally. As we've been seeing with the recent fiascos involving Anthropic nerfing or locking away their APIs it's important to know that other big companies can provide access to those models as well regardless of what the models' originators decide to do with them.
soyalemujica@reddit
What model do you rank as the best for coding with that setup?
UpperParamedicDude@reddit
Hmm, depending on your needs I think. At the moment I use this model for most of the stuff
Jackrong/Qwopus3.5-27B-v3-GGUF
I could easily run some \~4bpw 122B-A10B finetune but to me speed, free memory for dekstop usage and to fit some image gen models into VRAM simultaneously matters. Honestly, idk if the model I'm using even good by rolling standards but it does almost everything I need right now and I'm content with it
alphapussycat@reddit
Don't bother with the opus distills, compared to non-distills they're lobotomized, it's like might and day.
Ok_Technology_5962@reddit
Another model i can probably barely maybe not run on my 512 setup... Have to rpc systems together or disk offload
IrisColt@reddit
heh
alphapussycat@reddit
But still good to have. One day maybe you'll get the CPU and ram to run it... You're probably never running this on vram though.
LagOps91@reddit
152gb total here (ram+vram)... it's great to have it, but it's still far from enough. we all need some of that 1 bit model magic.
Mashic@reddit
But smaller models are getting better too.
Aggressive-Permit317@reddit
Kimi dropping 2.6 already? This is moving stupid fast. I’ve been running Kimi variants locally and the context handling + tool use has been surprisingly clean. Anyone got early leaks on what’s actually new in this one or are we waiting for the official drop to benchmark it against Gemma 4 and Qwen 3.5?
nuclearbananana@reddit
Hey can you give me a recipe for banana bread
Different_Fix_2217@reddit
K3 will probably be great https://www.youtube.com/watch?v=2IfAVV7ewO0
nuclearbananana@reddit
That's a very clickbaity title.. and in the length of the video you could just read the paper yourself
DerDave@reddit
Yeah, really looking forward to K3. So many nice Innovations from open source labs... Residual Attention, Engram, e.g. Also can't wait for big models to adopt the idea of dflash diffusion speculative decoding...
pneuny@reddit
Great timing with Anthropic nefing Opus 4.6 due to capacity issues.
WPBaka@reddit
Hype! Kimi K2.5 is one of my favorite models. Something about it just feels unique compared to ther releases IMO. I really like it's prose too
MoodDelicious3920@reddit
I think kimi k2.5 is the only model currently comparable to proprietary sota--especially for general stem, non coding tasks
TraditionalAdagio841@reddit
Another high-value model, great!
silenceimpaired@reddit
How I wish I could run the model.
B89983ikei@reddit
We could barely use 2.5!
segmond@reddit
Tell us when it's on huggingspace.
Shockersam@reddit
Ok astronaut
pigeon57434@reddit
i hope its not just code and there will also be a kimi k2.6
muyuu@reddit
This will be for very high end setups, but still very exciting if they can keep the improvement shown in their earlier releases. Huge to have something really good that can be run with significant resources, but without depending on any particular vendor shenanigans.
Canchito@reddit
Hopefully they won't inflate the api pricing like glm did...
RetiredApostle@reddit
Not in the first week.
Tall-Ad-7742@reddit
i am excited!!!