What I can tell you is that Cursor is optimized to work well with Claude. I can also imagine the people at Cursor giving feedback to Google and OpenAI on how to optimize their models to work well with Cursor. I don't think that's the case for the Chinese providers. On the other hand, benchmarks are obtained by testing these models in an equal context. The AI models are given a fixed set of tools, and they have to use them to solve coding problems.
But it is open source you can run your own inference and get lower token costs than open router plus you can cache however you want. There are much more sophisticated adaptive hierarchical KV caching methods than Anthropic use anyway.
You could use chutes.ai and get very low costs. I get 2000 requests a day at $10 a month. They have GPU rental on other parts of the bittensor network too.
I will give you a concrete real-world example that I have seen for high-throughput agentic system deployments. For the large open source models, i.e. Deepseek and Kimi-sized, Nvidia Dynamo on Coreweave with the KV-routing set up well can be over ten times cheaper per token than Claude API deployments.
The scale of usage obviously affects the price point where renting or owning GPUs saves you money. Someone spending $50 on open router each month isn't going to save money.
Dude, it's relatively straightforward to research this subject. It's surprisingly cost effective. You can get anywhere from one 5090 to data-centre nvlink clusters. Look it up.
The more sophisticated KV-cache systems don’t work the usual way where you just cache the context of a conversation. Instead they take the KV-caches of all conversations across all nodes, break them into chunks, give each chunk an ID and then put them into a database. Then when a request comes in the system does a database lookup to see which nodes have the most KV-cache hits for that request and a router will route the requests to different nodes to maximise KV-cache hits.
Yeah you can even take it out of ram and put it into long term storage like SSDs and collect KV chunks over the course of months. It is like doing RAG but over KV.
Optimal LLM inference is very different to what people think.
Very true. I mostly run Kimi K2 when do not need thinking (IQ4 quant with ik_llama) or DeepSeek 671B otherwise. Not so long ago I compared local inference vs cloud, and local in my case was cheaper even on old hardware, and locally I can manage cache in a way that can return to any old dialog almost instantly, and always keep my typical long prompts precached. When doing the comparison, I noticed that cached input tokens are basically free locally, I have no idea why in the cloud they are so expensive.
When I do training runs I set it to automatically benchmarks on each checkpoint after a certain number of steps so benchmarks are l built in to how I do training.
For reinforcement learning, for PPO or GRPO sometimes I use a benchmark as the reward model so in those situations benchmarks are part of the reinforcement learning rollout.
Similarly for neural architecture search I set it to use benchmark results to guide the architecture search.
There is a fourth usage in training where I directly fine tune on differentiable rewards so in this case the benchmark is actually part of the loss function.
All four of these are not possible without using the scientific method over reproducible quantitative benchmarks.
Sure, but those benchmark don't always translate to real life experience. Claude isn't the best model in any benchmark, yet I have to find a model that make so few mistakes and which code is so reliable.
You could make a dataset out of the software tasks that you found Claude performed well on and use that dataset to make a new benchmark of your own to compare other models to.
But at that point it would translate into real world performance so the original point I was replying to would no longer be valid, is the point I am making.
The idea of a benchmark is to be a prediction model, so we can judge a benchmark by how well it predicts the performance number on a held-out dataset i.e. real tasks in this case.
If it can predict with high accuracy according to the various metrics we have for judging prediction models then it can be used as a surrogate for testing on real tasks.
Thinking of it this way benchmarks end up working well, in the cases where they can be a good prediction generator.
I think it depends on the type of task because, for example, I have seen math benchmarks that predict really tightly which models will perform how well on the real, similar math questions.
Most of Open Source cannot even compete with Claude 2 in writing tasks, a corpo model from 3 years ago. Kimi and Deepseek are the closest, but do not have that polished edge. Deepseek also loves to miss the fucking point and Kimi can sometimes miss details.
You could take your own real world usage, find some way to assign a numerical value to good and bad outcomes, produce a representative dataset of task descriptions as well as input data and wrap it up as a benchmark.
That has been done a lot though. There is a really wide range of benchmarks out there. When I browse new on arxiv each day there are multiple each day for many topics. It feels unlikely that, for a given task, there is no current benchmark that correlates with task performance. I do think it is possible though.
Because the extra time it takes for me to manually bridge the gap between the models, costs more than the difference in token costs.
I don't care if there's an open source mode that's 95% as close and saves me 15¢ per prompt, when that 5% difference takes me 10+ minutes of extra debugging. It's not worth it to me.
Claude just gets things and is objectives oriented will not try to complete the task in the minor amount of token possible
Any specialist can extract work from these models, but anyone seem to be able to get work out of claude regardless of prompting skill, and that's make a massive difference in adoption
Can’t believe the last version was only 2 months ago. Just realised when looking at benchmark. Feel like an eternity with the ways things are moving so fast these days
Given that reinforcement learning is the hot thing, and all the "zero human data" techniques now, I am hoping for a continuous series of updates now, as long as the gains hold.
Wow crazy. We just wrote about it. It’s impressive how fast both deepseek and moonshot cought up. I believe that in 2-3 years, there gonna be only xai, gemini and chinese ais. Everybody else will be irrelevant.
Let's try it on some actual codebase and see if it's really SOTA or if they just benchmaxxxed it.
There's Brokk benchmark that tests the models against real-world Java problems, and while it still has the same problems that all other benchmarks have, it's still better than mainstream tired benchmarkslop that is gamed by everyone. Last time, Kimi demonstrated some of the worst abilities compared to all tested models. It's going to be a miracle if they somehow managed to at least match Qwen3 Coder. So far its general intelligence haven't increased according to my measures T_T
This benchmark says GPT-5 nano is above o3 and Gemini 2.5 Pro.
Also, Kimi K2 has way more knowledge than DeepSeek, probably due to the bf16 training. It's not even close when you throw enough at it. The new DeepSeek V3.1 is even worse at knowledge lol.
Kimi also has the lowest sycophancy by far, and is the most "dynamic" feeling open model imo. DeepSeek and Qwen feel very corporate in comparison. Night and day.
If you disagree with the results of the bench, you're free to run it yourself. Unfortunately since you'd probably won't do it, you have no way but to trust the authors of comprehensive benchmarks that spend their time demonstrating that some models are really better engineered than others.
You also confuse general intelligence of models (something you'd really want to care about) with their broad abilities, which is a bad argument.
Nano can be better on this benchmark, but it doesnt really matter for how the models really stack up against each other, it's just a niche case. Any benchmark can make any model look good in any case.
I don't understand what your general intelligence/broad abilities statement is supposed to mean, if you mean their knowledge versus their actual logic capabilities then yeah it matters. But with Transformers it's highly correlated, less knowledge really hurts reasoning abilities too.
I've tested the new DeepSeek versus the original, new Qwen3 versus the original, new Kimi versus the original. In every case the model is marginally better in certain coding tasks, but then takes a more noticeable drop in most other domains. Mainly it's logical abilities. These version upgrades just aren't gonna give the magical boost that they try to portray, just more overfitting on benchmarks and maybe some special one-shot coding tasks that are adjacent to said benchmarks.
The context length extensions aren't real either, if anything I notice more degradation overtime in long sessions or even certain things like chess lol. At BEST it's on par with the older models.
I've tested the new DeepSeek versus the original, new Qwen3 versus the original, new Kimi versus the original. In every case they fail at tasks that are not similar to those they're trying to benchmaxxx. None of the Chinese developers seem to focus on the model's general capabilities so far, which is disappointing considering the fact most capable models in the world tend to be general and equally good at everything.
I think that Chinese government should simply stop subsidizing any labs except for DeepSeek IMO. None of them ever come close.
Hard to tell if you're being sarcastic or not :P. I know you said DeepSeek is the best open model, it's definetely the best open reasoning model. Kimi is better at general conversation while still being quite competent in logic, and uses way less tokens which is very important.
Qwen.. has been very underwhelming, Geminimaxxed since the 2507 models. QwQ is still the best 32B model though and it's not really a debate.
DeepSeek R1-0528 & V3.1 are by far the strictest on Chinese topics though, for obvious reasons ofc. They don't budge no matter what you do unless you prefill so much you're not even using the model anymore lol.
First of all, if you want to know how good a LLM at coding, you have to test it across a range of languages. It's gotta be a surprise if a LLM is good at Python and suddenly fails miserably with any other language. Which can mean two things, it was either trained on Python specifically with limited support of other languages or they just benchmaxxxed it. Brokk is the only comprehensive and constantly updated benchmark I know that uses a language other than Python. So you kinda don't have much choice here.
Second, if you want to know how great a LLM's general intelligence is, you have to test it across a range of random tasks from random domains. And so far it's bad for any open models except for DeepSeek. This update of Kimi is no exception, I saw no improvement on my tasks, and it's disappointing that some developers only focus on coding capabilities rather than increasing the general intelligence of their models, because apparently improving the models' general intelligence makes them better at everything including coding, which is exactly I'd want from an AI as a consumer.
DeepSeek is the best open source model on the market so far.
Just tried LongCat. It sucks. Fails on my music theory questions just as miserably as Qwen does. It's amusing to see that this model knows music theory well enough to know modes as exotic as Phrygian Dominant, but is not smart enough to realize that the progression I wrote was in Lydian, which is a far more popular mode.
I think that none of the improvements made by AI developers actually matter unless they demonstrably improve the model's real world performance. LongCat does not demonstrate anything like this. What really matters is whether they'd be able to catch up with frontier (GPT 5, Grok 4, Gemini 3 soon). So far no Chinese model has ever achieved it. I feel like DeepSeek R2 is going to be the first one to do it and soon after there will appear a ton of lower quality ripoffs that boast about "scaling" and "1T parameters" while actually being worse than R2.
This is so true. I should be keeping a matrix for which models are good for which things. Deepseek is the only model that I've found to one shot ripserplusplus. Claude can do Jax but it always writes for an older version so you have to find and replace afterwards.
I don't know if you've noticed but everyone is talking at once. Even if you make it yourself, even if it's perfect, the rate of change has everyone's mind exploding.
Can you expand on that? I mostly work with large local models on fairly long contexts, but when I try out a new model I try a few prompts to get a feel for it. Kimi threw out refusals on several of these, so I just put it aside and moved on. You're saying that feeding it more context reduces refusals? I had no idea that was a thing.
Since you are being sincere and asking, yes, more context means less refusals for most 'censored' models. Though, Opus and other Claude ones can be up in the air with how they are censored from day to day, Kimi is completely uncensored after around 1k tokens, I have made it do some fucked up things.
This is very interesting. Any idea why that is? Is it that the refusal weights are being overwhelmed by the context as it grows? I had genuinely never heard of that.
Now I'm gonna load it up and fire a horrendous 5k context at it and see what happens lol
Yeah I went to check and there's the SSD7505 with Gen 4 ×16 and capacity for 4 drives, allegedly 25 GB/s with one, and 40 GB/s with two. That could potentially read the full 30B active in less than a second. Costs $700 just for the raid controller card tho lol.
4TB should RAM should be enough for 1T model realisticly. And you can get that with an used server mobo for dual EPYC and 16*256GB ram.
Alternatively get motherboard with 8 PCI gen 4 lanes (can be 6 + 2m2 of course as well). Put 81TB drives into it. and you'll get almost same speed possibly, who knows, maaybe :D
Eh idk, can a mobo work as a raid controller? One would need some kind of byte level stripping to get an even distribution over all drives, otherwise it's just gonna be 7GB/s anyway.
Yeah I went to check and there's the SSD7505 controller with Gen 4 ×16 and capacity for 4 drives, allegedly 25 GB/s with one, and 40 GB/s with two. That could potentially read the full 30B active in less than a second. Costs $700 just for the raid controller card tho lol.
Why not just bifurcate your motherboard x16 slot to 4x/4x/4x/4x? Cost you like $20 on Aliexpress for a physical card that splits x16 lanes into 4/4/4/4...
Bonus points when you use a PCIe 5.0 x16 slot, and PCIe 5 NVMEs... And you can probably get also some PCIE 5 m.2 on the motherboard (that are not going over the chipset) for 6 m.2 NVME in total.
Or if you feel very brave, get a Chinese H12D-8D Epyc board, and bifurcate 4 of your x16 slots, for 16 m.2's (19 if you can the 3 on the board) itself. ;) Disadvantage they are PCIe 4.0.
Distillation works dramatically more efficiently with reasoning models where you lift the entire CoT chain so IDK if distillation of non-reasoning models is that good of an idea most of the time.
That's because nobody expected a 1T dense model, whereas modern models are MoE.
Kimi K2 is trained on 15.5T tokens, so 2.976×10^24 FLOPs to train.
That'll take you about 191.4 days to train at ~50% MFU on a standard single NVL72 server rack with 9 servers of B200s (if you have 2 racks, then half the time). An single 8 B200 server is about $37/hr currently, so 9 of those is $333/hour. Total cost to train Kimi K2 is in the ballpark of around $1.52mil. Of course, you're not gonna find real NVL72 rentals that easily, but this gets you a rough estimate.
I'm actually undercounting deepseek. If you factor in the MTP params, it's over 40b active. So it's about 1/5 more expensive than Kimi K2 in terms of pure compute.
Good benchmark improvements for just 2 months. What are the major US companies doing? If the Chinese keep this progress up they could soon be the leaders.
It is also worth taking a look at the references cited in Attention is all you need, which form the basis of this important treatise. Since 2017, the apparent dominance has increased, especially in the technical reports on the models.
They keep on picking different people and events and calling that the start of AI but they always pick something too late. Ising Models were in 1924 and you could go further back than that.
Who would forget that. But are we talking about research that took 60 years to break through or the dominance since the breakthrough of AI with the publication of the first GPT model?
A lot of people don’t realise that Attention is All You Need was based on a specific type of RNNs that already had attention added. This is why it said it is “all you need” because the RNN was removed. For certain types of dataset the original RNNs with attention are actually better than transformers to this day.
It is also worth taking a look at the references cited in Attention is all you need, which form the basis of this important treatise. Since 2017, the apparent dominance has further increased, especially in the technical reports on the models.
You're asking a wrong question. A better question is, what are the Chinese companies doing? We have seen no Chinese equivalent to GPT 5 or at least Grok 4 so far, that is, a Chinese model that is clearly able to reason and solve problems far outside its training data. On various benches, DeepSeek only recently started to exhibit this kind of behavior, but even so it's still not quite there, and other Chinese models are still behind it.
The Chinese seem to be quite not great at supporting open source because there already should be an open source contender to GPT 5. There is still none. If Qwen's next model is going to become one I will be very pleasantly surprised.
On the easy to setup side, pretty much a Mac M3 Ultra 512GB system: https://www.youtube.com/watch?v=-zfUvA2CDqE
But in general, you want high bandwidth RAM in the 0.5 to 1.0 Terabyte range. This isn't really something most people are going to be able to run at home.
Thanks for the reply! I have a workstation with lots of RAM, 64 for now but I can upgrade it... Is it pointless trying to run this on a workstation like setup with main memory instead of a integrated GPU?
It's slightly better than qwen coder despite being twice the size, so it seems like diminishing returns set it in pretty hard after the 500b parameter mark.
Except it likely has much more broad knowledge outside of the coding domain. For example, I found using Qwen as a coder and Kimi K2 as a documentation writer was a good combo.
It is not something special for coding tasks. I make a simple test across the models Claude, Qwen Coder 4B, Grok Coder Fast and Kimi with that simple prompt task "C Program to Find the Largest Number Among Three Numbers" The most comprehensive and accurate answer which I received for different methods was from Qwen Coder and Claude. Most useless code was from that Kimi k2. By the way Qwen Coder and Qwen Thinking on Qwen.ai have a very similar result compared to Claude Sonnet. I use free Claude account. Maybe Kimi K2 is much better for agent task but for simple code generation it not good at all.
With such a simple task and no guidance on how you'll opinion a winner, you're just rolling the dice on who makes something that's prettier to your eyes.
This is my immediate impression of it for long-fiction (novel chapter) creative writing: It seems more nuanced and adapts better to the context of the scenario. That said, it does still struggle with long-context instruction following. It is also still engaging with tropes that do not make contextual sense. Hopefully these are things that might be addressed by reasoning as I'm convinced that long-context creative writing requires it.
Overall, it's about 80% of the way to GPT-5 IMO. Exceeds GPT-4o.
Funny enough up there somebody is claiming the model is shit because it doesn't know "obvious" music theory stuff i never heard about.
I guess at some point models will be like people and it will be like calling stephen hawking useless because he misses all his free throws at basketball...
I forgot where the reply you are referring to is, but they were talking about intermediate-to-advanced level musical stuff (scale/mode) that anyone who attempted to play a jazz would at least know what they are roughly about, and it's something any professional film composer would know. It was a niche domain knowledge, but not that ridiculously obscure.
I'd also agree with that reply, that DeepSeek is one of the best open-weight model when it comes to non-STEM, fairly obscure knowledge. Western closed-source model, like o3, is surprisingly good at understanding extremely niche non-STEM topic/concept, even multilingual, and DeepSeek comes pretty close.
How much did this model and the original k2 cost to train ? They must be bleeding money like crazy…. Can paid Api can’t cover the cost, alibaba and tencent and venture capitalists are really helping them
The original k2 cost around 20-30 million $ in total to train, thanks to its new training optimizer muon, which has challenged the 7-year status quo of AdamW
hmm. According to the Aider polyglot it is performing worse than the previous model: https://discord.com/channels/1131200896827654144/1413369191561564210/1413467650037780541
“Extended context length: Kimi K2-Instruct-0905’s context window has been increased from 128k to 256k tokens, providing better support for long-horizon tasks.”
Just tried it out. Definitely much better for agentic tool calling, and seems to be more self-aware of the actions it has taken previously. UI wise definitely improving. Sometimes it still goes on infinite loops but huge improvements!!
(P.s. I built a vibe coding platform focus on speed, powered by different high inference models from Groq and more. Just added the new Kimi k2 model. Do try it out for free here: Groq (dot) Sampleapp (dot) ai👀)
Based on benchmark scores it's not as big of an improvement as I was optimistically hoping for, but still a great option for distillation into smaller models now. Does seem like there's room for them to keep training this thing further though?
La clasificación del Benchmark es la más honesta que he visto jamás. Primera vez que veo que un modelo Chino no sale con mejor calificación que Sonnet 4. Menos mal... Ahora sí le daré una oportunidad a éste.
mrfakename0@reddit
No_Efficiency_1144@reddit
I am kinda confused why people spend so much on Claude (I know some people spending crazy amounts on Claude tokens) when cheaper models are so close.
Tolopono@reddit
On openrouter, grok code 1 is king for coding
yani205@reddit
The sharpest tool in the drawer is not always the best tool for the job.
DavidOrzc@reddit
What I can tell you is that Cursor is optimized to work well with Claude. I can also imagine the people at Cursor giving feedback to Google and OpenAI on how to optimize their models to work well with Cursor. I don't think that's the case for the Chinese providers. On the other hand, benchmarks are obtained by testing these models in an equal context. The AI models are given a fixed set of tools, and they have to use them to solve coding problems.
alex_pro777@reddit
Can you tell me what exact tasks these people trying to solve "spending crazy amounts on Claude"? Coding or what?
No_Efficiency_1144@reddit
Agentic stuff. It can take enormous amounts of tokens.
nuclearbananana@reddit
Cached claude is around the same cost as uncached Kimi.
And claude is usually cached while Kimi isn't.
(sonnet, not opus)
No_Efficiency_1144@reddit
But it is open source you can run your own inference and get lower token costs than open router plus you can cache however you want. There are much more sophisticated adaptive hierarchical KV caching methods than Anthropic use anyway.
akirakido@reddit
What do you mean run your own inference? It's like 280GB even on 1-bit quant.
No_Efficiency_1144@reddit
Buy or rent GPUs
Maximus-CZ@reddit
"lower token costs"
Just drop $15k on GPUs and your tokens will be free, bro
inevitabledeath3@reddit
You could use chutes.ai and get very low costs. I get 2000 requests a day at $10 a month. They have GPU rental on other parts of the bittensor network too.
No_Efficiency_1144@reddit
He was comparing to Claude which is cloud-based so logically you could compare to cloud GPU rental, which does not require upfront cost.
Maximus-CZ@reddit
Okay, then please show me where I can rent GPUs to run 1T model without spending more monthly than people would spend on claude tokens.
No_Efficiency_1144@reddit
I will give you a concrete real-world example that I have seen for high-throughput agentic system deployments. For the large open source models, i.e. Deepseek and Kimi-sized, Nvidia Dynamo on Coreweave with the KV-routing set up well can be over ten times cheaper per token than Claude API deployments.
TheAsp@reddit
The scale of usage obviously affects the price point where renting or owning GPUs saves you money. Someone spending $50 on open router each month isn't going to save money.
No_Efficiency_1144@reddit
I know if you go back to my original comment I was talking about people spending crazy amounts of money on Claude tokens.
AlwaysLateToThaParty@reddit
Dude, it's relatively straightforward to research this subject. It's surprisingly cost effective. You can get anywhere from one 5090 to data-centre nvlink clusters. Look it up.
Maximus-CZ@reddit
One rented 5090 will run this 1T Kimi cheaper than sonnet tokens?
Didnt think so
AlwaysLateToThaParty@reddit
In volume? Yes.
nuclearbananana@reddit
What methods? Locally things are all cached ik, not that I can run Kimi, but afaik Anthropic has had the steepest caching discount from the start
No_Efficiency_1144@reddit
The more sophisticated KV-cache systems don’t work the usual way where you just cache the context of a conversation. Instead they take the KV-caches of all conversations across all nodes, break them into chunks, give each chunk an ID and then put them into a database. Then when a request comes in the system does a database lookup to see which nodes have the most KV-cache hits for that request and a router will route the requests to different nodes to maximise KV-cache hits.
nuclearbananana@reddit
huh, didn't know you could break the KV cache into chunks.
No_Efficiency_1144@reddit
Yeah you can even take it out of ram and put it into long term storage like SSDs and collect KV chunks over the course of months. It is like doing RAG but over KV.
Optimal LLM inference is very different to what people think.
Lissanro@reddit
Very true. I mostly run Kimi K2 when do not need thinking (IQ4 quant with ik_llama) or DeepSeek 671B otherwise. Not so long ago I compared local inference vs cloud, and local in my case was cheaper even on old hardware, and locally I can manage cache in a way that can return to any old dialog almost instantly, and always keep my typical long prompts precached. When doing the comparison, I noticed that cached input tokens are basically free locally, I have no idea why in the cloud they are so expensive.
Llamasarecoolyay@reddit
Benchmarks aren't everything.
No_Efficiency_1144@reddit
Machine learning field uses the scientific method so it has to have reproducible quantitative benchmarks.
colin_colout@reddit
Lol why are you getting downvoted? This is literally true.
People are mad at benchmaxing...not benchmarks.
auggie246@reddit
You might want to learn more about training methods before saying such stuff
No_Efficiency_1144@reddit
When I do training runs I set it to automatically benchmarks on each checkpoint after a certain number of steps so benchmarks are l built in to how I do training.
For reinforcement learning, for PPO or GRPO sometimes I use a benchmark as the reward model so in those situations benchmarks are part of the reinforcement learning rollout.
Similarly for neural architecture search I set it to use benchmark results to guide the architecture search.
There is a fourth usage in training where I directly fine tune on differentiable rewards so in this case the benchmark is actually part of the loss function.
All four of these are not possible without using the scientific method over reproducible quantitative benchmarks.
Orolol@reddit
Sure, but those benchmark don't always translate to real life experience. Claude isn't the best model in any benchmark, yet I have to find a model that make so few mistakes and which code is so reliable.
No_Efficiency_1144@reddit
You could make a dataset out of the software tasks that you found Claude performed well on and use that dataset to make a new benchmark of your own to compare other models to.
Orolol@reddit
Sure. What's your point?
No_Efficiency_1144@reddit
Not a big point just that then you would have a good benchmark
Orolol@reddit
Sure, but it would still be only a benchmark.
No_Efficiency_1144@reddit
But at that point it would translate into real world performance so the original point I was replying to would no longer be valid, is the point I am making.
Orolol@reddit
Not really. It would translate to performance on a specific dataset on a specific numerical value.
No_Efficiency_1144@reddit
The idea of a benchmark is to be a prediction model, so we can judge a benchmark by how well it predicts the performance number on a held-out dataset i.e. real tasks in this case.
If it can predict with high accuracy according to the various metrics we have for judging prediction models then it can be used as a surrogate for testing on real tasks.
Thinking of it this way benchmarks end up working well, in the cases where they can be a good prediction generator.
Orolol@reddit
Dude, I made many benchmarks for LLM, like https://github.com/Orolol/familyBench, I know how it works.
And no, you can't really get to a point where real life experience is quantifiable into a set of mesurable metrics.
It can give you an idea of a some strength, weakness, but will never be precise enough to be really conclusive.
No_Efficiency_1144@reddit
I think it depends on the type of task because, for example, I have seen math benchmarks that predict really tightly which models will perform how well on the real, similar math questions.
Orolol@reddit
In coding there's nearly never "similar code question".
Turbulent_Pin7635@reddit
Are you married with Claude? Are you defending it so much that I was thinking someone is talking badly about your spouse.
Orolol@reddit
Sorry to share my experience. I didn't want to hurt your feelings.
forgotmyolduserinfo@reddit
I mean it simply is the best, so 🤷♂️
Careless_Wolf2997@reddit
Most of Open Source cannot even compete with Claude 2 in writing tasks, a corpo model from 3 years ago. Kimi and Deepseek are the closest, but do not have that polished edge. Deepseek also loves to miss the fucking point and Kimi can sometimes miss details.
Claude is just reliable.
Dogeboja@reddit
Yet they are mostly terrible. SWE-Bench should have been replaced a long ago. It does not represent real world use well.
No_Efficiency_1144@reddit
You could take your own real world usage, find some way to assign a numerical value to good and bad outcomes, produce a representative dataset of task descriptions as well as input data and wrap it up as a benchmark.
black__and__white@reddit
Just because someone hasn’t done that doesn’t make the existing benchmarks any better though, which is the point being made here
No_Efficiency_1144@reddit
That has been done a lot though. There is a really wide range of benchmarks out there. When I browse new on arxiv each day there are multiple each day for many topics. It feels unlikely that, for a given task, there is no current benchmark that correlates with task performance. I do think it is possible though.
Mkengine@reddit
Maybe rebench shows a more realistic picture?
https://swe-rebench.com/
aeroumbria@reddit
Never buy from the price leader :p
mrjackspade@reddit
Because the extra time it takes for me to manually bridge the gap between the models, costs more than the difference in token costs.
I don't care if there's an open source mode that's 95% as close and saves me 15¢ per prompt, when that 5% difference takes me 10+ minutes of extra debugging. It's not worth it to me.
Ok_Horror_8567@reddit
True I don't like Claude much
LoSboccacc@reddit
Claude just gets things and is objectives oriented will not try to complete the task in the minor amount of token possible
Any specialist can extract work from these models, but anyone seem to be able to get work out of claude regardless of prompting skill, and that's make a massive difference in adoption
Arcuru@reddit
For one thing, if you just pay for Claude Max you easily get 10x that amount in tokens per month.
When Anthropic is giving away so many tokens for so cheap, I will happily take that deal.
TheInfiniteUniverse_@reddit
Claude is not necessarily the smartest, but it very good agentic-wise. And that makes it the leader for now.
No_Efficiency_1144@reddit
I agree it is weaker at math than some but the best at many agentic tasks.
yani205@reddit
Can’t believe the last version was only 2 months ago. Just realised when looking at benchmark. Feel like an eternity with the ways things are moving so fast these days
Tolopono@reddit
B-b-but gary marcus said ai is plateauing in ~~2018 2019 2020 2021 2022 2023 2024~~ 2025!!!
Bakoro@reddit
Given that reinforcement learning is the hot thing, and all the "zero human data" techniques now, I am hoping for a continuous series of updates now, as long as the gains hold.
felloAI@reddit
Wow crazy. We just wrote about it. It’s impressive how fast both deepseek and moonshot cought up. I believe that in 2-3 years, there gonna be only xai, gemini and chinese ais. Everybody else will be irrelevant.
marisaandherthings@reddit
Woah.
Danny_Davitoe@reddit
Still returns very strange responses.
Ok_Knowledge_8259@reddit
Very close to SOTA now. This one clearly beats deepseek although bigger but still the results speak for themselves.
Massive-Shift6641@reddit
Let's try it on some actual codebase and see if it's really SOTA or if they just benchmaxxxed it.
There's Brokk benchmark that tests the models against real-world Java problems, and while it still has the same problems that all other benchmarks have, it's still better than mainstream tired benchmarkslop that is gamed by everyone. Last time, Kimi demonstrated some of the worst abilities compared to all tested models. It's going to be a miracle if they somehow managed to at least match Qwen3 Coder. So far its general intelligence haven't increased according to my measures T_T
HomeBrewUser@reddit
This benchmark says GPT-5 nano is above o3 and Gemini 2.5 Pro.
Also, Kimi K2 has way more knowledge than DeepSeek, probably due to the bf16 training. It's not even close when you throw enough at it. The new DeepSeek V3.1 is even worse at knowledge lol.
Kimi also has the lowest sycophancy by far, and is the most "dynamic" feeling open model imo. DeepSeek and Qwen feel very corporate in comparison. Night and day.
Massive-Shift6641@reddit
If you disagree with the results of the bench, you're free to run it yourself. Unfortunately since you'd probably won't do it, you have no way but to trust the authors of comprehensive benchmarks that spend their time demonstrating that some models are really better engineered than others.
You also confuse general intelligence of models (something you'd really want to care about) with their broad abilities, which is a bad argument.
HomeBrewUser@reddit
Nano can be better on this benchmark, but it doesnt really matter for how the models really stack up against each other, it's just a niche case. Any benchmark can make any model look good in any case.
I don't understand what your general intelligence/broad abilities statement is supposed to mean, if you mean their knowledge versus their actual logic capabilities then yeah it matters. But with Transformers it's highly correlated, less knowledge really hurts reasoning abilities too.
I've tested the new DeepSeek versus the original, new Qwen3 versus the original, new Kimi versus the original. In every case the model is marginally better in certain coding tasks, but then takes a more noticeable drop in most other domains. Mainly it's logical abilities. These version upgrades just aren't gonna give the magical boost that they try to portray, just more overfitting on benchmarks and maybe some special one-shot coding tasks that are adjacent to said benchmarks.
The context length extensions aren't real either, if anything I notice more degradation overtime in long sessions or even certain things like chess lol. At BEST it's on par with the older models.
Massive-Shift6641@reddit
I've tested the new DeepSeek versus the original, new Qwen3 versus the original, new Kimi versus the original. In every case they fail at tasks that are not similar to those they're trying to benchmaxxx. None of the Chinese developers seem to focus on the model's general capabilities so far, which is disappointing considering the fact most capable models in the world tend to be general and equally good at everything.
I think that Chinese government should simply stop subsidizing any labs except for DeepSeek IMO. None of them ever come close.
HomeBrewUser@reddit
Hard to tell if you're being sarcastic or not :P. I know you said DeepSeek is the best open model, it's definetely the best open reasoning model. Kimi is better at general conversation while still being quite competent in logic, and uses way less tokens which is very important.
Qwen.. has been very underwhelming, Geminimaxxed since the 2507 models. QwQ is still the best 32B model though and it's not really a debate.
DeepSeek R1-0528 & V3.1 are by far the strictest on Chinese topics though, for obvious reasons ofc. They don't budge no matter what you do unless you prefill so much you're not even using the model anymore lol.
inevitabledeath3@reddit
Why not look at SWE-rebench. Not sure how much I trust brokk.
Massive-Shift6641@reddit
First of all, if you want to know how good a LLM at coding, you have to test it across a range of languages. It's gotta be a surprise if a LLM is good at Python and suddenly fails miserably with any other language. Which can mean two things, it was either trained on Python specifically with limited support of other languages or they just benchmaxxxed it. Brokk is the only comprehensive and constantly updated benchmark I know that uses a language other than Python. So you kinda don't have much choice here.
Second, if you want to know how great a LLM's general intelligence is, you have to test it across a range of random tasks from random domains. And so far it's bad for any open models except for DeepSeek. This update of Kimi is no exception, I saw no improvement on my tasks, and it's disappointing that some developers only focus on coding capabilities rather than increasing the general intelligence of their models, because apparently improving the models' general intelligence makes them better at everything including coding, which is exactly I'd want from an AI as a consumer.
inevitabledeath3@reddit
So your essentially saying DeepSeek is best model?
Out of interest have you tried LongCat? Not many people have. Would be interested in what you think.
Massive-Shift6641@reddit
DeepSeek is the best open source model on the market so far.
Just tried LongCat. It sucks. Fails on my music theory questions just as miserably as Qwen does. It's amusing to see that this model knows music theory well enough to know modes as exotic as Phrygian Dominant, but is not smart enough to realize that the progression I wrote was in Lydian, which is a far more popular mode.
I think that none of the improvements made by AI developers actually matter unless they demonstrably improve the model's real world performance. LongCat does not demonstrate anything like this. What really matters is whether they'd be able to catch up with frontier (GPT 5, Grok 4, Gemini 3 soon). So far no Chinese model has ever achieved it. I feel like DeepSeek R2 is going to be the first one to do it and soon after there will appear a ton of lower quality ripoffs that boast about "scaling" and "1T parameters" while actually being worse than R2.
inevitabledeath3@reddit
That kind of music theory is not something I work with, and sounds kind of obscure. I was more worried about programming and academic use.
Massive-Shift6641@reddit
You're worried about wrong things. You should be worried about the model's general intelligence, not its performance on specific tasks.
My bench is special in the way it shows that LLMs do not necessarily don't know something. Rather, they are very inefficient at knowledge retrieval.
AppearanceHeavy6724@reddit
Longcat is good at fiction. I liked the vibe.
Robonglious@reddit
This is so true. I should be keeping a matrix for which models are good for which things. Deepseek is the only model that I've found to one shot ripserplusplus. Claude can do Jax but it always writes for an older version so you have to find and replace afterwards.
Massive-Shift6641@reddit
> a matrix for which models are good for which things
I wrote about the need for multi-faceted benchmarks inspired by psychometric tests a couple of days ago. It'd solve EXACTLY this problem.
Who has ever listened to me? lol
People get what they deserve
Robonglious@reddit
I don't know if you've noticed but everyone is talking at once. Even if you make it yourself, even if it's perfect, the rate of change has everyone's mind exploding.
ForsookComparison@reddit
Benchmarks can always be gamed or just inaccurate
inevitabledeath3@reddit
Brokk is also a benchmark.
SWE Rebench changes over time I think to avoid benchmaxxing.
cantgetthistowork@reddit
It's smaller at full context because attention heads are half
Ardalok@reddit
it's more compute effective though, that's matter more
Zen-smith@reddit
Is it uncensored? The biggest problem with the og was its filters to me which ruined its creative writing potential.
Careless_Wolf2997@reddit
The first one wasn't censored after around 1k tokens of context, and most Claude models will do some pretty kinky shit after 1.5k context.
Stop testing censorship at low contexts.
marhalt@reddit
Can you expand on that? I mostly work with large local models on fairly long contexts, but when I try out a new model I try a few prompts to get a feel for it. Kimi threw out refusals on several of these, so I just put it aside and moved on. You're saying that feeding it more context reduces refusals? I had no idea that was a thing.
Careless_Wolf2997@reddit
Since you are being sincere and asking, yes, more context means less refusals for most 'censored' models. Though, Opus and other Claude ones can be up in the air with how they are censored from day to day, Kimi is completely uncensored after around 1k tokens, I have made it do some fucked up things.
marhalt@reddit
This is very interesting. Any idea why that is? Is it that the refusal weights are being overwhelmed by the context as it grows? I had genuinely never heard of that. Now I'm gonna load it up and fire a horrendous 5k context at it and see what happens lol
blahblahsnahdah@reddit
To say it's less censored would be an understatement, from my testing. All refusals seem to be gone in this version.
epyctime@reddit
1t-a32b goes hard
silenceimpaired@reddit
I saw 32b and was so excited... a distilled model.... a di... oh... activated... 1T... right, that's this model. Sigh.
MoffKalast@reddit
Now I'm wondering how many NVMe drives in RAID 0 would it take to stream it lol.
KontoOficjalneMR@reddit
About five to get to the RAM speed. I checked last night :D
MoffKalast@reddit
Yeah I went to check and there's the SSD7505 with Gen 4 ×16 and capacity for 4 drives, allegedly 25 GB/s with one, and 40 GB/s with two. That could potentially read the full 30B active in less than a second. Costs $700 just for the raid controller card tho lol.
KontoOficjalneMR@reddit
Buying controller would make it more expensive than going for RAM build though.
just plug the nvme into regular PCIv4 ports and do balancing in software :)
MoffKalast@reddit
Well a RAM build likely won't give you 8-16TB of memory to work with, but it is questionable how usable it would be in practice.
KontoOficjalneMR@reddit
4TB should RAM should be enough for 1T model realisticly. And you can get that with an used server mobo for dual EPYC and 16*256GB ram.
Alternatively get motherboard with 8 PCI gen 4 lanes (can be 6 + 2m2 of course as well). Put 81TB drives into it. and you'll get almost same speed possibly, who knows, maaybe :D
MoffKalast@reddit
Eh idk, can a mobo work as a raid controller? One would need some kind of byte level stripping to get an even distribution over all drives, otherwise it's just gonna be 7GB/s anyway.
ProfessionalJackals@reddit
Why not just bifurcate your motherboard x16 slot to 4x/4x/4x/4x? Cost you like $20 on Aliexpress for a physical card that splits x16 lanes into 4/4/4/4...
Bonus points when you use a PCIe 5.0 x16 slot, and PCIe 5 NVMEs... And you can probably get also some PCIE 5 m.2 on the motherboard (that are not going over the chipset) for 6 m.2 NVME in total.
Or if you feel very brave, get a Chinese H12D-8D Epyc board, and bifurcate 4 of your x16 slots, for 16 m.2's (19 if you can the 3 on the board) itself. ;) Disadvantage they are PCIe 4.0.
dizzydizzy@reddit
how are you calculating that? bandwidth and latency are very different beasts?
KontoOficjalneMR@reddit
It's always rough estimations. Everything will of course depend madly on what kind of NVME drive you use, what ram, is ram dual channel, etc.
No_Efficiency_1144@reddit
Distillation works dramatically more efficiently with reasoning models where you lift the entire CoT chain so IDK if distillation of non-reasoning models is that good of an idea most of the time.
epyctime@reddit
It's an MoE not necessarily a (known) distillation. There are 1 trillion total parameters, with 32 Billion being activate at any time
No_Efficiency_1144@reddit
Yeah i am not saying Kimi is a distillation I am talking about distilling Kimi.
In my opinion another attempt at Deepseek distils is a better idea
epyctime@reddit
I gotcha yeah I'm excited for the distills as well, cos I can't run this shit for the life of me
No_Efficiency_1144@reddit
This one is really strong it performs similarly in math:
Charles Babbage
epyctime@reddit
I use it for code or summarizations etc, what sorts of maths are people doing? Has someone done a new proof or something using an LLM yet?
No_Efficiency_1144@reddit
Most sub areas of math can be investigated using LLMs.
The proof finding LLMs find new proofs all the time. They can take a long time to run though.
Substantial-Dig-8766@reddit
Oh yeah boys, another model that ill never run locally to completly ignore and see the people doing hype 😎
lightninglemons22@reddit
Imagine telling someone a year ago that there's going to be an os 'Trillion' parameter model
asssuber@reddit
That's peanuts.
I would point whoever told me that to the 1.6 trillion parameters model that google open sourced in 2023: https://huggingface.co/google/switch-c-2048
:D
No_Efficiency_1144@reddit
Yeah no one expected
DistanceSolar1449@reddit
That's because nobody expected a 1T dense model, whereas modern models are MoE.
Kimi K2 is trained on 15.5T tokens, so 2.976×10^24 FLOPs to train.
That'll take you about 191.4 days to train at ~50% MFU on a standard single NVL72 server rack with 9 servers of B200s (if you have 2 racks, then half the time). An single 8 B200 server is about $37/hr currently, so 9 of those is $333/hour. Total cost to train Kimi K2 is in the ballpark of around $1.52mil. Of course, you're not gonna find real NVL72 rentals that easily, but this gets you a rough estimate.
A 1T dense model would take you ~16 years.
No_Efficiency_1144@reddit
It’s interesting that Kimi is cheaper to train.
GPT 4, known at the time to be a MoE was 2.5 years ago so the MoE/dense differences were known for a while.
DistanceSolar1449@reddit
I'm actually undercounting deepseek. If you factor in the MTP params, it's over 40b active. So it's about 1/5 more expensive than Kimi K2 in terms of pure compute.
inevitabledeath3@reddit
MTP params?
ForsookComparison@reddit
I remember some guy getting dogpiled because he said he expected Llama3 to release with a 300B set of weights lol
MoffKalast@reddit
One that rivals Sonnet 4 apparently, even.
Ok_Cow1976@reddit
Pure bullshit, people would say.
ZestyCheeses@reddit
Good benchmark improvements for just 2 months. What are the major US companies doing? If the Chinese keep this progress up they could soon be the leaders.
Safe_Leadership_4781@reddit
Look at most of the names of the people on the scientific papers on AI, even if they were published in the US. They have always been in the lead.
procgen@reddit
Not seeing many of these names on Attention is All You Need ;)
Safe_Leadership_4781@reddit
It is also worth taking a look at the references cited in Attention is all you need, which form the basis of this important treatise. Since 2017, the apparent dominance has increased, especially in the technical reports on the models.
procgen@reddit
Let us never forget to pay tribute to the founding fathers: https://en.wikipedia.org/wiki/Dartmouth_workshop
No_Efficiency_1144@reddit
They keep on picking different people and events and calling that the start of AI but they always pick something too late. Ising Models were in 1924 and you could go further back than that.
procgen@reddit
AI literally did not exist as a field prior to these men starting it.
No_Efficiency_1144@reddit
This is erasing the work of the previous decades though.
Babbage, Lovelace, Ising, Hilbert etc were earlier.
procgen@reddit
They weren’t working on AI.
No_Efficiency_1144@reddit
They were, the label isn’t important. The field is still really just a subfield of applied math, physics, chemistry and engineering anyway.
procgen@reddit
They were not. They were not explicitly attempting to recreate the full power of human intelligence in machines.
No_Efficiency_1144@reddit
Okay I just don’t use this definition at all.
Safe_Leadership_4781@reddit
Who would forget that. But are we talking about research that took 60 years to break through or the dominance since the breakthrough of AI with the publication of the first GPT model?
No_Efficiency_1144@reddit
A lot of people don’t realise that Attention is All You Need was based on a specific type of RNNs that already had attention added. This is why it said it is “all you need” because the RNN was removed. For certain types of dataset the original RNNs with attention are actually better than transformers to this day.
Safe_Leadership_4781@reddit
It is also worth taking a look at the references cited in Attention is all you need, which form the basis of this important treatise. Since 2017, the apparent dominance has further increased, especially in the technical reports on the models.
procgen@reddit
Genie 3, AlphaFold 3, IMO gold, ARC-AGI, etc.
ZestyCheeses@reddit
Not available, Not available, Not available and a benchmark... Those products are interesting but we don't have access to them.
procgen@reddit
I mean that US companies are building models that significantly outperform on the ARC-AGI benchmarks.
Massive-Shift6641@reddit
> What are the major US companies doing?
You're asking a wrong question. A better question is, what are the Chinese companies doing? We have seen no Chinese equivalent to GPT 5 or at least Grok 4 so far, that is, a Chinese model that is clearly able to reason and solve problems far outside its training data. On various benches, DeepSeek only recently started to exhibit this kind of behavior, but even so it's still not quite there, and other Chinese models are still behind it.
LindaSawzRH@reddit
The Chinese are supporting Open Source, the Americans don't understand that concept.
lorddumpy@reddit
Come on bro
Massive-Shift6641@reddit
The Chinese seem to be quite not great at supporting open source because there already should be an open source contender to GPT 5. There is still none. If Qwen's next model is going to become one I will be very pleasantly surprised.
ffgg333@reddit
Is the creative writing better?
Amazing_Hat7058@reddit
What specs do I need to run this?
synn89@reddit
On the easy to setup side, pretty much a Mac M3 Ultra 512GB system: https://www.youtube.com/watch?v=-zfUvA2CDqE
But in general, you want high bandwidth RAM in the 0.5 to 1.0 Terabyte range. This isn't really something most people are going to be able to run at home.
Amazing_Hat7058@reddit
Thanks for the reply! I have a workstation with lots of RAM, 64 for now but I can upgrade it... Is it pointless trying to run this on a workstation like setup with main memory instead of a integrated GPU?
synn89@reddit
In general, yeah it would be. Especially when you have services like https://nano-gpt.com/ which you can run it on very cheaply at a good speed.
OsakaSeafoodConcrn@reddit
Possible to run on i7 cpu and 64GB DDR4 at reasonable 3tk/s?
synn89@reddit
No. You'd want more like 512GB-1TB of RAM and a processor that can access it properly(like an Epyc).
Professional-Bear857@reddit
It's slightly better than qwen coder despite being twice the size, so it seems like diminishing returns set it in pretty hard after the 500b parameter mark.
synn89@reddit
Except it likely has much more broad knowledge outside of the coding domain. For example, I found using Qwen as a coder and Kimi K2 as a documentation writer was a good combo.
ab2377@reddit
ah, too small for my laptop, i will pass
icpart@reddit
It is not something special for coding tasks. I make a simple test across the models Claude, Qwen Coder 4B, Grok Coder Fast and Kimi with that simple prompt task "C Program to Find the Largest Number Among Three Numbers" The most comprehensive and accurate answer which I received for different methods was from Qwen Coder and Claude. Most useless code was from that Kimi k2. By the way Qwen Coder and Qwen Thinking on Qwen.ai have a very similar result compared to Claude Sonnet. I use free Claude account. Maybe Kimi K2 is much better for agent task but for simple code generation it not good at all.
Marksta@reddit
With such a simple task and no guidance on how you'll opinion a winner, you're just rolling the dice on who makes something that's prettier to your eyes.
TheRealMasonMac@reddit
This is my immediate impression of it for long-fiction (novel chapter) creative writing: It seems more nuanced and adapts better to the context of the scenario. That said, it does still struggle with long-context instruction following. It is also still engaging with tropes that do not make contextual sense. Hopefully these are things that might be addressed by reasoning as I'm convinced that long-context creative writing requires it.
Overall, it's about 80% of the way to GPT-5 IMO. Exceeds GPT-4o.
UsernameAvaylable@reddit
Funny enough up there somebody is claiming the model is shit because it doesn't know "obvious" music theory stuff i never heard about.
I guess at some point models will be like people and it will be like calling stephen hawking useless because he misses all his free throws at basketball...
NandaVegg@reddit
I forgot where the reply you are referring to is, but they were talking about intermediate-to-advanced level musical stuff (scale/mode) that anyone who attempted to play a jazz would at least know what they are roughly about, and it's something any professional film composer would know. It was a niche domain knowledge, but not that ridiculously obscure.
I'd also agree with that reply, that DeepSeek is one of the best open-weight model when it comes to non-STEM, fairly obscure knowledge. Western closed-source model, like o3, is surprisingly good at understanding extremely niche non-STEM topic/concept, even multilingual, and DeepSeek comes pretty close.
NobleKale@reddit
'state of the art' is the most useless fucking phrase in LLMs
Inect@reddit
Well this second it is...
holistic-engine@reddit
From what I’ve read, the hardware reqs to even run this thing is insane, talking dozen H100’s or something if I’m not mistaken.
Awwtifishal@reddit
If you want to serve many users, yes. But if it's only for you and if you don't mind slower speeds, it's not that expensive.
Amgadoz@reddit
Yes. The upfront cost is quite high. Serving it at a large scale is quite cheap though.
power97992@reddit
How much did this model and the original k2 cost to train ? They must be bleeding money like crazy…. Can paid Api can’t cover the cost, alibaba and tencent and venture capitalists are really helping them
Awwtifishal@reddit
The original k2 cost around 20-30 million $ in total to train, thanks to its new training optimizer muon, which has challenged the 7-year status quo of AdamW
createthiscom@reddit
hmm. According to the Aider polyglot it is performing worse than the previous model: https://discord.com/channels/1131200896827654144/1413369191561564210/1413467650037780541
SatoshiNotMe@reddit
It now has 256k context, double the previous version. Also it’s very easily usable in Claude Code, e.g via this simple setup:
https://github.com/pchalasani/claude-code-tools/tree/main?tab=readme-ov-file#-using-claude-code-with-open-weight-anthropic-api-compatible-llm-providers
cantgetthistowork@reddit
Pls be 256K native context 🤞
m_shark@reddit
“Extended context length: Kimi K2-Instruct-0905’s context window has been increased from 128k to 256k tokens, providing better support for long-horizon tasks.”
cantgetthistowork@reddit
I saw that but I couldn't find any info on whether it was RoPE bullshit
Junliang_214@reddit
Just tried it out. Definitely much better for agentic tool calling, and seems to be more self-aware of the actions it has taken previously. UI wise definitely improving. Sometimes it still goes on infinite loops but huge improvements!!
(P.s. I built a vibe coding platform focus on speed, powered by different high inference models from Groq and more. Just added the new Kimi k2 model. Do try it out for free here: Groq (dot) Sampleapp (dot) ai👀)
LuozhuZhang@reddit
Wow, is Kimi moving to a thinking model?
NoseIndependent5370@reddit
They should.
paperbenni@reddit
No
Lopsided_Dot_4557@reddit
The new Kimi has really got some serious agentic capabilities. I did a testing video here : https://youtu.be/i1rQ88QgtKQ?si=OA86ueFOdBk1wCbx
Daniel_H212@reddit
Based on benchmark scores it's not as big of an improvement as I was optimistically hoping for, but still a great option for distillation into smaller models now. Does seem like there's room for them to keep training this thing further though?
oxygen_addiction@reddit
A heads up to everyone, it's available on Groq at 200t/s - Kimi K2 - GroqDocs https://share.google/qkQ0GU1JWmrCDMsY9
Hoak-em@reddit
Dang I can't wait for FP4 kernels on AMX (SGLang) and good hybrid 5090 + dual socket Xeons -- this thing could be great with an FP4
WithoutReason1729@reddit
Your post is getting popular and we just featured it on our Discord! Come check it out!
You've also been given a special flair for your contribution. We appreciate your post!
I am a bot and this action was performed automatically.
lightninglemons22@reddit
Imagine telling someone a year ago that there's going to be a open source trillion parameter model
sstainsby@reddit
I'd be interested to try this out in GitHub Copilot compared to Sonnet 4.
brianllamar@reddit
Run it in Continue and report back. Easy to do a side by side in VS code
Ordinary_Mud7430@reddit
La clasificación del Benchmark es la más honesta que he visto jamás. Primera vez que veo que un modelo Chino no sale con mejor calificación que Sonnet 4. Menos mal... Ahora sí le daré una oportunidad a éste.
DirtyGirl124@reddit
Does it pass the vibe check?
Dr_Karminski@reddit (OP)
HuggingFace: https://huggingface.co/moonshotai/Kimi-K2-Instruct-0905