Kimi-K2-Instruct-0905 Released!

[-]

No_Efficiency_1144@reddit

I am kinda confused why people spend so much on Claude (I know some people spending crazy amounts on Claude tokens) when cheaper models are so close.

[-]

Tolopono@reddit

On openrouter, grok code 1 is king for coding

[-]

yani205@reddit

The sharpest tool in the drawer is not always the best tool for the job.

[-]

What I can tell you is that Cursor is optimized to work well with Claude. I can also imagine the people at Cursor giving feedback to Google and OpenAI on how to optimize their models to work well with Cursor. I don't think that's the case for the Chinese providers. On the other hand, benchmarks are obtained by testing these models in an equal context. The AI models are given a fixed set of tools, and they have to use them to solve coding problems.

[-]

alex_pro777@reddit

Can you tell me what exact tasks these people trying to solve "spending crazy amounts on Claude"? Coding or what?

[-]

No_Efficiency_1144@reddit

Agentic stuff. It can take enormous amounts of tokens.

[-]

nuclearbananana@reddit

Cached claude is around the same cost as uncached Kimi.

And claude is usually cached while Kimi isn't.

(sonnet, not opus)

[-]

No_Efficiency_1144@reddit

But it is open source you can run your own inference and get lower token costs than open router plus you can cache however you want. There are much more sophisticated adaptive hierarchical KV caching methods than Anthropic use anyway.

[-]

akirakido@reddit

What do you mean run your own inference? It's like 280GB even on 1-bit quant.

[-]

No_Efficiency_1144@reddit

Buy or rent GPUs

[-]

Maximus-CZ@reddit

"lower token costs"

Just drop $15k on GPUs and your tokens will be free, bro

[-]

inevitabledeath3@reddit

You could use chutes.ai and get very low costs. I get 2000 requests a day at $10 a month. They have GPU rental on other parts of the bittensor network too.

[-]

No_Efficiency_1144@reddit

He was comparing to Claude which is cloud-based so logically you could compare to cloud GPU rental, which does not require upfront cost.

[-]

Maximus-CZ@reddit

Okay, then please show me where I can rent GPUs to run 1T model without spending more monthly than people would spend on claude tokens.

[-]

No_Efficiency_1144@reddit

I will give you a concrete real-world example that I have seen for high-throughput agentic system deployments. For the large open source models, i.e. Deepseek and Kimi-sized, Nvidia Dynamo on Coreweave with the KV-routing set up well can be over ten times cheaper per token than Claude API deployments.

[-]

TheAsp@reddit

The scale of usage obviously affects the price point where renting or owning GPUs saves you money. Someone spending $50 on open router each month isn't going to save money.

[-]

No_Efficiency_1144@reddit

I know if you go back to my original comment I was talking about people spending crazy amounts of money on Claude tokens.

[-]

AlwaysLateToThaParty@reddit

Dude, it's relatively straightforward to research this subject. It's surprisingly cost effective. You can get anywhere from one 5090 to data-centre nvlink clusters. Look it up.

[-]

Maximus-CZ@reddit

One rented 5090 will run this 1T Kimi cheaper than sonnet tokens?

Didnt think so

[-]

AlwaysLateToThaParty@reddit

In volume? Yes.

[-]

nuclearbananana@reddit

What methods? Locally things are all cached ik, not that I can run Kimi, but afaik Anthropic has had the steepest caching discount from the start

[-]

No_Efficiency_1144@reddit

The more sophisticated KV-cache systems don’t work the usual way where you just cache the context of a conversation. Instead they take the KV-caches of all conversations across all nodes, break them into chunks, give each chunk an ID and then put them into a database. Then when a request comes in the system does a database lookup to see which nodes have the most KV-cache hits for that request and a router will route the requests to different nodes to maximise KV-cache hits.

[-]

nuclearbananana@reddit

huh, didn't know you could break the KV cache into chunks.

[-]

No_Efficiency_1144@reddit

Yeah you can even take it out of ram and put it into long term storage like SSDs and collect KV chunks over the course of months. It is like doing RAG but over KV.

Optimal LLM inference is very different to what people think.

[-]

Lissanro@reddit

Very true. I mostly run Kimi K2 when do not need thinking (IQ4 quant with ik_llama) or DeepSeek 671B otherwise. Not so long ago I compared local inference vs cloud, and local in my case was cheaper even on old hardware, and locally I can manage cache in a way that can return to any old dialog almost instantly, and always keep my typical long prompts precached. When doing the comparison, I noticed that cached input tokens are basically free locally, I have no idea why in the cloud they are so expensive.

[-]

Llamasarecoolyay@reddit

Benchmarks aren't everything.

[-]

No_Efficiency_1144@reddit

Machine learning field uses the scientific method so it has to have reproducible quantitative benchmarks.

[-]

colin_colout@reddit

Lol why are you getting downvoted? This is literally true.

People are mad at benchmaxing...not benchmarks.

[-]

auggie246@reddit

You might want to learn more about training methods before saying such stuff

[-]

No_Efficiency_1144@reddit

When I do training runs I set it to automatically benchmarks on each checkpoint after a certain number of steps so benchmarks are l built in to how I do training.

For reinforcement learning, for PPO or GRPO sometimes I use a benchmark as the reward model so in those situations benchmarks are part of the reinforcement learning rollout.

Similarly for neural architecture search I set it to use benchmark results to guide the architecture search.

There is a fourth usage in training where I directly fine tune on differentiable rewards so in this case the benchmark is actually part of the loss function.

All four of these are not possible without using the scientific method over reproducible quantitative benchmarks.

[-]

Orolol@reddit

Sure, but those benchmark don't always translate to real life experience. Claude isn't the best model in any benchmark, yet I have to find a model that make so few mistakes and which code is so reliable.

[-]

No_Efficiency_1144@reddit

You could make a dataset out of the software tasks that you found Claude performed well on and use that dataset to make a new benchmark of your own to compare other models to.

[-]

Orolol@reddit

Sure. What's your point?

[-]

No_Efficiency_1144@reddit

Not a big point just that then you would have a good benchmark

[-]

Orolol@reddit

Sure, but it would still be only a benchmark.

[-]

No_Efficiency_1144@reddit

But at that point it would translate into real world performance so the original point I was replying to would no longer be valid, is the point I am making.

[-]

Orolol@reddit

But at that point it would translate into real world performance

Not really. It would translate to performance on a specific dataset on a specific numerical value.

[-]

No_Efficiency_1144@reddit

The idea of a benchmark is to be a prediction model, so we can judge a benchmark by how well it predicts the performance number on a held-out dataset i.e. real tasks in this case.

If it can predict with high accuracy according to the various metrics we have for judging prediction models then it can be used as a surrogate for testing on real tasks.

Thinking of it this way benchmarks end up working well, in the cases where they can be a good prediction generator.

[-]

Orolol@reddit

Dude, I made many benchmarks for LLM, like https://github.com/Orolol/familyBench, I know how it works.

And no, you can't really get to a point where real life experience is quantifiable into a set of mesurable metrics.

It can give you an idea of a some strength, weakness, but will never be precise enough to be really conclusive.

[-]

No_Efficiency_1144@reddit

I think it depends on the type of task because, for example, I have seen math benchmarks that predict really tightly which models will perform how well on the real, similar math questions.

[-]

Orolol@reddit

In coding there's nearly never "similar code question".

[-]

Turbulent_Pin7635@reddit

Are you married with Claude? Are you defending it so much that I was thinking someone is talking badly about your spouse.

[-]

Orolol@reddit

Sorry to share my experience. I didn't want to hurt your feelings.

[-]

forgotmyolduserinfo@reddit

I mean it simply is the best, so 🤷‍♂️

[-]

Careless_Wolf2997@reddit

Most of Open Source cannot even compete with Claude 2 in writing tasks, a corpo model from 3 years ago. Kimi and Deepseek are the closest, but do not have that polished edge. Deepseek also loves to miss the fucking point and Kimi can sometimes miss details.

Claude is just reliable.

[-]

Dogeboja@reddit

Yet they are mostly terrible. SWE-Bench should have been replaced a long ago. It does not represent real world use well.

[-]

No_Efficiency_1144@reddit

You could take your own real world usage, find some way to assign a numerical value to good and bad outcomes, produce a representative dataset of task descriptions as well as input data and wrap it up as a benchmark.

[-]

blackandwhite@reddit

Just because someone hasn’t done that doesn’t make the existing benchmarks any better though, which is the point being made here

[-]

No_Efficiency_1144@reddit

That has been done a lot though. There is a really wide range of benchmarks out there. When I browse new on arxiv each day there are multiple each day for many topics. It feels unlikely that, for a given task, there is no current benchmark that correlates with task performance. I do think it is possible though.

[-]

Mkengine@reddit

Maybe rebench shows a more realistic picture?

https://swe-rebench.com/

[-]

aeroumbria@reddit

Never buy from the price leader :p

[-]

mrjackspade@reddit

Because the extra time it takes for me to manually bridge the gap between the models, costs more than the difference in token costs.

I don't care if there's an open source mode that's 95% as close and saves me 15¢ per prompt, when that 5% difference takes me 10+ minutes of extra debugging. It's not worth it to me.

[-]

Ok_Horror_8567@reddit

True I don't like Claude much

[-]

LoSboccacc@reddit

Claude just gets things and is objectives oriented will not try to complete the task in the minor amount of token possible

Any specialist can extract work from these models, but anyone seem to be able to get work out of claude regardless of prompting skill, and that's make a massive difference in adoption

[-]

Arcuru@reddit

For one thing, if you just pay for Claude Max you easily get 10x that amount in tokens per month.

When Anthropic is giving away so many tokens for so cheap, I will happily take that deal.

[-]

TheInfiniteUniverse_@reddit

Claude is not necessarily the smartest, but it very good agentic-wise. And that makes it the leader for now.

[-]

No_Efficiency_1144@reddit

I agree it is weaker at math than some but the best at many agentic tasks.

[-]

yani205@reddit

Can’t believe the last version was only 2 months ago. Just realised when looking at benchmark. Feel like an eternity with the ways things are moving so fast these days

[-]

Tolopono@reddit

B-b-but gary marcus said ai is plateauing in ~~2018 2019 2020 2021 2022 2023 2024~~ 2025!!!

[-]

Bakoro@reddit

Given that reinforcement learning is the hot thing, and all the "zero human data" techniques now, I am hoping for a continuous series of updates now, as long as the gains hold.

[-]

felloAI@reddit

Wow crazy. We just wrote about it. It’s impressive how fast both deepseek and moonshot cought up. I believe that in 2-3 years, there gonna be only xai, gemini and chinese ais. Everybody else will be irrelevant.

[-]

marisaandherthings@reddit

Woah.

[-]

Danny_Davitoe@reddit

Still returns very strange responses.

[-]

Ok_Knowledge_8259@reddit

Very close to SOTA now. This one clearly beats deepseek although bigger but still the results speak for themselves.

[-]

Massive-Shift6641@reddit

Let's try it on some actual codebase and see if it's really SOTA or if they just benchmaxxxed it.

There's Brokk benchmark that tests the models against real-world Java problems, and while it still has the same problems that all other benchmarks have, it's still better than mainstream tired benchmarkslop that is gamed by everyone. Last time, Kimi demonstrated some of the worst abilities compared to all tested models. It's going to be a miracle if they somehow managed to at least match Qwen3 Coder. So far its general intelligence haven't increased according to my measures T_T

[-]

HomeBrewUser@reddit

This benchmark says GPT-5 nano is above o3 and Gemini 2.5 Pro.

Also, Kimi K2 has way more knowledge than DeepSeek, probably due to the bf16 training. It's not even close when you throw enough at it. The new DeepSeek V3.1 is even worse at knowledge lol.

Kimi also has the lowest sycophancy by far, and is the most "dynamic" feeling open model imo. DeepSeek and Qwen feel very corporate in comparison. Night and day.

[-]

Massive-Shift6641@reddit

If you disagree with the results of the bench, you're free to run it yourself. Unfortunately since you'd probably won't do it, you have no way but to trust the authors of comprehensive benchmarks that spend their time demonstrating that some models are really better engineered than others.

You also confuse general intelligence of models (something you'd really want to care about) with their broad abilities, which is a bad argument.

[-]

HomeBrewUser@reddit

Nano can be better on this benchmark, but it doesnt really matter for how the models really stack up against each other, it's just a niche case. Any benchmark can make any model look good in any case.

I don't understand what your general intelligence/broad abilities statement is supposed to mean, if you mean their knowledge versus their actual logic capabilities then yeah it matters. But with Transformers it's highly correlated, less knowledge really hurts reasoning abilities too.

I've tested the new DeepSeek versus the original, new Qwen3 versus the original, new Kimi versus the original. In every case the model is marginally better in certain coding tasks, but then takes a more noticeable drop in most other domains. Mainly it's logical abilities. These version upgrades just aren't gonna give the magical boost that they try to portray, just more overfitting on benchmarks and maybe some special one-shot coding tasks that are adjacent to said benchmarks.

The context length extensions aren't real either, if anything I notice more degradation overtime in long sessions or even certain things like chess lol. At BEST it's on par with the older models.

[-]

Massive-Shift6641@reddit

I've tested the new DeepSeek versus the original, new Qwen3 versus the original, new Kimi versus the original. In every case they fail at tasks that are not similar to those they're trying to benchmaxxx. None of the Chinese developers seem to focus on the model's general capabilities so far, which is disappointing considering the fact most capable models in the world tend to be general and equally good at everything.

I think that Chinese government should simply stop subsidizing any labs except for DeepSeek IMO. None of them ever come close.

[-]

HomeBrewUser@reddit

Hard to tell if you're being sarcastic or not :P. I know you said DeepSeek is the best open model, it's definetely the best open reasoning model. Kimi is better at general conversation while still being quite competent in logic, and uses way less tokens which is very important.

Qwen.. has been very underwhelming, Geminimaxxed since the 2507 models. QwQ is still the best 32B model though and it's not really a debate.

DeepSeek R1-0528 & V3.1 are by far the strictest on Chinese topics though, for obvious reasons ofc. They don't budge no matter what you do unless you prefill so much you're not even using the model anymore lol.

[-]

inevitabledeath3@reddit

Why not look at SWE-rebench. Not sure how much I trust brokk.

[-]

Massive-Shift6641@reddit

First of all, if you want to know how good a LLM at coding, you have to test it across a range of languages. It's gotta be a surprise if a LLM is good at Python and suddenly fails miserably with any other language. Which can mean two things, it was either trained on Python specifically with limited support of other languages or they just benchmaxxxed it. Brokk is the only comprehensive and constantly updated benchmark I know that uses a language other than Python. So you kinda don't have much choice here.

Second, if you want to know how great a LLM's general intelligence is, you have to test it across a range of random tasks from random domains. And so far it's bad for any open models except for DeepSeek. This update of Kimi is no exception, I saw no improvement on my tasks, and it's disappointing that some developers only focus on coding capabilities rather than increasing the general intelligence of their models, because apparently improving the models' general intelligence makes them better at everything including coding, which is exactly I'd want from an AI as a consumer.

[-]

inevitabledeath3@reddit

So your essentially saying DeepSeek is best model?

Out of interest have you tried LongCat? Not many people have. Would be interested in what you think.

[-]

Massive-Shift6641@reddit

DeepSeek is the best open source model on the market so far.

Just tried LongCat. It sucks. Fails on my music theory questions just as miserably as Qwen does. It's amusing to see that this model knows music theory well enough to know modes as exotic as Phrygian Dominant, but is not smart enough to realize that the progression I wrote was in Lydian, which is a far more popular mode.

I think that none of the improvements made by AI developers actually matter unless they demonstrably improve the model's real world performance. LongCat does not demonstrate anything like this. What really matters is whether they'd be able to catch up with frontier (GPT 5, Grok 4, Gemini 3 soon). So far no Chinese model has ever achieved it. I feel like DeepSeek R2 is going to be the first one to do it and soon after there will appear a ton of lower quality ripoffs that boast about "scaling" and "1T parameters" while actually being worse than R2.

[-]

inevitabledeath3@reddit

That kind of music theory is not something I work with, and sounds kind of obscure. I was more worried about programming and academic use.

[-]

Massive-Shift6641@reddit

You're worried about wrong things. You should be worried about the model's general intelligence, not its performance on specific tasks.

My bench is special in the way it shows that LLMs do not necessarily don't know something. Rather, they are very inefficient at knowledge retrieval.

[-]

AppearanceHeavy6724@reddit

Longcat is good at fiction. I liked the vibe.

[-]

Robonglious@reddit

This is so true. I should be keeping a matrix for which models are good for which things. Deepseek is the only model that I've found to one shot ripserplusplus. Claude can do Jax but it always writes for an older version so you have to find and replace afterwards.

[-]

Massive-Shift6641@reddit

> a matrix for which models are good for which things

I wrote about the need for multi-faceted benchmarks inspired by psychometric tests a couple of days ago. It'd solve EXACTLY this problem.

Who has ever listened to me? lol

People get what they deserve

[-]

Robonglious@reddit

I don't know if you've noticed but everyone is talking at once. Even if you make it yourself, even if it's perfect, the rate of change has everyone's mind exploding.

[-]

ForsookComparison@reddit

Benchmarks can always be gamed or just inaccurate

[-]

inevitabledeath3@reddit

Brokk is also a benchmark.

SWE Rebench changes over time I think to avoid benchmaxxing.

[-]

cantgetthistowork@reddit

It's smaller at full context because attention heads are half

[-]

Ardalok@reddit

it's more compute effective though, that's matter more

[-]

Zen-smith@reddit

Is it uncensored? The biggest problem with the og was its filters to me which ruined its creative writing potential.

[-]

Careless_Wolf2997@reddit

The first one wasn't censored after around 1k tokens of context, and most Claude models will do some pretty kinky shit after 1.5k context.

Stop testing censorship at low contexts.

[-]

marhalt@reddit

Can you expand on that? I mostly work with large local models on fairly long contexts, but when I try out a new model I try a few prompts to get a feel for it. Kimi threw out refusals on several of these, so I just put it aside and moved on. You're saying that feeding it more context reduces refusals? I had no idea that was a thing.

[-]

Careless_Wolf2997@reddit

Since you are being sincere and asking, yes, more context means less refusals for most 'censored' models. Though, Opus and other Claude ones can be up in the air with how they are censored from day to day, Kimi is completely uncensored after around 1k tokens, I have made it do some fucked up things.

[-]

marhalt@reddit

This is very interesting. Any idea why that is? Is it that the refusal weights are being overwhelmed by the context as it grows? I had genuinely never heard of that. Now I'm gonna load it up and fire a horrendous 5k context at it and see what happens lol

[-]

blahblahsnahdah@reddit

To say it's less censored would be an understatement, from my testing. All refusals seem to be gone in this version.

[-]

epyctime@reddit

1t-a32b goes hard

[-]

silenceimpaired@reddit

I saw 32b and was so excited... a distilled model.... a di... oh... activated... 1T... right, that's this model. Sigh.

[-]

MoffKalast@reddit

Now I'm wondering how many NVMe drives in RAID 0 would it take to stream it lol.

[-]

KontoOficjalneMR@reddit

About five to get to the RAM speed. I checked last night :D

[-]

MoffKalast@reddit

Yeah I went to check and there's the SSD7505 with Gen 4 ×16 and capacity for 4 drives, allegedly 25 GB/s with one, and 40 GB/s with two. That could potentially read the full 30B active in less than a second. Costs $700 just for the raid controller card tho lol.

[-]

KontoOficjalneMR@reddit

Buying controller would make it more expensive than going for RAM build though.

just plug the nvme into regular PCIv4 ports and do balancing in software :)

[-]

MoffKalast@reddit

Well a RAM build likely won't give you 8-16TB of memory to work with, but it is questionable how usable it would be in practice.

[-]

KontoOficjalneMR@reddit

4TB should RAM should be enough for 1T model realisticly. And you can get that with an used server mobo for dual EPYC and 16*256GB ram.

Alternatively get motherboard with 8 PCI gen 4 lanes (can be 6 + 2m2 of course as well). Put 81TB drives into it. and you'll get almost same speed possibly, who knows, maaybe :D

[-]

MoffKalast@reddit

Eh idk, can a mobo work as a raid controller? One would need some kind of byte level stripping to get an even distribution over all drives, otherwise it's just gonna be 7GB/s anyway.

[-]

ProfessionalJackals@reddit

Yeah I went to check and there's the SSD7505 controller with Gen 4 ×16 and capacity for 4 drives, allegedly 25 GB/s with one, and 40 GB/s with two. That could potentially read the full 30B active in less than a second. Costs $700 just for the raid controller card tho lol.

Why not just bifurcate your motherboard x16 slot to 4x/4x/4x/4x? Cost you like $20 on Aliexpress for a physical card that splits x16 lanes into 4/4/4/4...

Bonus points when you use a PCIe 5.0 x16 slot, and PCIe 5 NVMEs... And you can probably get also some PCIE 5 m.2 on the motherboard (that are not going over the chipset) for 6 m.2 NVME in total.

Or if you feel very brave, get a Chinese H12D-8D Epyc board, and bifurcate 4 of your x16 slots, for 16 m.2's (19 if you can the 3 on the board) itself. ;) Disadvantage they are PCIe 4.0.

[-]

dizzydizzy@reddit

how are you calculating that? bandwidth and latency are very different beasts?

[-]

KontoOficjalneMR@reddit

It's always rough estimations. Everything will of course depend madly on what kind of NVME drive you use, what ram, is ram dual channel, etc.

[-]

No_Efficiency_1144@reddit

Distillation works dramatically more efficiently with reasoning models where you lift the entire CoT chain so IDK if distillation of non-reasoning models is that good of an idea most of the time.

[-]

epyctime@reddit

It's an MoE not necessarily a (known) distillation. There are 1 trillion total parameters, with 32 Billion being activate at any time

[-]

No_Efficiency_1144@reddit

Yeah i am not saying Kimi is a distillation I am talking about distilling Kimi.

In my opinion another attempt at Deepseek distils is a better idea

[-]

epyctime@reddit

I gotcha yeah I'm excited for the distills as well, cos I can't run this shit for the life of me

[-]

No_Efficiency_1144@reddit

This one is really strong it performs similarly in math:

Charles Babbage

[-]

epyctime@reddit

I use it for code or summarizations etc, what sorts of maths are people doing? Has someone done a new proof or something using an LLM yet?

[-]

No_Efficiency_1144@reddit

Most sub areas of math can be investigated using LLMs.

The proof finding LLMs find new proofs all the time. They can take a long time to run though.

[-]

Substantial-Dig-8766@reddit

Oh yeah boys, another model that ill never run locally to completly ignore and see the people doing hype 😎

[-]

lightninglemons22@reddit

Imagine telling someone a year ago that there's going to be an os 'Trillion' parameter model

[-]

asssuber@reddit

That's peanuts.

I would point whoever told me that to the 1.6 trillion parameters model that google open sourced in 2023: https://huggingface.co/google/switch-c-2048

:D

[-]

No_Efficiency_1144@reddit

Yeah no one expected

[-]

DistanceSolar1449@reddit

That's because nobody expected a 1T dense model, whereas modern models are MoE.

Kimi K2 is trained on 15.5T tokens, so 2.976×10^24 FLOPs to train.

That'll take you about 191.4 days to train at ~50% MFU on a standard single NVL72 server rack with 9 servers of B200s (if you have 2 racks, then half the time). An single 8 B200 server is about $37/hr currently, so 9 of those is $333/hour. Total cost to train Kimi K2 is in the ballpark of around $1.52mil. Of course, you're not gonna find real NVL72 rentals that easily, but this gets you a rough estimate.

A 1T dense model would take you ~16 years.

[-]

No_Efficiency_1144@reddit

It’s interesting that Kimi is cheaper to train.

GPT 4, known at the time to be a MoE was 2.5 years ago so the MoE/dense differences were known for a while.

[-]

DistanceSolar1449@reddit

I'm actually undercounting deepseek. If you factor in the MTP params, it's over 40b active. So it's about 1/5 more expensive than Kimi K2 in terms of pure compute.

[-]

inevitabledeath3@reddit

MTP params?

[-]

ForsookComparison@reddit

I remember some guy getting dogpiled because he said he expected Llama3 to release with a 300B set of weights lol

[-]

MoffKalast@reddit

One that rivals Sonnet 4 apparently, even.

[-]

Ok_Cow1976@reddit

Pure bullshit, people would say.

[-]

ZestyCheeses@reddit

Good benchmark improvements for just 2 months. What are the major US companies doing? If the Chinese keep this progress up they could soon be the leaders.

[-]

Safe_Leadership_4781@reddit

Look at most of the names of the people on the scientific papers on AI, even if they were published in the US. They have always been in the lead.

[-]

procgen@reddit

Not seeing many of these names on Attention is All You Need ;)

[-]

Safe_Leadership_4781@reddit

It is also worth taking a look at the references cited in Attention is all you need, which form the basis of this important treatise. Since 2017, the apparent dominance has increased, especially in the technical reports on the models.

[-]

procgen@reddit

Let us never forget to pay tribute to the founding fathers: https://en.wikipedia.org/wiki/Dartmouth_workshop

[-]

No_Efficiency_1144@reddit

They keep on picking different people and events and calling that the start of AI but they always pick something too late. Ising Models were in 1924 and you could go further back than that.

[-]

procgen@reddit

AI literally did not exist as a field prior to these men starting it.

[-]

No_Efficiency_1144@reddit

This is erasing the work of the previous decades though.

Babbage, Lovelace, Ising, Hilbert etc were earlier.

[-]

procgen@reddit

They weren’t working on AI.

[-]

No_Efficiency_1144@reddit

They were, the label isn’t important. The field is still really just a subfield of applied math, physics, chemistry and engineering anyway.

[-]

procgen@reddit

They were not. They were not explicitly attempting to recreate the full power of human intelligence in machines.

[-]

No_Efficiency_1144@reddit

Okay I just don’t use this definition at all.

[-]

Safe_Leadership_4781@reddit

Who would forget that. But are we talking about research that took 60 years to break through or the dominance since the breakthrough of AI with the publication of the first GPT model?

[-]

No_Efficiency_1144@reddit

A lot of people don’t realise that Attention is All You Need was based on a specific type of RNNs that already had attention added. This is why it said it is “all you need” because the RNN was removed. For certain types of dataset the original RNNs with attention are actually better than transformers to this day.

[-]

Safe_Leadership_4781@reddit

It is also worth taking a look at the references cited in Attention is all you need, which form the basis of this important treatise. Since 2017, the apparent dominance has further increased, especially in the technical reports on the models.

[-]

procgen@reddit

What are the major US companies doing

Genie 3, AlphaFold 3, IMO gold, ARC-AGI, etc.

[-]

ZestyCheeses@reddit

Not available, Not available, Not available and a benchmark... Those products are interesting but we don't have access to them.

[-]

procgen@reddit

and a benchmark

I mean that US companies are building models that significantly outperform on the ARC-AGI benchmarks.

[-]

Massive-Shift6641@reddit

> What are the major US companies doing?

You're asking a wrong question. A better question is, what are the Chinese companies doing? We have seen no Chinese equivalent to GPT 5 or at least Grok 4 so far, that is, a Chinese model that is clearly able to reason and solve problems far outside its training data. On various benches, DeepSeek only recently started to exhibit this kind of behavior, but even so it's still not quite there, and other Chinese models are still behind it.

[-]

LindaSawzRH@reddit

The Chinese are supporting Open Source, the Americans don't understand that concept.

[-]

lorddumpy@reddit

the Americans don't understand that concept.

Come on bro

[-]

Massive-Shift6641@reddit

The Chinese seem to be quite not great at supporting open source because there already should be an open source contender to GPT 5. There is still none. If Qwen's next model is going to become one I will be very pleasantly surprised.

[-]

ffgg333@reddit

Is the creative writing better?

[-]

Amazing_Hat7058@reddit

What specs do I need to run this?

[-]

synn89@reddit

On the easy to setup side, pretty much a Mac M3 Ultra 512GB system: https://www.youtube.com/watch?v=-zfUvA2CDqE

But in general, you want high bandwidth RAM in the 0.5 to 1.0 Terabyte range. This isn't really something most people are going to be able to run at home.

[-]

Amazing_Hat7058@reddit

Thanks for the reply! I have a workstation with lots of RAM, 64 for now but I can upgrade it... Is it pointless trying to run this on a workstation like setup with main memory instead of a integrated GPU?

[-]

synn89@reddit

In general, yeah it would be. Especially when you have services like https://nano-gpt.com/ which you can run it on very cheaply at a good speed.

[-]

OsakaSeafoodConcrn@reddit

Possible to run on i7 cpu and 64GB DDR4 at reasonable 3tk/s?

[-]

synn89@reddit

No. You'd want more like 512GB-1TB of RAM and a processor that can access it properly(like an Epyc).

[-]

Professional-Bear857@reddit

It's slightly better than qwen coder despite being twice the size, so it seems like diminishing returns set it in pretty hard after the 500b parameter mark.

[-]

synn89@reddit

Except it likely has much more broad knowledge outside of the coding domain. For example, I found using Qwen as a coder and Kimi K2 as a documentation writer was a good combo.

[-]

ab2377@reddit

ah, too small for my laptop, i will pass

[-]

icpart@reddit

It is not something special for coding tasks. I make a simple test across the models Claude, Qwen Coder 4B, Grok Coder Fast and Kimi with that simple prompt task "C Program to Find the Largest Number Among Three Numbers" The most comprehensive and accurate answer which I received for different methods was from Qwen Coder and Claude. Most useless code was from that Kimi k2. By the way Qwen Coder and Qwen Thinking on Qwen.ai have a very similar result compared to Claude Sonnet. I use free Claude account. Maybe Kimi K2 is much better for agent task but for simple code generation it not good at all.

[-]

Marksta@reddit

With such a simple task and no guidance on how you'll opinion a winner, you're just rolling the dice on who makes something that's prettier to your eyes.

[-]

TheRealMasonMac@reddit

This is my immediate impression of it for long-fiction (novel chapter) creative writing: It seems more nuanced and adapts better to the context of the scenario. That said, it does still struggle with long-context instruction following. It is also still engaging with tropes that do not make contextual sense. Hopefully these are things that might be addressed by reasoning as I'm convinced that long-context creative writing requires it.

Overall, it's about 80% of the way to GPT-5 IMO. Exceeds GPT-4o.

[-]

UsernameAvaylable@reddit

Funny enough up there somebody is claiming the model is shit because it doesn't know "obvious" music theory stuff i never heard about.

I guess at some point models will be like people and it will be like calling stephen hawking useless because he misses all his free throws at basketball...

[-]

NandaVegg@reddit

I forgot where the reply you are referring to is, but they were talking about intermediate-to-advanced level musical stuff (scale/mode) that anyone who attempted to play a jazz would at least know what they are roughly about, and it's something any professional film composer would know. It was a niche domain knowledge, but not that ridiculously obscure.

I'd also agree with that reply, that DeepSeek is one of the best open-weight model when it comes to non-STEM, fairly obscure knowledge. Western closed-source model, like o3, is surprisingly good at understanding extremely niche non-STEM topic/concept, even multilingual, and DeepSeek comes pretty close.

[-]

NobleKale@reddit

'state of the art' is the most useless fucking phrase in LLMs

[-]

Inect@reddit

Well this second it is...

[-]

holistic-engine@reddit

From what I’ve read, the hardware reqs to even run this thing is insane, talking dozen H100’s or something if I’m not mistaken.

[-]

Awwtifishal@reddit

If you want to serve many users, yes. But if it's only for you and if you don't mind slower speeds, it's not that expensive.

[-]

Amgadoz@reddit

Yes. The upfront cost is quite high. Serving it at a large scale is quite cheap though.

[-]

power97992@reddit

How much did this model and the original k2 cost to train ? They must be bleeding money like crazy…. Can paid Api can’t cover the cost, alibaba and tencent and venture capitalists are really helping them

[-]

Awwtifishal@reddit

The original k2 cost around 20-30 million $ in total to train, thanks to its new training optimizer muon, which has challenged the 7-year status quo of AdamW

[-]

createthiscom@reddit

hmm. According to the Aider polyglot it is performing worse than the previous model: https://discord.com/channels/1131200896827654144/1413369191561564210/1413467650037780541

[-]

SatoshiNotMe@reddit

It now has 256k context, double the previous version. Also it’s very easily usable in Claude Code, e.g via this simple setup:

https://github.com/pchalasani/claude-code-tools/tree/main?tab=readme-ov-file#-using-claude-code-with-open-weight-anthropic-api-compatible-llm-providers

[-]

cantgetthistowork@reddit

Pls be 256K native context 🤞

[-]

m_shark@reddit

“Extended context length: Kimi K2-Instruct-0905’s context window has been increased from 128k to 256k tokens, providing better support for long-horizon tasks.”

[-]

cantgetthistowork@reddit

I saw that but I couldn't find any info on whether it was RoPE bullshit

[-]

Junliang_214@reddit

Just tried it out. Definitely much better for agentic tool calling, and seems to be more self-aware of the actions it has taken previously. UI wise definitely improving. Sometimes it still goes on infinite loops but huge improvements!!

(P.s. I built a vibe coding platform focus on speed, powered by different high inference models from Groq and more. Just added the new Kimi k2 model. Do try it out for free here: Groq (dot) Sampleapp (dot) ai👀)

[-]

LuozhuZhang@reddit

Wow, is Kimi moving to a thinking model?

[-]

NoseIndependent5370@reddit

They should.

[-]

paperbenni@reddit

No

[-]

Lopsided_Dot_4557@reddit

The new Kimi has really got some serious agentic capabilities. I did a testing video here : https://youtu.be/i1rQ88QgtKQ?si=OA86ueFOdBk1wCbx

[-]

Daniel_H212@reddit

Based on benchmark scores it's not as big of an improvement as I was optimistically hoping for, but still a great option for distillation into smaller models now. Does seem like there's room for them to keep training this thing further though?

[-]

oxygen_addiction@reddit

A heads up to everyone, it's available on Groq at 200t/s - Kimi K2 - GroqDocs https://share.google/qkQ0GU1JWmrCDMsY9

[-]

Hoak-em@reddit

Dang I can't wait for FP4 kernels on AMX (SGLang) and good hybrid 5090 + dual socket Xeons -- this thing could be great with an FP4

[-]

WithoutReason1729@reddit

Your post is getting popular and we just featured it on our Discord! Come check it out!

You've also been given a special flair for your contribution. We appreciate your post!

I am a bot and this action was performed automatically.

[-]

Dr_Karminski@reddit (OP)

HuggingFace: https://huggingface.co/moonshotai/Kimi-K2-Instruct-0905