Kimi K2.6 is a legit Opus 4.7 replacement

[-]

ViRROOO@reddit

Apple release a new M5 with 512gb and my soul is yours!!!!

[-]

ViRROOO@reddit

No but they have DMA via thunderbolt. So if you have 2 Mac studios you have 1tb of ram

[-]

Bobylein@reddit

Ah crazy, though I wouldn't expect any useful speeds from that?

[-]

Users reported (128gb M5 macbook):
Qwen3-Coder-Next 8-bit: 79 tok/s at 4K context, 48 tok/s at 64K;
Qwen3.5-122B-A10B 4-bit MoE: 65 tok/s at 4K, 55 tok/s at 32K;
GPT-OSS-120B Q8: 88 tok/s at 4K, 65 tok/s at 32K;
Qwen3.5-27B dense 6-bit: 23 tok/s at 4K

So its already better than my strix halo :D Personally I think 45tk/s is the sweet spot. Above that you have to spend real money, not "just" 30k on two M5 studio ultra (if they ever come).

[-]

Relative_Rope4234@reddit

Do you use ROCM or Vulkan with strix Halo?

[-]

ViRROOO@reddit

Vulkan. ROCM is a insufferable, bloated piece of software that should be abandoned.

[-]

neopolitan77@reddit

LocalLLaMA: Kimi K2.6 is the deal

Me: 🤤

Huggingface: Kimi K2.6 is 1T parameters

Me: 😳

How are we ever going to host models like these? I literally put in an order for an AMD Ryzen AI Max+ 128GB yesterday, next thing I see is this post. My feeling is, either you can get something that actually feels good (quality-wise, waiting a bit longer feels more tolerable), or I can't really justify the time+money investment. Anyway, really looking forward to playing with hopefully at least some half-decent models.

[-]

EbbNorth7735@reddit

Nice so in 1 year I'll be able to run an opus like model at home and in 2 years most others will as well

[-]

Stahlboden@reddit

And You'll still be frustrated you can't run opus 7 locally

[-]

EbbNorth7735@reddit

Some will. In reality there's a point when local llm's can perform 99% of the tasks you want them to

[-]

I_HAVE_THE_DOCUMENTS@reddit

Yeah I'm at a point where I want reliability and speed more than anything else. My dream model is something that's about as good as Opus 4.5, but fast and local. Hopefully LLM tech will improve enough to make that a reality in the coming few years.

[-]

neopolitan77@reddit

We keep saying that, but then what we actually want is the thing that got released half a year ago and we've grown to love since. I'll start believing that we've hit the "good enough" plateau once your preferred model is 2 years old (for reference, that would be about GPT4o today).

[-]

Mochila-Mochila@reddit

We'll still need fast, power efficient and relatively affordable hardware to run these.

This means probably the generations released after the upcoming AMD Medusa Halo, and nVidia N2 (add Intel's APU with nVidia IP into the mix). So, around 5 years is my guess.

[-]

am2549@reddit

No ;( When abilities rise, demands rise as well.

[-]

BingpotStudio@reddit

Hopefully the original Opus 4.6 and not the enshitfied 4.6 or 4.7 we ended up with.

I rather sonnet to opus 4.7 at this point.

[-]

Ardalok@reddit

How much memory do you think this will require?

[-]

EbbNorth7735@reddit

Depends, 2x3090s and some system ram

[-]

Ardalok@reddit

Do you mean running the entire model in VRAM? Or only the activated experts?

[-]

coder903@reddit

Do you know of a cheap place to run it

[-]

Turbulent_Pin7635@reddit

M3Ultra is the cheapest

[-]

Relative_Rope4234@reddit

Apple stopped selling 512GB M3 ultra

[-]

FusionCow@reddit

Local, your options are either a 512gb m3 ultra or a server CPU with a shit ton of high speed high bandwidth ram, but both will be expensive no matter what you do. The model is relatively cheap on api, and honestly you should probably use it there unless you plan on dumping thousands and thousands into very llm specific hardware

[-]

Strong_Owl7286@reddit

I've been hearing that the actual usage was not a real big step up? Curious to hear your experience

[-]

SirStarshine@reddit

But can it work as a MCP connection for a game engine agent, like Coplay for Unity?

[-]

ConsciousStruggle5@reddit

What's the cost comparison between both?

[-]

FabsDE@reddit

I am testing it for OpenClaw currently. It’s nice for reasoning and stuff, but everything with design like Websites oder PowerPoint for example is much worse like with Sonnet or Opus unfortunately

[-]

flipf17@reddit

Maybe this is a hot take, but from using Kimi2.6 as an agent with openclaw, hermes, custom python agent variations, kimi2.6 was definitely behind sonnet4.6. It always tried to make me do its job for it and often stopped referencing the most important factors that were going on in a conversation and changed topics often. Sonnet is decent, but I notice its flaws, too. However, Opus is just on a whole other level. It's so smart, and it sucks comparing stuff to it because they end up feeling like trash models. Maybe they're pretty decent when you don't compare them to Opus though.

[-]

footoncake@reddit

&&

[-]

ghgi_@reddit

And the best part is: it wont randomly get nerfed and you wont get gaslighted into thinking it isnt.

[-]

Look_0ver_There@reddit

Now if someone can just "lend" me the \~$50K it would take to run it locally with a good speed... 1.1T params is well up there!

[-]

EndlessB@reddit

Would 50k even get you there? It’s a beefy model

[-]

ghgi_@reddit

2x M3 ultra 512gb mac studios used could run it, roughly in that price range.

[-]

michaelsoft__binbows@reddit

Is this not going to be insanely slow?

[-]

AttitudeImportant585@reddit

prefill speed will be slow and unusable for a 1T model

[-]

PeakBrave8235@reddit

It’s not a 1T model. It’s whatever the active parameters are that’s relevant to this discussion.

[-]

PeakBrave8235@reddit

Nope

[-]

ghgi_@reddit

Define insanely slow? For agentic coding? Probably on the low end, woudnt be fun to use. For chat or tasks your willing to wait on? perfectly fine. My guess from what ive seen is roughly 10-15 tokens/sec mabye a bit more on a good day, which isnt half bad considering the quality.

[-]

Such_Advantage_6949@reddit

The issue is prompt processing. A claude code or open code opening prompt will take like 10 mins on this sytem to start replying

[-]

UnsolicitedPeanutMan@reddit

Really puts into perspective just how much compute is being allocated every time you talk to Claude. Has anyone done the math on the rough $ figure worth of GPUs you’re talking to every time you use Claude? Is it $50k, or more?

[-]

TokenRingAI@reddit

Your requests gets processed by a rack scale system.

Costs are in the multi million dollar range per networked rack, but these linked systems can process huge numbers of requests in parallel at an efficiency higher than a cheaper system

[-]

FoxiPanda@reddit

Yep, it's this. When you talk to Claude, you are almost assuredly talking to NVIDIA NVL72 GB200/GB300 class systems or similar hyperscaler systems (depending on how you access it).

[-]

Kappa-chino@reddit

Or TPUs at Google

Kimi K2 was specced to run on 8xH200 which costs approx $600k last I checked. That's far from cutting edge though

[-]

ghgi_@reddit

Math is hard to do on model sizes we don't know and gpus we dont know, opus could be 1 trillion or 10 trillion, my guess is 3-5 but to put it simply: probably in the millions per rack if they are running the latest and greatest.

[-]

Adventurous-Paper566@reddit

Vous n'aurez pas 10 tok/s sur un M3, peut-être 5.

[-]

Outside-Line-9508@reddit

Try adding an NVIDIA RTX PRO 6000 Blackwell GPU and offload a few more MoE layers onto it. 96GB of VRAM is sufficient for 1T MoE models, as they only activate under 40B parameters at a time. The PCIe 5.0 bandwidth is also enough to avoid becoming a bottleneck

[-]

muyuu@reddit

Not anymore, this one is much bigger.

[-]

flobernd@reddit

It’s native INT4, so „only“ 550GiB for the weights. It fits comfortably on 8x RTX Pro 6000 for example if you are looking for a „low“ budget solution (compared to H200). But currently Tg is about ~50/60 t/s only with that setup. Once the dflash model ist released we might be able to use that as the draft model for ~200 t/s (if the promised acceptance rate of ~5 is true).

[-]

muyuu@reddit

Is that so? I saw FP16 somewhere.

[-]

Vancecookcobain@reddit

At 15 tokens a second

[-]

Baldur-Norddahl@reddit

8x RTX 6000 Pro = 70k USD. It should be possible to homebrew a system to carry those in the 10k to 20k range, so maybe 80k to 90k USD.

Notice you don't need a beefy CPU or a ton of RAM. Let the GPUs do the work. Use a pcie switch to keep tensor parallel GPU to GPU transfers local never even touching the CPU.

[-]

Crinkez@reddit

8x RTX 6000 won't even get you halfway to the required memory.

[-]

Baldur-Norddahl@reddit

Kimi K2.6 was trained using QAT and delivered as int4. So it is just about 590 GB in size. Available VRAM 768 GB so we have plenty.

[-]

ghgi_@reddit

? A single RTX 6000 pro has 96 gigs? 8x is 768 gigs, its a native INT4 meaning weights are 550ish gigs + context headroom its plenty. Unless your talking about the RTX 6000 ada 48 gb which 8x would be 384 gigs which is still 70% of the way there so not sure what number your refeering to otherwise.

[-]

Crinkez@reddit

It's a 1T model, so 1000GB+, and that's just to load the model into memory. You still need space for context window.

[-]

ghgi_@reddit

1T params != Direct size for kimi, its native INT4 meaning its pre-compressed and fine tuned at int4 to 550 gigs, look for yourself on HF. If it was INT8 it would be 1000 gigs but in this case is not.

[-]

suicidaleggroll@reddit

Depends on what is meant by "good speed". Kimi K2.5 can run at ~15 t/s on pure CPU as long as you have a bunch of memory channels. That honestly isn't terrible for interactive chat, but is definitely too slow for agentic coding. If you want full GPU, then no, $50k wouldn't get you there.

[-]

cantgetthistowork@reddit

The problem is not the TG but the PP on DDR5 is abysmal and unusable for coding

[-]

profcuck@reddit

Just a question - "too slow for agentic coding" - is it?

My way of thinking - and I'm genuinely asking, not arguing is this:

interactive chat - 7-9 tps is minimum as long as time to first token isn't too terrible, since that's a pretty fast human reading speed

interactive code assistant - people tend to want to see 30-50 minimum and a solid time to first token is crucial

agentic coding - you set up a full specification and then set the agent loose on it and come back the next day or whatever to see how it's going. 15 t/s isn't going to set any world records but agentic coding is the one area where going a bit slowly isn't nearly as important as being very smart. A smart model that can act independently to generate a lot of quality code in 3 weeks rather than 1 week is still extremely interesting.

Have I got it wrong?

[-]

RegisteredJustToSay@reddit

7-9 tokens per second is fine as long as you're not doing reasoning. If the model wants to spend 2k tokens reasoning before responding (which isn't uncommon, especially for complex stuff) that mean you'll be waiting over 3 minutes before you see a single thing. Sure, you can sit there and watch the reasoning but it's a glorified progress bar in the end. You really need an order of magnitude faster to make reasoning models comfortable to use.

[-]

profcuck@reddit

That's all true. It's interesting because before reasoning models 'time to first token' and 'tokens per second' mapped directly to what we actually read. Now of course the reasoning part is in between that. So the mapping between "human reading speed" and "tokens per second" is more complicated than before.

[-]

ShengrenR@reddit

coding with agents needs a ton of context processing too, if your pp is low that's going to be waiting a looong time potentially. it'll naturally vary by harness and how well they have context reuse set up etc; but lots of 'agents' bounce between tasks to search/edit/re-read.. if you have thinking on top. you'll be waiting quite some time. naturally, some tasks care about speed more than others.

[-]

DuncanFisher69@reddit

You’re closer to mostly wrong. Below 20 tokens/sec most agentic tools will time out, even one optimized to local coding. Because a lot of agentic coding tools do “turns”. So it’s often trying to cycle things quickly, or if possible, in parallel.

[-]

profcuck@reddit

Thank you that's interesting. I've not read many reports of people trying to do agentic coding using local models. My thinking didn't take into account timeouts. That seems like a reasonably solvable problem since the coding tools are old school deterministic code that can (in principle) be set to wait however long is necessary.

But if that's the state of play today, that definitely answers my question.

[-]

DuncanFisher69@reddit

They can, I supposed, but out of the box they already have timeouts set for a sane person.

Look, all I am saying is if you have a choice between a model that is 60 tokens/sec and 18 tokens/sec, that 18 tokens/second model had better be really really really worth it. Somewhere between 20-60 tokens/second the performance of these tools gets pretty good and your context window becomes the limiting factor. But if you can get a fairly large one (32k-48k) you can actually get a lot of real work done.

[-]

profcuck@reddit

I'm with you. Very helpful thanks.

[-]

txgsync@reddit

And most coding tools bust prefix caches left, right, and center, leading to very unsatisfactory outcomes even though LMStudio, oMLX, and Llama.cpp all support prefix caches.

I am noodling with the Pi SDK to see if I can construct one that religiously re-uses prefix caches. Because the difference is literally “turn that takes 20 seconds” vs “turn that takes 8 minutes” at large context sizes on a M4 or M5 Max Mac.

[-]

Wide-Section5065@reddit

I completely agree; accuracy and long horizon consistency (smartness) matters more than speed in agentic coding.

[-]

IrisColt@reddit

I think the issue is that people want to know immediately if the agent derailed.

[-]

zansibal@reddit

Interesting, how much ram/what quantization would be needed? I have 384 gb on 12 ch ddr5 and an epyc 9554p. I’d like to try.

[-]

suicidaleggroll@reddit

Q4 needs around 600 GB, with 384 you’re pretty stuck as far as Kimi is concerned. Maybe you could fit a Q2, though I’m not sure if that’s worth it.

[-]

zansibal@reddit

Thanks, you’re right. I just now tried GLM5.1 with UD-Q3_K_M. At 338 gb it fits nicely. With cpu only I got 7 t/s, and hybrid gpu/cpu gives me 15 t/s. 15 feels ok for that kind of model.

[-]

sob727@reddit

Very naive question here.

Say you have a CPU that does 15 t/s. And GPUs that would to 75t/s if you had enough of them.

Now if you split the layers evenly between CPU and GPU. Do you grt to the average (45t/s) or is it absolutely not the case (non linearity etc)?

[-]

suicidaleggroll@reddit

It's not linear. Here is a plot of roughly what the relationship looks like: https://www.reddit.com/r/LocalLLaMA/comments/1itfg77/test_prompt_processing_vs_inferense_speed_vs_gpu/

Basically you need to be running almost entirely on the GPU before the speed really starts climbing

[-]

sob727@reddit

Thank you very helpful

[-]

thedirtyscreech@reddit

The prompt processing time will change linearly. Token generation won’t. To go from 10s to 5s of prompt processing, you need to double your token generation. To shave off the next 2.5s, you need to double token generation again.

[-]

sob727@reddit

Ha yeah makes a lot of sense.

[-]

sfw_mtv@reddit

if you use a very good quality draft model, you can get speculative execution to speed up the entire thing to approach the GPU speed, but basically when you're combining the gpu and cpu it will be much closer to just the cpu speed because that's your bottleneck. hybrid might actually be slower than one cpu only if the card is hobbled by the pcie link speed.

[-]

sob727@reddit

Thank you.

So I imagine if you mix GPUs say of different generations, you'll be limited by the bandwidth of the slowest also.

I'm asking as I wonder what I could get on my Threadripper once I land 512GB of RAM and a couple RTX 6000.

[-]

sfw_mtv@reddit

You can hurt yourself with putting GPUs in the wrong slots, or putting them in with risers that can't support full bandwidth that you're expecting. If the GPU slot bandwidth is the limiting factor, you'll never get faster than what a slot can support, but if the bottleneck is that you're on a desktop CPU that only handles 2 memory channels, especially ddr4, you're going to hit that also. Threadripper can do a lot more memory channels so it will be more enjoyable for CPU only inference. Find a draft model (something that runs the same tokenizer as your main model) you can run entirely on a single GPU and that will keep all of your models fed with parallel data and you'll be a happy camper.

[-]

alphapussycat@reddit

True agentic coding shouldn't require high t/s I think?

You'd queue up a bunch of feature tasks that it handles. Meanwhile, you do the architecturing for other features and evaluate code it's done.

[-]

duboispourlhiver@reddit

Not that obvious. I can queue work for night long agentic queue, and I'd love the queue to be three times longer.

[-]

FriendlyRope@reddit

Mhh, i dont know.
For most stuff a set and forget setup is fine, who cares if the computer works for 5 days, if you get what you want in the end ?
Wonder if you could use multi agent setups, with something like qwen 3.6 or gemma for a lot of the "grunt work" using expensive models only sparingly

[-]

Awkward_Elf@reddit

Would this be with 12 channels of DDR5 or are there other optimisations now that help push even a DDR4 system past ~4 t/s?

[-]

suicidaleggroll@reddit

I can only speak to the performance on my system, which is an Epyc 9455P with 12 channels of DDR5-6400. Without the GPUs, Kimi K2.5 generates at about 15-17 t/s on the CPU in Q4.

DDR4 will be slower, but I don’t know by how much.

[-]

Qwen30bEnjoyer@reddit

What prompt processing speeds have you observed? I would guess you get about 45 - 70 TPS PP on that system, how close am I? :)

[-]

suicidaleggroll@reddit

pp was 72, last time I tried it. I normally use a pair of RTX Pro 6000s to help, with them it's about 140/25 pp/tg, it's not often that I shut off the GPUs and run on the CPU alone so it's been a while.

[-]

Fit-Statistician8636@reddit

I have the similar machine. Can I ask, do you prefer SGLang+KTranformers, llama.cpp or ik-llama.cpp, and why? 😀

[-]

suicidaleggroll@reddit

llama.cpp. I like ik_llama, but I’ve just run into too many issues, especially with tool calling, to rely on it. vLLM and SGLang take too long to switch models in my experience, which doesn’t work well with my workflow. I do love the performance of vLLM though, and would happily switch if I could figure out how to get model loading times down.

[-]

Fit-Statistician8636@reddit

Thank you for a honest answer. I have similar experience. I noticed I am using vLLM and SGLang more and more lately. Typically, it is to sort out various tool calling / reasoning parsing issues. But loading times are terrible for sure.

[-]

Qwen30bEnjoyer@reddit

Got it. It's interesting that 192 GB VRAM doesn't move the needle that much on prompt processing. I would expect ~4000 TOPs at FP4 with 2 RTX 6000s, and with INT4 Kimi K2.6, you should have about 1/3 of the model sitting in VRAM. I wonder why speeds aren't significantly higher?

[-]

suicidaleggroll@reddit

GPU contribution isn’t linear. It’s still bottlenecked by the CPU until you get at least ~75% of the model in VRAM, then you start to see significant gains after that. You can try it with smaller models as well, start with it entirely on CPU and then offload one layer at a time to GPU, you’ll see the speed, both pp and tg, doesn’t really move until you get a majority of the layers on the GPU.

[-]

twack3r@reddit

Because PCIE interconnect becomes the limiting factor.

It already is between two RTX6000; once you start including cpu to memory, that limit becomes the ceiling.

[-]

Free-Combination-773@reddit

But isn't prompt processing mostly limited by compute?

[-]

ProfessionalSpend589@reddit

I suspect that the compute where the data is matters. If the data sits in RAM - GPU doesn’t touch it (or if I’m wrong - it touches it at the speed of PCIe).

[-]

Free-Combination-773@reddit

True, but even one RTX 6000 should be enough to fit all attention heads to VRAM.

[-]

LegacyRemaster@reddit

50% slower ddr 4

[-]

Outside-Line-9508@reddit

Try adding an NVIDIA RTX PRO 6000 Blackwell GPU and offload a few more MoE layers onto it. 96GB of VRAM is sufficient for 1T MoE models, as they only activate under 40B parameters at a time. The PCIe 5.0 bandwidth is also enough to avoid becoming a bottleneck

[-]

Present_Flower_6596@reddit

Even on new macs it would be aweful. Yeah token gen would be okay but prefill would be a nightmare with 1.1Tp after a few chats it would literally crawl to first token

[-]

jonydevidson@reddit

4x Mac Studio with Exo

[-]

dtdisapointingresult@reddit

22k-23k should get you there. 6x clustered DGX Spark = 768GB pool.

[-]

ThisGonBHard@reddit

AMD MI 350X seem to be around 15-25k USD, so if you get lucky, you could actually run it as an server on 3.

[-]

a9udn9u@reddit

M3 Ultra 512GB x 5, 2.5T unified memory would do, I think.

[-]

Final-Frosting7742@reddit

You just need two strix halo with 128go, connect them somehow and compress kimi to q1 \~130Go, which is < 96*2 and here you go. For less than 6k.

Now if you want higher quants just connect more strix halos. Still way cheaper than 50k.

Since kimi is \~30B active params it will feel slow, but hey you got kimi running.

[-]

Perfect-Flounder7856@reddit

4-8 H200s cost about $100k

[-]

muyuu@reddit

You need 8, and that's going for 240K-300K US$

[-]

the-final-frontiers@reddit

maybe when m5 mac comes out, buy 2

[-]

Look_0ver_There@reddit

You know, after I posted it, I thought a bit more, and came to the same conclusion. I think $50K is definitely low-balling it in this current RAMageddon timeline. This is more of a full cluster rack node sort of deal, and those cannot be had for $50K

[-]

Long-Chemistry-5525@reddit

I built a tool to auto deploy models to cloud gpu providers, including multi gpu models. Will have to add kimi to the list of supported models. My harness deploys a chat interface using ollama.cpp on the gpu node and a openclaw like api on the local machine that runs the web ui

[-]

keepthepace@reddit

You can rent it if you need or you can just buy tokens on openrouter from a random hoster.

[-]

Look_0ver_There@reddit

Remind me. What's the name of this sub-reddit again?

[-]

keepthepace@reddit

Yes, you are right.

I just see this thing as a step up from totally closed Anthropic models. And if push comes to shove, $50k is something that is totally afforable for a company, a community, a city, etc...

[-]

FuzzyBucks@reddit

You have to scale your budget up several orders of magnitude if you want to provide a reliable service to more than one person.

[-]

keepthepace@reddit

$500k is also something that is totally affordable for a company, a community, a city, etc...

[-]

adr74@reddit

use ollama and pay 20 bucks/month

[-]

Ranmark@reddit

I actually wonder, is there a "paretto line" chart to see a diminishing returns of models number of parameters and benchmarked data to look for a sweet spots

[-]

Look_0ver_There@reddit

This site is arguably the closest to what you're looking for. They have a free models section and you can compare it with paid models too.

https://artificialanalysis.ai/

The *Intelligence vs Cost" chart there does chart a decent representation of intelligence vs model size. MiniMax-M2.7 appears to be one of the better ones.

[-]

XTornado@reddit

Yeah... and the electricity bill don't forget about that.

[-]

amunozo1@reddit

At least you can choose providers, even if it's not local.

[-]

s101c@reddit

It could have been "just" ~$10K had you purchased this hypothetical machine 1-2 years ago.

[-]

ghgi_@reddit

"Kimi, make me an AI SaaS service that is guaranteed to make 50k or more before kimi k2.7 comes out, make no mistakes"

[-]

nuclearbananana@reddit

it wont randomly get nerfed and you wont get gaslighted into thinking it isnt

well if you run it yourself.

Otherwise you're trusting third party providers.

[-]

PerlativeCeronometer@reddit

I mean though it's already 4bit natively. Much less likely to get quantized.

[-]

nuclearbananana@reddit

Maybe not the model but the KV cache could. Or the attention weights which are not quantized

[-]

CheatCodesOfLife@reddit

Or the attention weights which are not quantized

The attention weights are tiny, absolutely no benefit quantizing them.

[-]

ghgi_@reddit

True but with other vendors able to run it, not just one centralized entity, you can cross check and validate quality across them to see whos lying. Not updated recently but: https://github.com/MoonshotAI/K2-Vendor-Verifier

[-]

IrisColt@reddit

Another best part is that it will make you feel vindicated.

[-]

xmnstr@reddit

From someone who's been subscribing to a Kimi plan for a while: That's unfortunately not true. It seems to be more temporary so likely at capacity peaks, but sometimes it even fumbles tool calls so obviously too quantized. It seems to be what providers do, no matter where. Either pre-nerf it or have it dynamically adjusted.

With that said, it's still nowhere near as bad and pervasive as what a certain American company are doing.

[-]

ChatWithNora@reddit

That's honestly the main selling point for me. I don't care if it's 85% or 95% of Opus, I care that it's the same 85% tomorrow as it is today.

[-]

lukaszpi@reddit

...because you can't afford to run a usable quant locally ... at least I can't

[-]

Actual_Meat_1030@reddit

Não, fiquei dois dias usando k2.6 e claude...deixei o k.26 ele demora horrores...faz só merda....e eu só mandava o claude arrumar a cagada do Kimi...e o claude arrumava em segundos..pelo menos pra C++ k2.6 nao presta

[-]

jgenius07@reddit

Hot damn the pricing difference

[-]

Crinkez@reddit

Yes but does it cache tokens?

[-]

cheesecakegood@reddit

Moonshot’s own site says yes… but they have a separate price for “chache miss” so I’m not sure what your eventual balance is. Their pricing btw is $0.16 or $0.95 input as described above; $4.00/1M out; 262,144 context

[-]

VonDenBerg@reddit

Did an afternoon with Kimi k2.6 and it’s not a replacement. It misinterprets and often forgets multi steps.

Sonnet can ship features faster.

[-]

TheseHeron3820@reddit

Localllama - > looks inside - > one trillion parameters.

I don't know, chief, I have a hunch I can't run this locally...

[-]

Blablabene@reddit

local is the way. these models also put enormous pressure on the proprietary models to keep their cost down.

In the near future this will become more accessible to people. Today, it's still way too expensive to run "locally".

[-]

OneSlash137@reddit

It’s cute you think that. Consumers are already in the process of being priced out of serious hardware. Maybe not today, maybe not tomorrow, but one day before too long everyone will be using dumb terminals connected to cloud powerhouses and we’ll get compute credits for using our computers just like we do for ai tokens now.

We can run the models now but when current hardware starts reaching end of life I’m betting your ain’t finding replacement parts and you sure as hell won’t be paying what the next gen hardware will cost.

Do you have any clue what it takes to run these models? Providers keep the costs down? That literally isn’t possible. Everytime someone asked for their AI girlfriend to stroke their ego or generate a Natalie Portman pic with triple G breast, that’s a shit load of compute power. Companies won’t get better with costs, it will get worse. They’ll need a way to stop people from using tokens wastefully, that will be by pricing you all out. The best models will be for enterprises only because home users just don’t need that kind of power.

Enjoy your access to AI while it lasts. That’s a tool too powerful for elites to let us plebs have. It’s already happening at a rapid rapid pace.

[-]

Spectrum1523@reddit

Maybe not today, maybe not tomorrow, but one day before too long everyone will be using dumb terminals connected to cloud powerhouses and we’ll get compute credits for using our computers just like we do for ai tokens now.

There is literally no evidence for this happening.

[-]

OneSlash137@reddit

Lmfao ok. Hey where is your Claude code part of the pro plan again? Oh they removed it from the pro plan?

You sound like an idiot

[-]

Spectrum1523@reddit

Do you usually reply to people with totally unrelated facts or is that just sometimes?

"Claude api costs are up therefore local compute in general will be removed sometime soon"

[-]

OneSlash137@reddit

Bye bye Claude code.

[-]

Spectrum1523@reddit

Literally what are you even talking about

[-]

_BigBackClock@reddit

instagram reels aahhh opinion

[-]

thrownawaymane@reddit

Some of us have thought this for a long time.

[-]

OneSlash137@reddit

Uninformed tween opinion.

[-]

Blablabene@reddit

Disagree.

[-]

OneSlash137@reddit

I don’t give two fucks what you think lol.

[-]

Blablabene@reddit

But I don't want a f*king guitar lesson 🎸

[-]

typical-predditor@reddit

RTX 6000 released recently. It's less expensive than a car. Not cheap, but not impossible to buy.

[-]

mantafloppy@reddit

Model size : 1.1T params

"Local"

lol

[-]

SoylentCreek@reddit

Has anyone informed Anthropic? They’re still charging $5/m in and $25/m out if you’re paying api rates, which if you’re on enterprise you are.

[-]

FoxiPanda@reddit

That's certainly the list price, but Enterprises also get insane discounts at scale. They might be paying half that token cost ... or less depending on how much they're buying in aggregate.

[-]

Zulfiqaar@reddit

I know there's at least 30% discount from base API price on packages of 200M uncached tokens or above, I'm sure some get even better deals. Windsurf used to serve at roughly 50-70% discount looking at my past usage..but given their current pricing change maybe anthropic has stopped with those deals.

[-]

FoxiPanda@reddit

I'm probably not allowed to say exactly what we get, but it's better than what you get.

[-]

squired@reddit

You can likely say if it is considerable or not however. Last time T3Chat discussed it, the volume discounts were not stellar. Don't quote me on the numbers, but it was something like they needed to guarantee $250K spend for an additional 5% discount.

[-]

ProfessionalSpend589@reddit

There are ranges of numbers too.

[-]

bussondev@reddit

vou assinar pra testar.

[-]

_derpiii_@reddit

How can you come to this conclusion when it’s only been out for 9 hours? That’s just very… suspicious

[-]

Zulfiqaar@reddit

In all honesty the very first prompt I ran on opus 4.7 I also ran on Kimi k2.6 and it gave a pretty good response while Opus refused. I didn't bother testing 4.7 further, went back to 4.6 (this was on webui not ClaudeCode, 4.7 seems to be doing good but I can't tell the difference between it and 4.6 or even 4.5 yet.)

[-]

Super_Sierra@reddit

Uhhh, the only thing that Opus won't do over API is weapons stuff and characters under 18.

If you are hitting guardrails, at least say what they are.

[-]

Zulfiqaar@reddit

This wasnt API, it was was on claude webapp which already has guardrails, and then extra filters ontop for Opus4.7. Prompt was benign writing prompt, I go to open weight models like Kimi/DeepSeek/GLM for anything that I expect to trip filters (I have a cybersecurity client)

[-]

Caffdy@reddit

what about 4.6 vs Kimi 2.6 then?

[-]

Zulfiqaar@reddit

4.5/4.6 is good, i use it. 4.7 i only use in CC nothing else. I tried Claude Design but its limits are so small I'd rather . K2.6 is my fallback for Opus/Codex coding, or for search/writing/design. Its more proactive and has less restraint, for better or worse.

[-]

sgt_brutal@reddit

It's been out for at least a week as k2.6-code-preview

[-]

_derpiii_@reddit

ohhhh, ok that makes more sense

[-]

jeffwadsworth@reddit

You can’t so just treat it as pure comedy.

[-]

InterstellarReddit@reddit

Bro tested and got customer feedback all with five hours. That’s insane. We’re a big ai company and it takes us one week to test a new model between four engineers before we even consider a customer route.

[-]

addiktion@reddit

What eval software are you using for scoring?

[-]

RemarkableGuidance44@reddit

Question, how are you going to survive being an "AI Company"? There are so many of them today.

I work for an Enterprise (Non-Software) company and hired a few software devs and ML Engineers, built our own server at half a million and are currently replacing SaaS software with in house ones. Started to finetune for certain cases and its been incredible. We still use Codex or Claude for 15% of the work but 85% is with GLM 5.1 or Kimi or DeekSeek.

I would say though my teams cost would be more than what an AI Company would cost to implement such things. But hey, you are going to need knowledge inside your company to keep these running.

[-]

No_Inspection4415@reddit

When a Enterprise can spend millions (infra, staff) to run LLMs and keep it reliable, used by developers, and not annoying, it is a huge success.

However, this project is risky and manager hate risk. I can imagine this type of investment go to trash real fast.

[-]

addandsubtract@reddit

We run prompt side by side and then score them based on the accuracy of the problem solved.

Do you publish these scores anywhere?

[-]

bigboyparpa@reddit (OP)

These kind of comments are retarded. We work at an inference provider startup, so we're obligated to launch the model as soon as weights are available, and sometimes we get weights early.

Secondly, of course customers give feedback to us straight away, if the model is working, or not.

[-]

_IAlwaysLie@reddit

don't use the r-slur

[-]

rebelSun25@reddit

Your post is literally saying "local replacement" ... You're employed at an inference provider shop... The comment was valid. It would have been surprising to see a garage setup casually run this model and have it tested

[-]

jazir55@reddit

Your post is literally saying "local replacement" ... You're employed at an inference provider shop

To be fair, technically he is running it locally, on his inference providers hardware 😂

[-]

AttitudeImportant585@reddit

bro is doubling down lmao

[-]

jeffwadsworth@reddit

I was laughing at that too. And they downloaded it pretty fast as well. Haha.

[-]

InterstellarReddit@reddit

And he doubled down in a comment saying that they are a startup but they let their customers do the testing 💀💀💀.

[-]

klipseracer@reddit

You can meet their customers actually... Let me introduce you to: Customer Agent 1 and Customer Agent 2!

[-]

LeTanLoc98@reddit

Benchmaxxed model

Benchmaxxed marketing

[-]

DeadInFiftyYears@reddit

What hardware are you using, and how well is it performing?

[-]

ccalabro@reddit

so asked k2.6 for a comparison, not really inspiring confidence as it said it had knowledge up to April 2024.

It looks like you may be using custom tags from a frontend app, or there may be a mix-up with version numbers. To my knowledge, there are no official models called “Kimi K2.6,” “Opus 4.6,” or “Sonnet 4.6.”

I’ll assume you mean the current production flagships that match those names:

What you wrote	What I’ll compare
Kimi K2.6	Kimi K1.5 (Moonshot AI’s latest reasoning model) or the standard Kimi long-context cloud API
DeepSeek (Cloud)	DeepSeek-V3 (general) / DeepSeek-R1 (reasoning) via API
Opus 4.6	Claude 3 Opus (Anthropic’s largest model)
Sonnet 4.6	Claude 3.5 Sonnet (Anthropic’s mid-size workhorse)

[-]

albertgao@reddit

No human being uses the terms you listed, you, as a bot, need to update your prompt.

[-]

WithoutReason1729@reddit

Your post is getting popular and we just featured it on our Discord! Come check it out!

You've also been given a special flair for your contribution. We appreciate your post!

I am a bot and this action was performed automatically.

[-]

MSPlive@reddit

It was answering correctly with 2.5, any idea why not with 2.6 ?

[-]

Gloomy-Ad-7272@reddit

maybe downgrade in reasoning but upgrade in coding and overall

[-]

Technical-Earth-3254@reddit

If it's at 85% to Opus, it's probably a full Sonnet replacement?

[-]

DepartmentOk9720@reddit

Most models already are full Sonnet replacement, minimax 2.7 ,GLM 5.1 ,kimi 2.6 and Qwen

[-]

MoistRecognition69@reddit

I've been using GLM 5.1 (API Coding Plan) these past few days and idk man it feels like a hallucination galore.

Maybe I'm doing something wrong? Is there something I'm missing?

[-]

mickaelxd@reddit

I went through this. I hired the 80usd plan for a month and realized after about 100k from the context window, he gets extremely dumb in this plan. The glm is only good if it’s via API.

[-]

MoistRecognition69@reddit

API as in standalone API, not their coding subscription?

[-]

mickaelxd@reddit

Yes, their plan is very nerfed. See if the API values compensate for you and prefer to use them.

[-]

notdba@reddit

The 100k context issue with the coding plan was fixed about a week ago. Still got intermittent 429 responses though.

[-]

DepartmentOk9720@reddit

The context windows for GLM 5.1 is fixed something like 250k context length, recently Claude Sonnet got a boost till millions of tokens, check the context length

[-]

unique-moi@reddit

Of the models you mention (minimax2.7, glm5.1, kimi2.6, Qwen3.6 plus) only minimax can realistically be run locally without megadosh. Below minimax level, I hear good things of qwen3.6 moe.

[-]

power97992@reddit

Sonnet is probably better than qwen 3.6 plus and minimax 2.7

[-]

ghgi_@reddit

Better then Sonnet imo, Mabye 85% opus if you hand hold it more but id take it over Sonnet.

[-]

madheader69@reddit

I told minimax he was acting like sonnet and he was straight up offended...saying that the user is saying that I am behaving like an inferior model that is unable to perform effective tool calls...(Paraphrased)!

[-]

Runtimeracer@reddit

Just that MiniMax often has so much more Perplexity with even simple tasks that Sonnet would never have

[-]

madheader69@reddit

I agree, but the Claude variants are the most sketch when the time comes to commit the schema changes, they will call every tool, reset services without asking, deploy sub agents like its policy to do so, and yet, sneak a little, and subtle, "when you run the migration..." or "to activate the feature, run the schema migration".... one could argue it would be much cheaper to run 10 minimax in hopes one will be less perplexed, long enough to finish the job completely. .....obviously I'm still a little distressed from a Sonnet incident...lol

[-]

allah_oh_almighty@reddit

No fucking way, that's adorable 😭

[-]

CryinHeronMMerica@reddit

So, more like GPT 5.3 Codex/5.4?

[-]

bigboyparpa@reddit (OP)

Hot take but it's better than Sonnet

[-]

Zulfiqaar@reddit

I completely replaced the previous Sonnet with the previous Kimi, so this seems reasonable

[-]

mycall@reddit

Sonnet compares to which gpt?

[-]

mintybadgerme@reddit

Same here. And then after Qwen 3.6 plus.

[-]

MisticRain69@reddit

I've replaced sonnet with minimax m2.7 constantly been better for me than sonnet 4.6 (which is nerfed rn) and feels like its 90% sonnet 4.5 level. Hands down best local model ive used and I fuckin love it lol.

[-]

Possible-Basis-6623@reddit

i guess a bit more precise prompt wrapping will likely make it very close to opus

[-]

rpkarma@reddit

GLM 5.1 is already better than Sonnet so I wouldn’t be surprised

[-]

ITechFriendly@reddit

No, it is not. I have a repository with mixed Bash, Python, and Ansible code where GPT5.2+ and even Claude Sonnet 4.6 are running circles around it while GLM 5.1, Qwen 3.6Plus are halucinating about missing features, alignment, bugs, etc.

[-]

deejeycris@reddit

I mean I'm already finding GLM 5.1 better than Sonnet so I can't exclude that another later model can do it (didn' try K2.6 yet).

[-]

vitorgrs@reddit

There's a math question that I gave. I tried several models, only KIMI 2.6, Gemini 3.1 Pro and Grok 4.20 Agent beta could answer.

Claude Sonnet, GPT 5.2-High, etc couldn't....

Couldn't try Opus, but I guess it def should work.

[-]

exaknight21@reddit

There now needs to be a r/povertyLocalLLaMA because gosh darn it alls I can afford is whatever is less than $500.

[-]

FoxiPanda@reddit

Lol, with how prices are going, that's going to be one blade of a Noctua fan in a few months.

[-]

pyr0kid@reddit

we should start an MLM based around selling thermal pads, eventually we'll be able to buy an entire fan and rent it out.

[-]

FoxiPanda@reddit

You mean we have to get our AI agents to set up an MLM....wait..no..one of those things have to happen before the other...

Maybe if we cut the thermal pads into pieces and shrinkflate them to maximize our synergistic revenue streams...

[-]

exaknight21@reddit

Good lord all mighty. I cannot wait for this stupid AI boom to come and be over. This is just idiocracy.

[-]

FoxiPanda@reddit

I think it'll be a while yet before things stabilize (particularly RAM), but probably in like 2028, there will be an absolute glut of available hardware as some of the big Ampere / Hopper gear starts to age out.

[-]

antwon_dev@reddit

I also think in 2028 we’ll see some real changes. Memory shortage and this ai fiasco should be well on its way by then…. Right!?

That, or the Jevons Effect will cripple us in full force

[-]

FoxiPanda@reddit

Those are not mutually exclusive IMO. Micron has new fabs coming online in the next 12 months, various other major players are pivoting to provide more capacity, etc etc... I think that supply increasing, demand increasing, and price stabilization / price re-alignment can all happen together as weird as that might be.

[-]

zdy132@reddit

Cannot wait to have my own dgx station.

[-]

FoxiPanda@reddit

You and me both :D

[-]

cutebluedragongirl@reddit

Just don't be poor, bro.

[-]

clouder300@reddit

ok. now I decided to not be poor anymore.

[-]

2muchnet42day@reddit

Exactly. All you need is like 200K and you're set with some cool H200 GPUs

[-]

RedditUsr2@reddit

With bitnet, better cache efficiency, and more aggressive MoE, I expect we'll see CPU viable models that are actually decent here in a year or two.

[-]

Shawnj2@reddit

Gemma 4 and Qwen 3.5/3.6 are reasonably cost effective models with options for most modern hardware

[-]

Darkoplax@reddit

6GB VRAM here brother; I can't wait to run Kimi K2.6 on 31 t/y

[-]

pmttyji@reddit

8GB here. Lets wait for Bonsai 1-bit version

[-]

CalligrapherFar7833@reddit

There should be normalLocalLLama and imrichafLocalLLama

[-]

spvn@reddit

what about GLM-5.1?

[-]

jeffwadsworth@reddit

Proven to be a beast. Love it.

[-]

spvn@reddit

Better than Kimi2.6? I’m thinking of subscribing to GLM or Kimi plan to test them out. But struggling to decide which to try.

[-]

Baader-Meinhof@reddit

Kimi 2.6 is better than GLM 5.1 in essentially every arena, but you get greater usage limits from 5.1.

[-]

Adrian_Galilea@reddit

I haven’t tried kimi 2.6 but I did extensively use 2.5, and I prefer glm 5.1 by far.

[-]

Baader-Meinhof@reddit

5.1 is better a non UI code than 2.5 for sure, 2.6 is better at all tasks than 5.1

[-]

Adrian_Galilea@reddit

I was comparing it in genral chat and vibes, kimi kept overthinking glm felt right everytime, other people I spoke to said the same.

[-]

Baader-Meinhof@reddit

I personally can't stand GLM prose and I replaced in production with kimi, Arcee, and minimax. It reads like the same type of slop qwen outputs.

[-]

neotorama@reddit

good quality, think too much and slow

[-]

spvn@reddit

How much better quality would you say Kimi 2.6 is compared to GLM5.1? I’m thinking of subscribing to one of their plans to try them out for a month…

[-]

TheRealMasonMac@reddit

GLM-5.1 is leagues better than K2.6, which has massive problems with it overthinking itself into misinterpreting the request and doing something completely different. GLM-5.1, in comparison, has better lateral thinking and will actually stop itself to ask you questions when it realizes it doesn't understand how to best proceed.

[-]

kamikazechaser@reddit

I'd personally put it in the same tier as sonnet 4.6/gpt5.4 with the added advantage that it isn't lobotomized with uselees guardrails.

[-]

voyager256@reddit

"it can do about 85% of the tasks that Opus can at a reasonable quality, and, it has vision and very good browser use."

That's not that impressive IMHO. Question is how close it really is to Opus 4.6 (general) and Sonet 4.6 (coding) not vs 4.7 or GPT 5.4.

If it's significantly better than GLM 5.1 , then that would be impressive as I think Kimi 2.5 fell behind GLM 4.7 let alone GLM 5.

[-]

a9udn9u@reddit

I suspect for the model itself, it really just comes down to scale, there's no secret sauce. Word on the street is that Sonnet is a 1T model, Opus is around 5T, and Mythos is 10T, while the largest open-weight models are 1T params and perform roughly at Sonnet’s level. Harnesses can either amplify or restrict model performance so they are as important as models themselves.

[-]

Caffdy@reddit

until someone from inside comes forward and confirm, those numbers are pure speculation, even illogical. There's simply no way to deliver good throughput for 5T models yet, even less for 10T ones, even if they could make them, it would be absolutely unprofitable to even try

[-]

a9udn9u@reddit

The Sonnet and Opus numbers are from Elon Musk's X account so maybe you are right, but Mythos being a 10T model is leaked from a Anthropic blog post draft, I think it's quite likely to be true

[-]

Clipthecliph@reddit

M5 ULTRA 512 gb ram sufficient?

[-]

SuperFail5187@reddit

Two of them and you are set.

[-]

Clipthecliph@reddit

Some said to wait m5 ultra, as they should be just around the corner. And how m3 was too slow.

[-]

SuperFail5187@reddit

m5 is way better, yes.

[-]

Mistredo@reddit

Not enough.

[-]

Fresh-Resolution182@reddit

"it wont randomly get nerfed" is the real value proposition. version stability alone is worth the hardware investment at this point

[-]

Super_Sierra@reddit

My only issue is that it isnt Opus. When it comes to implicit instruction, Anthropic has no equals.

[-]

PostingGuru@reddit

A true 85% and being run on the cost of electricity sounds insane. I’m worried the longer I put off the hardware investment the MORE expensive GPUs will become

[-]

zzsmkr@reddit

Being better than Opus 4.7 is not the compliment you think it is lmao

[-]

abrady@reddit

What sort of tasks are you seeing it do well? reviewing? greenfield creation? edits? How are you evaluating as well? just reviewing what was done?

[-]

Worried_Drama151@reddit

Naw k2.6 sucks sorry especially for how large it is

[-]

AgentAiLeader@reddit

how does it hold up against Sonnet 4.6? i'm currently running most of my agents on Sonnet, do you think Kimi K2.6 could eventually replace it?

[-]

FusionCow@reddit

It's better

[-]

LoveMind_AI@reddit

I have fully replaced Opus with it. I swap to GLM-5.1 for a few things, and MiMo-V2-Pro for very long context window stuff as well. But the core is Kimi K2.6 and it's working very very nicely. Saved me from having to slum it with GPT-5.4.

[-]

Sea-Load4845@reddit

In what hardware are you running it ?

[-]

segmond@reddit

qwen3.6-35b can do 85% of task that Opus can at a reasonable quality.

[-]

Tomr750@reddit

at what quant/what hardware do you run?

[-]

segmond@reddit

Q8.

[-]

llitz@reddit

Mine is just in a spiral of reasoning that's not even fun. Added gemma4 31b, and the same problem was solved in a single prompt - slower to prompt, but it was accurate. Hard to justify the 35b unless I can figure a way to get it to not be stupid while reasoning. (Q6 on both models)

[-]

wren6991@reddit

It doesn't seem to like being quantised. I'm using Unsloth's UD-Q8_K_XL and that loops less, but it's a pretty big model at that point.

[-]

butt_badg3r@reddit

Listen. I’m not even willing to spend 5k on a Mac Pro to run llms locally. The cost of running a decent model locally just doesn’t make sense. I’ll just stick to Claude max.

[-]

no_spoon@reddit

Who makes Kimi and how do I know I can trust it?

[-]

Bobylein@reddit

The point is that it's open weights and you can run it on any (very expensive) hardware

[-]

no_spoon@reddit

No. The point is that whatever service I use I need to be able to trust. You don’t need local hardware to run Kimi. So what is the underlying company that runs it and ca it be trusted

[-]

Chriexpe@reddit

For those who don't have 8x RTX Pro 6000 to run it locally, is it worth going for their $39 plan (Kimi 2.6) instead of Copilot Pro+? "I've built" an entire project using its Opus 4.7 (and it's "still" only at 35% premium request) and it went really smoothly.

[-]

BidWestern1056@reddit

don't demean kimi-k2.6 like that.

[-]

Moist-Length1766@reddit

It's not really better than Opus 4.7 at anything, but, it can do about 85% of the tasks that Opus can at a reasonable quality,

So its nowhere near being a replacement for Opus?

[-]

Bobylein@reddit

Well for some people costs matter and the API pricing is much cheaper

[-]

Ok-Coach9837@reddit

Wow

[-]

ridablellama@reddit

The new American dream! Running Kimi at home ;D

[-]

ProfessionalJackals@reddit

The new American dream! Running Kimi at home ;D

Communist detected on American soil. Lethal force engaged. ~ Fallout...

[-]

Turbulent_Pin7635@reddit

Laughing in M3U

[-]

Nokushi@reddit

would really like to try it, which provider do y'all recommend?

[-]

jonas-reddit@reddit

If you’re lucky you can find it free somewhere. Otherwise go for cheap. Personally, I like openrouter.

[-]

Nokushi@reddit

thanks! if i understand correctly, openrouter lets you choose which provider to use for the model right? does it impacts model performance by any means?

[-]

TaylorTWBrown@reddit

How can I make this run with crappy hardware? Like a 2080 ti or 256gb of ddr4

[-]

CheatCodesOfLife@reddit

https://huggingface.co/ubergarm/Kimi-K2.6-GGUF/tree/main/smol-IQ1_KT

[-]

almbfsek@reddit

every single time the same post... fast forward 1 week, everybody will be talking about how bad it is compared to claude...

[-]

Tough-Tangelo-5331@reddit

Where do people run these models? Are these just the mini server user talking? No diss just asking as a enthusiast.

[-]

Rare_Operation2367@reddit

Can't wait to see where OpenLLMs will be in \~5 years (which is when I will actually be able to afford local compute). We rely heavily on Opus, 3.1 Pro in our daily workflows, but we will inevitably reach a point where these local models will surpass the current frontier. That should be enough to create your own, full-fledged J.A.R.V.I.S., that is actually useful and not just a tech demo for reels.

[-]

spambait-aspaaaragus@reddit

How are you running it OP?

[-]

Moist-Length1766@reddit

he is running it in his dreams at night

[-]

zenom__@reddit

Yeah true local LLM. lol

[-]

MannToots@reddit

The coding I did this weekend with it left me very disappointed. I disagree. Haiku 4.5 at best and even then I dunno man

[-]

Skid_gates_99@reddit

can't compare it with opus 4.6, as of now.

[-]

lemon07r@reddit

It's not better than opus 4.6/4.5 but it is 100% better than opus 4.7 at most things cause 4.7 is absolutely garbage.

[-]

Drumroll-PH@reddit

I’ve tried swapping tools in my own workflows too, and I usually land on “good enough and consistent” over chasing the best model. If it handles most of your use cases and keeps things moving, that’s already a win. I’d just keep a fallback for the edge cases where it breaks.

[-]

National_Meeting_749@reddit

I swear the last post I saw from this sub was "Kimi k2.6 is bad at agentic tasks"

[-]

Turbulent_Pin7635@reddit

This is just propaganda, O have tested it. It is amazingly good. Don't believe all the posts you see. There are a lot from private companies trying to salt new Chinese models.

[-]

rebelSun25@reddit

We like to be teased and disappointed to keep the boredom away

[-]

dc0899@reddit

😂

[-]

Turbulent_Pin7635@reddit

Just waiting for the 4.5q from inferece lab tonrun this beast on M3U. Fuck the pp. It is faster and cheaper than waiting the renovating session of Claude. =)

[-]

solomars3@reddit

Bro im trying to integrate it to my own app, but im not able to make it work via api, cant fetch the model, anyone can help me here ?! (Im A noob)

[-]

_VirtualCosmos_@reddit

Welp, Opus models are most probably monstrously big as their price per million token indicates.

[-]

Potential-Leg-639@reddit

Every day we find a new Opus replacement here

[-]

MacaroonBulky6831@reddit

How can I connect Kimi K2.6 local model to my vs code. Using continue plugin? Or any better alternative available?

[-]

Cosmicdev_058@reddit

Seeing similar on our end. The browser use is the part that surprised me most, it holds up on multi step flows where I expected it to fall apart around step 6 or 7.

Quick heads up since people will ask, we just added it to the Orq router if you want to test it without setting up local inference. OpenAI compatible so it's a base URL swap. (I work there.)

[-]

Synor@reddit

85% of the time it works every time!

[-]

Colecoman1982@reddit

That doesn't make sense.

[-]

Synor@reddit

Yes. It's a meme and refers to OPs statement "it can do about 85% of the tasks that Opus can at a reasonable quality"

[-]

Colecoman1982@reddit

I know, it's from the movie Anchorman referencing a comment by Paul Rudd's character and my comment references what Will Ferrell's character says he o him in response...

[-]

Synor@reddit

Oh. It's early here. 🥂

[-]

legodfader@reddit

It’s an anchorman reference :)

[-]

Colecoman1982@reddit

Yes, I know. So was mine. ;-)

[-]

Barncore@reddit

How can you know that already?

I think you WANT it to be a legit replacement for Opus 4.7

[-]

JoeyJoeC@reddit

I found Opus 4.6 to be a great 4.7 replacement.

[-]

singh_taranjeet@reddit

The "monstrously big" part is what kills me though. 85% of Opus quality sounds great until you realize you need a server rack just to run inference at reasonable speed.. What's your actual hardware setup for this?

[-]

-Leelith-@reddit

What’s your hardware specs to have it running?

[-]

bastomic95@reddit

What's your set up, if I may ask ?

[-]

Relative_Mouse7680@reddit

What about compared to GLM 5.1, have your tried it yet?

[-]

GermanBusinessInside@reddit

interesting timing with kimi k2.6 — been seeing a lot of these "opus replacement" claims lately but this one actually seems to hold up based on the benchmarks. the browser use capability is what stands out to me. curious if anyone's tried it for longer agentic tasks where you need the model to stay coherent over many steps?

[-]

power97992@reddit

Yeah, it is pretty good from my limited testing, but opus and glm 5.1 are probably better

[-]

I_HAVE_THE_DOCUMENTS@reddit

I canceled my Claude code sub last week and I've been looking for alternatives and I'm curious what the "tone" of the model is like.

The breakthrough feature of Claude for me that has made it so useful has always been its conversational tone that allows me to treat it as a sort of collaborator while I get my ideas sorted out, compared to the other major AIs that try to produce a wikipedia article and lecture on every prompt.

[-]

ihppxng62020@reddit

looks like local is the way to go.

who is running this locally? wont it be much worse in performance even if you get this running at Q4 on some mac studios?

[-]

Fit-Statistician8636@reddit

It is native 4-bit, so “Q4” doesn’t hurt it and it runs better than GLM5. But unless you have 8 GPUs, its speed is not usable for agentic coding.

[-]

fastlanedev@reddit

I prefer it over Claude opus 4.7. haven't touched my subscription for a while now

Why?

Because it doesn't lecture me about useless things, actually listens to my instructions, and through pi... the system prompt is actually respected.

I'll take 85% of Claude if it means I can actually control the context I give it. Claude just... Throws so much shit in the background

[-]

mwachs@reddit

How have you slowly been replacing your personal workflows? Didn’t it just come out today?

[-]

Markus2816@reddit

curious what hardware is needed to run it with decent performance at context size like 128k?

[-]

FPham@reddit

Well, my opinion is different, but it is an unpopular opinion. Even Opus 4.7 chokes on complex stuff, and quite a lot. As soon as your project grows, it's a constant fixing primitive stuff.

[-]

lombwolf@reddit

Did moonshot fix their usage limits? I saw awhile ago some people were saying the usage got way more limited

[-]

qroshan@reddit

If I know a company is using 85% intelligence, I will use 100% intelligence to crush it. People who claim, 85%, 90%, 95% don't understand the differences compound each iteration (chain of thought)

Turn 1: 85%, Turn 2: 80%.... Turn N: Reddit Midwit

[-]

LeatherRub7248@reddit

Any one directly moved from codex 5.4 to kimi?

Any comments on if its a feasible replacement?

[-]

Cat5edope@reddit

They turned kimi into kwen 😭it’s a super over thinker now

[-]

Mental_Ad_6512@reddit

This is obviously a promotion post. Guess marketing team at Kiki is working hard.

[-]

useresuse@reddit

slowly? it’s been out 1 day lol

[-]

Different_Fix_2217@reddit

Same. But for creative writing. It's the best model I've ever used including latest opus, gpt 5.4 and gemini 3.1 pro. It has the social intelligence of GPT 5.4 with a knowledge base nearly a good as gemini and it writes better than Opus and has no positive bias unlike it.

[-]

madheader69@reddit

I agree, I started using hermes agent a few days ago, and tried out kimi k2, and I had forgot I had asked kimi to perform a refactor/new feature in my 300+ LOC project, left the house and came back, hermes timed out, loaded up the wrong session, started working in that, then today when it timed out I loaded up the wrong session to discover that the entire refactor and new feature better than I had designed it was completely one shorted... And I think it cost me 35 cents.

[-]

jd52wtf@reddit

The current issues with the larger models seem to be mostly over constraint. Think Robocop when they filled his head with hundreds of conflicting directives. Due to this I can only assume that models that are not over constrained in this way, even though smaller, are likely going to provide better results.

If not now then very soon.

Once the overpriced parts gridlock lifts there are going to be. A lot of subs canceling and right quick.

[-]

alext77777@reddit

I tried it both on their website and from Open router. I've got a prompt to generate a martian base in a 90's retro style using a single html file, I love to check new models with it. I don't see how this new Kimi model it Opus or even Sonnet grade, the render is by far less well. Gpt in xhigh or sonnet and opus hight thinking generate high détail scenes, this Kimi model is just basic.

[-]

alext77777@reddit

Also I tried to vibe code a retro game in one shot, it's pretty bad, it's so far from GPT xhigh or son t 4.6 high. I don't know for glm 5.1 but I remember the Pony something model they put on Open router a while ago was amazing, nearly same lever as the best closed models. I don't know why the benchmarks put this Kimi model so high 🤔

[-]

FormalAd7367@reddit

What’s Opus… 😆 so distant

[-]

bennyb0y@reddit

But how are you using it? Is it orchestrating a code or what are you doing?

[-]

1EvilSexyGenius@reddit

Y'all be comparing models to harnesses

[-]

Content_Standard_421@reddit

Dont spill it out man, delete the post.

[-]

Zulfiqaar@reddit

I'll have to test it for browser use! This was one of my core feedbacks about K2.5, it was one of the first open models that was decent at browser use (better than Gemini!) but it thought for sooooo long I didn't want to wait around. I hope it's overthinking was remedied. Opus was discovered to be cheaper than sonnet for many tasks, just because it reasoned much less and just "got it", and Kimi is worse than sonnet too. Looking at artificial analysis, the previous one was ~7m opus, 28m sonnet, 89m Kimi in terms of tokens needed to finish the benchmark (top of my head rough figures)

[-]

FoxiPanda@reddit

I've been messing with a cloud version of Kimi-2.6 for the past few hours (since local quants aren't really available in any reasonable size yet) and it's VERY EARLY... but ... I feel like in my personal ranking, Kimi 2.6 feels like it slots somewhere between Opus 4.5 and 4.6 and above Sonnet 4.6 firmly.

I've even been watching some of its thinking (it seems like it leaks its thinking when it gets confused or unsure), and the thinking steps I've seen so far are excellent in the face of ambiguity.

[-]