Kimi K2.6 is a legit Opus 4.7 replacement
Posted by bigboyparpa@reddit | LocalLLaMA | View on Reddit | 354 comments
After testing it and getting some customer feedback too, its the first model I'd confidently recommend to our customers as an Opus 4.7 replacement.
It's not really better than Opus 4.7 at anything, but, it can do about 85% of the tasks that Opus can at a reasonable quality, and, it has vision and very good browser use.
I've been slowly replacing some of my personal workflows with Kimi K2.6 and it works surprisingly well, especially for long time horizon tasks.
Sure the model is monstrously big, but I think it shows that frontier LLMs like Opus 4.7 are not necessarily bringing anything new to the table. People are complaining about usage limits as well, it looks like local is the way to go.
ViRROOO@reddit
Apple release a new M5 with 512gb and my soul is yours!!!!
Bobylein@reddit
That isn't enough though, is it?
ViRROOO@reddit
No but they have DMA via thunderbolt. So if you have 2 Mac studios you have 1tb of ram
Bobylein@reddit
Ah crazy, though I wouldn't expect any useful speeds from that?
ViRROOO@reddit
Users reported (128gb M5 macbook):
Qwen3-Coder-Next 8-bit: 79 tok/s at 4K context, 48 tok/s at 64K;
Qwen3.5-122B-A10B 4-bit MoE: 65 tok/s at 4K, 55 tok/s at 32K;
GPT-OSS-120B Q8: 88 tok/s at 4K, 65 tok/s at 32K;
Qwen3.5-27B dense 6-bit: 23 tok/s at 4K
So its already better than my strix halo :D Personally I think 45tk/s is the sweet spot. Above that you have to spend real money, not "just" 30k on two M5 studio ultra (if they ever come).
Relative_Rope4234@reddit
Do you use ROCM or Vulkan with strix Halo?
ViRROOO@reddit
Vulkan. ROCM is a insufferable, bloated piece of software that should be abandoned.
neopolitan77@reddit
LocalLLaMA: Kimi K2.6 is the deal
Me: đ€€
Huggingface: Kimi K2.6 is 1T parameters
Me: đł
How are we ever going to host models like these? I literally put in an order for an AMD Ryzen AI Max+ 128GB yesterday, next thing I see is this post. My feeling is, either you can get something that actually feels good (quality-wise, waiting a bit longer feels more tolerable), or I can't really justify the time+money investment. Anyway, really looking forward to playing with hopefully at least some half-decent models.
EbbNorth7735@reddit
Nice so in 1 year I'll be able to run an opus like model at home and in 2 years most others will as well
Stahlboden@reddit
And You'll still be frustrated you can't run opus 7 locally
EbbNorth7735@reddit
Some will. In reality there's a point when local llm's can perform 99% of the tasks you want them to
I_HAVE_THE_DOCUMENTS@reddit
Yeah I'm at a point where I want reliability and speed more than anything else. My dream model is something that's about as good as Opus 4.5, but fast and local. Hopefully LLM tech will improve enough to make that a reality in the coming few years.
neopolitan77@reddit
We keep saying that, but then what we actually want is the thing that got released half a year ago and we've grown to love since. I'll start believing that we've hit the "good enough" plateau once your preferred model is 2 years old (for reference, that would be about GPT4o today).
Mochila-Mochila@reddit
We'll still need fast, power efficient and relatively affordable hardware to run these.
This means probably the generations released after the upcoming AMD Medusa Halo, and nVidia N2 (add Intel's APU with nVidia IP into the mix). So, around 5 years is my guess.
am2549@reddit
No ;( When abilities rise, demands rise as well.
BingpotStudio@reddit
Hopefully the original Opus 4.6 and not the enshitfied 4.6 or 4.7 we ended up with.
I rather sonnet to opus 4.7 at this point.
Ardalok@reddit
How much memory do you think this will require?
EbbNorth7735@reddit
Depends, 2x3090s and some system ram
Ardalok@reddit
Do you mean running the entire model in VRAM? Or only the activated experts?
coder903@reddit
Do you know of a cheap place to run it
Turbulent_Pin7635@reddit
M3Ultra is the cheapest
Relative_Rope4234@reddit
Apple stopped selling 512GB M3 ultra
FusionCow@reddit
Local, your options are either a 512gb m3 ultra or a server CPU with a shit ton of high speed high bandwidth ram, but both will be expensive no matter what you do. The model is relatively cheap on api, and honestly you should probably use it there unless you plan on dumping thousands and thousands into very llm specific hardware
Strong_Owl7286@reddit
I've been hearing that the actual usage was not a real big step up? Curious to hear your experience
SirStarshine@reddit
But can it work as a MCP connection for a game engine agent, like Coplay for Unity?
ConsciousStruggle5@reddit
What's the cost comparison between both?
FabsDE@reddit
I am testing it for OpenClaw currently. Itâs nice for reasoning and stuff, but everything with design like Websites oder PowerPoint for example is much worse like with Sonnet or Opus unfortunately
flipf17@reddit
Maybe this is a hot take, but from using Kimi2.6 as an agent with openclaw, hermes, custom python agent variations, kimi2.6 was definitely behind sonnet4.6. It always tried to make me do its job for it and often stopped referencing the most important factors that were going on in a conversation and changed topics often. Sonnet is decent, but I notice its flaws, too. However, Opus is just on a whole other level. It's so smart, and it sucks comparing stuff to it because they end up feeling like trash models. Maybe they're pretty decent when you don't compare them to Opus though.
footoncake@reddit
&&
ghgi_@reddit
And the best part is: it wont randomly get nerfed and you wont get gaslighted into thinking it isnt.
Look_0ver_There@reddit
Now if someone can just "lend" me the \~$50K it would take to run it locally with a good speed... 1.1T params is well up there!
EndlessB@reddit
Would 50k even get you there? Itâs a beefy model
ghgi_@reddit
2x M3 ultra 512gb mac studios used could run it, roughly in that price range.
michaelsoft__binbows@reddit
Is this not going to be insanely slow?
AttitudeImportant585@reddit
prefill speed will be slow and unusable for a 1T model
PeakBrave8235@reddit
Itâs not a 1T model. Itâs whatever the active parameters are thatâs relevant to this discussion.Â
PeakBrave8235@reddit
Nope
ghgi_@reddit
Define insanely slow? For agentic coding? Probably on the low end, woudnt be fun to use. For chat or tasks your willing to wait on? perfectly fine. My guess from what ive seen is roughly 10-15 tokens/sec mabye a bit more on a good day, which isnt half bad considering the quality.
Such_Advantage_6949@reddit
The issue is prompt processing. A claude code or open code opening prompt will take like 10 mins on this sytem to start replying
UnsolicitedPeanutMan@reddit
Really puts into perspective just how much compute is being allocated every time you talk to Claude. Has anyone done the math on the rough $ figure worth of GPUs youâre talking to every time you use Claude? Is it $50k, or more?
TokenRingAI@reddit
Your requests gets processed by a rack scale system.
Costs are in the multi million dollar range per networked rack, but these linked systems can process huge numbers of requests in parallel at an efficiency higher than a cheaper system
FoxiPanda@reddit
Yep, it's this. When you talk to Claude, you are almost assuredly talking to NVIDIA NVL72 GB200/GB300 class systems or similar hyperscaler systems (depending on how you access it).
Kappa-chino@reddit
Or TPUs at GoogleÂ
Kimi K2 was specced to run on 8xH200 which costs approx $600k last I checked. That's far from cutting edge thoughÂ
ghgi_@reddit
Math is hard to do on model sizes we don't know and gpus we dont know, opus could be 1 trillion or 10 trillion, my guess is 3-5 but to put it simply: probably in the millions per rack if they are running the latest and greatest.
Adventurous-Paper566@reddit
Vous n'aurez pas 10 tok/s sur un M3, peut-ĂȘtre 5.
Outside-Line-9508@reddit
Try adding an NVIDIA RTX PRO 6000 Blackwell GPU and offload a few more MoE layers onto it. 96GB of VRAM is sufficient for 1T MoE models, as they only activate under 40B parameters at a time. The PCIe 5.0 bandwidth is also enough to avoid becoming a bottleneck
muyuu@reddit
Not anymore, this one is much bigger.
flobernd@reddit
Itâs native INT4, so âonlyâ 550GiB for the weights. It fits comfortably on 8x RTX Pro 6000 for example if you are looking for a âlowâ budget solution (compared to H200). But currently Tg is about ~50/60 t/s only with that setup. Once the dflash model ist released we might be able to use that as the draft model for ~200 t/s (if the promised acceptance rate of ~5 is true).
muyuu@reddit
Is that so? I saw FP16 somewhere.
Vancecookcobain@reddit
At 15 tokens a second
Baldur-Norddahl@reddit
8x RTX 6000 Pro = 70k USD. It should be possible to homebrew a system to carry those in the 10k to 20k range, so maybe 80k to 90k USD.
Notice you don't need a beefy CPU or a ton of RAM. Let the GPUs do the work. Use a pcie switch to keep tensor parallel GPU to GPU transfers local never even touching the CPU.
Crinkez@reddit
8x RTX 6000 won't even get you halfway to the required memory.
Baldur-Norddahl@reddit
Kimi K2.6 was trained using QAT and delivered as int4. So it is just about 590 GB in size. Available VRAM 768 GB so we have plenty.
ghgi_@reddit
? A single RTX 6000 pro has 96 gigs? 8x is 768 gigs, its a native INT4 meaning weights are 550ish gigs + context headroom its plenty. Unless your talking about the RTX 6000 ada 48 gb which 8x would be 384 gigs which is still 70% of the way there so not sure what number your refeering to otherwise.
Crinkez@reddit
It's a 1T model, so 1000GB+, and that's just to load the model into memory. You still need space for context window.
ghgi_@reddit
1T params != Direct size for kimi, its native INT4 meaning its pre-compressed and fine tuned at int4 to 550 gigs, look for yourself on HF. If it was INT8 it would be 1000 gigs but in this case is not.
suicidaleggroll@reddit
Depends on what is meant by "good speed". Kimi K2.5 can run at ~15 t/s on pure CPU as long as you have a bunch of memory channels. That honestly isn't terrible for interactive chat, but is definitely too slow for agentic coding. If you want full GPU, then no, $50k wouldn't get you there.
cantgetthistowork@reddit
The problem is not the TG but the PP on DDR5 is abysmal and unusable for coding
profcuck@reddit
Just a question - "too slow for agentic coding" - is it?
My way of thinking - and I'm genuinely asking, not arguing is this:
interactive chat - 7-9 tps is minimum as long as time to first token isn't too terrible, since that's a pretty fast human reading speed
interactive code assistant - people tend to want to see 30-50 minimum and a solid time to first token is crucial
agentic coding - you set up a full specification and then set the agent loose on it and come back the next day or whatever to see how it's going. 15 t/s isn't going to set any world records but agentic coding is the one area where going a bit slowly isn't nearly as important as being very smart. A smart model that can act independently to generate a lot of quality code in 3 weeks rather than 1 week is still extremely interesting.
Have I got it wrong?
RegisteredJustToSay@reddit
7-9 tokens per second is fine as long as you're not doing reasoning. If the model wants to spend 2k tokens reasoning before responding (which isn't uncommon, especially for complex stuff) that mean you'll be waiting over 3 minutes before you see a single thing. Sure, you can sit there and watch the reasoning but it's a glorified progress bar in the end. You really need an order of magnitude faster to make reasoning models comfortable to use.
profcuck@reddit
That's all true. It's interesting because before reasoning models 'time to first token' and 'tokens per second' mapped directly to what we actually read. Now of course the reasoning part is in between that. So the mapping between "human reading speed" and "tokens per second" is more complicated than before.
ShengrenR@reddit
coding with agents needs a ton of context processing too, if your pp is low that's going to be waiting a looong time potentially. it'll naturally vary by harness and how well they have context reuse set up etc; but lots of 'agents' bounce between tasks to search/edit/re-read.. if you have thinking on top. you'll be waiting quite some time. naturally, some tasks care about speed more than others.
DuncanFisher69@reddit
Youâre closer to mostly wrong. Below 20 tokens/sec most agentic tools will time out, even one optimized to local coding. Because a lot of agentic coding tools do âturnsâ. So itâs often trying to cycle things quickly, or if possible, in parallel.
profcuck@reddit
Thank you that's interesting. I've not read many reports of people trying to do agentic coding using local models. My thinking didn't take into account timeouts. That seems like a reasonably solvable problem since the coding tools are old school deterministic code that can (in principle) be set to wait however long is necessary.
But if that's the state of play today, that definitely answers my question.
DuncanFisher69@reddit
They can, I supposed, but out of the box they already have timeouts set for a sane person.
Look, all I am saying is if you have a choice between a model that is 60 tokens/sec and 18 tokens/sec, that 18 tokens/second model had better be really really really worth it. Somewhere between 20-60 tokens/second the performance of these tools gets pretty good and your context window becomes the limiting factor. But if you can get a fairly large one (32k-48k) you can actually get a lot of real work done.
profcuck@reddit
I'm with you. Very helpful thanks.
txgsync@reddit
And most coding tools bust prefix caches left, right, and center, leading to very unsatisfactory outcomes even though LMStudio, oMLX, and Llama.cpp all support prefix caches.
I am noodling with the Pi SDK to see if I can construct one that religiously re-uses prefix caches. Because the difference is literally âturn that takes 20 secondsâ vs âturn that takes 8 minutesâ at large context sizes on a M4 or M5 Max Mac.
Wide-Section5065@reddit
I completely agree; accuracy and long horizon consistency (smartness) matters more than speed in agentic coding.
IrisColt@reddit
I think the issue is that people want to know immediately if the agent derailed.
zansibal@reddit
Interesting, how much ram/what quantization would be needed? I have 384 gb on 12 ch ddr5 and an epyc 9554p. Iâd like to try.
suicidaleggroll@reddit
Q4 needs around 600 GB, with 384 youâre pretty stuck as far as Kimi is concerned. Â Maybe you could fit a Q2, though Iâm not sure if thatâs worth it.
zansibal@reddit
Thanks, youâre right. I just now tried GLM5.1 with UD-Q3_K_M. At 338 gb it fits nicely. With cpu only I got 7 t/s, and hybrid gpu/cpu gives me 15 t/s. 15 feels ok for that kind of model.
sob727@reddit
Very naive question here.
Say you have a CPU that does 15 t/s. And GPUs that would to 75t/s if you had enough of them.
Now if you split the layers evenly between CPU and GPU. Do you grt to the average (45t/s) or is it absolutely not the case (non linearity etc)?
suicidaleggroll@reddit
It's not linear. Here is a plot of roughly what the relationship looks like: https://www.reddit.com/r/LocalLLaMA/comments/1itfg77/test_prompt_processing_vs_inferense_speed_vs_gpu/
Basically you need to be running almost entirely on the GPU before the speed really starts climbing
sob727@reddit
Thank you very helpful
thedirtyscreech@reddit
The prompt processing time will change linearly. Token generation wonât. To go from 10s to 5s of prompt processing, you need to double your token generation. To shave off the next 2.5s, you need to double token generation again.
sob727@reddit
Ha yeah makes a lot of sense.
sfw_mtv@reddit
if you use a very good quality draft model, you can get speculative execution to speed up the entire thing to approach the GPU speed, but basically when you're combining the gpu and cpu it will be much closer to just the cpu speed because that's your bottleneck. hybrid might actually be slower than one cpu only if the card is hobbled by the pcie link speed.
sob727@reddit
Thank you.
So I imagine if you mix GPUs say of different generations, you'll be limited by the bandwidth of the slowest also.
I'm asking as I wonder what I could get on my Threadripper once I land 512GB of RAM and a couple RTX 6000.
sfw_mtv@reddit
You can hurt yourself with putting GPUs in the wrong slots, or putting them in with risers that can't support full bandwidth that you're expecting. If the GPU slot bandwidth is the limiting factor, you'll never get faster than what a slot can support, but if the bottleneck is that you're on a desktop CPU that only handles 2 memory channels, especially ddr4, you're going to hit that also. Threadripper can do a lot more memory channels so it will be more enjoyable for CPU only inference. Find a draft model (something that runs the same tokenizer as your main model) you can run entirely on a single GPU and that will keep all of your models fed with parallel data and you'll be a happy camper.
alphapussycat@reddit
True agentic coding shouldn't require high t/s I think?
You'd queue up a bunch of feature tasks that it handles. Meanwhile, you do the architecturing for other features and evaluate code it's done.
duboispourlhiver@reddit
Not that obvious. I can queue work for night long agentic queue, and I'd love the queue to be three times longer.
FriendlyRope@reddit
Mhh, i dont know.
For most stuff a set and forget setup is fine, who cares if the computer works for 5 days, if you get what you want in the end ?
Wonder if you could use multi agent setups, with something like qwen 3.6 or gemma for a lot of the "grunt work" using expensive models only sparingly
Awkward_Elf@reddit
Would this be with 12 channels of DDR5 or are there other optimisations now that help push even a DDR4 system past ~4 t/s?
suicidaleggroll@reddit
I can only speak to the performance on my system, which is an Epyc 9455P with 12 channels of DDR5-6400. Â Without the GPUs, Kimi K2.5 generates at about 15-17 t/s on the CPU in Q4.
DDR4 will be slower, but I donât know by how much.
Qwen30bEnjoyer@reddit
What prompt processing speeds have you observed? I would guess you get about 45 - 70 TPS PP on that system, how close am I? :)
suicidaleggroll@reddit
pp was 72, last time I tried it. I normally use a pair of RTX Pro 6000s to help, with them it's about 140/25 pp/tg, it's not often that I shut off the GPUs and run on the CPU alone so it's been a while.
Fit-Statistician8636@reddit
I have the similar machine. Can I ask, do you prefer SGLang+KTranformers, llama.cpp or ik-llama.cpp, and why? đ
suicidaleggroll@reddit
llama.cpp. Â I like ik_llama, but Iâve just run into too many issues, especially with tool calling, to rely on it. Â vLLM and SGLang take too long to switch models in my experience, which doesnât work well with my workflow. Â I do love the performance of vLLM though, and would happily switch if I could figure out how to get model loading times down.
Fit-Statistician8636@reddit
Thank you for a honest answer. I have similar experience. I noticed I am using vLLM and SGLang more and more lately. Typically, it is to sort out various tool calling / reasoning parsing issues. But loading times are terrible for sure.
Qwen30bEnjoyer@reddit
Got it. It's interesting that 192 GB VRAM doesn't move the needle that much on prompt processing. I would expect ~4000 TOPs at FP4 with 2 RTX 6000s, and with INT4 Kimi K2.6, you should have about 1/3 of the model sitting in VRAM. I wonder why speeds aren't significantly higher?
suicidaleggroll@reddit
GPU contribution isnât linear. Â Itâs still bottlenecked by the CPU until you get at least ~75% of the model in VRAM, then you start to see significant gains after that. Â You can try it with smaller models as well, start with it entirely on CPU and then offload one layer at a time to GPU, youâll see the speed, both pp and tg, doesnât really move until you get a majority of the layers on the GPU.
twack3r@reddit
Because PCIE interconnect becomes the limiting factor.
It already is between two RTX6000; once you start including cpu to memory, that limit becomes the ceiling.
Free-Combination-773@reddit
But isn't prompt processing mostly limited by compute?
ProfessionalSpend589@reddit
I suspect that the compute where the data is matters. If the data sits in RAM - GPU doesnât touch it (or if Iâm wrong - it touches it at the speed of PCIe).
Free-Combination-773@reddit
True, but even one RTX 6000 should be enough to fit all attention heads to VRAM.
LegacyRemaster@reddit
50% slower ddr 4
Outside-Line-9508@reddit
Try adding an NVIDIA RTX PRO 6000 Blackwell GPU and offload a few more MoE layers onto it. 96GB of VRAM is sufficient for 1T MoE models, as they only activate under 40B parameters at a time. The PCIe 5.0 bandwidth is also enough to avoid becoming a bottleneck
Present_Flower_6596@reddit
Even on new macs it would be aweful. Yeah token gen would be okay but prefill would be a nightmare with 1.1Tp after a few chats it would literally crawl to first token
jonydevidson@reddit
4x Mac Studio with Exo
dtdisapointingresult@reddit
22k-23k should get you there. 6x clustered DGX Spark = 768GB pool.
ThisGonBHard@reddit
AMD MI 350X seem to be around 15-25k USD, so if you get lucky, you could actually run it as an server on 3.
a9udn9u@reddit
M3 Ultra 512GB x 5, 2.5T unified memory would do, I think.
Final-Frosting7742@reddit
You just need two strix halo with 128go, connect them somehow and compress kimi to q1 \~130Go, which is < 96*2 and here you go. For less than 6k.
Now if you want higher quants just connect more strix halos. Still way cheaper than 50k.
Since kimi is \~30B active params it will feel slow, but hey you got kimi running.
Perfect-Flounder7856@reddit
4-8 H200s cost about $100k
muyuu@reddit
You need 8, and that's going for 240K-300K US$
the-final-frontiers@reddit
maybe when m5 mac comes out, buy 2
Look_0ver_There@reddit
You know, after I posted it, I thought a bit more, and came to the same conclusion. I think $50K is definitely low-balling it in this current RAMageddon timeline. This is more of a full cluster rack node sort of deal, and those cannot be had for $50K
Long-Chemistry-5525@reddit
I built a tool to auto deploy models to cloud gpu providers, including multi gpu models. Will have to add kimi to the list of supported models. My harness deploys a chat interface using ollama.cpp on the gpu node and a openclaw like api on the local machine that runs the web ui
keepthepace@reddit
You can rent it if you need or you can just buy tokens on openrouter from a random hoster.
Look_0ver_There@reddit
Remind me. What's the name of this sub-reddit again?
keepthepace@reddit
Yes, you are right.
I just see this thing as a step up from totally closed Anthropic models. And if push comes to shove, $50k is something that is totally afforable for a company, a community, a city, etc...
FuzzyBucks@reddit
You have to scale your budget up several orders of magnitude if you want to provide a reliable service to more than one person.
keepthepace@reddit
$500k is also something that is totally affordable for a company, a community, a city, etc...
adr74@reddit
use ollama and pay 20 bucks/month
Ranmark@reddit
I actually wonder, is there a "paretto line" chart to see a diminishing returns of models number of parameters and benchmarked data to look for a sweet spots
Look_0ver_There@reddit
This site is arguably the closest to what you're looking for. They have a free models section and you can compare it with paid models too.
https://artificialanalysis.ai/
The *Intelligence vs Cost" chart there does chart a decent representation of intelligence vs model size. MiniMax-M2.7 appears to be one of the better ones.
XTornado@reddit
Yeah... and the electricity bill don't forget about that.
amunozo1@reddit
At least you can choose providers, even if it's not local.
s101c@reddit
It could have been "just" ~$10K had you purchased this hypothetical machine 1-2 years ago.
ghgi_@reddit
"Kimi, make me an AI SaaS service that is guaranteed to make 50k or more before kimi k2.7 comes out, make no mistakes"
nuclearbananana@reddit
well if you run it yourself.
Otherwise you're trusting third party providers.
PerlativeCeronometer@reddit
I mean though it's already 4bit natively. Much less likely to get quantized.
nuclearbananana@reddit
Maybe not the model but the KV cache could. Or the attention weights which are not quantized
CheatCodesOfLife@reddit
The attention weights are tiny, absolutely no benefit quantizing them.
ghgi_@reddit
True but with other vendors able to run it, not just one centralized entity, you can cross check and validate quality across them to see whos lying. Not updated recently but: https://github.com/MoonshotAI/K2-Vendor-Verifier
IrisColt@reddit
Another best part is that it will make you feel vindicated.
xmnstr@reddit
From someone who's been subscribing to a Kimi plan for a while: That's unfortunately not true. It seems to be more temporary so likely at capacity peaks, but sometimes it even fumbles tool calls so obviously too quantized. It seems to be what providers do, no matter where. Either pre-nerf it or have it dynamically adjusted.
With that said, it's still nowhere near as bad and pervasive as what a certain American company are doing.
ChatWithNora@reddit
That's honestly the main selling point for me. I don't care if it's 85% or 95% of Opus, I care that it's the same 85% tomorrow as it is today.
lukaszpi@reddit
...because you can't afford to run a usable quant locally ... at least I can't
Actual_Meat_1030@reddit
NĂŁo, fiquei dois dias usando k2.6 e claude...deixei o k.26 ele demora horrores...faz sĂł merda....e eu sĂł mandava o claude arrumar a cagada do Kimi...e o claude arrumava em segundos..pelo menos pra C++ k2.6 nao presta
jgenius07@reddit
Hot damn the pricing difference
Crinkez@reddit
Yes but does it cache tokens?
cheesecakegood@reddit
Moonshotâs own site says yes⊠but they have a separate price for âchache missâ so Iâm not sure what your eventual balance is. Their pricing btw is $0.16 or $0.95 input as described above; $4.00/1M out; 262,144 context
VonDenBerg@reddit
Did an afternoon with Kimi k2.6 and itâs not a replacement. It misinterprets and often forgets multi steps.Â
Sonnet can ship features faster.Â
TheseHeron3820@reddit
Localllama - > looks inside - > one trillion parameters.
I don't know, chief, I have a hunch I can't run this locally...
Blablabene@reddit
local is the way. these models also put enormous pressure on the proprietary models to keep their cost down.
In the near future this will become more accessible to people. Today, it's still way too expensive to run "locally".
OneSlash137@reddit
Itâs cute you think that. Consumers are already in the process of being priced out of serious hardware. Maybe not today, maybe not tomorrow, but one day before too long everyone will be using dumb terminals connected to cloud powerhouses and weâll get compute credits for using our computers just like we do for ai tokens now.
We can run the models now but when current hardware starts reaching end of life Iâm betting your ainât finding replacement parts and you sure as hell wonât be paying what the next gen hardware will cost.
Do you have any clue what it takes to run these models? Providers keep the costs down? That literally isnât possible. Everytime someone asked for their AI girlfriend to stroke their ego or generate a Natalie Portman pic with triple G breast, thatâs a shit load of compute power. Companies wonât get better with costs, it will get worse. Theyâll need a way to stop people from using tokens wastefully, that will be by pricing you all out. The best models will be for enterprises only because home users just donât need that kind of power.
Enjoy your access to AI while it lasts. Thatâs a tool too powerful for elites to let us plebs have. Itâs already happening at a rapid rapid pace.
Spectrum1523@reddit
There is literally no evidence for this happening.
OneSlash137@reddit
Lmfao ok. Hey where is your Claude code part of the pro plan again? Oh they removed it from the pro plan?
You sound like an idiot
Spectrum1523@reddit
Do you usually reply to people with totally unrelated facts or is that just sometimes?
"Claude api costs are up therefore local compute in general will be removed sometime soon"
OneSlash137@reddit
Bye bye Claude code.
Spectrum1523@reddit
Literally what are you even talking about
_BigBackClock@reddit
instagram reels aahhh opinion
thrownawaymane@reddit
Some of us have thought this for a long time.
OneSlash137@reddit
Uninformed tween opinion.
Blablabene@reddit
Disagree.
OneSlash137@reddit
I donât give two fucks what you think lol.
Blablabene@reddit
But I don't want a f*king guitar lesson đž
typical-predditor@reddit
RTX 6000 released recently. It's less expensive than a car. Not cheap, but not impossible to buy.
mantafloppy@reddit
Model size : 1.1T params
"Local"
lol
SoylentCreek@reddit
Has anyone informed Anthropic? Theyâre still charging $5/m in and $25/m out if youâre paying api rates, which if youâre on enterprise you are.
FoxiPanda@reddit
That's certainly the list price, but Enterprises also get insane discounts at scale. They might be paying half that token cost ... or less depending on how much they're buying in aggregate.
Zulfiqaar@reddit
I know there's at least 30% discount from base API price on packages of 200M uncached tokens or above, I'm sure some get even better deals. Windsurf used to serve at roughly 50-70% discount looking at my past usage..but given their current pricing change maybe anthropic has stopped with those deals.
FoxiPanda@reddit
I'm probably not allowed to say exactly what we get, but it's better than what you get.
squired@reddit
You can likely say if it is considerable or not however. Last time T3Chat discussed it, the volume discounts were not stellar. Don't quote me on the numbers, but it was something like they needed to guarantee $250K spend for an additional 5% discount.
ProfessionalSpend589@reddit
There are ranges of numbers too.
bussondev@reddit
vou assinar pra testar.
_derpiii_@reddit
How can you come to this conclusion when itâs only been out for 9 hours? Thatâs just very⊠suspicious
Zulfiqaar@reddit
In all honesty the very first prompt I ran on opus 4.7 I also ran on Kimi k2.6 and it gave a pretty good response while Opus refused. I didn't bother testing 4.7 further, went back to 4.6 (this was on webui not ClaudeCode, 4.7 seems to be doing good but I can't tell the difference between it and 4.6 or even 4.5 yet.)
Super_Sierra@reddit
Uhhh, the only thing that Opus won't do over API is weapons stuff and characters under 18.
If you are hitting guardrails, at least say what they are.
Zulfiqaar@reddit
This wasnt API, it was was on claude webapp which already has guardrails, and then extra filters ontop for Opus4.7. Prompt was benign writing prompt, I go to open weight models like Kimi/DeepSeek/GLM for anything that I expect to trip filters (I have a cybersecurity client)
Caffdy@reddit
what about 4.6 vs Kimi 2.6 then?
Zulfiqaar@reddit
4.5/4.6 is good, i use it. 4.7 i only use in CC nothing else. I tried Claude Design but its limits are so small I'd rather . K2.6 is my fallback for Opus/Codex coding, or for search/writing/design. Its more proactive and has less restraint, for better or worse.
sgt_brutal@reddit
It's been out for at least a week as k2.6-code-preview
_derpiii_@reddit
ohhhh, ok that makes more sense
jeffwadsworth@reddit
You canât so just treat it as pure comedy.
InterstellarReddit@reddit
Bro tested and got customer feedback all with five hours. Thatâs insane. Weâre a big ai company and it takes us one week to test a new model between four engineers before we even consider a customer route.
addiktion@reddit
What eval software are you using for scoring?
RemarkableGuidance44@reddit
Question, how are you going to survive being an "AI Company"? There are so many of them today.
I work for an Enterprise (Non-Software) company and hired a few software devs and ML Engineers, built our own server at half a million and are currently replacing SaaS software with in house ones. Started to finetune for certain cases and its been incredible. We still use Codex or Claude for 15% of the work but 85% is with GLM 5.1 or Kimi or DeekSeek.
I would say though my teams cost would be more than what an AI Company would cost to implement such things. But hey, you are going to need knowledge inside your company to keep these running.
No_Inspection4415@reddit
When a Enterprise can spend millions (infra, staff) to run LLMs and keep it reliable, used by developers, and not annoying, it is a huge success.
However, this project is risky and manager hate risk. I can imagine this type of investment go to trash real fast.
addandsubtract@reddit
Do you publish these scores anywhere?
bigboyparpa@reddit (OP)
These kind of comments are retarded. We work at an inference provider startup, so we're obligated to launch the model as soon as weights are available, and sometimes we get weights early.
Secondly, of course customers give feedback to us straight away, if the model is working, or not.
_IAlwaysLie@reddit
don't use the r-slur
rebelSun25@reddit
Your post is literally saying "local replacement" ... You're employed at an inference provider shop... The comment was valid. It would have been surprising to see a garage setup casually run this model and have it tested
jazir55@reddit
To be fair, technically he is running it locally, on his inference providers hardware đ
AttitudeImportant585@reddit
bro is doubling down lmao
jeffwadsworth@reddit
I was laughing at that too. And they downloaded it pretty fast as well. Haha.
InterstellarReddit@reddit
And he doubled down in a comment saying that they are a startup but they let their customers do the testing đđđ.
klipseracer@reddit
You can meet their customers actually... Let me introduce you to: Customer Agent 1 and Customer Agent 2!
LeTanLoc98@reddit
Benchmaxxed model
Benchmaxxed marketing
DeadInFiftyYears@reddit
What hardware are you using, and how well is it performing?
ccalabro@reddit
so asked k2.6 for a comparison, not really inspiring confidence as it said it had knowledge up to April 2024.
It looks like you may be using custom tags from a frontend app, or there may be a mix-up with version numbers. To my knowledge, there are no official models called âKimi K2.6,â âOpus 4.6,â or âSonnet 4.6.â
Iâll assume you mean the current production flagships that match those names:
albertgao@reddit
No human being uses the terms you listed, you, as a bot, need to update your prompt.
WithoutReason1729@reddit
Your post is getting popular and we just featured it on our Discord! Come check it out!
You've also been given a special flair for your contribution. We appreciate your post!
I am a bot and this action was performed automatically.
MSPlive@reddit
It was answering correctly with 2.5, any idea why not with 2.6 ?
Gloomy-Ad-7272@reddit
maybe downgrade in reasoning but upgrade in coding and overall
Technical-Earth-3254@reddit
If it's at 85% to Opus, it's probably a full Sonnet replacement?
DepartmentOk9720@reddit
Most models already are full Sonnet replacement, minimax 2.7 ,GLM 5.1 ,kimi 2.6 and Qwen
MoistRecognition69@reddit
I've been using GLM 5.1 (API Coding Plan) these past few days and idk man it feels like a hallucination galore.
Maybe I'm doing something wrong? Is there something I'm missing?
mickaelxd@reddit
I went through this. I hired the 80usd plan for a month and realized after about 100k from the context window, he gets extremely dumb in this plan. The glm is only good if itâs via API.
MoistRecognition69@reddit
API as in standalone API, not their coding subscription?
mickaelxd@reddit
Yes, their plan is very nerfed. See if the API values compensate for you and prefer to use them.
notdba@reddit
The 100k context issue with the coding plan was fixed about a week ago. Still got intermittent 429 responses though.
DepartmentOk9720@reddit
The context windows for GLM 5.1 is fixed something like 250k context length, recently Claude Sonnet got a boost till millions of tokens, check the context lengthÂ
unique-moi@reddit
Of the models you mention (minimax2.7, glm5.1, kimi2.6, Qwen3.6 plus) only minimax can realistically be run locally without megadosh. Below minimax level, I hear good things of qwen3.6 moe.
power97992@reddit
Sonnet is probably better than qwen 3.6 plus and minimax 2.7
ghgi_@reddit
Better then Sonnet imo, Mabye 85% opus if you hand hold it more but id take it over Sonnet.
madheader69@reddit
I told minimax he was acting like sonnet and he was straight up offended...saying that the user is saying that I am behaving like an inferior model that is unable to perform effective tool calls...(Paraphrased)!
Runtimeracer@reddit
Just that MiniMax often has so much more Perplexity with even simple tasks that Sonnet would never have
madheader69@reddit
I agree, but the Claude variants are the most sketch when the time comes to commit the schema changes, they will call every tool, reset services without asking, deploy sub agents like its policy to do so, and yet, sneak a little, and subtle, "when you run the migration..." or "to activate the feature, run the schema migration".... one could argue it would be much cheaper to run 10 minimax in hopes one will be less perplexed, long enough to finish the job completely. .....obviously I'm still a little distressed from a Sonnet incident...lol
allah_oh_almighty@reddit
No fucking way, that's adorable đ
CryinHeronMMerica@reddit
So, more like GPT 5.3 Codex/5.4?
bigboyparpa@reddit (OP)
Hot take but it's better than Sonnet
Zulfiqaar@reddit
I completely replaced the previous Sonnet with the previous Kimi, so this seems reasonable
mycall@reddit
Sonnet compares to which gpt?
mintybadgerme@reddit
Same here. And then after Qwen 3.6 plus.
MisticRain69@reddit
I've replaced sonnet with minimax m2.7 constantly been better for me than sonnet 4.6 (which is nerfed rn) and feels like its 90% sonnet 4.5 level. Hands down best local model ive used and I fuckin love it lol.
Possible-Basis-6623@reddit
i guess a bit more precise prompt wrapping will likely make it very close to opus
rpkarma@reddit
GLM 5.1 is already better than Sonnet so I wouldnât be surprisedÂ
ITechFriendly@reddit
No, it is not. I have a repository with mixed Bash, Python, and Ansible code where GPT5.2+ and even Claude Sonnet 4.6 are running circles around it while GLM 5.1, Qwen 3.6Plus are halucinating about missing features, alignment, bugs, etc.
deejeycris@reddit
I mean I'm already finding GLM 5.1 better than Sonnet so I can't exclude that another later model can do it (didn' try K2.6 yet).
vitorgrs@reddit
There's a math question that I gave. I tried several models, only KIMI 2.6, Gemini 3.1 Pro and Grok 4.20 Agent beta could answer.
Claude Sonnet, GPT 5.2-High, etc couldn't....
Couldn't try Opus, but I guess it def should work.
exaknight21@reddit
There now needs to be a r/povertyLocalLLaMA because gosh darn it alls I can afford is whatever is less than $500.
FoxiPanda@reddit
Lol, with how prices are going, that's going to be one blade of a Noctua fan in a few months.
pyr0kid@reddit
we should start an MLM based around selling thermal pads, eventually we'll be able to buy an entire fan and rent it out.
FoxiPanda@reddit
You mean we have to get our AI agents to set up an MLM....wait..no..one of those things have to happen before the other...
Maybe if we cut the thermal pads into pieces and shrinkflate them to maximize our synergistic revenue streams...
exaknight21@reddit
Good lord all mighty. I cannot wait for this stupid AI boom to come and be over. This is just idiocracy.
FoxiPanda@reddit
I think it'll be a while yet before things stabilize (particularly RAM), but probably in like 2028, there will be an absolute glut of available hardware as some of the big Ampere / Hopper gear starts to age out.
antwon_dev@reddit
I also think in 2028 weâll see some real changes. Memory shortage and this ai fiasco should be well on its way by thenâŠ. Right!?
That, or the Jevons Effect will cripple us in full force
FoxiPanda@reddit
Those are not mutually exclusive IMO. Micron has new fabs coming online in the next 12 months, various other major players are pivoting to provide more capacity, etc etc... I think that supply increasing, demand increasing, and price stabilization / price re-alignment can all happen together as weird as that might be.
zdy132@reddit
Cannot wait to have my own dgx station.
FoxiPanda@reddit
You and me both :D
cutebluedragongirl@reddit
Just don't be poor, bro.
clouder300@reddit
ok. now I decided to not be poor anymore.
2muchnet42day@reddit
Exactly. All you need is like 200K and you're set with some cool H200 GPUs
RedditUsr2@reddit
With bitnet, better cache efficiency, and more aggressive MoE, I expect we'll see CPU viable models that are actually decent here in a year or two.
Shawnj2@reddit
Gemma 4 and Qwen 3.5/3.6 are reasonably cost effective models with options for most modern hardware
Darkoplax@reddit
6GB VRAM here brother; I can't wait to run Kimi K2.6 on 31 t/y
pmttyji@reddit
8GB here. Lets wait for Bonsai 1-bit version
CalligrapherFar7833@reddit
There should be normalLocalLLama and imrichafLocalLLama
spvn@reddit
what about GLM-5.1?
jeffwadsworth@reddit
Proven to be a beast. Love it.
spvn@reddit
Better than Kimi2.6? Iâm thinking of  subscribing to GLM or Kimi plan to test them out. But struggling to decide which to try.Â
Baader-Meinhof@reddit
Kimi 2.6 is better than GLM 5.1 in essentially every arena, but you get greater usage limits from 5.1.
Adrian_Galilea@reddit
I havenât tried kimi 2.6 but I did extensively use 2.5, and I prefer glm 5.1 by far.
Baader-Meinhof@reddit
5.1 is better a non UI code than 2.5 for sure, 2.6 is better at all tasks than 5.1
Adrian_Galilea@reddit
I was comparing it in genral chat and vibes, kimi kept overthinking glm felt right everytime, other people I spoke to said the same.
Baader-Meinhof@reddit
I personally can't stand GLM prose and I replaced in production with kimi, Arcee, and minimax. It reads like the same type of slop qwen outputs.
neotorama@reddit
good quality, think too much and slow
spvn@reddit
How much better quality would you say Kimi 2.6 is compared to GLM5.1? Iâm thinking of subscribing to one of their plans to try them out for a monthâŠ
TheRealMasonMac@reddit
GLM-5.1 is leagues better than K2.6, which has massive problems with it overthinking itself into misinterpreting the request and doing something completely different. GLM-5.1, in comparison, has better lateral thinking and will actually stop itself to ask you questions when it realizes it doesn't understand how to best proceed.
kamikazechaser@reddit
I'd personally put it in the same tier as sonnet 4.6/gpt5.4 with the added advantage that it isn't lobotomized with uselees guardrails.
voyager256@reddit
That's not that impressive IMHO. Question is how close it really is to Opus 4.6 (general) and Sonet 4.6 (coding) not vs 4.7 or GPT 5.4.
If it's significantly better than GLM 5.1 , then that would be impressive as I think Kimi 2.5 fell behind GLM 4.7 let alone GLM 5.
a9udn9u@reddit
I suspect for the model itself, it really just comes down to scale, there's no secret sauce. Word on the street is that Sonnet is a 1T model, Opus is around 5T, and Mythos is 10T, while the largest open-weight models are 1T params and perform roughly at Sonnetâs level. Harnesses can either amplify or restrict model performance so they are as important as models themselves.
Caffdy@reddit
until someone from inside comes forward and confirm, those numbers are pure speculation, even illogical. There's simply no way to deliver good throughput for 5T models yet, even less for 10T ones, even if they could make them, it would be absolutely unprofitable to even try
a9udn9u@reddit
The Sonnet and Opus numbers are from Elon Musk's X account so maybe you are right, but Mythos being a 10T model is leaked from a Anthropic blog post draft, I think it's quite likely to be true
Clipthecliph@reddit
M5 ULTRA 512 gb ram sufficient?
SuperFail5187@reddit
Two of them and you are set.
Clipthecliph@reddit
Some said to wait m5 ultra, as they should be just around the corner. And how m3 was too slow.
SuperFail5187@reddit
m5 is way better, yes.
Mistredo@reddit
Not enough.
Fresh-Resolution182@reddit
"it wont randomly get nerfed" is the real value proposition. version stability alone is worth the hardware investment at this point
Super_Sierra@reddit
My only issue is that it isnt Opus. When it comes to implicit instruction, Anthropic has no equals.
PostingGuru@reddit
A true 85% and being run on the cost of electricity sounds insane. Iâm worried the longer I put off the hardware investment the MORE expensive GPUs will become
zzsmkr@reddit
Being better than Opus 4.7 is not the compliment you think it is lmao
abrady@reddit
What sort of tasks are you seeing it do well? reviewing? greenfield creation? edits? How are you evaluating as well? just reviewing what was done?
Worried_Drama151@reddit
Naw k2.6 sucks sorry especially for how large it is
AgentAiLeader@reddit
how does it hold up against Sonnet 4.6? i'm currently running most of my agents on Sonnet, do you think Kimi K2.6 could eventually replace it?
FusionCow@reddit
It's better
LoveMind_AI@reddit
I have fully replaced Opus with it. I swap to GLM-5.1 for a few things, and MiMo-V2-Pro for very long context window stuff as well. But the core is Kimi K2.6 and it's working very very nicely. Saved me from having to slum it with GPT-5.4.
Sea-Load4845@reddit
In what hardware are you running it ?
segmond@reddit
qwen3.6-35b can do 85% of task that Opus can at a reasonable quality.
Tomr750@reddit
at what quant/what hardware do you run?
segmond@reddit
Q8.
llitz@reddit
Mine is just in a spiral of reasoning that's not even fun. Added gemma4 31b, and the same problem was solved in a single prompt - slower to prompt, but it was accurate. Hard to justify the 35b unless I can figure a way to get it to not be stupid while reasoning. (Q6 on both models)
wren6991@reddit
It doesn't seem to like being quantised. I'm using Unsloth's UD-Q8_K_XL and that loops less, but it's a pretty big model at that point.
butt_badg3r@reddit
Listen. Iâm not even willing to spend 5k on a Mac Pro to run llms locally. The cost of running a decent model locally just doesnât make sense. Iâll just stick to Claude max.
no_spoon@reddit
Who makes Kimi and how do I know I can trust it?
Bobylein@reddit
The point is that it's open weights and you can run it on any (very expensive) hardware
no_spoon@reddit
No. The point is that whatever service I use I need to be able to trust. You donât need local hardware to run Kimi. So what is the underlying company that runs it and ca it be trusted
Chriexpe@reddit
For those who don't have 8x RTX Pro 6000 to run it locally, is it worth going for their $39 plan (Kimi 2.6) instead of Copilot Pro+? "I've built" an entire project using its Opus 4.7 (and it's "still" only at 35% premium request) and it went really smoothly.
BidWestern1056@reddit
don't demean kimi-k2.6 like that.
Moist-Length1766@reddit
So its nowhere near being a replacement for Opus?
Bobylein@reddit
Well for some people costs matter and the API pricing is much cheaper
Ok-Coach9837@reddit
Wow
ridablellama@reddit
The new American dream! Running Kimi at home ;D
ProfessionalJackals@reddit
Communist detected on American soil. Lethal force engaged. ~ Fallout...
Turbulent_Pin7635@reddit
Laughing in M3U
Nokushi@reddit
would really like to try it, which provider do y'all recommend?
jonas-reddit@reddit
If youâre lucky you can find it free somewhere. Otherwise go for cheap. Personally, I like openrouter.
Nokushi@reddit
thanks! if i understand correctly, openrouter lets you choose which provider to use for the model right? does it impacts model performance by any means?
TaylorTWBrown@reddit
How can I make this run with crappy hardware? Like a 2080 ti or 256gb of ddr4
CheatCodesOfLife@reddit
https://huggingface.co/ubergarm/Kimi-K2.6-GGUF/tree/main/smol-IQ1_KT
almbfsek@reddit
every single time the same post... fast forward 1 week, everybody will be talking about how bad it is compared to claude...
Tough-Tangelo-5331@reddit
Where do people run these models? Are these just the mini server user talking? No diss just asking as a enthusiast.
Rare_Operation2367@reddit
Can't wait to see where OpenLLMs will be in \~5 years (which is when I will actually be able to afford local compute). We rely heavily on Opus, 3.1 Pro in our daily workflows, but we will inevitably reach a point where these local models will surpass the current frontier. That should be enough to create your own, full-fledged J.A.R.V.I.S., that is actually useful and not just a tech demo for reels.
spambait-aspaaaragus@reddit
How are you running it OP?
Moist-Length1766@reddit
he is running it in his dreams at night
zenom__@reddit
Yeah true local LLM. Â lolÂ
MannToots@reddit
The coding I did this weekend with it left me very disappointed. I disagree. Haiku 4.5 at best and even then I dunno man
Skid_gates_99@reddit
can't compare it with opus 4.6, as of now.
lemon07r@reddit
It's not better than opus 4.6/4.5 but it is 100% better than opus 4.7 at most things cause 4.7 is absolutely garbage.
Drumroll-PH@reddit
Iâve tried swapping tools in my own workflows too, and I usually land on âgood enough and consistentâ over chasing the best model. If it handles most of your use cases and keeps things moving, thatâs already a win. Iâd just keep a fallback for the edge cases where it breaks.
National_Meeting_749@reddit
I swear the last post I saw from this sub was "Kimi k2.6 is bad at agentic tasks"
Turbulent_Pin7635@reddit
This is just propaganda, O have tested it. It is amazingly good. Don't believe all the posts you see. There are a lot from private companies trying to salt new Chinese models.
rebelSun25@reddit
We like to be teased and disappointed to keep the boredom away
dc0899@reddit
đ
Turbulent_Pin7635@reddit
Just waiting for the 4.5q from inferece lab tonrun this beast on M3U. Fuck the pp. It is faster and cheaper than waiting the renovating session of Claude. =)
solomars3@reddit
Bro im trying to integrate it to my own app, but im not able to make it work via api, cant fetch the model, anyone can help me here ?! (Im A noob)
_VirtualCosmos_@reddit
Welp, Opus models are most probably monstrously big as their price per million token indicates.
Potential-Leg-639@reddit
Every day we find a new Opus replacement here
MacaroonBulky6831@reddit
How can I connect Kimi K2.6 local model to my vs code. Using continue plugin? Or any better alternative available?
Cosmicdev_058@reddit
Seeing similar on our end. The browser use is the part that surprised me most, it holds up on multi step flows where I expected it to fall apart around step 6 or 7.
Quick heads up since people will ask, we just added it to the Orq router if you want to test it without setting up local inference. OpenAI compatible so it's a base URL swap. (I work there.)
Synor@reddit
85% of the time it works every time!
Colecoman1982@reddit
That doesn't make sense.
Synor@reddit
Yes. It's a meme and refers to OPs statement "it can do about 85% of the tasks that Opus can at a reasonable quality"
Colecoman1982@reddit
I know, it's from the movie Anchorman referencing a comment by Paul Rudd's character and my comment references what Will Ferrell's character says he o him in response...
Synor@reddit
Oh. It's early here. đ„
legodfader@reddit
Itâs an anchorman reference :)
Colecoman1982@reddit
Yes, I know. So was mine. ;-)
Barncore@reddit
How can you know that already?
I think you WANT it to be a legit replacement for Opus 4.7
JoeyJoeC@reddit
I found Opus 4.6 to be a great 4.7 replacement.
singh_taranjeet@reddit
The "monstrously big" part is what kills me though. 85% of Opus quality sounds great until you realize you need a server rack just to run inference at reasonable speed.. What's your actual hardware setup for this?
-Leelith-@reddit
Whatâs your hardware specs to have it running?
bastomic95@reddit
What's your set up, if I may ask ?
Relative_Mouse7680@reddit
What about compared to GLM 5.1, have your tried it yet?
GermanBusinessInside@reddit
interesting timing with kimi k2.6 â been seeing a lot of these "opus replacement" claims lately but this one actually seems to hold up based on the benchmarks. the browser use capability is what stands out to me. curious if anyone's tried it for longer agentic tasks where you need the model to stay coherent over many steps?
power97992@reddit
Yeah, it is pretty good from my limited testing, but opus and glm 5.1 are probably better
I_HAVE_THE_DOCUMENTS@reddit
I canceled my Claude code sub last week and I've been looking for alternatives and I'm curious what the "tone" of the model is like.
The breakthrough feature of Claude for me that has made it so useful has always been its conversational tone that allows me to treat it as a sort of collaborator while I get my ideas sorted out, compared to the other major AIs that try to produce a wikipedia article and lecture on every prompt.
ihppxng62020@reddit
who is running this locally? wont it be much worse in performance even if you get this running at Q4 on some mac studios?
Fit-Statistician8636@reddit
It is native 4-bit, so âQ4â doesnât hurt it and it runs better than GLM5. But unless you have 8 GPUs, its speed is not usable for agentic coding.
fastlanedev@reddit
I prefer it over Claude opus 4.7. haven't touched my subscription for a while now
Why?
Because it doesn't lecture me about useless things, actually listens to my instructions, and through pi... the system prompt is actually respected.
I'll take 85% of Claude if it means I can actually control the context I give it. Claude just... Throws so much shit in the background
mwachs@reddit
How have you slowly been replacing your personal workflows? Didnât it just come out today?
Markus2816@reddit
curious what hardware is needed to run it with decent performance at context size like 128k?
FPham@reddit
Well, my opinion is different, but it is an unpopular opinion. Even Opus 4.7 chokes on complex stuff, and quite a lot. As soon as your project grows, it's a constant fixing primitive stuff.
lombwolf@reddit
Did moonshot fix their usage limits? I saw awhile ago some people were saying the usage got way more limited
qroshan@reddit
If I know a company is using 85% intelligence, I will use 100% intelligence to crush it. People who claim, 85%, 90%, 95% don't understand the differences compound each iteration (chain of thought)
Turn 1: 85%, Turn 2: 80%.... Turn N: Reddit Midwit
LeatherRub7248@reddit
Any one directly moved from codex 5.4 to kimi?
Any comments on if its a feasible replacement?
Cat5edope@reddit
They turned kimi into kwen đitâs a super over thinker now
Mental_Ad_6512@reddit
This is obviously a promotion post. Guess marketing team at Kiki is working hard.
useresuse@reddit
slowly? itâs been out 1 day lol
Different_Fix_2217@reddit
Same. But for creative writing. It's the best model I've ever used including latest opus, gpt 5.4 and gemini 3.1 pro. It has the social intelligence of GPT 5.4 with a knowledge base nearly a good as gemini and it writes better than Opus and has no positive bias unlike it.
madheader69@reddit
I agree, I started using hermes agent a few days ago, and tried out kimi k2, and I had forgot I had asked kimi to perform a refactor/new feature in my 300+ LOC project, left the house and came back, hermes timed out, loaded up the wrong session, started working in that, then today when it timed out I loaded up the wrong session to discover that the entire refactor and new feature better than I had designed it was completely one shorted... And I think it cost me 35 cents.
jd52wtf@reddit
The current issues with the larger models seem to be mostly over constraint. Think Robocop when they filled his head with hundreds of conflicting directives. Due to this I can only assume that models that are not over constrained in this way, even though smaller, are likely going to provide better results.
If not now then very soon.
Once the overpriced parts gridlock lifts there are going to be. A lot of subs canceling and right quick.
alext77777@reddit
I tried it both on their website and from Open router. I've got a prompt to generate a martian base in a 90's retro style using a single html file, I love to check new models with it. I don't see how this new Kimi model it Opus or even Sonnet grade, the render is by far less well. Gpt in xhigh or sonnet and opus hight thinking generate high détail scenes, this Kimi model is just basic.
alext77777@reddit
Also I tried to vibe code a retro game in one shot, it's pretty bad, it's so far from GPT xhigh or son t 4.6 high. I don't know for glm 5.1 but I remember the Pony something model they put on Open router a while ago was amazing, nearly same lever as the best closed models. I don't know why the benchmarks put this Kimi model so high đ€
FormalAd7367@reddit
Whatâs Opus⊠đ so distant
bennyb0y@reddit
But how are you using it? Is it orchestrating a code or what are you doing?
1EvilSexyGenius@reddit
Y'all be comparing models to harnesses
Content_Standard_421@reddit
Dont spill it out man, delete the post.
Zulfiqaar@reddit
I'll have to test it for browser use! This was one of my core feedbacks about K2.5, it was one of the first open models that was decent at browser use (better than Gemini!) but it thought for sooooo long I didn't want to wait around. I hope it's overthinking was remedied. Opus was discovered to be cheaper than sonnet for many tasks, just because it reasoned much less and just "got it", and Kimi is worse than sonnet too. Looking at artificial analysis, the previous one was ~7m opus, 28m sonnet, 89m Kimi in terms of tokens needed to finish the benchmark (top of my head rough figures)
FoxiPanda@reddit
I've been messing with a cloud version of Kimi-2.6 for the past few hours (since local quants aren't really available in any reasonable size yet) and it's VERY EARLY... but ... I feel like in my personal ranking, Kimi 2.6 feels like it slots somewhere between Opus 4.5 and 4.6 and above Sonnet 4.6 firmly.
I've even been watching some of its thinking (it seems like it leaks its thinking when it gets confused or unsure), and the thinking steps I've seen so far are excellent in the face of ambiguity.
Additional-Curve4212@reddit
ollama cloud? or kimi's api or cloud or something?
FoxiPanda@reddit
Ollama cloud for me (already had a year sub from months ago lol). It feels like they are working to scale inference availability but sometimes it gets real slow lol...so fair warning if you go that way.
LankyGuitar6528@reddit
Does it have MCP in the web interface? I can't find it. I'd love to test that out if it exists.
InterstellarReddit@reddit
Yes an inference company that lets their customers do the initial testing a provide feedback. We believe you.
Barubiri@reddit
I used it today, can confirm is good.
hihenryjr@reddit
âKingâ from discord?
Snoo_28140@reddit
Won't be running that locally any time soon lmao But this is amazing, OSS is competing with frontier. What a time to be aliveeee đ
LittleYouth4954@reddit
Tell us more about your workflows
jeffwadsworth@reddit
Not enough time for that so just wait for a real test user.
jeffwadsworth@reddit
Anything you throw at it, run by a local running 4bit GLM 5.1. I bet it is just as good. Donât use the website version because it doesnât work well. I can test prompts for comparison if need be.
cmndr_spanky@reddit
What âtasksâ exactly ? The only thing I really care about is using LLMs for agentic coding (inside Claude code, open code, etc).
Simple agentic RAG stuff works just fine with small models already