I was part of the beta, there were times that I forgot I'm on kimi and not gpt 5.4 . Opus is still the king in most workflows honestly, but this thing is coming hard.
Outside of frontend GPT 5.4 and 5.3-codex just miles ahead of opus.
Its the reverse from my experience. The amount of times that GPT5.4 resulted in prompt hell because it was unable to fix something. While the exact same initial prompt for Opus 4.6 ... fixed.
GPT 5.4 is a excellent model, but its often way too strict in its focus field.
This might be some prompting issue, models behave a bit different and you might be used to different type of prompting/your agentsmd might not be great for gpt etc. For me gpt exactly fits what I need when I ask it to do. The only reason I'm using other models is pricing.
You're overestimating the ability of a random/average person to convey their thoughts using natural language. I'm not joking, developed reading comprehension and "explaining what do you want" skills are way more rare than you might think. It's quite common to see someone giving a vague instruction, and then, after model interpreting it wrongly, they think it's model's fault.
The thing is, when the models are "interpreting vague instructions", they do that according to how they see things fit, which might not be an optimal solutions. In general, this result in a lot of tech debt over the time, since you stack a lot of randomly interpreted instructions. I prefer my models to fail loudly when they miss some info rather models which silently interpret something which makes things fail in the long run.
It all depends on the seriousness of dev work, if you just vibe code a small app this doesn't really matter, getting to the point when this matter will get some time.
GPT 5.4 is really persistent in following the instruction. If you have correct agentsmd with info on how and what to test, give some sort of AC for hard tasks, talk to it a bit to have a plan beforehand (don't even have to use plan mode, the model is great in free talk mode), then the model pretty much oneshots tasks of any difficulty not counting frontend.
I guess eagerness is a matter of taste. I don't use agentic coding, and always say "Be brief" in the system prompt, so I actually prefer it doing as little as possible. I'm more hands-on that way. I do notice I'm "lagging" compared to other peers, but in my personal work, code quality and maintainability matters much more.
For my use case, plenty of models are good enough nowadays. I even use local ones running on CPU from time to time, esp. when some cloud service is down.
I think what he wants to say is that most people give only vague instructions, thinking that's enough, and it's true. Most people struggle to convey their mental images precisely. LLMs can't read your mind. your mental image stays locked inside unless you supply every detail for others to reconstruct it.
No, that's not what they're saying, they're just being overly polite - the implication was that you may be the one with the problem understanding English, and if your followup response is any indication they may in fact be onto something
For me 12 years in tech, gpt is much more precise I feel like a opened another tech agency with grunts doing stuff I want. Opus is not precise enough or gets convoluted. Prompting matters a lot, obviously. Simpsons' duh.
It doesn't feel like a smaller model at all. Maybe it depends per case, but for my main repo at work - agentic harness-app with microservices, with EDA for talk between services, each repo with it's own env, opus is not quite able to do repo-wide edits which need to touch 2-3 services, while GPT easily does pretty much anything, given I provide correct design doc. My programming in the last month fully shifted to writing design docs, checking/reviewing code, setting up debugging sessions, there's literally no need to write code with 5.4 atp.
Pretty much the same experience, GPT 5.4 rarely makes mistakes, especially for Backend engineering, while Claude makes mistakes and GPT 5.4 easily fixes them,
Similar experience here. Claude Code showed the way, Codex made it truly work. It was the first workflow that minimized my attention enough run parallel agents and trust that the validation tests actually did pass.
Exactly the opposite experience in ML/DS for me. I remember back in sonnet 4 days, how everyone was glazing the model, and it implemented RoPE in custom transformer without caching, while I explicitly asked for caching and even provided the code for caching from official torchtune library. It just put the cache function inside the forward without making it remembering prev calculations lol. o3 did that easily btw. Since then I try to avoid anthropic models, I used opus 4.5 for a while when I was working on llm proxy app, and to this day I still fix weird bugs it left here and there with gpt. I spin opus only when I need some frontend fix, and that's not often since now kimi deal with most my frontend needs.
This is not even taling that compaction in codex is just another level compared to any other implementations due to all magic they do on their endpoint side, I wish we had at least something similar for other models :(
I'd say, that in general GPT does it's job the better the bigger/harder the task is. I don't know how, but that's some observations a lot of people noticing that the model just super coherent on long runs and good on state recovery.
There's a blogpost by openAI on that, from what I understood, they create a compact vector representation of conversation, I don't remember the details but basically embeddings of the chat, which then they allow the model to check, or just always append to the chat. I don't think anyone is doing something like this, usually compaction is some sort of summarizations, and you can only summarize limited info that way, non-token based embeddings should allow way better compaction.
But I might be wrong ofc, don't remember the exact details, and they didn't share the code/exact logic anyway.
Fully agreed. People need to quit with the fanboi stuff. I'm a graybeard dev and I hop to and froe every few months. Codex w/ ChatGPT-5.4 Extended/Medium is quite a bit ahead of Claude Code w/ Opus 4.6/7. You're right though, Anthropic is def better at frontend aesthetics; not enough for me to maintain a Max sub atm though. If they release a baby Mythos next month? I'll hop right back!
It doesnt, but its still quite good. Just still nowhere near opus or gpt 5.4. I've been running a lot of a/b testing with k2.6 last couple days against opus 4.6 and gpt 5.4. (and opus 4.7). Opus 4.7 was dead last lol, so at least we can say k2.6 is better than opus 4.7. I only tested for reviews and audits, to see who caught the most valid bugs and had the least hallucinations/false positives. At least kimi looks pretty in evals now? But isnt actually much better than k2.5. I do think it's still currently the best open weight model now.
Because calling something "modified MIT" is a farce when the entire thing is the antithesis of MIT. I have no issue with them releasing a model NC. That's up to them. But in that case they should just say so, use a proper license, and be done with it.
Are you...are you saying you think Qwen is competitive for creative writing? Qwen is one of the worst writing models there is. I'm shocked you mention it in the same breath as Gemma 31B.
As someone deep in the throes of GPU poverty...what does the hardware look like that is capable of running this? 8-10 RTX6000 pros? Something even nuttier?
fully in VRAM? you're basically right. it's a 594 GB model on disk. it was built to run on a single last-gen datacenter GPU server. the rack-mount ones take 8 SXM or OAM GPU bricks. you see fully populated 8× H100 80 GB servers on eBay once in a while for $200k, but those won't quite do it because you won't have any room for context, so figure on needing the even more expensive H100 96 GB. or their rough Intel or AMD equivalents, the Gaudi 2 or the MI250. somewhere between "Lamborghini" and "house"?
you can sorta tape one together out of RTX PRO 6000s and a Xeon or EPYC motherboard with enough PCIe lanes, sure. but we're talking, like, minimum $100k just for the cards.
there's already a quant attempting to shrink it enough to fit in a 512 GB unified memory Mac Studio, which is going to be a lot slower but probably still useful if they manage to further quantize the already INT4 QAT model without lobotomizing it too much
https://x.com/bridgemindai/status/2046313533743468993/video/1?s=46 too many Kimi paid shills polluting the sub.. insane - atleast Qwen and Gemma don’t buy plaudits
Tensor Parallelism divide the model layers in slices vertically, you need 2, 4 or 8 GPUs for it to work. PP divide horizontally whole layers distributed among the cards, so the amount of GPUs are not a problem if they are an odd number.
True, but LOL at your electric bill. You aren't saving any money going this route and the only reason to do so would be privacy concerns. (and this is coming from someone that is very much a proponent of local AI - I know the electric bill pain firsthand haha)
Wow, so much? Once you paid off the local hardware, how would you compare let’s say a Claude Max subscription to running your own big model (with your own electricity)?
My electric bill is $700/mo. When you get into serious local hardware.and perpetual use the costs add up fast. The break even point isn't as close in the future as you think it is if you are running local models that actually compete with the frontier cloud models. It's years out.
I'm in a situation where I'm required to rent right now (moved here and will need to move again in the next year), so solar is not an option right now. Definitely in the plans when I hit my final location, though! 👍
Solar Balconies are extreme popular here in Germany. You can get dual 500w panels (limited to 800w out), microinverter for like 250 Euro. And the balcony/stack batteries are more expensive (then rack ones) but still range in the ~180 Euro/1Kwh.
I don't like him because he's got one of the biggest, most annoying ego's I've ever seen - especially for a guy who wrote a mediocre LLM wrapper that charges you $8/m just to hook you up to OpenRouter with a web search feature. Open-WebUI has had more features for like a year+ at this point. Then him going off to create his own coding harness and explicitly disallowing the use of local models because anybody who claims to use them for actual work is "lying". The guy is just... not at all as important or as smart as he thinks he is.
This. He has achieved very little. He milks his Twitch employment and depends on impressionable people to be impressed by him having loud opinions. The confidence with which he communicates his stuff gives insecure/unknowledgeable people the impression that he "must know what he's talking about, given how loud and confident he is".
It's all bullshit.
ah, I haven't used any of his paid tools so I can't speak to the $8/m thing. I mainly just watch his content, which is usually pretty solid even if he has super strong (and loud) opinions. I can definitely see how that ego is annoying if you're actively interacting with his products though, thank you for replying
if this sub isn't shitting on someone somewhere then the sky is falling and Hell has frozen over. llama.cpp released under MIT: "We fucking hate Ollama." Kimi released under modified MIT: "We fucking hate Cursor."
Use the env variable. But for opus 4.7 it's just a client side thing where it compacts earlier. You may also just shorten your claude.mds to <300 lines, not use skills and compact smartly and you'll get most of the benefits with none of the downsides
I think there's still a very important role for open-weight models that are as powerful as SOTA but too big to run on a conventional home computer, they serve as a mechanism to keep the big API providers "honest." If the big model APIs get too costly or throttle thinking too much or whatever, there will be providers offering these open-weight models to compete with them.
There are enough businesses and individuals who can totally afford to run a 1.1 trillion parameter model, that it keeps pressure on the whole frontier industry to not get too crazy.
The company I work for doesn't quite hand out compute like candy, but I needed RAM, and they dropped several hundred GB in my lap without blinking, even with the prices being what they are.
Businesses are already spending $100k+ a month on tokens, if they think that a million dollars in spend is going to let them have control over their own infrastructure and then have unlimited usage, that's going to be attractive.
If you consider stuff like defense contractors and and private research labs around the world, where they want to keep their most sensitive data air gapped, these huge models are extremely high value.
One of the api providers could print it out and serve it for fast and cheap for everyone. That means open ai cant charge 10x more for 5% better performance
Making chips take time. Lots of time. Im talking about somthing like https://taalas.com/ who are literally burning the weights into silicon for crazy speeds.
It was a theoretical proposition I dont think somthing like that is going to happen for k2.6 because:
a) its too big
b) by the time they can fab somthing, a much better model will have dropped.
I think everyone has seen it by now, but: https://chatjimmy.ai/ is a demo of the tech
I hope we do get a 'good enough' model with up to context to be burned into hardware cards, as the speed improvements over an order of magnitude would unlock new AI uses and help us move a lot of compute out of ~~software rendering~~, I mean, software LLM generation and free up our resources for our other use.
Or just for MS to spend more RAM on putting copies of React into Notepad >.>
1.1T params and you still need the quant chart to figure out if your rig can even touch it. great model, democratized for the top 0.1% of hardware owners
I run Qwen2.5-Coder-7B-Instruct-Q4_K_M from unsloth on it right now and get decent performance, but I haven't really used it beyond test prompts. I've been tweaking this command line over the last week or so though.
why are you using a 2 year old coding model and not something made in the last 6 months? Even the Qwen 3.6 that came out a few days ago is going to be faster and better than 2.5 coder at coding
Mixture of experts models which are denoted by (Parameter Number - Active Parameter Number) route queries to different segments of the model as it infers from your query, resulting in faster performance even if you can't fully load it onto your GPU.
I won't be able to load that Qwen 3.6 model and load my IDE at the same time, lol. It's maxing out my system ram with the model loaded right now. I'm getting about 8.3-8.4 t/s right now asking it to refactor a powershell script I had ChatGPT write (just a test prompt).
However, it's good to know that in a pinch, I can run the model.
Linux, 32gb system ram, at low context i get 600-700 prefill tps and around 15 tps output. At around 50k context it's more like 200 prefill and 1-3 tps output.
Compiled for cuda 12.x on linux. Adapt values to your CPU etc. Low unsloth quant, no thinking, low kv cache quant, no vision. Still delivers better results than q35b no thinking or gemma 4 e4b with thinking.
It's not about "accelerators" (whatever you mean by that). You could even run it very slowly on GPU. But it's still an Nvidia gpu, you should be able to use cuda 12.x
True. I prefer Gemma because she speaks my language with almost no mistakes. Qwen and all Chinese models unfortunately are very bad at it. But still when English is enough for my tasks Qwen is great too. I removed all orher models from my workflow.
The 12GB VRAM crowd already crying in the thread is very relatable. Curious what the actual FP8 quant sizes end up being - K2.5 at full precision was already a painful 150GB+. If they manage to get a decent IQ3/IQ4 that fits in 48GB that would be genuinely useful. Anyone know if the architecture changes anything for llama.cpp support or is it the same transformer layout as K2.5?
All of these values can be found in config.json. (with the exception of 60 which one must know that Kimi MoE starts on layer 1 and not layer 0, so num_hidden_layers - 1)
Qwen 3.5 397B has more quantization potential, with Kimi it's mostly already eqhaused and it won't quant 4x from the released version, unlike Qwen 3.5 397B.
the pace at which Chinese labs are releasing open weights right now is genuinely hard to keep up with. not too long ago Kimi K2 felt like news, now there's already a .6. what are the actual capability deltas between these point releases?
Tell me China hasn't won the race. Every organization with enough compute will be running Chinese open weights by the end of 2026. My organization already is and provides freely to all employees via open webui. Soon, most technological advancements worldwide will be completed with support from Chinese rather than American AI.
I think USA is trying very hard to go anti-opensource. Not only not to open anything themselves, but to block anyone else from opening and sharing 'the secret sauce'.
Nice! Given Kimi K2.5 already was my favorite local model, I am looking forward to running K2.6 on my rig! They also kept local-friendly INT4 weight format, which can be practically losslessly converted to Q4_X GGUF.
I was scrolling, somehow expecting / looking for your comment here 😀. I cannot wait for GGUFs to appear, too… But wait - I run K2.5 with SGLang + KTransformers. Did you tried this path?
So far I only got llama.cpp and ik_llama.cpp working. What is your experience with SGLang, does it work well with RAM+VRAM inference?
At some point I tried SGLang but never could get it working. They still have open bug about K2.5: https://github.com/sgl-project/sglang/issues/20096
If you are using CPU+GPU inference, perhaps you could share the link to the exact quant of K2.5 you are using and full SGLang command that you found to be working? It would help to know a working baseline. Then I may give SGLang another try, it would be interesting to compare with other backends.
# KTransformers repo and submodules
RUN git clone --recursive https://github.com/kvcache-ai/ktransformers.git \
&& cd /opt/ktransformers \
&& git checkout "${KTRANSFORMERS_REF}" \
&& git submodule update --init --recursive
# Remove base-package metadata first, then install the KTransformers fork
# without dependency resolution so the CUDA 13 stack is not downgraded.
RUN (python3 -m pip uninstall -y sglang sglang-kt kt-kernel sgl-kernel || true) \
&& python3 -m pip install --upgrade pip setuptools wheel packaging \
&& cd /opt/ktransformers/third_party/sglang \
&& python3 -m pip install --no-deps "./python[all]" \
&& python3 -m pip install --no-deps "${SGL_KERNEL_CU130_WHL}" \
&& python3 -m pip install --no-deps decord2
# Build kt-kernel against the prepared CUDA 13 / SM120 environment.
RUN cd /opt/ktransformers/kt-kernel \
&& python3 -m pip install --no-deps --no-build-isolation -v .
Yes, just reading this: “Kimi-K2.6 has the same architecture as Kimi-K2.5, and the deployment method can be directly reused.”
So, no quant needed - SGLang + KTransformes should be able to use the native .safetensors model. Yes, I have a great experience with Kimi+SGL+KT, and with SGL in general (using voipmonitor’s fork to run MiniMax-M2.7 from VRAM). It is not without issues, but llama/ik is not either.
I’ll get out of bath, run “hf pull” and post my “recipe” for K2.5 in 10 minutes 😀.
I built my rig gradually over the years, starting with buying GPUs one by one, then PSUs, and in the beginning of the previous year migrated to the EPYC platform with 8-channel 1 TB DDR4 3200 MHz RAM (the server memory costed me approximately $1600 in total)... so yes, I got lucky enough to upgrade before RAM prices went insane.
You can offload layers to regular RAM. The entire model doesn't need to be in VRAM with GGUF. So if your total VRAM+RAM can hold the weights, you should be able to run the model (albeit slower than if it was all in unified high-bandwidth RAM).
What gives me pause about these benchmarks even more than seeing GPT 5.4 and Kimi beating Opus 4.7 in coding scenarios (something I also doubt) is seeing Gemini 3.1 Pro winning in things like Terminal Bench. I cannot for the life of me get that model to be competitive in what that benchmark claims to cover, yet it's number 1?
Gemini is perhaps the weirdest, most inconsistent model.
The only thing that I can really think, is that they have a lot more knobs they turn dynamically, based on the current load.
Sometimes I get super-genius Gemini who does a full load of work up front, and sometimes I get the absolutely minimal effort model.
Gemini will literally add things like and destroy existing work.
One of the things I hate the most is how it will make notes about how "in a real project, we would do xyz, but we'll just put this stub for now". It's so hard to get the model to take things seriously and not as a trivial exercise.
When it's good, it's very good. When it's bad, it's among the worst.
When it reaches its context limit, it falls apart the hardest.
indeed, I think this might be the 1st time open weights is SOTA level since the release of GPT 4, and that was March 2022, also dare I say not 6 months behind, and no moat for closed weights
if this runs well on ollama thats going to be interesting for self-hosted inference. the MoE architecture should keep memory usage reasonable even at this scale. curious what the actual VRAM requirements look like with different quants.
You need at the bare minimum 32G VRAM and like 700 GB of the fastest RAM you can get and motherboard with 4 channels... to run it slowly but usable speed with ik_llama at Q4.
I'm fine with it. Just ask some things, do other stuffs, come back and get my answer. I try to use "instruct" with these models, but sometimes, when I really need it, I even run them as "thinking".
It could be benchmaxxed, but since it’s the Kimi team I think it’s legit. Their last model was a breakthrough for real world performance so I would not doubt them.
Very exciting. 6 months and we'll have this performance at 1/10th the size presumably, good to see open weights giving the closed labs some serious competition!
I see, Never noticed the previous versions' as those are too large for my GPU(Thought they followed MiniMax's route). I tried Kimi-Linear which is mit only.
any early numbers on what spec you need to run this locally at a reasonable quant. the K2 lineage has been creeping up in size, curious if 2.6 still fits on dual 24gb or if its workstation class cards now
LagOps91@reddit
Pretty damned impressive assuming the benchmarks translate into real-world performance
anedisi@reddit
I was part of the beta, there were times that I forgot I'm on kimi and not gpt 5.4 . Opus is still the king in most workflows honestly, but this thing is coming hard.
Theio666@reddit
I can't take this seriously unless you're mainly working on frontend things. Outside of frontend GPT 5.4 and 5.3-codex just miles ahead of opus.
ProfessionalJackals@reddit
Its the reverse from my experience. The amount of times that GPT5.4 resulted in prompt hell because it was unable to fix something. While the exact same initial prompt for Opus 4.6 ... fixed.
GPT 5.4 is a excellent model, but its often way too strict in its focus field.
Theio666@reddit
This might be some prompting issue, models behave a bit different and you might be used to different type of prompting/your agentsmd might not be great for gpt etc. For me gpt exactly fits what I need when I ask it to do. The only reason I'm using other models is pricing.
autoencoder@reddit
"you're holding it wrong"
No. It should understand English.
Theio666@reddit
You're overestimating the ability of a random/average person to convey their thoughts using natural language. I'm not joking, developed reading comprehension and "explaining what do you want" skills are way more rare than you might think. It's quite common to see someone giving a vague instruction, and then, after model interpreting it wrongly, they think it's model's fault.
autoencoder@reddit
So you're saying ChatGPT is bad at interpreting vague instructions? Why wouldn't other models be?
Theio666@reddit
The thing is, when the models are "interpreting vague instructions", they do that according to how they see things fit, which might not be an optimal solutions. In general, this result in a lot of tech debt over the time, since you stack a lot of randomly interpreted instructions. I prefer my models to fail loudly when they miss some info rather models which silently interpret something which makes things fail in the long run.
It all depends on the seriousness of dev work, if you just vibe code a small app this doesn't really matter, getting to the point when this matter will get some time.
GPT 5.4 is really persistent in following the instruction. If you have correct agentsmd with info on how and what to test, give some sort of AC for hard tasks, talk to it a bit to have a plan beforehand (don't even have to use plan mode, the model is great in free talk mode), then the model pretty much oneshots tasks of any difficulty not counting frontend.
autoencoder@reddit
I guess eagerness is a matter of taste. I don't use agentic coding, and always say "Be brief" in the system prompt, so I actually prefer it doing as little as possible. I'm more hands-on that way. I do notice I'm "lagging" compared to other peers, but in my personal work, code quality and maintainability matters much more.
For my use case, plenty of models are good enough nowadays. I even use local ones running on CPU from time to time, esp. when some cloud service is down.
More-Curious816@reddit
I think what he wants to say is that most people give only vague instructions, thinking that's enough, and it's true. Most people struggle to convey their mental images precisely. LLMs can't read your mind. your mental image stays locked inside unless you supply every detail for others to reconstruct it.
YRUTROLLINGURSELF@reddit
No, that's not what they're saying, they're just being overly polite - the implication was that you may be the one with the problem understanding English, and if your followup response is any indication they may in fact be onto something
Waste-Peak-1213@reddit
For me 12 years in tech, gpt is much more precise I feel like a opened another tech agency with grunts doing stuff I want. Opus is not precise enough or gets convoluted. Prompting matters a lot, obviously. Simpsons' duh.
Zeeplankton@reddit
GPT 5.4 is insane at backend but it's definitely a smaller model.. helps to check output with opus.. but per parameter 5.4 is a monster
Theio666@reddit
It doesn't feel like a smaller model at all. Maybe it depends per case, but for my main repo at work - agentic harness-app with microservices, with EDA for talk between services, each repo with it's own env, opus is not quite able to do repo-wide edits which need to touch 2-3 services, while GPT easily does pretty much anything, given I provide correct design doc. My programming in the last month fully shifted to writing design docs, checking/reviewing code, setting up debugging sessions, there's literally no need to write code with 5.4 atp.
Unusual-Candidate-43@reddit
Pretty much the same experience, GPT 5.4 rarely makes mistakes, especially for Backend engineering, while Claude makes mistakes and GPT 5.4 easily fixes them,
squired@reddit
Similar experience here. Claude Code showed the way, Codex made it truly work. It was the first workflow that minimized my attention enough run parallel agents and trust that the validation tests actually did pass.
Kappalonia@reddit
Huh? I work in data science and Opus 4.5 wiped the floor with GPT.
Theio666@reddit
Exactly the opposite experience in ML/DS for me. I remember back in sonnet 4 days, how everyone was glazing the model, and it implemented RoPE in custom transformer without caching, while I explicitly asked for caching and even provided the code for caching from official torchtune library. It just put the cache function inside the forward without making it remembering prev calculations lol. o3 did that easily btw. Since then I try to avoid anthropic models, I used opus 4.5 for a while when I was working on llm proxy app, and to this day I still fix weird bugs it left here and there with gpt. I spin opus only when I need some frontend fix, and that's not often since now kimi deal with most my frontend needs.
This is not even taling that compaction in codex is just another level compared to any other implementations due to all magic they do on their endpoint side, I wish we had at least something similar for other models :(
I'd say, that in general GPT does it's job the better the bigger/harder the task is. I don't know how, but that's some observations a lot of people noticing that the model just super coherent on long runs and good on state recovery.
squired@reddit
You're spot on when it comes to the compaction. It boggles the mind how well it works. They nailed that bit.
Theio666@reddit
There's a blogpost by openAI on that, from what I understood, they create a compact vector representation of conversation, I don't remember the details but basically embeddings of the chat, which then they allow the model to check, or just always append to the chat. I don't think anyone is doing something like this, usually compaction is some sort of summarizations, and you can only summarize limited info that way, non-token based embeddings should allow way better compaction.
But I might be wrong ofc, don't remember the exact details, and they didn't share the code/exact logic anyway.
squired@reddit
That's fascinating and I'll explore further. Thank you!
VicemanPro@reddit
You should all know that performance caps are a thing. You're not all getting the same model, every time.
squired@reddit
Fully agreed. People need to quit with the fanboi stuff. I'm a graybeard dev and I hop to and froe every few months. Codex w/ ChatGPT-5.4 Extended/Medium is quite a bit ahead of Claude Code w/ Opus 4.6/7. You're right though, Anthropic is def better at frontend aesthetics; not enough for me to maintain a Max sub atm though. If they release a baby Mythos next month? I'll hop right back!
anedisi@reddit
I had yesterday opus solve an ipad app bug that no others could solve and I have access to glm kimi gpt.
BihariBabua@reddit
The other day, Opus couldn't get a grid layout right. Sad state of affairs.
Spirited_Neck1858@reddit
how about k2.6 vs sonnet 4.6?
lemon07r@reddit
It doesnt, but its still quite good. Just still nowhere near opus or gpt 5.4. I've been running a lot of a/b testing with k2.6 last couple days against opus 4.6 and gpt 5.4. (and opus 4.7). Opus 4.7 was dead last lol, so at least we can say k2.6 is better than opus 4.7. I only tested for reviews and audits, to see who caught the most valid bugs and had the least hallucinations/false positives. At least kimi looks pretty in evals now? But isnt actually much better than k2.5. I do think it's still currently the best open weight model now.
IrisColt@reddit
mother of God...
ResidentPositive4122@reddit
See, minimax, this is a proper modified MIT. Still MIT core (i.e. do whatever you want) just with an attribution if you're a large corp. That's it.
Macmill_340@reddit
Why not just use apache if it's about attribution?
Dudeonyx@reddit
Why is that so important to you?
Genuinely asking.
ResidentPositive4122@reddit
Because calling something "modified MIT" is a farce when the entire thing is the antithesis of MIT. I have no issue with them releasing a model NC. That's up to them. But in that case they should just say so, use a proper license, and be done with it.
EveningIncrease7579@reddit
clouder300@reddit
carnist shit
thrownawaymane@reddit
I am in this photo and I don't like it
Admirable_Market2759@reddit
You use a Xeon? How does it work?
I bought a Aliexpress Mother board but it was dead on arrival lol
Haven’t tried again.
po_stulate@reddit
I'm the 512 RAM DDR4 I can confirm
panchovix@reddit
How many t/s you get on that setup? I think that one can run 4bit IIRC on lcpp lol.
seamonn@reddit
You mean s/t
MoonLightSunDark@reddit
Thanks for the laugh lmao
the-username-is-here@reddit
Yes, we can!
WoodCreakSeagull@reddit
Thanks, Ollama
KeikakuAccelerator@reddit
Fk, this got me lmao.
ShengrenR@reddit
I laughed louder than I should have.. bit of a cackle if I'm honest.
Cool-Chemical-5629@reddit
Now that's the kind of pun idea I like.
jatjatjat@reddit
You won the internet today.
Cold_Tree190@reddit
Lmao fr
TheItalianDonkey@reddit
whelp, this made me audibly laugh.
RedParaglider@reddit
I'm loving seeing more memes in this dub on topic and less slop comments 😘
silenceimpaired@reddit
Sigh. It appears the rumours of a smaller Kimi were just rumours.
Bakoro@reddit
I don't see the point, unless Kimi has some special feature you want that no one else is offering.
For smaller LLMs, we've got like a dozen other offerings, there should be at least one real frontier sized behemoth model.
silenceimpaired@reddit
Kimi had many singing it's praises for creative writing. I'm hoping to see something competitive to qwen and Gemma ~30b dense
dtdisapointingresult@reddit
Are you...are you saying you think Qwen is competitive for creative writing? Qwen is one of the worst writing models there is. I'm shocked you mention it in the same breath as Gemma 31B.
silenceimpaired@reddit
Don't be so antagonistic. No I don't think it's great at writing, but if you look at kimi linear it was worse because of its size and training.
dtdisapointingresult@reddit
I'm just antagonistic towards Qwen's writing, haha. The only worse writing model is GPT-OSS.
Zeeplankton@reddit
I can imagine someone doing some frankenstein distill model with gemma
oxygen_addiction@reddit
With them pivoting to an orchestrateor (big) + hundreds of subagents model (small), it'd make sense.
OcelotOk8071@reddit
that's not a foregone conclusion.
silenceimpaired@reddit
How I hope you are right.
onewheeldoin200@reddit
As someone deep in the throes of GPU poverty...what does the hardware look like that is capable of running this? 8-10 RTX6000 pros? Something even nuttier?
Fit-Statistician8636@reddit
You can run it with a single RTX 5090 and a lot of RAM. Not cheap by any means, but not as expensive as 8x RTX PRO 6000.
700 t/s PP and 20 t/s TG is not quite there for coding, but chat is fine:
https://huggingface.co/ubergarm/Kimi-K2.6-GGUF/discussions/3
arcanemachined@reddit
You don't. You torment yourself trying to get couple tokens per seconds, or you give up and use OpenRouter.
HopePupal@reddit
fully in VRAM? you're basically right. it's a 594 GB model on disk. it was built to run on a single last-gen datacenter GPU server. the rack-mount ones take 8 SXM or OAM GPU bricks. you see fully populated 8× H100 80 GB servers on eBay once in a while for $200k, but those won't quite do it because you won't have any room for context, so figure on needing the even more expensive H100 96 GB. or their rough Intel or AMD equivalents, the Gaudi 2 or the MI250. somewhere between "Lamborghini" and "house"?
you can sorta tape one together out of RTX PRO 6000s and a Xeon or EPYC motherboard with enough PCIe lanes, sure. but we're talking, like, minimum $100k just for the cards.
there's already a quant attempting to shrink it enough to fit in a 512 GB unified memory Mac Studio, which is going to be a lot slower but probably still useful if they manage to further quantize the already INT4 QAT model without lobotomizing it too much
tyrantwargodnamedbob@reddit
Check the top comments, some guys rig is like a meter long and he's trying out K2.6 today I believe
LegacyRemaster@reddit
Ok... I have to buy another 3 RTX 6000..
Fit-Statistician8636@reddit
That’s a great state to be in, actually :)
ProfessionalJackals@reddit
Heuu, why? Its a Moe model, no? You just need a ton of system ram...
Worried_Drama151@reddit
https://x.com/bridgemindai/status/2046313533743468993/video/1?s=46 too many Kimi paid shills polluting the sub.. insane - atleast Qwen and Gemma don’t buy plaudits
oxygen_addiction@reddit
8xRTX6000 needed to run this with decent context, right?
Damn. Claude/Codex etc. must be a bit bigger than this and GLM5.1
Expensive-Paint-9490@reddit
With 8 Pro 6000 you have 768GB VRAM and model is 595GB... in 173GB you can fit a fuckton of context.
panchovix@reddit
With 8 at least you can run TP 8 on vLLM. With 6 or 7 you can run it but using PP.
Caffdy@reddit
For those who don't know"
TP=Tensor Parallelism
PP=Pipeline Parallelism
Tensor Parallelism divide the model layers in slices vertically, you need 2, 4 or 8 GPUs for it to work. PP divide horizontally whole layers distributed among the cards, so the amount of GPUs are not a problem if they are an odd number.
PrysmX@reddit
True, but LOL at your electric bill. You aren't saving any money going this route and the only reason to do so would be privacy concerns. (and this is coming from someone that is very much a proponent of local AI - I know the electric bill pain firsthand haha)
tspwd@reddit
Wow, so much? Once you paid off the local hardware, how would you compare let’s say a Claude Max subscription to running your own big model (with your own electricity)?
PrysmX@reddit
My electric bill is $700/mo. When you get into serious local hardware.and perpetual use the costs add up fast. The break even point isn't as close in the future as you think it is if you are running local models that actually compete with the frontier cloud models. It's years out.
tspwd@reddit
Oh, wow! That’s a lot! Are you running coding agents 24/7?
PrysmX@reddit
Various agentic workflow, some coding and some other, as well as image and video generation.
tspwd@reddit
Seems like you are making good use of your hardware :)
ProfessionalJackals@reddit
Lots of solar panels? I mean, this is probably the best use case for tons of panels if you work from home.
Normally your wasting a ton of solar energy to the grid (where you get cents on the kwh for it, or nothing, or need to pay to dump to the grid).
Extra 5k or 10k battery, things are barely 100 bucks for a kwh these days. I think your electricity bill is paid back fast with local LLMs + Solar...
PrysmX@reddit
I'm in a situation where I'm required to rent right now (moved here and will need to move again in the next year), so solar is not an option right now. Definitely in the plans when I hit my final location, though! 👍
ProfessionalJackals@reddit
Do you have a balcony?
Solar Balconies are extreme popular here in Germany. You can get dual 500w panels (limited to 800w out), microinverter for like 250 Euro. And the balcony/stack batteries are more expensive (then rack ones) but still range in the ~180 Euro/1Kwh.
And you can take it with you to new locations.
oxygen_addiction@reddit
You'd be doing a lot of batching to get your money's worth. So you'd need a lot of free VRAM.
But fair point, it might be that 7x6000RTX is enough :).
Few_Painter_5588@reddit
In other news, apparently Cursor's Composer 2.1 model has started training
rebelSun25@reddit
We're about to see at least two videos from Theo why it's actually a good thing
Mission_Biscotti3962@reddit
Nobody should watch videos from Theo
Glad-Ad6295@reddit
why? aint he a chill guy who has some anger issues
ayylmaonade@reddit
I don't like him because he's got one of the biggest, most annoying ego's I've ever seen - especially for a guy who wrote a mediocre LLM wrapper that charges you $8/m just to hook you up to OpenRouter with a web search feature. Open-WebUI has had more features for like a year+ at this point. Then him going off to create his own coding harness and explicitly disallowing the use of local models because anybody who claims to use them for actual work is "lying". The guy is just... not at all as important or as smart as he thinks he is.
Mission_Biscotti3962@reddit
This. He has achieved very little. He milks his Twitch employment and depends on impressionable people to be impressed by him having loud opinions. The confidence with which he communicates his stuff gives insecure/unknowledgeable people the impression that he "must know what he's talking about, given how loud and confident he is".
It's all bullshit.
Glad-Ad6295@reddit
ah, I haven't used any of his paid tools so I can't speak to the $8/m thing. I mainly just watch his content, which is usually pretty solid even if he has super strong (and loud) opinions. I can definitely see how that ego is annoying if you're actively interacting with his products though, thank you for replying
Marcuss2@reddit
50% of what he says is true, 50% is total garbage.
Problem is, he stands behind that 50% of garbage, even when called out.
hellomistershifty@reddit
i don't even know who that is, but someone with anger issues doesn't sound very chill
Darkoplax@reddit
Why is it not a good thing again ?
Darkoplax@reddit
Hope they do it fast, Composer 2 has been great so far
emprahsFury@reddit
if this sub isn't shitting on someone somewhere then the sky is falling and Hell has frozen over. llama.cpp released under MIT: "We fucking hate Ollama." Kimi released under modified MIT: "We fucking hate Cursor."
No_Conversation9561@reddit
u/ezyz Could we get MLX 2.8bit like Kimi-K2.5?
ezyz@reddit
Quant trials are still running! I just started uploading a 3.6 bpw on the quality frontier: https://huggingface.co/spicyneuron/Kimi-K2.6-MLX-3.6bit
This one pairs nicely with Qwen 3.6 35B on a 512GB Mac Studio.
Still searching for a good sub-3 quant, but the KL divergence seems to jump pretty dramatically on this model.
No_Conversation9561@reddit
What’s the size of this quant?
ezyz@reddit
459 GB total
Dany0@reddit
Boys and gals, we have Opus 4.7 at home
Kappalonia@reddit
Edit to 4.6, nobody wants 4.7 lol
Dany0@reddit
Turn off 1M mode and it'll be less arse
nmkd@reddit
Where can one do that?
Dany0@reddit
Use the env variable. But for opus 4.7 it's just a client side thing where it compacts earlier. You may also just shorten your claude.mds to <300 lines, not use skills and compact smartly and you'll get most of the benefits with none of the downsides
CornerLimits@reddit
I dont have enough square meters to host this unfortunately :(
Dany0@reddit
Can't wait to run it at 1-5tok/s from ssd
DR4G0NH3ART@reddit
2 seconds/token take it or leave it /s
tazztone@reddit
token/sarcasm
ZeusZCC@reddit
2 token per day
groosha@reddit
Keeps vibecoder away
Dany0@reddit
Yes. Good and based. Keep away, shoo
DR4G0NH3ART@reddit
Lol
darkpigvirus@reddit
waiter: sorry sir but that's for ram 😞. you need to divide by 100 if you run it in SSD 🤣
ProfessionalJackals@reddit
100 SSD in parallel stripped?
Thomas-Lore@reddit
More like 1-5s/tok.
FaceDeer@reddit
I think there's still a very important role for open-weight models that are as powerful as SOTA but too big to run on a conventional home computer, they serve as a mechanism to keep the big API providers "honest." If the big model APIs get too costly or throttle thinking too much or whatever, there will be providers offering these open-weight models to compete with them.
Bakoro@reddit
There are enough businesses and individuals who can totally afford to run a 1.1 trillion parameter model, that it keeps pressure on the whole frontier industry to not get too crazy.
The company I work for doesn't quite hand out compute like candy, but I needed RAM, and they dropped several hundred GB in my lap without blinking, even with the prices being what they are.
Businesses are already spending $100k+ a month on tokens, if they think that a million dollars in spend is going to let them have control over their own infrastructure and then have unlimited usage, that's going to be attractive.
If you consider stuff like defense contractors and and private research labs around the world, where they want to keep their most sensitive data air gapped, these huge models are extremely high value.
amuhak@reddit
One of the api providers could print it out and serve it for fast and cheap for everyone. That means open ai cant charge 10x more for 5% better performance
Dany0@reddit
https://openrouter.ai/moonshotai/kimi-k2.6 two providers, just one other than moonshot. So far the pricing is far from dirt cheap. And the tok/s is abhorrent too
amuhak@reddit
Making chips take time. Lots of time. Im talking about somthing like https://taalas.com/ who are literally burning the weights into silicon for crazy speeds.
It was a theoretical proposition I dont think somthing like that is going to happen for k2.6 because: a) its too big b) by the time they can fab somthing, a much better model will have dropped.
I think everyone has seen it by now, but: https://chatjimmy.ai/ is a demo of the tech
Kind_Style7978@reddit
I hope we do get a 'good enough' model with up to context to be burned into hardware cards, as the speed improvements over an order of magnitude would unlock new AI uses and help us move a lot of compute out of ~~software rendering~~, I mean, software LLM generation and free up our resources for our other use.
Or just for MS to spend more RAM on putting copies of React into Notepad >.>
Dany0@reddit
I have the ssd space. And I'm in the top 0.1% because I'm technically speaking gpu rich with my 5090 lmao
Every day is a reminder I'm closer to being homeless than a billionaire
SnooPaintings8639@reddit
Nice. Can I move in with you?
GlossyCylinder@reddit
90% of us can't run it lol.
Dany0@reddit
To quote a great poet and lyricist, this house is a broken home
uniVocity@reddit
“Home”
PhotographerUSA@reddit
I need this in 35B format lol
TurnUpThe4D3D3D3@reddit
China finally did it. They finally beat US models on HLE tools. Congrats to the Kimi team.
Ok_Mammoth589@reddit
Did they? What does Claude 4.7 score on those?
SeyAssociation38@reddit
China is only a few months behind. Once they figure out EUV they have to potential to be ahead of the US
guillefix@reddit
Sorry, what is EUV?
RelationshipLong9092@reddit
I have no doubt they'll do it, but EUV is a pretty big barrier to climb!
ffgg333@reddit
How is create writing? Better? Less censored?
seppe0815@reddit
gemma 4 the big uncensored modell, nothing more you need bro! trust
Ynead@reddit
It kinda sucks for creative writing. Kimi K2.5 was quite superior
Budget-Light-1694@reddit
https://gofund.me/dae1e2fa0
Fresh-Resolution182@reddit
1.1T params and you still need the quant chart to figure out if your rig can even touch it. great model, democratized for the top 0.1% of hardware owners
arm2armreddit@reddit
RTX5070 12GB Vram 😭😭😭, gguf 0.1BIT whenn?
Genesis2001@reddit
12GB? what luxury!
cries in 1660S not designed for AI
lol
arm2armreddit@reddit
did u try anything to run? it has 1400 cuda cores and 6GB sounds descent for llama3.3, or i am missing something?
Genesis2001@reddit
I run
Qwen2.5-Coder-7B-Instruct-Q4_K_Mfrom unsloth on it right now and get decent performance, but I haven't really used it beyond test prompts. I've been tweaking this command line over the last week or so though.The last 3 (repeat penalty, temp, top-p) params are new as of today and are being tested.
jwpbe@reddit
why are you using a 2 year old coding model and not something made in the last 6 months? Even the Qwen 3.6 that came out a few days ago is going to be faster and better than 2.5 coder at coding
Genesis2001@reddit
Because I haven't found coding model in the 7-9B param range to run comfortably on my hardware. I'm still quite new.
jwpbe@reddit
understandable, try this if you can fit Q5_K_M into a combination GPU / RAM offload with reasonable speeds:
https://huggingface.co/AesSedai/Qwen3.6-35B-A3B-GGUF
If not, try Q4_K_XL from here:
https://huggingface.co/unsloth/Qwen3.6-35B-A3B-GGUF
Mixture of experts models which are denoted by (Parameter Number - Active Parameter Number) route queries to different segments of the model as it infers from your query, resulting in faster performance even if you can't fully load it onto your GPU.
Genesis2001@reddit
I won't be able to load that Qwen 3.6 model and load my IDE at the same time, lol. It's maxing out my system ram with the model loaded right now. I'm getting about 8.3-8.4 t/s right now asking it to refactor a powershell script I had ChatGPT write (just a test prompt).
However, it's good to know that in a pinch, I can run the model.
AppealSame4367@reddit
You can run qwen3.6 35B though. Sincerly, someone with a RTX2060 mobile and 6gb vram.
Chasian@reddit
Huh? How what's your t/second. I ran on my 3070 and it was like 7 lol
AppealSame4367@reddit
Linux, 32gb system ram, at low context i get 600-700 prefill tps and around 15 tps output. At around 50k context it's more like 200 prefill and 1-3 tps output.
Compiled for cuda 12.x on linux. Adapt values to your CPU etc. Low unsloth quant, no thinking, low kv cache quant, no vision. Still delivers better results than q35b no thinking or gemma 4 e4b with thinking.
#!/bin/bash
export GGML_CUDA_GRAPHS=0
./build/bin/llama-server \
-hf unsloth/Qwen3.6-35B-A3B-GGUF:UD-IQ2_M \
--no-mmproj \
--no-mmproj-offload \
-c 80000 \
-b 2048 \
-ub 2048 \
--prio 3 \
-fit on \
-np 1 \
-kvu \
--clear-idle \
--cont-batching \
--slot-save-path ./slots \
--port 8129 \
--host 0.0.0.0 \
--cache-ram 8184 \
--spec-type ngram-map-k4v \
--draft-max 32 \
--draft-min 5 \
--spec-ngram-size-n 4 \
--spec-ngram-min-hits 1 \
--mlock \
--no-mmap \
--cache-type-k q4_0 \
--cache-type-v q4_0 \
-t 6 \
--temp 1.0 \
--top-p 0.95 \
--top-k 20 \
--min-p 0.0 \
--presence_penalty 0.0 \
--repeat-penalty 1.0 \
--jinja \
--reasoning off
Dudeonyx@reddit
1660s has no ai accelerators afiak
AppealSame4367@reddit
It's not about "accelerators" (whatever you mean by that). You could even run it very slowly on GPU. But it's still an Nvidia gpu, you should be able to use cuda 12.x
Zestyclose839@reddit
My Zephyrus G14 might’ve just found itself a new job
DigiDecode_@reddit
this GGUF quant should work with your RTX 5070, only 11.8kb in size 🤣🤣
Processing img svf023l81ewg1...
rebelSun25@reddit
0.001 NanoQuant Ablated A10K coming anytime
Noobysz@reddit
And now i need the REAP version of this so that Strawberry have 1000000 Rs
Worried_Drama151@reddit
Most overhyped shit ever, this a 2 horse race these days Qwen and Gemma. Gtfo with moonshot and z.ai steal ur training data
spaceman3000@reddit
True. I prefer Gemma because she speaks my language with almost no mistakes. Qwen and all Chinese models unfortunately are very bad at it. But still when English is enough for my tasks Qwen is great too. I removed all orher models from my workflow.
vex_humanssucks@reddit
The 12GB VRAM crowd already crying in the thread is very relatable. Curious what the actual FP8 quant sizes end up being - K2.5 at full precision was already a painful 150GB+. If they manage to get a decent IQ3/IQ4 that fits in 48GB that would be genuinely useful. Anyone know if the architecture changes anything for llama.cpp support or is it the same transformer layout as K2.5?
usrlocalben@reddit
Kimi has been native INT4 since Kimi-K2-Thinking, so \~4.5bpw.
The non-vision portion has been the same architecture & hyperparms since the first version. Nothing has changed in 2.6.
There is no need to wonder what the size is, since the INT4 weights dominate the model it's easy to compute, roughly given 1T params:
10**12 * 4.5 / 8 / 1024**3 = \~523 GB
To be more precise wrt. MoE vs. attn:
384*7168*2048*3*60 * 4.5 / 8 / 1024**3 = \~531 GB
The safetensors are 555GB, so 555 - 531 = 24GB of embedding, output, attention heads, vision etc. which are in BF16, F32 etc.
Want to know the size of a 3 bpw quant? just substitute 4.5 for some other bpw, e.g. 3bpw:
384*7168*2048*3*60 * 3 / 8 / 1024**3 = 354GB (+24GB = 378GB)
All of these values can be found in config.json. (with the exception of 60 which one must know that Kimi MoE starts on layer 1 and not layer 0, so num_hidden_layers - 1)
Kirin_ll_niriK@reddit
I have 32GB on my rig (upgrading to 64 once I can swing it) and even I am sitting here realizing I will never run this without heavy quantization
_derpiii_@reddit
What a time to be alive
Alternative-Advice40@reddit
local will be great
mrinterweb@reddit
1.1T params was hard to read while drinking my coffee. Nearly did a spit take
Eyelbee@reddit
They should go larger. 4-5T would be great.
thrownawaymane@reddit
You got a full 48U rack or are you just happy to see me?
BallsInSufficientSad@reddit
A small Mac Studio had 512GB RAM - it's doable without the RAM shortage.
Expensive-Paint-9490@reddit
Well, thanks to QAT it is smaller than Qwen3.5-397B-A17B.
john0201@reddit
It just barely will not fit in a 512GB Mac Studio. Annoying.
FullOf_Bad_Ideas@reddit
you can quant Qwen 397B to be usable at around 150 GiB. You can't do that to Kimi K2.6
Service-Kitchen@reddit
Why is that sorry?
Daniel_H212@reddit
K2.6 just had too many more parameters.
Service-Kitchen@reddit
Ah okay, I thought it was something to do with the quantisation format.
FullOf_Bad_Ideas@reddit
Qwen 3.5 397B has more quantization potential, with Kimi it's mostly already eqhaused and it won't quant 4x from the released version, unlike Qwen 3.5 397B.
Comacdo@reddit
And it's MoE... Imagine the absolute behemoth model we would get from a dense one the same size ? One can dream..
TopChard1274@reddit
Erm how many people here afford to run this locally? A couple? One?
h-mo@reddit
the pace at which Chinese labs are releasing open weights right now is genuinely hard to keep up with. not too long ago Kimi K2 felt like news, now there's already a .6. what are the actual capability deltas between these point releases?
jld1532@reddit
Tell me China hasn't won the race. Every organization with enough compute will be running Chinese open weights by the end of 2026. My organization already is and provides freely to all employees via open webui. Soon, most technological advancements worldwide will be completed with support from Chinese rather than American AI.
JuniorDeveloper73@reddit
USA dont even try on opensource
jld1532@reddit
Which I think will be ultimately judged as a huge mistake in terms of international influence.
RelationshipLong9092@reddit
That can be said of an awful lot of America's decisions as of late.
cass1o@reddit
To be fair, google just released some good gemma models that are actually small enough to be run by most people.
SnooPaintings8639@reddit
I think USA is trying very hard to go anti-opensource. Not only not to open anything themselves, but to block anyone else from opening and sharing 'the secret sauce'.
JuniorDeveloper73@reddit
Its unveliable because opensource its more brains on several problems,as a big company you could get TONS of development and ideas for free
The thing its out,its pure nonsense
RepulsiveRaisin7@reddit
Has it? I use GLM and I feel like Codex and Claude are significantly better optimized for programming. I guess it depends on what you're doing.
jld1532@reddit
And if you have money. The average person underestimates the amount of underutilized compute out there.
ttkciar@reddit
No, but I will tell you that there isn't a race.
Lissanro@reddit
Nice! Given Kimi K2.5 already was my favorite local model, I am looking forward to running K2.6 on my rig! They also kept local-friendly INT4 weight format, which can be practically losslessly converted to Q4_X GGUF.
Fit-Statistician8636@reddit
I was scrolling, somehow expecting / looking for your comment here 😀. I cannot wait for GGUFs to appear, too… But wait - I run K2.5 with SGLang + KTransformers. Did you tried this path?
Lissanro@reddit
So far I only got llama.cpp and ik_llama.cpp working. What is your experience with SGLang, does it work well with RAM+VRAM inference?
At some point I tried SGLang but never could get it working. They still have open bug about K2.5: https://github.com/sgl-project/sglang/issues/20096
If you are using CPU+GPU inference, perhaps you could share the link to the exact quant of K2.5 you are using and full SGLang command that you found to be working? It would help to know a working baseline. Then I may give SGLang another try, it would be interesting to compare with other backends.
Fit-Statistician8636@reddit
So this is my SGL+KT Dockerfile some AI built for me few weeks back. It was try-error-modify iteration until it worked:
And this my command to run it:
It runs well on Blackwell, and it worked well on two GPUs too, with
--tensor-parallel-size 2.Fit-Statistician8636@reddit
So this is my SGL+KT Dockerfile some AI built for me few weeks back.
It was try-error-modify iteration until it worked:
```
# syntax=docker/dockerfile:1.7
# SGLang + KTransformers for CUDA 13 on EPYC Turin
ARG BASE_IMAGE=lmsysorg/sglang:dev-cu13
FROM ${BASE_IMAGE}
SHELL ["/bin/bash", "-o", "pipefail", "-c"]
ARG DEBIAN_FRONTEND=noninteractive
ARG KTRANSFORMERS_REF=main
ARG KT_CUDA_ARCHS=120
ARG KT_BUILD_JOBS=64
ARG SGL_KERNEL_CU130_WHL=https://github.com/sgl-project/whl/releases/download/v0.3.21/sgl_kernel-0.3.21%2Bcu130-cp312-abi3-manylinux2014_x86_64.whl
# Build/runtime settings for Turin + Blackwell
ENV CUDA_HOME=/usr/local/cuda \
TRITON_PTXAS_PATH=/usr/local/cuda/bin/ptxas \
HF_HUB_ENABLE_HF_TRANSFER=1 \
CUDA_MODULE_LOADING=LAZY \
PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True \
PIP_DISABLE_PIP_VERSION_CHECK=1 \
PIP_NO_CACHE_DIR=1 \
PIP_ROOT_USER_ACTION=ignore \
CPUINFER_CPU_INSTRUCT=FANCY \
CPUINFER_ENABLE_AMX=OFF \
CPUINFER_ENABLE_AVX512_VNNI=ON \
CPUINFER_ENABLE_AVX512_BF16=ON \
CPUINFER_ENABLE_AVX512_VBMI=ON \
CPUINFER_USE_CUDA=1 \
CPUINFER_CUDA_ARCHS=${KT_CUDA_ARCHS} \
CPUINFER_PARALLEL=${KT_BUILD_JOBS}
# Native build dependencies for kt-kernel
RUN apt-get update && apt-get install -y --no-install-recommends \
build-essential \
cmake \
ninja-build \
git \
git-lfs \
curl \
wget \
ca-certificates \
pkg-config \
python3-dev \
libhwloc-dev \
libnuma-dev \
numactl \
pciutils \
&& rm -rf /var/lib/apt/lists/* \
&& git lfs install --system
WORKDIR /opt
# KTransformers repo and submodules
RUN git clone --recursive https://github.com/kvcache-ai/ktransformers.git \
&& cd /opt/ktransformers \
&& git checkout "${KTRANSFORMERS_REF}" \
&& git submodule update --init --recursive
# Remove base-package metadata first, then install the KTransformers fork
# without dependency resolution so the CUDA 13 stack is not downgraded.
RUN (python3 -m pip uninstall -y sglang sglang-kt kt-kernel sgl-kernel || true) \
&& python3 -m pip install --upgrade pip setuptools wheel packaging \
&& cd /opt/ktransformers/third_party/sglang \
&& python3 -m pip install --no-deps "./python[all]" \
&& python3 -m pip install --no-deps "${SGL_KERNEL_CU130_WHL}" \
&& python3 -m pip install --no-deps decord2
# Build kt-kernel against the prepared CUDA 13 / SM120 environment.
RUN cd /opt/ktransformers/kt-kernel \
&& python3 -m pip install --no-deps --no-build-isolation -v .
WORKDIR /workspace
CMD ["bash"]
```
And this my command to run it:
```
docker run --rm \
--name kimi-k2.5-rawint4-kt-160k-p2 \
--ipc=host \
--cap-add=SYS_NICE \
--runtime nvidia \
--gpus device=GPU-xxx \
-p 8000:8000 \
-v /mnt/hot/hfhub:/root/.cache/huggingface/hub \
-v /mnt/bulk/config/parsers:/opt/parsers:ro \
-e 'NCCL_P2P_LEVEL=PHB' \
-e 'NCCL_MIN_CTAS=8' \
-e 'OMP_NUM_THREADS=8' \
-e 'SAFETENSORS_FAST_GPU=1' \
-e 'PYTORCH_ALLOC_CONF=expandable_segments:True' \
-e 'HF_HUB_OFFLINE=1' \
-e 'SGLANG_ENABLE_JIT_DEEPGEMM=0' \
-e 'SGLANG_ENABLE_DEEP_GEMM=0' \
sglang-kt:cu13 \
python -m sglang.launch_server \
--model-path /root/.cache/huggingface/hub/models--moonshotai--Kimi-K2.5/snapshots/54383e83fa343a1331754112fb9e3410c55efa2f \
--kt-weight-path /root/.cache/huggingface/hub/models--moonshotai--Kimi-K2.5/snapshots/54383e83fa343a1331754112fb9e3410c55efa2f \
--served-model-name kimi-k2.5-rawint4-kt-160k-p2 \
--host 0.0.0.0 \
--port 8000 \
--mem-fraction-static 0.94 \
--trust-remote-code \
--context-length 163840 \
--max-running-requests 2 \
--prefill-max-requests 2 \
--max-total-tokens 327680 \
--kt-cpuinfer 24 \
--kt-threadpool-count 1 \
--kt-num-gpu-experts 2 \
--kt-method RAWINT4 \
--kt-gpu-prefill-token-threshold 512 \
--kt-max-deferred-experts-per-token 1 \
--kt-enable-dynamic-expert-update \
--enable-mixed-chunk \
--tensor-parallel-size 1 \
--disable-shared-experts-fusion \
--disable-custom-all-reduce \
--chunked-prefill-size 16384 \
--attention-backend flashinfer \
--reasoning-parser kimi_k2 \
--tool-call-parser kimi_k2 \
--sampling-defaults model
```
It runs well on Blackwell, and it worked well on two GPUs too, with `--tensor-parallel-size 2`.
Fit-Statistician8636@reddit
Yes, just reading this: “Kimi-K2.6 has the same architecture as Kimi-K2.5, and the deployment method can be directly reused.”
So, no quant needed - SGLang + KTransformes should be able to use the native .safetensors model. Yes, I have a great experience with Kimi+SGL+KT, and with SGL in general (using voipmonitor’s fork to run MiniMax-M2.7 from VRAM). It is not without issues, but llama/ik is not either.
I’ll get out of bath, run “hf pull” and post my “recipe” for K2.5 in 10 minutes 😀.
MuzafferMahi@reddit
My god what kind of a gig you got?
Lissanro@reddit
I have shared details about my rig here, and here I shared my performance for various models.
valtor2@reddit
Is it because of your massive RAM? I wouldn't have expected to be able to run 1T params on 96GB VRAM.
How much did it cost you? You were before the RAM-pocalypse but very much into the GPU-pocalypse I bet :)
Lissanro@reddit
I built my rig gradually over the years, starting with buying GPUs one by one, then PSUs, and in the beginning of the previous year migrated to the EPYC platform with 8-channel 1 TB DDR4 3200 MHz RAM (the server memory costed me approximately $1600 in total)... so yes, I got lucky enough to upgrade before RAM prices went insane.
PrysmX@reddit
You can offload layers to regular RAM. The entire model doesn't need to be in VRAM with GGUF. So if your total VRAM+RAM can hold the weights, you should be able to run the model (albeit slower than if it was all in unified high-bandwidth RAM).
pmttyji@reddit
What other models came with similar weight format? I remember that GPT-OSS came in MXFP4 & Gemma3 came in QAT.
_yustaguy_@reddit
Are the crown prince of Saudi Arabia by chance?
ForsookComparison@reddit
What gives me pause about these benchmarks even more than seeing GPT 5.4 and Kimi beating Opus 4.7 in coding scenarios (something I also doubt) is seeing Gemini 3.1 Pro winning in things like Terminal Bench. I cannot for the life of me get that model to be competitive in what that benchmark claims to cover, yet it's number 1?
Bakoro@reddit
Gemini is perhaps the weirdest, most inconsistent model.
The only thing that I can really think, is that they have a lot more knobs they turn dynamically, based on the current load. and destroy existing work.
Sometimes I get super-genius Gemini who does a full load of work up front, and sometimes I get the absolutely minimal effort model.
Gemini will literally add things like
One of the things I hate the most is how it will make notes about how "in a real project, we would do xyz, but we'll just put this stub for now". It's so hard to get the model to take things seriously and not as a trivial exercise.
When it's good, it's very good. When it's bad, it's among the worst.
When it reaches its context limit, it falls apart the hardest.
cant-find-user-name@reddit
My experience is pretty much the same. Gemini is a genius some times and a dumbass many more times.
AdOne8437@reddit
Total Parameters 1T Activated Parameters 32B
Hmmm,ok I think I will sit this one out :)
muyuu@reddit
how many 24GB RTX3090s to run this one?
_supert_@reddit
To preserve thinking or not preserve thinking?
TheRealMasonMac@reddit
K2.5 preserves thinking by default IIRC.
TopChard1274@reddit
1.1... whoah 😮
SnooPaintings8639@reddit
Is it a new quant? Even smaller that 1.58 bit!? Whoah indeed! /s
Long_comment_san@reddit
HERE WE GOOOOOOOOOOO
DigiDecode_@reddit
indeed, I think this might be the 1st time open weights is SOTA level since the release of GPT 4, and that was March 2022, also dare I say not 6 months behind, and no moat for closed weights
Perfect-Flounder7856@reddit
SoTA?
bakawolf123@reddit
well, open source has caught up to proprietary models
now we only need hardware to catch up so we can actually run them =)
korino11@reddit
Size much less thean 2.5 version! That is veeery good. We have a hope as local)
CrawlUpAndDie@reddit
Good news
Extra-Organization-6@reddit
if this runs well on ollama thats going to be interesting for self-hosted inference. the MoE architecture should keep memory usage reasonable even at this scale. curious what the actual VRAM requirements look like with different quants.
Awwtifishal@reddit
You need at the bare minimum 32G VRAM and like 700 GB of the fastest RAM you can get and motherboard with 4 channels... to run it slowly but usable speed with ik_llama at Q4.
relmny@reddit
I (very occasionally) run k2 or k2.5 IQ2 on 32gb VRAM + 128 gb RAM + ssd at 1.7 t/s (not everyone codes)
Cory123125@reddit
To do literally what?
relmny@reddit
Chat.
I only use local models.
When the small/medium ones won't do, then I pull the big guns (deepseek, kimi, glm) and... wait.
I use them at least once every week.
Cory123125@reddit
.... This is even more perplexing. That output rate sounds entirely too slow to be useful.
relmny@reddit
I'm fine with it. Just ask some things, do other stuffs, come back and get my answer. I try to use "instruct" with these models, but sometimes, when I really need it, I even run them as "thinking".
Ell2509@reddit
Shame you are getting downvoted. Honestly though, 1.1t is not runnable for any amateur.
ttkciar@reddit
... yet!
Extra-Organization-6@reddit
Part of the game i guess, I have the best recipe in town for blueberry muffins, check recent comments lol
Similar-Republic149@reddit
Give me a recipe for blueberry muffins
Extra-Organization-6@reddit
YOUR MOM
Yu2sama@reddit
What's the point of these bots I wonder?
ResidentPositive4122@reddit
At least this bot is funny
TheItalianDonkey@reddit
in what world is that reasonable? :-D
WithoutReason1729@reddit
Your post is getting popular and we just featured it on our Discord! Come check it out!
You've also been given a special flair for your contribution. We appreciate your post!
I am a bot and this action was performed automatically.
No_Mango7658@reddit
I call bs... Those tests cannot be accurate
Healthy-Nebula-3603@reddit
That model is 1.1T .....
No_Mango7658@reddit
And opus is estimated at over 5t.
What I'm getting at is this feels like a benchmax release... I'll be curious to test it's actual capabilities
Caffdy@reddit
can you source that?
No_Mango7658@reddit
I have no source, just people on the internet guessing.
Elon claimed at one point opus is 5t, but I don't know if he really knows either
my_name_isnt_clever@reddit
Elon is less reliable than any internet random
Cory123125@reddit
If anything, I now believe Opus is smaller
Healthy-Nebula-3603@reddit
O assume Bijam ( YouTube will test that soon if not did that already)
So we can compare performance
No_Mango7658@reddit
I can't stand his useless videos. Waste of disk space
Healthy-Nebula-3603@reddit
In that case test by yourself:)
I like to see different details of the same tests.
I only miss agent complex tasks from him.
TurnUpThe4D3D3D3@reddit
It could be benchmaxxed, but since it’s the Kimi team I think it’s legit. Their last model was a breakthrough for real world performance so I would not doubt them.
No_Mango7658@reddit
Dude if this is real this is the end of anthropic... I will go into so much debt for a pair of m3 ultra's to run this.
Healthy-Nebula-3603@reddit
Haha
Healthy-Nebula-3603@reddit
Oh wow 1.1T model size!
Give me a few minutes I will test that on my local computer!
Cory123125@reddit
Man those 2 tokens you get today will be the smartest tokens you've ever seen
Miserable_Ad7246@reddit
That token will come eventually.
Spirited_Neck1858@reddit
haha eventually
srigi@reddit
Apple Watch sized model
FlamaVadim@reddit
🤣
Fringolicious@reddit
Very exciting. 6 months and we'll have this performance at 1/10th the size presumably, good to see open weights giving the closed labs some serious competition!
Due_Net_3342@reddit
cheering with 144GB :(
pmttyji@reddit
License: modified-mit
"OI MiniMax-M2.7"
FyreKZ@reddit
Moonshot has used modified mit sice K2, nothing new.
pmttyji@reddit
I see, Never noticed the previous versions' as those are too large for my GPU(Thought they followed MiniMax's route). I tried Kimi-Linear which is mit only.
Furacao__Boey@reddit
BF16 is 595 GB, Q4 could be runnable on single 96 gb vram + RAM maybe?
Different_Fix_2217@reddit
Its already 4bit. That is not BF16.
FullOf_Bad_Ideas@reddit
595GB is quanted already, they publish a model that has mixed precision but majority of weights are in INT4.
if you have 512GB of RAM, yeah
Sticking_to_Decaf@reddit
That would be awesome. Especially if NVFP4 could fit with some decent context
david_0_0@reddit
any early numbers on what spec you need to run this locally at a reasonable quant. the K2 lineage has been creeping up in size, curious if 2.6 still fits on dual 24gb or if its workstation class cards now
Jackw78@reddit
Just need half terabyte of vram now...
cr0wburn@reddit
Big kiss to the chinese modelnakers who make it christmas almost everyday!
Intrepid_Travel_3274@reddit
Did somebody tested in real scenarios? Backend Architecture and Frontend Design? It is really better/equal to Opus 4.6?
smile132465798@reddit
So Kimi 2.6, Qwen Max, and DeepSeek V4 this week?
Saltwater_Fish@reddit
Best open source for sure
panchovix@reddit
If I had 48GB more VRAM/RAM I could run this at 4 bit :(
WhyLifeIs4@reddit
Its a good model sir
philguyaz@reddit
This is amazing considering them and deepseek are the last bastions of vision + text opensource models my business relies on
Specter_Origin@reddit
Did we find Cursor's CEO's account?
pseudoreddituser@reddit
Twitter announcement: https://x.com/Kimi_Moonshot/status/2046249571882500354?s=20 Blog: https://www.kimi.com/blog/kimi-k2-6
FoxiPanda@reddit
This thing is a chonk.. rip to my hardware trying to run some IQ2 quant of this at 4tok/s
Weak_Engine_8501@reddit
Bonsai kimi 2.6 when?
Exciting-Engine882@reddit
whoa whoa, can't wait to test it
ZeusZCC@reddit
Niceeee