Kimi K2.6 Released (huggingface)

[-]

LagOps91@reddit

Pretty damned impressive assuming the benchmarks translate into real-world performance

[-]

anedisi@reddit

I was part of the beta, there were times that I forgot I'm on kimi and not gpt 5.4 . Opus is still the king in most workflows honestly, but this thing is coming hard.

[-]

Theio666@reddit

I can't take this seriously unless you're mainly working on frontend things. Outside of frontend GPT 5.4 and 5.3-codex just miles ahead of opus.

[-]

ProfessionalJackals@reddit

Outside of frontend GPT 5.4 and 5.3-codex just miles ahead of opus.

Its the reverse from my experience. The amount of times that GPT5.4 resulted in prompt hell because it was unable to fix something. While the exact same initial prompt for Opus 4.6 ... fixed.

GPT 5.4 is a excellent model, but its often way too strict in its focus field.

[-]

This might be some prompting issue, models behave a bit different and you might be used to different type of prompting/your agentsmd might not be great for gpt etc. For me gpt exactly fits what I need when I ask it to do. The only reason I'm using other models is pricing.

[-]

autoencoder@reddit

prompting issue

"you're holding it wrong"

No. It should understand English.

[-]

Theio666@reddit

You're overestimating the ability of a random/average person to convey their thoughts using natural language. I'm not joking, developed reading comprehension and "explaining what do you want" skills are way more rare than you might think. It's quite common to see someone giving a vague instruction, and then, after model interpreting it wrongly, they think it's model's fault.

[-]

autoencoder@reddit

So you're saying ChatGPT is bad at interpreting vague instructions? Why wouldn't other models be?

[-]

Theio666@reddit

The thing is, when the models are "interpreting vague instructions", they do that according to how they see things fit, which might not be an optimal solutions. In general, this result in a lot of tech debt over the time, since you stack a lot of randomly interpreted instructions. I prefer my models to fail loudly when they miss some info rather models which silently interpret something which makes things fail in the long run.

It all depends on the seriousness of dev work, if you just vibe code a small app this doesn't really matter, getting to the point when this matter will get some time.

GPT 5.4 is really persistent in following the instruction. If you have correct agentsmd with info on how and what to test, give some sort of AC for hard tasks, talk to it a bit to have a plan beforehand (don't even have to use plan mode, the model is great in free talk mode), then the model pretty much oneshots tasks of any difficulty not counting frontend.

[-]

autoencoder@reddit

I guess eagerness is a matter of taste. I don't use agentic coding, and always say "Be brief" in the system prompt, so I actually prefer it doing as little as possible. I'm more hands-on that way. I do notice I'm "lagging" compared to other peers, but in my personal work, code quality and maintainability matters much more.

For my use case, plenty of models are good enough nowadays. I even use local ones running on CPU from time to time, esp. when some cloud service is down.

[-]

More-Curious816@reddit

I think what he wants to say is that most people give only vague instructions, thinking that's enough, and it's true. Most people struggle to convey their mental images precisely. LLMs can't read your mind. your mental image stays locked inside unless you supply every detail for others to reconstruct it.

[-]

YRUTROLLINGURSELF@reddit

No, that's not what they're saying, they're just being overly polite - the implication was that you may be the one with the problem understanding English, and if your followup response is any indication they may in fact be onto something

[-]

Waste-Peak-1213@reddit

For me 12 years in tech, gpt is much more precise I feel like a opened another tech agency with grunts doing stuff I want. Opus is not precise enough or gets convoluted. Prompting matters a lot, obviously. Simpsons' duh.

[-]

Zeeplankton@reddit

GPT 5.4 is insane at backend but it's definitely a smaller model.. helps to check output with opus.. but per parameter 5.4 is a monster

[-]

Theio666@reddit

It doesn't feel like a smaller model at all. Maybe it depends per case, but for my main repo at work - agentic harness-app with microservices, with EDA for talk between services, each repo with it's own env, opus is not quite able to do repo-wide edits which need to touch 2-3 services, while GPT easily does pretty much anything, given I provide correct design doc. My programming in the last month fully shifted to writing design docs, checking/reviewing code, setting up debugging sessions, there's literally no need to write code with 5.4 atp.

[-]

Unusual-Candidate-43@reddit

Pretty much the same experience, GPT 5.4 rarely makes mistakes, especially for Backend engineering, while Claude makes mistakes and GPT 5.4 easily fixes them,

[-]

squired@reddit

Similar experience here. Claude Code showed the way, Codex made it truly work. It was the first workflow that minimized my attention enough run parallel agents and trust that the validation tests actually did pass.

[-]

Kappalonia@reddit

Huh? I work in data science and Opus 4.5 wiped the floor with GPT.

[-]

Theio666@reddit

Exactly the opposite experience in ML/DS for me. I remember back in sonnet 4 days, how everyone was glazing the model, and it implemented RoPE in custom transformer without caching, while I explicitly asked for caching and even provided the code for caching from official torchtune library. It just put the cache function inside the forward without making it remembering prev calculations lol. o3 did that easily btw. Since then I try to avoid anthropic models, I used opus 4.5 for a while when I was working on llm proxy app, and to this day I still fix weird bugs it left here and there with gpt. I spin opus only when I need some frontend fix, and that's not often since now kimi deal with most my frontend needs.

This is not even taling that compaction in codex is just another level compared to any other implementations due to all magic they do on their endpoint side, I wish we had at least something similar for other models :(

I'd say, that in general GPT does it's job the better the bigger/harder the task is. I don't know how, but that's some observations a lot of people noticing that the model just super coherent on long runs and good on state recovery.

[-]

squired@reddit

You're spot on when it comes to the compaction. It boggles the mind how well it works. They nailed that bit.

[-]

Theio666@reddit

There's a blogpost by openAI on that, from what I understood, they create a compact vector representation of conversation, I don't remember the details but basically embeddings of the chat, which then they allow the model to check, or just always append to the chat. I don't think anyone is doing something like this, usually compaction is some sort of summarizations, and you can only summarize limited info that way, non-token based embeddings should allow way better compaction.

But I might be wrong ofc, don't remember the exact details, and they didn't share the code/exact logic anyway.

[-]

squired@reddit

That's fascinating and I'll explore further. Thank you!

[-]

VicemanPro@reddit

You should all know that performance caps are a thing. You're not all getting the same model, every time.

[-]

squired@reddit

Fully agreed. People need to quit with the fanboi stuff. I'm a graybeard dev and I hop to and froe every few months. Codex w/ ChatGPT-5.4 Extended/Medium is quite a bit ahead of Claude Code w/ Opus 4.6/7. You're right though, Anthropic is def better at frontend aesthetics; not enough for me to maintain a Max sub atm though. If they release a baby Mythos next month? I'll hop right back!

[-]

anedisi@reddit

I had yesterday opus solve an ipad app bug that no others could solve and I have access to glm kimi gpt.

[-]

BihariBabua@reddit

The other day, Opus couldn't get a grid layout right. Sad state of affairs.

[-]

Spirited_Neck1858@reddit

how about k2.6 vs sonnet 4.6?

[-]

lemon07r@reddit

It doesnt, but its still quite good. Just still nowhere near opus or gpt 5.4. I've been running a lot of a/b testing with k2.6 last couple days against opus 4.6 and gpt 5.4. (and opus 4.7). Opus 4.7 was dead last lol, so at least we can say k2.6 is better than opus 4.7. I only tested for reviews and audits, to see who caught the most valid bugs and had the least hallucinations/false positives. At least kimi looks pretty in evals now? But isnt actually much better than k2.5. I do think it's still currently the best open weight model now.

[-]

IrisColt@reddit

mother of God...

[-]

ResidentPositive4122@reddit

Both the code repository and the model weights are released under the Modified MIT License.

See, minimax, this is a proper modified MIT. Still MIT core (i.e. do whatever you want) just with an attribution if you're a large corp. That's it.

[-]

Macmill_340@reddit

Why not just use apache if it's about attribution?

[-]

Dudeonyx@reddit

Why is that so important to you?

Genuinely asking.

[-]

ResidentPositive4122@reddit

Because calling something "modified MIT" is a farce when the entire thing is the antithesis of MIT. I have no issue with them releasing a model NC. That's up to them. But in that case they should just say so, use a proper license, and be done with it.

[-]

EveningIncrease7579@reddit

[-]

clouder300@reddit

carnist shit

[-]

thrownawaymane@reddit

I am in this photo and I don't like it

[-]

Admirable_Market2759@reddit

You use a Xeon? How does it work?

I bought a Aliexpress Mother board but it was dead on arrival lol

Haven’t tried again.

[-]

po_stulate@reddit

I'm the 512 RAM DDR4 I can confirm

[-]

panchovix@reddit

How many t/s you get on that setup? I think that one can run 4bit IIRC on lcpp lol.

[-]

seamonn@reddit

You mean s/t

[-]

MoonLightSunDark@reddit

Thanks for the laugh lmao

[-]

the-username-is-here@reddit

Yes, we can!

[-]

WoodCreakSeagull@reddit

Thanks, Ollama

[-]

KeikakuAccelerator@reddit

Fk, this got me lmao.

[-]

ShengrenR@reddit

I laughed louder than I should have.. bit of a cackle if I'm honest.

[-]

Cool-Chemical-5629@reddit

Now that's the kind of pun idea I like.

[-]

jatjatjat@reddit

You won the internet today.

[-]

Cold_Tree190@reddit

Lmao fr

[-]

TheItalianDonkey@reddit

whelp, this made me audibly laugh.

[-]

RedParaglider@reddit

I'm loving seeing more memes in this dub on topic and less slop comments 😘

[-]

silenceimpaired@reddit

Sigh. It appears the rumours of a smaller Kimi were just rumours.

[-]

Bakoro@reddit

I don't see the point, unless Kimi has some special feature you want that no one else is offering.

For smaller LLMs, we've got like a dozen other offerings, there should be at least one real frontier sized behemoth model.

[-]

silenceimpaired@reddit

Kimi had many singing it's praises for creative writing. I'm hoping to see something competitive to qwen and Gemma ~30b dense

[-]

dtdisapointingresult@reddit

Are you...are you saying you think Qwen is competitive for creative writing? Qwen is one of the worst writing models there is. I'm shocked you mention it in the same breath as Gemma 31B.

[-]

silenceimpaired@reddit

Don't be so antagonistic. No I don't think it's great at writing, but if you look at kimi linear it was worse because of its size and training.

[-]

dtdisapointingresult@reddit

I'm just antagonistic towards Qwen's writing, haha. The only worse writing model is GPT-OSS.

[-]

Zeeplankton@reddit

I can imagine someone doing some frankenstein distill model with gemma

[-]

oxygen_addiction@reddit

With them pivoting to an orchestrateor (big) + hundreds of subagents model (small), it'd make sense.

[-]

OcelotOk8071@reddit

that's not a foregone conclusion.

[-]

silenceimpaired@reddit

How I hope you are right.

[-]

onewheeldoin200@reddit

As someone deep in the throes of GPU poverty...what does the hardware look like that is capable of running this? 8-10 RTX6000 pros? Something even nuttier?

[-]

Fit-Statistician8636@reddit

You can run it with a single RTX 5090 and a lot of RAM. Not cheap by any means, but not as expensive as 8x RTX PRO 6000.

700 t/s PP and 20 t/s TG is not quite there for coding, but chat is fine:

https://huggingface.co/ubergarm/Kimi-K2.6-GGUF/discussions/3

[-]

arcanemachined@reddit

You don't. You torment yourself trying to get couple tokens per seconds, or you give up and use OpenRouter.

[-]

HopePupal@reddit

fully in VRAM? you're basically right. it's a 594 GB model on disk. it was built to run on a single last-gen datacenter GPU server. the rack-mount ones take 8 SXM or OAM GPU bricks. you see fully populated 8× H100 80 GB servers on eBay once in a while for $200k, but those won't quite do it because you won't have any room for context, so figure on needing the even more expensive H100 96 GB. or their rough Intel or AMD equivalents, the Gaudi 2 or the MI250. somewhere between "Lamborghini" and "house"?

you can sorta tape one together out of RTX PRO 6000s and a Xeon or EPYC motherboard with enough PCIe lanes, sure. but we're talking, like, minimum $100k just for the cards.

there's already a quant attempting to shrink it enough to fit in a 512 GB unified memory Mac Studio, which is going to be a lot slower but probably still useful if they manage to further quantize the already INT4 QAT model without lobotomizing it too much

[-]

tyrantwargodnamedbob@reddit

Check the top comments, some guys rig is like a meter long and he's trying out K2.6 today I believe

[-]

LegacyRemaster@reddit

Ok... I have to buy another 3 RTX 6000..

[-]

Fit-Statistician8636@reddit

That’s a great state to be in, actually :)

[-]

ProfessionalJackals@reddit

Ok... I have to buy another 3 RTX 6000..

Heuu, why? Its a Moe model, no? You just need a ton of system ram...

[-]

Worried_Drama151@reddit

https://x.com/bridgemindai/status/2046313533743468993/video/1?s=46 too many Kimi paid shills polluting the sub.. insane - atleast Qwen and Gemma don’t buy plaudits

[-]

oxygen_addiction@reddit

8xRTX6000 needed to run this with decent context, right?

Damn. Claude/Codex etc. must be a bit bigger than this and GLM5.1

[-]

Expensive-Paint-9490@reddit

With 8 Pro 6000 you have 768GB VRAM and model is 595GB... in 173GB you can fit a fuckton of context.

[-]

panchovix@reddit

With 8 at least you can run TP 8 on vLLM. With 6 or 7 you can run it but using PP.

[-]

Caffdy@reddit

For those who don't know"

TP=Tensor Parallelism

PP=Pipeline Parallelism

Tensor Parallelism divide the model layers in slices vertically, you need 2, 4 or 8 GPUs for it to work. PP divide horizontally whole layers distributed among the cards, so the amount of GPUs are not a problem if they are an odd number.

[-]

PrysmX@reddit

True, but LOL at your electric bill. You aren't saving any money going this route and the only reason to do so would be privacy concerns. (and this is coming from someone that is very much a proponent of local AI - I know the electric bill pain firsthand haha)

[-]

tspwd@reddit

Wow, so much? Once you paid off the local hardware, how would you compare let’s say a Claude Max subscription to running your own big model (with your own electricity)?

[-]

PrysmX@reddit

My electric bill is $700/mo. When you get into serious local hardware.and perpetual use the costs add up fast. The break even point isn't as close in the future as you think it is if you are running local models that actually compete with the frontier cloud models. It's years out.

[-]

tspwd@reddit

Oh, wow! That’s a lot! Are you running coding agents 24/7?

[-]

PrysmX@reddit

Various agentic workflow, some coding and some other, as well as image and video generation.

[-]

tspwd@reddit

Seems like you are making good use of your hardware :)

[-]

ProfessionalJackals@reddit

I know the electric bill pain firsthand haha

Lots of solar panels? I mean, this is probably the best use case for tons of panels if you work from home.

Normally your wasting a ton of solar energy to the grid (where you get cents on the kwh for it, or nothing, or need to pay to dump to the grid).

Extra 5k or 10k battery, things are barely 100 bucks for a kwh these days. I think your electricity bill is paid back fast with local LLMs + Solar...

[-]

PrysmX@reddit

I'm in a situation where I'm required to rent right now (moved here and will need to move again in the next year), so solar is not an option right now. Definitely in the plans when I hit my final location, though! 👍

[-]

ProfessionalJackals@reddit

Do you have a balcony?

Solar Balconies are extreme popular here in Germany. You can get dual 500w panels (limited to 800w out), microinverter for like 250 Euro. And the balcony/stack batteries are more expensive (then rack ones) but still range in the ~180 Euro/1Kwh.

And you can take it with you to new locations.

[-]

oxygen_addiction@reddit

You'd be doing a lot of batching to get your money's worth. So you'd need a lot of free VRAM.

But fair point, it might be that 7x6000RTX is enough :).

[-]

Few_Painter_5588@reddit

In other news, apparently Cursor's Composer 2.1 model has started training

[-]

rebelSun25@reddit

We're about to see at least two videos from Theo why it's actually a good thing

[-]

Mission_Biscotti3962@reddit

Nobody should watch videos from Theo

[-]

Glad-Ad6295@reddit

why? aint he a chill guy who has some anger issues

[-]

ayylmaonade@reddit

I don't like him because he's got one of the biggest, most annoying ego's I've ever seen - especially for a guy who wrote a mediocre LLM wrapper that charges you $8/m just to hook you up to OpenRouter with a web search feature. Open-WebUI has had more features for like a year+ at this point. Then him going off to create his own coding harness and explicitly disallowing the use of local models because anybody who claims to use them for actual work is "lying". The guy is just... not at all as important or as smart as he thinks he is.

[-]

Mission_Biscotti3962@reddit

This. He has achieved very little. He milks his Twitch employment and depends on impressionable people to be impressed by him having loud opinions. The confidence with which he communicates his stuff gives insecure/unknowledgeable people the impression that he "must know what he's talking about, given how loud and confident he is".
It's all bullshit.

[-]

Glad-Ad6295@reddit

ah, I haven't used any of his paid tools so I can't speak to the $8/m thing. I mainly just watch his content, which is usually pretty solid even if he has super strong (and loud) opinions. I can definitely see how that ego is annoying if you're actively interacting with his products though, thank you for replying

[-]

Marcuss2@reddit

50% of what he says is true, 50% is total garbage.

Problem is, he stands behind that 50% of garbage, even when called out.

[-]

hellomistershifty@reddit

i don't even know who that is, but someone with anger issues doesn't sound very chill

[-]

Darkoplax@reddit

Why is it not a good thing again ?

[-]

Darkoplax@reddit

Hope they do it fast, Composer 2 has been great so far

[-]

emprahsFury@reddit

if this sub isn't shitting on someone somewhere then the sky is falling and Hell has frozen over. llama.cpp released under MIT: "We fucking hate Ollama." Kimi released under modified MIT: "We fucking hate Cursor."

[-]

No_Conversation9561@reddit

u/ezyz Could we get MLX 2.8bit like Kimi-K2.5?

[-]

ezyz@reddit

Quant trials are still running! I just started uploading a 3.6 bpw on the quality frontier: https://huggingface.co/spicyneuron/Kimi-K2.6-MLX-3.6bit

This one pairs nicely with Qwen 3.6 35B on a 512GB Mac Studio.

Still searching for a good sub-3 quant, but the KL divergence seems to jump pretty dramatically on this model.

[-]

No_Conversation9561@reddit

What’s the size of this quant?

[-]

ezyz@reddit

459 GB total

[-]

Dany0@reddit

Boys and gals, we have Opus 4.7 at home

[-]

Kappalonia@reddit

Edit to 4.6, nobody wants 4.7 lol

[-]

Dany0@reddit

Turn off 1M mode and it'll be less arse

[-]

nmkd@reddit

Where can one do that?

[-]

Dany0@reddit

Use the env variable. But for opus 4.7 it's just a client side thing where it compacts earlier. You may also just shorten your claude.mds to <300 lines, not use skills and compact smartly and you'll get most of the benefits with none of the downsides

[-]

CornerLimits@reddit

I dont have enough square meters to host this unfortunately :(

[-]

Dany0@reddit

Can't wait to run it at 1-5tok/s from ssd

[-]

DR4G0NH3ART@reddit

2 seconds/token take it or leave it /s

[-]

tazztone@reddit

token/sarcasm

[-]

ZeusZCC@reddit

2 token per day

[-]

groosha@reddit

Keeps vibecoder away

[-]

Dany0@reddit

Yes. Good and based. Keep away, shoo

[-]

DR4G0NH3ART@reddit

Lol

[-]

darkpigvirus@reddit

waiter: sorry sir but that's for ram 😞. you need to divide by 100 if you run it in SSD 🤣

[-]

ProfessionalJackals@reddit

you need to divide by 100 if you run it in SSD

100 SSD in parallel stripped?

[-]

Thomas-Lore@reddit

More like 1-5s/tok.

[-]

FaceDeer@reddit

I think there's still a very important role for open-weight models that are as powerful as SOTA but too big to run on a conventional home computer, they serve as a mechanism to keep the big API providers "honest." If the big model APIs get too costly or throttle thinking too much or whatever, there will be providers offering these open-weight models to compete with them.

[-]

Bakoro@reddit

There are enough businesses and individuals who can totally afford to run a 1.1 trillion parameter model, that it keeps pressure on the whole frontier industry to not get too crazy.

The company I work for doesn't quite hand out compute like candy, but I needed RAM, and they dropped several hundred GB in my lap without blinking, even with the prices being what they are.

Businesses are already spending $100k+ a month on tokens, if they think that a million dollars in spend is going to let them have control over their own infrastructure and then have unlimited usage, that's going to be attractive.

If you consider stuff like defense contractors and and private research labs around the world, where they want to keep their most sensitive data air gapped, these huge models are extremely high value.

[-]

amuhak@reddit

One of the api providers could print it out and serve it for fast and cheap for everyone. That means open ai cant charge 10x more for 5% better performance

[-]

Dany0@reddit

https://openrouter.ai/moonshotai/kimi-k2.6 two providers, just one other than moonshot. So far the pricing is far from dirt cheap. And the tok/s is abhorrent too

[-]

amuhak@reddit

Making chips take time. Lots of time. Im talking about somthing like https://taalas.com/ who are literally burning the weights into silicon for crazy speeds.

It was a theoretical proposition I dont think somthing like that is going to happen for k2.6 because: a) its too big b) by the time they can fab somthing, a much better model will have dropped.

I think everyone has seen it by now, but: https://chatjimmy.ai/ is a demo of the tech

[-]

Kind_Style7978@reddit

I hope we do get a 'good enough' model with up to context to be burned into hardware cards, as the speed improvements over an order of magnitude would unlock new AI uses and help us move a lot of compute out of ~~software rendering~~, I mean, software LLM generation and free up our resources for our other use.

Or just for MS to spend more RAM on putting copies of React into Notepad >.>

[-]

Dany0@reddit

I have the ssd space. And I'm in the top 0.1% because I'm technically speaking gpu rich with my 5090 lmao

Every day is a reminder I'm closer to being homeless than a billionaire

[-]

SnooPaintings8639@reddit

Nice. Can I move in with you?

[-]

GlossyCylinder@reddit

90% of us can't run it lol.

[-]

Dany0@reddit

To quote a great poet and lyricist, this house is a broken home

[-]

uniVocity@reddit

“Home”

[-]

PhotographerUSA@reddit

I need this in 35B format lol

[-]

TurnUpThe4D3D3D3@reddit

China finally did it. They finally beat US models on HLE tools. Congrats to the Kimi team.

[-]

Ok_Mammoth589@reddit

Did they? What does Claude 4.7 score on those?

[-]

SeyAssociation38@reddit

China is only a few months behind. Once they figure out EUV they have to potential to be ahead of the US

[-]

guillefix@reddit

Sorry, what is EUV?

[-]

RelationshipLong9092@reddit

I have no doubt they'll do it, but EUV is a pretty big barrier to climb!

[-]

ffgg333@reddit

How is create writing? Better? Less censored?

[-]

seppe0815@reddit

gemma 4 the big uncensored modell, nothing more you need bro! trust

[-]

Ynead@reddit

It kinda sucks for creative writing. Kimi K2.5 was quite superior

[-]

Budget-Light-1694@reddit

https://gofund.me/dae1e2fa0

[-]

Fresh-Resolution182@reddit

1.1T params and you still need the quant chart to figure out if your rig can even touch it. great model, democratized for the top 0.1% of hardware owners

[-]

arm2armreddit@reddit

RTX5070 12GB Vram 😭😭😭, gguf 0.1BIT whenn?

[-]

Genesis2001@reddit

12GB? what luxury!

cries in 1660S not designed for AI

lol

[-]

arm2armreddit@reddit

did u try anything to run? it has 1400 cuda cores and 6GB sounds descent for llama3.3, or i am missing something?

[-]

Genesis2001@reddit

I run Qwen2.5-Coder-7B-Instruct-Q4_K_M from unsloth on it right now and get decent performance, but I haven't really used it beyond test prompts. I've been tweaking this command line over the last week or so though.

llama-server -m "Qwen2.5-Coder-7B-Instruct-Q4_K_M.gguf" -t 6 -c 4096 -ngl 999 --batch-size 256 -ctk q8_0 -ctv q8_0 --flash-attn on --port 8011 --host 0.0.0.0 --spec-type ngram-mod --spec-ngram-size-n 20 --spec-ngram-size-m 48 --repeat-penalty 1.05 --temp 0.1 --top-p 0.9

The last 3 (repeat penalty, temp, top-p) params are new as of today and are being tested.

[-]

jwpbe@reddit

why are you using a 2 year old coding model and not something made in the last 6 months? Even the Qwen 3.6 that came out a few days ago is going to be faster and better than 2.5 coder at coding

[-]

Genesis2001@reddit

Because I haven't found coding model in the 7-9B param range to run comfortably on my hardware. I'm still quite new.

[-]

jwpbe@reddit

understandable, try this if you can fit Q5_K_M into a combination GPU / RAM offload with reasonable speeds:

https://huggingface.co/AesSedai/Qwen3.6-35B-A3B-GGUF

If not, try Q4_K_XL from here:

https://huggingface.co/unsloth/Qwen3.6-35B-A3B-GGUF

Mixture of experts models which are denoted by (Parameter Number - Active Parameter Number) route queries to different segments of the model as it infers from your query, resulting in faster performance even if you can't fully load it onto your GPU.

[-]

Genesis2001@reddit

I won't be able to load that Qwen 3.6 model and load my IDE at the same time, lol. It's maxing out my system ram with the model loaded right now. I'm getting about 8.3-8.4 t/s right now asking it to refactor a powershell script I had ChatGPT write (just a test prompt).

However, it's good to know that in a pinch, I can run the model.

[-]

AppealSame4367@reddit

You can run qwen3.6 35B though. Sincerly, someone with a RTX2060 mobile and 6gb vram.

[-]

Chasian@reddit

Huh? How what's your t/second. I ran on my 3070 and it was like 7 lol

[-]

AppealSame4367@reddit

Linux, 32gb system ram, at low context i get 600-700 prefill tps and around 15 tps output. At around 50k context it's more like 200 prefill and 1-3 tps output.

Compiled for cuda 12.x on linux. Adapt values to your CPU etc. Low unsloth quant, no thinking, low kv cache quant, no vision. Still delivers better results than q35b no thinking or gemma 4 e4b with thinking.

#!/bin/bash

export GGML_CUDA_GRAPHS=0

./build/bin/llama-server \

-hf unsloth/Qwen3.6-35B-A3B-GGUF:UD-IQ2_M \

--no-mmproj \

--no-mmproj-offload \

-c 80000 \

-b 2048 \

-ub 2048 \

--prio 3 \

-fit on \

-np 1 \

-kvu \

--clear-idle \

--cont-batching \

--slot-save-path ./slots \

--port 8129 \

--host 0.0.0.0 \

--cache-ram 8184 \

--spec-type ngram-map-k4v \

--draft-max 32 \

--draft-min 5 \

--spec-ngram-size-n 4 \

--spec-ngram-min-hits 1 \

--mlock \

--no-mmap \

--cache-type-k q4_0 \

--cache-type-v q4_0 \

-t 6 \

--temp 1.0 \

--top-p 0.95 \

--top-k 20 \

--min-p 0.0 \

--presence_penalty 0.0 \

--repeat-penalty 1.0 \

--jinja \

--reasoning off

[-]

Dudeonyx@reddit

1660s has no ai accelerators afiak

[-]

AppealSame4367@reddit

It's not about "accelerators" (whatever you mean by that). You could even run it very slowly on GPU. But it's still an Nvidia gpu, you should be able to use cuda 12.x

[-]

Zestyclose839@reddit

My Zephyrus G14 might’ve just found itself a new job

[-]

DigiDecode_@reddit

this GGUF quant should work with your RTX 5070, only 11.8kb in size 🤣🤣

Processing img svf023l81ewg1...

[-]

rebelSun25@reddit

0.001 NanoQuant Ablated A10K coming anytime

[-]

Noobysz@reddit

And now i need the REAP version of this so that Strawberry have 1000000 Rs

[-]

Worried_Drama151@reddit

Most overhyped shit ever, this a 2 horse race these days Qwen and Gemma. Gtfo with moonshot and z.ai steal ur training data

[-]

spaceman3000@reddit

True. I prefer Gemma because she speaks my language with almost no mistakes. Qwen and all Chinese models unfortunately are very bad at it. But still when English is enough for my tasks Qwen is great too. I removed all orher models from my workflow.

[-]

vex_humanssucks@reddit

The 12GB VRAM crowd already crying in the thread is very relatable. Curious what the actual FP8 quant sizes end up being - K2.5 at full precision was already a painful 150GB+. If they manage to get a decent IQ3/IQ4 that fits in 48GB that would be genuinely useful. Anyone know if the architecture changes anything for llama.cpp support or is it the same transformer layout as K2.5?

[-]

usrlocalben@reddit

Kimi has been native INT4 since Kimi-K2-Thinking, so \~4.5bpw.

The non-vision portion has been the same architecture & hyperparms since the first version. Nothing has changed in 2.6.

There is no need to wonder what the size is, since the INT4 weights dominate the model it's easy to compute, roughly given 1T params:

10**12 * 4.5 / 8 / 1024**3 = \~523 GB

To be more precise wrt. MoE vs. attn:
384*7168*2048*3*60 * 4.5 / 8 / 1024**3 = \~531 GB

The safetensors are 555GB, so 555 - 531 = 24GB of embedding, output, attention heads, vision etc. which are in BF16, F32 etc.

Want to know the size of a 3 bpw quant? just substitute 4.5 for some other bpw, e.g. 3bpw:

384*7168*2048*3*60 * 3 / 8 / 1024**3 = 354GB (+24GB = 378GB)

All of these values can be found in config.json. (with the exception of 60 which one must know that Kimi MoE starts on layer 1 and not layer 0, so num_hidden_layers - 1)

[-]

Kirin_ll_niriK@reddit

I have 32GB on my rig (upgrading to 64 once I can swing it) and even I am sitting here realizing I will never run this without heavy quantization

[-]

_derpiii_@reddit

What a time to be alive

[-]

Alternative-Advice40@reddit

local will be great

[-]

mrinterweb@reddit

1.1T params was hard to read while drinking my coffee. Nearly did a spit take

[-]

Eyelbee@reddit

They should go larger. 4-5T would be great.

[-]

thrownawaymane@reddit

You got a full 48U rack or are you just happy to see me?

[-]

BallsInSufficientSad@reddit

A small Mac Studio had 512GB RAM - it's doable without the RAM shortage.

[-]

Expensive-Paint-9490@reddit

Well, thanks to QAT it is smaller than Qwen3.5-397B-A17B.

[-]

john0201@reddit

It just barely will not fit in a 512GB Mac Studio. Annoying.

[-]

FullOf_Bad_Ideas@reddit

you can quant Qwen 397B to be usable at around 150 GiB. You can't do that to Kimi K2.6

[-]

Service-Kitchen@reddit

Why is that sorry?

[-]

Daniel_H212@reddit

K2.6 just had too many more parameters.

[-]

Service-Kitchen@reddit

Ah okay, I thought it was something to do with the quantisation format.

[-]

FullOf_Bad_Ideas@reddit

Qwen 3.5 397B has more quantization potential, with Kimi it's mostly already eqhaused and it won't quant 4x from the released version, unlike Qwen 3.5 397B.

[-]

Comacdo@reddit

And it's MoE... Imagine the absolute behemoth model we would get from a dense one the same size ? One can dream..

[-]

TopChard1274@reddit

Erm how many people here afford to run this locally? A couple? One?

[-]

h-mo@reddit

the pace at which Chinese labs are releasing open weights right now is genuinely hard to keep up with. not too long ago Kimi K2 felt like news, now there's already a .6. what are the actual capability deltas between these point releases?

[-]

jld1532@reddit

Tell me China hasn't won the race. Every organization with enough compute will be running Chinese open weights by the end of 2026. My organization already is and provides freely to all employees via open webui. Soon, most technological advancements worldwide will be completed with support from Chinese rather than American AI.

[-]

JuniorDeveloper73@reddit

USA dont even try on opensource

[-]

jld1532@reddit

Which I think will be ultimately judged as a huge mistake in terms of international influence.

[-]

RelationshipLong9092@reddit

That can be said of an awful lot of America's decisions as of late.

[-]

cass1o@reddit

To be fair, google just released some good gemma models that are actually small enough to be run by most people.

[-]

SnooPaintings8639@reddit

I think USA is trying very hard to go anti-opensource. Not only not to open anything themselves, but to block anyone else from opening and sharing 'the secret sauce'.

[-]

JuniorDeveloper73@reddit

Its unveliable because opensource its more brains on several problems,as a big company you could get TONS of development and ideas for free

The thing its out,its pure nonsense

[-]

RepulsiveRaisin7@reddit

Has it? I use GLM and I feel like Codex and Claude are significantly better optimized for programming. I guess it depends on what you're doing.

[-]

jld1532@reddit

And if you have money. The average person underestimates the amount of underutilized compute out there.

[-]

ttkciar@reddit

> Tell me China hasn't won the race.

No, but I will tell you that there isn't a race.

[-]

Lissanro@reddit

Nice! Given Kimi K2.5 already was my favorite local model, I am looking forward to running K2.6 on my rig! They also kept local-friendly INT4 weight format, which can be practically losslessly converted to Q4_X GGUF.

[-]

Fit-Statistician8636@reddit

I was scrolling, somehow expecting / looking for your comment here 😀. I cannot wait for GGUFs to appear, too… But wait - I run K2.5 with SGLang + KTransformers. Did you tried this path?

[-]

Lissanro@reddit

So far I only got llama.cpp and ik_llama.cpp working. What is your experience with SGLang, does it work well with RAM+VRAM inference?

At some point I tried SGLang but never could get it working. They still have open bug about K2.5: https://github.com/sgl-project/sglang/issues/20096

If you are using CPU+GPU inference, perhaps you could share the link to the exact quant of K2.5 you are using and full SGLang command that you found to be working? It would help to know a working baseline. Then I may give SGLang another try, it would be interesting to compare with other backends.

[-]

Fit-Statistician8636@reddit

So this is my SGL+KT Dockerfile some AI built for me few weeks back. It was try-error-modify iteration until it worked:

# syntax=docker/dockerfile:1.7
# SGLang + KTransformers for CUDA 13 on EPYC Turin

ARG BASE_IMAGE=lmsysorg/sglang:dev-cu13
FROM ${BASE_IMAGE}

SHELL ["/bin/bash", "-o", "pipefail", "-c"]

ARG DEBIAN_FRONTEND=noninteractive
ARG KTRANSFORMERS_REF=main
ARG KT_CUDA_ARCHS=120
ARG KT_BUILD_JOBS=64
ARG SGL_KERNEL_CU130_WHL=https://github.com/sgl-project/whl/releases/download/v0.3.21/sgl_kernel-0.3.21%2Bcu130-cp312-abi3-manylinux2014_x86_64.whl

# Build/runtime settings for Turin + Blackwell
ENV CUDA_HOME=/usr/local/cuda \
    TRITON_PTXAS_PATH=/usr/local/cuda/bin/ptxas \
    HF_HUB_ENABLE_HF_TRANSFER=1 \
    CUDA_MODULE_LOADING=LAZY \
    PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True \
    PIP_DISABLE_PIP_VERSION_CHECK=1 \
    PIP_NO_CACHE_DIR=1 \
    PIP_ROOT_USER_ACTION=ignore \
    CPUINFER_CPU_INSTRUCT=FANCY \
    CPUINFER_ENABLE_AMX=OFF \
    CPUINFER_ENABLE_AVX512_VNNI=ON \
    CPUINFER_ENABLE_AVX512_BF16=ON \
    CPUINFER_ENABLE_AVX512_VBMI=ON \
    CPUINFER_USE_CUDA=1 \
    CPUINFER_CUDA_ARCHS=${KT_CUDA_ARCHS} \
    CPUINFER_PARALLEL=${KT_BUILD_JOBS}

# Native build dependencies for kt-kernel
RUN apt-get update && apt-get install -y --no-install-recommends \
    build-essential \
    cmake \
    ninja-build \
    git \
    git-lfs \
    curl \
    wget \
    ca-certificates \
    pkg-config \
    python3-dev \
    libhwloc-dev \
    libnuma-dev \
    numactl \
    pciutils \
    && rm -rf /var/lib/apt/lists/* \
    && git lfs install --system

WORKDIR /opt

# KTransformers repo and submodules
RUN git clone --recursive https://github.com/kvcache-ai/ktransformers.git \
    && cd /opt/ktransformers \
    && git checkout "${KTRANSFORMERS_REF}" \
    && git submodule update --init --recursive

# Remove base-package metadata first, then install the KTransformers fork
# without dependency resolution so the CUDA 13 stack is not downgraded.
RUN (python3 -m pip uninstall -y sglang sglang-kt kt-kernel sgl-kernel || true) \
    && python3 -m pip install --upgrade pip setuptools wheel packaging \
    && cd /opt/ktransformers/third_party/sglang \
    && python3 -m pip install --no-deps "./python[all]" \
    && python3 -m pip install --no-deps "${SGL_KERNEL_CU130_WHL}" \
    && python3 -m pip install --no-deps decord2

# Build kt-kernel against the prepared CUDA 13 / SM120 environment.
RUN cd /opt/ktransformers/kt-kernel \
    && python3 -m pip install --no-deps --no-build-isolation -v .

WORKDIR /workspace
CMD ["bash"]

And this my command to run it:

docker run --rm \
  --name kimi-k2.5-rawint4-kt-160k-p2 \
  --ipc=host \
  --cap-add=SYS_NICE \
  --runtime nvidia \
  --gpus device=GPU-xxx \
  -p 8000:8000 \
  -v /mnt/hot/hfhub:/root/.cache/huggingface/hub \
  -v /mnt/bulk/config/parsers:/opt/parsers:ro \
  -e 'NCCL_P2P_LEVEL=PHB' \
  -e 'NCCL_MIN_CTAS=8' \
  -e 'OMP_NUM_THREADS=8' \
  -e 'SAFETENSORS_FAST_GPU=1' \
  -e 'PYTORCH_ALLOC_CONF=expandable_segments:True' \
  -e 'HF_HUB_OFFLINE=1' \
  -e 'SGLANG_ENABLE_JIT_DEEPGEMM=0' \
  -e 'SGLANG_ENABLE_DEEP_GEMM=0' \
  sglang-kt:cu13 \
  python -m sglang.launch_server \
    --model-path /root/.cache/huggingface/hub/models--moonshotai--Kimi-K2.5/snapshots/54383e83fa343a1331754112fb9e3410c55efa2f \
    --kt-weight-path /root/.cache/huggingface/hub/models--moonshotai--Kimi-K2.5/snapshots/54383e83fa343a1331754112fb9e3410c55efa2f \
    --served-model-name kimi-k2.5-rawint4-kt-160k-p2 \
    --host 0.0.0.0 \
    --port 8000 \
    --mem-fraction-static 0.94 \
    --trust-remote-code \
    --context-length 163840 \
    --max-running-requests 2 \
    --prefill-max-requests 2 \
    --max-total-tokens 327680 \
    --kt-cpuinfer 24 \
    --kt-threadpool-count 1 \
    --kt-num-gpu-experts 2 \
    --kt-method RAWINT4 \
    --kt-gpu-prefill-token-threshold 512 \
    --kt-max-deferred-experts-per-token 1 \
    --kt-enable-dynamic-expert-update \
    --enable-mixed-chunk \
    --tensor-parallel-size 1 \
    --disable-shared-experts-fusion \
    --disable-custom-all-reduce \
    --chunked-prefill-size 16384 \
    --attention-backend flashinfer \
    --reasoning-parser kimi_k2 \
    --tool-call-parser kimi_k2 \
    --sampling-defaults model

It runs well on Blackwell, and it worked well on two GPUs too, with --tensor-parallel-size 2.

[-]

Fit-Statistician8636@reddit

So this is my SGL+KT Dockerfile some AI built for me few weeks back.
It was try-error-modify iteration until it worked:

```
# syntax=docker/dockerfile:1.7
# SGLang + KTransformers for CUDA 13 on EPYC Turin

ARG BASE_IMAGE=lmsysorg/sglang:dev-cu13
FROM ${BASE_IMAGE}

SHELL ["/bin/bash", "-o", "pipefail", "-c"]

ARG DEBIAN_FRONTEND=noninteractive
ARG KTRANSFORMERS_REF=main
ARG KT_CUDA_ARCHS=120
ARG KT_BUILD_JOBS=64
ARG SGL_KERNEL_CU130_WHL=https://github.com/sgl-project/whl/releases/download/v0.3.21/sgl_kernel-0.3.21%2Bcu130-cp312-abi3-manylinux2014_x86_64.whl

# Build/runtime settings for Turin + Blackwell
ENV CUDA_HOME=/usr/local/cuda \
TRITON_PTXAS_PATH=/usr/local/cuda/bin/ptxas \
HF_HUB_ENABLE_HF_TRANSFER=1 \
CUDA_MODULE_LOADING=LAZY \
PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True \
PIP_DISABLE_PIP_VERSION_CHECK=1 \
PIP_NO_CACHE_DIR=1 \
PIP_ROOT_USER_ACTION=ignore \
CPUINFER_CPU_INSTRUCT=FANCY \
CPUINFER_ENABLE_AMX=OFF \
CPUINFER_ENABLE_AVX512_VNNI=ON \
CPUINFER_ENABLE_AVX512_BF16=ON \
CPUINFER_ENABLE_AVX512_VBMI=ON \
CPUINFER_USE_CUDA=1 \
CPUINFER_CUDA_ARCHS=${KT_CUDA_ARCHS} \
CPUINFER_PARALLEL=${KT_BUILD_JOBS}

# Native build dependencies for kt-kernel
RUN apt-get update && apt-get install -y --no-install-recommends \
build-essential \
cmake \
ninja-build \
git \
git-lfs \
curl \
wget \
ca-certificates \
pkg-config \
python3-dev \
libhwloc-dev \
libnuma-dev \
numactl \
pciutils \
&& rm -rf /var/lib/apt/lists/* \
&& git lfs install --system

WORKDIR /opt

# KTransformers repo and submodules
RUN git clone --recursive https://github.com/kvcache-ai/ktransformers.git \
&& cd /opt/ktransformers \
&& git checkout "${KTRANSFORMERS_REF}" \
&& git submodule update --init --recursive

# Remove base-package metadata first, then install the KTransformers fork
# without dependency resolution so the CUDA 13 stack is not downgraded.
RUN (python3 -m pip uninstall -y sglang sglang-kt kt-kernel sgl-kernel || true) \
&& python3 -m pip install --upgrade pip setuptools wheel packaging \
&& cd /opt/ktransformers/third_party/sglang \
&& python3 -m pip install --no-deps "./python[all]" \
&& python3 -m pip install --no-deps "${SGL_KERNEL_CU130_WHL}" \
&& python3 -m pip install --no-deps decord2

# Build kt-kernel against the prepared CUDA 13 / SM120 environment.
RUN cd /opt/ktransformers/kt-kernel \
&& python3 -m pip install --no-deps --no-build-isolation -v .

WORKDIR /workspace
CMD ["bash"]
```

And this my command to run it:

```
docker run --rm \
--name kimi-k2.5-rawint4-kt-160k-p2 \
--ipc=host \
--cap-add=SYS_NICE \
--runtime nvidia \
--gpus device=GPU-xxx \
-p 8000:8000 \
-v /mnt/hot/hfhub:/root/.cache/huggingface/hub \
-v /mnt/bulk/config/parsers:/opt/parsers:ro \
-e 'NCCL_P2P_LEVEL=PHB' \
-e 'NCCL_MIN_CTAS=8' \
-e 'OMP_NUM_THREADS=8' \
-e 'SAFETENSORS_FAST_GPU=1' \
-e 'PYTORCH_ALLOC_CONF=expandable_segments:True' \
-e 'HF_HUB_OFFLINE=1' \
-e 'SGLANG_ENABLE_JIT_DEEPGEMM=0' \
-e 'SGLANG_ENABLE_DEEP_GEMM=0' \
sglang-kt:cu13 \
python -m sglang.launch_server \
--model-path /root/.cache/huggingface/hub/models--moonshotai--Kimi-K2.5/snapshots/54383e83fa343a1331754112fb9e3410c55efa2f \
--kt-weight-path /root/.cache/huggingface/hub/models--moonshotai--Kimi-K2.5/snapshots/54383e83fa343a1331754112fb9e3410c55efa2f \
--served-model-name kimi-k2.5-rawint4-kt-160k-p2 \
--host 0.0.0.0 \
--port 8000 \
--mem-fraction-static 0.94 \
--trust-remote-code \
--context-length 163840 \
--max-running-requests 2 \
--prefill-max-requests 2 \
--max-total-tokens 327680 \
--kt-cpuinfer 24 \
--kt-threadpool-count 1 \
--kt-num-gpu-experts 2 \
--kt-method RAWINT4 \
--kt-gpu-prefill-token-threshold 512 \
--kt-max-deferred-experts-per-token 1 \
--kt-enable-dynamic-expert-update \
--enable-mixed-chunk \
--tensor-parallel-size 1 \
--disable-shared-experts-fusion \
--disable-custom-all-reduce \
--chunked-prefill-size 16384 \
--attention-backend flashinfer \
--reasoning-parser kimi_k2 \
--tool-call-parser kimi_k2 \
--sampling-defaults model
```

It runs well on Blackwell, and it worked well on two GPUs too, with `--tensor-parallel-size 2`.

[-]

Fit-Statistician8636@reddit

Yes, just reading this: “Kimi-K2.6 has the same architecture as Kimi-K2.5, and the deployment method can be directly reused.”

So, no quant needed - SGLang + KTransformes should be able to use the native .safetensors model. Yes, I have a great experience with Kimi+SGL+KT, and with SGL in general (using voipmonitor’s fork to run MiniMax-M2.7 from VRAM). It is not without issues, but llama/ik is not either.

I’ll get out of bath, run “hf pull” and post my “recipe” for K2.5 in 10 minutes 😀.

[-]

MuzafferMahi@reddit

My god what kind of a gig you got?

[-]

Lissanro@reddit

I have shared details about my rig here, and here I shared my performance for various models.

[-]

valtor2@reddit

Is it because of your massive RAM? I wouldn't have expected to be able to run 1T params on 96GB VRAM.

How much did it cost you? You were before the RAM-pocalypse but very much into the GPU-pocalypse I bet :)

[-]

Lissanro@reddit

I built my rig gradually over the years, starting with buying GPUs one by one, then PSUs, and in the beginning of the previous year migrated to the EPYC platform with 8-channel 1 TB DDR4 3200 MHz RAM (the server memory costed me approximately $1600 in total)... so yes, I got lucky enough to upgrade before RAM prices went insane.

[-]

PrysmX@reddit

You can offload layers to regular RAM. The entire model doesn't need to be in VRAM with GGUF. So if your total VRAM+RAM can hold the weights, you should be able to run the model (albeit slower than if it was all in unified high-bandwidth RAM).

[-]

pmttyji@reddit

What other models came with similar weight format? I remember that GPT-OSS came in MXFP4 & Gemma3 came in QAT.

[-]

_yustaguy_@reddit

Are the crown prince of Saudi Arabia by chance?

[-]

ForsookComparison@reddit

What gives me pause about these benchmarks even more than seeing GPT 5.4 and Kimi beating Opus 4.7 in coding scenarios (something I also doubt) is seeing Gemini 3.1 Pro winning in things like Terminal Bench. I cannot for the life of me get that model to be competitive in what that benchmark claims to cover, yet it's number 1?

[-]

Bakoro@reddit

Gemini is perhaps the weirdest, most inconsistent model.

The only thing that I can really think, is that they have a lot more knobs they turn dynamically, based on the current load.
Sometimes I get super-genius Gemini who does a full load of work up front, and sometimes I get the absolutely minimal effort model.
Gemini will literally add things like and destroy existing work.

One of the things I hate the most is how it will make notes about how "in a real project, we would do xyz, but we'll just put this stub for now". It's so hard to get the model to take things seriously and not as a trivial exercise.

When it's good, it's very good. When it's bad, it's among the worst.
When it reaches its context limit, it falls apart the hardest.

[-]

cant-find-user-name@reddit

My experience is pretty much the same. Gemini is a genius some times and a dumbass many more times.

[-]

AdOne8437@reddit

Total Parameters 1T Activated Parameters 32B

Hmmm,ok I think I will sit this one out :)

[-]

muyuu@reddit

how many 24GB RTX3090s to run this one?

[-]

_supert_@reddit

To preserve thinking or not preserve thinking?

[-]

TheRealMasonMac@reddit

K2.5 preserves thinking by default IIRC.

[-]

TopChard1274@reddit

1.1... whoah 😮

[-]

SnooPaintings8639@reddit

Is it a new quant? Even smaller that 1.58 bit!? Whoah indeed! /s

[-]

Long_comment_san@reddit

HERE WE GOOOOOOOOOOO

[-]

DigiDecode_@reddit

indeed, I think this might be the 1st time open weights is SOTA level since the release of GPT 4, and that was March 2022, also dare I say not 6 months behind, and no moat for closed weights

[-]

Perfect-Flounder7856@reddit

SoTA?

[-]

bakawolf123@reddit

well, open source has caught up to proprietary models
now we only need hardware to catch up so we can actually run them =)

[-]

korino11@reddit

Size much less thean 2.5 version! That is veeery good. We have a hope as local)

[-]

CrawlUpAndDie@reddit

Good news

[-]

Extra-Organization-6@reddit

if this runs well on ollama thats going to be interesting for self-hosted inference. the MoE architecture should keep memory usage reasonable even at this scale. curious what the actual VRAM requirements look like with different quants.

[-]

Awwtifishal@reddit

You need at the bare minimum 32G VRAM and like 700 GB of the fastest RAM you can get and motherboard with 4 channels... to run it slowly but usable speed with ik_llama at Q4.

[-]

relmny@reddit

I (very occasionally) run k2 or k2.5 IQ2 on 32gb VRAM + 128 gb RAM + ssd at 1.7 t/s (not everyone codes)

[-]

Cory123125@reddit

To do literally what?

[-]

relmny@reddit

Chat.
I only use local models.
When the small/medium ones won't do, then I pull the big guns (deepseek, kimi, glm) and... wait.

I use them at least once every week.

[-]

Cory123125@reddit

.... This is even more perplexing. That output rate sounds entirely too slow to be useful.

[-]

relmny@reddit

I'm fine with it. Just ask some things, do other stuffs, come back and get my answer. I try to use "instruct" with these models, but sometimes, when I really need it, I even run them as "thinking".

[-]

Ell2509@reddit

Shame you are getting downvoted. Honestly though, 1.1t is not runnable for any amateur.

[-]

ttkciar@reddit

> 1.1T is not runnable for any amateur

... yet!

[-]

Extra-Organization-6@reddit

Part of the game i guess, I have the best recipe in town for blueberry muffins, check recent comments lol

[-]

Similar-Republic149@reddit

Give me a recipe for blueberry muffins

[-]

Extra-Organization-6@reddit

YOUR MOM

[-]

Yu2sama@reddit

What's the point of these bots I wonder?

[-]

ResidentPositive4122@reddit

At least this bot is funny

[-]

TheItalianDonkey@reddit

in what world is that reasonable? :-D

[-]

WithoutReason1729@reddit

Your post is getting popular and we just featured it on our Discord! Come check it out!

You've also been given a special flair for your contribution. We appreciate your post!

I am a bot and this action was performed automatically.

[-]

No_Mango7658@reddit

I call bs... Those tests cannot be accurate

[-]

Healthy-Nebula-3603@reddit

That model is 1.1T .....

[-]

No_Mango7658@reddit

And opus is estimated at over 5t.

What I'm getting at is this feels like a benchmax release... I'll be curious to test it's actual capabilities

[-]

Caffdy@reddit

And opus is estimated at over 5t.

can you source that?

[-]

No_Mango7658@reddit

I have no source, just people on the internet guessing.

Elon claimed at one point opus is 5t, but I don't know if he really knows either

[-]

my_name_isnt_clever@reddit

Elon is less reliable than any internet random

[-]

Cory123125@reddit

If anything, I now believe Opus is smaller

[-]

Healthy-Nebula-3603@reddit

O assume Bijam ( YouTube will test that soon if not did that already)

So we can compare performance

[-]

No_Mango7658@reddit

I can't stand his useless videos. Waste of disk space

[-]

Healthy-Nebula-3603@reddit

In that case test by yourself:)

I like to see different details of the same tests.

I only miss agent complex tasks from him.

[-]

TurnUpThe4D3D3D3@reddit

It could be benchmaxxed, but since it’s the Kimi team I think it’s legit. Their last model was a breakthrough for real world performance so I would not doubt them.

[-]

No_Mango7658@reddit

Dude if this is real this is the end of anthropic... I will go into so much debt for a pair of m3 ultra's to run this.

[-]

Healthy-Nebula-3603@reddit

Haha

[-]

Healthy-Nebula-3603@reddit

Oh wow 1.1T model size!

Give me a few minutes I will test that on my local computer!

[-]

Cory123125@reddit

Man those 2 tokens you get today will be the smartest tokens you've ever seen

[-]

Miserable_Ad7246@reddit

That token will come eventually.

[-]

Spirited_Neck1858@reddit

haha eventually

[-]

srigi@reddit

Apple Watch sized model

[-]

FlamaVadim@reddit

🤣

[-]

Fringolicious@reddit

Very exciting. 6 months and we'll have this performance at 1/10th the size presumably, good to see open weights giving the closed labs some serious competition!

[-]

Due_Net_3342@reddit

cheering with 144GB :(

[-]

pmttyji@reddit

License: modified-mit

"OI MiniMax-M2.7"

[-]

FyreKZ@reddit

Moonshot has used modified mit sice K2, nothing new.

[-]

pmttyji@reddit

I see, Never noticed the previous versions' as those are too large for my GPU(Thought they followed MiniMax's route). I tried Kimi-Linear which is mit only.

[-]

Furacao__Boey@reddit

BF16 is 595 GB, Q4 could be runnable on single 96 gb vram + RAM maybe?

[-]

Different_Fix_2217@reddit

Its already 4bit. That is not BF16.

[-]

FullOf_Bad_Ideas@reddit

595GB is quanted already, they publish a model that has mixed precision but majority of weights are in INT4.

single 96 gb vram + RAM maybe?

if you have 512GB of RAM, yeah

[-]