Kimi K2 Thinking 1-bit Unsloth Dynamic GGUFs
Posted by danielhanchen@reddit | LocalLLaMA | View on Reddit | 125 comments
Hi everyone! You can now run Kimi K2 Thinking locally with our Unsloth Dynamic 1bit GGUFs. We also collaborated with the Kimi team on a bug fix for K2 Thinking's chat template not prepending the default system prompt of You are Kimi, an AI assistant created by Moonshot AI. on the 1st turn. 🥰
We also we fixed llama.cpp custom jinja separators for tool calling - Kimi does {"a":"1","b":"2"} and not with extra spaces like {"a": "1", "b": "2"}
The 1-bit GGUF will run on 247GB RAM. We shrank the 1T model to 245GB (-62%) & the accuracy recovery is comparable to our third-party DeepSeek-V3.1 Aider Polyglot benchmarks
All 1bit, 2bit and other bit width GGUFs are at https://huggingface.co/unsloth/Kimi-K2-Thinking-GGUF
The suggested temp is temperature = 1.0. We also suggest a min_p = 0.01. If you do not see <think>, use --special. The code for llama-cli is below which offloads MoE layers to CPU RAM, and leaves the rest of the model on GPU VRAM:
export LLAMA_CACHE="unsloth/Kimi-K2-Thinking-GGUF"
./llama.cpp/llama-cli \
-hf unsloth/Kimi-K2-Thinking-GGUF:UD-TQ1_0 \
--n-gpu-layers 99 \
--temp 1.0 \
--min-p 0.01 \
--ctx-size 16384 \
--seed 3407 \
-ot ".ffn_.*_exps.=CPU"
Step-by-step Guide + fix details: https://docs.unsloth.ai/models/kimi-k2-thinking-how-to-run-locally and GGUFs are here.
Let us know if you have any questions and hope you have a great weekend!
danihend@reddit
Has anyone ever run a 1bit model and gotten any value from it? Personally, every model ice ever tried below 3 or 4 just seems unusable.
yoracale@reddit
Have you tried the Unsloth Dynamic ones specifically? 3rd party benchmarks were conducted and our Dynamic 3-bit DeepSeek V3.1 GGUF gets 75.6% on Aider Polyglot! See: https://docs.unsloth.ai/basics/unsloth-dynamic-2.0-ggufs/unsloth-dynamic-ggufs-on-aider-polyglot
danihend@reddit
Yeah I've always been trying the Unsloth Dynamic quants but never found a Q1 to be anything other than useless. Maybe I am doing it wrong. What's the best example of a Q1 from Unsloth that I can run on 10GB VRAM? (RTX3080) with 64 GB system RAM in case it's an MOE.
RobTheDude_OG@reddit
How well would this run on a system with 64gb ram and 8 or 16gb vram?
And how well would it run on a system with 128gb of ram?
Was thinking to upgrade, but with ram prices in the gutter i might wait till ddr6 and AM6
ffgg333@reddit
Nice. In 10 years, I will have enough ram to run it on cpu😅.
Dayder111@reddit
In 10 years 3D DRAM will likely arrive, maybe even for consumers already as well.
danielhanchen@reddit (OP)
Haha :))
twack3r@reddit
Ok this is awesome! Anyone having this running on 4 or 6 3090s (plus a 5090) and wanna compare notes?
danielhanchen@reddit (OP)
If you have 4*24GB = 96GB VRAM or more, definitely customize the offloading flags as seen in the hint box at https://docs.unsloth.ai/models/kimi-k2-thinking-how-to-run-locally#run-kimi-k2-thinking-in-llama.cpp for eg -ot ".(6|7|8|9|[0-9][0-9]|[0-9][0-9][0-9]).ffn_(gate|up|down)_exps.=CPU" means to offload gate, up and down MoE layers but only from the 6th layer onwards.
twack3r@reddit
Thanks u/danielhanchen
I have 6 3090s and a 5090 but I’m not sure how much spreading across GPUs will help performance given my understanding that llama.cpp still performs poorly across GPUs compared to vLLM and TP.
Will be testing this extensively, this is exactly the kind of model I built this rig for.
pathfinder6709@reddit
Page not found for model deployment guide
danielhanchen@reddit (OP)
Oh wait sorry which link is broken - will fix asap!
pathfinder6709@reddit
https://huggingface.co/unsloth/Kimi-K2-Thinking-GGUF/blob/main/docs/deploy_guidance.md
MatterMean5176@reddit
So what's the word people, anybody try the smallest quant? I am intrigued, any thoughts on it?
black_ap3x@reddit
Me crying in the corner with my 3060
yoracale@reddit
Will still work as long as you have more RAM. But might be slow depending on your RAM
SilentLennie@reddit
Do you run evals to know what the quality losses are ?
danielhanchen@reddit (OP)
We ran some preliminary ones, and we see 85%+ accuracy retainment for the lowest 1bit one! We follow similar methodology to https://docs.unsloth.ai/basics/unsloth-dynamic-2.0-ggufs/unsloth-dynamic-ggufs-on-aider-polyglot
SilentLennie@reddit
85% doesn't sound that promising, but when jumps in capabilities between models are great and 85% is actually 85+% which means 85% is the worst you can expect, that does sound like promising.
korino11@reddit
For codig Kimi - is the WORS model i ever used. I always lie, it alwys broke code. It doesnt care about promts at all! It doesnt care about tasks and todo...
Significant-Pin5045@reddit
I hope this pops the bubble finally
urekmazino_0@reddit
How much would you say the performance difference is from the full model?
Crinkez@reddit
Please stop normalizing "performance" to refer to strength. Performance is supposed to equal speed.
yoracale@reddit
You can run the full precision K2 Thinking model by using our ,4-bit or 5-bit GGUFs.
nmkd@reddit
Why run 5 bit, isn't the model natively trained on INT4?
yoracale@reddit
Because they may be some slight quantization degradation, so 5bit just to be 'safe'
nmkd@reddit
But why would you quantize to a format that's larger?
Is INT4 not smaller than Q5 GGUF?
danielhanchen@reddit (OP)
The issue is INT4 isn't represented "correctly" as of yet in llama.cpp, so we tried using Q4_1 which most likely fits. The issue is llama.cpp uses float16, whilst the true INT4 uses bfloat16. So using 5bit is the safest bet!
MitsotakiShogun@reddit
^ This. It would be nice if every compression ratio was accompanied by a performance retention ratio like (I think) Nvidia did with some models in the past, or with complete benchmark runs like Cerebras did recently with their REAP releases.
yoracale@reddit
Definitely is interesting but doing benchmarks like this requires lots of time, money and manpower. Unfortunately at the moment, we're still a small team so it's unfeasible however a third party conducted third party benchmarks for our DeepSeek-V3.1 GGUFs on the Aider Polyglot benchmark which is one of the hardest benchmarks, and we also did it for Llama and Gemma on 5shot MMLU. Overall the Unsloth Dynamic quants nearly squeeze out the maximum performance you can from quantizing a model.
And the most important thing for performance is actually the bug fixes we do! We've done over 100 bug fixes now and a lot of them dramatically increase the accuracy of the model and we're actually making a page with all our bug fixes ever!
Third party DeepSeek v3.1 benchmarks: https://docs.unsloth.ai/new/unsloth-dynamic-ggufs-on-aider-polyglot
Llama, Gemma 5shot MMLU, KL Divergence, benchmarks: https://docs.unsloth.ai/basics/unsloth-dynamic-2.0-ggufs
_VirtualCosmos_@reddit
Won't that quant make it heavily lobotomized?
danielhanchen@reddit (OP)
Nah! The trick is to dynamically quantize some unimportant layers to 1bit, and the important ones are in 4bit!
For eg at https://docs.unsloth.ai/basics/unsloth-dynamic-2.0-ggufs/unsloth-dynamic-ggufs-on-aider-polyglot, DeepSeek V3.1 dynamic 5bit is nearly equivalent to the full 8bit model!
_VirtualCosmos_@reddit
Now that you are here, I have a question: Are quatizations a no-loss compression technique? I mean, can you reverse the parameter to its original FP32 or FP16 having only the quantized param? (I have no idea how those maths work)
Aperturebanana@reddit
Wow holy shit that’s awesome
LegacyRemaster@reddit
Feedback about the speed. Ubergarm IQ2_KS with 128gb ram + 5070 ti + 3060 ti + SSD. :D . Will try unsloth too but yeah... Maybe with Raid 0 - x4 SSD will be better (I have it).
danielhanchen@reddit (OP)
Oh wait did you customize the regex offloading flags? Try that! See examples in the hint box at https://docs.unsloth.ai/models/kimi-k2-thinking-how-to-run-locally#run-kimi-k2-thinking-in-llama.cpp for eg
-ot "\.(6|7|8|9|[0-9][0-9]|[0-9][0-9][0-9])\.ffn_(gate|up|down)_exps.=CPU"means to offload gate, up and down MoE layers but only from the 6th layer onwards.Also remove the 4bit K and V quantization - it most likely will make generation slower
LegacyRemaster@reddit
will try thx man!
danielhanchen@reddit (OP)
Let me know how it goes!
LegacyRemaster@reddit
Sure. I have two problems here:
1) I'm using the model on Windows, and the memory/SSD management is terrible.
2) Even though I have a 10GB/s SSD transfer rate (and with 4 RAID SSDs I get 20-22GB/s), Windows isn't loading data at the desired speed (average 500MB/s).
So the bottleneck are:
a) the RTX 3060ti, which has half the memory speed of the RTX 5070ti.
b) PCI Express 4.
c) Windows and SSD management.
I'll have to try it on Ubuntu, but in any case, for a production scenario, much, much more memory is needed.
mysteryweapon@reddit
Okay, cool, how do I run a ~50gb model on my sort of meager desktop ?
yoracale@reddit
Well If you want to run a 50GB model, I guess Qwen3-30B will be great for you? You can read our step-by-step guide for the model here: https://docs.unsloth.ai/models/qwen3-how-to-run-and-fine-tune/qwen3-2507#run-qwen3-30b-a3b-2507-tutorials
Or if you want to choose any other model to run, you can view our entire catalog here: https://docs.unsloth.ai/models/tutorials-how-to-fine-tune-and-run-llms
fallingdowndizzyvr@reddit
Thank you! Now this I can run. I have ~250GB of usable VRAM.
MLDataScientist@reddit
Do you have 8xMI50 32GB? What speed are you getting? I have 8xMI50 but fan noise and power usage is intolerable. So, I just use 4x MI50 most of the time.
Tai9ch@reddit
Have you tried cranking them down to 100W each?
I find that they deal with lower power limits very nicely, with 100W retaining like 90% of the performance of 200W.
MLDataScientist@reddit
Yes, 100W works. But still fan noise is an issue. I recently changed fans to 80mm fans and that reduced the noise a bit.
fallingdowndizzyvr@reddit
No. I have a gaggle of GPUs.
danielhanchen@reddit (OP)
OO definitely tell me how it goes!
GmanMe7@reddit
Want to make money? Make super simple tutorial on youtube on mac studio and another one with windows PC.
yoracale@reddit
We have a step-by-step guide with code snippets to copy paste in our guide: https://docs.unsloth.ai/models/kimi-k2-thinking-how-to-run-locally
tvetus@reddit
1 million output tokens in... 5.8 days :)
CovidCrazy@reddit
Do you think LM studio would be the best way to run this on a Mac studio?
yoracale@reddit
You can run this in LM Studio yes. I think for more speed llama.cpp is more customizable
1ncehost@reddit
Wont be running this one, but I just wanted to say thanks for the tireless work you guys put into each model.
danielhanchen@reddit (OP)
No worries and super appreciate it! :)
Accomplished_Bet_127@reddit
With the speed you answer everyone, even in random posts, I still believe you are a bot. No way someone can both work and communicate this much. What's your secret? What you eat? How much you sleep? Have you swam a pool of liquid adderal when you was younger?
danielhanchen@reddit (OP)
Haha it's just me :) my brother helps on their own account but this one is me!
We do sleep! A bit nocturnal though so around 5am to 1pm. Nah never taken adderal, but I get that a lot lol
layer4down@reddit
5AM-1AM 😌😴
AcanthaceaeNo5503@reddit
Lmao true though, I really love unsloth. Hope to join someday
danielhanchen@reddit (OP)
Oh thanks! We're always looking for more help :)
issarepost@reddit
Maybe several people using one account?
danielhanchen@reddit (OP)
Nah it's just me! My brother does use his other account to answer questions if I'm not around though
croninsiglos@reddit
Hmm but how about 128 GB of unified memory and no GPU... aka a 128 GB Macbook Pro?
xxPoLyGLoTxx@reddit
I JUST downloaded it and ran a “Hi” test with 128gb unified m4 max Mac Studio. With Q3_X_KL I was getting around 0.3 tps. I haven’t tweaked anything yet but I’ll likely use it for tasks not needing an immediate response. I’m fine with it chugging along in the background. I’ll probably load up gpt-oss-120b on my PC for other tasks.
danielhanchen@reddit (OP)
Oh cool! Ye sadly it is slow without a GPU :( One way to boost it is via speculative decoding which might increase it by 2x to 3x
xxPoLyGLoTxx@reddit
Thx for all you do!
Fitzroyah@reddit
I hope pewdiepie sees this, perfect for his rig! I will keep dreaming with my old 1080.
Odd-Ordinary-5922@reddit
pewdiepie uses vllm and awq
danielhanchen@reddit (OP)
Oh that would be cool!
Bakoro@reddit
It's kind of humorous how time looped back on itself.
This is like the old days when personal computers were taking off, and people were struggling with needing whole megabytes of ram rather than kilobytes, gigabytes of storage rather than megabytes.
Another 5~10 years and we're all going to just have to have 500 GB+ of ram to run AI models.
danielhanchen@reddit (OP)
Oh lol exactly! In the good ol days the computers were the size of an entire room!
lxe@reddit
Anyone has TPS and quality numbers?
danielhanchen@reddit (OP)
For now if you have enough RAM, you might get 1 to 2 tokens / s. If you have enough VRAM, then 20 tokens / s from what I see
CapoDoFrango@reddit
Can you do a quarter bit?
danielhanchen@reddit (OP)
I'm trying to see if we can further shrink it!
phormix@reddit
Oof, this is cool but given the RAM shortages lately (and the fact that the RAM I bought in June already more than doubled in cost) it is still a hard venture for homebrew
danielhanchen@reddit (OP)
Oh ye RAM sadly is getting very much more popular :(
kapitanfind-us@reddit
Quick question, always wondered why seed is needed? Apologies if off topic.
danielhanchen@reddit (OP)
Oh the 3407 seed? It's not necessary but if you want the same response every time you reload the model, the seed is used for that
rookan@reddit
What hardware did you use to make this quant?
danielhanchen@reddit (OP)
Oh we generally use spot cloud machines since they're cheap! We also have some workstations which we also run them on!
Craftkorb@reddit
Amazing! Hey I could upgrade one of my servers to have loads more RAM
Checks RAM prices
Neeevermind 😑
danielhanchen@reddit (OP)
We're trying to see if it's possible to shrink it further!
paul_tu@reddit
Oh boy, I'd need oculink now
danielhanchen@reddit (OP)
Interesting but yes faster interconnects will defs be helpful!
noiserr@reddit
I'm waiting on GGUFs for the Kimi-Linear-REAP-35B-A3B-Instruct
danielhanchen@reddit (OP)
Sadly llama.cpp doesn't have support for Kimi Linear :(
nonaveris@reddit
Will try this on a decently beefy Xeon (8480+ w/ 192gb memory) alongside a slightly mismatched pair of NVidia GPUs (3090/2080ti 22gb).
Not expecting miracles, but nice to see that it could have a decent chance to work.
danielhanchen@reddit (OP)
Oh yes that would be cool!
NameEuphoric3115@reddit
I have a single 4090, can I run this model of kimi?
danielhanchen@reddit (OP)
It can work yes, but will be slow - expect maybe 1 token / or less.
AvidCyclist250@reddit
This is some dick out in a blizzard level of shrinkage, impressive work
danielhanchen@reddit (OP)
Thank you! We provide more similar benchmarks on Aider Polyglot as well at https://docs.unsloth.ai/basics/unsloth-dynamic-2.0-ggufs/unsloth-dynamic-ggufs-on-aider-polyglot
ciprianveg@reddit
I need to test the Q3_XL on my 512GB ddr4 threadripper. I expect 5-6 t/s.
danielhanchen@reddit (OP)
OOO let me know how it goes! 512GB is a lot!
Herr_Drosselmeyer@reddit
I appreciate the effort, but even at 'only' 247GB of VRAM, it's not practical for 99.99% of users.
Still, thanks for all the work you guys do.
danielhanchen@reddit (OP)
Thanks! We're trying to see if we can compress it further via other tricks!
brahh85@reddit
i would say that 10-15% of the users of this reddit can run it, and next year could be 50%.
18 months ago i used in API a model that was 72B , now i have enough VRAM to use it at Q8 in my system , thanks to my small fleet of MI50. I bet that people is buying DDR5 ram to host things like gpt-oss 120b and glm 4.5 air , and the next step is GLM 4.6 . In the end is just having 1 or 2 GPU and a ton of DDR5.
Im waiting for AMD to launch a desktop quad channel CPU to upgrade mobo+cpu+ram and be able to host a 355B model... but maybe i should design my system having kimi in mind.
XiRw@reddit
Can my pentium 4 processor with Windows 98 handle it?
danielhanchen@reddit (OP)
Haha if llama.cpp works then maybe? But I doubt it since 32bit machines in the good ol days have limited RAM as well - Windows XP 32bit for eg had max RAM of 4GB!
xxPoLyGLoTxx@reddit
No you need to upgrade to Windows ME or Vista more than likely.
AleksHop@reddit
can we run q4 with offloading to 2x96gb rtx pro?
danielhanchen@reddit (OP)
Oh 2*96 = 192GB + RAM - definitely in the future!
yoracale@reddit
Yes you can but it will be too slow unfortunately. Unless you can add more RAM and have the disk size of the model fit the total RAM/VRAM
FullOf_Bad_Ideas@reddit
does anyone here have 256GB or 512GB Mac?
how well does this work on it?
thinking about running it on a phone. I don't think storage offloading works there though, it'll just crash out
danielhanchen@reddit (OP)
Oh it might work on a phone, but ye probs will crash :(
Storage offloading works ok on SSDs, but definitely I don't recommend it - it can get slow!
Hoodfu@reddit
Have an m3 ultra 512gb - Didnt do the the 1bit, but did the 2 bit 370 gig one dynamic unsloth: 328 input tokens - 12.43 tok/sec - 1393 output tokens - 38.68s to first token
FullOf_Bad_Ideas@reddit
Thanks! That's probably a bit too slow to use for tasks that output a lot of reasoning tokens, but it's technically runnable nonetheless!
By any chance, have you used LongFlash Chat? There are MLX quants but no support from llama.cpp - https://huggingface.co/mlx-community/LongCat-Flash-Chat-4bit
In theory it should run a bit faster on Apple hardware, since it has dynamic, but overall low, number of activated parameters - varying between 18.6B and 31.3B
maifee@reddit
Waiting for half bit dynamic gguf
danielhanchen@reddit (OP)
Haha - the closest possible would be to somehow do distillation or remove say 50% of parameters by deleting unnecessary ones
Long_comment_san@reddit
Amazing stuff. I wish I had so much hardware for 1 bit quant but hey, we'll get there eventually.
danielhanchen@reddit (OP)
One of the goals is to probably prune some layers away - say a 50% reduction which can definitely help on RAM and GPU savings!
no_witty_username@reddit
Do you mean how many layers are offloaded to gpu versus cpu or do you mean something else by this? I've always wondered if there's a procedure or method that we can implement on very large models that surgically could reduce the parameter size and still be able to run the model. Like take a 1 trillion parameter model and some process reduces it down to only 4 billion parameters, and while the model loses its intelligence somehow it would still run as if for example you ran 4b qwen model but its kimi 2. And I'm not talking distillation which requires retraining, this would be closer to model merger type of tech... Just wondering if we developed such tech yet or coming up on something around that capability..
Nymbul@reddit
Here is some literature I've seen regarding pruning and an open source implementation of it.
Essentially, it's a process of determining the least relevant layers for a given dataset and then literally cutting them out of the model, typically with a "healing" training pass afterwards. The hope is that the tiny influence of those layers was largely irrelevant to the final answer.
I tried a 33% reduction once and it became a labotamite. It's a lot of guesswork.
danielhanchen@reddit (OP)
Oh yes those literature are nice!
no_witty_username@reddit
Thanks, ill check it out now.
danielhanchen@reddit (OP)
Oh I meant actually pruning like deleting unnecessary layers for eg like Cerebras REAP - we actually made some GGUFs for them for eg:
Yes distillation is another option!
mal-adapt@reddit
The way you’re conceptualizing the concept unfortunately is not possible, we can trim disparate context—relative to some topic— the important detail here being WHY we can do this relative ‘to a topic’; the reason we can do that— provides insight on why the other way around isn’t logically
Where a topic it’s just any higher order, inferential, semantic concept— awe are specifically targeting, capabilities, which we conceptually recognize will likely be organized at the ‘edges’ of the model—optimizing relative to capabilities found organized in higher latent layers, we’re can trim the ‘furthest away branches’ at the complexity or height—or higher, potentially not lose much capability relative to the topic we are optimizing for.
You can literally imagine it trimming a tree
WithoutReason1729@reddit
Your post is getting popular and we just featured it on our Discord! Come check it out!
You've also been given a special flair for your contribution. We appreciate your post!
I am a bot and this action was performed automatically.
Thistleknot@reddit
1B or it didnt happen
yoracale@reddit
I'm not sure if llama.cpp supports the architecture so probably not until they support it
ParthProLegend@reddit
I had sent you a Reddit DM, please check if possible.
FORLLM@reddit
I aspire to someday be able to run monsters like this locally and I really appreciate your efforts to make them more accessible. I don't know that that's very encouraging for you, but I hope it is.
yoracale@reddit
Thank you yes, any supportive comments like yours are amazing so thank you so much, we appreciate you 🥰
john0201@reddit
This is great. Do you have an idea of what tps would be expected with 2x5090 and 256GB system memory (9960X)? Not sure I will install if it is only 5tps it seems like much under 10 isn’t super usable. But awesome effort to be able to run a model this big locally at all!
danielhanchen@reddit (OP)
Yes probably 5 tokens ish but I didn't select all the best settings - it might be possible to push it to 10!