What kind of consumer computer can run Kimi-K2.6-GGUF which is a 585GB download?
Posted by THenrich@reddit | LocalLLaMA | View on Reddit | 29 comments
I read today about the release of Kimi K2.6.
In LM Studio on Windows it shows the download size of the model as 585GB.
What kind of Windows machine can run this monster model?
What minimum RAM and VRAM are needed to run it at a reasonable speed?
jeffwadsworth@reddit
This is one of those "If you have to ask, " kind of questions. Unless you are rich, just find a rich enthusiast friend otherwise it will break the bank. No consumer systems will even come close to being viable in any way. A custom system will run around 100K minimum for even running the quants, if you wish to have decent t/s.
LatentSpacer@reddit
Nah, for $50-70K you could build a machine with 4x RTX 6000 pro, EPYC CPU and 512GB RAM. That would be 384GB vram + 512GB ram = 896GB total memory. I think it could run some pretty big models at decent speeds. Could even add a few more RTX 6000.
Turbulent_Pin7635@reddit
Define "decent" in numbers...
chibop1@reddit
Don't try this at home!
Long_comment_san@reddit
And why would you need it?
Ok_Mammoth589@reddit
The fuck are you talking about? People were losing their everloving minds that Anthropic stopped letting them pay $200/month for Claude. Clearly there are reasons
Long_comment_san@reddit
Man, take a break,
ScaredyCatUK@reddit
Become frineds with Alex Ziskind, https://www.youtube.com/watch?v=FD6i0htqLew and if that doesn't work try Exo
Excellent_Screen_653@reddit
You can do that on a first gen Raspberry Pi mate lol
Annual_Award1260@reddit
I can run it on my old ddr4 1TB ram workstation. I get a blistering 2 tok/sec
j_osb@reddit
On zen5 epycs you can get upwards of 15t/s tg if you populate 12 channels.
LatentSpacer@reddit
Which quant?
Kerem-6030@reddit
lol
OutrageousMinimum191@reddit
IQ2_S quand runs in my 384gb DDR5 server with 7 t/s
Euphoric_Emotion5397@reddit
those really hardcore enthusiast. YOu should have seen so many here with their homelab which has 512GB.
Cergorach@reddit
None.
Digger412@reddit
Minimum VRAM is something like 24GB to hold the KV cache plus attention, and the smallest quant I published of K2.5 (and likely of K2.6) was 262GiB / 281GB so you're looking at minimum \~256GB of RAM and a smaller Ubergarm or Unsloth quant.
I have this older sweep bench on Linux + 2x 3090s + 12 channel RAM with the "full quality" Q4_X. Performance should be about the same for K2.6, just as a point of reference. I've upgrade to 8 6000 Pros since then and haven't re-benched yet but I'll try to later tonight.
streppelchen@reddit
Interested to get the 6000 pro numbers :)
Digger412@reddit
Sweep-bench results on llama.cpp:
streppelchen@reddit
Thanks! I suppose you use the machine not alone but serve more users? Then the only question left to answer (for my curiosity) is vllm speeds with concurrency
Digger412@reddit
I use the machine for myself, producing quants (I'm AesSedai on HF), doing research, hosting model showcases in the BeaverAI server, etc.
There is definitely multi-user / concurrency / batching usage on it for the model showcases, I'm working on a de-slop rewriter pipeline currently and that benefits from massive parallelism + VRAM for slop phrase clustering with HDBSCAN and friends, batching embeddings, DSPy rollouts, abliteration research, and more.
The two 3090s I had were really good for doing single-user inference with llama.cpp, but having these I can really dig into more academic-grade workloads :)
streppelchen@reddit
Awesome, thanks again!
No-Juggernaut-9832@reddit
4 RTX 6000 rig will run you minimum 60K. That’s not including probably a dedicated 220-240v 40-50amp (dryer socket). Or 2-3 Mac Studio with 512G.
There are smaller models that are just as capable for 1/2 or 1/3 the rig size. MiniMax2.7 for sample. Will work at 128-192G
ImportancePitiful795@reddit
Tbh the moment someone goes to the $60K range with the 4 RTX6000s, better get a ARS-111GL-DNHR-LCC with 2 GH200s (albeit without NVLink) or ARS-221GL-NHIR for bit more having NVLINK.
ImportancePitiful795@reddit
Supermicro ARS-221GL-NHIR. It costs around €75000.
You need those 2 GH200 with NVLINK.
Except if you want to gamble with the half priced ARS-111GL-DNHR-LCC and slower interconnect of the 2 GH200s
Southern_Sun_2106@reddit
2 Mac studios (M3 Ultra's) will run this at slow chat speed. I just ran GLM 5.1 4-bit (which is a much much smaller model) - and it could not power Claude Code - was timing out. 17 t/s - good for chatting tho. If you are getting into the jet/whole-house-heater 'windows machine', that would be, like someone said, like a second mortgage, but that will get technologically outdated within the next year or two. Long story short, not practical at the moment to run such models 'at home.' On the positive side, if you can wait several months, there will show up a smaller model and just as capable.
CatalyticDragon@reddit
A second hand EPYC server you found with 2TB of RAM running Windows Server.
Miriel_z@reddit
"Consumer" computer sounds a bit downplayed. I am literally crushed, I thought my new laptop is decent.
eclipsegum@reddit
The most reasonable setup that can run this isn’t going to be Windows. It would be 2 Mac Studio M3 Ultras. 512GB x2 or 512GB+ 256GB. Exo for clustering.