What kind of consumer computer can run Kimi-K2.6-GGUF which is a 585GB download?

Posted by THenrich@reddit | LocalLLaMA | View on Reddit | 29 comments

I read today about the release of Kimi K2.6.
In LM Studio on Windows it shows the download size of the model as 585GB.

What kind of Windows machine can run this monster model?
What minimum RAM and VRAM are needed to run it at a reasonable speed?

https://www.kimi.com/blog/kimi-k2-6

[-]

jeffwadsworth@reddit

This is one of those "If you have to ask, " kind of questions. Unless you are rich, just find a rich enthusiast friend otherwise it will break the bank. No consumer systems will even come close to being viable in any way. A custom system will run around 100K minimum for even running the quants, if you wish to have decent t/s.

[-]

LatentSpacer@reddit

Nah, for $50-70K you could build a machine with 4x RTX 6000 pro, EPYC CPU and 512GB RAM. That would be 384GB vram + 512GB ram = 896GB total memory. I think it could run some pretty big models at decent speeds. Could even add a few more RTX 6000.

[-]

Turbulent_Pin7635@reddit

Define "decent" in numbers...

[-]

chibop1@reddit

Don't try this at home!

[-]

Long_comment_san@reddit

And why would you need it?

[-]

Ok_Mammoth589@reddit

The fuck are you talking about? People were losing their everloving minds that Anthropic stopped letting them pay $200/month for Claude. Clearly there are reasons

[-]

Long_comment_san@reddit

Man, take a break,

[-]

ScaredyCatUK@reddit

Become frineds with Alex Ziskind, https://www.youtube.com/watch?v=FD6i0htqLew and if that doesn't work try Exo

[-]

Excellent_Screen_653@reddit

You can do that on a first gen Raspberry Pi mate lol

[-]

Annual_Award1260@reddit

I can run it on my old ddr4 1TB ram workstation. I get a blistering 2 tok/sec

[-]

j_osb@reddit

On zen5 epycs you can get upwards of 15t/s tg if you populate 12 channels.

[-]

LatentSpacer@reddit

Which quant?

[-]

Kerem-6030@reddit

lol

[-]

OutrageousMinimum191@reddit

IQ2_S quand runs in my 384gb DDR5 server with 7 t/s

[-]

Euphoric_Emotion5397@reddit

those really hardcore enthusiast. YOu should have seen so many here with their homelab which has 512GB.

[-]

Cergorach@reddit

None.

[-]

Digger412@reddit

Minimum VRAM is something like 24GB to hold the KV cache plus attention, and the smallest quant I published of K2.5 (and likely of K2.6) was 262GiB / 281GB so you're looking at minimum \~256GB of RAM and a smaller Ubergarm or Unsloth quant.

I have this older sweep bench on Linux + 2x 3090s + 12 channel RAM with the "full quality" Q4_X. Performance should be about the same for K2.6, just as a point of reference. I've upgrade to 8 6000 Pros since then and haven't re-benched yet but I'll try to later tonight.

[-]

streppelchen@reddit

Interested to get the 6000 pro numbers :)

[-]

Digger412@reddit

Sweep-bench results on llama.cpp:

PP	TG	N_KV	T_PP s	S_PP t/s	T_TG s	S_TG t/s
8192	2048	0	7.938	1032.01	43.730	46.83
8192	2048	8192	11.433	716.54	45.788	44.73
8192	2048	16384	14.920	549.07	47.625	43.00
8192	2048	24576	18.406	445.08	49.437	41.43
8192	2048	32768	21.904	374.00	51.265	39.95
8192	2048	40960	25.385	322.71	53.196	38.50
8192	2048	49152	28.885	283.60	55.107	37.16
8192	2048	57344	32.364	253.12	56.954	35.96

[-]

streppelchen@reddit

Thanks! I suppose you use the machine not alone but serve more users? Then the only question left to answer (for my curiosity) is vllm speeds with concurrency

[-]

Digger412@reddit

I use the machine for myself, producing quants (I'm AesSedai on HF), doing research, hosting model showcases in the BeaverAI server, etc.

There is definitely multi-user / concurrency / batching usage on it for the model showcases, I'm working on a de-slop rewriter pipeline currently and that benefits from massive parallelism + VRAM for slop phrase clustering with HDBSCAN and friends, batching embeddings, DSPy rollouts, abliteration research, and more.

The two 3090s I had were really good for doing single-user inference with llama.cpp, but having these I can really dig into more academic-grade workloads :)

[-]

streppelchen@reddit

Awesome, thanks again!

[-]

No-Juggernaut-9832@reddit

4 RTX 6000 rig will run you minimum 60K. That’s not including probably a dedicated 220-240v 40-50amp (dryer socket). Or 2-3 Mac Studio with 512G.

There are smaller models that are just as capable for 1/2 or 1/3 the rig size. MiniMax2.7 for sample. Will work at 128-192G

[-]

ImportancePitiful795@reddit

Tbh the moment someone goes to the $60K range with the 4 RTX6000s, better get a ARS-111GL-DNHR-LCC with 2 GH200s (albeit without NVLink) or ARS-221GL-NHIR for bit more having NVLINK.

[-]

ImportancePitiful795@reddit

Supermicro ARS-221GL-NHIR. It costs around €75000.

You need those 2 GH200 with NVLINK.

Except if you want to gamble with the half priced ARS-111GL-DNHR-LCC and slower interconnect of the 2 GH200s

[-]

Southern_Sun_2106@reddit

2 Mac studios (M3 Ultra's) will run this at slow chat speed. I just ran GLM 5.1 4-bit (which is a much much smaller model) - and it could not power Claude Code - was timing out. 17 t/s - good for chatting tho. If you are getting into the jet/whole-house-heater 'windows machine', that would be, like someone said, like a second mortgage, but that will get technologically outdated within the next year or two. Long story short, not practical at the moment to run such models 'at home.' On the positive side, if you can wait several months, there will show up a smaller model and just as capable.

[-]

CatalyticDragon@reddit

What kind of Windows machine can run this monster model?

A second hand EPYC server you found with 2TB of RAM running Windows Server.

[-]

Miriel_z@reddit

"Consumer" computer sounds a bit downplayed. I am literally crushed, I thought my new laptop is decent.

[-]

eclipsegum@reddit

The most reasonable setup that can run this isn’t going to be Windows. It would be 2 Mac Studio M3 Ultras. 512GB x2 or 512GB+ 256GB. Exo for clustering.