What is the smallest amount of RAM sufficient to run any available on HF GGUF LLM model locally?

Posted by alex20_202020@reddit | LocalLLaMA | View on Reddit | 28 comments

Disclaimer: the question is theoretical, aimed at people who know how engines (e.g. llama.cpp) work.
"Run": I define as able to process prefill of 20 tokens and generate 20 tokens response within a month.
As context's KV cache need memory and that amount is proportional to context length, "smallest amount of RAM" excludes context allocation needs, also it excludes memory taken by OS itself (but includes inference engine's executable).
"Any": it needs to be sufficient to run all (each at one time) of LLM models currently available in GGUF format on HF.
I use Linux and interested in estimations for it, but info for other OS is welcome.
The question assumes no GPU for simplicity (RAM, not RAM+VRAM in the title), however info on engines abilities to use very little RAM to load to large VRAM is welcome.

[-]

Pleasant-Shallot-707@reddit

1TB

[-]

rayc25@reddit

It’s soooo hard to get 1tb of ram now. Even if you have the money, it’s so hard to buy anything for $30k+.

[-]

alex20_202020@reddit (OP)

It’s soooo hard to get 1tb of ram now.

It is only $1500 and a bit of searching second-hand market.

[-]

rayc25@reddit

You gotta choose between a downpayment on a house or 1tb ddr5 😂

[-]

alex20_202020@reddit (OP)

No, I am already buying up DDR3, still way to go to 1T cause I want to get it for ~ 1000 total and such prices are not common (but happen for small volumes).

[-]

FinalCap2680@reddit

Kimi K 2.6 GGUF BF16 (OP did not specify quant) is about 2TB ...

[-]

alex20_202020@reddit (OP)

is about 2TB

And how much RAM do I need to run it? BTW it is MoE IIRC.

[-]

last_llm_standing@reddit

didn't someone posted a question yesterday on a similar topic?

[-]

alex20_202020@reddit (OP)

didn't someone posted a question yesterday on a similar topic?

I have tried web search before posting, there were only e.g. "I have 4GB RAM" questions found in top results. If you could add a link...

[-]

last_llm_standing@reddit

its about non gpu inference: https://www.reddit.com/r/LocalLLaMA/comments/1tli757/what_is_the_current_best_small_language_model/

[-]

alex20_202020@reddit (OP)

It is very different, both in title and responses it got.

[-]

superdariom@reddit

Why do you want to run them slow and how large are we talking here? I regularly run Moe models twice the size of my VRAM

[-]

alex20_202020@reddit (OP)

I do not want them slow, but I want to run large models on hardware I have. In this post they mention 2TB Kimi, is there any larger?

[-]

last_llm_standing@reddit

its the same concept, how slow depends on your RAM, bandwidth, CPU etc.

[-]

New_Spray_7886@reddit

Considering the raspberry pi guy here, probably 0 as everything he does is with a page file in swap

[-]

alex20_202020@reddit (OP)

probably 0

Can current engines split tensors to load them partly for multiplications?

[-]

Last_Mastod0n@reddit

How slow is it? He must have a Pi with an m.2

[-]

Fast-Satisfaction482@reddit

With one month to process 20 tokens, I'm pretty sure you can pull that off with a 1kB RAM MCU and an SD card. But I don't have the numbers to back it up.

But the inference engine would be completely custom.

[-]

alex20_202020@reddit (OP)

But the inference engine would be completely custom.

Ah, I need to add to the question: use now available engines only.

[-]

jc2046@reddit

20 tokens in a month? I want some of what you are smoking... hmmm

[-]

alex20_202020@reddit (OP)

I am getting 1 token / 2 minutes right now already on my old laptop. But in theory...

[-]

kivaougu@reddit

Youre really asking if someone else can do your research so how is this "discussion"?

[-]

alex20_202020@reddit (OP)

"discussion"

I have doubts about tagging, but it is IMO even further from "support question".

[-]

Craftkorb@reddit

You could load each block from the model file from disk on-demand. I'll assume GGUF Q8. You'd technically only require a few KiB for the metadata book-keeping, and then 34 Bytes per block. At F32 that's just 4x32 = 128Byte.
For KV-Cache, you typically need O(n*m) storage where n is the current context length and m the count of layers in the model. Assuming a context of 4096 (That's prompt + current generation) and unquantized f32, for a 20 layer model, that's 20*4096*4 = 320KiB. When you quantize to fp16 that shrinks to half.

[-]