What is the smallest amount of RAM sufficient to run any available on HF GGUF LLM model locally?
Posted by alex20_202020@reddit | LocalLLaMA | View on Reddit | 28 comments
-
Disclaimer: the question is theoretical, aimed at people who know how engines (e.g. llama.cpp) work.
-
"Run": I define as able to process prefill of 20 tokens and generate 20 tokens response within a month.
-
As context's KV cache need memory and that amount is proportional to context length, "smallest amount of RAM" excludes context allocation needs, also it excludes memory taken by OS itself (but includes inference engine's executable).
-
"Any": it needs to be sufficient to run all (each at one time) of LLM models currently available in GGUF format on HF.
-
I use Linux and interested in estimations for it, but info for other OS is welcome.
-
The question assumes no GPU for simplicity (RAM, not RAM+VRAM in the title), however info on engines abilities to use very little RAM to load to large VRAM is welcome.
Pleasant-Shallot-707@reddit
1TB
rayc25@reddit
It’s soooo hard to get 1tb of ram now. Even if you have the money, it’s so hard to buy anything for $30k+.
alex20_202020@reddit (OP)
It is only $1500 and a bit of searching second-hand market.
rayc25@reddit
You gotta choose between a downpayment on a house or 1tb ddr5 😂
alex20_202020@reddit (OP)
No, I am already buying up DDR3, still way to go to 1T cause I want to get it for ~ 1000 total and such prices are not common (but happen for small volumes).
FinalCap2680@reddit
Kimi K 2.6 GGUF BF16 (OP did not specify quant) is about 2TB ...
alex20_202020@reddit (OP)
And how much RAM do I need to run it? BTW it is MoE IIRC.
last_llm_standing@reddit
didn't someone posted a question yesterday on a similar topic?
alex20_202020@reddit (OP)
I have tried web search before posting, there were only e.g. "I have 4GB RAM" questions found in top results. If you could add a link...
last_llm_standing@reddit
its about non gpu inference: https://www.reddit.com/r/LocalLLaMA/comments/1tli757/what_is_the_current_best_small_language_model/
alex20_202020@reddit (OP)
It is very different, both in title and responses it got.
superdariom@reddit
Why do you want to run them slow and how large are we talking here? I regularly run Moe models twice the size of my VRAM
alex20_202020@reddit (OP)
I do not want them slow, but I want to run large models on hardware I have. In this post they mention 2TB Kimi, is there any larger?
last_llm_standing@reddit
its the same concept, how slow depends on your RAM, bandwidth, CPU etc.
New_Spray_7886@reddit
Considering the raspberry pi guy here, probably 0 as everything he does is with a page file in swap
alex20_202020@reddit (OP)
Can current engines split tensors to load them partly for multiplications?
Last_Mastod0n@reddit
How slow is it? He must have a Pi with an m.2
Fast-Satisfaction482@reddit
With one month to process 20 tokens, I'm pretty sure you can pull that off with a 1kB RAM MCU and an SD card. But I don't have the numbers to back it up.
But the inference engine would be completely custom.
alex20_202020@reddit (OP)
Ah, I need to add to the question: use now available engines only.
jc2046@reddit
20 tokens in a month? I want some of what you are smoking... hmmm
alex20_202020@reddit (OP)
I am getting 1 token / 2 minutes right now already on my old laptop. But in theory...
kivaougu@reddit
Youre really asking if someone else can do your research so how is this "discussion"?
alex20_202020@reddit (OP)
I have doubts about tagging, but it is IMO even further from "support question".
Craftkorb@reddit
nis the current context length andmthe count of layers in the model. Assuming a context of 4096 (That's prompt + current generation) and unquantized f32, for a 20 layer model, that's20*4096*4 = 320KiB. When you quantize to fp16 that shrinks to half.tamerlanOne@reddit
Usa come metro di misura 2gb di ram =1b di llm
AccomplishedBoss7738@reddit
if you say any that includes kimi qwn then 1.4tb and if you say decent models then 800.
Mountain_Patience231@reddit
32GB would be good, 48Gb would be great
bigattichouse@reddit
Probably need to be running like gemma 270M, and have plenty of disc space.