What is the smallest amount of RAM sufficient to run any available on HF GGUF LLM model locally?

Posted by alex20_202020@reddit | LocalLLaMA | View on Reddit | 28 comments

  1. Disclaimer: the question is theoretical, aimed at people who know how engines (e.g. llama.cpp) work.

  2. "Run": I define as able to process prefill of 20 tokens and generate 20 tokens response within a month.

  3. As context's KV cache need memory and that amount is proportional to context length, "smallest amount of RAM" excludes context allocation needs, also it excludes memory taken by OS itself (but includes inference engine's executable).

  4. "Any": it needs to be sufficient to run all (each at one time) of LLM models currently available in GGUF format on HF.

  5. I use Linux and interested in estimations for it, but info for other OS is welcome.

  6. The question assumes no GPU for simplicity (RAM, not RAM+VRAM in the title), however info on engines abilities to use very little RAM to load to large VRAM is welcome.