Is it possible to run some simple LLM (e.g. llama2) using very low amounts of RAM (e.g. 16MB)?

Posted by galapag0@reddit | LocalLLaMA | View on Reddit | 28 comments

I'm thinking if it is possible to run a small llama2 LLM in MSDOS as some fun side project. In theory, it is possible to compile it using OpenWatcom2 (assuming we change the C file to be C98) and replace the mmap call by a malloc (but dealing with the limited memory). Any hints?

[-]

upquarkspin@reddit

I run llama 3.2 B1 with 21.8

t/s on an iphone 13...;)

[-]

SiEgE-F1@reddit

Except you're missing the point where Iphone 13 has 6 threads CPU, running at 1.8-3.2ghz. And the phone is as performant as most old, low-end PCs.

Basically, it is identical to saying "I ran AI on my i3-3770, DDR3 machine", which is unimpressive, and pretty much pointless.

[-]

upquarkspin@reddit

Agree!

[-]

ethereel1@reddit

This isn't as crazy as it sounds. In fact, as a goal, it's rather excellent. To get you started, look at the TinyStories model on HF, the one that's under 700K. Then think how you may make something similar but more useful. That thought will lead you to graphs, and from there to GANs, and onward in your own research direction. The more extreme the constraint, the more valuable the innovation. Good luck!

[-]

SiEgE-F1@reddit

MSDOS/16mb is an outdated tech.
Trying to run LLMs on that is like trying to use your Apple II to watch 4k video content on Youtube - even if you manage to do it - it'll be a long, painful project, that would look like you've redone everything from scratch, anyway: new youtube, new browser, new player...

Your best hope for an AI on that hunk of junk is a bunch of if-else branches :) I mean.. they alone already take the most from the hardware's performance, anyway.

[-]

utf80@reddit

Good question!

[-]

N8Karma@reddit

Sure - you could use https://huggingface.co/roneneldan/TinyStories-8M at 4-bit quantization.

[-]

Radiant_Dog1937@reddit

If you could it would be agonizingly slow.

[-]

galapag0@reddit (OP)

How slow will be?

[-]

Radiant_Dog1937@reddit

If you got .001 tokens per hour. I would not be shocked.

[-]

Seuros@reddit

[-]

Seuros@reddit

Slow

[-]

Uncle___Marty@reddit

[-]

Uncle___Marty@reddit

Really

[-]

Yapper_Zipper@reddit

[-]

LilPsychoPanda@reddit

[-]

Competitive-Dark5729@reddit

[-]

Xhehab_@reddit

[-]

cms2307@reddit

Llama 2 isn’t any smaller than llama 3 it’s just worse, and I seriously doubt any model can run in less than a gigabyte or 2 of ram, and even then they would have to be extremely small (millions of parameters)

[-]

galapag0@reddit (OP)

Uhm, isn't Bitnet models only 70MB or so?

[-]

cms2307@reddit

Bitnet does take up less but good luck finding anything that uses it and works well

[-]

galapag0@reddit (OP)

What’s the issue with Bitnet?

[-]

LoafyLemon@reddit

Adoption rate and compatibility

[-]

WaveCut@reddit

look into smallest gpt2 implementations, but still go down to 16mb is unlikely to happen, and it would have no use in that size

[-]

Competitive-Dark5729@reddit

While it may be theoretically possible (with insane amount of tedious work), it’s practically impossible.

You can call an api though - host a server running a model on another machine and request from there.

[-]

Yapper_Zipper@reddit

You can only compress so much information at any time and smallest I could run on browser with good enough results is Qwen 0.5B GGUF model (~500MB). You need memory at the end of the day, if you want something useful. Otherwise try out llama-stories example from Andrej. That is very much minimal implmentation of LLAMA in C language trained on tinystories dataset. It is not ChatGPT but it gives some text.

[-]

Content_One5405@reddit

Snallest llama is 1B. Thats several gb of memory.

With 16mb you would need many thousands of long term storage requests. You would use something like strassen algorithm.

https://avikdas.com/2019/04/25/dynamic-programming-deep-dive-chain-matrix-multiplication.html

Requests are more numerous than just llm model size divide by ram, because the same data is requested several times.

Assuming system uses hdd, with 10ms per request and 10k requests, you would get about 100 seconds for one token for the snallest model. Assuming you can implement a fancy matrix multiplication algorithm.

There is no reason to use llama2 anymore. llama3 has small models as well, like 1b.

I doubt that llama can work with such a low memory and you will likely need to implement your algorithm for neural network calculation - naive algorithm, that requests everything from hdd will take something like a day per token.

[-]

Uncle___Marty@reddit

I mean, yeah this is possible but you're going to have to use a horrible model which wont even make sense. The inference is also going to be super slow. I've no idea what kind of CPU you'll be running (and the ram speed will make a HUGE difference) but expect things to be super slow, like way slower than running an AI on a modern phone.

You said "Fun side project", brother/sister, this sounds like no fun I know of..... But we're all different so if this is fun for you then go ahead but be warned this will be slow as hell.