Need practical local LLM advice: Only having a 4GB RAM box from 2016
Posted by Tall-Ant-8557@reddit | LocalLLaMA | View on Reddit | 19 comments
Sorry, not so tech person.
I’m trying to figure out the most practical local LLM setup using my spare machine:
4 GB RAM
No GPU for now, so please assume CPU-first unless I mention otherwise.
I want advice on:
- whether anything meaningful can run on 4 GB RAM
- best inference stack: Ollama vs llama.cpp vs LM Studio vs something else
- My OS is L-Ubuntu
- what you personally run on similar hardware
Interested in models for:
- chat
- coding help
- writing / summarization
- lightweight local workflows
Would appreciate recommendations.
Disposable110@reddit
https://huggingface.co/mradermacher/Llama-3.2-3B-Instruct-uncensored-GGUF Q4 on llamacpp cpu works fine for chat and creative writing. Not for coding. Workflows can do but not out of the box and needs a lot of tweaking.
defective@reddit
I REALLY don't know if you'll be able to run much with 4GB RAM, in which you must also run an OS. But if I were you, I would install LM Studio.
LM Studio will allow you to search all the models on Huggingface, and download them easily, and will suggest quantizations that will fit in your available resources. So, you'll be able to quickly narrow down the list of models and test ones that will actually work on your system.
Download some, chat with them a bit, note the ones that don't seem helplessly stupid, and continue until you have several candidates. Then you can begin investigating other inference providers like llama.cpp which should use slightly less RAM and maybe you could go up a bit in quant, or increase your KV cache (context) more.
Also with LM Studio, you can play around with quantizing the KV cache itself, which can minimize your RAM utilization even more. It's easiest to do in LM Studio because the process is just checkboxes and dropdown menus.
One thing you have going for you is that your limitations in RAM at least force you to use some of the smallest (and therefore fastest) models that exist. So, you can try using swap memory, and while this will cause the speed of the model to tank, it might be worth it if you find yourself needing just a BIT bigger model to work with.
Ulterior-Motive_@reddit
At best you're going to be looking at the smallest Qwen3.5 models, maybe a 4 bit quant of Qwen3.5-4B which might teach you a thing or two about running LLMs, but as to how useful they'd be... You're gonna need to measure your expectations.
Tall-Ant-8557@reddit (OP)
So, these 4B models are not useful for any tasks?
Ulterior-Motive_@reddit
I wouldn't call them useless, but they're going to lack a lot of world knowledge. Summarizing a text, writing a formal email, maybe writing simple bash/powershell scripts are probably what they'd be most useful for.
Kahvana@reddit
Just 4 GB, what RAM type do you have? DDR3/4/5? What CPU do you have? (example: DDR3, Intel Core I5-5200U)
Your best bet is going to be the bonsai 4b/1.7b with koboldcpp. While bonsai 8b MIGHT work, it's going to be real tight when factoring in context too (context = how much you can talk before it starts to forget previous parts).
For more details on inference engines:
As for your requirements:
On my 8GB DDR4-2400MHz with an Intel Pentium Silver N5000, I was able to run Bonsai 8B and have good fun with it! While it's double what you have, it fits with 8k context in \~6.5 GB RAM (with windows taking 2.3 GB).
Tall-Ant-8557@reddit (OP)
What are some of the use cases of yours for your 8GB one? Was the speed decent?
Kahvana@reddit
Bonsai 8B was mostly conversations / roleplay for fun, I've yet to test it with MCP (which I want to do when I got enough time). I mostly run specialist models for separate tasks, but I'm afraid none of those could run properly with 4GB RAM.
Decent speed? You're measuring seconds per token opposed to tokens per second (t/s) on that type of hardware. The fact that bonsai 8b ran at 1 t/s at all was very impressive to me.
HopePupal@reddit
you cannot do anything useful with 4 GB of RAM and no GPU. sorry. you can probably get used smartphones that are more powerful than that.
gh0stwriter1234@reddit
If you are running a lightweight Linux you can probably even run Gemma 4 E2B the smallest one... at around 10t/s probably.
If you could scrounge up an extra 4GB you could run E4B which is getting closer to useful.
Skeptic-AI-This-User@reddit
I think there’s also a Qwen 0.5B or 0.8B model they could try. They’ll have to keep the context light either way.
gh0stwriter1234@reddit
Yeah the other thing is this is probably a ddr4 system given the time period... so probably possible to get some cheap 8gb sticks people have spare.
DangKilla@reddit
Just download gemma4 to an old smart phone.
Prize_Negotiation66@reddit
look at bonsai models
Last_Mastod0n@reddit
Youd likely be better off running a model on a high end smartphone tbh
SM8085@reddit
That's pretty tight. Think around the 2B range.
Like unsloth/Qwen3.5-2B-GGUF, hardware compatibility estimate from the model card:
I wouldn't expect it to output much usable code. Maybe you could chat about the concepts of coding.
Summarizing should be okay.
I roll Xubuntu, I pronounce it 'zoo-buntu' because the 'xu' seems like it should be pronounced 'zoo' to me. Like Xulu.
Forward_Compute001@reddit
A screenshot from your smartwatch, running a 2b model
Forward_Compute001@reddit
Phase 1, learn
- Phase 2, clarify what you need and want - - Phase 3, Build the setup that you need for your usecase - - Phase 4, scale as needed
PHASE 1 If you are just starting you can usewhat you have and just try ollama on linux on your 4gb and play for a few days.
I would reccomend to get yourself a machine with a gpu or multiple gpu's and see if this is what your can use for your workflows, because at least one gpu or something that can fit a model of 20-60gb of size is lretty much the start for this stuff beeing intelligent.
very simple stuff can run on 10gb but you are missing out on a big chunk.
PHASE 2 Learn what you need to set up for your usecase, containers, vms, llama.ccp, ollama, ect
PHASE 3 Plan and build the bardware you need, preferebly so that you can scale it easily
PHASE 4 scale by just adding morehardware
borretsquared@reddit
as far as im aware you will get very disappointing performance with only 4 gigs of ram and no GPU. I would imagine that without a GPU and such little ram your ram isnt the fastest either, or your CPU. if you're really desperate to try a model you could maybe run on swap memory but it will be biblically slow.