Need practical local LLM advice: Only having a 4GB RAM box from 2016

Posted by Tall-Ant-8557@reddit | LocalLLaMA | View on Reddit | 19 comments

Sorry, not so tech person.

I’m trying to figure out the most practical local LLM setup using my spare machine:

4 GB RAM

No GPU for now, so please assume CPU-first unless I mention otherwise.

I want advice on:

whether anything meaningful can run on 4 GB RAM
best inference stack: Ollama vs llama.cpp vs LM Studio vs something else
My OS is L-Ubuntu
what you personally run on similar hardware

Interested in models for:

chat
coding help
writing / summarization
lightweight local workflows

Would appreciate recommendations.

[-]

Disposable110@reddit

https://huggingface.co/mradermacher/Llama-3.2-3B-Instruct-uncensored-GGUF Q4 on llamacpp cpu works fine for chat and creative writing. Not for coding. Workflows can do but not out of the box and needs a lot of tweaking.

[-]

defective@reddit

I REALLY don't know if you'll be able to run much with 4GB RAM, in which you must also run an OS. But if I were you, I would install LM Studio.

LM Studio will allow you to search all the models on Huggingface, and download them easily, and will suggest quantizations that will fit in your available resources. So, you'll be able to quickly narrow down the list of models and test ones that will actually work on your system.

Download some, chat with them a bit, note the ones that don't seem helplessly stupid, and continue until you have several candidates. Then you can begin investigating other inference providers like llama.cpp which should use slightly less RAM and maybe you could go up a bit in quant, or increase your KV cache (context) more.

Also with LM Studio, you can play around with quantizing the KV cache itself, which can minimize your RAM utilization even more. It's easiest to do in LM Studio because the process is just checkboxes and dropdown menus.

One thing you have going for you is that your limitations in RAM at least force you to use some of the smallest (and therefore fastest) models that exist. So, you can try using swap memory, and while this will cause the speed of the model to tank, it might be worth it if you find yourself needing just a BIT bigger model to work with.

[-]

Ulterior-Motive_@reddit

At best you're going to be looking at the smallest Qwen3.5 models, maybe a 4 bit quant of Qwen3.5-4B which might teach you a thing or two about running LLMs, but as to how useful they'd be... You're gonna need to measure your expectations.

[-]

Tall-Ant-8557@reddit (OP)

So, these 4B models are not useful for any tasks?

[-]

Ulterior-Motive_@reddit

I wouldn't call them useless, but they're going to lack a lot of world knowledge. Summarizing a text, writing a formal email, maybe writing simple bash/powershell scripts are probably what they'd be most useful for.

[-]

Kahvana@reddit

Just 4 GB, what RAM type do you have? DDR3/4/5? What CPU do you have? (example: DDR3, Intel Core I5-5200U)

Your best bet is going to be the bonsai 4b/1.7b with koboldcpp. While bonsai 8b MIGHT work, it's going to be real tight when factoring in context too (context = how much you can talk before it starts to forget previous parts).

For more details on inference engines:

Don't use ollama at all costs, it's very slow and inefficient, which really matters on your hardware.
lmstudio is an option, but has some unique quirks.
llama.cpp with their build-in server would be the most performant, but the learning curve is really steep.
Koboldcpp is the best compromise here.

As for your requirements:

1b or bigger models are fine for chatting, but very much lacking in depth of conversation. They are by no means accurate.
You can forget about coding help unless it's dead simple python or very simple bash.
Summaries and writing is fine with these! Just be very concise in your instructions to it.
Define local workflow, it doesn't say anything.

On my 8GB DDR4-2400MHz with an Intel Pentium Silver N5000, I was able to run Bonsai 8B and have good fun with it! While it's double what you have, it fits with 8k context in \~6.5 GB RAM (with windows taking 2.3 GB).

[-]

Tall-Ant-8557@reddit (OP)

What are some of the use cases of yours for your 8GB one? Was the speed decent?

[-]

Kahvana@reddit

Bonsai 8B was mostly conversations / roleplay for fun, I've yet to test it with MCP (which I want to do when I got enough time). I mostly run specialist models for separate tasks, but I'm afraid none of those could run properly with 4GB RAM.

Decent speed? You're measuring seconds per token opposed to tokens per second (t/s) on that type of hardware. The fact that bonsai 8b ran at 1 t/s at all was very impressive to me.

[-]

HopePupal@reddit

you cannot do anything useful with 4 GB of RAM and no GPU. sorry. you can probably get used smartphones that are more powerful than that.

[-]

gh0stwriter1234@reddit

If you are running a lightweight Linux you can probably even run Gemma 4 E2B the smallest one... at around 10t/s probably.

If you could scrounge up an extra 4GB you could run E4B which is getting closer to useful.

[-]

Skeptic-AI-This-User@reddit

I think there’s also a Qwen 0.5B or 0.8B model they could try. They’ll have to keep the context light either way.

[-]

gh0stwriter1234@reddit

Yeah the other thing is this is probably a ddr4 system given the time period... so probably possible to get some cheap 8gb sticks people have spare.

[-]

DangKilla@reddit

Just download gemma4 to an old smart phone.

[-]

Prize_Negotiation66@reddit

look at bonsai models

[-]

Last_Mastod0n@reddit

Youd likely be better off running a model on a high end smartphone tbh

[-]

SM8085@reddit

4 GB RAM

That's pretty tight. Think around the 2B range.

Like unsloth/Qwen3.5-2B-GGUF, hardware compatibility estimate from the model card:

coding help

I wouldn't expect it to output much usable code. Maybe you could chat about the concepts of coding.

Summarizing should be okay.

My OS is L-Ubuntu

I roll Xubuntu, I pronounce it 'zoo-buntu' because the 'xu' seems like it should be pronounced 'zoo' to me. Like Xulu.

[-]

Forward_Compute001@reddit

A screenshot from your smartwatch, running a 2b model

[-]

Forward_Compute001@reddit

Phase 1, learn

- Phase 2, clarify what you need and want - - Phase 3, Build the setup that you need for your usecase - - Phase 4, scale as needed

PHASE 1 If you are just starting you can usewhat you have and just try ollama on linux on your 4gb and play for a few days.

I would reccomend to get yourself a machine with a gpu or multiple gpu's and see if this is what your can use for your workflows, because at least one gpu or something that can fit a model of 20-60gb of size is lretty much the start for this stuff beeing intelligent.

very simple stuff can run on 10gb but you are missing out on a big chunk.

PHASE 2 Learn what you need to set up for your usecase, containers, vms, llama.ccp, ollama, ect

PHASE 3 Plan and build the bardware you need, preferebly so that you can scale it easily

PHASE 4 scale by just adding morehardware

[-]

borretsquared@reddit

as far as im aware you will get very disappointing performance with only 4 gigs of ram and no GPU. I would imagine that without a GPU and such little ram your ram isnt the fastest either, or your CPU. if you're really desperate to try a model you could maybe run on swap memory but it will be biblically slow.