Cannot even run the smallest model on system RAM?

Posted by FloJak2004@reddit | LocalLLaMA | View on Reddit | 22 comments

I am a bit confused. I am trying to run small LLMs on my Unraid server within the Ollama docker, using just the CPU and 16GB of system RAM.

Got Ollama up and running, but even when pulling the smallest models like Qwen 3 0.6B with Q4_K_M quantization, Ollama tells me I need way more RAM than I have left to spare. Why is that? Should this model not be running on any potato? Does this have to do with context overhead?

Sorry if this is a stupid question, I am trying to learn more about this and cannot find the solution anywhere else.

[-]

GortKlaatu_@reddit

What about just running: ollama run qwen3:0.6b

[-]

FloJak2004@reddit (OP)

You are correct, that works out of the box just fine! Interesting

[-]

yoracale@reddit

It's because we set the context length higher by default. You can turn it off if you'd like!

[-]

hainesk@reddit

For your reference, when I run your command, Ollama shows it‘s using 13GB. It goes down to 7.3GB if I set Num parallel to 1., setting flash attention does the same thing. Setting both flash attention, num parallel and Q8 KV cache brings it down to 4.1GB.

These are the environment variables I use in the Ollama service file (sudo systemctl edit Ollama.service).

Environment="OLLAMA_FLASH_ATTENTION=1"

Environment="OLLAMA_NUM_PARALLEL=1"

Environment="OLLAMA_KV_CACHE_TYPE=q8_0"

Environment="OLLAMA_HOST=0.0.0.0"

Environment="OLLAMA_ORIGINS=*"

[-]

techmago@reddit

try this.

Environment="OLLAMA_FLASH_ATTENTION=1"

Environment="OLLAMA_KV_CACHE_TYPE=q8_0"

[-]

FloJak2004@reddit (OP)

Thanks for the detailed guide! I'll try this if I can somehow find this directory. I am running Ollama inside a docker container and am somewhat of a Linux noob unfortunately.

[-]

techmago@reddit

ohh is a docker run. I think you should just pass as args then

docker run -d --name=ollama --restart=unless-stopped -e OLLAMA_FLASH_ATTENTION=1 -e OLLAMA_KV_CACHE_TYPE=q8_0 blablabla image

[-]

dani-doing-thing@reddit

Try lower quants or smaller ctx sizes, only 9GiB of RAM seems to be available there... Also llama.cpp will probably be lighter than ollama.

There are even smaller LLMs like SmolLM...

[-]

FloJak2004@reddit (OP)

Somebody suggested simply running the official Qwen3 0.6b version - which works fine and uses minimal RAM. Maybe the Unsloth quant defaults to a larger context size and therefore is more RAM hungry?

[-]

dani-doing-thing@reddit

Could be, I'm not sure how ollama handle getting a model from a HF repo, maybe it's using the default 40960 context size?

[-]

ArsNeph@reddit

I don't know why it's doing that, but first try the official Ollama run command from their library. Also, try modifying the model file and set the context to like 8192. Secondly, if your CPU is really old it might not support AVX2 for inference. Try KoboldCPP and see if it works. If it works, it's not a problem with your rig, just some issue with Ollama

[-]

FloJak2004@reddit (OP)

Thanks! Must have something to do with the unsloth quant - like somebody already mentioned, just running the official Qwen3 0.6b works just fine for me - and with minimal RAM usage.
I use an i3 13100 on this machine.

[-]

ArsNeph@reddit

Okay, then it's almost definitely an issue with the quant, it's possible that the default context length is set to 132000 or something. Regardless, I'm glad it's working! You should be able to run Qwen 3 8B just fine on that machine. A bit of advice though, Ollama is really quite slow compared to vanilla llama.cpp, so I would recommend using that or KoboldCPP instead, once you get the hang of Ollama

[-]

Background-Ad-5398@reddit

some small models like to default to 100k context even if the model doesnt support that in practice

[-]

stddealer@reddit

Don't bother with ollama. If you want something easy to use, go for lmstudio, otherwise, just use llama.cpp.

[-]

FloJak2004@reddit (OP)

Thanks! I'll try llama.cpp . I am using LMStudio on my PC and Mac already, but wanted to have a small LLM running on my NAS for my local Open WebUI instance to connect to.

[-]

LatestLurkingHandle@reddit

Try installing the Ollama app, with just CPU and memory it's slow, but it works with many quantized models with only 16 GB of RAM

[-]

uti24@reddit

Check context size you are running your model with.

I have seen by default software could set maximum context size and that requires a lot of memory.

[-]

ThunderousHazard@reddit

I believe you are right but Qwen3 ctx size should be 32768.

Trying locally with llama.cpp the 4B variant (Q5KM), without flash attention, I have a total memory usage of 10GB.

Some math isn't mathing (or ollama is doing something particular which makes it use more resources..?).

[-]