Local ai - ollama, open Web ui, rtx 3060 12 GB
Posted by Apollyon91@reddit | LocalLLaMA | View on Reddit | 7 comments
I am running unraid (home server) with a dedicated GPU. NVIDIA rtx 3060 with 12 GB of vram.
I tried setting it up on my desktop through opencode. Both instances yeild the same result.
I run the paperless stack with some basic llm models.
But I wanted to expand this and use other llms for other things as well, including some light coding.
But when running qwen3:14b for example, which other reddit posts suggest would be fine, it seems to hammer the cpu as well, all cores are used together with the gpu. But gpu utilisation seems low, compared to how much the cpu is being triggered.
Am I doing something wrong, did I miss some setting, or is there something I should be doing instead?
reviews4weed@reddit
If you exceed GPU ram ollama defaults to cpu. Make sure your GPU drivers and configure are good. This will happen with any model that grows beyond your GPU memory.
Ollama is great for simplicity and having a cloud big model. I switched to gemma4:e2b on my 12gb server and its been good locally
Apollyon91@reddit (OP)
It happens on all uses of any model it seems. Drivers are updated and good.
reviews4weed@reddit
Ollama ps does it show cpu 100 or GPU %
suicidaleggroll@reddit
Ollama does this regularly. Switch to another inference engine, literally anything is better than Ollama.
Apollyon91@reddit (OP)
So llama.cpp would be better in this case? Or is there another good choice?
suicidaleggroll@reddit
Yes llama.cpp would be better than Ollama in this, and every other case. vLLM would also work, or LM Studio, or SGLang.
Apollyon91@reddit (OP)
Thanks. Will give that a try