Just tried Ollama for the first time, it runs terrible with half GPU power on the default model it provides compared to the one you add, any reason why?
Posted by dreamer_2142@reddit | LocalLLaMA | View on Reddit | 16 comments
My GPU power consumption is 250w (undervolted rtx3090) when I added Qwen3.5-27B-GGUF to Ollama using a template (Modelfile made by gpt). I gave it 3 task to test it, build a snake game, build a flappy bird game, and make an interactive grid on the web for the mouse visual effect, all were successful.
But I don't know how good or bad my "Modelfile" is since I couldn't find a tempalte online, so I thought, let me try Qwen3.6 from inside the app, it downloaded 24GB, and I was surorised it failed with the first two tasks, isn't the app supposed to have the best template and download the best model to give you a good result? and it consume only 120w power.
I think most people have bad results due to the app, not the model.
prompts I've used:
1st task: build for me a snake game for html
--
2nd taesk: build for me a flappy bird game for html
gigglegenius@reddit
ollama ate my vram (have a rtx 4090). I switchd to llama.cpp (just ask an LLM how to build it with the most recent MTP support and get the mtp models). my context window with ollama: 20k. with llama.cpp, for some reason: 100k. q5 heretic gguf mtp qwen3.6:27b
dreamer_2142@reddit (OP)
Do you need to provide the settings/template to llama.cpp too? if so where do you get it? or do you download it from inside llama.cpp and it will take care of it?
gigglegenius@reddit
you start it with cmd if you are on windows. its a long command, you can put it in a .bat file. dont forget quantized kv cache. ask some llm for it, I did, it worked right away
jwpbe@reddit
dont use kv cache quantization unless you strictly have to for a specific workflow, it degrades outputs. its not free
gigglegenius@reddit
its not? maybe I should try the old q4ks with unquantized cache. any benchmarks to this?
jwpbe@reddit
You’ll have to look them up on the subreddit search on old.reddit.com/r/localllama - if you can fit unsloth 4 xl that would be better
dreamer_2142@reddit (OP)
ok, thanks.
dreamer_2142@reddit (OP)
cool, thanks!
chibop1@reddit
Did you set the enough context length? I believe it's now 8192 by default? The model pulled from the Ollama library works great here.
dreamer_2142@reddit (OP)
default app on mine set to 32k. but do check my edit comment in this post (not sure how to link it), I think there are other parameters that has more impact on the model. but not sure why it only consume half the power compared to the one I downloaded and put it manually on ollama.
PromptInjection_@reddit
switch to llama.cpp or LMStudio. Ollama is not good.
dryadofelysium@reddit
Ollama is currently in the process of abandoning their own engine/wrapper and will switch to a model similar to LMStudio very soon.
Forward_Jackfruit813@reddit
Can confirm, I thought it was just hyperbole but moving to llama.cpp has been transformative for me.
PhoneOk7721@reddit
Ollama is garbage, literally everyone will tell you this, use llama.cpp or literally anything else except ollama.
Sufficient-Bid3874@reddit
llama.cpp
dreamer_2142@reddit (OP)
The Modelfile made by gpt for Qwen3.5-27B-UD-Q4_K_X if anyone is curious.
FROM x:\x\Qwen3.5-27B-UD-Q4_K_XL.gguf
TEMPLATE """{{- if .System }}<|im_start|>system
{{ .System }}<|im_end|>
{{ end }}{{- range .Messages }}<|im_start|>{{ .Role }}
{{ .Content }}<|im_end|>
{{ end }}<|im_start|>assistant
"""
SYSTEM """You are a helpful assistant. Answer the user directly and stay on topic."""
PARAMETER stop "<|im_end|>"
PARAMETER stop "<|endoftext|>"
PARAMETER temperature 0.3
PARAMETER top_p 0.9
PARAMETER num_ctx 16384