Just tried Ollama for the first time, it runs terrible with half GPU power on the default model it provides compared to the one you add, any reason why?

Posted by dreamer_2142@reddit | LocalLLaMA | View on Reddit | 16 comments

My GPU power consumption is 250w (undervolted rtx3090) when I added Qwen3.5-27B-GGUF to Ollama using a template (Modelfile made by gpt). I gave it 3 task to test it, build a snake game, build a flappy bird game, and make an interactive grid on the web for the mouse visual effect, all were successful.

But I don't know how good or bad my "Modelfile" is since I couldn't find a tempalte online, so I thought, let me try Qwen3.6 from inside the app, it downloaded 24GB, and I was surorised it failed with the first two tasks, isn't the app supposed to have the best template and download the best model to give you a good result? and it consume only 120w power.

I think most people have bad results due to the app, not the model.

prompts I've used:

1st task: build for me a snake game for html
--
2nd taesk: build for me a flappy bird game for html

[-]

gigglegenius@reddit

ollama ate my vram (have a rtx 4090). I switchd to llama.cpp (just ask an LLM how to build it with the most recent MTP support and get the mtp models). my context window with ollama: 20k. with llama.cpp, for some reason: 100k. q5 heretic gguf mtp qwen3.6:27b

[-]

dreamer_2142@reddit (OP)

Do you need to provide the settings/template to llama.cpp too? if so where do you get it? or do you download it from inside llama.cpp and it will take care of it?

[-]

gigglegenius@reddit

you start it with cmd if you are on windows. its a long command, you can put it in a .bat file. dont forget quantized kv cache. ask some llm for it, I did, it worked right away

[-]

jwpbe@reddit

dont use kv cache quantization unless you strictly have to for a specific workflow, it degrades outputs. its not free

[-]

gigglegenius@reddit

its not? maybe I should try the old q4ks with unquantized cache. any benchmarks to this?

[-]

jwpbe@reddit

You’ll have to look them up on the subreddit search on old.reddit.com/r/localllama - if you can fit unsloth 4 xl that would be better

[-]

dreamer_2142@reddit (OP)

ok, thanks.

[-]

dreamer_2142@reddit (OP)

cool, thanks!

[-]

chibop1@reddit

Did you set the enough context length? I believe it's now 8192 by default? The model pulled from the Ollama library works great here.

[-]

dreamer_2142@reddit (OP)

default app on mine set to 32k. but do check my edit comment in this post (not sure how to link it), I think there are other parameters that has more impact on the model. but not sure why it only consume half the power compared to the one I downloaded and put it manually on ollama.

[-]

dreamer_2142@reddit (OP)

The Modelfile made by gpt for Qwen3.5-27B-UD-Q4_K_X if anyone is curious.

FROM x:\x\Qwen3.5-27B-UD-Q4_K_XL.gguf

TEMPLATE """{{- if .System }}<|im_start|>system

{{ .System }}<|im_end|>

{{ end }}{{- range .Messages }}<|im_start|>{{ .Role }}

{{ .Content }}<|im_end|>

{{ end }}<|im_start|>assistant

"""

SYSTEM """You are a helpful assistant. Answer the user directly and stay on topic."""

PARAMETER stop "<|im_end|>"

PARAMETER stop "<|endoftext|>"

PARAMETER temperature 0.3

PARAMETER top_p 0.9

PARAMETER num_ctx 16384