Gemma 4 thinking system prompt
Posted by No_Information9314@reddit | LocalLLaMA | View on Reddit | 26 comments
I like to be able to enable and disable thinking using a system prompt, so that I can control what which prompts generate thinking tokens rather than relying on the model to choose for me. It's one of the reasons I loved Qwen-30b-A3b.
I'm having trouble getting this same setup working for the gemma 4 models. Right now playing with the 26b. The model will sometimes respond to a system prompt asking it to skip reasoning, sometimes not. If I put `
I'm curious if anyone has been able to devise a way to toggle thinking on/off using system prompts and/or chat templates with the gemma4 models?
Snoo_28140@reddit
If your backend supports jinja templates, you can adapt (maybe even use directly?) this temlate from qwen:
https://pastebin.com/4wZPFui9
Source: https://www.reddit.com/r/LocalLLaMA/s/ne7L5HfBYI
pfn0@reddit
the jinja template supports enable_thinking
https://huggingface.co/google/gemma-4-31B-it/blob/main/chat_template.jinja#L157
pass chat_template_kwargs '{"enable_thinking":false}' or true as necessary.
No_Information9314@reddit (OP)
I’m finding that the model May respect this for the first or second prompt, but is inconsistent in its application. Aka it will think sometimes even with this in the system prompt.
pfn0@reddit
This isn't a system prompt setting. The system prompt is the wrong place to apply it
No_Information9314@reddit (OP)
Chat template shows system or developer role is the place, where are you applying?
pfn0@reddit
it is applied in the api request body, where sampler parameters are sent, if you adjust those.
Snoo_28140@reddit
My bad, indeed that is supported out of the box. I got caught up on the system prompt aspect.
mr_Owner@reddit
Llama cpp latest flag for it changed, it is now used as:
--reasoning=on/off
No_Information9314@reddit (OP)
Yes but I don’t want to have to reload the model every time I switch modes
mr_Owner@reddit
Can't have it all haha
No_Information9314@reddit (OP)
I want it all!
sunychoudhary@reddit
Interesting.
System prompts for “thinking” are always a bit tricky because the real question isn’t whether it responds better, it’s whether the behavior stays consistent, controllable and stable across different tasks.
A lot of prompt tricks look good in a few examples and then drift hard in real use.
No_Information9314@reddit (OP)
Yeah that’s been my experience with this model, even with the officially supported methods
Herr_Drosselmeyer@reddit
Google themselves say this:
Trigger Thinking: Thinking is enabled by including the
<|think|>token at the start of the system prompt. To disable thinking, remove the tokenNo_Information9314@reddit (OP)
I do not have this token in the system prompt - not sure where or how to remove it
Specter_Origin@reddit
What are you using to serve the model ?
No_Information9314@reddit (OP)
llama.cpp
Specialist_Sun_7819@reddit
yeah gemma is weirdly inconsistent about respecting thinking toggles. i just set do_thinking=false in the generation config if your backend supports it, way more reliable than system prompt instructions. for ollama you can also pass it as a parameter. system prompt instructions like "do not reason internally" work maybe 60% of the time which is... not great lol. qwen was definitely better about this
No_Information9314@reddit (OP)
Thanks - by generation config do you mean the chat template? I’m using llama.cpp.
defensivedig0@reddit
Isn't it supposed to be that adding <|think|> to the system prompt toggles thinking on and removing it disables it?
No_Information9314@reddit (OP)
I find that the model reasons with or without this tag
Klutzy-Snow8016@reddit
Instead of trying to use a system prompt for this, use the chat template argument "enable_thinking". That's the supported method. Llama.cpp and vllm, at least, support setting chat_template_kwargs in the request as well.
No_Information9314@reddit (OP)
It doesn’t really work - maybe for the first prompt but not after.
durden111111@reddit
Just use llama cpp to disable thinking
No_Information9314@reddit (OP)
Yes but I don’t want to have to reload the model every time I switch modes
Yukki-elric@reddit
Grab the jinja template from their huggingface repo, ask a competent LLM to modify it so that if the last user message contains "/think", it removes it from context and enables thinking for the next LLM response.