Gemma 4 thinking system prompt

Posted by No_Information9314@reddit | LocalLLaMA | View on Reddit | 26 comments

I like to be able to enable and disable thinking using a system prompt, so that I can control what which prompts generate thinking tokens rather than relying on the model to choose for me. It's one of the reasons I loved Qwen-30b-A3b.

I'm having trouble getting this same setup working for the gemma 4 models. Right now playing with the 26b. The model will sometimes respond to a system prompt asking it to skip reasoning, sometimes not. If I put `` in the user prompt before my own content, that seems to work well. However that isn't really practical for api calls and the like.

I'm curious if anyone has been able to devise a way to toggle thinking on/off using system prompts and/or chat templates with the gemma4 models?

[-]

Snoo_28140@reddit

If your backend supports jinja templates, you can adapt (maybe even use directly?) this temlate from qwen:

https://pastebin.com/4wZPFui9

Source: https://www.reddit.com/r/LocalLLaMA/s/ne7L5HfBYI

[-]

pfn0@reddit

the jinja template supports enable_thinking

https://huggingface.co/google/gemma-4-31B-it/blob/main/chat_template.jinja#L157

pass chat_template_kwargs '{"enable_thinking":false}' or true as necessary.

[-]

No_Information9314@reddit (OP)

I’m finding that the model May respect this for the first or second prompt, but is inconsistent in its application. Aka it will think sometimes even with this in the system prompt.

[-]

pfn0@reddit

This isn't a system prompt setting. The system prompt is the wrong place to apply it

[-]

No_Information9314@reddit (OP)

Chat template shows system or developer role is the place, where are you applying?

[-]

pfn0@reddit

it is applied in the api request body, where sampler parameters are sent, if you adjust those.

[-]

Snoo_28140@reddit

My bad, indeed that is supported out of the box. I got caught up on the system prompt aspect.

[-]

mr_Owner@reddit

Llama cpp latest flag for it changed, it is now used as:

--reasoning=on/off

[-]

No_Information9314@reddit (OP)

Yes but I don’t want to have to reload the model every time I switch modes

[-]

mr_Owner@reddit

Can't have it all haha

[-]

No_Information9314@reddit (OP)

I want it all!

[-]

sunychoudhary@reddit

Interesting.

System prompts for “thinking” are always a bit tricky because the real question isn’t whether it responds better, it’s whether the behavior stays consistent, controllable and stable across different tasks.

A lot of prompt tricks look good in a few examples and then drift hard in real use.

[-]

No_Information9314@reddit (OP)

Yeah that’s been my experience with this model, even with the officially supported methods

[-]

Herr_Drosselmeyer@reddit

Google themselves say this:

Trigger Thinking: Thinking is enabled by including the <|think|> token at the start of the system prompt. To disable thinking, remove the token

[-]

No_Information9314@reddit (OP)

I do not have this token in the system prompt - not sure where or how to remove it

[-]

Specter_Origin@reddit

What are you using to serve the model ?

[-]

No_Information9314@reddit (OP)

llama.cpp

[-]

Specialist_Sun_7819@reddit

yeah gemma is weirdly inconsistent about respecting thinking toggles. i just set do_thinking=false in the generation config if your backend supports it, way more reliable than system prompt instructions. for ollama you can also pass it as a parameter. system prompt instructions like "do not reason internally" work maybe 60% of the time which is... not great lol. qwen was definitely better about this

[-]

No_Information9314@reddit (OP)

Thanks - by generation config do you mean the chat template? I’m using llama.cpp.

[-]

defensivedig0@reddit

Isn't it supposed to be that adding <|think|> to the system prompt toggles thinking on and removing it disables it?

[-]

No_Information9314@reddit (OP)

I find that the model reasons with or without this tag

[-]

Klutzy-Snow8016@reddit

Instead of trying to use a system prompt for this, use the chat template argument "enable_thinking". That's the supported method. Llama.cpp and vllm, at least, support setting chat_template_kwargs in the request as well.

[-]

No_Information9314@reddit (OP)

It doesn’t really work - maybe for the first prompt but not after.

[-]

durden111111@reddit

Just use llama cpp to disable thinking

[-]

No_Information9314@reddit (OP)

Yes but I don’t want to have to reload the model every time I switch modes

[-]

Yukki-elric@reddit

Grab the jinja template from their huggingface repo, ask a competent LLM to modify it so that if the last user message contains "/think", it removes it from context and enables thinking for the next LLM response.