how to preserve gemma 4 thinking trace

Posted by Qwoctopussy@reddit | LocalLLaMA | View on Reddit | 20 comments

how can i prevent discarding the thinking trace?

llama.cpp (b8858) serving gemma 4 31b (UD-Q6_K_XL), (almost) vanilla pi harness

got some flags here and there on llama-server, nothing relevant, but adding --jinja and --chat-template-kwargs ‘{“preserve_thinking”: true}’ didn’t seem to change it

[-]

dqUu3QlS@reddit

The model card for Gemma 4 models says:

No Thinking Content in History: In multi-turn conversations, the historical model output should only include the final response. Thoughts from previous model turns must not be added before the next user turn begins.

So it's possible llama.cpp doesn't expose the ability to retain thinking content for Gemma 4 models. You could modify and recompile llama.cpp, or use a different inference engine, but you would get lower quality outputs.

[-]

Qwoctopussy@reddit (OP)

turns out you can just modify the chat template file

https://www.reddit.com/r/LocalLLaMA/s/YhtBgNraZt

[-]

dqUu3QlS@reddit

Nice, is the response quality on the second turn still good if you do that?

[-]

Mart-McUH@reddit

llama.cpp has nothing to do with it, it is just inference engine. It depends on frontend. I don't use provided jinja templates at all, I use SillyTavern with text completion (the instruct template including channels defined manually). With such configuration it is simply a checkbox to include previous thinking traces and how many. Useful if you want to play game like hangman etc. where model needs to keep some secret consistent through turns but hidden from you.

Eg. it is all about frontend, does not depend on backend really. But chat completion (using chat templates) limits a lot how you can manipulate actual prompt, but it is easier to use as you do not need to define it yourself.

[-]

Qwoctopussy@reddit (OP)

i’m interested in preserving thinking trace because, in long multi-turn agentic tasks, it spends a lot of time re-thinking thoughts that it has already had. TG isn’t very high so it’s painful

it has high-quality well-structured thoughts, so it’s a shame for it to discard them

[-]

TacticalRock@reddit

For agentic work, it really does seem like Qwen 3.6 27B might be the better choice since it's explicitly trained to allow `preserve_thinking`.

[-]

Qwoctopussy@reddit (OP)

yeah i like qwen 3.6 27b, but it fails the car wash test.

gemma 4 has a high-quality thinking trace and gets it right, so i speculate gemma could give qwen a run for its money… if only i didn’t have to sit waiting for it to re-derive its thoughts on every turn

[-]

PrinceOfLeon@reddit

Gemma 4 was released some time after the car wash test became popular. It's quite possible the answer made it into the training set. It's not useful asking how many R's are in strawberry for the same reason.

[-]

Qwoctopussy@reddit (OP)

good point. qwen3.6 is also after, but maybe the car wash test didn’t make its way into CN datasets so nothing is definitive

best way to know is to test. i figured out how to preserve the thinking trace so i’ll try it

https://www.reddit.com/r/LocalLLaMA/s/YhtBgNraZt

[-]

TacticalRock@reddit

Ah interesting. Maybe worth making a custom harness that injects the reasoning trace along with the response, have a secondary model summarize the instruct trace, and have it be a part of the chat history so that the summarized trace gets sent for the next turn instead of the full trace? Not sure what else to do since Google explicitly states not to include reasoning content as a part of multi turn.

[-]

Qwoctopussy@reddit (OP)

figured something out https://www.reddit.com/r/LocalLLaMA/s/YhtBgNraZt

[-]

Qwoctopussy@reddit (OP)

tangentially, i’m somewhat curious what about “You are Pi, expert software architect” in the system prompt made it decide it’s a casual interaction lol

“silly hooman wants to play pretend i’m an expert”

[-]

PrinceOfLeon@reddit

I'm an expert software architect.

If you came up to me and asked me to think of two numbers and tell you the second, I would think we were having a casual interaction, not that you were trying to make use of my expertise.

[-]

Qwoctopussy@reddit (OP)

i haven’t prompted you to be an expert software architect tho so pretty sure you’re not in that region of latent space rn

[-]

Klutzy-Snow8016@reddit

There was a chatbot called Pi a couple years ago with a gimmick of having human-like casual conversations. Maybe it's picking up on that.

[-]

Qwoctopussy@reddit (OP)

ah good catch. shucks.

[-]

TheLexoPlexx@reddit

I suppose that's because the chat template doesn't support preserving thinking in this case? I'm using this in a custom harness and I regularly extend context window quickly by keeping thinking in.

[-]

Qwoctopussy@reddit (OP)

what chat template will preserve thinking?

[-]

Qwoctopussy@reddit (OP)

oh, i guess you didn’t change your chat template; you’re re-injecting the thinking trace from the harness

how would i go about doing such a thing? my understanding is that on every prompt Pi resubmits the entire chat history — it can’t know that inference is being run without the thinking traces — so I would presume the chat template is stripping the contents of the thinking tags, and what you’re doing is stripping the tags themselves so their content doesn’t get stripped?

my understanding of what’s going on here is very fuzzy

[-]

Qwoctopussy@reddit (OP)

took an embarrassingly long time for me to figure it out, but i modified the jinja template to preserve thinking

the trick was in realizing that although logic exists to delete thinking blocks (strip_thinking macro), this is a red herring—this logic is only active during edge cases. the issue in the normal case is that thinking simply isn’t being preserved to begin with

thinking traces are preserved during think→tool call→think loops, so i just copied that logic to preserve thinking traces whenever they’re encountered (and i also disabled strip_thinking)