What's the deal with Qwen3.5's and Gemma 4's reasoning traces?

Posted by mags0ft@reddit | LocalLLaMA | View on Reddit | 5 comments

Hey there,

I noticed something odd when trying out the latest and greatest local reasoning models recently. First, I just noticed it for Qwen3.5, but Gemma 4 seems to do it too:

The reasoning traces do that weird thing of starting with "Here is a detailed reasoning process for the problem: ..." or similar. Also, they seem to have began to suddenly include Markdown formatting and all the SOTA models apparently now like to write their reasoning as lists with bullet points?

What I don't get is why they are doing that. How does generating a few dozens of boilerplate tokens improve performance by any means? I am no hater of reasoning, and I don't think it's just "the model yapping around with no performance gain", but come on, I don't think it's necessary to spend time and electricity computing tokens for "Here is a reasoning process: ..." and hundreds of "**" tokens that aren't even going to get rendered.

It almost seems like they messed something up with synthetic data generation: Did they prompt their teacher models to "generate a reasoning process" for each sample and "forgot" to strip the preamble and Markdown formatting from the training data? I think that would be hilarious, but I genuinely cannot think of any other way why this might have happened. You could literally pre-fill the preamble in the reasoning?!

It may just be my personal preference, but I prefer densely packed, coherent reasoning text and models that don't spend time computing formatting tokens for an internal monologue that I am only rarely going to look at.

Any thoughts on this? Maybe there's a good reason for it, because many labs seem to be adopting this behavior.

Best greets :)

[-]

RanklesTheOtter@reddit

"The user asked a question about March madness. I should write a basketball physics simulation to better understand how the dynamics of basketball work. Ok wait no, they just asked if LeBron is on fire this season. Ok let me write an MCP server to interface with the public records database, X, and Facebook to see Lebron's latest activity. Wait no, that won't tell me anything about how he's been playing...."

mags0ft@reddit (OP)

😭😭😭

yeah overthinking is definitely another problem too

Alarming-Ad8154@reddit

I think most labs lean heavily into RL, meaning there are scoring rules in place that calculate a reward for certain aspects of the response. Those can strongly shape specific aspects of the response. If they use reward models (not entirely unlikely) then the preferences of those models will shape aspects of the reasoning trace, which could mean rewarding styling/bullit lists, markdown delimiters.

ForsookComparison@reddit

2025 models all kind of dumped the universe into reasoning and then used it as extra context to come up with an answer. Deepseek and OpenAI were really the only ones that kept it concise and even Claude had issues with it up until 4.0 I'd argue.

GLM (4? 4.5?) really kicked off the current phase of using structured reasoning to get to an answer faster and proved that it works way better than the regurgitation method (and uses less tokens).

At the cost of thinking a lot for the small things, modern models that use structured thinking get to the point really quick

Ah, I understand! That explains the lists and bullet points...