(llama.cpp) Possible to disable reasoning for some requests (while leaving reasoning on by default)?
Posted by regunakyle@reddit | LocalLLaMA | View on Reddit | 19 comments
I am running unsloth/gemma-4-26B-A4B-it-GGUF/gemma-4-26B-A4B-it-UD-Q4_K_XL.gguf with llama-server (with reasoning enabled).
Is it possible to disable reasoning for some requests only? If yes, how?
I want to leave reasoning on by default, but in some other use cases I want it to respond as fast as possible (e.g. chatting bot)
Sadman782@reddit
go to settings > Developer and
Use this json
{
"chat_template_kwargs": {
"enable_thinking": false
}
}
regunakyle@reddit (OP)
Thank you! This is the correct answer.
SkyFeistyLlama8@reddit
I'm trying to pass extra options like thinking or reasoning budget using Microsoft Agent Framework to llama-server. The regular OpenAI client works but OpenAIChatClient from agent_framework.openai doesn't seem to expose "extra_body"
charmander_cha@reddit
Tem como fazer isso no opencode?
andy2na@reddit
use llama-swap with llama.cpp, allows different parameters (thinking, coding, instruct, etc) without having to reload the model
example config:
LexRivera@reddit
last one should be thinking=true i guess? Because otherwise it's same as :instruct.
Also, for this to work (so models will appear in /v1/models) you need to set
includeAliasesInList: trueat the root of config.regunakyle@reddit (OP)
Thanks! Never used llama-swap before, will try later
charmander_cha@reddit
Isso seria possível de fazer com o sistema roteador padrão do llama.cpp?
ElectronSpiderwort@reddit
I've been messing with --models-preset models.ini --models-max 1 flags for router mode; you could easily set up the same model with multiple sets of parameters for chat vs. deep reasoning and swap them out via the UI or, I think, API
Top-Rub-4670@reddit
This is only workable if you want to disable/enable thinking on new conversations.
Because swapping model mid-conversation just to enable reasoning once kinds of sucks; depending on hardware and context it could take several minutes.
ElectronSpiderwort@reddit
You are correct. But I tried it last night with OP's model only in Q8 on a MacBook m2 max, and it was a fraction of a minute to change models mid steam. I didn't time it but like 20 seconds.
ItankForCAD@reddit
I believe there is a pr to add a reasoning toggle for the webui.
regunakyle@reddit (OP)
I think there are 2 PRs for this, lol Hopefully they got merged soon
DigRealistic2977@reddit
Ya mean encapsulation of the reasoning? I guess that's a thing too strips out the reasoning while the reasoning is on at the same time
regunakyle@reddit (OP)
nope, I mean disabling reasoning all together (no). Some other comment got this right
Pjotrs@reddit
Using filters in llama-swap
TheLastSpark@reddit
keep in mind depending how you are doing it, turning it on/off will break cache
segmond@reddit
Yes, it's possible. Go to settings, go to developer
Start it with reasoning,
paste the following {"chat_template_kwargs": {"enable_thinking": false}}
Turns off reasoning, if you need to reason, change false to true.
gpt-oss-120b requires a different format. you can't toggle thinking off/on in 100% thinking models. good news is the latest models are now hybrid. have fun till llama.cpp UI gets it integrated.
biller23@reddit
I add this to my OpenAI API request with qwen (it's in Lua, translate to JSON)
chat_template_kwargs = { enable_thinking = false }