(llama.cpp) Possible to disable reasoning for some requests (while leaving reasoning on by default)?

Posted by regunakyle@reddit | LocalLLaMA | View on Reddit | 19 comments

I am running unsloth/gemma-4-26B-A4B-it-GGUF/gemma-4-26B-A4B-it-UD-Q4_K_XL.gguf with llama-server (with reasoning enabled).

Is it possible to disable reasoning for some requests only? If yes, how?

I want to leave reasoning on by default, but in some other use cases I want it to respond as fast as possible (e.g. chatting bot)

[-]

Sadman782@reddit

go to settings > Developer and
Use this json

{

"chat_template_kwargs": {

"enable_thinking": false

}

[-]

regunakyle@reddit (OP)

Thank you! This is the correct answer.

[-]

I'm trying to pass extra options like thinking or reasoning budget using Microsoft Agent Framework to llama-server. The regular OpenAI client works but OpenAIChatClient from agent_framework.openai doesn't seem to expose "extra_body"

[-]

charmander_cha@reddit

Tem como fazer isso no opencode?

[-]

andy2na@reddit

use llama-swap with llama.cpp, allows different parameters (thinking, coding, instruct, etc) without having to reload the model

example config:

  "Gemma4-26B":
    cmd: >
      /custom-bin/bin/llama-server
       --port ${PORT}
      --host 127.0.0.1
      --webui-mcp-proxy
      --model /models/gemma4/bartowski_google_gemma-4-26B-A4B-it-IQ4_XS.gguf
      --mmproj /models/gemma4/gemma-4-26B-A4B-it-mmproj-BF16.gguf
      --cache-type-k q8_0
      --cache-type-v q8_0      
      --n-gpu-layers auto
      --split-mode layer
      --main-gpu 0
      --tensor-split 24,0
      --parallel 1 
      --batch-size 512 
      --ubatch-size 512
      --ctx-size 262144
      --image-min-tokens 300
      --image-max-tokens 512 
      --flash-attn on 
      --jinja
      --cache-ram 2048
      --reasoning on
      --ctx-checkpoints 1
    filters:
      stripParams: "temperature, top_p, top_k, min_p, presence_penalty, repeat_penalty"
      
      setParamsByID:
        "${MODEL_ID}:thinking":
          chat_template_kwargs:
            enable_thinking: true
          reasoning_budget: 4096
          temperature: 1.0
          top_p: 0.9
          top_k: 20
          min_p: 0.1
          presence_penalty: 0.0
          repeat_penalty: 1.0


        "${MODEL_ID}:thinking-coding":
          chat_template_kwargs:
            enable_thinking: true
          reasoning_budget: 4096
          temperature: 1.0
          top_p: 0.9
          top_k: 20
          min_p: 0.1
          presence_penalty: 0.0
          repeat_penalty: 1.17


        "${MODEL_ID}:instruct":
          chat_template_kwargs:
            enable_thinking: false
          temperature: 1.0
          top_p: 0.9
          top_k: 20
          min_p: 0.1
          presence_penalty: 0.0
          repeat_penalty: 1.0


        "${MODEL_ID}:instruct-reasoning":
          chat_template_kwargs:
            enable_thinking: false
          temperature: 1.0
          top_p: 0.9
          top_k: 20
          min_p: 0.1
          presence_penalty: 0.0
          repeat_penalty: 1.0

[-]

LexRivera@reddit

last one should be thinking=true i guess? Because otherwise it's same as :instruct.

Also, for this to work (so models will appear in /v1/models) you need to set includeAliasesInList: true at the root of config.

[-]

regunakyle@reddit (OP)

Thanks! Never used llama-swap before, will try later

[-]

charmander_cha@reddit

Isso seria possível de fazer com o sistema roteador padrão do llama.cpp?

[-]

ElectronSpiderwort@reddit

I've been messing with --models-preset models.ini --models-max 1 flags for router mode; you could easily set up the same model with multiple sets of parameters for chat vs. deep reasoning and swap them out via the UI or, I think, API

[-]

Top-Rub-4670@reddit

This is only workable if you want to disable/enable thinking on new conversations.

Because swapping model mid-conversation just to enable reasoning once kinds of sucks; depending on hardware and context it could take several minutes.

[-]

ElectronSpiderwort@reddit

You are correct. But I tried it last night with OP's model only in Q8 on a MacBook m2 max, and it was a fraction of a minute to change models mid steam. I didn't time it but like 20 seconds.

[-]

ItankForCAD@reddit

I believe there is a pr to add a reasoning toggle for the webui.

[-]

regunakyle@reddit (OP)

I think there are 2 PRs for this, lol Hopefully they got merged soon

[-]

DigRealistic2977@reddit

Ya mean encapsulation of the reasoning? I guess that's a thing too strips out the reasoning while the reasoning is on at the same time

[-]

regunakyle@reddit (OP)

nope, I mean disabling reasoning all together (no ). Some other comment got this right

[-]

paste the following {"chat_template_kwargs": {"enable_thinking": false}}

Turns off reasoning, if you need to reason, change false to true.

gpt-oss-120b requires a different format. you can't toggle thinking off/on in 100% thinking models. good news is the latest models are now hybrid. have fun till llama.cpp UI gets it integrated.

[-]

biller23@reddit

I add this to my OpenAI API request with qwen (it's in Lua, translate to JSON)

chat_template_kwargs = { enable_thinking = false }