Is there a way to disable reasoning per request in llama.cpp's llama-server, while leaving it on by default?

Posted by Mrinohk@reddit | LocalLLaMA | View on Reddit | 10 comments

Title. I've got a llama.cpp server running a model being accessed across a number of scripts, and some of them are easier for the model than others, and those easier ones are also latency dependent. Rather than host two different servers with different parameters, I'd rather just send something along with the prompt to disable it.

If I must host multiple servers, am I able to host two servers for the same model but only have the model loaded in memory once? VRAM limited, like most of you I'm sure.

[-]

PixelSage-001@reddit

Unfortunately, there isn't a direct per-request flag in `llama-server` to disable reasoning if it's baked into the model's default system prompt or template.

If the model relies on a specific `` tag formatting or token sequence to trigger reasoning, your best option is to pass a custom system prompt or a suffix in the request template that explicitly instructs the model to bypass thinking (e.g., "Respond directly without thinking steps"). Or, if you have control over the endpoints, spin up two parallel `llama-server` instances on different ports—one with reasoning template parameters enabled, and one without.

[-]

Still-Notice8155@reddit

if you use pi.dev harness, press shift+tab if you're on windows.

[-]

BitGreen1270@reddit

Yea this is possible. If you view the payload that is sent to llama server there is a param for reasoning. You just need to set it to false when sending a http request to the llama server. I can share an example at night, but you can figure it out as well with developer tools with openwebui that ships with llama server.

[-]

Shoddy_Bed3240@reddit

For small models like Qwen 3.6 is better to keep it on. You need to check --reasoning-budget

[-]

thejacer@reddit

With Qwen3.5 and 3.6 you can pass this custom JSON via API and it will turn off thinking.

{"chat_template_kwargs": {"enable_thinking": false}}

[-]

DunderSunder@reddit

It's very simple for qwen, you have to put something in your request like this. or for example some python packages can handle it.

"chat_template_kwargs": {
"enable_thinking": false
}

[-]

Gallardo994@reddit

I recently stumbled upon this post: https://www.reddit.com/r/hermesagent/comments/1t83hbt/how_i_toggle_qwen3_thinking_onoff_perrequest

It's not precisely what you've been looking for as it requires llama-swap on top of llama-server, however, it looks neat by utilizing virtual models.

[-]

relicx74@reddit

Have you looked if your model supports some syntax like nothing to disable thinking for a given inference?

[-]

b0tm0de@reddit

Some models support this, for example qwen3 /no_think.

[-]

IgnisIason@reddit

Short answer

No, llama-cpp’s llama-server doesn’t have a “skip reasoning” switch you can flip per request. But you can get 90 % of what you want by sending per-request inference parameters (low-latency settings) from your client, rather than spinning up a second server.

Use per-request inference options

Every llama-server JSON request accepts an options block. You can keep “full reasoning” as the default in your scripts, and for the latency-sensitive calls send something like:

{ "prompt": "your short task …", "options": { // keep the context window minimal "n_ctx": 1024,

// shut off sampling creativity
"temperature": 0.0,       // deterministic top-token
"top_k": 1,
"top_p": 1.0,

// stop as soon as first newline (or other token)
"n_predict": 32,          // or whatever very small number
"stop": ["\n"]

} }

Effect: the model does far fewer decoder steps, returns a terse answer, and you avoid the extra server instance.

Why it works: “Reasoning” is just longer, probabilistic sampling. Constrain the sampling window and the model won’t wander into chain-of-thought.

If you really need two parameter sets

Option A – two endpoints, one process

Use two routes inside the same server:

route /fast → hard-coded low-latency options

route /rich → default exploratory options

You’d add a small wrapper around llama.cpp’s server to map URL paths to option presets—still only one model in VRAM.

Option B – two server processes

Running two separate llama-server instances will double-load the model into VRAM today; llama-cpp doesn’t offer shared weights across processes yet. If you’re VRAM-constrained, stick with Option A (single process + routing).

Quick flag glossary (for latency tuning)

Option What it does Latency effect

n_predict Max tokens to sample linear ↓ temperature 0 → greedy decoding slight ↓ top_k = 1 pick highest-prob token slight ↓ mirostat = 0 turn off adaptive sampling removes extra compute repeat_penalty = 1.0 neutral repetition bias a tiny ↓

Adjust just these and you’ll feel the difference.

TL;DR

Use per-call options instead of per-model servers. If you need two modes, build a lightweight route-switcher and keep one copy of the weights in memory.