Is there a way to disable reasoning per request in llama.cpp's llama-server, while leaving it on by default?
Posted by Mrinohk@reddit | LocalLLaMA | View on Reddit | 10 comments
Title. I've got a llama.cpp server running a model being accessed across a number of scripts, and some of them are easier for the model than others, and those easier ones are also latency dependent. Rather than host two different servers with different parameters, I'd rather just send something along with the prompt to disable it.
If I must host multiple servers, am I able to host two servers for the same model but only have the model loaded in memory once? VRAM limited, like most of you I'm sure.
PixelSage-001@reddit
Unfortunately, there isn't a direct per-request flag in `llama-server` to disable reasoning if it's baked into the model's default system prompt or template.
If the model relies on a specific `` tag formatting or token sequence to trigger reasoning, your best option is to pass a custom system prompt or a suffix in the request template that explicitly instructs the model to bypass thinking (e.g., "Respond directly without thinking steps"). Or, if you have control over the endpoints, spin up two parallel `llama-server` instances on different ports—one with reasoning template parameters enabled, and one without.
Still-Notice8155@reddit
if you use pi.dev harness, press shift+tab if you're on windows.
BitGreen1270@reddit
Yea this is possible. If you view the payload that is sent to llama server there is a param for reasoning. You just need to set it to false when sending a http request to the llama server. I can share an example at night, but you can figure it out as well with developer tools with openwebui that ships with llama server.
Shoddy_Bed3240@reddit
For small models like Qwen 3.6 is better to keep it on. You need to check --reasoning-budget
thejacer@reddit
With Qwen3.5 and 3.6 you can pass this custom JSON via API and it will turn off thinking.
{"chat_template_kwargs": {"enable_thinking": false}}
DunderSunder@reddit
It's very simple for qwen, you have to put something in your request like this. or for example some python packages can handle it.
Gallardo994@reddit
I recently stumbled upon this post: https://www.reddit.com/r/hermesagent/comments/1t83hbt/how_i_toggle_qwen3_thinking_onoff_perrequest
It's not precisely what you've been looking for as it requires llama-swap on top of llama-server, however, it looks neat by utilizing virtual models.
relicx74@reddit
Have you looked if your model supports some syntax like nothing to disable thinking for a given inference?
b0tm0de@reddit
Some models support this, for example qwen3 /no_think.
IgnisIason@reddit
Short answer
No, llama-cpp’s llama-server doesn’t have a “skip reasoning” switch you can flip per request. But you can get 90 % of what you want by sending per-request inference parameters (low-latency settings) from your client, rather than spinning up a second server.
Every llama-server JSON request accepts an options block. You can keep “full reasoning” as the default in your scripts, and for the latency-sensitive calls send something like:
{ "prompt": "your short task …", "options": { // keep the context window minimal "n_ctx": 1024,
} }
Effect: the model does far fewer decoder steps, returns a terse answer, and you avoid the extra server instance.
Option A – two endpoints, one process
Use two routes inside the same server:
route /fast → hard-coded low-latency options
route /rich → default exploratory options
You’d add a small wrapper around llama.cpp’s server to map URL paths to option presets—still only one model in VRAM.
Option B – two server processes
Running two separate llama-server instances will double-load the model into VRAM today; llama-cpp doesn’t offer shared weights across processes yet. If you’re VRAM-constrained, stick with Option A (single process + routing).
Option What it does Latency effect
n_predict Max tokens to sample linear ↓ temperature 0 → greedy decoding slight ↓ top_k = 1 pick highest-prob token slight ↓ mirostat = 0 turn off adaptive sampling removes extra compute repeat_penalty = 1.0 neutral repetition bias a tiny ↓
Adjust just these and you’ll feel the difference.
TL;DR
Use per-call options instead of per-model servers. If you need two modes, build a lightweight route-switcher and keep one copy of the weights in memory.