Is there a way to disable reasoning per request in llama.cpp's llama-server, while leaving it on by default?

Posted by Mrinohk@reddit | LocalLLaMA | View on Reddit | 10 comments

Title. I've got a llama.cpp server running a model being accessed across a number of scripts, and some of them are easier for the model than others, and those easier ones are also latency dependent. Rather than host two different servers with different parameters, I'd rather just send something along with the prompt to disable it.

If I must host multiple servers, am I able to host two servers for the same model but only have the model loaded in memory once? VRAM limited, like most of you I'm sure.