How do you start your Llama.cpp server?
Posted by Citadel_Employee@reddit | LocalLLaMA | View on Reddit | 39 comments
Sorry for the noob question. Recently made the switch from ollama to llama.cpp.
I was wondering people’s preferred method of starting a server up? Do you just open your terminal and paste the command? Have it as a start-up task?
What I’ve landed on so far is just a shell script on my desktop. But it is a bit tedious if I want to change the model.
FastDecode1@reddit
User-level systemd service. That way I can stop/restart it without having to type my password every time.
Here's the unit file (~/.config/systemd/user/llamacpp.service):
Also, no need for llama-swap. llama-server supports using a .ini file that contains the settings for your models.
The most simple way is to give it your models directory with --models-dir and then the .ini file with --models-preset. The .ini file layout is simple:
Just the [model file name] without the .gguf extension, then under it whatever settings (CLI options) you want to run with the model. (I haven't done much in mine, this is a WIP from a home server I'm working on).
And apparently, according to the docs, you can define options that apply to all models with a [*] section, which is neat.
WouterC@reddit
How can you can launch automatically Qwen3.5-2B-Q6_K when llama starts?
I have also my models defined in a models.ini file.
higglesworth@reddit
This is super helpful, thanks bro!!
StardockEngineer@reddit
llama-server doesn't have nearly as many launch options as llama-swap, fyi.
bluecamelblazeit@reddit
Llama-swap is great and built for this exactly. You set everything in a config file, one config per model and you can swap between models in the UI or using the API.
https://github.com/mostlygeek/llama-swap
Nexter92@reddit
You can now do the same with a config file in llama-server by default ;)
StardockEngineer@reddit
No, there are a lot of features missing from llama-server, unfortunately.
Nexter92@reddit
Don't say somethings you didn't know about please :
models.ini
RipperFox@reddit
Ofc router mode is "nice" - but llama-swap can even switch to vLLm, SGlang, etc. Your AI generated example sucks btw.
Nexter92@reddit
It's no AI... It's in there documentation under llama-server.
Those who use vLLM will never use seriously llamacpp instead for fun, they have huge rig, we don't.
RipperFox@reddit
vLLM can be better on a single 5090 if you don't want to wait until llama.cpp catches up with it's experimental forks - and your completely random (Vulkan, seriously?) docker example is clearly badly formatted AI slop suggesting running docker - who uses/needs docker for llama-server router mode anyway?
Nexter92@reddit
Vulkan is better than cuda or shitty rocm for llama.cpp
Docker is a good practice to try and test tools. If a vulnerability is discovered or a backdoor is found, they won't have access to your host system. You to dumb little monkey, go back to play with your other nerds friends, and maybe take the sun, it will be good for you.
RipperFox@reddit
Interesting strategy, calling someone a nerd (which I am) in 2026 - ok ok, I'll touch some gras.. I bet I'm even older than you - Casser la croûte and have a nice day! :)
StardockEngineer@reddit
lol that doesn’t even scratch the surface of llama-swap configs
awitod@reddit
You don't even need a config file. It can pick up the models from a folder.
bluecamelblazeit@reddit
I wasn't aware, thanks.
Including the functionality to swap the model that's loaded?
AurumDaemonHD@reddit
I wonder how much faster it is. It seemed kinda same to me. Llamacpp container starts fast.
Nexter92@reddit
Better support, better configuration
Citadel_Employee@reddit (OP)
Thank you, that might be just what I need.
GreenHell@reddit
And if the config file is daunting, your favourite free LLM (e.g. Gemini, ChatGPT, Claude whatever) can do it for you to make it structured and nice. Just don't let it decide on your parameters without checking.
charles25565@reddit
I use podman with
--restart=always.StardockEngineer@reddit
llama-swap. It's far more feature rich than llama-server and I need these extra features.
awitod@reddit
With a docker-compose file your settings will vary.
llama-router-server:image:ghcr.io/ggml-org/llama.cpp:server-cuda13container_name: llama-router-servergpus: allports:- "8080:8080"volumes:- ./volumes/llama/models:/modelscommand:- --models-dir- /models- --models-max- "1"- --no-models-autoload- --host-0.0.0.0- --port- "8080"- --ctx-size- "262144"- --threads- "16"- --parallel- "8"- --cache-ram- "8192"- --n-gpu-layers- "999"- --kv-unified- --jinja- --cont-batching- --no-mmapProfessionalSpend589@reddit
I have a notes.txt file which actually has a history of commands I’ve used to run llama-server.
I usually manually run the latest row.
jacek2023@reddit
I use two ways:
- I have collection of scripts for each model
- I just use command from shell, but it's in my history, so it's easy to paste with the Linux shell
I have over 100 models, so collection of scripts was a good idea in the past, because different models required different parameters (context length, ngl, etc). But now I have more VRAM and llama.cpp is smarter (fit) so I can usually just use the last command and change only the model.
I don't use llama-swap/router/etc
mister2d@reddit
Why don't you use router mode and presets in an ini file?
mister2d@reddit
I use router mode with global defaults and presets.
uber-linny@reddit
I have it as a *.bat file which is in startup apps , have the same for embedding , reranking, whisper and kokoro.
use llama-swap to manage models in openweb ui
uber-linny@reddit
if not "%1"=="min" start /min cmd /c "%\~f0" min & exit
u/echo off
setlocal
:: Define the root working directory
set "WORK_DIR=C:\llamaROCM"
echo.
echo === VERIFYING FILES ===
:: 1. Check for llama-swap
if not exist "%WORK_DIR%\llama-swap.exe" (
echo ERROR: llama-swap.exe NOT FOUND in %WORK_DIR%
pause
exit /b 1
)
:: 2. Check for Embedding Batch file
if not exist "%WORK_DIR%\START_Embed.bat" (
echo ERROR: START_Embed.bat NOT FOUND in %WORK_DIR%
pause
exit /b 1
)
:: 3. Check for Reranker Batch file
if not exist "%WORK_DIR%\START_ReRanker.bat" (
echo ERROR: START_ReRanker.bat NOT FOUND in %WORK_DIR%
pause
exit /b 1
)
:: 4. Check for Whisper
if not exist "%WORK_DIR%\whisper.cpp\Whisper_Vulkan.bat" (
echo ERROR: Whisper_Vulkan.bat NOT FOUND in %WORK_DIR%\whisper.cpp
pause
exit /b 1
)
:: 5. Check for Fast-Kokoro
if not exist "%WORK_DIR%\Fast-Kokoro\Fast-Kokoro-ONNX.py" (
echo ERROR: Fast-Kokoro-ONNX.py NOT FOUND in %WORK_DIR%\Fast-Kokoro
pause
exit /b 1
)
echo.
echo === LAUNCHING SERVICES ===
echo Root: %WORK_DIR%
:: --- 1. LLM ---
echo Launching Local LLM...
start /min "Local LLM Models" cmd /k "cd /d %WORK_DIR% && llama-swap.exe"
timeout /t 1 >nul
:: --- 2. EMBEDDING ---
echo Launching Embedding...
start /min "Embedding" cmd /k "cd /d %WORK_DIR% && START_Embed.bat"
timeout /t 1 >nul
:: --- 3. RERANKER ---
echo Launching Reranker...
start /min "Reranker" cmd /k "cd /d %WORK_DIR% && START_ReRanker.bat"
timeout /t 1 >nul
:: --- 4. WHISPER ---
echo Launching Whisper...
start /min "Whisper STT" cmd /k "cd /d %WORK_DIR%\whisper.cpp && Whisper_Vulkan.bat"
timeout /t 1 >nul
:: --- 5. KOKORO TTS ---
echo Launching Fast-Kokoro...
:: Note: Assumes python is in your system PATH.
:: If you use a specific venv, change "python" to "your_venv\Scripts\python.exe"
start /min "Kokoro STT : 8880" cmd /k "cd /d %WORK_DIR%\Fast-Kokoro && python Fast-Kokoro-ONNX.py"
echo.
echo Launcher complete. All services started.
echo This window will now close.
timeout /t 2
exit
moderately-extremist@reddit
Use a code block to make your script format better:
CharacterAnimator490@reddit
Gemini/Qwen made me a nice little startup file.
I can chose the model, context, kv cache, paralel.
BelgianDramaLlama86@reddit
I use a powershell shortcut on my desktop that starts llama-server whilst pointing to a models.ini file. There I have a list of all my models with their location and parameters. The powershell path is this: "C:\Windows\System32\WindowsPowerShell\v1.0\powershell.exe -WindowStyle Minimized -Command "llama-server --webui-mcp-proxy --models-max 1 --models-preset C:\AI\Models\models.ini --port 8081" ". It automatically unloads the previous model as I load a new model, like llama-swap would do too, but without needing it :)
FreQRiDeR@reddit
Depends on the model. Different flags, parameters depending on model.
Objective-Stranger99@reddit
It autostarts with my TWM (Hyprland).
ambient_temp_xeno@reddit
I open the terminal, change disk and folder/s then use the up arrow key.
https://i.redd.it/bj2s0fvkubsg1.gif
SM8085@reddit
I made llama-server.bash.
It prompts me for which model to select from that very specific directory hierarchy.
I made it before llama-server could serve different models like llama-swap. I should really learn that system, but part of me likes having the model loaded and waiting for me. Qwen3.5-122B-A10B takes a minute to load on my rig.
FreonMuskOfficial@reddit
Is this essentially discussing the tweaking of the nano file and the params within? Then initiating ollama serve and then running the model with the new params?
moderately-extremist@reddit
I run llama-server with systemd. Previously, I was compiling llama-server and creating the systemd file myself, but I recently found out llama-server is in Debian's Unstable repo so I set up a new server using that, which creates the systemd service file for you. Then I load models using a models-presets file.
madtopo@reddit
I keep all my model configuration in a single
config.inifile which I then pass on to thellama-serverprocess, which I used to run manually when I was learning how to use it. Now I just run it with systemd