How to simply run your model at startup in Debian/Ubuntu

Posted by EmilPi@reddit | LocalLLaMA | View on Reddit | 8 comments

I see lots of posts asking how to autostart models on startup. One solution is to use llama.cpp and systemd service to start API endpoint after boot - you can then connect it to OpenWebUI or another OpenAI API compatible UI. I have set up this exact startup scheme: \# ASSUMING YOU HAVE INSTALLED CUDA ACCORDING TO OFFICIAL GUIDE [https://docs.nvidia.com/cuda/cuda-installation-guide-linux/](https://docs.nvidia.com/cuda/cuda-installation-guide-linux/) git clone [https://github.com/ggerganov/llama.cpp](https://github.com/ggerganov/llama.cpp) cd llama.cpp cmake -B build -DGGML\_CUDA=ON -DGGML\_CUDA\_F16=ON cmake --build build --config Release --parallel 32 Now download some GGUF files (if they are large, there can be multiple which end with ...-00001-of-0000X - only reference -00001-of-... part then after `--model`) then (as root or with sudo) put file `llm-server.service` into `/etc/systemd/system/`: [Unit] Description=llama.cpp server After=network.target [Service] User=ai WorkingDirectory=/home/ai/3rdparty/llama.cpp ExecStart=<FULL_PATH_TO_llama.cpp_cloned_folder>/build/bin/llama-server --host localhost --port 1234 --model <PATH_WITH_YOUR_GGUF_FILE(S)>/model.gguf -ngl 999 -c 4096 Restart=always [Install] WantedBy=multi-user.target *This assumes your model fully fits into your GPU - otherwise you should play with -ts (tensor split over GPUs) and -ngl (how much layers put on GPU(s)) settings.* Now enable and start your service (adding sudo before every command or as root): systemctl enable llm-server.service systemctl start llm-server.service If you have any questions or errors, I will try to answer them in comments.