How to simply run your model at startup in Debian/Ubuntu
Posted by EmilPi@reddit | LocalLLaMA | View on Reddit | 8 comments
I see lots of posts asking how to autostart models on startup. One solution is to use llama.cpp and systemd service to start API endpoint after boot - you can then connect it to OpenWebUI or another OpenAI API compatible UI.
I have set up this exact startup scheme:
\# ASSUMING YOU HAVE INSTALLED CUDA ACCORDING TO OFFICIAL GUIDE [https://docs.nvidia.com/cuda/cuda-installation-guide-linux/](https://docs.nvidia.com/cuda/cuda-installation-guide-linux/)
git clone [https://github.com/ggerganov/llama.cpp](https://github.com/ggerganov/llama.cpp)
cd llama.cpp
cmake -B build -DGGML\_CUDA=ON -DGGML\_CUDA\_F16=ON
cmake --build build --config Release --parallel 32
Now download some GGUF files (if they are large, there can be multiple which end with ...-00001-of-0000X - only reference -00001-of-... part then after `--model`)
then (as root or with sudo) put file `llm-server.service` into `/etc/systemd/system/`:
[Unit]
Description=llama.cpp server
After=network.target
[Service]
User=ai
WorkingDirectory=/home/ai/3rdparty/llama.cpp
ExecStart=<FULL_PATH_TO_llama.cpp_cloned_folder>/build/bin/llama-server --host localhost --port 1234 --model <PATH_WITH_YOUR_GGUF_FILE(S)>/model.gguf -ngl 999 -c 4096
Restart=always
[Install]
WantedBy=multi-user.target
*This assumes your model fully fits into your GPU - otherwise you should play with -ts (tensor split over GPUs) and -ngl (how much layers put on GPU(s)) settings.*
Now enable and start your service (adding sudo before every command or as root):
systemctl enable llm-server.service
systemctl start llm-server.service
If you have any questions or errors, I will try to answer them in comments.
8 Comments
muxxington@reddit
mrpazdzioch@reddit
AmericanNewt8@reddit
kryptkpr@reddit
EmilPi@reddit (OP)
EmilPi@reddit (OP)
EmilPi@reddit (OP)
kryptkpr@reddit