Does Ollama work as a server to server multiple simultaneous users and models same time?
Posted by badabimbadabum2@reddit | LocalLLaMA | View on Reddit | 18 comments
Hi,
I am desperately trying to get any of these LLMs work as a server.
For a server I have 2 requirements
1. More than one model is loaded in GPU memory and they are never unloaded.
* So if there are 20 users interacting with model1 and 20 users interacting with model2, the server would server as fast a as possible both and unloading would just slow things down.
2. The server needs to start automatically after a reboot and the models needs to be loaded into GPU memory automatically.
The GPU has 24GB and the first model is 5GB and the second model is13GB.
Only way I have able to achieve this is in Windows 11 with lm-studio and task scheduler. With a .BAT calling a powershell script both models are loaded into gpu memory. But I cant use windows 11 as a server in production.
In Ubuntu,
with Ollama I can only load one model at a time, And this of course does not work in my case because I need 2 models simultaneously in the GPU memory.
When ever I try to load a second model, the first model is unloaded.
With Lm-studio I cant start the server in Ubuntu with cron and .sh scripts. It just never starts. Everything runs fine when I manually start the server and load the models.
IS there any other more production ready solution than these Ollamas and Lm-studios which looks to be just hobbyist tools to chat with the LLMS? I need production ready thing, and I think not able to use yet MLC.
18 Comments
chibop1@reddit
badabimbadabum2@reddit (OP)
chibop1@reddit
badabimbadabum2@reddit (OP)
chibop1@reddit
chibop1@reddit
gtek_engineer66@reddit
Chaosdrifer@reddit
bluelobsterai@reddit
cantgetthistowork@reddit
haydenhaydo@reddit
RemindMeBot@reddit
nerdlord420@reddit
DataCraftsman@reddit
kryptkpr@reddit
ggone20@reddit
badabimbadabum2@reddit (OP)
Murky_Mountain_97@reddit