Does Ollama work as a server to server multiple simultaneous users and models same time?

Posted by badabimbadabum2@reddit | LocalLLaMA | View on Reddit | 18 comments

Hi, I am desperately trying to get any of these LLMs work as a server. For a server I have 2 requirements 1. More than one model is loaded in GPU memory and they are never unloaded. * So if there are 20 users interacting with model1 and 20 users interacting with model2, the server would server as fast a as possible both and unloading would just slow things down. 2. The server needs to start automatically after a reboot and the models needs to be loaded into GPU memory automatically. The GPU has 24GB and the first model is 5GB and the second model is13GB. Only way I have able to achieve this is in Windows 11 with lm-studio and task scheduler. With a .BAT calling a powershell script both models are loaded into gpu memory. But I cant use windows 11 as a server in production. In Ubuntu, with Ollama I can only load one model at a time, And this of course does not work in my case because I need 2 models simultaneously in the GPU memory. When ever I try to load a second model, the first model is unloaded. With Lm-studio I cant start the server in Ubuntu with cron and .sh scripts. It just never starts. Everything runs fine when I manually start the server and load the models. IS there any other more production ready solution than these Ollamas and Lm-studios which looks to be just hobbyist tools to chat with the LLMS? I need production ready thing, and I think not able to use yet MLC.