Does Ollama work as a server to server multiple simultaneous users and models same time?

Posted by badabimbadabum2@reddit | LocalLLaMA | View on Reddit | 18 comments

Hi, I am desperately trying to get any of these LLMs work as a server. For a server I have 2 requirements 1. More than one model is loaded in GPU memory and they are never unloaded. * So if there are 20 users interacting with model1 and 20 users interacting with model2, the server would server as fast a as possible both and unloading would just slow things down. 2. The server needs to start automatically after a reboot and the models needs to be loaded into GPU memory automatically. The GPU has 24GB and the first model is 5GB and the second model is13GB. Only way I have able to achieve this is in Windows 11 with lm-studio and task scheduler. With a .BAT calling a powershell script both models are loaded into gpu memory. But I cant use windows 11 as a server in production. In Ubuntu, with Ollama I can only load one model at a time, And this of course does not work in my case because I need 2 models simultaneously in the GPU memory. When ever I try to load a second model, the first model is unloaded. With Lm-studio I cant start the server in Ubuntu with cron and .sh scripts. It just never starts. Everything runs fine when I manually start the server and load the models. IS there any other more production ready solution than these Ollamas and Lm-studios which looks to be just hobbyist tools to chat with the LLMS? I need production ready thing, and I think not able to use yet MLC.

18 Comments

[-]

chibop1@reddit

Also, keep in mind that OLLAMA_NUM_PARALLEL will increase context size. For example, if you need 4 parallel requests for 8192 context size, the models will be loaded with context size of 8192*4=32768.

badabimbadabum2@reddit (OP)

So this affects the VRAM usage, if more parallel requests? Is this in general with all or Ollama specific?

It's llama.cpp, so it should be same to everything that utilizes llama.cpp like Koboldcpp, Ollama, LMStudio, etc. Basically it reserves the maximum context size for when receiving max parallel requests with max context length.

So if I have 5GB model took previously 5GB from the GPU when 100% loaded, and I will now add OLLAMA\_NUM\_PARALLEL=2 how much vram will be reserved from GPU?

Set the environment variable, restart Ollama server, and run the model. Then if you run `ollama ps`. It'll tell you how much memory it's using.

Yes, set the following environment variables as necessary: * OLLAMA_NUM_PARALLEL: Maximum number of parallel requests * OLLAMA_MAX_QUEUE: The queue length, defines number of requests that might be sitting there and waiting for being picked up * OLLAMA_MAX_LOADED_MODELS: Maximum number of loaded models * OLLAMA_KEEP_ALIVE: The duration that models stay loaded in memory (I think -1 is forever?)

gtek_engineer66@reddit

Ollama is very good for running multiple models but very slow at concurrent requests. VLLM is excellent at concurrency.

Chaosdrifer@reddit

Why not use something like vllm?

bluelobsterai@reddit

+1. You’ll need to run two vllm instances. One for each model. Vllm and quantization are not as straightforward.

cantgetthistowork@reddit

Aphrodite?

haydenhaydo@reddit

RemindMe! 1 day

RemindMeBot@reddit

I will be messaging you in 1 day on [**2024-11-11 19:39:48 UTC**](http://www.wolframalpha.com/input/?i=2024-11-11%2019:39:48%20UTC%20To%20Local%20Time) to remind you of [**this link**](https://www.reddit.com/r/LocalLLaMA/comments/1go86pm/does_ollama_work_as_a_server_to_server_multiple/lwgl1rd/?context=3) [**CLICK THIS LINK**](https://www.reddit.com/message/compose/?to=RemindMeBot&subject=Reminder&message=%5Bhttps%3A%2F%2Fwww.reddit.com%2Fr%2FLocalLLaMA%2Fcomments%2F1go86pm%2Fdoes_ollama_work_as_a_server_to_server_multiple%2Flwgl1rd%2F%5D%0A%0ARemindMe%21%202024-11-11%2019%3A39%3A48%20UTC) to send a PM to also be reminded and to reduce spam. ^(Parent commenter can ) [^(delete this message to hide from others.)](https://www.reddit.com/message/compose/?to=RemindMeBot&subject=Delete%20Comment&message=Delete%21%201go86pm) ***** |[^(Info)](https://www.reddit.com/r/RemindMeBot/comments/e1bko7/remindmebot_info_v21/)|[^(Custom)](https://www.reddit.com/message/compose/?to=RemindMeBot&subject=Reminder&message=%5BLink%20or%20message%20inside%20square%20brackets%5D%0A%0ARemindMe%21%20Time%20period%20here)|[^(Your Reminders)](https://www.reddit.com/message/compose/?to=RemindMeBot&subject=List%20Of%20Reminders&message=MyReminders%21)|[^(Feedback)](https://www.reddit.com/message/compose/?to=Watchful1&subject=RemindMeBot%20Feedback)| |-|-|-|-|

nerdlord420@reddit

You're probably going to want to utilize OLLAMA\_KEEP\_ALIVE=-1

DataCraftsman@reddit

You could host 2 ollama containers in docker and use 2 different ports for the different loaded models. So you would run the following commands after setting up Cuda: docker run -d --gpus=all -v ollama:/root/.ollama -p 11434:11434 --name ollama-13gb ollama/ollama docker run -d --gpus=all -v ollama:/root/.ollama -p 11435:11434 --name ollama-5gb ollama/ollama Then just reference the port you need in each web ui or api you are using.

Does Ollama work as a server to server multiple simultaneous users and models same time?

Reply to Post

18 Comments

chibop1@reddit

badabimbadabum2@reddit (OP)

chibop1@reddit

badabimbadabum2@reddit (OP)

chibop1@reddit

chibop1@reddit

gtek_engineer66@reddit

Chaosdrifer@reddit

bluelobsterai@reddit

cantgetthistowork@reddit

haydenhaydo@reddit

RemindMeBot@reddit

nerdlord420@reddit

DataCraftsman@reddit

kryptkpr@reddit

ggone20@reddit

badabimbadabum2@reddit (OP)

Murky_Mountain_97@reddit