Serve 100 Large AI Models on a single GPU with low impact to time to first token.
Posted by SetZealousideal5006@reddit | LocalLLaMA | View on Reddit | 29 comments
I wanted to build an inference provider for proprietary AI models, but I did not have a huge GPU farm. I started experimenting with Serverless AI inference, but found out that coldstarts were huge. I went deep into the research and put together an engine that loads large models from SSD to VRAM up to ten times faster than alternatives. It works with vLLM, and transformers, and more coming soon.
With this project you can hot-swap entire large models (32B) on demand.
Its great for:
- Serverless AI Inference
- Robotics
- On Prem deployments
- Local Agents
And Its open source.
Let me know if anyone wants to contribute :)
3Ex8@reddit
This is really cool!
badgerbadgerbadgerWI@reddit
This is solving the right problem. Most production deployments don't need all models hot in memory - they need smart scheduling.
Have you tested this with heterogeneous workloads? Like mixing embedding models with LLMs? That's where I've seen most orchestration frameworks fall apart.
SetZealousideal5006@reddit (OP)
Will add this to the roadmap, thanks for the feedback :)
SetZealousideal5006@reddit (OP)
I have tried doing a voice agent on a single GPU, it kind of worked, but still needs more work.
BumbleSlob@reddit
Is the speed increase because you are storing the uncompressed weights on SSD?
SetZealousideal5006@reddit (OP)
It creates a memory map of the model and streams the memory chunks through RAM and VRAM with a pinned memory pool.
BumbleSlob@reddit
Does this approach mean you can run models larger than your VRAM? Sounds neat. I def want to give it a poke around.
SetZealousideal5006@reddit (OP)
I’m working on it :) getting good tokens per second is a challenge though.
BumbleSlob@reddit
For sure. Just think you’ve got a neat idea going here. Keep up the experimentation
SetZealousideal5006@reddit (OP)
Thanks 🙏 will keep you posted :)
_nickfried@reddit
I'm wondering if it’s possible to use 4-5 PCIe 5.0 SSDs to fully saturate the GPU’s PCIe 5 bandwidth for streaming experts.
What happens if there are multiple GPUs and even more SSDs?
SetZealousideal5006@reddit (OP)
Yeah, that will make it better. I have experienced when running on Runpod shared instances that speedups go down due to saturation.
I have also experienced replacing the pcie version of my Orin Nano, and getting significant speedups.
If you try it on one of such machines, please share benchmarks :)
DeltaSqueezer@reddit
What's the difference between what you did and ServerlessLLM?
SetZealousideal5006@reddit (OP)
The main difference is that this doesn’t work only with LLMs, repo has implementations for STT, VLMs, etc.
This repo is actually based on the Serverless LLM storage library.
Given that I did not wanted to make this a cli tool + sdk, wanted to decouple the scheduler layer, and overall follow a different direction,
I opted for giving credits to the original repo over doing a fork.
I am trying to make the storage service an extension of torch, so that every model that is implemented on torch can use this speedups.
Next step, I am exploring how to run bigger models that don’t fit VRAM on a usable latency.
DeltaSqueezer@reddit
Thanks. While I have your attention, what's the difference between your implementation/ServerlessLLM and the vLLM native load sharded_state?
SetZealousideal5006@reddit (OP)
My implementation has an automatic compiled patch. So this makes it easy to plug and maintain the speed up described in the Serverless LLM paper to any inference engine, not only vLLM.
It also fixes several segmentation faults with Serverless LLM and memory leak issues when loading and offloading vLLM models.
This is first iteration, but my roadmap is
C0DASOON@reddit
Excellent work. A comparison against [Run:AI Model Streamer](https://github.com/run-ai/runai-model-streamer) would be very useful.
SetZealousideal5006@reddit (OP)
Coming soon!
OverclockingUnicorn@reddit
How does it handle if a request is processing on model A and a request for model B comes and the GPU does not have enough memory to load both models simultaneously? Does it queue the request and wait for model A to become unavailable and then load model B? Or drop the request entirely?
SetZealousideal5006@reddit (OP)
This engine manages loading and unloading models. So you can only run one at a time.
I am thinking on building a separate project dedicated to schedulers, but maybe will add a first come first serve scheduler in the meantime.
no_no_no_oh_yes@reddit
Does it accept custom vLLM parametrization? Every single model I load into vLLM need some weird flags or whatever. Some of them also need different vLLM containers.
SetZealousideal5006@reddit (OP)
This is thought for the situation where your models need different sampling parameters. My pain with inference providers is that you cannot control the model you use.
By using the sdk you can spawn different versions of a model.
When container requires different vllm versions my suggestion is maybe spawning serveral containers of flashtensors and building a scheduler on top that enables the required model.
edrevo@reddit
Very cool! Could you explain somewhere (ideally in the GitHub repo!) how did you achieve those speedups?
BarnacleOk1355@reddit
Whats the minimum hardware this can run on?
SetZealousideal5006@reddit (OP)
This was benchmarked on an H100, but could run on any cuda compatible device. Speed upper limit is SSD memory bandwidth.
OverclockingUnicorn@reddit
Any reason you couldn't load over the network?
ethertype@reddit
Interesting. Finally system to GPU bandwidth starts to be of interest also for inferencing. Have you looked into resizeable BAR and how that makes a difference or not for model loading?
SetZealousideal5006@reddit (OP)
The benchmarks
DefNattyBoii@reddit
Holy speed