Serving AI From The Basement - 192GB of VRAM Setup
Posted by XMasterrrr@reddit | programming | View on Reddit | 10 comments
Hey guys, this is something I have been intending to share here for a while. This setup took me some time to plan and put together, and then some more time to explore the software part of things and the possibilities that came with it.
Part of the main reason I built this was data privacy, I do not want to hand over my private data to any company to further train their closed weight models; and given the recent drop in output quality on different platforms (ChatGPT, Claude, etc), I don't regret spending the money on this setup.
I was also able to do a lot of cool things using this server by leveraging tensor parallelism and batch inference, generating synthetic data, and experimenting with finetuning models using my private data. I am currently building a model from scratch, mainly as a learning project, but I am also finding some cool things while doing so and if I can get around ironing out the kinks, I might release it and write a tutorial from my notes.
So I finally had the time this weekend to get my blog up and running, and I am planning on following up this blog post with a series of posts on my learnings and findings. I am also open to topics and ideas to experiment with on this server and write about, so feel free to shoot your shot if you have ideas you want to experiment with and don't have the hardware, I am more than willing to do that on your behalf and sharing the findings 😄
Please let me know if you have any questions, my PMs are open, and you can also reach me on any of the socials I have posted on my website.
LordDarthShader@reddit
Are you running an Integrated Intel to make use of the shared system memory?
-grok@reddit
Mad lad!
North_Permission3129@reddit
Really cool setup! I was thinking about doing something like this and using some open source LLMs. How difficult is the software side of things, and what sort of hardware are you using?
XMasterrrr@reddit (OP)
In short the software side is hectic. Each new model that gets released on a new (or a new iteration of an) architecture means inference engines need to implement that architecture for inference. The best inference engines written to specs are definitely the ones provided by the model's authors. We rely on open source software, you have a abundance of options, I move between vLLM, Aphrodite, Exllama2, and llama.cpp.
To utilize this server properly you need an inference engine that supports Tensor Parallelism. The first two engines I mentioned supported Tensor Parallelism for quite sometime, Exllama2 just introduced it finally, and llama.cpp has not and probably would never support it due to cpu/ram offloading.
Then it becomes about finding an engines they supports the model architecture you want to run, finding if that engine supports quantization or not if it is larger than the amount of VRAM available, and if it does, do you need to offload the entire model into the VRAM or can it quantize it in batches, and so on...
On the other hand, although it might sounds exhausting, here is an example of something amazing I could do using vLLM with fully unquantized Llama 3.1 70B: https://x.com/TheAhmadOsman/status/1828922904626770259?t=-FKC-yuwDk1AcOL3kx15vA&s=19
All of this and I am just talking about inference, not fine-tuning, not training. The rabbit hole becomes deeper and deeper, but I guess I love it and that's why I am here 😅 I plan on dedicating a blogpost (or 2) for the software aspect alone.
Re: hardware, I have it listed on the blog, do you have something specific in mind with your question?
North_Permission3129@reddit
Thank you for the reply! Nothing specific that I had in mind, but I will check out the blog. I’ve always wanted to get into this sort of stuff, but man the learning curve seems high.
MrPentiumD@reddit
You can use Ollama which runs a server with a very similar API as openAI which you can then interface in your software. Doesn’t even require a monster rig to run.
XMasterrrr@reddit (OP)
Ollama is a wrapper around llama.cpp, and IMHO is a blah tool that just sets environment variables, do a bad job at offloading calculations, and lead to a lot of frustration.
It is great if you have 1 GPU and don't wanna do much except run basic models with 1 chat session; anything more than that and it really isn't worth using.
Check out my reply to OP for some context of shortcomings.
MrPentiumD@reddit
Thank you I’m really an amateur when it comes to AI stuff. I run a single GPU server rig so I’ve never had to deal with that.
XMasterrrr@reddit (OP)
Absolutely, no problem. We all gotta start somewhere.
The potential of AI is in having a bunch of them talking with each other (AKA Agents) , doing small jobs on fine-tuned models for said specific jobs, at scale. That's where something like my AI Server shines.
ansible@reddit
We were looking at building a 4 GPU system based on a WRX80 Creator motherboard.
To avoid riser cables, we were thinking about making a custom water cooling loop, with the right fittings, it is apparently possible to connect all the water block inputs together and all the outputs together. Two combo res / pumps, 3 or 4 rads, flexible tubing. Finding a beefy enough power supply was also an issue, though we could use two and have one switch on from the other.