Is this setup good enough to run LLaMA 70B at 8-bit quantization?

Posted by matt23458798@reddit | LocalLLaMA | View on Reddit | 6 comments

Hey everyone!

I'm building a budget-friendly AI/ML rig primarily to experiment with running large language models like LLaMA 70B at 8-bit quantization. This is my first time building a rig like this before so I’ve put together the following components and wanted to get your thoughts on whether this setup is sufficient for the task:

My Current Setup:

GPU: Liquid-cooled block of 4 RTX 3090s (connected using the EKWB Vector RTX water block).
Motherboard: ASUS ROG Zenith Extreme X399.
CPU: AMD Ryzen Threadripper 1950X (16 cores, 32 threads).
RAM: 32GB DDR4.
Storage: 1TB SSD running Linux.
PSU: 2000W Modular Mining Power Supply (supports up to 6-8 GPUs).
Chassis: Open-air mining rig
Cooling: Liquid cooling loop for GPUs, 4 basic fans for airflow.
OS: Planning to run Ubuntu/Linux for compatibility with AI frameworks.

What I Need to Know:

CPU cooling: I've read online that the threadripper needs special liquid cooling or else it will have major issues. Do I really need this or would the open air mining rig with the four fans be enough to cool the cpu?
Open Air mining rig size: would this be large enough for my setup? I know it usually doesnt have a spot for the liquid cooling reservoir and radiator but honestly I was just thinking about placing those on the table next to the rig. Any other recommendations would be helpful
Performance: Will this setup handle running LLaMA 70B with 8-bit quantization effectively? I understand memory usage is crucial for larger models like this. What kind of tokens/second should I expect for moderate length request (\~1000 tokens). What are the limits of this rig, can it run larger models or even LLama 70B at 16 bit quant?
Bottlenecks: Are there any obvious weak points (e.g., CPU, RAM) that might limit the system's ability to load and inference the model?
Improvements: If there are upgrades you'd recommend without breaking the bank, I’d love to hear them!

I’m aware that 32GB of RAM might be on the lower side for larger datasets or training, but I was hoping it would suffice for inference and I can always upgrade to 64GB or more in the future. Do I need to consider going up to 64GB now or can it wait? Is the older PCIe 3.0 architecture of the X399 board a dealbreaker?

Looking forward to your advice! Thanks in advance for helping me optimize this build.

Hey everyone!

My Current Setup:

GPU: Liquid-cooled block of 4 RTX 3090s (connected using the EKWB Vector RTX water block).
Motherboard: ASUS ROG Zenith Extreme X399.
CPU: AMD Ryzen Threadripper 1950X (16 cores, 32 threads).
RAM: 32GB DDR4.
Storage: 1TB SSD running Linux.
PSU: 2000W Modular Mining Power Supply (supports up to 6-8 GPUs).
Chassis: Open-air mining rig (supports 6 GPUs, 81mm spacing between slots). Look at the pic attached for what the open air mining rig would look like (ignore the components attached its just a stock image).
Cooling: Liquid cooling loop for GPUs, 4 basic fans for airflow.
OS: Planning to run Ubuntu/Linux for compatibility with AI frameworks.

What I Need to Know:

CPU cooling: I've read online that the threadripper needs special liquid cooling or else it will have major issues. Do I really need this or would the open air mining rig with the four fans be enough to cool the cpu?
Open Air mining rig size: would this be large enough for my setup? I know it doesnt have a spot for the liquid cooling reservoir and radiator but honestly I was just thinking about placing those on the table next to the rig. Any other recommendations would be helpful
Performance: Will this setup handle running LLaMA 70B with 8-bit quantization effectively? I understand memory usage is crucial for larger models like this. What kind of tokens/second should I expect for moderate length request (\~1000 tokens). What are the limits of this rig, can it run larger models or even LLama 70B at 16 bit quant?
Bottlenecks: Are there any obvious weak points (e.g., CPU, RAM) that might limit the system's ability to load and inference the model?
Improvements: If there are upgrades you'd recommend without breaking the bank, I’d love to hear them!

Looking forward to your advice! Thanks in advance for helping me optimize this build.

[-]

fasti-au@reddit

O I is only under load when model is loading the passing of results is small data

a_beautiful_rhind@reddit

PCIE 3.0 is fine. More ram will help you when caching weights so you don't have to load everything from disk all the time. Also you can perform ops like quantization/conversion on CPU.

The liquid cooling is so much more money than a couple more sticks of DDR4.

matt23458798@reddit (OP)

Liquid cooling for the threadripper cpu? so can i just run it normally with some fans in an open rig?

I thought you are liquid cooling the cards. I'd love to shrink all mine down, but those coolers are almost another GPU last I checked.

You can probably air cool your CPU, but those AIO were pretty cheap so I don't think it makes a difference. Lack of ram is all that stands out. I have 96 in one machine and 256 in another. Feels like a necessity when you work with all these large files.

bluelobsterai@reddit

I don’t know threadrippers but my Rome’s are fine air cooled.

DinoAmino@reddit

96GB VRAM can run q8 GGUF no problem. You might want to use q6_K_L to use more context, like 32k. Or better, run vLLM and an INT8 quant with 28k.

No go for fp16 though, unless you get more RAM and use CPU. But don't bother with fp16, seriously. 8 bit works great.