Hardware Review & Sanity Check
Posted by MegaSuplexMaster@reddit | LocalLLaMA | View on Reddit | 5 comments
We are doing a proof of concept for an internal AI build at my company.
Here is the hardware I have spec'd out (we had allot of this on site already) wanted to get your thoughts on whether I'm heading in the right direction:
• Dell T550 Tower Server
• Dual Intel Xeon Silver 4309Y (8C, 2.8GHz)
• 256 GB RAM
• 2x NVIDIA Tesla T4 (16GB each)
• RAID 1 – OS (500GB SSD)
• RAID 5 – Data/Models (1TB)
I loaded up Docker, Open WebUI, and Ollama. The main goal is to start with a standard chatbot to get everyone in the company comfortable using AI as an assistant — helping with emails and everyday tasks. From there, we plan to add internal knowledge bases covering HR, IT, and Finance. The longer-term goal is enabling the team to research deals and accounts, as we are a sales organization.
Like I said, this is just a POC wanted to confirm I'm on the right track and get yalls thoughts.
thanks!
ttkciar@reddit
You will want better GPUs than that.
32GB of VRAM is good for running a single instance of a mid-sized LLM (in the 24B to 32B parameters range, quantized to Q4_K_M) for a single person, but when you are serving multiple people you need to allow for enough VRAM for the K and V caches of all concurrent users, which eats up several GB.
Depending on how many concurrent users you anticipate having, you should be looking at 48GB of VRAM at the very least, and 64GB if you want to allow long-context tasks (like RAG, aka "chatting about your documents").
MelodicRecognition7@reddit
these specs are very low for LLMs, it is suitable only for very small models. Read this to get some understanding: https://old.reddit.com/r/LocalLLaMA/comments/1rqo2s0/can_i_run_this_model_on_my_hardware/
overand@reddit
Personally, I'd recommend avoiding the Tesla T4 cards, they're old enough that you may start to have compatibility issues. A newer card with less VRAM might be a better choice for proof-of-concept, but, others with Tesla T4 experience should chime in.
If you can get an RTX 3090 (or better, two) into your build, you'll be doing much better.
Also, I'd avoid RAID 5 entirely in 2026; if you can afford downtime but want capacity, put "disposable" data on a RAID 0 stripe. (Disposable only! Which is a Yes for models, but No for data.)
Having your models on spinning disks isn't great, but it's fine if you're literally only using one model ever. You can bet on 100-200MB/sec from spinning disks. If you're using that full 64 GB of VRAM, call it 40 gigs for the model and 24 gigs for context as a random guess, your model will load in 3.5 - 5 minutes. That's a heck of a long time to wait for your query to even start processing.
Now, I'm happy with the models I use on a dual 3090 system, decent professional-ish results on models as small as \~20 gigs. That's still nearly a 2 minute load time from a spinning disk. If the model stays loaded in VRAM (or at least system RAM) that's much less of an issue, of course! Just wanted to help you set expectations.
That CPU is fine for proof-of-concept; I do local LLM inference on a system with a Ryzen 5 3600 and 64 GB (now 128 GB) of DDR4 VRAM. But, performance drops WAY off if the model doesn't fit in VRAM.
MegaSuplexMaster@reddit (OP)
I made a mistake in my post the 1TB drive for models and data are SSD but RAID 1 but i do follow what your getting at thanks for advice.,
overand@reddit
My next piece of advice is... when you're comfortable, and have selected a model, switch to something other than Ollama. I don't agree with people that you should avoid it at all costs, as I think it's a decent way to get started, but if you're at all CLI savvy, just get going with `llama.cpp` at least, or maybe vLLM if you want to start with the tool you're likely to end up using if you're a multi-user organization.