Hardware Review & Sanity Check

[-]

ttkciar@reddit

You will want better GPUs than that.

32GB of VRAM is good for running a single instance of a mid-sized LLM (in the 24B to 32B parameters range, quantized to Q4_K_M) for a single person, but when you are serving multiple people you need to allow for enough VRAM for the K and V caches of all concurrent users, which eats up several GB.

Depending on how many concurrent users you anticipate having, you should be looking at 48GB of VRAM at the very least, and 64GB if you want to allow long-context tasks (like RAG, aka "chatting about your documents").

[-]

MelodicRecognition7@reddit

these specs are very low for LLMs, it is suitable only for very small models. Read this to get some understanding: https://old.reddit.com/r/LocalLLaMA/comments/1rqo2s0/can_i_run_this_model_on_my_hardware/

[-]

overand@reddit

Personally, I'd recommend avoiding the Tesla T4 cards, they're old enough that you may start to have compatibility issues. A newer card with less VRAM might be a better choice for proof-of-concept, but, others with Tesla T4 experience should chime in.

If you can get an RTX 3090 (or better, two) into your build, you'll be doing much better.

Also, I'd avoid RAID 5 entirely in 2026; if you can afford downtime but want capacity, put "disposable" data on a RAID 0 stripe. (Disposable only! Which is a Yes for models, but No for data.)

Having your models on spinning disks isn't great, but it's fine if you're literally only using one model ever. You can bet on 100-200MB/sec from spinning disks. If you're using that full 64 GB of VRAM, call it 40 gigs for the model and 24 gigs for context as a random guess, your model will load in 3.5 - 5 minutes. That's a heck of a long time to wait for your query to even start processing.

Now, I'm happy with the models I use on a dual 3090 system, decent professional-ish results on models as small as \~20 gigs. That's still nearly a 2 minute load time from a spinning disk. If the model stays loaded in VRAM (or at least system RAM) that's much less of an issue, of course! Just wanted to help you set expectations.

That CPU is fine for proof-of-concept; I do local LLM inference on a system with a Ryzen 5 3600 and 64 GB (now 128 GB) of DDR4 VRAM. But, performance drops WAY off if the model doesn't fit in VRAM.

[-]

MegaSuplexMaster@reddit (OP)

I made a mistake in my post the 1TB drive for models and data are SSD but RAID 1 but i do follow what your getting at thanks for advice.,

[-]

overand@reddit

My next piece of advice is... when you're comfortable, and have selected a model, switch to something other than Ollama. I don't agree with people that you should avoid it at all costs, as I think it's a decent way to get started, but if you're at all CLI savvy, just get going with `llama.cpp` at least, or maybe vLLM if you want to start with the tool you're likely to end up using if you're a multi-user organization.