Realistic local LLM rig under $6500? Dev with heavy RAM needs

Posted by TeachTall3390@reddit | LocalLLaMA | View on Reddit | 27 comments

Hey everyone,

I'm a developer looking for practical hardware recommendations under $6500 for local LLM work. My usage breaks down like this:

60% local inference
30% LoRA training
10% light fine-tuning on smaller models

Anything heavy I just rent GPU clusters or use work resources.

I usually run 40-50 services at once, so I need a ton of RAM. Video editing would be a nice bonus but not required. Linux or macOS is fine.

What builds are actually worth it right now? Thanks!

[-]

Electronic-Space-736@reddit

this one is launching currently https://hilbert-agentic-computer.kckb.me/b06cccc2

[-]

SiXke@reddit

any thoughts on this? I tried to create a post about this but it got deleted (no clue why)

[-]

I am going for it, it suits my use case, I want access to full weights (or close too), This will be serving just me, not tens or hundreds of people, and I am happy with the tradeoff of not having the tensor core speed, and the slightly slower unified memory, it provides me access to a larger model than I can otherwise afford currently.

[-]

Turbulent_Pin7635@reddit

M5 MAX 128Gb

[-]

ExcellentTip9926@reddit

DGX Spark at $4,699 is probably the best fit. 128GB unified memory, full CUDA stack, runs up to 200B param models at FP4 locally, Mac Mini-sized. Memory bandwidth is only 273 GB/s so inference is slower than a Mac Studio, and thermal throttling on long training is real, but for a 60/30/10 inference/LoRA/finetune split it’s almost perfectly designed. Other options: Mac Studio M3 Ultra 256GB at $5,600 has more RAM and faster large-model inference but you lose CUDA for LoRA. 4090 build with 128GB DDR5 around $5,500 is best CUDA speed per dollar but 24GB VRAM limits local model size. 5090 build at $6,500+ stretches budget hard with current memory shortage pricing. For 40-50 services plus LoRA plus inference, I’d go DGX Spark.

[-]

_millsy@reddit

How would you see that comparing against a strix halo or similar?

[-]

ExcellentTip9926@reddit

Strix Halo is genuinely competitive. Framework Desktop with Ryzen AI Max+ 395 at $2,599-$2,999 gets 128GB unified memory with 256 GB/s bandwidth vs Spark’s 273. On apples-to-apples LLM inference (gpt-oss-120b) it’s within 5-10% of Spark while saving $1,700+. Spark wins on prompt processing (3-5x faster), long context (23% faster at 32K), and image gen (2.5x faster on FLUX). So if you’re doing heavy prefill, long-context RAG, or diffusion models, Spark earns its price. If it’s mostly LLM chat + LoRA, Strix Halo is the better buy now.

[-]

FreshBowler32@reddit

Adding to and creating a table to this. Also realistically after allocating RAM to the OS VRAM would be pretty equal to a quad 3090 setup.

Framework ( Strix Halo )	Spark	Quad RTX 3090 (DIY Rig)
System Cost $2,599 – $2,999	System Cost $4,300+	System Cost $4,000 – $5,500 (Used market prices)
Total VRAM 128GB (Unified)	Total VRAM 128GB (Unified)	Total VRAM 96GB (Dedicated GDDR6X)
Memory Bandwidth 256 GB/s	Memory Bandwidth 273 GB/s	Memory Bandwidth 3,760 GB/s (940 GB/s x 4)
LLM Inference Baseline	LLM Inference \~1.1x faster	LLM Inference \~2x–3x faster (Decodes 30–50+ t/s on 70B+ models)
Prompt Processing Baseline	Prompt Processing 3–5x faster	Prompt Processing 5–8x faster (CUDA is highly optimized for prefill)
Training (LoRA) Possible	Training (LoRA) Good	Training (LoRA) Best (Massive TFLOPS; NVLink support for 2-way)
Power Consumption \~120W – 150W	Power Consumption \~150W – 200W	Power Consumption 1,200W – 1,600W+ (Requires dual PSUs)
Ease of Use Plug & Play	Ease of Use Plug & Play	Ease of Use Advanced (Requires open-air case, custom power)

[-]

Snoo_81913@reddit

There's some rumors/leaks that the Ryzen AI max will have a theoretical max bandwidth of 460 gb/s using LPDDR6-14400 ... sounds... EXPENSIVE. meanwhile M3 ultra has 800+ bandwidth. 2 generations ago.

[-]

veinamond@reddit

Well, it would all be nice if only DDR6 and LPDDR6 were coming anytime soon. From what I read online, the soonest it hits the consumer market is probably 2030, maybe later.

[-]

Enturbulated_One@reddit

Shhhh, don't talk about it! Every time I've mentioned it, or looked, DDR6 release date estimate has been pushed back further!

[-]

Snoo_81913@reddit

What this guy said (it's almost like I copied him lol) but those are definitely the two models in your bracket that will do the job. I would also lean into the Spark.

[-]

ranting80@reddit

Training leans on the spark.

[-]

HopePupal@reddit

if you're going the GB10 route, the Asus version is a lot cheaper than the Spark

[-]

Snoo_81913@reddit

So many factors to consider here.

I'll go with the weighted for now.

60% inference. Apple Studio M3 ultra 512gb RAM 800+ gb/s bandwidth loads large LLMs and all your services easily. About 4k used on ebay. Cons: MLX coreML not cutting edge Gonna blow you up with fans. Used. New units are $7,500 and up and you can only get 256gb in new models.

Loras and Fine-tuning: Nvidia DGX Spark. 1 Petaflop pure Raw power. At FP4. It will consume input like a starving animal. Just cram it all in and it will shred it. Cutting edge CUDA architecture. Will rip through fine tuning and loras like they don't exist. Scalible with 200gbps connections for clusters. Cons. 273 gb/s bandwidth. 128gb RAM. Slower token generation. No video Gen. Custom Os DGX OS maxing out your budget at $5,590-$6,400

[-]

bionicdna@reddit

Where are you finding a used M3 Ultra Mac Studio on eBay with 512Gb RAM under $10k? The rest seems to be scams.

[-]

Snoo_81913@reddit

You could be right there. I saw a couple of them here and there I've been keeping an eye out it's getting harder and harder to find anything like that. Good catch

[-]

Powerful_Ad8150@reddit

Single or dual DGX Spark cluster. Single - q3.5 122 @ 50tps vLLM / m2.7 at poor mans q4 quant @ 22 tps llamacpp. I have Asus G10, the only difference being that the Spark has a power button on the front (though that's not a deal-breaker - mine booted once two months ago and I never turned it off again xD ). It's an amazing machine. Although there are some compatibility issues with some solutions because it's ARM, not x86.

[-]

cleversmoke@reddit

Recently I bought:

Windows mini PC with 64gb DDR5: $1200 -- It has oculink and 2 USB4
RTX 3090 24G: $800
Aoostar AG02: $250

Running Qwen3.6-35B-A3B Q4 with 262k context. PP at 2800 tk/s and TG at 130 tk/s. Simple --fit:on configuration.

I plan on buying another RTX 3090 + Aoostar AG01 so I can utilize the Q8 version.

That would bring my total to be around $3500. I can probably add another RTX 3090 if a Qwen3.6-120B+ model comes out.

Unsure if it can handle 40-50 services though unless I do a lot of throttling.

[-]

Snoo_81913@reddit

Man I have poured over setups like this. Which mini did you go with?

[-]

cleversmoke@reddit

I went with the Reatan X7 255 with oculink, Radeon 780M, 64GB DDR5 ram, and 2TB NVMe.

I bought it when it was $800 and it has jumped to $1200 since (US Amazon), but even at $1200, I'd buy it again. I got the Reatan specifically for the oculink since I didn't want a tower.

If my second eGPU works on it, I'll be thrilled! The eGPU comes in next week.

[-]

No_Mango7658@reddit

Seeing as how qwen3.6 q4km with 256k context basically fits in a 5090, that would be my target.

[-]

Charming-Author4877@reddit

I gave Qwen 3.6 and Gemma-4 a quite extensive testrun today (on a 5090) and the results were really impressive, much better than I expected.
https://www.reddit.com/r/GithubCopilot/comments/1ss583x/i_am_not_switching_yet_but_i_tested_gemma4_and

[-]

whodoneit1@reddit

Today I tested using Kimi2.6 for planning and Qwen 3.6 Plus for implementation and the results were really good. I was running non locally as I wanted to test this work flow first

[-]

Charming-Author4877@reddit

I personally go with this:
- 2x 3090 or 1x5090 +1x3090
- 128GB DDR5 RAM (or 196GB if you can find an affordable pack)
- Large 9100 PRO SSD, or 2 striped prev generation SSD (sums up to the same speed)

I use Windows + WSL

For Speech/Music I run Demodokos Foundry, I put it into on-demand mode or bind it to my 2nd GPU
That gives SOTA inference without taking any VRAM when not used.

For LLM you can run Qwen 3.6 35B at 260K context and still have plenty of primary VRAM available.
Also the dense models (gemma 31B or Qwen 28B) run well, with a bit of KV quantization.

For light fine-tuning or LORA training you can use either one card in background, or both.

I have a second PC like this available in network for long running tasks.

Macbooks offer great value but at the same time they are exotic hardware in the AI world, it's improving a lot but still is a burden. I absolutely hate the Apple development environment.
It is great for running large models that won't fit in my described solution but prefill speed is gruesome.

DGX Spark and similar ARM unified RAM boxes are glorified mini computers, significantly slower than the Macbook and prefill is a total showstopper.

Same with AMD GPUs, they are not impressive in compute.

So my choice went on a conservative CUDA solution, it's hard enough with Local AI that is mutating and changing faster than anyone can easily follow.

[-]

Magnus919@reddit

As much MacBook Pro as you can stomach to pay for.

Or a Mac Studio if you never leave home.

[-]

Excellent_Koala769@reddit

MacBook Pro M5 Max 128 GB