My home data center
Posted by alecKarfonta@reddit | LocalLLaMA | View on Reddit | 61 comments
System 1:
Threadripper 3960x 24c
4x 3090 ti
128gb ddr4
System 2:
Xeon 8352 36c
4x 5070 ti
128gb ddr4
System 3:
Intel 14700k 24c
64gb ddr5
5090
System 4:
Ryzen 5950x 16c
64gb ddr4
2x 5070 ti
The first system uses two PSUs to handle the almost 2000w full load of the 3090s. Was nervous about this but it has been running stable for about a month.
The Intel is an engineering sample that cost $100. I mainly use it to run an embedding model.
I use them for various ml experiments, projects and some agentic coding. Right now the 3090s are training a tts lora with data distilled from a larger model. The 5070s run qwen 27b for coding, nemotron streaming stt and moss tts for an interactive agent I am building.
These recent qwen models are good enough for coding. Sometimes I leave them all night working on a repo. Mainly boilerplate improvements but its incredible to get real work down with no token cost. Aside from from the obvious costs of this hardware.
Love this community ❤️
Potential_You_9954@reddit
Wow, PNY 5090. This is the first time I've seen the real card... I saw it previously, just pics... If you want to change to pro6000, that costs a lot (unless you bought from suppliers, not Micro Center).
FusionCow@reddit
"You people are building datacenters in your homes"
polandtown@reddit
I think we all need the temps in that room 😃:D:D:D:D
Kazushi998@reddit
nice rig, im curious how people do the backup/disaster recovery for diy local rig
ProfessionalJackals@reddit
How do people can let a LLM run wild all night on their codebase. Do you not check the work as they are developing?
It feels like there is no need for that.
Conscious-content42@reddit
He did say it's on a github repo and he can review the changes it makes, I assume that means it only makes PRs then he can review them, need a human in the loop still.
zilled@reddit
So that's the actual purpose of those black wrap bands TIL
LongDistanceRope@reddit
is the 5070ti worth it over 5060ti? I'm going back and forth between them and trying to decide since the 5070ti basically cost same as second hand 3090 (at least where i live) which changes a lot of variables too.
alecKarfonta@reddit (OP)
I would say not worth the current price difference for the 5070 ti vs the 5060 ti. They run inference at almost identical speeds.
The 3090 VRAM gets you in the range of running a usable model on a single card. I would choose that unless power consumption is a concern. With 16gb you really need multiple before they become useful for llms. They are great for other things tho; stt, tts.
offzinho3k@reddit
If you want more speed, 5070ti; if you want to save energy, 5060ti.
lucsaddler@reddit
Como voce consegue rodar 4 gpu's no mesmo "PC"? A RTX 3090 tem 24GB de VRAM, entao com 4 você teria 96GB VRAM, entao você divide o modelo em 4 partes e coloca nelas? Qual a taxa media de tokens por segundo?
Tenho essas duvidas pois estou pesquisando para montar um setup local para rodar modelos!
alecKarfonta@reddit (OP)
Sí, el modelo se divide entre las GPU y me permite ejecutar modelos q4 de 120 bits completamente en VRAM. La velocidad ronda los 40 tokens por segundo con un tamaño de contexto decente. Principalmente uso LlamaCPP con el modo de división de capas.
Traducido con Google
crossoverXYZ@reddit
that's a serious setup ngl. what kind of power draw are you seeing with all four systems running? my electric bill would cry 😅
alecKarfonta@reddit (OP)
Ty :) have not measured but its pretty rare i max them all out. Undervolting helps a lot with minimal performance loss. Also solar helps I have ~3kw
CheatCodesOfLife@reddit
Crossover is a spam bot btw, click it's profile and look at how it posts.
alecKarfonta@reddit (OP)
Spooky.
CheatCodesOfLife@reddit
You tried https://github.com/ggml-org/llama.cpp/blob/master/tools/rpc/README.md between the 3090 and 5070TI rig yet?
woodrowbill@reddit
How do the 4x 5070ti vs the 4x 3090 compare?
alecKarfonta@reddit (OP)
Its tough I really thought the newer architecture would enable more recent optimizations and overcome the lower cuda core count. But it seems to be a complete wash, the 5070s run models (that fit) at the same speed as the 3090s. Maybe I am doing something wrong.
The main potential advantage is you can run nvfp4 models with no penalty for that quantization, but it has less vram. So I end up running q6 k on the 3090s and it produces identical results. The only real advantage for the 5070s seems to be power consumption.
woodrowbill@reddit
Appreciate you. I have a single 3090 and always wondered if a 2x 5070 would be better in performance. Great insights.
IrisColt@reddit
I guess that's one reason 3090s are still strong.
redmctrashface@reddit
I wish I had enough room for this kind of stuff but I live in a big city
HavenTerminal_com@reddit
the $100 engineering sample carrying the embeddings load in this setup is sending me
CircularSeasoning@reddit
Data renter? Drake dislike.
Data center? Drake like.
GameBoyRay@reddit
Imagine 300 years from now how even more barbaric this will look
mosesman831@reddit
You never need to buy a dedicated radiator anymore.
__JockY__@reddit
This is the shit right here! Love it.
ambient_temp_xeno@reddit
https://www.youtube.com/watch?v=XCTbeeCDueg
Dry_Yam_4597@reddit
That is sexy
BlackBeardAI@reddit
Looks a lot like mine which I posted a while ago
LegacyRemaster@reddit
sounds hot 😃
info_solutions@reddit
You can run Claude Mythos perfectly now !
panchovix@reddit
Add a 6000 PRO on one of these systems!
alecKarfonta@reddit (OP)
Yea I've come very close to ordering one of those. If there were a better local model that targeted that vram size I would have one. That is my upgrade path, planning to trade up the 3090s for a single 6000 pro.
Sofakingwetoddead@reddit
eh, not sure what you do but 27b at q8, 16kv and full context, with mamba and sglang will take up \~87gb .... However, you can parallelize multiple jobs, or spawn workers that run simultaneously.
Technically, you don't need 96gb to run that model, but if you want to run it at speed then you need it.
alecKarfonta@reddit (OP)
Yea my original intention was to run 120b models with the 96gb setup. I didnt expect qwen 27b to beat oss and nemotron 120b. Still there will be better models in this range at some point.
For now I am running qwen 27b at q6 with 128k context (llamacpp). It uses ~29gb vram on two 5070s. The other gpus are running various models for stt, tts, cv
Sofakingwetoddead@reddit
I would assume so. I mean, about there being better models in the future. I'm so happy with 27b though. I would like slightly larger native context, like 320k would be perfect for me, but I couldn't be more pleased with 27b. Each day that goes by the harness gets a little bit better, we get a little more efficient..
That's awesome, though. I definitely would like to be able to run other models in parallel with more automated workflows. Right now, though, I'm having a hard enough time keeping 27b occupied. 😃
igsterious@reddit
That's a literal rack! 😁
fromage9747@reddit
What are you using to leave the qwen models working over night on a repo? Just the CLI itself? Or an app to manage the agents?
alecKarfonta@reddit (OP)
Its a custom tool I made based on opencode and leaked Claude code. Its an agent I can assign github tickets to and it uses a combination of planning from proprietary models and implementation by local models. Right now it can only do boilerplate stuff but its improving
fromage9747@reddit
Excellent. Just thought there was some tool out there that I didn't know about. I also have my own. Cheers and thanks for the reply!
alecKarfonta@reddit (OP)
Cool yea its a tough problem I have had with all coding agents. From my experience they cannot be left on their own for long. They will eventually claim things are complete when they are not. So I built mine around deep git hub integration like a normal developer and so I can review.
I give it a task that it turns into an initiative with sub tasks. Each subtask touches at most 3 files. That helps break things into small enough chunks for the local model to handle. And representing the code as ASTs was important.
interested to hear about your experience if you want to share
fromage9747@reddit
Still building mine just currently stuck on another project until I can throw myself into it. I just was really intrigued that you have something running overnight and thought blady hell, something is already available? Arggg. But good to know that its custom so I'm not wasting my time. Mine is based on the Qwen Code CLI. It was the first Agent that I had used and I haven't really found the need to try anything else. Still working on MCP's and harnesses as well. Everything in time. Learning a lot and something new everyday!
xrothgarx@reddit
Do you have a more thorough write up of how you’re running and accessing the models?
alecKarfonta@reddit (OP)
I can put something together if you're interested. Managing all these varied resource systems has been a big problem for me. I use a custom tool I built called llama-nexus. It started as a wrapper for llamacpp so I could quickly experiemnt with different parameters or builds.
Then i added ways to manage embedding, stt and tts models. Then I added benchmarking, distillation, qlora training. Now its a gui that fills almost all of my ml ops needs. I run it on all machines then choose which models I want to run on that machine.
Next I plan to integrate with kserve for better scalability in a cluster scenario
xrothgarx@reddit
I'd be interested to read about your stack. I'm already using k8s (I work on Talos Linux) but haven't gone the kserve route because I only have 2 machines serving static LLMs (via VLLM). But I've been having trouble using them for extended periods of time because of context limits.
I have other, smaller GPUs in the cluster for stt and video transcoding, But the LLMs have been causing me issues (even running on my local GPU) that I'd love to read more about what problems you've solved and how.
jacek2023@reddit
nicely optimized space, I have just a single x399 with 4 GPUs in the single openframe, you have the whole lab 😄
TiK4D@reddit
I am in awe, this is ridiculous lmao
alecKarfonta@reddit (OP)
Lol ty I have built this up over a long period. The first 3090 I bought on launch day. It was the beginning of the end of high end gpu availability.
Draco32@reddit
chefs kiss
Apprehensive-View583@reddit
You prob pay more electricity wise than api token price from one of Chinese provider lol
alecKarfonta@reddit (OP)
Potentially for now. Even the glm yearly subscription I have has doubled in cost since where I signed up. And I run more than llm infernce. Solar also helps
BeautyxArt@reddit
you fed the other two 3090s with psu without any use of main mobo psu socket? just psu to the gpu?
alecKarfonta@reddit (OP)
Yea exactly, then its just a cheap splitter cable that sends the power on signal from your mobo to both psus. Add2psu on amazon
dangerous_inference@reddit
I just got the same frame. TWINSIES.
I only got one though. For my rig I made the hilarious decision to put this in my laundry room and run a 240v PSU off the dryer line. I should probably get an IR camera too.
alecKarfonta@reddit (OP)
Hell yea its an interesting case. Didn't actually realize at first they're meant to stack. I have a third one im working on now. The 240v is clever, right now I run extensions so they're not all on the same circuit
FastHotEmu@reddit
what model/brand is that frame?
dangerous_inference@reddit
Kingwin 8 GPU Miner Rig Case Frame
sheppyrun@reddit
the hardware is impressive but the real question is whether you need all of it. most local llm workloads don't use four gpus per system. the best setup is the one you actually use, not the one you build.
alecKarfonta@reddit (OP)
I work in the industry so having this hardware around to play with helps me stay on top of things. I run a lot of different kinds of models, inference and training. Currently have something on every gpu. Im looking to stack another xeon machine with 5070s since that's all I can get at msrp
jikilan_@reddit
Sound like the one women you actually loves , not the one you married