Pretty sure I maxed out my consumer PC. Help me run the best model for my needs please
Posted by Quadrapoole@reddit | LocalLLaMA | View on Reddit | 31 comments
What is the best model that'll work with my setup?
Did I goof buying a second set of 128gb or system ram for a non server board?
Just using this for personal use. I honestly needed llms to help me setup Linux as a Windows refugee.
I want to use llm to help code home assistant stuff and just personal ocr of documents.
Haven't tried coding but I see some pretty cool stuff to restore old pictures.
I also want to use models to create home schooling lessons for my kids.
Also wanting to learn how to do some goon stuff too so if anyone can help me in that direction, that'd be sweet.
Thanks in advance!
ghgi_@reddit
With a setup like this id say for pure GPU look at minimax m2.7, very solid model, nvfp4 will work good for those blackwells and run pretty fast, If you want to offload though, id say GLM 5.1 is your best bet, pretty solid model that does even better then minimax but offloading means speed loss, So its a quality vs speed thing, Id test both to find what works for you/your workload.
Quadrapoole@reddit (OP)
What would be best to run it?
I only have experience with llama.cpp but I hear vllm is better?
Any tips on sglang vs vllm vs llama.cpp?
What about all the quants nvfp4 vs awq vs reap?
There's so much everyday it's hard to keep up
I'm only a single user so I'm wondering which will get me the max intelligence model.
Tbh I started down this rabbit hole from seeing ik_llama cpp running deepseek.
ghgi_@reddit
Pure GPU id say use vllm, honestly probably no need to get into SGLang for what your doing, tips for VLLM are honestly just copy other peoples configs, use prebuilt dockers, etc, Vllm has alot of knobs and dials and being on RTX 6000 pros in my experience if your doing it from scratch your gonna need some trial and error.
If your doing offloading (GPUS + CPU) then go with llama.cpp, its also better if you just want pure simplicity, it can obviously do pure GPU too and if VLLM on NVFP4 (nvfp4 is the quant optimzied for blackwell cards like yours, best option 90% of the time) is too much of a hastle (I have some configs on a dual RTX 6000 pro setup for minimax if you can't figure it out) then llama.cpp will make your life easier, no NVFP4 but you get the most used quant style which is GGUF, id always recommend getting ones from unsloth, the UD versions are often better in my experience and they always publish them.
AWQ is mostly for vllm/sglang, I woudnt use it unless you had too and in this case the models I suggested should have NVFP4 and if your offloading then you should use GGUFs anyways, I woudnt touch REAP in general, too much quality loss.
Quadrapoole@reddit (OP)
Can you post some dual rtx 6000 pro vllm minimax 2.7 configs? Much appreciated πβΊοΈ
ghgi_@reddit
I had to make a few last minute configs since my old minimax script was for VLLM 1.17.1 and it was outdated, this script should work on VLLM 1.19.1/latest stable release: https://paste.opensuse.org/pastes/ae377dd7b1e5 if it doesnt work then it should atleast still be roughly 80% correct and probably has something to do with the moe-backend flag
Quadrapoole@reddit (OP)
Thanks for actually helping.
Are you doing any comfy UI stuff? Any advice on getting started?
ghgi_@reddit
No sorry, I don't really care much about image or video gen for my usecases so ive never taken the time to learn comfy but on YT theres plenty of info.
Simple_Library_2700@reddit
https://huggingface.co/lukealonso/MiniMax-M2.7-NVFP4
This guy has instructions in the model card on to how run under sglang if you want to try that as well
marutthemighty@reddit
Can you please tell me how many GB in RAM your system has now in total? Is it DDR4 or DDR5?
Also, are you using eGPU?
Quadrapoole@reddit (OP)
I got 256gb of ddr5 at 5800mhz. That's the fastest it can run with 4 sticks.
Not sure if it was worth getting the second pair of 128 but I think it helps with loading the 192 gb of vram faster.
These are not egpu as the second pics show, they're plugged into the mobo with gen5 x8/x8 bifurcation on my asrock z790 taichi carrera.
Only reason I got the second pair was when I learned about deepseek engram and they're taking forever to release their model.
Not sure if I goofed spending 2k on 128gb.
marutthemighty@reddit
Ok. But would an eGPU help, in your case? Or is it overkill?
Kolapsicle@reddit
The i9-13900KF has 20 PCIe lanes. You're using 8 lanes for your chipset and NVMe. You'll be lucky to run those GPUs in an 8/4 configuration. Wild, my boy.
Quadrapoole@reddit (OP)
Asrock z790 taichi has x16 to x8/x8 bifurcation.
Kolapsicle@reddit
Gen5 8/8 isn't bad, but for the amount of cash you had to throw around sacrificing the Gen5 NVMe slot and leaving half the PCIe bandwidth on the table is still crazy. Don't get me wrong, I'm sufficiently jealous, and for smaller AI workloads the PCIe bottleneck won't be an issue, but if you saturate both cards with a large model you'll start to see relatively poor scaling.
Quadrapoole@reddit (OP)
Don't really find gen 5 nvme worth it over gen 4.
It runs hot and doesn't really speed up the os system that much.
It would have cost too much to get a proper server threadripper system so that's why I went max consumer PC system.
I mean I spent about 26k on this system and most of it is the GPU since vram is most important. Dunno how I could have built it better for less.
Trust me I wish the rtx 6000s had nv link.
Just asking to get help with comfy UI. Like any good YouTube videos to get started especially for the nsfw stuff...
Kolapsicle@reddit
Assuming you have ComfyUI installed, you can download all sorts of image and video models from https://civitai.com/models (https://civitai.red/models for NSFW). You can filter by model, checkpoints, LoRAs, etc. You'll probably want to check out the workflows in particular to get up and running. Oh, and if you haven't already, https://github.com/Comfy-Org/ComfyUI-Manager is probably the most important extension you can install. With it you can drop workflows into ComfyUI and install missing nodes with the click of a button.
tgromy@reddit
Wow, what a nice setup bro
Quadrapoole@reddit (OP)
Thank you for you kind words!
rebelSun25@reddit
So, i dropped $30k CAD, but I'm not sure how to beat utilitize it.
"Claude, do that thing where you run the best possible scenario for my hardware, old chap. Be quick. Make haste and avoid mistakes this time, eh!"
Quadrapoole@reddit (OP)
Well I have 3 kids so life gets busy.
That's why I'm asking for help but so far only 1 person has actually helped.
Elon was right about Reddit
Makers7886@reddit
"now draw a picture of a cat"
LatentSpacer@reddit
Nah, you have a 13900 when you could have had a 14900 π
Quadrapoole@reddit (OP)
Hahaha. That's true, but it wasn't much faster and still has the same Intel power problem.
Don't think I'll but another Intel because of that fiasco
Herr_Drosselmeyer@reddit
I'm confused by the watercooling loop.
Quadrapoole@reddit (OP)
I have waterblocks for the rtx 6000 pro.
Just need to test them first
Admirable_Dirt_2371@reddit
People like you are why this world is so broken. AND you have kids, I've lost all hope for humanity.
Visible_Afternoon_98@reddit
Where tf y'all getting so much money
traveddit@reddit
???
Daemontatox@reddit
Can it run crysis 2 tho?
onewheeldoin200@reddit
Solid shitpost tbh
Main_Secretary_8827@reddit
what the actual fuck