Pretty sure I maxed out my consumer PC. Help me run the best model for my needs please

Posted by Quadrapoole@reddit | LocalLLaMA | View on Reddit | 31 comments

What is the best model that'll work with my setup?

Did I goof buying a second set of 128gb or system ram for a non server board?

Just using this for personal use. I honestly needed llms to help me setup Linux as a Windows refugee.

I want to use llm to help code home assistant stuff and just personal ocr of documents.

Haven't tried coding but I see some pretty cool stuff to restore old pictures.

I also want to use models to create home schooling lessons for my kids.

Also wanting to learn how to do some goon stuff too so if anyone can help me in that direction, that'd be sweet.

Thanks in advance!

[-]

ghgi_@reddit

With a setup like this id say for pure GPU look at minimax m2.7, very solid model, nvfp4 will work good for those blackwells and run pretty fast, If you want to offload though, id say GLM 5.1 is your best bet, pretty solid model that does even better then minimax but offloading means speed loss, So its a quality vs speed thing, Id test both to find what works for you/your workload.

[-]

Quadrapoole@reddit (OP)

What would be best to run it?

I only have experience with llama.cpp but I hear vllm is better?

Any tips on sglang vs vllm vs llama.cpp?

What about all the quants nvfp4 vs awq vs reap?

There's so much everyday it's hard to keep up

I'm only a single user so I'm wondering which will get me the max intelligence model.

Tbh I started down this rabbit hole from seeing ik_llama cpp running deepseek.

[-]

ghgi_@reddit

Pure GPU id say use vllm, honestly probably no need to get into SGLang for what your doing, tips for VLLM are honestly just copy other peoples configs, use prebuilt dockers, etc, Vllm has alot of knobs and dials and being on RTX 6000 pros in my experience if your doing it from scratch your gonna need some trial and error.

If your doing offloading (GPUS + CPU) then go with llama.cpp, its also better if you just want pure simplicity, it can obviously do pure GPU too and if VLLM on NVFP4 (nvfp4 is the quant optimzied for blackwell cards like yours, best option 90% of the time) is too much of a hastle (I have some configs on a dual RTX 6000 pro setup for minimax if you can't figure it out) then llama.cpp will make your life easier, no NVFP4 but you get the most used quant style which is GGUF, id always recommend getting ones from unsloth, the UD versions are often better in my experience and they always publish them.

AWQ is mostly for vllm/sglang, I woudnt use it unless you had too and in this case the models I suggested should have NVFP4 and if your offloading then you should use GGUFs anyways, I woudnt touch REAP in general, too much quality loss.

[-]

Quadrapoole@reddit (OP)

Can you post some dual rtx 6000 pro vllm minimax 2.7 configs? Much appreciated 👍☺️

[-]

ghgi_@reddit

I had to make a few last minute configs since my old minimax script was for VLLM 1.17.1 and it was outdated, this script should work on VLLM 1.19.1/latest stable release: https://paste.opensuse.org/pastes/ae377dd7b1e5 if it doesnt work then it should atleast still be roughly 80% correct and probably has something to do with the moe-backend flag

[-]

Quadrapoole@reddit (OP)

Thanks for actually helping.

Are you doing any comfy UI stuff? Any advice on getting started?

[-]

ghgi_@reddit

No sorry, I don't really care much about image or video gen for my usecases so ive never taken the time to learn comfy but on YT theres plenty of info.

[-]

Simple_Library_2700@reddit

https://huggingface.co/lukealonso/MiniMax-M2.7-NVFP4

This guy has instructions in the model card on to how run under sglang if you want to try that as well

[-]

marutthemighty@reddit

Can you please tell me how many GB in RAM your system has now in total? Is it DDR4 or DDR5?

Also, are you using eGPU?

[-]

Quadrapoole@reddit (OP)

I got 256gb of ddr5 at 5800mhz. That's the fastest it can run with 4 sticks.

Not sure if it was worth getting the second pair of 128 but I think it helps with loading the 192 gb of vram faster.

These are not egpu as the second pics show, they're plugged into the mobo with gen5 x8/x8 bifurcation on my asrock z790 taichi carrera.

Only reason I got the second pair was when I learned about deepseek engram and they're taking forever to release their model.

Not sure if I goofed spending 2k on 128gb.

[-]

marutthemighty@reddit

Ok. But would an eGPU help, in your case? Or is it overkill?

[-]

Kolapsicle@reddit

The i9-13900KF has 20 PCIe lanes. You're using 8 lanes for your chipset and NVMe. You'll be lucky to run those GPUs in an 8/4 configuration. Wild, my boy.

[-]

Quadrapoole@reddit (OP)

Asrock z790 taichi has x16 to x8/x8 bifurcation.

[-]

Kolapsicle@reddit

Gen5 8/8 isn't bad, but for the amount of cash you had to throw around sacrificing the Gen5 NVMe slot and leaving half the PCIe bandwidth on the table is still crazy. Don't get me wrong, I'm sufficiently jealous, and for smaller AI workloads the PCIe bottleneck won't be an issue, but if you saturate both cards with a large model you'll start to see relatively poor scaling.

[-]

Quadrapoole@reddit (OP)

Don't really find gen 5 nvme worth it over gen 4.

It runs hot and doesn't really speed up the os system that much.

It would have cost too much to get a proper server threadripper system so that's why I went max consumer PC system.

I mean I spent about 26k on this system and most of it is the GPU since vram is most important. Dunno how I could have built it better for less.

Trust me I wish the rtx 6000s had nv link.

Just asking to get help with comfy UI. Like any good YouTube videos to get started especially for the nsfw stuff...

[-]

Kolapsicle@reddit

Assuming you have ComfyUI installed, you can download all sorts of image and video models from https://civitai.com/models (https://civitai.red/models for NSFW). You can filter by model, checkpoints, LoRAs, etc. You'll probably want to check out the workflows in particular to get up and running. Oh, and if you haven't already, https://github.com/Comfy-Org/ComfyUI-Manager is probably the most important extension you can install. With it you can drop workflows into ComfyUI and install missing nodes with the click of a button.

[-]

tgromy@reddit

Wow, what a nice setup bro

[-]

Quadrapoole@reddit (OP)

Thank you for you kind words!

[-]

rebelSun25@reddit

So, i dropped $30k CAD, but I'm not sure how to beat utilitize it.

"Claude, do that thing where you run the best possible scenario for my hardware, old chap. Be quick. Make haste and avoid mistakes this time, eh!"

[-]

Quadrapoole@reddit (OP)

Well I have 3 kids so life gets busy.

That's why I'm asking for help but so far only 1 person has actually helped.

Elon was right about Reddit

[-]

Makers7886@reddit

"now draw a picture of a cat"

[-]

LatentSpacer@reddit

Nah, you have a 13900 when you could have had a 14900 😜

[-]

Quadrapoole@reddit (OP)

Hahaha. That's true, but it wasn't much faster and still has the same Intel power problem.

Don't think I'll but another Intel because of that fiasco

[-]

Herr_Drosselmeyer@reddit

I'm confused by the watercooling loop.

[-]

Quadrapoole@reddit (OP)

I have waterblocks for the rtx 6000 pro.

Just need to test them first

[-]