How far would AMD Threadripper 3600 (24 core, 48 threads) and 256 GB of memory get me for running local LLMs?
Posted by x3derr8orig@reddit | LocalLLaMA | View on Reddit | 65 comments
I am thinking about buying this Black Friday a Threadripper to set up a local LLM inference machine. I would add some graphic(s) card, when the budget would allow me, but, this could be a start, no?
My reasoning is a "low" power consuption (280W), 256 GB of memory (enough to spare for some other tasks) and possibility to upgrade down the road. Without any discounts, this would be around 2200 euros (with cooling, case, a ton of disk space, etc, the whole package). I hope I could bring this down to at least 2000 or lower.
Does this make sense or am I delusional?
koalfied-coder@reddit
Just for clarification ram is not very important for llms. Vram is the wave. Get a nice p620 Lenovo and thank me later. Slap 2 3090 turbos, a5000 or a6000 and cook!!!
trackpap@reddit
hey, i got 3x3090s and 1 a6000, what mobo and specs should i be looking for?
koalfied-coder@reddit
What size are the 3090s? Turbo single stack or greater than 2 stack? I'm not sure if I can share links here so I'll DM you some options. Essentially you'll want a true server rig with pcie GEN 4 for the a6000. Gigabyte g292-z20 epyc may work if cards fit. If not there are 8-10 card 4u servers you can use for larger cards.
mj3815@reddit
would you mind DMing to me as well? I'm looking into p620 as option for dual 3090. I'm worried about power supply capacity though and I don't know about hacking together a solution. Thanks!
koalfied-coder@reddit
That's assuming you want all the cards in the same server. For the a6000 a p620 is a fantastic host. Even 2 3090 if they are slim kind.
koalfied-coder@reddit
I HIGHlY recommend a used Lenovo p620 workstation with a a series card of choice. They can be had for less than 700 with thread ripper and the works. It's crazy how good they are.
sibilischtic@reddit
ok i bought a p620 quadro now what
koalfied-coder@reddit
What's your budget for a graphics card? I would get something like one or two a5000 or 3090 turbos. If you have the spend an a6000 is ideal. There are other options as well but these I have experience.
mj3815@reddit
do you have to upgrade the power supply for dual 3090s?
Terminator857@reddit
On ebay or where?
koalfied-coder@reddit
eBay is solid I went with PC Server and Parts store myself for $650.
FullstackSensei@reddit
Get an Epyc. Same specs, much cheaper. You can build a 64-96 core, 16 memory channel (375-400GB/sec), dual CPU Epyc of the same generation as that threadripper for about half the price if you're a bit patient. You'll also get additional nice features like 10gb NIC and remote management for free.
x3derr8orig@reddit (OP)
Epyc never crossed my mind, but yes, it seems like a great option. The only problem is there are so many variants available. I found this one (above Rome) - AMD EPYC 7402 - it has 8 memory channels (where Rome, as far as I can tell has only 4) so that's an advantage and it seems you can even OC memory to use beyond 3200. And I found a few on local ebay for around 250 euros.
FullstackSensei@reddit
No Epyc has 4 channels. It's either 8 DDR4 for 7xx1to 7xx3, or 12 DDR5 for 7xx4 or 7xx5. Motherboards can be a bit confusing with how the PCIe lanes are exposed.
If you want a higher end Rome, DM me. I have a few I'm willing to part with. I'm also in Europe
RealPjotr@reddit
Or 6 channels for the 8004 series Zen4c.
khrizp@reddit
Wasn’t it 12 channel per CPU?
FullstackSensei@reddit
8 DDR4 on SP3 (up to Milan), 12 DDR5 on SP5 (Genoa and later). Rome Epyc is currently the sweet spot for hobbyist. 48 cores (with 256MB cache) CPUs cost under 500$/€, and dual CPU motherboards can be had for around 300-350$/€ (for 1st gen Epyc motherboars which also support 2nd gen, but PCIe slots run at 3.0 speed). 2933 or 3200 ECC DDR4 costs a little over 1$/€ per GB. SP3 is the same socket as TR4/sWRX8 as far as CPU coolers are concerned, and there's plenty of those for cheap thanks to how popular first gen threadripper was.
khrizp@reddit
How many ccd? I remember reading that impacts performance and also remember reading that we don’t really add the memory together since it’s different cpu.
Terminator857@reddit
Intel granite rapids might be a better bet.
https://www.phoronix.com/review/intel-xeon-6980p-performance/10
koalfied-coder@reddit
This is like saying the world's best rollerblades for cross country travel. Can it be done? Sure. Should it when the 3090s are so cheap, probably not.
Terminator857@reddit
Is that what this entire thread is about?
koalfied-coder@reddit
The thread is asking if it makes sense to do CPU/ memory over graphics cards and vram. Which it currently does not.
kiselsa@reddit
Running anything higher than 70b (64 gb ram) will be too slow to read in real time. And even 70b will not be relatively fast too.
Better to spend your money on some used 3090s
x3derr8orig@reddit (OP)
Right, but, at the moment I don't have anything to attach the GPU to, so, I thought to build something now that I can use (as a DIY setup, nothing commercial, just for me to play around and test various things) and later I can add one by one GPU. So, this is more like a solid base that I can expand later and nor run into walls in a year from now.
advertisementeconomy@reddit
I think the typical response would be to build something out of eBay (or similar) kit on the cheap and plop a couple of real GPU's into that and you're still around or under the 2K euros. Search the threads here for cheap build suggestions. I'd be curious to see benchmarks on a system like you've described, but I imagine it would be pretty painful.
Good luck!
Due_Town_7073@reddit
My question too. If I buy 3090s then what to put it on . If I buy 6 x 3090s what would be the minimum motherboard and ram to get my 20t/s on 70b or even 123b 4Q
crazymonezyy@reddit
What's the LLM you want and what will you use it for?
If you want to run a business or something using it where multiple users need to interact with the LLM realtime - it's not possible.
x3derr8orig@reddit (OP)
I know, yes, this would be a home lab / playground kind of a scenario, just for me to play around and test different things, nothing multiuser, commercial..
crazymonezyy@reddit
I understand this is "local" Llama but if you're ok going non-local then with interruptible pricing hosts like vast.ai go as cheap as 0.07-0.1$/hr for a 4090.
If you remember to turn off your instances then even if you use it 4-5 hours a day everyday, it'll cost you like $200 for the entire year. Comparably 2000 Euros is a lot of money. Of course your PC has other utility but this way you can begin your experimentation without committing.
x3derr8orig@reddit (OP)
that's also a valid point, yes.
Barafu@reddit
Not far. In fact, a low tier new Ryzen with crazy overclocked RAM will get you much faster. But not as fast as a pair of 3090s.
DataGOGO@reddit
Can you explain your logic on the memory?
Even with the fastest overclocked memory (assuming it is truly stable) a true 4 channel setup will still have twice the memory bandwidth.
Sufficient_Prune3897@reddit
4x DDR 4 3200 vs 2x DDR 5 8000. In theory DDR 5 should be faster.
nero10578@reddit
Try and get over 96GB DDR5 stable at anything above 5200MT/s
Sufficient_Prune3897@reddit
True, but that is the difference between 1 t/s Vs 1,2 t/s and with the DDR5 system you at least get a good CPU.
nero10578@reddit
I mean a Epyc CPU is definitely a better CPU for LLMs. In terms of raw tflops and having gobs of fast memory.
Sufficient_Prune3897@reddit
OP was talking about a 5 year old threadripper
DataGOGO@reddit
Why only 3200? That generation has no problem running at least 3800; even my 1950x ran 4 x 3600 C15 1T and had read /write bandwidth over 131k.
Ddr5 8000, IF you can get it truly stable is what? 118k?
Sufficient_Prune3897@reddit
For reference, a 4090 has 1000k and even a 3060 has 360k.
DataGOGO@reddit
Oh for sure, no argument from me on GPU vs dram.
SwordsAndElectrons@reddit
Going to assume you mean a 3960X.
Quad channel DDR4 is only going to be in the same ballpark as dual channel DDR5 for bandwidth, and that's going to limit your performance. A newer platform with overclocked memory would probably be a bit faster, I think.
The best thing it has going for it would be the available number of PCIe lanes, but that comes into play if you start adding GPUs.
x3derr8orig@reddit (OP)
As someone above suggested, I also looked into AMD EPYC 7402 which comes with 8 channel memory. Wouldn't that be a step up in right direction? And yes, I do plan to add GPUs a bit later (when budget allows).
SwordsAndElectrons@reddit
In theory, it would potentially be about twice as fast. I can't claim any personal experience with it.
That's around 200GB/s of bandwidth. My guess is that 70B models should definitely be useable if you don't need it to be blazing fast. 123B might be tolerable, but kind of slow.
BlueSwordM@reddit
A 3960X would be fine, if a bit excessive for just running LLMs.
Do note you'll only have 100GB/s of memory bandwidth accessible since you only have access to quad channel DDR4 3200 with that amount of RAM.
x3derr8orig@reddit (OP)
Right, I do mean to use it for other things as well, so, not only for running LLMs. I found AMD EPYC 7402 to have 8 channel, wouldn't that be better option instead of 3960x?
Expensive-Paint-9490@reddit
On paper, yes, by far. Yet Epyc use mobos engineered for servers, and this can cause some headache in daily usage. Driver compatibility issues, lack of sleep modes, case compatibility, things like these. Not automatically a deal breaker but you definitely want to be an enthusiast if you go that way.
CMDR_CHIEF_OF_BOOTY@reddit
You'll be able to run any popular LLM model. It'll just be slow as hell. Even with newer CPU like the ryzen 7950x its just tolerable for smaller models. Not a bad option long term but if I had to do CPU inferencing I'd go with a am5 Asus pro art motherboard and a 7950x or 9950x cpu.
Because of that I ended up building a X99 xeon workstation to setup as a LLM rig. It was like $800 not including any gpus. Sure it's not the Fastest but once the models are loaded into the GPUs Vram that stops mattering. Just keep in mind you HAVE to do a GPU only inferencing on a X99 rig these CPU will never be fast enough to inference in a usable way.
Expensive-Paint-9490@reddit
A trìhreadripper 3960 has more memory bandwidth than a ryzen 7950x - 4*3200 vs 2*5600. Considering that inference is memory-bound, the old threadripper will be faster. The Ryzen will be faster at prompt evaluation though.
Psychological_Ear393@reddit
I have a 7950X, 4 sticks because I need 128Gb for other things, and llama 3.2 3B is perfectly usable for inference in WSL through ollama/open webui. And that's at 3800M/T
TheNotSoEvilEngineer@reddit
get some cheap p40, they work fine for most LLM. 24GB of vram. Just have to work on adding active cooling to them.
SwordsAndElectrons@reddit
If only they were still cheap.
nero10578@reddit
They’re no longer cheap
Echo9Zulu-@reddit
Budget wise it wouldn't be worth cheaping out on a board that sacrifices even some of the bells and whistles. Getting higher density memory would also be a requirement to max out your channels to maximize throughput.
Also, this is the wrong setup to prioritize low power consumption; high load is to be expected from this hardware, even more so with inference tasks. You want a beefcake psu with minimum platinum efficiency rating, which above the 1600w level should be standard. That, memory and a high quality board would be good candidates for black friday deals in the long term. A build like this takes time to spec out so you should not full send it on the best deal, measure "deal" by specs then price and choose parts you can buy used carefully.
It's always a mixed bag with used parts but if you source it properly you could easily get a higher end board and choose the right deal on a threadripper. Maybe I'm off by a longshot. If your use case demands inference and other shenanigans at the same time then fucking full send it and get all the compute some poor mail person can schlep to your front door
x3derr8orig@reddit (OP)
right, so, obviously I would not use it only for inference, but for other stuff as well (nas, pi hole, dnscrypt, HA, coolify, etc)
zyeborm@reddit
I have that exact setup with 128gb Running Goliath 120b q5(or so) with a 3080 I was getting about 1-1.2 tokens per second. It's fun to play with, but I wouldn't try and buy such a setup.
If you wanted to look at cost effective larger models take a look at used epyc gear and check the memory bandwidth on each CPU before you jump in. There's some weirdness around the 16 core mark.
cookerz30@reddit
My Alibaba EYPC has been cruising along just fine since last October.
Short-Sandwich-905@reddit
How much?
Terminator857@reddit
What model size? How many tokens per second?
LoafyLemon@reddit
Bro is still waiting for results
Terminator857@reddit
3 tokens per day? :P
KvAk_AKPlaysYT@reddit
A token a day keeps away
EmilPi@reddit
I guess you mean 3960X threadripper.
You can get 3970X results https://www.reddit.com/r/LocalLLaMA/comments/1erh260/2x_rtx_3090_threadripper_3970x_256gb_ram_llm/ (CPU-only results)
Also, notice that I bought used mobo + used 3970X for 1000 euros, and it works just fine. For 2200 you better buy 2-3 RTX 3090...
CaptParadox@reddit
You're better off upgrading a video card than relying on CPU sadly. (5950x 16 core 32 threads here with a 3070ti)
StableLlama@reddit
Running a neural network is a massively parallel task. And it is a very simple task. This is something the GPU cores are perfectly suited for. And the complexity a CPU core can handle is overkill.
So it just comes down to the numbers: compare the count of the cores on the different systems and you'll know wich one is faster by which amount.
Terminator857@reddit
Not relevant. The bottleneck is memory bandwidth.