Anyone else running local LLMs on older hardware?

[-]

a_beautiful_rhind@reddit

Phenom 2 X4 with RX580 is the oldest computer I got LLMs working on. Man that was a slow ride. Had to compile without AVX2 support.

[-]

Thedudely1@reddit

That's awesome. I'm on a similar era of GPU but the Phenom 2 X4 that's crazy.

[-]

I'm using a GTX 1060 6GB to run small dense models or to run 30b-ish MoE models with expert layers offloaded to my CPU (with 32GB of RAM. Yes I realize that my i5 11400 is doing a lot of the heavy lifting in that case)

[-]

Shipworms@reddit

IBM X3650 M4 (released 2024), with 768GB RAM, and 2x 8-core Xeons! I have run Kimi K2.5 at Q4, and it obviously wasn’t ridiculously fast, but was fast enough to give it a task, forget about it, and a few minutes later you have a reply! I did some experiments with code generation, and it was pretty good! It also has 6x PCIE 16x slots, so in theory that could be 6 GPUs, or 12 GPUs at 8x PCIE;

[note : anyone with an I M X3650 M4 or related SAN volume controller : check the IMM2 date. A few years after the release i. 2014, a software bug that sends excess current to a small chip on the motherboard every boot / reset cycle exists, but isn’t widely known since most servers are powered up 24/7. I have 3x of these, 2 never-used spares, and 1 used but condition. All 3 had the voltage-regulator-destroying software bug, all 3 got updated IMM2]

Also a 2021 crypto board with 2011-era chipset, 8 PCIe slots (and a 2-core Sandy Bridge i3) and 5x Intel Arc B50

Currently testing : AsRock H510 BTC Pro, 6-PCIe slot mining board, but DDR4. Have a 6-core i5 in there currently, 32 gb RAM (a laptop DDR4 SO-DIMM in a desktop DIMM converter), 2x 5060 Ti 16gb, and 4x Arc Pro B50 16gb. It has an extra port for a 7th GPU as well. The main benefit of this board is resizable BAR and not locking up when a 5060 Ti is plugged in!

Regarding the older Xeon hardware; my thoughts are that it could be used as a ‘backup’ computation unit for local AI, so if you have for example 32gb of VRAM in your PC, you can still load models that won’t fit … instead, they overflow to the server with lots of RAM! This is my idea with the crypto board - it can be the ‘main’ inference station, but can offload data to the server if a huge model (GLM 5.1, Kimi K2 etc) is loaded that won’t fit into the smaller, faster computer. Ideally this could even be wake-on-LAN or something like that, so it only powers up the server when needed!

[-]

createthiscom@reddit

What’s the tok/s on that IBM? With or without a GPU?

[-]

Shipworms@reddit

I’m not sure; I had the RAM slowed down to 800MHz for initial testing, and the Xeons didn’t max out, and reached 62 degrees C max during inference. I think it was about 2 tokens/second IIRC for Kimi K2.5 - but even then, I was using a quantized model that used more bits than the original unquantized model, for most of the model weights (total model size 630gbj, so I should be able to get a good speedup by putting the RAM and CPU into non-power-saving mode, using a properly quantized model, etc.

While that is slow, my rationale for setting it up this way is that : I can’t afford 768gb of VRAM, and this lets me run literally any model in existence, albeit slowly, which is better than not running the model at all (especially with RAM and other component prices going through the roof). I am going to mess around with linking multiple computers together across Ethernet soon, and I have 3 of these servers, which I will test with 384gb RAM each, and 24 cores of Xeon each :)

[-]

createthiscom@reddit

It would be interesting to just slap a modern GPU on it ( 5090 ) and see how fast it goes. I run a similar setup with 768gb of DDR5 on dual EPYC 9355’s. I think I got 6-8 tok/s before adding a GPU and I get 20+ tok/s with a blackwell 6000 pro.

[-]

mrtrly@reddit

Your Xeon works because you've sized the model to the hardware. At 10 t/s (MacBook in the thread), you've hit the usability floor without GPU. Small models on massive old Xeons are about as good as this gets.

[-]

HopePupal@reddit

not that old or weird, i guess, but i have a 2019 Intel MBP (Core i9 9980-HK, 64 GB DDR4 dual channel) for CPU-only inference with ik_llama and Qwen3.5 35B-A3B Q6_K_L. (those machines have dGPUs but they're hopeless for LLM use.) PP 45.3 t/s and TG 10.2 t/s at small context. it's the one server in my office that's always on, and has a built-in UPS.

[-]

vreezy117@reddit

We had an i5 Gen 3 from 2013. Since last month cpu is no longer supported. Bun requires avx in cpu level architecture. We used Öllama and open webui and wanted to use OpenCode. We have now new CPU and MB but gfx 1080 has now Problems in linux. :(

[-]

-Luciddream-@reddit

Steam Deck for me, just to see if it's working (it does)

[-]

Ok_Mall4434@reddit

my current server:

Intel Pentium G3258 CPU (2 Cores)
Asrock z97 extreme9 mobo
32Gb RAM
4x RTX 3090, all at PCIE GEN3@8x

running vLLM and pretty good speed

[-]

ForsookComparison@reddit

G3258 - legendary CPU. Did so many good projects with those $50 new

[-]

Yayman123@reddit

An Intel NUC from 2015 with an i3 and 16GB of RAM. Running Qwen 3.5 2B. It's kinda dumb and slow but juuust good enough.

[-]

harglblarg@reddit

Same here, home server with 2014 i5, 16gb RAM runs tiny Qwen for basic language processing and information extraction.

[-]

nakedspirax@reddit

Don't you guys want quality?

[-]

Yayman123@reddit

I'm using it for several simple tasks like feeding a weather entity and having it generate a paragraph about the weather for my dashboard, nothing crazy.

[-]

harglblarg@reddit

For what it does it's plenty. If I need more I can finetune.

[-]

nakedspirax@reddit

Yeah you got a point. That is plenty

[-]

Feuerkroete@reddit

Interesting, would you care to provide your llama.cpp flags?

[-]

Yayman123@reddit

Oh I'm not doing anything special. I'm just using the stock settings for the Ollama application for Home Assistant with a 4K context window. Anything specific you wanted me to check?

[-]

Glittering_Mouse_883@reddit

Ryzen 5 1600, 128gb ram, 72gb vram (2x rtx 3090, 2x rtx 3060)

[-]

Potential-Gold5298@reddit

Core i5-4460 and 32 Gb DDR3. Gemma 4 31B Q5_K_S - 1 t/s. Gemma 4 26B-A4B Q5_K_M - 6.3 t/s. I'm happy.

[-]

Feuerkroete@reddit

I'm interested in your llama.cpp flags. Would you mind sharing it?

[-]

Potential-Gold5298@reddit

I borrowed the settings from this post:

-t 4 -c 32768 -ctk q8_0 -ctv q8_0 -ctxcp 2 --cache-ram 2048 --parallel 1 --temp 1.0 --top-p 0.95 --top-k 64 --min-p 0.0 --jinja --reasoning off --mlock

You can experiment with -ctk and -ctv.

Enable reasoning where necessary (just remove the flag - reasoning is enabled by default).

--min-p 0.0 – in llama.cpp the default is 0.05, but Google doesn't mention it in the recommendations – I'm not sure which value is best. Overall, you can experiment with the sampler depending on your task.

[-]

1ncehost@reddit

I've got a 48 core epyc from 2017 and 4x MI100s from 2020 in my training rig. I'm curious how long they'll last.

[-]

a_beautiful_rhind@reddit

They will last until you get tired of them most likely.

[-]

Constant-Simple-1234@reddit

Ooh. Training on ROCm? How difficult was the setup? What do you train (model / size / use case)? Thanks in advance!

[-]

1ncehost@reddit

Easy. The most difficult thing was figuring out the cooling on the datacenter style cards. Software is plug and play now a days for ROCm. Have trained lots of things but currently a tiny model.

[-]

createthiscom@reddit

I’ve got a dual EPYC from 2025 way back before LLMs were good at math.

[-]

jacek2023@reddit

2008 computer with 3060 running Qwen 3 IIRC, posted on reddit some time ago

[-]

brwinfart@reddit

2016 Macbook pro 16gb.

Currently setting it up as an automatic video editing suite.

Had to swap out to the Gemini API to analyse podcast transcripts to find shorts. Hoping to 'chunk' the transcript eventually and run this through a local llm.

[-]

jeffzyxx@reddit

I picked up some Nvidia GRID K520's (three of them, to be precise) - each basically two GX680 4gb's duct taped together. Threw them in a third gen i7 and a fourth gen i5. Had Claude bang its head against the wall to figure out how to hack the CUDA bytecode and... It runs!

Currently trying to get GPT-OSS 20B running across the four dies (two cards) - it's working right now, just slow. Already had it implement a research paper that sped up generation a bit.

(This is just for fun, my actual rig is a P100 paired with a 5800x - trying to find a cheap MI50 to pair with it.)

[-]

Constant-Simple-1234@reddit

On vulkan you can use Mi50 with nvidia.

[-]

jeffzyxx@reddit

Like splitting layers across the cards, or something else? I think splitting layers between "cards" was what I had to do to get GPT-OSS working on the (painfully small) 4GB of VRAM per die on the K520's.

[-]

AlwaysLateToThaParty@reddit

I'm using a tenth generation intel from 2019. 10900 2.8Ghz, with 128GB of RAM. I just bought an RTX 6000 Pro to go with it. I was planning on upgrading the system as well as the GPU, but with the prices of RAM the way it is, that's just going to have to wait for a little while. Can't say as I have any complaints though.

[-]

Skyline34rGt@reddit

I experiment with my old laptop:

Intel 6405u 2/4, 1x8Gb ram and GPU: Mx350 2Gb Vram.

With CPU only it was disaster, but after many tests (I need to install Linux + Koboltcpp (only this config will handle Mx350 with Vulvan and profil 'veryoldCPu' nothing works at windows, and with linux not lmstudio, not ollama, not llama, only Kobolt.cpp with this profil for very old cpu works xD)

And best what I can run fully at Vram and wigh good speed was Qwen3.5 2B with Q4_k_m quant.

[-]

bnightstars@reddit

I'm running some small 2B/4B models on a i5/16GB/Macbook Pro(2020) it's hovering around 10 t/s so wouldn't call it usable.

[-]

ttkciar@reddit

My laughably ancient systems:

Dell T7910 with dual E5-2660v3 Xeons and 256GB DDR4, no GPU,
Supermicro X10DRC-T4+ with dual E5-2680v3 Xeons, AMD MI50, and 128GB DDR4,
Supermicro X10DRU-i+ with dual E5-2690v4, AMD MI60, and 256GB DDR4,
A really old Dell T7500 with a E5504 Xeon, AMD V340, and 24GB of DDR3, with a second PSU literally duct-taped to it and daisy-chained via ADD2PSU device.

[-]

SM8085@reddit

I love my old HP Z820 2xE5-2697v2. It's got the beefy 256GB of old, slow RAM.

[-]

Lemonzest2012@reddit

2x Nvidia Tesla P100 16GB, from 2016 rest of my system is pretty new, Ryzen 7 5700G/96GB RAM