Do cheap 32GB V100s still make sense for homelab AI?

[-]

No_Mango7658@reddit

I've been extremely tempted. $10k for 256gb at 900gbps!

[-]

pile-of-V100s@reddit

Cheaper than that actually. I have about $10k into 512gb worth of 32GB V100s, including SXM2 adapter boards, risers, and PCIe switches.

[-]

That’s amazing! There are completed servers with 832gb v100 for like 10k. How is your tps with something like qwen122b or stepfun 3.5 flash or similar? Find any amazing settings for parallelization?? I’d love to hear you experience because I am very tempted. How do you have 16 of these on risers hooked up to you machine? Any info would be appreciated

[-]

pile-of-V100s@reddit

MoEs don't work as well as dense with -sm tensor. Prefill for qwen3.5-122b and step-3.5-flash across 4 GPUs is half as fast as -sm layer, and a quarter as fast on 8 GPUs IME. Decode is about 80% faster than -sm layer for those two, but the really big MoEs that use deepseek architecture are not supported at all, so other than step-3.5-flash, I always end up going huge (Kimi, GLM 5.1) or dense (Qwen3.6-27B or Qwen3.5-27B-RYS) - often both simultaneously. Big MoE for smarts and smaller dense for grunt work, since I see >1500pp and 45tg with -sm tensor running 27B, vs a third or half that (prefill/decode) with MoEs of similar capability like 122B.

I did it to run the full size Q4_X version of Kimi K2.5, and to run combinations of smaller models simultaneously while staying fully in VRAM. Tensor parallel was a happy upside, but really I just wanted to hit 768GB-1TB total VRAM+RAM so I could run proper quality big stuff locally. How is the same standard way everyone does it at this scale - PCIe switches, slimsas/mcio cables, and riser boards - there's been a few posts here about that stuff in the past 6-9 months. The NVLink boards work great but I only have one quad SXM2 version for running a smaller dense model - idle power increase for NVLink for all 16 isn't worth it when you're still bottlenecked by PCIe past 4 GPUs.

[-]

Glittering-Call8746@reddit

So pcie version for more than 4 ?

[-]

pile-of-V100s@reddit

IMO, yes. Or dual NVLink boards. The quad NVLink board is limited to PCIe 3.0 x8 per GPU (both dual and quad SXM2 boards have 4 SlimSAS connections), so it's not a good fit for models that don't fit on 4x32gb.

[-]

Glittering-Call8746@reddit

Will those 9-11 slots gpu expansion be good ?

[-]

pile-of-V100s@reddit

Not sure what you're referring to, the PCIe switches I'm using are one PCIe 4.0 x16 -> 6x or 10x 8 lane SlimSAS, so a 10 port switch will support either 5 (x16) or 10 (x8) GPUs

[-]

Shellite@reddit

Got any pics of your setup?

[-]

Bootes-sphere@reddit

V100s are a trap for homelab unless you're specifically targeting CUDA 7.0 workloads or have free power.
Your 5070 Ti alone will crush a V100 on anything modern, better memory bandwidth, tensor cores that actually matter, way lower power draw. The 32GB sounds appealing, but you're paying for datacenter-grade reliability you don't need.
If you're actually memory-bound (running multiple 13B+ models simultaneously), grab a used RTX 6000 Ada or wait for the next-gen consumer cards. A single 5090 will outperform two V100s for a fraction of the power bill.
The only case I'd make: if electricity is basically free and you want to run 3-4 large models in parallel for experimentation, the raw VRAM density is useful. But you'd be better served buying one newer card than two V100s. What models are you actually running that maxes out your current setup?

[-]

CharmingAioli3228@reddit

"A single 5090 will outperform two V100s for a fraction of the power bill." at 7 times the hardware cost? Like I am not disputing that 5090 is better - I would like one. But you are comparing a 500$, 10yo GPU, to a 3,5k modern beast. In my mind, this right here peeks my interest in those, that people are even comparing those side by side.

[-]

David-Gallium@reddit

I've got 4x V100 32GB on a NVLink board running the 1Cat fork of VLLM running Qwen 3.5 122B AWQ with full context. Took me about an hour to get working and without any additional performance tuning below is what I'm seeing for 32k in 2k out 4 concurrency benchmark. Idle is 220w and ramps to 600w when inferencing.

```
============ Serving Benchmark Result ============

Successful requests: 4

Failed requests: 0

Benchmark duration (s): 32.46

Total input tokens: 32000

Total generated tokens: 2000

Request throughput (req/s): 0.12

Output token throughput (tok/s): 61.61

Peak output token throughput (tok/s): 75.00

Peak concurrent requests: 4.00

Total token throughput (tok/s): 1047.37

---------------Time to First Token----------------

Mean TTFT (ms): 13557.73

Median TTFT (ms): 13562.83

P99 TTFT (ms): 25368.65

-----Time per Output Token (excl. 1st token)------

Mean TPOT (ms): 13.55

Median TPOT (ms): 13.49

P99 TPOT (ms): 13.72

---------------Inter-token Latency----------------

Mean ITL (ms): 13.52

Median ITL (ms): 13.49

P99 ITL (ms): 18.41

==================================================
```

[-]

CharmingAioli3228@reddit

can i ask you about the board? priv is fine, I am currently on the lookout for one and it is a headache

[-]

David-Gallium@reddit

My setup was a complete server; Dell C4140. The whole server with the GPUs was cheaper then buying the aftermarket SXM board and 4x V100s on their own.

[-]

pile-of-V100s@reddit

I have... many 32GB V100s, because it was a cheaper way to reach 768GB-1TB of combined RAM+VRAM. In hindsight it was probably a bad tradeoff at my income level for time spent building, troubleshooting, dealing with power, and cooling vs buying newer but I think it still makes sense for a lot of people - just be prepared to deal with the jank and expect to only use mainline llama.cpp.

Oh, and those NVLink boards do work very nicely, but they nearly double idle power. 4 V100s on a quad NVLink board will idle at 180W...

[-]

Glittering-Call8746@reddit

Have github repos of your adventures?

[-]

pile-of-V100s@reddit

Nah, I started backporting mainline's stability & precision fixes and Volta MMA tensor core support to ik_llama.cpp but then mainline added tensor parallel, rendering most of that (sloppy) work moot.

Besides V100 SXM2 boards (see ebay/aliexpress and that lawyer's posts here), the rest of the adventure is the same kind of stuff anyone who tries to hook up and run 16 GPUs on one system will encounter. (PCIe switches, SlimSAS/MCIO risers, power supplies, maybe a dedicated mini-split A/C unit...)

[-]

Glittering-Call8746@reddit

Which lawyer post ?

[-]

pile-of-V100s@reddit

https://old.reddit.com/user/TumbleweedNew6515/submitted/

[-]

twnznz@reddit

Why not just whip out vast.ai and test your case to see if it's good enough?

[-]

aks4896@reddit

We all need to hear this more!!

[-]

realbrandonb602@reddit

Running 4x V100S-PCIE-32GB in a Dell T640 tower, so I can speak to this directly.

You said homelab and that was the key word in my decision too. I specifically went PCIe cards in a used server chassis over SXM2 on adapter boards. SXM2 is cheaper per card but your dealing with adapter boards, cooling headaches, and a build that's harder to maintain. PCIe in a proper server with hot-swap fans, iDRAC remote management, and standard power delivery just made more sense for something that lives in my house and needs to stay running. The T640 shows up on eBay regularly and the GPU cage fits 4 full-length cards without any fabrication.

All-in I'm under $6k for 128GB of VRAM across 4 cards, including the server, RAM, CPU upgrades, and all the GPU hardware. When I was shopping, 3090s were going for $1100-1200 each and that gets you 3 cards at 72GB total for the same money. The math just didn't work for what I wanted, which was enough VRAM to run multiple large models simultaneously.

The purpose of my build is a local voice assistant, STT, LLM inference, TTS, the whole pipeline running on the tower. Right now it runs 95-100% local. I haven't hit a cloud API in weeks. I've got a 35B MoE on 1 as the primary brain, a 31B dense model on number 2 for tool routing, and GPUs 3+4 running a flex pool that swaps between a 120B, an 80B, and a 70B model depending on the task. Thats 3 models hot, 2 more on standby, all local.

Software stack: llama.cpp is the daily driver and it works great on Volta with no issues. I also tested the 1Cat-vLLM fork which is specifically patched for V100, got 121 t/s single user and 502 t/s at concurrency 8 on a 35B MoE. vLLM wins for concurrent workloads but there's a big caveat: you MUST run with enforce-eager disabled or performance tanks by 7x. Standard vLLM without the 1Cat patches won't run on Volta at all in recent versions.

For the flex pool on GPUs 2+3, the killer feature is how easy GGUF models are to work with. I download a model, drop it in a directory, and llama.cpp just loads it. I've got 320GB of system RAM so I pin the model files in page cache, swapping between a 60GB and a 46GB model takes about 26 seconds from RAM vs 90+ seconds from NVMe. For experimentation, being able to rotate through models in under 30 seconds without touching config files is huge.

Honest negatives: no BF16, no FP8, and parts of the ecosystem are dropping sm_70 support. TensorRT-LLM and TGI won't work. vLLM only works with the community fork. If your planning to use anything outside of llama.cpp, verify Volta support before buying. Power draw is real, I'm currently running at default clocks and the cards pull about 200W each under load. I've got 2000W PSUs arriving so I can test uncapped performance, but right now I'm bandwidth-bound on generation anyway so more power doesn't necessarily mean more speed.

To directly answer your question: yes, 32GB V100s absolutely still make sense for homelab AI in 2026, but only if you go in with the right expectations. They're not fast in absolute terms, a 5090 will smoke them on per-card throughput. What they give you is raw VRAM density at a price point nothing else touches. If your goal is running larger models locally and experimenting with different architectures, that VRAM is what matters. If your goal is maximum speed on a single small model, buy the newer card.

[-]

Huge-Safety-1061@reddit

No they are slow and are finicky and need custom solutions. We all saw Hardware havens video comparing it to a 3060 and frankly thats ridiculous to compare. Like a 3060 12G...okay. Granted he said he was not really into AI, even may have said its distopian lol.

[-]

SKX007J1@reddit (OP)

Googling Hardware Haven now.

[-]

codehamr@reddit

I'd skip them in 2026. The ecosystem has actively moved on from Volta, no BF16 support, and TGI/TensorRT-LLM/Triton have all dropped sm_70. vLLM still works but with caveats. llama.cpp is fine, that's about it.

The bigger problem for your setup is mixing Volta with Blackwell. Tensor parallelism really wants matched architectures, so you'd likely end up running the V100s as a separate inference node rather than pooling VRAM with the 5070 Ti. Add 250W per card idle-to-load and the electricity bill catches up to whatever you saved.

If you want cheap VRAM for experimentation, used 3090s are still the sweet spot. Same 24GB but Ampere keeps you in the modern software stack.

[-]

starkruzr@reddit

3090s have really been dropping off at least in the US in terms of value. you're talking about $1100 on eBay.

[-]

Guinness@reddit

Yeah, they're not $750 anymore. Seems like the new norm is around $1200 or so.

[-]

starkruzr@reddit

I'm starting to conclude that the best "starter package" budget-wise today is a pair of 16GB 5060Tis.

[-]

Seren251@reddit

That was kinda what I did. 2x 5060Tis in my workstation. 2 more connected to TB4 docks on my laptop (with a native 3080 which roughly matches a 5060ti in performance).

[-]

SKX007J1@reddit (OP)

Thanks, that’s exactly the sort of sanity check I was looking for.

To clarify, I wasn’t assuming I could cleanly pool V100 VRAM with the 5070 Ti/5060 Ti into one big unified memory pool. My rough understanding was that mixed-GPU model splitting/offloading is possible in some setups, but likely awkward, inefficient, and very software-dependent. So your point about tensor parallelism wanting matched architectures makes sense.

The “separate inference node” point is probably the key thing I needed to hear. If the V100s would mostly end up being their own separate runtime rather than meaningfully extending the newer RTX cards, that changes the value equation quite a bit.

The software support issue is also fair. I knew Volta was old, but I hadn’t fully appreciated how much of the modern stack had moved on from sm_70/BF16-era assumptions. If llama.cpp is basically the safe path and everything else is caveats, that’s a pretty big limitation.

The only part I’m still unsure about is the 3090 value argument. I completely understand why it’s the default recommendation: 24GB, Ampere, good CUDA support, strong performance, and still modern enough. But in the UK they still seem to be around £700–£800 used, so I’m trying to work out where the value line actually is. as "Spend a bit more" kind of never ends.

At that price, I start wondering about other used 32GB-ish options that sometimes appear under £1,000 — things like Tesla V100 32GB, maybe L40/L40S if found cheap enough, or some of the AMD workstation/datacentre cards. I’m not claiming any of those are better; I’m aware they all come with their own software-stack issues, compatibility problems, power/cooling headaches, or “works in theory but not in the tools I actually want” traps. I just don’t know enough yet to argue confidently for or against them.

So I suppose my real question is: is the 3090 still the sweet spot because 24GB is “enough” and the software support is worth more than chasing 32GB+, or is it just the least bad compromise in the used market?

Not trying to argue for the V100s, don't know enough about the topic to even know where the hills I'm meant to die on are! More trying to understand whether I’m being tempted by the headline 32GB number and ignoring the hidden costs.

[-]

ImpossibleHot@reddit

the thing is CUDA cores and mem speed for inference

[-]

Reddit_User_Original@reddit

In "theory" it's a good idea, but I think the power cost alone for everything is too much and makes it practically speaking a bad idea.

[-]

SillyLilBear@reddit

Not worth it

[-]

m94301@reddit

Hey just wanted to drop in and say v100 is very usable for today's models despite the lack of fp8/4.

I have an nvlink board and two of the pcie cards, water-cooled, and I can say that ADT-Link Store on Ali is good. Other vendors sent me bent shit, wrong items. It's a jungle in Chinese v100-land as a US buyer.

And there is not much use to using max power, it just burns energy for not much gain.

This is one of the 32GB PCIE sxm holders

Qwen3.6 27B 29 t/s at 150W power limit 31.5 t/s at 200W 32.4 t/s at 250W 32.7 t/s at 300W, it is only using 240-260 max

And the MOE Qwen3.6 36b A3b 79.44 t/s at 150W, it is only using 124

[-]

rashaniquah@reddit

Work has V100s, they're horribly slow and don't support the key libraries. I was getting better performance on a single 3080.

[-]

ravetam@reddit

I bought one (pcie version) two months ago, not that power hungry as advertised. I cap it to 100watt (nvidia-smi -pl 100) to limit heat generation. I did had trouble running vllm and comfyui. But llama cpp works great, but there is a a file that needs to be modified in order to build, check llama cpp build doc (Fixing Compatibility Issues with Old CUDA and New glibc). New drivers/cuda doesn't work. Do I regret it? No.

V100 32G + 3080 TI 12G - Qwen27B-Q8 + 256k KV 3080 10G - Other models

[-]

ttkciar@reddit

I'm pretty happy with my 32GB MI50 and MI60, which are comparable to V100. I'd say go for it.

[-]

metmelo@reddit

Same here with 4x mi50s 32GB. The v100's are faster though due to the nvlink. You can connect up to 4 of them together in the same nvlink adapter.

[-]

soshulmedia@reddit

Has another ever found a source for these Radeon Instinct Infinity Fabric Bridge PCBs?

They seem to be unobtainium. I think if someone(TM) would make a lone board or more generally find a source for these links, a lot of lonely PCI-e-bandwidth-starved MI50s in old mining rigs and similar would get quite the speed boost ..

[-]

snacksneaksnake@reddit

I am trying to decide to between a R9700 and a MI60 for Qwen 3.6 27b. What is your experience like?

[-]

ttkciar@reddit

I have no experience with R9700, but MI60 with the llama.cpp Vulkan back-end is fast for inference, a bit slow for prompt processing.

I dorked around trying to get ROCm working under Slackware for about a month with little luck before giving up and trying Vulkan, and Vulkan just worked with minimal effort (just following the llama.cpp compilation instructions, which is like three short commands). I kicked myself for not trying it sooner. Vulkan is wonderful, and has even caught up with ROCm performance on the most part.

With Gemma-4-31B-it I get between 40 and 47 tokens per second for token generation, and about 200 tokens per second for prompt processing.

Overall I'm quite happy with that. With Qwen3.6-27B it should be somewhat faster.

[-]

snacksneaksnake@reddit

Those are solid numbers I think. I don’t have a system. I want to build one by adding a GPU to my Ryzen 5600 Linux station. I wish there was a website dedicated to this kind of benchmarks. They are all mostly for gaminh

[-]

thesuperbob@reddit

Same here, I'm quite happy with what a small pile of Mi50 can do. Are they slow? Compared to RTX 5090, sure. Compared to unified memory machines? Not really. Now that subscription prices are spiking, I fear GPU drought is only going to get worse, and being able to run MiniMax M2.7 locally calms my nerves a bit.

[-]

Savantskie1@reddit

I’m currently saving up for two more MI50 32GB cards so I can run bigger models or increase my context.

[-]

Maharrem@reddit

I'd skip the V100s unless you absolutely need 32GB on a single card and can live exclusively in llama.cpp. The lack of BF16 and FP8 support means you're frozen out of most modern inference engines — vLLM might limp along, but TensorRT-LLM and TGI both dropped Volta. Power isn't trivial either, especially if you enable NVLink, and the 250W per card adds up fast.

A used 3090 with 24GB costs maybe a bit more but gives you full Ampere and plays nice with everything, plus you can pool two of them for 48GB without driver conflicts. If you really want cheap 32GB, a used MI60 with ROCm is the more honest bang-for-buck path, but I'd still pick the 3090 for daily driver sanity.

[-]

Away-Albatross2113@reddit

Definitely yes. The LLMs of current generation are memory bandwidth constrained, not compute. v100 has 900GBps, which is 1/3rd of the latest generation GPUs at 3.3 TBps, but comes at a fraction of the cost. It is a no brainer actually. Go for it.

[-]

pile-of-V100s@reddit

Yep, 2-4 V100s -sm tensor/-sm graph work very well for dense models. On a quad NVLink board and limited to 100W, I see 1100 tok/s pp and 38 tok/s decode with Q8 Qwen3.6 27B. Bump the limit to 200W and I've seen >1500pp and 46tg.

Minimal speedup for MoE and in my experience stability can be an issue, so consider it cheap VRAM for that use case.

[-]

riconec@reddit

5090 has 1.8Tbps, which expensive cards are you comparing with v100?

[-]

SKX007J1@reddit (OP)

Thanks!

Think it would be a smart move to pay a little extra now for SXM2 on PCIe adaptors over straight PCIe cards, so down the road, as prices go down more, I have the option of adding NVlink interconnect Dua/Quad Card SXM2 Adapter Board to keep them useful for longer?

[-]

No-Comfortable-2284@reddit

depends what ur tryna do. for single user inference sure. for multi user deployment. nope. no native support for fp8 or fp4 will suck. heck cant bf16 either.

they can to int8 so for single inference u can use q8 models ig

[-]

SKX007J1@reddit (OP)

Yep, just single user inference for larger models that won't fit into the Blackwell cards that will cover the smaller model fp8/fp4 side of things...well, that's the plan, so not one pool of VRAM, but two nodes, one 16gb for smaller modern stuff. and one larger 64gb for larger but legacy jobs.

[-]

No-Comfortable-2284@reddit

interesting setup. just FYI volta cards and blackwell cannot be used together in linux. no drivers that will support both. you will have to use windows.

[-]

SKX007J1@reddit (OP)

Thank you for clarifying that!

[-]

rog-uk@reddit

GV100 pcie has NVlink, in case that helps.

[-]

metmelo@reddit

You can connect 4 v100's SXM2 with nvlink with chinese adapters

[-]

rog-uk@reddit

You are 100% right. I was only pointing it out in case OP was looking to use non converted pcie cards in a standard(ish) desktop/workstation type setup.

[-]

segmond@reddit

Define cheap.

[-]

SKX007J1@reddit (OP)

sub £500 per 32gb

[-]

MotokoAGI@reddit

I would if I could mix it with a blackwell, I don't own one, so check into the driver to make sure it can support both.

[-]

a_beautiful_rhind@reddit

I wouldn't mix them with newer cards. Run them on their own because of drivers and software support.

[-]

AdamantiumStomach@reddit

Depends on what tasks you are planning to do. The raw compute power (TFLOPS) is of course worse, but memory-bound and bandwidth-bound tasks, as well as FP64 tasks will be handled better on a V100 than on any consumer GPU almost always.

[-]

SKX007J1@reddit (OP)

To be totally honest, I’m very much a beginner here, so I don’t have a real FP64 “need” at the moment. It’s more of a curiosity/learning justification than me trying to solve an existing workload problem.

That said, I’m an avid maker and I’m interested in CNC, so things like finite element analysis, computational fluid dynamics, thermal simulation, fixture/part stress analysis, or machine-frame/vibration modelling all sound like genuinely fun areas to learn about.

I’m not pretending FP64 matters for LLM inference, and I know the V100s are old compared with modern RTX cards. My thinking is more: if they’re cheap enough, maybe they could give me a useful mix of 32GB VRAM for larger-model experimentation and an entry point into CUDA/HPC/FP64-style workloads.

But I’m very open to being told that’s just me inventing a use case to justify old datacentre hardware. 😂

[-]

AdamantiumStomach@reddit

No, you are totally correct. FP64 is indeed does not matter for LLM inference (at least for now), but every single consumer GPU has extremely bad FP64 performance, you can visit techpowerup website to compare, V100 wins in every single FP64 case if compared against RTX cards. Meanwhile, LLM inference speed is mostly memory bandwidth-bound, again you can see on techpowerup that only xx90 cards and from the NVIDIA consumer segment could reach similar values there (I am referring to my memories, so I might be not entirely correct). You can also compare FP16 performances to see if the card is worth it in other AI-based workloads, like diffusion-based image generation.

[-]

Every-Arachnid-1133@reddit

Some men are drowning while others are dying of thirst

[-]

SKX007J1@reddit (OP)

😄In fairness, I got the 5000 series cards last year on sale in an Amazon Prime sale for still cheap and pre AI rabbit hole (5060ti for media encoding in my NAS) and 5070ti was a placeholder as at the time we thought the 5080ti was going to be a thing!