TheaterFire

I Put a Datacenter GPU in My Gaming PC for £200

Posted by tymscar@reddit | LocalLLaMA | View on Reddit | 122 comments

Hey there! I wrote a blogpost about my experience running local models on a V100 from a newbie perspective and got loads of views outside of reddit, so I thought I'd share it here too!

Reply to Post

122 Comments

looktotheson@reddit

Incredibly well written blog. Thanks for sharing! Might give this a shot later this year as well
View on Reddit #87674797

cafedude@reddit

Makes me wonder what we'll be able to get for fairly cheap when the current generation of datacenter GPUs is retired.
View on Reddit #87554911

Ell2509@reddit

This topic feels like the elephant in the room. Companies with no profits are taking on enormous debt to built datacenters with gpus that have a 6 year life at 24/7 use. Now with the right cooling etc, maybe they push to 8, but even at 8 years, replacing all these gpus will almost be the same capex spend as it is to build the DCs right now. It looks to me like a 6 year long scam, because there is no way they can make a market out of what they are doing now. Divided down into per token cost, including all operating cost and sunk cost, over a 6 year period, will just be wildly too expensive for most people to use. I suspect that they are using this 6 year period and all the conpute to train models which they otherwise wouldn't be able to train. After that, I bet AI is a whole lot less available to the public, and the rug will be pulled from under anyone who built up reliance on it. One massive scam. So a few people can end up with a wildly powerful AI for their own use.
View on Reddit #87561492

stonktraders@reddit

And 6 years old GPUs won’t be considered as obsolete when they decommissioned. Just look at how slow NVIDIA is rolling out their products these days and how little raw performance gain in each generation and even cut back on memory bandwidth
View on Reddit #87590617

etaoin314@reddit

performance gain from ampere to Ada=2x. Ada to Blackwell=2x+ that seems pretty good to me, so I am not sure what you are talking about there. Yes the next gen is a bit slow to come out and who knows how fast vera will actually be, but nothing i have seen suggests that it is going to underperform.
View on Reddit #87652100

stonktraders@reddit

the 2x you are talking about comes from fp8 and fp4 support. The raw performance in each generation is just around 20-30% without grame gen
View on Reddit #87671886

TimeSalvager@reddit

...DCs are a lot more than than just the hardware. When they do a hardware refresh, all the physical infrastructure that houses the hardware is still fine. The capital outlay for labor, concrete pours, power distribution and everything that isn't the hardware is substantial, but still completely valid during a hardware refresh.
View on Reddit #87581258

Objective-Picture-72@reddit

Right but the GPU hardware cost is like 60% of a data center cost right now. So the 40% residual has a long useful life but 60% of the cost is essentially deprecated in 6 years. Not sure if the major AI labs get a return on that 60% over 6 years consider how low the pricing is AI tokens right now.
View on Reddit #87615529

CorpusculantCortex@reddit

Yes but in 6 years the hardware will be more performative and cheaper. Also for general purpose models there are multiple companies making model on chip accelerators that are a fraction of the cost to produce, and much faster processing. So once the growth phase levels out and there are stable performative models, available hardware will adjust to reduce costs. Plus 6 years is a lot of time for revolutionary thinking in an essentially brand new sphere. Every 6 months or less a paper is published that makes models smaller, faster, more reliable in the next generation. And the biggest thing is tokens arent that cheap. Like for agentic professional flows it is serious money, so the other side of it is that user will necessarily find more efficient ways to leverage the tools and connecting mcp to claude and saying pull out xyz data and make me a report. Because that is wildly token inefficient and long term a company can't justify spending 3$ every time an internal report needs to be run. But that is how people are using it because people who don't know how to design data and sw systems are being given a tool that let's them think they can. There is no way to predict where it will go.
View on Reddit #87616704

Objective-Picture-72@reddit

I have no idea what you're trying to say tbh
View on Reddit #87669363

XeNo___@reddit

Datacenter hardware is also more than just the compute nodes. Think of all the Networking stuff like Switches, NIC's, ... Just the switches (depending on the topology) can get \*very\* expensive very fast. All of that remains untouched when a HW refresh comes around.
View on Reddit #87590491

thehpcdude@reddit

Absolutely do refresh switches, especially on back end and InfiniBand fabrics. Even the front end Ethernet stuff gets replaced most of the time.
View on Reddit #87623129

Aphid_red@reddit

Given that NVidia sells its chips for $50K a piece or more, the rest of it is a rounding error.
View on Reddit #87619883

Emotional-Dust-1367@reddit

> Divided down into per token cost, including all operating cost and sunk cost, over a 6 year period, will just be wildly too expensive for most people to use. Do you have the math for that cost-per-token divided down? I’d love to see it because I don’t think the GPU cost is the largest thing affecting it
View on Reddit #87602737

quantgorithm@reddit

even the efficiency of the token needs to be measured at some point. Not all tokens are equal.
View on Reddit #87625812

Ell2509@reddit

I have only done "back of a napkin" math, so not yet, however I was actually going to do some research and run the mumbers myself. I worked for one of thw "big 4" in an earlier part of my life, and am really curious to see. Lots of what these big companies, connected to AI, are doing is really shady from an accounting point of view.
View on Reddit #87603449

Emotional-Dust-1367@reddit

It would make a fascinating YouTube video
View on Reddit #87603508

Puzzled-Formal-9207@reddit

This is actually an interesting insight. I was actually thinking that AI shouldn't be available to the public and should only be used by institutions like NASA etc for meaningful work rather than having AI slop everywhere. Now reading that the current situation is unsustainable due to the gpu lifespan would hopefully turn this into reality. The world is meant for humans. Water is meant for humans. Not machines! Thanks for sharing.
View on Reddit #87570746

I_Will_Eat_Your_Ears@reddit

>Now reading that the current situation is unsustainable due to the gpu lifespan Don't believe everything you read. As many redditors have pointed out, he was completely incorrect. Capex expenses get depreciated, so the paper value of the gpus will be zero, but they'll still have market value. If they choose to sell the gpus, they can buy replacements and drop them into an existing facility. This already played out with crypto miners. It only stops if the demand for compute does
View on Reddit #87620991

thawizard@reddit

/r/LostRedditor
View on Reddit #87590587

I_Will_Eat_Your_Ears@reddit

>replacing all these gpus will almost be the same capex spend as it is to build the DCs right now. In general, technology gets better and cheaper as time goes on. When they refresh, they can drop the replacement GPUs into a facility with all the other services up and running (power, cooling, networking, a building, racks, etc). Basically, there's nothing to suggest the capex spend in 6 or 8 years will be anything close to the initial one.
View on Reddit #87619734

PinotGroucho@reddit

No it's not a scam in the sense that the goal is to defraud people of money. It's more a civilizational "All-in" bet that whoever has a 5 year head start in AI, while at the same time sucking all the oxygen out of the room for potential competitors and having the pension funds pay for it all, wins the game
View on Reddit #87604101

timfduffy@reddit

Frontier labs have strong margins on inference when taking depreciation into account. Heck Anthropic is about to have a profitable quarter for the first time, and profitability takes depreciation into account. IIRC the labs and hyperscalers are assuming something like 5-6 year depreciation schedules for their GPUs.
View on Reddit #87578560

NandaVegg@reddit

For Anthropic's profitability it is "observed" by a bear analyst (though he is a permabear, he does have sharp eyes) that it took time-limited free compute offered by xAI into account, so it won't last long. However I have no doubt that API inferencing service itself is generally profitable. The same researcher says that OpenAI paid around $1.30/h for A100s in the past year, which is below market rate and they would turn profitable at that rate assuming mid-to-high average compute utilization (that at least 30-40% of capacity is always being in use by customers, 24h). The problem for both Anthropic and OpenAI is that both parties are oversubscribed to future compute obligation (OpenAI especially).
View on Reddit #87601302

quantum_splicer@reddit

At end of life for the cards, what happens are they recycled or ? 
View on Reddit #87562645

Think_Wing_1357@reddit

You can find a lot of liquidated DC gears on ebay today: chassis, Mobo, CPU, etc. I'm sure the same will be true with GPU eventually.
View on Reddit #87563117

sizebzebi@reddit

how is this a new issue. gpus in data centers is not a new thing
View on Reddit #87600320

128G@reddit

Post pandemic prices were insane. You could easily buy a dual 6 core Sandy Bridge server for $75.
View on Reddit #87567353

sp3kter@reddit

And $500/m in electric
View on Reddit #87568141

Antique_Bag_4832@reddit

Yeah but isnt there solar panels now, that cost wont matter
View on Reddit #87595835

128G@reddit

Like any of the workstations shown on this subreddit are any better.
View on Reddit #87568530

AlexWIWA@reddit

We should rename the subreddit to be LocalPowerSubsidizers
View on Reddit #87585903

KontoOficjalneMR@reddit

GPUs were there as well haha. For a brief period you could get P40 with 24GB of CRAM for 20$ because literally no one wanted, they were almost giving them away for free. AI of course completely changed that, but these times will com back. datacenter GPUs are deprecating _badly_ (or googly, looking from hobyist perspective)
View on Reddit #87595140

Ell2509@reddit

Mostly binned, in small operations. In the large ones, i would imagine they will depreciate the items off the balance sheets and then replace, retiring old gpus into secondary markets. But as I say, I don't think the business model will work with that level of capex spend.
View on Reddit #87563089

quantum_splicer@reddit

I'm just thinking it's not sustainable to basically dispose of graphics cards in those quantities, like I get some would go to reuse. But I'm thinking from natural resource utilisation, if at the terminal end of life for the cards if we aren't able to recycle and recover raw materials. Then we have a process where basically we are eating into finite resources. it's bad enough from energy perspective and water diversion perspective.
View on Reddit #87563606

Ell2509@reddit

Data center cards are different to consumer ones anyway. Some no cooling, no pcie connector. I looked at buying some off Ebay and fitting them to cooling but never bothered. It will be interesting 6 to 8 years from now, that is for sure. A lot of gpus to decommission.
View on Reddit #87565761

sizebzebi@reddit

lollllll
View on Reddit #87600297

grawl_dorgiers@reddit

Pennies on the dollar!
View on Reddit #87616384

whakahere@reddit

The issue is, most of these gpus get machine crushed afterwards. If they had data on them they never leave the building without being crushed. I know people who build and maintain them. I asked for some gous they no longer use... Nope crushed
View on Reddit #87601988

Roid_Splitter@reddit

Won't have to wait for the retirement cycle.
View on Reddit #87565863

silenceimpaired@reddit

Not when… if. It’s closer to when now but don’t be surprised if it turns to if.
View on Reddit #87555573

ranjop@reddit

Do you mean that due to the AI boom the current generation DC HW will be run very long until it’s technically EOL?
View on Reddit #87560594

xISeeAllx@reddit

What kind of mobo/cpu/ram would be required for 2 v100 sxm2?
View on Reddit #87657757

tymscar@reddit (OP)

Anything really can do 2 of them.
View on Reddit #87657801

BannedGoNext@reddit

I hate the idea of the noise, I wonder if you could put a large fan and duct it to where it increases airflow without the data center hearing loss.
View on Reddit #87589575

Dante_Avalon@reddit

Use 3090, 24gb vram and it's silent and have better perfomance/power ratio
View on Reddit #87637820

Raunhofer@reddit

Rig the fan like OP did. It's not like the card uses a lot of power; it's just that the server rack form factor which would require you to rev it up.
View on Reddit #87591057

BannedGoNext@reddit

Yea, I guess it's possible to jerry rig some sort of water cooling too for noise.
View on Reddit #87615173

Dante_Avalon@reddit

Erm, and what's so special about this? Or using AliExpress's adapter and tons of V100 that floods market right now is already Giga-brain move? 
View on Reddit #87637186

XxBrando6xX@reddit

Great write up thank you for putting the time in to write this up for everyone. Question, you mentioned that these can be connected or used via nvlink even through the pcie adapter, Doesn’t this dramatically crush its speed though since they’re interfacing over pcie speed which I imagine is slower than nvlink over the lil adapter connecting the cards directly to one another would be ?? I’m working off my limited hobbiest knowledge of how hardware works so apologies if I’m off base
View on Reddit #87562965

tymscar@reddit (OP)

I’m not sure I fully understand the question, but basically you can have a PCI card that fits two of these on it. Then the shared memory between them is much, much faster and that directly makes token generation speed much quicker too. The speed of the PCI interface then also doesn’t matter too much because once you load the model weights into the card you’re not constantly sending back and forth through the bus anything.
View on Reddit #87575809

quantgorithm@reddit

what would the search term be for the ones that hold 2 cards on the single adapter card?
View on Reddit #87626693

tymscar@reddit (OP)

Here’s an example: https://ebay.us/m/vQUiAE
View on Reddit #87630448

quantgorithm@reddit

TY
View on Reddit #87633395

Bulky-Priority6824@reddit

What's the difference between the one  showing in your pic and the ones on eBay that are on a pcie card? They have both, only $600 for each style for 32mb
View on Reddit #87552119

tymscar@reddit (OP)

There is some performance difference, in favour of the SXM2, something like 10%. It runs at higher max power. Another benefit is that you can buy a board with two SXM slots on it, and then the cards talk with each other through super fast NVLink. Look into it!
View on Reddit #87553407

Bulky-Priority6824@reddit

Do these snap on or you have to solder 
View on Reddit #87556062

tymscar@reddit (OP)

Snap. No soldering involved whatsoever.
View on Reddit #87556133

quantgorithm@reddit

I'm reading your blog like a future upgrade path bible!
View on Reddit #87626016

Bulky-Priority6824@reddit

Very neat I'm so surprised this isn't used more widely. Great Value of GPU rigs
View on Reddit #87565614

lor_louis@reddit

The kinds of racks required to run those are extremely noisy (since the whole system is assumed to be running at 100% at all times to make the economics work), so not the kind of thing you want running in your house.
View on Reddit #87585832

farkinga@reddit

Excellent work! I think this implicitly asks: what's the difference between nvidia hardware generations? 16gb Ada and 16gb Volta add up to 32gb; but is that any better or worse than 32gb of Blackwell (for example). In practical terms, is there any architectural advantage to upgrading, apart from how the drivers eventually drop support for older architectures. It's not quite apples-to-apples but as another data point, I've got Qwen3.6 27b NVFP4 MTP 128k context on 2x 5060 Ti (32gb total) and get 1000 t/s pp and 60 t/s gen. That's consumer 50-series Blackwell; and I AM jumping through hoops to run nvfp4 since that will eventually become better-optimized. Dollar-for-dollar, the V100 SXM 16gb is probably cheaper than a 5060 Ti 16gb; but that's debatable. You've got to pay shipping twice (v100, sxm-pcie board) and the price difference narrows to less than $100 USD. If your case/installation needs a 3d printed cooling solution and attached blower/fan (since the v100 is a data center card), that's a bit extra also. I doubt the v100 is cheaper, in this scenario. I know the point of the article isn't to claim this is the cheapest way to get 16gb VRAM. And I do appreciate how the v100 bandwidth from 2017 compares to current-gen Apple M5, etc. The SXM v100 is an interesting value that some people are going to benefit from. But there is a real-world performance difference between older architectures versus current; and 16gb from one is not equal to 16gb of another. So, it's just a trade-off and I think a decent amount of the LocalLLaMA community can probably appreciate the nuance.
View on Reddit #87621410

sage-longhorn@reddit

Cool post but AIs writing style is so tedious at this point >The compute is still real. The VRAM is still real. And the memory bandwidth is where it gets genuinely surprising.
View on Reddit #87581963

tymscar@reddit (OP)

It’s not ai. Look at my other reply.
View on Reddit #87582020

sage-longhorn@reddit

I'm not seeing that in your comment history but I'll say that if this isn't AI then that's worse somehow
View on Reddit #87617140

128G@reddit

a 4080 only having 16GB of VRAM is insane!
View on Reddit #87553137

T-Loy@reddit

Given the 256bit bus, not that much. It's either 16GB or 32GB and you know workstations cards want to be the double sided option. Now whether an 80-class should have ex-70-class die size and bus width is the other question. For once not Nvidia's fault memory manufacturers haven't figured out working 3/4GB GDDR6 modules. Now the continued absence of the 3GB GDDR7 modules and RTX 5000 Super series. (Funnily enough the 5050 9GB of all cards got the 3GB module treatment.)
View on Reddit #87610402

ThisWillPass@reddit

Hopefully the ai overlords will smite all those that held down gpu vram for profits.
View on Reddit #87589911

tymscar@reddit (OP)

It's idiotic if you ask me. Especially considering when the card came out and how cheat VRAM was back then comparatively. As a card for half the price of the 4090, it serves me way better than half of a 4090, especially in games, so I don't regret it, but it's clear they've done it this way just to differentiate between the two more.
View on Reddit #87556692

128G@reddit

You know what also has 16GB of VRAM? a used RX 7600 XT.
View on Reddit #87557464

tymscar@reddit (OP)

Yeah, but not cuda. And much much slower vram. 288Gbps compared to 900 on the v100.
View on Reddit #87565495

quantum_splicer@reddit

I know this is Gunna sound stupid didn't someone rewrite cuda to run on AMD cards or maybe I have a weird dream
View on Reddit #87568626

samas69420@reddit

check zluda
View on Reddit #87583019

128G@reddit

The 7600 XT has GDDR6 while the 4080 has GDDR7. Its stipl seems unacceptable for Nvidia to be selling new cards with anuthing less than 16GB of RAM. 24GB should be the minimum for a modrange card.
View on Reddit #87568337

Background_One_6482@reddit

3060 perfomance
View on Reddit #87575113

raycol08@reddit

you are right! [https://wccftech.com/nvidia-v100-an-8-year-old-gpu-now-sells-for-100-us-crushes-modern-consumer-cards-in-ai-llms/](https://wccftech.com/nvidia-v100-an-8-year-old-gpu-now-sells-for-100-us-crushes-modern-consumer-cards-in-ai-llms/) https://preview.redd.it/si7f0uocs15h1.jpeg?width=728&format=pjpg&auto=webp&s=e1a9c8ffdae5046f01a8c956b35dfea1d42b2a33
View on Reddit #87607261

eatsleepsafelives@reddit

Well somebody my found your blog (good read) - the V100 I could find are at $600 now ;)
View on Reddit #87589442

ChristianRauchenwald@reddit

Checked eBay here in Germany: |Model|Cost|Link| |:-|:-|:-| |Tesla V100 **16GB** GPU SXM2 incl. SXM2-PCIe Adapter|€212.98|[https://www.ebay.de/itm/406939336945](https://www.ebay.de/itm/406939336945)| |Tesla V100 **32GB** GPU SXM2 incl. SXM2-PCIe Adapter|€694.98|[https://www.ebay.de/itm/406939343384](https://www.ebay.de/itm/406939343384)| Doesn't seem like a bad deal.
View on Reddit #87605461

veeravan_451@reddit

Really helpful post. How about the 32GB V100? That price is totally within what I can handle. I’m more hoping that in the near future all these AI companies go bankrupt so I can pick up an H200 for dirt cheap and run a local server.
View on Reddit #87599815

tymscar@reddit (OP)

Yeah, those are obviously better but its hard to find good deals!
View on Reddit #87600181

veeravan_451@reddit

What price range would be good? The 32GB ones in my area seem to be around £350–380.
View on Reddit #87600399

tymscar@reddit (OP)

Anything under 400 is very good
View on Reddit #87602668

Afganitia@reddit

Why are you using drivers 55X?? Should not Volta support until 58X branch? 
View on Reddit #87592786

tymscar@reddit (OP)

Tried them all one by one (magic of nix lets me do that in minutes) and none of them saw both cards.
View on Reddit #87600158

bradrlaw@reddit

If you have the pcie lanes you can easily do 4 of this to get 64GB at I think the cheapest possible price point. I am taking a slightly different route and using the 32gb pcie version (about $750 each). Note you will need to come up with a custom cooling solution which adds to the cost along with power supply costs. People do sell 3d printed shrouds / fan holders, but it will be highly dependent on your case. In my setup will have two 32gb v100s for 64gb for main inference tasks and existing 16gb card for agent orchestration. I try to run models at best possible quantization because benchmarks don't always capture how soon they start to degrade.
View on Reddit #87567177

tymscar@reddit (OP)

Best of the best is actually four PCI cards with two x of these on each card nvlinked, all 32GB. You can get them cheaper than the PCI ones, and it will give you more or less 256GB VRAM for more or less four grand.
View on Reddit #87575914

bradrlaw@reddit

Which motherboard / cpu combo would you use that would have enough pcie lanes for that?
View on Reddit #87590294

tymscar@reddit (OP)

A thread ripper with the WS WRX90E-SAGE SE
View on Reddit #87600032

Raunhofer@reddit

It's interesting how compatible the card is. Funny even, that you can just plug in some adapter to make it work. Makes nvidia's $200000000 whatever server cabinets feel a lot less magical.
View on Reddit #87591199

libregrape@reddit

What a pleasure to read your blog! Finally not a bs AI slop, but an actually super interesting and insightful read..
View on Reddit #87560726

nullbyte420@reddit

ehh it's definitely AI slopped up. >The compute is still real. The VRAM is still real. And the memory bandwidth is where it gets genuinely surprising. >The fan on the adapter is not subtle. It is not quiet. It is not something you want in a room you also sleep in. >82 decibels. That is somewhere between a garbage disposal and a lawnmower, well past “loud PC” and into “should I be wearing earplugs in my own house” territory. >And the worst part: you cannot control it. I tried nvidia-smi, I tried scanning for it on Linux, I even tried Afterburner on Windows (more on that later, the whole setup barely works on Windows). Nothing. The fan on this adapter is not designed to be controlled. It is designed to run at 100%, forever, inside a server rack where nobody has to hear it.
View on Reddit #87568245

tymscar@reddit (OP)

Watching this from the sidelines must be fascinating because you don’t know what’s right, so you try to guess. The only person that knows for certain if this is written with an AI is me, and I know for a fact it is not. At all. I think you just look too deep into it, and it hurts those that just learned over the years to talk like that in blog posts that are meant to sound exciting. It’s similar to the delve thing that was super common a couple of years ago. Most people who had that in their text were ai slopping, but there were groups of people, from South Africa if I remember correctly, that just spoke like that. And yeah, you guessed it, those were the people that were paid to do the RLHF on the models back then. I have changed my style over the years specifically to have people that comment like this that my writing is ai slop. For example, I used to love lists and emojis. I think it’s easier to follow, especially for those with bad eyesight like myself. I stopped that. Then the whole thing with emdashes. They are amazing. I used to love those! I had to stop because AI started using those. People started telling me now that using an Oxford comma is an ai slop smell. Well, you know what, you can’t take that out of my dead hands. And I won’t even try anymore to change myself because of what people think AI is. If there will be some sort of a way to sign your content in the future or something to prove it’s human-made, I will, but I won’t go out of my way anymore for people that just like to pop your balloon after spending tens of hours on a project because they think using comparisons is ai slop. Sorry for the whole rant, I know that you probably are just fed up with the slop, and I get that. I am too. But maybe if the content of the blog is not clearly slop, then you can assume that it’s not.
View on Reddit #87575675

standish_@reddit

> People started telling me now that using an Oxford comma is an ai slop smell. People who use the Oxford comma aren't invited to my party with the strippers, Sam and Dario.
View on Reddit #87590335

kylemd@reddit

Don't worry OP, it read human generated to me. Good read for somebody who is on the fence looking at V100s. Have you tried the vLLM fork yet?
View on Reddit #87576801

tymscar@reddit (OP)

Thank you! Not yet, because I couldn’t get MTP working on it.
View on Reddit #87577363

misterflyer@reddit

It *isn't just* slopped up. *It's* slopped down. *It's* slopped through. The AI *slop's kiss* 🎯
View on Reddit #87573979

nullbyte420@reddit

And the best part? *You* did this. You've awakened my quantum-riemannian core. This is a *breakthrough* not only in research, not only in science, but in *knowledge* itself. 
View on Reddit #87578345

misterflyer@reddit

oh no you didn't 🤣 well done 😉
View on Reddit #87581127

Ynead@reddit

This is so obviously written by Claude lmao
View on Reddit #87580026

eleqtriq@reddit

Only another bot would think it’s not slopped up, huh, bot
View on Reddit #87586947

veeravan_451@reddit

Thanks so much op. I'd basically given up on local deployment. I was using a Mac mini before, but the token generation speed was way too slow, so I switched to GMI Cloud for cloud deployment. Your guide gave me hope again. Prices here are roughly £120 for the 16GB version and around £400 for the 32GB one. Are there any downsides to running two 16GB cards vs one 32GB card? And if there are other GPU recommendations in this price range, that would also be great. I haven't bought a GPU since the mining boom, and I haven't paid attention to GPU prices for a long time either. The only piece of electronics I've bought in the last few years is the Mac mini.
View on Reddit #87588754

311voltures@reddit

Awesome post
View on Reddit #87587010

PythonFuMaster@reddit

I've got a similar configuration, but using the actual PCIe version of the 16GB V100. It's passively cooled so you need a server or a custom fan assembly, but I've got 4 giant GPU servers that can hold three of these things each (Supermicro Fat Twin, it's an older X9 system though). I'm also using NixOS, with driver legacy_580 and CUDA 13 I believe (I'm on NixOS unstable, but 26.05 was just released so stable should have the needed driver now). Also using llama.cpp (with some patches for improved RPC performance, I have those 4 machines networked over Infiniband), it works well and is my second fastest card, just behind the 3090. In total I've got the V100, the 3090, a P40 and Quadro M6000 24GB, an RX 6700xt, two Intel Arc A770s, an instinct MI60 32GB, and soon a water cooled Titan V. I used to run minimax m2.7 at around 20-30 tokens per second, but I've gone down to qwen 27B for now, it's smart enough for most of what I need and with MTP is much faster (minimax should be going faster but my network has some bottlenecks I need to fix)
View on Reddit #87581564

BitterNocturne@reddit

Sounds fan
View on Reddit #87578162

DingyAtoll@reddit

Where do I get the adapter for £50? All the ones I see online are £150
View on Reddit #87570396

tymscar@reddit (OP)

I got it on eBay. Try hunting for it. Sadly, all the prices went up after I posted this blog post on Saturday, and it got onto Hacker News. I was checking the prices every day.
View on Reddit #87575979

JSVD2@reddit

wow super inspiring. thank you. good read.
View on Reddit #87573859

a_beautiful_rhind@reddit

About 2 years ago I was salivating over these things. At least the 32g variety. I think the P100s are a slightly better deal than the 16gb v100. Then again, nobody wrote P100 flash attention so you're trapped in llama.cpp
View on Reddit #87564478

PDXSonic@reddit

That's the biggest downfall of the P100. I had a 4xP100 rig at one point (that I got for the price of 1 working P100) and it was great right until VLLM and Exllama stopped supporting it. Once they left it was just llama.cpp and it couldn't utilize it via tensor parallel and left most of the performance on the table. Although I ended up keeping one and throwing it in a server since MoE models make it worth to have one card going at least.
View on Reddit #87567271

a_beautiful_rhind@reddit

I wonder how ik_llama would have done with it. Plus there's always pascal forks of vllm. Open driver doesn't support pascal so I can't stick my ewaste back in the server and it just sits.
View on Reddit #87572048

Embarrassed_Adagio28@reddit

Very good writeup! I have dual tesla v100 16gb gpus with 32gb of system ram and a ryzen 5700x in a dedicated lmstudio server and your claims line up exactly with my experience. I have been very pleased with my purchase. I use it with lmstudio but am switching to vllm soon for better multi agent support. 
View on Reddit #87566042

DingyAtoll@reddit

Where do I get the adapter for £50? They are all £150 where I look
View on Reddit #87570610

sizelrd@reddit

Well done
View on Reddit #87565389

Ok_Selection_7577@reddit

Really nice write up mate, this sort of content (and "I changed out the BIOS and managed to get an LLM running in a tin of Bisto from 1980's") is what i come here for 😄
View on Reddit #87565126

tymscar@reddit (OP)

Thank you! I do have an RSS 😄
View on Reddit #87565387

Truantee@reddit

I think you can buy one on aliexpress with all of those stuffs packed.
View on Reddit #87564250

ranjop@reddit

What an amazing write 👌🏻
View on Reddit #87560441

je11eebean@reddit

I read your blog. This is amazing work and you've documented all too! Thank you for sharing this.
View on Reddit #87551164

rog-uk@reddit

What cpu/mobo are you using please? I was looking at maybe a pair of these for an onler workstation I just got, and read there could be complications with memory access. 
View on Reddit #87548877

tymscar@reddit (OP)

ROG Maximus Z790 Hero with an intel 13900kf. Thank you!
View on Reddit #87549135

mhphilip@reddit

Great read! You are way more than a newbie to me. This is not the road I’m going but that’s also good to learn!
View on Reddit #87546316

tsukuyomi911@reddit

Nice read.
View on Reddit #87546260