Tell me about the RTX 8000 - 48GB is cheap right now

[-]

DeltaSqueezer@reddit

The problem is that it is an older less supported architecture and you can get 2x3090 which is more modern, faster and costs half as much for the same VRAM. For the same budget, you can get 4x3090 for 96GB VRAM.

[-]

Thrumpwart@reddit (OP)

And fit them in what case?

[-]

Ok_Warning2146@reddit

For single case, the most you can do is 3x3090 with an E-ATX mobo with 7 PCIe5.0x16 slots.

[-]

No_Security1366@reddit

even with risers? not trolling i genuinely want to know and i don’t feel like asking gpt. i could have gptd it. im gonna ask gpt.

[-]

DeltaSqueezer@reddit

There are many options. You can use a 4U case. Open chassis, or if you have specific requirements, it is not hard to build a custom enclosure.

[-]

Thrumpwart@reddit (OP)

Sounds janky.

[-]

OPsyduck@reddit

Sounds like you don't like simple solutions to a simple problem.

[-]

Thrumpwart@reddit (OP)

I just have different needs than an open air rig. Not everyone wants jank.

[-]

OPsyduck@reddit

We just have a different opinion on what is considered jank. There are plenty of pictures around and some of them look really nice.

[-]

FullstackSensei@reddit

Plenty of cases if you're willing to watercool them. 3090 blocks are getting cheap, especially on the 2nd hand market. I have three watercooled in a regular LianLi O11D (not XL) with room for a 4th.

[-]

Thrumpwart@reddit (OP)

I considered watercooling, but that's just a hassle I don't want to have to deal with. The idea of purposely pumping liquid around thousands of dollars worth of electronics does not give me joy.

[-]

NEEDMOREVRAM@reddit

Why a case? I'm using an open air mining rig. I'm using a ROMED8-2T motherboard and can fit two fat 3090s and one skinny 3090 on the PCIe slots. If I need to hook up anymore GPUs, I just use the riser I bought off Amazon.

[-]

Thrumpwart@reddit (OP)

Because I don't want open air rigs around my house. I'm an adult.

[-]

Mass2018@reddit

Yeah. Damn kids and their open air AI servers!

[-]

Thrumpwart@reddit (OP)

I kid. I would love to run 8x 7900XTX but my wife would leave me. I'm not really an adult, I just need to act like one around her.

[-]

NEEDMOREVRAM@reddit

I'm an adult.

Hey, don't convince me. Convince yourself. I'm probably MULTIPLE decades older than you and I have an open air mining rig in my home office.

Why?

Because a PC case looks like a literal video game system. Something that a kid would use.

Open air mining rig looks scientific and shit. Like I'm some sort of mad scientist and this is my laboratory where I do science and shit.

Bitches love open-air mining rigs. It literally gets their panties wet because it looks like I got money ("Ooooh, look at all this complicated-looking computer stuff---I wonder what else is big besides his bank account?") and it makes me seem like I'm this super-intelligent genius. Of which I'm really not...I'm more of a "pretty smart genius."

[-]

Thrumpwart@reddit (OP)

My wife wouldn't care for it and neither would I. I'm not mining ethereum anymore.

[-]

NEEDMOREVRAM@reddit

Different strokes...

[-]

PurpleUpbeat2820@reddit

For the same budget, you can get 4x3090 for 96GB VRAM.

Or an M3 Ultra Mac with 500GB VRAM!

[-]

a_beautiful_rhind@reddit

I have a 2080ti 22gb.. which is basically the same card sans the extra memory.

The issues I run into are lack of flash attention and many custom kernel projects dropping support for turning. I am forced to try to make them work by editing cuda code with only AI help. Had some success but nothing great. A bit like trying to run AMD.

Speeds are passable but they aren't ampere speeds. Surprisingly, the card is only marginally faster than a P100 for LLMs due to memory bandwith. On image/video it does a bit better.

Of course there is good llama.cpp support (with it's own flash attention) and you can do pipeline parallel with xformers in exllama... but no tensor parallel or the new paged server.

You for sure can cram 4 of them into a box, but you'll be fighting software support. They have nvlink as well. Imo, the price still isn't right to put up with all that but it's getting there.

[-]

inevitabledeath3@reddit

You could try ik_llama.cpp they have -sm graph which is essentially tensor parallelism and they also support NVLink and point to point connections.

[-]

a_beautiful_rhind@reddit

Unfortunately it doesn't have ReBar without editing the bios with fixed addresses so p2p for turing is a bit hard.

[-]

inevitabledeath3@reddit

Doesn't the 2080ti have NVLink or SLI?

[-]

a_beautiful_rhind@reddit

Yea, you can buy an nvlink for it. P2P is nicer because it can peer between multiple cards.

[-]

Thrumpwart@reddit (OP)

Thank you. I'm running AMD now without issue (ROCm is coming along nicely).

I just like the idea of running larger models faster than on my Mac. The Mac will run almost anything but its kinda slow.

These seem like a good price. I don't need blistering speed, just enough capacity to run what I want at usable speeds.

I would be running llama.cpp and/or exllama. Don't need tensor parallel, just something fast enough.

[-]

inevitabledeath3@reddit

You could try ik_llama.cpp they have -sm graph which is essentially tensor parallelism and they also support NVLink and point to point connections.

[-]

michaelsoft__binbows@reddit

I feel like amd is getting more interesting for sure. I wonder if there is a similar go-to item (to the 3090) for red team, aside from 7900xtx

[-]

Thrumpwart@reddit (OP)

The MI50/60 is quite among in some circles. Older tech but very cheap, high VRAM, and work well together.

[-]

michaelsoft__binbows@reddit

Nice. thank you for reminding me about Mi50/60. I see tugm4470 selling 32GB MI50 at $235 or so a pop right now which is significantly better than last i checked a few months ago.

Since i've not yet really started to put my 3090s to task yet i'm holding off for now but i do have x99 and x399 rigs to slap these into. 32gb is particularly attractive on these, few other options will come close and 48gb turing quadros are not coming down in price any time soon. the question really will be if 3090s will get cheaper again any time soon (also doubt) to fill those rigs up with instead.

[-]

NighthawkT42@reddit

How does speed on a Mac with 48GB compare to something like a PC with 32GB RAM and 16GB VRAM? I was thinking it would be much faster the moment the PC starts having to use RAM but now not so sure.

[-]

PurpleUpbeat2820@reddit

How does speed on a Mac with 48GB compare to something like a PC with 32GB RAM and 16GB VRAM? I was thinking it would be much faster the moment the PC starts having to use RAM but now not so sure.

Exactly right.

I have a PC with 128GB RAM and 12GB VRAM (RTX 3060) and an M4 Max Macbook Pro with 128GB RAM (effectively 96GB VRAM). For small models like llama3.2:3b I get 150tps vs 100tps. The second you run out of VRAM on the PC (including context etc.) that tanks to 1tps vs 40tps.

For example, going from the 10GB qwen2.5:14b-instruct-q5_K_S to the 12GB qwen2.5:14b-instruct-q6_K on the PC performance goes from 18.3tps to 9.7tps.

In practice this means I don't use my PC for AI even though I bought it specifically for AI because 12GB VRAM is too little to run anything of interest. What I actually run is mostly qwen2.5-coder:32b-q4_K_M, llama3.3:70b and sometimes Qwen2.5-14B-Instruct-1M. Interestingly even the latter doesn't run well in 12GB VRAM because it runs out of VRAM for the context length. Furthermore, whereas all the software on Apple Silicon dies gracefully the PC software stiffs everything including the Linux OS and requires a hard reset.

[-]

Thrumpwart@reddit (OP)

Definitely faster than RAM. Apple Silicon is much better than Ram processing, and Apple MLX is improving all the time.

[-]

a_beautiful_rhind@reddit

Then it might work out for you. Its gonna be faster than the mac. You can always buy 2 of them, try them out and return them.

Also no BF16 support so if you end up running something in transformers it will cast it to FP16.

[-]

Thrumpwart@reddit (OP)

Ah shit, good to know. I like my Q8 ggufs.

[-]

a_beautiful_rhind@reddit

Q8 is going to use llama.cpp stuff and probably int8 tensor cores. I also tried aphrodite and it didn't like 3x3090 + 2080 together but maybe 2 of the same card would work faster.

[-]

em1905@reddit

insightful comment on the CUDA support, thanks!

[-]

bluelobsterai@reddit

My little rig. I’d skip the nvlink unless you are training. No paiged attention is kind of a bummer. I’ve upgraded to the a6000 for that reason. For many workloads like long context asr they rule.

[-]

Abrases@reddit

Just got myself two Quadro RTX 8000 and I can't find a nvlink 3 slots to buy (gaming motherboard...).

You are saying that nvlink didn't help much with inference but did with training? I'm very interested in your experience of with and without nvlink for inference specifically.
Any details would be much appreciated :). Thanks !

[-]

bluelobsterai@reddit

SLI was like twice as fast for shared mem operations

[-]

Abrases@reddit

Thanks, damn now I really need a 3 slot nvlink :D

[-]

Thrumpwart@reddit (OP)

How are the A6000's treating you? Worthwhile upgrade?

[-]

bluelobsterai@reddit

Only if the software you’re running requires new features.

[-]

Thrumpwart@reddit (OP)

What is the speed difference like?

[-]

bluelobsterai@reddit

I’d rent on vast tensordock or runpod and test your workload. I’d rather you “know”

[-]

Thrumpwart@reddit (OP)

Good idea, thanks!

[-]

Imaginary_Bench_7294@reddit

Sorry to ignore your edit, but I wanted to chip in.

Depending on the workload, a 8k will show nearly the same performance as a 3090. The FP16 flops are within 10% of each other. However, the FP32 performance is less than half that of a 3090, and the memory bandwidth is about 28% less.

The 8k should be able to run a 70B 4 bit model somewhere around 8 tokens per second, or about ⅔ the speed of dual 3090's (about 12 tokens per second).

You probably won't be able to take advantage of some newer things like flash attention, but from the sounds of it, that may not be a big problem.

If you plan on spending this kind of money though, I would suggest looking at buying a couple of A6000 GPUs instead. Sure, you might only get 96GB of Vram instead of 192 for a similar price point, but I also don't see any comments stating why you would need 192GB.

The A6000 has more than double the FP64 and FP32 performance, and about a 20% gain in FP16. Being a newer generation, you won't have to worry about it being depreciated as quickly either. The older the GPU architecture, the sooner it will be unable to run the latest code, which for me is always something I consider when looking at hardware.

Here's a comparison of the 8k vs a 3090: https://bizon-tech.com/gpu-benchmarks/NVIDIA-Quadro-RTX-8000-vs-NVIDIA-RTX-3090/532vs579

Here's a comparison of the 8k vs a A6000: https://bizon-tech.com/gpu-benchmarks/NVIDIA-Quadro-RTX-8000-vs-NVIDIA-RTX-A6000/532vs585

[-]

JoeFlaccoIsAnEliteQB@reddit

i know this is an old thread, but I have an 8000 and a 3090ti. am i correct in thinking most everything would actually run better on the 3090ti? I like having the 48 on one card, but I guess 32b would run better off the 8000.

i bought the 8000 new and its the passively cooled one but im guessing selling the 8000 and getting another 3090ti would be 48with nv-link and much muvh faster?

[-]

Thrumpwart@reddit (OP)

Nice, thank you. Work considering, even if it means half the vram.

[-]

01010101010111000111@reddit

Getting them to "combine memory" or work together is a nightmare.

If you want to preview that experience, just plug in a bunch of GPUs (can be different ones) into your box and see if you can get them to collaborate.

[-]

Super_Sierra@reddit

This isn't an issue. Idk who told you otherwise but I had a 4060 16gb and 3060 12gb play perfectly nice with each other.

[-]

PurpleUpbeat2820@reddit

Cool. What OS?

[-]

fallingdowndizzyvr@reddit

If you want to preview that experience, just plug in a bunch of GPUs (can be different ones) into your box and see if you can get them to collaborate.

I do exactly that every day. I have a box with 1 AMD and 2 Nvidia GPUs in it. They work together just fine. Add to that another box with Intel A770 GPUs and I also toss a Mac into the cluster to spice things up. They all work together easy as can be.

[-]

Thrumpwart@reddit (OP)

Wouldn't be an issue, I'm awesome.

[-]

Healthy-Nebula-3603@reddit

RTX 8000 old and is x2 slower than RTX 3090 including memory bandwidth ... but 48GB looks nice ....

[-]

Thrumpwart@reddit (OP)

But I could fit 4 of them in an ATX case with 192GB VRAM on a single PSU.

[-]

Healthy-Nebula-3603@reddit

Yes ...but interface will be very slow with this cards .... If 70b model q4km get not evrn 8 tokens and will be probably closer to 6 tokens I think ... 140b model will be getting barely with 3 tokens/s ... If you use q8 even slower ...

[-]

a_beautiful_rhind@reddit

You're hella underselling it. The arch isn't that bad. 8t/s is P40 speeds. More like 10-13t/s. Q8 isn't going to be worse since turning has int8 tensor cores.

[-]

Healthy-Nebula-3603@reddit

2x rtx 3090 with Q4km 70b llama 3.3 has around 16 5/s ...

This card has x2 slower memory ... So I think will be as fast as p40 then.

[-]

a_beautiful_rhind@reddit

It has 672.0 GB/s memory speed. P40 is half of that. If you go by bandwidth alone, it falls somewhere between 3070ti and 3080.

[-]

Healthy-Nebula-3603@reddit

https://www.reddit.com/r/LocalLLaMA/s/gTUpgJ9MFT

8 tokens 😅

[-]

slavik-f@reddit

You were right about 8 tokens on the RTX 8000.

But I do not think you're right, that P40 can "be as fast". P40 can't do 8 tokens on 70b model.

[-]

Healthy-Nebula-3603@reddit

p40 is even slower 😅?

[-]

Ronmaximus@reddit

you guys need to look into this ;) https://github.com/NVIDIA/TensorRT-LLM. Im gonna see if i can setup my RTX 8000 in Stable Diffusion to use TensorRT, and also the LLM version of it for inferencing that too. By the way i also have the Tesla P40 and i love that thing. Runs many models well at decent speeds. But anyways if you have a GPU with a Pascal chip then you can use this TensorRT to boost speeds quite a bit. Look into it

[-]

muxxington@reddit

Are there more infos on how the P40 is supported? Could find much on Github. Only that P40 is not supported.

[-]

Ronmaximus@reddit

It sure does work with the P40 but it takes quite a bit of setting up depending on your setup. I have 3 cards in my server which are the P40, the RTX8000, and a Quadro P1000 for video output to my screens leaving the other two cards only for compute purposes. Because of this setup i had to specify which card i wanted to use for which card. That was a whole issue to figure out on its own but i got it going. So after a lot of trial and error, and research on this... for Stable Diffusion i got the P40 working with Tensorrt after i compiled it for the proper versions SM 6.1 which is the compute version for the P40 on the Pascal chip. So in other words if you look up what versions of CUDA, torch, cuDNN, and TensorRT work with the P40 the formula is:

CUDA Toolkit version 11.8

cuDNN 8.8.0

TensorRT 8.6.1.6

torch 2.0.0 or 2.0.1

Also you need the tensorRT extension for that older version which is at this site:

https://github.com/AUTOMATIC1111/stable-diffusion-webui-tensorrt (i use A1111) you may have to find a way for your build, ComfyUI or regular SD.

[-]

Ronmaximus@reddit

Also, for the Tesla P40 you will need a driver passed 450 i believe. I run windows datacenter driver version 551.78 (i run this on a Win 11 server setup). You can use the one that best works for you though.

I recommend getting a free NVIDIA developer account to get all the files you need because it was hard to find them otherwise. This saved me a lot of time.

Here is a reference from NVIDIA on a support matrix for cuDNN and CUDA: https://docs.nvidia.com/deeplearning/cudnn/archives/cudnn-880/support-matrix/index.html

The NVIDIA site has a lot of info for this on TensorRT support per compute or other comparisons. I pieced a lot of info together by scavenging the web for about 4 days to get it working on the P40.

NOTE: Despite setting this up, i didnt see a big improvement on the P40 with the mix above, also its limited for that version of TensorRT.. reason being is that back then about 2 years ago when they had support for the P40 or SM 6.1, they only had SD models of 1.5 working. I use mostly SDXL models and this wont work with those. Either way when you compile everything for compute 6.1 with this in the webui-user.bat things will run faster
set TORCH_CUDA_ARCH_LIST="6.1"
So it will look like this:

************************************************
u/echo off

set PYTHON="C:\Users\yourdirhere\AppData\Local\Programs\Python\Python310\python.exe"

set TORCH_CUDA_ARCH_LIST="6.1"

set GIT=

set VENV_DIR=

set CUDA_VISIBLE_DEVICES=1

set COMMANDLINE_ARGS=--autolaunch --upcast-sampling --skip-version-check --device-id 0 --no-half-vae --xformers --opt-split-attention --enable-insecure-extension-access

call webui.bat

**************************************

Just make sure CUDA Toolkit is installed because TensorRT will be installed using pytorch which needs to seek that library from your python 3.10.6 directory even if you have this working inside a "venv" environment. So this will be a temporary pull of needed files from the path where the toolkit and python is installed from.

I havent tested TensorRT-LLM for that type of inference yet. But for LLM's it already runs pretty quick for the models that fit in this cards (24GB).

[-]

Thrumpwart@reddit (OP)

X4 if I but 4 of them, so a 72B model in Q8 at 32tk/s sounds awesome!

[-]

Healthy-Nebula-3603@reddit

You not get x4 timrs faster... Will be even a bit slower than on 1 card

[-]

Aphid_red@reddit

The lazy way with ollama, yes.

Use tensor parallel to get some speedup. Not 4x though, think more like 2.5x.

https://www.reddit.com/r/LocalLLaMA/comments/1bdlrah/tensor_parallel_in_aphrodite_v050_is_amazing/

Speedup here varies from 1.6x to 3.5x.

[-]

Thrumpwart@reddit (OP)

Maybe 5x faster.

[-]

a_beautiful_rhind@reddit

8.6 and ollama, obviously the most efficient backend.

[-]

slavik-f@reddit

I tried RTX 8000 on VAST.AI and getting speed:

- with llama3.3:70b (43GB) - 8.6 tokens per second

- with qwen2.5:72b-instruct-q4_0 (41GB) - 9.3 tokens per second

[-]

Healthy-Nebula-3603@reddit

So I was right 😅 around 8 tokens .

[-]

Geechan1@reddit

I own two of these cards. I thought I'd give you several observations and insights into the ownership experience:

I'm noticing people tend to undersell the performance of these cards. I've found the best backend for them is koboldcpp running GGUF quants, as that is faster than ollama/llama.cpp and supports llama.cpp's own implementation of Flash Attention. You'll want to run the rowsplit and the mmq kernel options on these cards; with these settings and 0 context, you can expect about 8-9 tokens per second for a Q5 quant on a 123b parameter model. For Q5 70b, expect more like 12-14 tokens per second with these settings. Prompt processing speed is a bit slow at about 180t/s for 123b and 350 for 70b, but still plenty usable.
The lack of Flash Attention 2 hurts, as you cannot use the exl2 format efficiently. Supposedly Turing is going to support this at some point, but it's reliant on the author to actually implement something, and it's been on "coming soon" for over a year now! If that feature ever comes, expect exl2 support to dramatically improve.
You can find excellent prices for these cards if you're patient and shop around for server farm grabs. I got my pair for 2k USD each, which ends up being cheaper than a 3090 in my country, for double the VRAM.
Because they're dual slot and blower cards, it's really easy to stack them and fit them in a standard case without needing to resort to open air with PCIe risers. You won't choke the thermals on the cards because all the hot air gets exhausted out of the case. They're also easier to run with a lower specced power supply; I actually get away with a 750w PSU running two cards with about 100W headroom to spare.

3090s are faster, better-supported and cheaper depending on your region, so that's why they're a default recommendation. However if you think the above trade offs are worth it, it's really hard to go wrong with an RTX 8000. 48GB in a dual slot blower card for much cheaper than the A6000 is hard to beat.

[-]

Thrumpwart@reddit (OP)

Thank you. This is great information to know. I generally run 70-72b models at Q8, but I could live with a q5 on 2 cards. If I ran 4 of them, that's 192GB VRAM!!

2k USD is more than the 2,400 CAD these are available on Amazon for NEW! That's why the deal jumped out at me. I can squeeze 4 of them onto an Asrock RomeD8 motherboard for relatively cheap. Any idea how inference speeds vary between 1x vs dual cards?

[-]

Geechan1@reddit

I would say using pipeline parallelism (the default for most backends), you're going to see numbers in a similar ballpark no matter how many cards you scale to. If you can manage to use tensor parallelism (which only a few backends support), you should expect to gain significant speed improvements per card. Row split also seems to utilise multiple RTX 8000s better, so I definitely encourage experimentation.

2,400 CAD for a new one? Fantastic deal - I'd jump on it while you can. I'd start with one or two and then you can decide whether to scale up from there or not. I find that 96GB of VRAM is really the sweet spot at the moment, as you're able to run 123B at high quants and that's the best quality we have available locally to us right now outside something like Deepseek, which realistically you can't run on pure GPUs without absurd investment.

[-]

Thrumpwart@reddit (OP)

Right on, thanks again!

[-]

Ronmaximus@reddit

Yeah i got my RTX 8000 new for just over 2k USD. So yeah its a good price.

[-]

Thrumpwart@reddit (OP)

How are you liking it?

[-]

Ronmaximus@reddit

I like it very much for what i do. For LLM inferencing you can get like 40 t/s for a 10B Q8 model, or around 20 t/s for a 20B at Q4. Also, for Stable Diffusion it takes about 50 secs to make a 1280x800 image, or it can take 8-9 seconds to make the same type of image with CUDA enhanced by TensorRT. Look into that because it can make things super fast making it worth buying one of these cards. They also have TensorRT-LLM which will enhance the T/s for LLM inference. So overall im happy with the card and i wouldnt mind getting another or an A6000. You can also run these cards on nvlink ... "notable a6000s scale insanely well. since they work in configs of upto 2048 gpus. so more is always better. eg (this shows off the improvements nvlink provides on workstation cards really well. 3090s only have limited nvlink bandwidth and suffer greatly since pcie headroom is effectively 0. but a6000s have the full 960gb/s of cross card bandwidth via nvlink + the bus's 48gbs/+ this lets them load a model at 192gb/s vs 48gb/s for 3090s with 4 of each in 1 system.)" So it makes a big difference when you scale these cards up for big models.

[-]

RouteGuru@reddit

there is older AMD cards with lots of vram too... not sure if they work but mi100 has like 32gb of vram and is less than 1K

[-]

Thrumpwart@reddit (OP)

I've had my eye on MI100's too. I priced out stuffing 8 of them into a Gigabyte Epyc server. Total price is like $15,000 Canadian.

But look at the responses in this thread - imagine how much worse it would be if I asked about 8x MI100's with all these teenagers yelling about 3090's.

[-]

gpupoor@reddit

teenagers or not, their suggestion was valid. unless you have 1 PCIE slot, there is no reason to invest on a completely dead architecture which can't do 80% of the cool nvidia exclusive stuff. with that said I too prefer WS cards. I cant stand 9 3090 models out of 10 because of their 5slot gaming flashy cooler and 10 3090 models out of 10 because of their awful power consumption. you could look into a dual a4500 nvlinked setup, they can be pretty cheap.

[-]

Thrumpwart@reddit (OP)

I have been looking at a4000 and a4500s recently. Especially with the growing prevalence of spec decoding, it could be fun to have a small Nvidia GPU for a draft model.

There is a reason to invest in MI100s considering they are cheap and have plenty of VRAM. Most of the gimmicky GPU features are of no use to me - I want cheap VRAM and manageable power draw, because unlike many of the teens here I pay my own power bills.

[-]

slavik-f@reddit

looks like nobody takes AMD seriously...

[-]

RouteGuru@reddit

there's definitely ppl who do though. I have both, and whenever I come around here asking for help with my AMD setup, some ppl chime in with solutions... its just more work to implement. Nvidia just works out the box. AMD not so much. I guess its a time over money kinda thing.

[-]

slavik-f@reddit

I ordered RTX 8000 on Ebay a month ago. The seller was from China. I paid $1400, which seemed a great price.

Well, too good to be true. 3 days I received a package from China with small glass ball in it. Clearly, the scam... So, watch out.

The seller had 0 rating, but I decided to try. Filed the dispute now...

[-]

Ok_Warning2146@reddit

Sorry to hear that but you were too brave to give a 0 rating seller $1400.

[-]

slavik-f@reddit

why not?

there is Ebay buyer protection.

there is credit card chargeback option.

[-]

Ok_Warning2146@reddit

What is the limit of these protections? I suppose they can't cover everything when the amount is high enough.

[-]

slavik-f@reddit

Why not?

I already got the reply from Ebay and they told me they're issuing full refund, including taxes, shipping... And that's what I was expecting - full refund.

[-]

Ronmaximus@reddit

I never buy something in that price range that is undervalued with a 0 sales from the seller for that reason... and from China? nope. Why should you care? because if you do too many of these purchases not caring then they will give you a hard time to get your money back. Then you have to deal with the bank. Its just a waste of time and inconvenience. Instead buy from people that have a track record. Hopefully that helps. I paid just over 2k last year for my RTX 8000 brand new on eBay from a seller that had a few sales on his record. He was here in California so it was more credible. Good luck on your hunt.

[-]

Ok_Warning2146@reddit

Well, I am just the risk averse type that doesn't like to get into unnecessary troubles. Your mileage definitely varies.

[-]

darkfader_o@reddit

it's different if he's got them time and nerve to go through with it. i'd never manage ;-)

[-]

Thrumpwart@reddit (OP)

The ones Im looking at are on Amazon.

[-]

JakoDel@reddit

I've paid 400usd for 3 near new RVII

with rocm getting actually good I dont see how they're worth it

[-]

darkfader_o@reddit

mind telling me what rocm version you are using? RVII (Pro) would be best value if it weren't a PITA, wasting 1000s of $ worth of time to make it work. but if you really paid 400 for three of them you got damn lucky!

[-]

Uninterested_Viewer@reddit

Seems priced pretty appropriately as it's coming up on 7 years old. It's still fetching quite a premium over the alternative of 2x 3090s likely due exactly to the fact that that it takes up only a single pcie to get that 48gb VRAM.

[-]

Thrumpwart@reddit (OP)

Also 260w draw per GPU vs like 700w for 2 3090s? Maybe 800w?

[-]

ethertype@reddit

A single 3090 typically defaults to 350-370W max. Can be adjusted down to about 270 or so with minimal performance loss.

My 3090Ti is set up for 450 Max.

[-]

Possum4404@reddit

M3 MacBook with 48GB VRAM: 20W 😂

[-]

ethertype@reddit

I think the downvotes are entirely unjustified. What Apple can mass-produce at fabulous profit also happens to be fabulously energy efficient. This is just a fact.

[-]

L3Niflheim@reddit

Ferrari vs Prius comparison

[-]

ethertype@reddit

Great point! Which one of those two cars are more practical for the average family on a day to day basis?

[-]

L3Niflheim@reddit

I am not sure what point you're trying to prove? You started off the conversation by trying to flex that the M3 is 20w. But it is 20w because it isn't anywhere as fast. The slow thing is more power efficient which we already knew.

[-]

ethertype@reddit

I did not.

[-]

thedudear@reddit

I love apple hardware for inferencing, and if that's all you plan to do then it's perfect for that.

But the compute is not in the same ballpark. So not really a good comparison.

[-]

Evening_Ad6637@reddit

That’s it. Not only can you train and finetune models with an RTX orders of magnitude faster, but even regarding inference: people often forget that prompt eval speed is still painfully slow in Apple Macs. So processing large context is not funny on Mac.

[-]

_mausmaus@reddit

THIS.

“My M4 blah blah blah…” Oh, yeah? Let’s compare compute benchmarks.

[-]

Conscious-Map6957@reddit

apples vs oranges

[-]

Possum4404@reddit

Indeed, typing this on a big M3 machine :) You are using legacy tech…

[-]

L3Niflheim@reddit

Your new tech isn't anywhere even close to as fast

[-]

Possum4404@reddit

i know, but it has the vram. when i need beef, I use beefier systems ;)

[-]

Massive-Question-550@reddit

Sounds great until the soldered SSD eventually dies.

[-]

fallingdowndizzyvr@reddit

The Macs are very efficient. Which makes it unnecessary to exaggerate. That's idle not under load. Everyone else is posting numbers under load.

[-]

Xyzzymoon@reddit

There's more.

RTX 8000 is half the speed of 3090 for inferencing the same thing.

Obviously RTX 8000 can hold a bigger model, but if all else equal, 3090 are actually superior.

[-]

Massive-Question-550@reddit

Thing is depending on the application inference speed maybe a lot less important vs context. For example coding would likely require fast inference while story writing would require larger context. I hear more and more about people using AI to summarize large amounts of information which probably uses a fair bit of both. Either way it's good to know there are other options than hunting down every last 3090 on the planet.

[-]

skirmis@reddit

Since LLMs are bottle-necked by VRAM bandwidth and not computational speed, that's probably just fine for LLM inference.

[-]

dillon-nyc@reddit

I have an RTX 6000 (the 24gb version of that card), and it plays pretty well with whatever I've thrown at it.

I don't know why people are comparing it to Pascal architecture, it's not a P40.

I've thought about swapping it for a 3090 (newer architecture, faster memory, physically larger, more power draw), and I still may, but what I have fits within my power budget, and when I bought it, it was priced roughly the same as a 3090 was then.

[-]

Thrumpwart@reddit (OP)

What backends do you run? Have you run into any issues other than lack for FA2?

[-]

dillon-nyc@reddit

I'm also not running anything terribly complicated or bleeding edge right now. There's been a lot of ollama in my life to keep things simple, and there's been decent support on image models that I've toyed with.

[-]

Thrumpwart@reddit (OP)

Thanks for the info. I know not to expect cutting edge, but if it's just nice and simple and works that's what I'm after.

I also mainline a Mac and it does this job but it's slow (waiting on a QVQ-72B prompt to process as I write this). I would love this size of VRAM in a faster vehicle like the 8000.

[-]

dillon-nyc@reddit

I would probably echo people here telling you to save the money and buy a pair of 3090s. It's probably cheaper to buy the 3090s, a motherboard, memory, decent power supply, and case to put the whole mess in than to buy just an 8000.

If I didn't have a 6000 from the past, I wouldn't buy one today.

[-]

Thrumpwart@reddit (OP)

But I don't want just 1 RTX 8000 - I'd buy 4 to stuff into a workstation.

[-]

dillon-nyc@reddit

I think you'll save a lot of money if you get four 3090s, an open air case, and make what will look to a lot of people like a crypto mining rig, and then undervolting the 3090s to fit on a single normal home circuit.

Did you ever say anywhere what your use case is?

[-]