I compared all specs of the major GPUs/machines that are being used here, because bandwidth is not everything. Some of ya'll need a reality check.

[-]

vick2djax@reddit

Tesla V100…what’s the catch? I was thinking a second 3090 but I meaaan

[-]

Tyme4Trouble@reddit

Nice comparison. I think one thing that might get missed is all of this will get filtered by your model. It doesn’t matter how cost efficient it is if it can’t run the model you want.

[-]

dlarsen5@reddit

That’s what I’m thinking, okay sure 5090 is much faster than any Strix Halo systems but it only has 32GB so beyond that even if it’s slower on AMD you get an extra 96GB for model + context

[-]

smithy_dll@reddit

This is why it is strange that RTX PRO 6000 is not included. The real metric is PP and TG tok/s/$ for a specific model

[-]

BuilderUnhappy7785@reddit

This should really have been broken into consumer GPU, enterprise/workstation GPU, and SoC platforms.

[-]

Daniel_H212@reddit

Strix Halo you can get way more than 96 GB and it's still stable.

[-]

dlarsen5@reddit

96GB on top of 5090 32GB unless stacked then more

[-]

El_90@reddit

I love running 235b at q3.... Try that on most GPU cards !

[-]

iMakeSense@reddit

What can it do at that quantization? It seems like at some point you'd be better off with a smaller model with higher quantization

[-]

lawanda123@reddit

On strix halo, what's your llama cpp config?

[-]

llama-server --host 0.0.0.0 --port ${PORT} --log-file /var/log/llamacpp.log -np 1 \ -m /path/Qwen3-235B/bartowski_Qwen_Qwen3-235B-A22B-Instruct-2507-GGUF_Qwen_Qwen3-235B-A22B-Instruct-2507-Q3_K_S_Qwen_Qwen3-235B-A22B-Instruct-2507-Q3_K_S-00001-of-00003.gguf \ --temp 0.6 --top-p 0.95 --top-k 20 --min-p 0.0 --repeat-penalty 1.0 \ --fit on --fit-ctx 65536

Trying ud_3_xl at 32k context later, I think k_s is the limit though

[-]

alex20_202020@reddit

if it can’t run the model you want.

One only need HDD drive space to store it, the engines can load part by part and run. The speed varies greatly though, but I do not like to read "can't" - it is a lie.

[-]

FoxiPanda@reddit

I think this kind of misses the point on the M3 Ultra (especially the 256 & 512GB varieties). Here's a hot take in return: It takes ~5 RTX 6000 Pros or 16 (lol) RTX 5090s to be able to even load the same model as you can on a 512GB Mac Studio. Is prompt processing speedy? Not in the slightest, but it's not absolutely an abysmal experience either. You can load up Deepseek v4 flash Q8 (or a low quant of Pro), Step-3.7-Flash Q8, MiMo-2.5 (or a low quant of Pro), GLM-5.1 (low quant), Kimi-K2.6 (low quant), Qwen3.5-397B (Q6), etc etc for ~$11-15K.

5 * $10000 = $50K for RTX Pro 6000s to even load that model ... or you can suffer horrendous performance that negates the point of the RTX Pro 6000 and load it in to VRAM + DRAM for like $10K for the card + 6-10K for the DDR5 RAM + $1K for mobo + $2k for proc + $500 for PSUs / cases / random cables to be able to hook up the 5th card... so ... $21500+ even just for the DRAM+1 card setup to be able to load all those models above...

Similarly, if you can even get 16 5090s stable working together, that'd be 16 * $4000 = $64K for RTX 5090s... which ... well, good luck getting all of those attached to one system without some really fascinating $2K+ PCIe switch setups and you still have to get a bunch of the other stuff...

Also, there's not a single RTX Pro 6000 Blackwell WS in the world available for $7000 new from any sort of reputable dealer - they're >$10K now. If you can point me at one at that price, I'll buy it today. I'd love to have another one.

I have RTX 5090, an RTX Pro 6000, and M3 Ultra Mac Studios and IMO, they're very complementary. I run small and mid-sized dense models on the NVIDIA hardware and larger MoE models on the Studios and it works pretty well.

[-]

Ok_Top9254@reddit (OP)

Yes, I'm not in the market for macs or Pro 6000's, as you could tell from my post. I have no idea how much either of these cost, I just grabbed the pricing I saw on this subreddit and sold prices on ebay. They've actually gotten much worse than I thought, and for that price you could grab two, almost three Pro 6000 cards, which does change the equation...

Here is my counter hot take:
Smaller models beat bigger ones with more thinking time and multiple shots. Crazy I know. There is not such a massive difference as there used to be between 400Bs and 600Bs-1T+.

Generation speed suffers with more context on low compute as well, so realistically you would be 2-3 shot-ing answers on the RTX setup before you'd get a reply from the mac, let alone the difference between using text only and dropping files and images into the chat. Even with a smaller model that does make a difference.

You also completely ignored the 32GB V100s, which have Nvlink too and you can cluster them in groups of 4 with "unified" memory. 16 is about 12k bucks, with a full system say 19-20k, which is actually less than what the studio is right now, and yes the power consumption and space and yada yada would be atrocious and all that... but you COULD do it, and have for cheaper while being faster.

[-]

nomorebuttsplz@reddit

Smaller models beat bigger ones with more thinking time

Can you give some examples of what benchmarks or data have led you to this conclusion?

[-]

FoxiPanda@reddit

The problem with the V100s is the long term support, power, and infra. You’re really going to spend $20k on something that’s dropped from cuda support already and you’re willing to plop down over 5000W to power it to get to 512GB vram…Really? Come on now.

[-]

twack3r@reddit

Let’s turn this around:

Point me in the direction of an m3ultra 512GiB for $11k-15k and I’ll buy it immediately.

Thing is, I know you can’t. Around $25k is where it starts and even those are unicorns.

Large unified memory and m3 is ok-ish for chatbot use and the occasional document generation. The bandwidth is ok, compute is laughable.

Where I agreed is that a heterogeneous stack such as yours absolutely has its value.

[-]

nomorebuttsplz@reddit

I've seen used ones recently go for 15k on ebay from seemingly reputable sellers

[-]

FoxiPanda@reddit

Sorry I was quoting OP’s price he put in his table at 11k which I thought was optimistic AF lol but I didn’t want to go scour eBay to find out how much they actually cost currently, so I put a range up to $15k sigh lol

[-]

twack3r@reddit

fair tbh

[-]

ThisGonBHard@reddit

You can't find Macs for those prices anymore.

The cheapers Mac with 128 GB, the 128 GB MacBook Pro 16" M5 Max is 6.6k EUR. I can get 2 Strix Halo 128 GB PCs for 6k EUR.

The 256 GB and 512 GB are outright not available anymore. When they were available, they were starting to make the RTX 6000 PRO 96 GB look like a good deal.

[-]

FoxiPanda@reddit

I agree - this is how I ended up with a 6000 actually.

[-]

Southern_Sun_2106@reddit

And that's why OP didn't respond to your post. They are pushing a certain narrative it would seem.

[-]

sn2006gy@reddit

Sparks are still 3-4k, I'm unsure why people would buy the Nvidia ones or pay more. Asus GX10 ftw - just buy you rown local storage instead of paying 1k per tb - models don't need latest gen nvme as you just load them once.

[-]

entsnack@reddit

It gets very annoying if you like swapping models.

[-]

HavenTerminal_com@reddit

raspberry pi is doing a lot of work in that comparison

[-]

farewellrif@reddit

Aw man, no AMD cards? I'd love to see MI50 16 and 32gb in this comparison.

Great work though, interesting how well older cards hold up in price performance vs the mid range.

[-]

Ok_Top9254@reddit (OP)

I was actually planning to, I have three 32GB Mi50s myself but after I checked the ebay prices I didn't even bother because they are crazy overpriced. After the Vega rabbithole was found they shot up so much, I don't think they are worth it over the Nvidia counterparts. P100s are half the price of the 16GB Mi50s and 32GB Mi50s and 60s are reaching V100 prices that have 6x more compute so not worth it. I should have added the R9700 though, that's my bad.

[-]

lacerating_aura@reddit

I am particularly interested in R9700. They are in this weired limbo in my mind where they have sufficient vram, decent compute and bandwidth and so so software support. This is a very ill informed picture in my mind so please forgive any inaccuracies.

[-]

migsperez@reddit

My R9700 is due to be delivered in the next 15 mins. Fingers crossed it's good, it's my largest ever investment in compute.

[-]

lacerating_aura@reddit

Congratulations.

[-]

Ok_Top9254@reddit (OP)

Yup, just added it to the post.

[-]

BringMeTheBoreWorms@reddit

7900xtx cards are worth considering as well, they have 24gb at good bandwidth and are pretty cheap second hand

[-]

andreasntr@reddit

My poor 7900xtx, i bought it for 600€ used and cannot be happier

[-]

BringMeTheBoreWorms@reddit

I’ve got 2 and they are a good card. The latest hip improvements in llamacpp are also worth giving a go

[-]

andreasntr@reddit

True, vulkan is trading blows with cuda nowadays as well

[-]

BringMeTheBoreWorms@reddit

I’ve been using vulkan but have a bencher run tests vs hip for my main models every few days. I noticed a big jump in hip performance lately. Haven’t done a full comparison yet to see if vulkan has improved as well.

[-]

andreasntr@reddit

Yeah sorry, i didn't mean vulkan improved in this timeframe, just saying that it's running smoothly for me and performance are good, which means i'm not concerned about the lack of cuda support on amd cards like mine

[-]

farewellrif@reddit

Fair, I picked up 2x 16gbs for $US150 back before they went crazy. I feel like I did ok, but I definitely wouldn't buy a 32gb at today's prices

[-]

SSOMGDSJD@reddit

Mi50 slots in approx where the v100 does. Slight bw advantage to the mi50 and slightly lower price for the 32gb. But worse on prefill, i think, dont have one to test them head to head

[-]

The_Hardcard@reddit

It’s fine for you to pick what matters most to you and argue your point. But this sub will continue to have people who have different criteria for being actually productive.

So Mac will continue to be recommended here, since memory capacity per dollar is a key consideration given the greater productive capabilities of high parameter models. Macs have a unique, vastly superior combination of capacity and bandwidth at a range of price points that allow access to productivity that is unavailable with other solutions without spending far more money.

For you and others not to care is reasonable.

Looking for everyone in the LLM community and this sub to not consider and discuss Macs is unreasonable.

[-]

Twirrim@reddit

For what it's worth, I'm rocking an RTX 3050 that is getting me 20-30 tok/s on unsloth/Qwen3.6-35B-A3B-GGUF Q4 something, and I *think* from what I'm seeing, CPU/System memory bandwidth is part of what is slowing it down (it's a somewhat anemic 4 core Ryzen 3 3200G!) It certainly spends a whole lot of time pegging the CPU, which I assume is it shuffling stuff around.

It's working fine for the kinds of things I'm using it for.

[-]

jdchmiel@reddit

yeah, your model is around 17g and not entirely in vram.

[-]

noctrex@reddit

You could also add the mainstream AMD cards like 7900XTX with its 24GB VRAM. I have one and it performing quite good.

Also the 9060XT is very interesting to get for compute on the cheap. 16GB VRAM for about 400 bucks is a very good deal, and you can buy 2 of them for less than 800, and utilize 32GB.

[-]

ketosoy@reddit

For anyone else extending this research, gpupoet.com is a n easy, fast, albeit imperfect and incomplete repository of much of this.

[-]

ChocomelP@reddit

TTFT.

[-]

nick_ziv@reddit

This is making the V100 32gb look very attractive if there would be another qwen 122B A10b coming. Also slapping the power limit on the 3090 helps with consumption. Another card that should be on here is the Titan RTX. It's 24gb and seems to be forgotten frequently (maybe a good thing for price!)

[-]

suprjami@reddit

CUDA 12 won't be supported forever.

In a year or two a Volta card will be worth nothing.

[-]

SSOMGDSJD@reddit

Community support will carry on with pinned cuda versions, theres simply too much good compute in volta to leave it to die. Current premiums wont hold up long term for sure, but they will be far from paperweights

[-]

suprjami@reddit

My job is (amongst other things) to backport software, so I'm fairly familiar with this overall idea. I have some idea what can and can't be accomplished and the risks and effort involved.

I think inference engines like llama.cpp or vllm have such high complexity and rapid code change that backporting later features to earlier versions is not viable.

Look at BeeLlama and Lucebox, two projects currently trying to add DFlash to forked llama.cpp. These project leads are very smart people. Neither of them have a working multi-gpu implementation. That's one feature, no library issues to deal with, and they still can't get it right.

This stuff is hard. Really fucking hard. I just don't see it happening by unpaid volunteers so a small minority can keep an ancient card running a few more years.

[-]

nick_ziv@reddit

Consistent with the nature of this sub, I think part of the idea for why this could work is that AI will be able to assist devs in building out anything that is technically possible on the hardware. I imagine because closed loop testing is possible that we will see much of this older hardware remain supported for some time. Also, over 100t/s generation even if I never updated llama.cpp is still a useful tool.

[-]

Reggitor360@reddit

Should have taken a used 7900XTX (they go for 600ish) and a MI100 (750-900ish) into the mix tbh

Imo still solid picks, especially the MI if you want FP32/64

[-]

Shoddy-Tutor9563@reddit

Unfortunately not all these teraflops are linearly convertible to prompt processing and token generation. It greatly depends on the software - inference code, which is, as we know, not even in terms of quality from GPU maker to GPU maker

[-]

WithoutReason1729@reddit

Your post is getting popular and we just featured it on our Discord! Come check it out!

You've also been given a special flair for your contribution. We appreciate your post!

I am a bot and this action was performed automatically.

[-]

crossoverXYZ@reddit

Interesting benchmark results. One thing worth adding to the discussion is that real-world performance often diverges from synthetic benchmarks depending on your specific use case. I've tested several of these models in a RAG pipeline and the ranking changes quite a bit when you factor in instruction following consistency and output formatting reliability. A model might score well on MMLU but still struggle to consistently output valid JSON when you need structured responses. For my production setup I ended up choosing a model that scored slightly lower on benchmarks but was much more predictable in terms of output formatting, which saved me a ton of post-processing headaches.

[-]

Proof-Possibility-54@reddit

Nice, quite useful. I am thinking right now about buying a GPU to power ollama, so this comparison arrived just on time.

Thanks!

[-]

joelkunst@reddit

I haven't done very detailed look, but it looks like your are comparing pens of graphic cards vs full computer for (dgx spark, framework desktop, mac), which is not really fair, also electricity cost is part of the price that's not visible here at all.

Some of this graphics cards are awesome and hand their place, but saying that macbook is an overpriced raspberry pi is a stretch 🤣

[-]

FullOf_Bad_Ideas@reddit

I understand that this sub is now filled with gamers who do nothing but ERP with anime waifus on their setups, but for people who do something actually productive, prefill is still very important and this is completely hidden by the "generate 1000 word story" benchmarks that most posts or big AI youtube channels do.

I think ERP died off, it feels like it's mostly vibe coders now.

5070 Ti was the best price per compute when I was comparing some gpu's. 5080 a bit behind. Lower vram so they're not top picks but they're killing it now at mining.

FP16 TFlops for 6000 Pro seem off, I think it's about 400-500.

[-]

Kahvana@reddit

(E)RP going strong on sillytavern and other platforms! Especially with the release of Gemma4 31B.

It's just that vibe coders and other "productive" pipelines can now finally be done since the context window is no longer 32K or 128K with actual coherency to 16K.

The pool of RP-ers is much smaller than people wanting to be more productive.

[-]

Makers7886@reddit

As someone around since the start of the llm wave I remember people being worried the ERP/Waifu proliferation was going to hurt LLM momentum. These days seem like Sunday church in comparison I hardly see that stuff being talked about around here. Or maybe I'm numb to it now.

[-]

Kahvana@reddit

It's called "AI Companions" now instead I think.

With gpt-4o closing it kinda feels like it became somewhat less of an issue, lessons were learned from that. It's also apparent that programming/toolcalling needs are in direct conflict with ERP/Waifu cards (structured text vs creative writing).

[-]

Aerthlyomi@reddit

Another, I don't get it; really I don't, post

I can use my M4 Max 128 Gb Studio to type that message, while listening music and at the same time I have a Qwen 122B model translating a very long Chinese text in the background. I don't even feel it. The fans are not even running.

I was using a 4090. It was faster, way faster, but I had to spend hours choosing a model for its small VRAM capacity. And that text would not fit. Now I don't care at all.

It was noisy, really noisy, it was hot, and I was certainly doing nothing else while it was working.

Having a Local model is also about what you do with it. If it's just performance without any use case, go to the commercial models. You win every time.

[-]

crossoverXYZ@reddit

Glad someone finally brought up prefill performance. The "tokens per second on generation" benchmarks that dominate YouTube completely miss the real bottleneck for anyone doing serious work with local models. When you're feeding in 8k+ tokens of context for code analysis or document processing, prefill speed is what determines whether the experience feels responsive or painful.

I've been running a dual P100 setup for a few months now and can confirm they're genuinely underrated for the price point. The 700 GB/s combined bandwidth handles Qwen 27B quantized surprisingly well for my use case (mostly structured data extraction from documents). The main gotcha is power draw — these things idle hot and under load you're basically running a space heater. Worth factoring in electricity costs if you're running inference 24/7.

Would love to see the prefill charts when you get to them. If it helps, my P100 dual setup pulls about 450W under sustained inference load with a 27B Q4 model. Idle is around 120W combined.

[-]

andreasntr@reddit

Load draw is not bad considering that high end gpus can go up to 350w alone

[-]

SkyFeistyLlama8@reddit

A better benchmark would have compared prompt processing or prefill speeds and token generation speeds across different inference stacks: llama.cpp, mlx, vllm, onnx, whatever uses rocm etc.

Prefill is a killer on anything other than discrete Nvidia GPUs. I've used MacBook Pros and other laptops for inference and waiting minutes to half an hour for a code base to get slurped into prompt context is painful. On the other hand, you get a portable inference machine that doesn't need to be plugged into a drier outlet to run.

[-]

nauxiv@reddit

The Asus-branded DGX Spark (Ascent GX10) is still only around $3600 new, which should make a noticeable difference on price/performance. The cooling on it is a little better than the NVIDIA FE version too.

[-]

BringTea_666@reddit

Why are you measuring it via TFLOPS ? It's meaningless measure for AI.

[-]

hurdurdur7@reddit

You put intel B60 on a picture and skip AMD AI Pro R9700 from the picture. Meh.

[-]

hurdurdur7@reddit

And teraflops are cool , but most cards are stuck behind their memory bandwidth, not their teraflops.

[-]

Long_comment_san@reddit

4090 is honestly outrageously expensive

[-]

nomorebuttsplz@reddit

I kind of doubt 3060 is twice as fast pp as m3 ultra in practice. maybe in theory though.

[-]

suprjami@reddit

Llama 2 7B comparison

M3 Ultra = 1121.80 or 1538.34 (depending on core count)

Apple benchmarks: https://github.com/ggml-org/llama.cpp/discussions/4167

RTX 3060 = 2137.50 or 2407.67 (flash attn off/on)

CUDA benchmarks: https://github.com/ggml-org/llama.cpp/discussions/15013

[-]

nomorebuttsplz@reddit

That doesn't quite corroborate OP's numbers. Also I think MLX would be a better comparison especially for older versions. Also, for newer models it looks like m3u and 3060 have more similar prefill numbers even in GGUF. e.g. here: https://www.reddit.com/r/LocalLLaMA/comments/1tokpoc/400_qwen_3627b_setup_dual_rtx_3060_3050_ts/ the initial peak is reported as 600 t/s whereas on my m3u the initial peak is 490 (q6 gguf). At 50k they report 438 t/s whereas I am at 328. So it seems like m3u is more like 75-80% as fast prefill for qwen 3 27b.

Not saying the 3060 isn't slightly better, just it's not accurate to extrapolate that unsourced (?) fp16 for m3u number to reality.

[-]

suprjami@reddit

Good post!

I have found it really hard to find reliable numbers for MLX.

There was a oMLX page which hosted many thousands of benchmarks, but seemingly for the same hardware and same model it would vary by unrealistic amounts. Like a given measurement would be 1000 tok/sec from one person and 10 tok/sec from another.

[-]

Complex-Maybe3123@reddit

Memory is king, bandwidth is queen. As someone with both a 3090 and a strix halo, I can clearly feel the pros and cons of both.

[-]

Momsbestboy@reddit

I love this thread, because it adds a relevant number to the stats. I dont have the time and hardware to expand it even more by using different multiples model sizes to the data. The 3060 is nice, but is it still ranked as high if we use a llm with 20gb size? Doesnt fit, needs unloading to RAM, and suddenly is much slower than any GPU with e.g. 32GB VRAM...

[-]

JoJoeyJoJo@reddit

I mean o certainly didn’t pay $2400 for my new Nvidia 4080, let alone a used one.

[-]

Perfect-Flounder7856@reddit

So by all you didn't mean all....cuz rtx pro 6k isn't on here and I know a lot of people on this sub use this card. As do i.

[-]

HumanDrone8721@reddit

+1

[-]

Ok_Top9254@reddit (OP)

Well yes, but no. 6k is just 5090 with more vram, the 10% more cores doesn't do much for the value. But it's roughly 15$ per TFLOP and 68$ per GB. (I'm also just envious don't mind me)

[-]

Clear-Ad-9312@reddit

pro 6000 and tesla v100 both make the most sense out this entire chart. I wonder why the intel arc pro b70 was not included. I wish it was 48gb instead of 32gb

[-]

_madar_@reddit

The Max-Q runs at 300w, the workstation version I'd agree though

[-]

Imaginary-Unit-3267@reddit

As an RTX 3060 haver I am surprised and pleased to discover I have the lowest cost per teraflop of anything on this list!

[-]

suprjami@reddit

There's a reason they are the majority hardware on HuggingFace.

[-]

fallingdowndizzyvr@reddit

Your RTX prices are too low. Your 5060ti and 5070ti prices are too high.

[-]

HokkaidoNights@reddit

What about the rest of the computer... you can't compare compete systems to a GPU alone on price or power use.

[-]

jcdoe@reddit

I agree, I think the data needs to be adjusted to account for the pc needed to host the gpu. Comparing a $5000 all in one to a $3100 GPU is apples and oranges.

I suspect the Mac will still be on the expensive end, but I also suspect those of us running LLMs on Macs bought them for other purposes (general purpose computing, gaming via crossover, video or music production).

[-]

dangerous_inference@reddit

There is never a limit to how much resources can be used. You give someone a mountain of lobsters they can use them for fertilizer. I am looking forward to doing some wild stuff with my 4x48gb modded 4090 system soon. This ceiling for demand people imagine is really just the lack of proper infrastructure to effectively leverage more and more intelligence. The average person will be burning an unfathomable amount of AI compute within 5-10 years.

[-]

Double_Cause4609@reddit

Counterpoint:

"Actually productive" is a crazy statement because there's more than one way to be productive. Some people do crazy multi-agent workflows with clever compaction schemes that use a ton of prefill and cache eviction, but some people do simpler synchronous agents that don't evict cache *that* much.

Also: Low prefill is a good problem to have. There's nothing in the properties of accelerators that say you need to load the entire model on it to do prefill; you can absolutely stream parameters into ie: an NPU card, or a 5090 etc to do prefill on a huge model.

That doesn't work the other way around though; decode is limited to what you can fit in memory.

So, if we're talking "I don't know how to do machine learning, I can't make a custom inference engine, and I don't want to believe anyone else will build disaggregated per-tensor prefill ever" (which is not an unfair stance), your argument kind of works, but if we're factoring in what future serving stacks will look like within the lifetimes of the hardware, I don't think you should be ripping into people for their builds.

Another note that I didn't see you make: TFLOPs actually distribute well. Bandwidth doesn't. Compute-bound loads are easier to ammortize compute loads over so if you're choosing a cheap multi-GPU config versus an expensive single-GPU config, you do have to factor in that the TFLOPs do compose relatively well (particularly with graph-concurrent setups like IK_LCPP or EXL3.

Also: I'd argue FP8 FLOPs may matter as much as FP16 FLOPs as a lot of people more practically run at that, but by the same token (no pun intended), most cards with high FP16 FLOPs also support FP8 so it's a fine enough proxy.

[-]

nomorebuttsplz@reddit

with an m3u you can have a fat model like 8 bit minimax or GLM 5.1 and a small model for simpler parts of the workflow loaded at the same time.

I wouldn't want it if I knew I absolutely had to load in 50k+ context every single call but just making simple apps in python there is absolutely no need for that.

[-]

SkyFeistyLlama8@reddit

NPUs are the big question mark here. I've used them on laptops where prompt processing beats the integrated GPU at a fraction of the power budget but I'm stuck running much smaller 4B or 7B models only.

A multi-NPU inference accelerator card would be fun to play with.

[-]

Sofakingwetoddead@reddit

In actual, real world performance which I would measure as TG & PP tokens p/ sec at large context - the cards that perform poorly in your benchmark are far more cost effective than the solutions that perform well in your test. Cold prefill, prompt processing and so forth. That's the speed. RTX 6000 Pro pp \~5k with 200k context. r9700 w/ \~200k context pp \~450... 5k/450 \~11 ----- 11x 1400 = 15,400... RTX \~10k ... not even close.

In practical use, the r9700 delivers 1/10th the performance, with a much smaller model and less context, of a RTX 6000 pro. 1/3 the cost for 1/10th the performance

[-]

a_beautiful_rhind@reddit

3090s aren't overkill. There are image models, training, and everything else under the sun. P100s, macs and all that stuff is only really LLMs. If anything they are getting slow.

[-]

DiscipleofDeceit666@reddit

Maybe the apple angle doesn’t have to be about raw ai speed. Maybe it could be about AI swarms. Semantic search by phrase within text and it sends 1000 agents to find the files for you.

[-]

Vegetable-Score-3915@reddit

The apple angle is also, how to get that amount of unified ram vs mulitple gpus. What is the opportunity cost and practical matters involved in splitting a large model across multiple gpus. Ie 2 16gb gpus is not the same as 1 32gb gpu. I appreciate ops post. It just should have some caveats

[-]