I compared all specs of the major GPUs/machines that are being used here, because bandwidth is not everything. Some of ya'll need a reality check.
Posted by Ok_Top9254@reddit | LocalLLaMA | View on Reddit | 105 comments
Hot takes:
- Mac studio is overpriced Raspberry Pi that are way more inefficient than people think. M5 MBP is better, but not by much.
- Spark was actually decent when it was just 3-4k. Strix is obviously much better now
- 3090 are complete overkill for single stream usage, V100s are much better value if you can find them cheap. P40 are very niche, but decent if you want exactly 48GB of vram, run moe and don't have money for Mi50s or V100s.
- P100s are extremely underrated entry level LLM gpu's that are not talked about enough. 200 bucks (dual gpu) for a combined 32GB of 700GB/s memory and 70% of M3 Ultra compute is crazy.
I understand that this sub is now filled with gamers who do nothing but ERP with anime waifus on their setups, but for people who do something actually productive, prefill is still very important and this is completely hidden by the "generate 1000 word story" benchmarks that most posts or big AI youtube channels do. Especially with multimodal models that eat up context like mad.
I'm still collecting data for prefill and generation charts I'd like to do in the future... I also couldn't find much reliable power data, so if you could provide that from your own setups in the comments I'll be glad.
Thanks for coming to my ted talk.
vick2djax@reddit
Tesla V100…what’s the catch? I was thinking a second 3090 but I meaaan
Tyme4Trouble@reddit
Nice comparison. I think one thing that might get missed is all of this will get filtered by your model. It doesn’t matter how cost efficient it is if it can’t run the model you want.
dlarsen5@reddit
That’s what I’m thinking, okay sure 5090 is much faster than any Strix Halo systems but it only has 32GB so beyond that even if it’s slower on AMD you get an extra 96GB for model + context
smithy_dll@reddit
This is why it is strange that RTX PRO 6000 is not included. The real metric is PP and TG tok/s/$ for a specific model
BuilderUnhappy7785@reddit
This should really have been broken into consumer GPU, enterprise/workstation GPU, and SoC platforms.
dlarsen5@reddit
headline figures make for good post
Daniel_H212@reddit
Strix Halo you can get way more than 96 GB and it's still stable.
dlarsen5@reddit
96GB on top of 5090 32GB unless stacked then more
Daniel_H212@reddit
Ah I see what you mean
El_90@reddit
I love running 235b at q3.... Try that on most GPU cards !
iMakeSense@reddit
What can it do at that quantization? It seems like at some point you'd be better off with a smaller model with higher quantization
lawanda123@reddit
On strix halo, what's your llama cpp config?
El_90@reddit
llama-server --host 0.0.0.0 --port ${PORT} --log-file /var/log/llamacpp.log -np 1 \ -m /path/Qwen3-235B/bartowski_Qwen_Qwen3-235B-A22B-Instruct-2507-GGUF_Qwen_Qwen3-235B-A22B-Instruct-2507-Q3_K_S_Qwen_Qwen3-235B-A22B-Instruct-2507-Q3_K_S-00001-of-00003.gguf \ --temp 0.6 --top-p 0.95 --top-k 20 --min-p 0.0 --repeat-penalty 1.0 \ --fit on --fit-ctx 65536
Trying ud_3_xl at 32k context later, I think k_s is the limit though
alex20_202020@reddit
One only need HDD drive space to store it, the engines can load part by part and run. The speed varies greatly though, but I do not like to read "can't" - it is a lie.
FoxiPanda@reddit
I think this kind of misses the point on the M3 Ultra (especially the 256 & 512GB varieties). Here's a hot take in return: It takes ~5 RTX 6000 Pros or 16 (lol) RTX 5090s to be able to even load the same model as you can on a 512GB Mac Studio. Is prompt processing speedy? Not in the slightest, but it's not absolutely an abysmal experience either. You can load up Deepseek v4 flash Q8 (or a low quant of Pro), Step-3.7-Flash Q8, MiMo-2.5 (or a low quant of Pro), GLM-5.1 (low quant), Kimi-K2.6 (low quant), Qwen3.5-397B (Q6), etc etc for ~$11-15K.
5 * $10000 = $50K for RTX Pro 6000s to even load that model ... or you can suffer horrendous performance that negates the point of the RTX Pro 6000 and load it in to VRAM + DRAM for like $10K for the card + 6-10K for the DDR5 RAM + $1K for mobo + $2k for proc + $500 for PSUs / cases / random cables to be able to hook up the 5th card... so ... $21500+ even just for the DRAM+1 card setup to be able to load all those models above...
Similarly, if you can even get 16 5090s stable working together, that'd be 16 * $4000 = $64K for RTX 5090s... which ... well, good luck getting all of those attached to one system without some really fascinating $2K+ PCIe switch setups and you still have to get a bunch of the other stuff...
Also, there's not a single RTX Pro 6000 Blackwell WS in the world available for $7000 new from any sort of reputable dealer - they're >$10K now. If you can point me at one at that price, I'll buy it today. I'd love to have another one.
I have RTX 5090, an RTX Pro 6000, and M3 Ultra Mac Studios and IMO, they're very complementary. I run small and mid-sized dense models on the NVIDIA hardware and larger MoE models on the Studios and it works pretty well.
Ok_Top9254@reddit (OP)
Yes, I'm not in the market for macs or Pro 6000's, as you could tell from my post. I have no idea how much either of these cost, I just grabbed the pricing I saw on this subreddit and sold prices on ebay. They've actually gotten much worse than I thought, and for that price you could grab two, almost three Pro 6000 cards, which does change the equation...
Here is my counter hot take:
Smaller models beat bigger ones with more thinking time and multiple shots. Crazy I know. There is not such a massive difference as there used to be between 400Bs and 600Bs-1T+.
Generation speed suffers with more context on low compute as well, so realistically you would be 2-3 shot-ing answers on the RTX setup before you'd get a reply from the mac, let alone the difference between using text only and dropping files and images into the chat. Even with a smaller model that does make a difference.
You also completely ignored the 32GB V100s, which have Nvlink too and you can cluster them in groups of 4 with "unified" memory. 16 is about 12k bucks, with a full system say 19-20k, which is actually less than what the studio is right now, and yes the power consumption and space and yada yada would be atrocious and all that... but you COULD do it, and have for cheaper while being faster.
nomorebuttsplz@reddit
Can you give some examples of what benchmarks or data have led you to this conclusion?
FoxiPanda@reddit
The problem with the V100s is the long term support, power, and infra. You’re really going to spend $20k on something that’s dropped from cuda support already and you’re willing to plop down over 5000W to power it to get to 512GB vram…Really? Come on now.
twack3r@reddit
Let’s turn this around:
Point me in the direction of an m3ultra 512GiB for $11k-15k and I’ll buy it immediately.
Thing is, I know you can’t. Around $25k is where it starts and even those are unicorns.
Large unified memory and m3 is ok-ish for chatbot use and the occasional document generation. The bandwidth is ok, compute is laughable.
Where I agreed is that a heterogeneous stack such as yours absolutely has its value.
nomorebuttsplz@reddit
I've seen used ones recently go for 15k on ebay from seemingly reputable sellers
FoxiPanda@reddit
Sorry I was quoting OP’s price he put in his table at 11k which I thought was optimistic AF lol but I didn’t want to go scour eBay to find out how much they actually cost currently, so I put a range up to $15k sigh lol
twack3r@reddit
fair tbh
ThisGonBHard@reddit
You can't find Macs for those prices anymore.
The cheapers Mac with 128 GB, the 128 GB MacBook Pro 16" M5 Max is 6.6k EUR. I can get 2 Strix Halo 128 GB PCs for 6k EUR.
The 256 GB and 512 GB are outright not available anymore. When they were available, they were starting to make the RTX 6000 PRO 96 GB look like a good deal.
FoxiPanda@reddit
I agree - this is how I ended up with a 6000 actually.
Southern_Sun_2106@reddit
And that's why OP didn't respond to your post. They are pushing a certain narrative it would seem.
sn2006gy@reddit
Sparks are still 3-4k, I'm unsure why people would buy the Nvidia ones or pay more. Asus GX10 ftw - just buy you rown local storage instead of paying 1k per tb - models don't need latest gen nvme as you just load them once.
entsnack@reddit
It gets very annoying if you like swapping models.
HavenTerminal_com@reddit
raspberry pi is doing a lot of work in that comparison
farewellrif@reddit
Aw man, no AMD cards? I'd love to see MI50 16 and 32gb in this comparison.
Great work though, interesting how well older cards hold up in price performance vs the mid range.
Ok_Top9254@reddit (OP)
I was actually planning to, I have three 32GB Mi50s myself but after I checked the ebay prices I didn't even bother because they are crazy overpriced. After the Vega rabbithole was found they shot up so much, I don't think they are worth it over the Nvidia counterparts. P100s are half the price of the 16GB Mi50s and 32GB Mi50s and 60s are reaching V100 prices that have 6x more compute so not worth it. I should have added the R9700 though, that's my bad.
lacerating_aura@reddit
I am particularly interested in R9700. They are in this weired limbo in my mind where they have sufficient vram, decent compute and bandwidth and so so software support. This is a very ill informed picture in my mind so please forgive any inaccuracies.
migsperez@reddit
My R9700 is due to be delivered in the next 15 mins. Fingers crossed it's good, it's my largest ever investment in compute.
lacerating_aura@reddit
Congratulations.
Ok_Top9254@reddit (OP)
Yup, just added it to the post.
BringMeTheBoreWorms@reddit
7900xtx cards are worth considering as well, they have 24gb at good bandwidth and are pretty cheap second hand
andreasntr@reddit
My poor 7900xtx, i bought it for 600€ used and cannot be happier
BringMeTheBoreWorms@reddit
I’ve got 2 and they are a good card. The latest hip improvements in llamacpp are also worth giving a go
andreasntr@reddit
True, vulkan is trading blows with cuda nowadays as well
BringMeTheBoreWorms@reddit
I’ve been using vulkan but have a bencher run tests vs hip for my main models every few days. I noticed a big jump in hip performance lately. Haven’t done a full comparison yet to see if vulkan has improved as well.
andreasntr@reddit
Yeah sorry, i didn't mean vulkan improved in this timeframe, just saying that it's running smoothly for me and performance are good, which means i'm not concerned about the lack of cuda support on amd cards like mine
farewellrif@reddit
Fair, I picked up 2x 16gbs for $US150 back before they went crazy. I feel like I did ok, but I definitely wouldn't buy a 32gb at today's prices
SSOMGDSJD@reddit
Mi50 slots in approx where the v100 does. Slight bw advantage to the mi50 and slightly lower price for the 32gb. But worse on prefill, i think, dont have one to test them head to head
The_Hardcard@reddit
It’s fine for you to pick what matters most to you and argue your point. But this sub will continue to have people who have different criteria for being actually productive.
So Mac will continue to be recommended here, since memory capacity per dollar is a key consideration given the greater productive capabilities of high parameter models. Macs have a unique, vastly superior combination of capacity and bandwidth at a range of price points that allow access to productivity that is unavailable with other solutions without spending far more money.
For you and others not to care is reasonable.
Looking for everyone in the LLM community and this sub to not consider and discuss Macs is unreasonable.
Twirrim@reddit
For what it's worth, I'm rocking an RTX 3050 that is getting me 20-30 tok/s on unsloth/Qwen3.6-35B-A3B-GGUF Q4 something, and I *think* from what I'm seeing, CPU/System memory bandwidth is part of what is slowing it down (it's a somewhat anemic 4 core Ryzen 3 3200G!) It certainly spends a whole lot of time pegging the CPU, which I assume is it shuffling stuff around.
It's working fine for the kinds of things I'm using it for.
jdchmiel@reddit
yeah, your model is around 17g and not entirely in vram.
noctrex@reddit
You could also add the mainstream AMD cards like 7900XTX with its 24GB VRAM. I have one and it performing quite good.
Also the 9060XT is very interesting to get for compute on the cheap. 16GB VRAM for about 400 bucks is a very good deal, and you can buy 2 of them for less than 800, and utilize 32GB.
ketosoy@reddit
For anyone else extending this research, gpupoet.com is a n easy, fast, albeit imperfect and incomplete repository of much of this.
ChocomelP@reddit
TTFT.
nick_ziv@reddit
This is making the V100 32gb look very attractive if there would be another qwen 122B A10b coming. Also slapping the power limit on the 3090 helps with consumption. Another card that should be on here is the Titan RTX. It's 24gb and seems to be forgotten frequently (maybe a good thing for price!)
suprjami@reddit
CUDA 12 won't be supported forever.
In a year or two a Volta card will be worth nothing.
SSOMGDSJD@reddit
Community support will carry on with pinned cuda versions, theres simply too much good compute in volta to leave it to die. Current premiums wont hold up long term for sure, but they will be far from paperweights
suprjami@reddit
My job is (amongst other things) to backport software, so I'm fairly familiar with this overall idea. I have some idea what can and can't be accomplished and the risks and effort involved.
I think inference engines like llama.cpp or vllm have such high complexity and rapid code change that backporting later features to earlier versions is not viable.
Look at BeeLlama and Lucebox, two projects currently trying to add DFlash to forked llama.cpp. These project leads are very smart people. Neither of them have a working multi-gpu implementation. That's one feature, no library issues to deal with, and they still can't get it right.
This stuff is hard. Really fucking hard. I just don't see it happening by unpaid volunteers so a small minority can keep an ancient card running a few more years.
nick_ziv@reddit
Consistent with the nature of this sub, I think part of the idea for why this could work is that AI will be able to assist devs in building out anything that is technically possible on the hardware. I imagine because closed loop testing is possible that we will see much of this older hardware remain supported for some time. Also, over 100t/s generation even if I never updated llama.cpp is still a useful tool.
Reggitor360@reddit
Should have taken a used 7900XTX (they go for 600ish) and a MI100 (750-900ish) into the mix tbh
Imo still solid picks, especially the MI if you want FP32/64
Shoddy-Tutor9563@reddit
Unfortunately not all these teraflops are linearly convertible to prompt processing and token generation. It greatly depends on the software - inference code, which is, as we know, not even in terms of quality from GPU maker to GPU maker
WithoutReason1729@reddit
Your post is getting popular and we just featured it on our Discord! Come check it out!
You've also been given a special flair for your contribution. We appreciate your post!
I am a bot and this action was performed automatically.
crossoverXYZ@reddit
Interesting benchmark results. One thing worth adding to the discussion is that real-world performance often diverges from synthetic benchmarks depending on your specific use case. I've tested several of these models in a RAG pipeline and the ranking changes quite a bit when you factor in instruction following consistency and output formatting reliability. A model might score well on MMLU but still struggle to consistently output valid JSON when you need structured responses. For my production setup I ended up choosing a model that scored slightly lower on benchmarks but was much more predictable in terms of output formatting, which saved me a ton of post-processing headaches.
Proof-Possibility-54@reddit
Nice, quite useful. I am thinking right now about buying a GPU to power ollama, so this comparison arrived just on time.
Thanks!
joelkunst@reddit
I haven't done very detailed look, but it looks like your are comparing pens of graphic cards vs full computer for (dgx spark, framework desktop, mac), which is not really fair, also electricity cost is part of the price that's not visible here at all.
Some of this graphics cards are awesome and hand their place, but saying that macbook is an overpriced raspberry pi is a stretch 🤣
FullOf_Bad_Ideas@reddit
I think ERP died off, it feels like it's mostly vibe coders now.
5070 Ti was the best price per compute when I was comparing some gpu's. 5080 a bit behind. Lower vram so they're not top picks but they're killing it now at mining.
FP16 TFlops for 6000 Pro seem off, I think it's about 400-500.
Kahvana@reddit
(E)RP going strong on sillytavern and other platforms! Especially with the release of Gemma4 31B.
It's just that vibe coders and other "productive" pipelines can now finally be done since the context window is no longer 32K or 128K with actual coherency to 16K.
The pool of RP-ers is much smaller than people wanting to be more productive.
Makers7886@reddit
As someone around since the start of the llm wave I remember people being worried the ERP/Waifu proliferation was going to hurt LLM momentum. These days seem like Sunday church in comparison I hardly see that stuff being talked about around here. Or maybe I'm numb to it now.
Kahvana@reddit
It's called "AI Companions" now instead I think.
With gpt-4o closing it kinda feels like it became somewhat less of an issue, lessons were learned from that. It's also apparent that programming/toolcalling needs are in direct conflict with ERP/Waifu cards (structured text vs creative writing).
Aerthlyomi@reddit
Another, I don't get it; really I don't, post
I can use my M4 Max 128 Gb Studio to type that message, while listening music and at the same time I have a Qwen 122B model translating a very long Chinese text in the background. I don't even feel it. The fans are not even running.
I was using a 4090. It was faster, way faster, but I had to spend hours choosing a model for its small VRAM capacity. And that text would not fit. Now I don't care at all.
It was noisy, really noisy, it was hot, and I was certainly doing nothing else while it was working.
Having a Local model is also about what you do with it. If it's just performance without any use case, go to the commercial models. You win every time.
crossoverXYZ@reddit
Glad someone finally brought up prefill performance. The "tokens per second on generation" benchmarks that dominate YouTube completely miss the real bottleneck for anyone doing serious work with local models. When you're feeding in 8k+ tokens of context for code analysis or document processing, prefill speed is what determines whether the experience feels responsive or painful.
I've been running a dual P100 setup for a few months now and can confirm they're genuinely underrated for the price point. The 700 GB/s combined bandwidth handles Qwen 27B quantized surprisingly well for my use case (mostly structured data extraction from documents). The main gotcha is power draw — these things idle hot and under load you're basically running a space heater. Worth factoring in electricity costs if you're running inference 24/7.
Would love to see the prefill charts when you get to them. If it helps, my P100 dual setup pulls about 450W under sustained inference load with a 27B Q4 model. Idle is around 120W combined.
andreasntr@reddit
Load draw is not bad considering that high end gpus can go up to 350w alone
SkyFeistyLlama8@reddit
A better benchmark would have compared prompt processing or prefill speeds and token generation speeds across different inference stacks: llama.cpp, mlx, vllm, onnx, whatever uses rocm etc.
Prefill is a killer on anything other than discrete Nvidia GPUs. I've used MacBook Pros and other laptops for inference and waiting minutes to half an hour for a code base to get slurped into prompt context is painful. On the other hand, you get a portable inference machine that doesn't need to be plugged into a drier outlet to run.
nauxiv@reddit
The Asus-branded DGX Spark (Ascent GX10) is still only around $3600 new, which should make a noticeable difference on price/performance. The cooling on it is a little better than the NVIDIA FE version too.
BringTea_666@reddit
Why are you measuring it via TFLOPS ? It's meaningless measure for AI.
hurdurdur7@reddit
You put intel B60 on a picture and skip AMD AI Pro R9700 from the picture. Meh.
hurdurdur7@reddit
And teraflops are cool , but most cards are stuck behind their memory bandwidth, not their teraflops.
Long_comment_san@reddit
4090 is honestly outrageously expensive
nomorebuttsplz@reddit
I kind of doubt 3060 is twice as fast pp as m3 ultra in practice. maybe in theory though.
suprjami@reddit
Llama 2 7B comparison
M3 Ultra = 1121.80 or 1538.34 (depending on core count)
Apple benchmarks: https://github.com/ggml-org/llama.cpp/discussions/4167
RTX 3060 = 2137.50 or 2407.67 (flash attn off/on)
CUDA benchmarks: https://github.com/ggml-org/llama.cpp/discussions/15013
nomorebuttsplz@reddit
That doesn't quite corroborate OP's numbers. Also I think MLX would be a better comparison especially for older versions. Also, for newer models it looks like m3u and 3060 have more similar prefill numbers even in GGUF. e.g. here: https://www.reddit.com/r/LocalLLaMA/comments/1tokpoc/400_qwen_3627b_setup_dual_rtx_3060_3050_ts/ the initial peak is reported as 600 t/s whereas on my m3u the initial peak is 490 (q6 gguf). At 50k they report 438 t/s whereas I am at 328. So it seems like m3u is more like 75-80% as fast prefill for qwen 3 27b.
Not saying the 3060 isn't slightly better, just it's not accurate to extrapolate that unsourced (?) fp16 for m3u number to reality.
suprjami@reddit
Good post!
I have found it really hard to find reliable numbers for MLX.
There was a oMLX page which hosted many thousands of benchmarks, but seemingly for the same hardware and same model it would vary by unrealistic amounts. Like a given measurement would be 1000 tok/sec from one person and 10 tok/sec from another.
Complex-Maybe3123@reddit
Memory is king, bandwidth is queen. As someone with both a 3090 and a strix halo, I can clearly feel the pros and cons of both.
Momsbestboy@reddit
I love this thread, because it adds a relevant number to the stats. I dont have the time and hardware to expand it even more by using different multiples model sizes to the data. The 3060 is nice, but is it still ranked as high if we use a llm with 20gb size? Doesnt fit, needs unloading to RAM, and suddenly is much slower than any GPU with e.g. 32GB VRAM...
JoJoeyJoJo@reddit
I mean o certainly didn’t pay $2400 for my new Nvidia 4080, let alone a used one.
Perfect-Flounder7856@reddit
So by all you didn't mean all....cuz rtx pro 6k isn't on here and I know a lot of people on this sub use this card. As do i.
HumanDrone8721@reddit
+1
Ok_Top9254@reddit (OP)
Well yes, but no. 6k is just 5090 with more vram, the 10% more cores doesn't do much for the value. But it's roughly 15$ per TFLOP and 68$ per GB. (I'm also just envious don't mind me)
Clear-Ad-9312@reddit
pro 6000 and tesla v100 both make the most sense out this entire chart. I wonder why the intel arc pro b70 was not included. I wish it was 48gb instead of 32gb
_madar_@reddit
The Max-Q runs at 300w, the workstation version I'd agree though
Imaginary-Unit-3267@reddit
As an RTX 3060 haver I am surprised and pleased to discover I have the lowest cost per teraflop of anything on this list!
suprjami@reddit
There's a reason they are the majority hardware on HuggingFace.
fallingdowndizzyvr@reddit
Your RTX prices are too low. Your 5060ti and 5070ti prices are too high.
HokkaidoNights@reddit
What about the rest of the computer... you can't compare compete systems to a GPU alone on price or power use.
jcdoe@reddit
I agree, I think the data needs to be adjusted to account for the pc needed to host the gpu. Comparing a $5000 all in one to a $3100 GPU is apples and oranges.
I suspect the Mac will still be on the expensive end, but I also suspect those of us running LLMs on Macs bought them for other purposes (general purpose computing, gaming via crossover, video or music production).
dangerous_inference@reddit
There is never a limit to how much resources can be used. You give someone a mountain of lobsters they can use them for fertilizer. I am looking forward to doing some wild stuff with my 4x48gb modded 4090 system soon. This ceiling for demand people imagine is really just the lack of proper infrastructure to effectively leverage more and more intelligence. The average person will be burning an unfathomable amount of AI compute within 5-10 years.
Double_Cause4609@reddit
Counterpoint:
"Actually productive" is a crazy statement because there's more than one way to be productive. Some people do crazy multi-agent workflows with clever compaction schemes that use a ton of prefill and cache eviction, but some people do simpler synchronous agents that don't evict cache *that* much.
Also: Low prefill is a good problem to have. There's nothing in the properties of accelerators that say you need to load the entire model on it to do prefill; you can absolutely stream parameters into ie: an NPU card, or a 5090 etc to do prefill on a huge model.
That doesn't work the other way around though; decode is limited to what you can fit in memory.
So, if we're talking "I don't know how to do machine learning, I can't make a custom inference engine, and I don't want to believe anyone else will build disaggregated per-tensor prefill ever" (which is not an unfair stance), your argument kind of works, but if we're factoring in what future serving stacks will look like within the lifetimes of the hardware, I don't think you should be ripping into people for their builds.
Another note that I didn't see you make: TFLOPs actually distribute well. Bandwidth doesn't. Compute-bound loads are easier to ammortize compute loads over so if you're choosing a cheap multi-GPU config versus an expensive single-GPU config, you do have to factor in that the TFLOPs do compose relatively well (particularly with graph-concurrent setups like IK_LCPP or EXL3.
Also: I'd argue FP8 FLOPs may matter as much as FP16 FLOPs as a lot of people more practically run at that, but by the same token (no pun intended), most cards with high FP16 FLOPs also support FP8 so it's a fine enough proxy.
nomorebuttsplz@reddit
with an m3u you can have a fat model like 8 bit minimax or GLM 5.1 and a small model for simpler parts of the workflow loaded at the same time.
I wouldn't want it if I knew I absolutely had to load in 50k+ context every single call but just making simple apps in python there is absolutely no need for that.
SkyFeistyLlama8@reddit
NPUs are the big question mark here. I've used them on laptops where prompt processing beats the integrated GPU at a fraction of the power budget but I'm stuck running much smaller 4B or 7B models only.
A multi-NPU inference accelerator card would be fun to play with.
Sofakingwetoddead@reddit
In actual, real world performance which I would measure as TG & PP tokens p/ sec at large context - the cards that perform poorly in your benchmark are far more cost effective than the solutions that perform well in your test. Cold prefill, prompt processing and so forth. That's the speed. RTX 6000 Pro pp \~5k with 200k context. r9700 w/ \~200k context pp \~450... 5k/450 \~11 ----- 11x 1400 = 15,400... RTX \~10k ... not even close.
In practical use, the r9700 delivers 1/10th the performance, with a much smaller model and less context, of a RTX 6000 pro. 1/3 the cost for 1/10th the performance
a_beautiful_rhind@reddit
3090s aren't overkill. There are image models, training, and everything else under the sun. P100s, macs and all that stuff is only really LLMs. If anything they are getting slow.
DiscipleofDeceit666@reddit
Maybe the apple angle doesn’t have to be about raw ai speed. Maybe it could be about AI swarms. Semantic search by phrase within text and it sends 1000 agents to find the files for you.
Vegetable-Score-3915@reddit
The apple angle is also, how to get that amount of unified ram vs mulitple gpus. What is the opportunity cost and practical matters involved in splitting a large model across multiple gpus. Ie 2 16gb gpus is not the same as 1 32gb gpu. I appreciate ops post. It just should have some caveats
last_llm_standing@reddit
What do you mean bandwidth is not everything?? give me one good example where having more compute but limited bandwidth was useful (other than for prefill ofcourse)
zipperlein@reddit
Worst problem with the older Pascal cards is the missing mainline support in vllm, imo. Espacially for multi-card setups, I would not get more than one of those.
fugogugo@reddit
why no 5060 Ti 16GB ?
Winter-Editor-9230@reddit
Acer veriton is 3700, their version of the gb10
Candid_Ad_6752@reddit
Where rtx pro 6k at? We have feelings too
FormalAd7367@reddit
amazing works! let me read… looks like it clarified alot of myth
fractalcrust@reddit
did you look at RTX 6ks?
bick_nyers@reddit
FLOPS is love, FLOPS is life.