Model Size Backend GPU Layers Batch Size Test Performance (Tokens/s) llama 3.1 8B Q4_K_M 4.58 GiB ROCm 99 512 Prompt Processing 2935.69 ± 36.32 llama 3.1 8B Q4_K_M 4.58 GiB ROCm 99 512 Text Generation 94.46 ± 0.22 llama 3.1 8B Q4_K_M 4.58 GiB ROCm 99 1024 Prompt Processing 2900.54 ± 22.52 llama 3.1 8B Q4_K_M 4.58 GiB ROCm 99 1024 Text Generation 93.85 ± 0.22 llama 3.1 8B Q4_K_M 4.58 GiB ROCm 99 2048 Prompt Processing 2880.82 ± 5.92 llama 3.1 8B Q4_K_M 4.58 GiB ROCm 99 2048 Text Generation 93.20 ± 0.18

[-]

Thrumpwart@reddit

I love my chonky boi. It runs about 5-10% slower than 7900XTX, but the 48GB Vram is incredible. Perfect for Qwen3.6 27B with high context runs.

I love it for agentic coding overnight runs. Rock solid stable.

[-]

__no_author__@reddit

I would add:

AMD RX 7900XT, 800 GB/s
AMD RX 7900XTX, 960 GB/s

[-]

Only-An-Egg@reddit

M4 Pro Mac Mini 273GB/s
M4 Max 32 Core Mac Studio 410GB/s
M4 Max 40 Core Mac Studio 546GB/s
M3 Ultra Mac Studio 819GB/s

What you fail to mention is max memory capacity:

24GB - RTX 3090
32GB - Intel Arc Pro B70, RTX 5090
48GB - Mac Mini
64GB - M5 Pro MacBook Pro
96GB - M3 Ultra Mac Studio
128GB - Strix Halo, DGX Spark, M5 Max MacBook Pro, M4 Max Mac Mini
256GB - M3 Ultra Mac Studio
512GB - M3 Ultra Mac Studio

[-]

NiceAttorney@reddit

Strix Halo and DGX Spark are also shared memory systems too.

[-]

onetwomiku@reddit

strix halo only needs to reserve 512Mb for system (some vendors locks it at 1GB)

[-]

CalmSpinach2140@reddit

https://medium.com/@se.mehmet.baykar/increase-vram-on-apple-silicon-for-local-llms-1b35c453b165

You can override default macOS ram allocation. No need to restart either.

[-]

Only-An-Egg@reddit

True. I don't know how much memory the OS needs to reserve on those. Running headless Linux would take up a lot less memory than macOS.

[-]

mycall@reddit

prices not available

[-]

DeProgrammer99@reddit

Could make a shared Google Sheet and include recent prices and FP8 FLOPS and such, too.

[-]

In_der_Tat@reddit

Please do.

[-]

Only-An-Egg@reddit

No. Do your own research.

[-]

DeProgrammer99@reddit

I already did and have such a sheet and have made shared sheets for things before to allow others' input. Why bother sharing at all if you're not going to share...meaningfully? Searchably, usably, notjustwastingyourowntimefully? 🤷‍♂️

[-]

DeProgrammer99@reddit

Also, I wasn't trying to say you specifically should do it. I was going to comment at the top level but saw you contributed more to it, so I replied to yours because of the "multiple people contributing to the same dataset" context.

[-]

truthputer@reddit

You missed the R9700 32GB, which is in my opinion extremely underrated and a bargain.

[-]

lannistersstark@reddit

Why would anyone get GDDR6 over HBM2 which is MI60/MI50?

[-]

Total-Buy2684@reddit

You can assign more memory to llms with a command prompt in Mac. Can squeeze a few more gb if you close everything else.

[-]

addiktion@reddit

No numbers yet for the M5 Max chip? That would give us a rough idea of where the new M5 Ultra would land.

[-]

Only-An-Egg@reddit

The 32 and 40 core model speeds are listed in OP's image

[-]

AIgavemethisusername@reddit

Nvidia RTX 5070 GPU, 896 GB/s

[-]

SBoots@reddit

Nvidia RTX 4090 GPU, 1,008 GB/s

For anyone wondering

[-]

kwizzle@reddit

For some reason we rarely hear people talking about 4090s, probably something to do with being a lot more expensive than a 3090 and nearer in price to the 5090 for less VRAM and speed.

[-]

sfifs@reddit

VRAM is too limited. The smallest really competitive local model in my benchmarking right now is Qwen 3.6 35bA3b whose NVFP4 variant requires about 36GB minimum to barely run with concurrency of 1. Smaller models are still not really competitive in terms of instruction following and coding accuracy. So I'd look at at least unified RAM systems of 48, 64 or 128GB for anything effective.

[-]

kwizzle@reddit

Yeah but you can run qwen 35b by offloading experts to cpu very well with the 4090, and besides 27b is smarter and fits well with a 4 bit quant.

[-]

sfifs@reddit

Hmmm.. what's the throughout hit you practically see doing that? I use a DGX. Interestingly enough while I fully expected 27b to be smarter, I found they benched almost the same - here are my benchmarks - https://srinathh.medium.com/mid-size-local-models-are-now-competitive-for-ai-agents-7696b2e8b535

[-]

kwizzle@reddit

I just tested with Qwen 3.6 35b and I'm getting 55tok/s right now.
For some reason only 9.1/24gigs of VRAM on my 4090 are used and my PC memory use by llamacpp is 19.7gb.
By way of comparison when I run 27B fully in VRAM without MTP I get about 45t/s.
As for benchmarks, I always take those with a big grain of salt and I prefer testing models for my specific use cases which are mostly coding related. That being said, chatting with 35b right now gives me the impression that it might be better at general language, though I am certain that 27B is a better coder.
I'm using the following to launch it:
llama-server -m "E:\AI Models\Qwen3.5-35B-A3B-Q4_K_M.gguf" --alias "qwen3.6-35b-a3b" --host 0.0.0.0 --port 8080 --ctx-size 32767 -n 32676 -ctk q8_0 -ctv q8_0 -b 512 -ngl 99 --mlock --no-mmap --jinja -fa on --cpu-moe

[-]

sfifs@reddit

Nice! I run on vllm with the full 256k context because I find that in my openclaw turns, i routinely run in the 50k-150k token range on context with all tools, memory & session conversation history loaded

[-]

Aumcoming_Inquiry@reddit

Nice - i'm using vllm - I default to using the full 256k context these models support because in my openclaw turns, I find contexts routinely run in the 50k-150k token range with all the tools & memory and covnersation session histroy etc loaded.

[-]

Caffeine_Monster@reddit

Native fp8 is nothing to laugh at - though you really need two 4090s to get the most out of them in terms if gpu only deployments.

3090 is still the value king, and it's not even close. Only real reason to go mac is low power / always on applications.

[-]

Lost-Vermicelli-6252@reddit

I have two 4090s but they are in diff machines. If I moved them to same machine, does it use the compute from both or just the VRAM?

I’m debating whether or not the new PSU/case/cooling would be worth the effort.

[-]

Caffeine_Monster@reddit

It's worth it as it doubles the compute and bandwidth if you deploy models correctly with tensor parallel.

48gb vram @ fp8 can get you a long way.

You don't necessarily need to change much cooling wise, and you can use 2 PSUs if you want to cut corners.

[-]

formlessglowie@reddit

This. I run Qwen 3.6 27b at fp8 on two 3090s, full context, image processing and with MTP, getting a consistent 60+ tok/s in decoding. It’s seriously powerful for agentic tasks and coding in general, I’m a professional software developer and a lot of my production code nowadays is made by the GPT 5.5 plan + Qwen3.6 27b execution combo, I sometimes need a code review from 5.5 and then another coding round from 27b but that’s it. It’s beyond incredible I can actually ship production code from my Chinese motherboard and used GPUs, this was unimaginable six months ago.

[-]

indyfromoz@reddit

Could you please share your rig setup? I have a RTX 4090 with a AMD 12-core CPU, using it for mostly gaming. I would love to get rid of Windows, install a Linux distro for just running LLMs

[-]

formlessglowie@reddit

Huananzhi X99 F8
Xeon E5-2696 v3
2xRTX 3090 (vLLM)
1XRTX 3080 (for TTS mostly)
4x16GB DDR4 2133MHz ECC

All GPUs were bought used, CPU is obviously used, RAM sticks probably are too, motherboard is a Frankenstein. I love that I can run something as ridiculous as 27b on this freak. I live in strange times.

[-]

Ok_Rope_9332@reddit

Have you tried Gemma4 31b?

[-]

formlessglowie@reddit

Not much tbh, as benchmarks are behind 3.5 27b, so I didn’t think it vs 3.6 was even a question worth considering. Is it that good? I’ve tried 26b a4b, and it’s very good for natural language stuff but fails long running agent sessions, which is what I use these models for (long coding sessions basically). Is 31b much better in that sense?

[-]

indyfromoz@reddit

Thank you 🙏

[-]

DonkeyBonked@reddit

I use: Huananzhi H12D-8D AMD EPYC 7502 128GB RAM 4x RTX 3090 24GB (I cap them at 250W) Ubuntu 24.04 LTS

Allegedly, I "should" be able to add more cards via converting my three Mini-SAS-HD (SFF-8643), but I'm very skeptical, the Huananzhi bios has been a pain in the rear for me.

I'm considering switching to PCI-E x16 to x8/x8 splitters when I get the money for more GPUs depending on how the other adapter goes. I do have a Mini-SAS-HD to OCuLink adapter, I just need a card to test with.

The worst part of this system is that I can't really make use of the BMC. If I enable the BMC and I change even a single setting from default in the bios, I immediately lose the ability to see the NVME slots.

If I had the money, I'd have gotten a different board, but the ones I would have wanted were all well over 1k.

[-]

Fit-Palpitation-7427@reddit

I did exactly that and never looked back

[-]

Lost-Vermicelli-6252@reddit

Same. I used to “need” windows for certain multiplayer games, but don’t really play them anymore, so have one of my machines running CachyOS instead. It’s amazing. Boots up so much faster than windows and stuff isn’t as… annoying.

[-]

tmflynnt@reddit

I also have two 3090s and am looking at all the various options for optimizing stuff. Would you mind sharing a bit more about your inference software setup and what you use harness wise? I assume you are swapping between Codex and something like Pi or OpenCode?

It would be nice if there was something out there that would smoothly combine frontier planning + local execution in one polished and reliable setup, but I don't think there's a one stop shop for that quite yet from what I've seen.

[-]

voyager256@reddit

But you use Q8 for KV cache too to fit full context , right? Also Wouldn’t a good Q6 quant be better for 3090(assuming you run on llama.cpp or its forks)?

[-]

formlessglowie@reddit

Yes, forgot to mention, Q8 for KV cache. I find it to be virtually free lunch, never ran into any apparent issues (Q4 is another story, can be very good or downright unreliable, depends on factors). I run this setup on vLLM for tensor parallelism, that's how I'm getting 60+ tok/s (and I'm on PCIe 3.0 x16, if I were on 5.0 this could easily border the high 80s or even 90s). Q6 would be very good indeed if I were using cpp.

[-]

Fit-Palpitation-7427@reddit

VLLM to do tensor parallel I guess?

[-]

formlessglowie@reddit

Yes, forgot to add that detail.

[-]

etaoin314@reddit

Going from 1 to 2 is a world of difference! A system with 2 4090 would be a monster. All you need is a motherboard that can bifurcate the PCI and you’re Gucci.

[-]

BosphorusScalene@reddit

I added a 2nd GPU to mine externally to skip the new case, connected with an m2 oculink adapter, minimax GPU dock and a 2nd PSU. I'm sure it's not as fast as a normal pcie slot, but it's working great so far and was way easier than a new case.

[-]

inevitabledeath3@reddit

The reason to go mac is for RAM/VRAM capacity. Nvidia GPUs get very expensive if you need VRAM for bigger models.

[-]

FinancialElephant@reddit

What about model size though?

[-]

AcaciaBlue@reddit

and more memory surely?

[-]

raindownthunda@reddit

Definitely. INT8 seems to be becoming more viable and keeping 3090’s competitive. The speed difference between fp8 and int8 on a 3090 is 1.5x+

[-]

panchovix@reddit

For LLMs, 4090 is way more expensive than a 3090 for the same amount of memory and almost same bandwidth.

The 4090 will be 2x times faster on PP vs a 3090 tho. And also is about 2x faster on compute in general (diffusion like txt2img, etc)

[-]

FissionFusion@reddit

what stat is the determining factor in PP?

[-]

panchovix@reddit

Compute units and compute in general. So higher clocks and more cores are faster. Also perf per clock (aka IPC, for the same clock getting higher performance on newer GPUs)

[-]

hidden2u@reddit

wow 2x prompt processing is huge for giant agent system prompts

[-]

comperr@reddit

Yeah that's why i got 2 5090s. They're pretty fast. I am getting a 4th one when i have time to drive to microcenter

[-]

Clear-Ad-9312@reddit

at that point, why not just buy an rtx pro 6000 that is just about the same price as 2x 5090 and has more vram than 2 5090s?

[-]

comperr@reddit

The rtx pro 6000 is just a 5090 with 72 or 96gb vram. So it is only as fast as one 5090 even if you dont need all the vram. With 2 5090s i can literally fit 2 27b qwen3.6 with q8_0 kvcache in each card and run them simultaneously.

[-]

Clear-Ad-9312@reddit

price is still a big point of reason to go for the 6000 over the 2x 5090, that is what I am presenting here

[-]

Anonymous_Prime99@reddit

I did it for the wattage really. 300W for the power of the sun without turning the place into an oven? Yes.

[-]

Clear-Ad-9312@reddit

hmm, I thought you could just limit power anyways through the settings, but ok, sounds fair.

[-]

comperr@reddit

Likewise, i limit my 5090s to 425w. You can lower the power limit extremely easily

[-]

rbit4@reddit

I got 8 5090s at price of about 2200 avg. 3.5x less than price of rtx6000.. question is when do you need the 96gb vram

[-]

Clear-Ad-9312@reddit

mind letting me know how you came across an abundent source of cheap 5090s? I can only find them for like $4.5k

[-]

Late_Night_AI@reddit

Go ahead and grab one for me while you’re there, i need a 2nd 5090 for more vram. 🚗😎

[-]

comperr@reddit

The household limit is a killer im thinking about bringing a friend so i can actually get 2

[-]

rbit4@reddit

That's the reason why I have a 8 5090 rig.. 5x perf of a 3090 and 2x of 4090 using nvfp4 vs 4 bit quant on the rest

[-]

AcePilot01@reddit

pp?

[-]

ScorpiaChasis@reddit

prompt processing... unless it is really peeing

[-]

sibilischtic@reddit

They did mention txt2img

[-]

SBoots@reddit

It doesn't seem to get talked about very much. I have a 5090 and a 4090 in my system. I had the 4090 first and while the 5090 is clearly a big step up, the 4090 is no slouch!

[-]

kwizzle@reddit

This is sorta my situation, I had a 4090 from before prices were insane and I'm considering adding a 5090. Do you feel the 4090 keeps up well enough in speed when splitting a model between the two cards? And what models and quants are you running on there?

[-]

SBoots@reddit

My go to model right now is Gemma4-31B-Q8_0.gguf (31G) w/mtp-gemma-4-31B-it.gguf (491M) drafter model split across the two cards with a 128K context. I get about 65-70 t/s. I'm using the llamacpp Gemma4 MTP branch.

[-]

rainbyte@reddit

How much PP on 4090 vs 5090 with that model?

[-]

SBoots@reddit

I see about ~4000 t/s PP combined across both cards. llamacpp doesn't give me a breakdown per card. Model is too large to run on the 4090 for me to test each card solo.

[-]

Freonr2@reddit

$/GB/s and $/GB has always been poor since launch date. They don't make a lot of sense.

[-]

Mulster_@reddit

Also 40 series power connectors burning down

[-]

biogoly@reddit

Way more 3090s were made and used for crypto mining, so it’s a much bigger pool for used and affordable second hand GPUs.

[-]

eugene20@reddit

3090 outstanding value if you can find one, maybe worried about it needing a repaste or just burning out from age.
4090 people holding onto them like unicorns, just superb cards.
5090 just far too expensive for most consumers to justify.

[-]

lemondrops9@reddit

4090's have been 2x the price compared to 3090s in my area for as long as I can remember. Guessing the supply for 4090 was low as well.

[-]

KingSlayin@reddit

That's my card

[-]

wilhelmbw@reddit

Chinese bought them up to make 4090 48gbs

[-]

Endflux@reddit

And the 3090 TI has the same memory bandwidth

[-]

ahtolllka@reddit

Anyone who have full sized 4090 should consider converting it to 48gb

[-]

SBoots@reddit

would be amazing but that sort of modification is well beyond my skill level

[-]

tired514@reddit

Not to be confused with the 4090M; I'm running a Morefine G1 eGPU (16gb) at 576GB/s with a second one on the way. Crazy expensive, but awesome little DC-powered portable rig.

[-]

megawhop@reddit

Try using an extended PCIE riser card with a 90 degree angle. Lets you keep your PCIE speed and bandwidth without using oculink, or TB, etc.

I have 5090 as my main, and a 4090 using a riser card with a 3ft ribbon cable running into a 3d printed external GPU enclosure containing the 4090, a PSU, and the female end of the riser card.

[-]

tired514@reddit

Oh the Morefine G1s are fully integrated little boxen:

https://www.morefine.com/en-ca/collections/external-gpu

They're wickedly expensive for what they are but in my case form factor is everything. I live on a boat in the summer and I'm trying to avoid running the main inverter (that makes 120V AC) at all costs. These little Morefines run on 20V DC, so I can run them off my dedicated 12V->20V SMPS. Drops my idle consumption from \~100W to \~10W.

[-]

arbobendik@reddit

I'm using a 5090M and those mobile GPUs usually have really impressive memory overclocking potential. I am running a stable +250MHz (equals nvidia-settings +4000MT/s) overclock, bringing me to 2000MHz. This raises bandwidth from 896GB/s to 1024GB/s.

Maybe test a more conservative +125MHz first though, although GDDR6 on my old 3060m overclocked equally well with +250MHz (+2000MT/s in nvidia-settings)

[-]

tired514@reddit

Funny you mention it. 😄

I was literally thinking "if the 4090M is only limited by thermals / power, what's to stop one from overclocking the VRAM given memory bandwidth is the main constraint during inference?"

Turns out the 4090M uses a 256bit VRAM bus while the 4090 is 384bit, but I did end up overclocking a fair bit and yup - decent performance improvement!

I can't wait to see what kind of token rate I get with Qwen3.6-27B @ Q6 with two 4090Ms.

[-]

NineThreeTilNow@reddit

Nvidia RTX 4090 GPU, 1,008 GB/s

Native FP8 as well.

[-]

illforgetsoonenough@reddit

AMD 7900XTX, 960 GB/s

[-]

sillynoobhorse@reddit

300$ 3080M 16 GB

448.0 GB/s

[-]

AcePilot01@reddit

So either a 3090 or a 5090. Got it.

4090 have it's balls cut off for no reason or something?

[-]

SBoots@reddit

24GB of VRAM and plenty fast in my experience

[-]

Ok-Measurement-1575@reddit

Same for 3090Ti.

[-]

gggghhhhiiiijklmnop@reddit

Thanks man, I felt left out 🤣

[-]

sn2006gy@reddit

FYI, for B70 users, Intel just released an update that addresses Qwen 3.6 perf issues. May start getting closer to that 608 GB/s perf.

[-]

tovidagaming@reddit

The biggest issue I had with vllm which is what seems to be needed for llm-scaler, is how to compare vllm supported quants (INT4, Fp8, AWQ, etc) with models running the usual q4, 5, 6, 8 quants on llama-cpp. It just felt like comparing apples and oranges. And that's when I was able to get vllm to even work. I will have to try the new update in a docker container...

[-]

sn2006gy@reddit

i don’t even bother with small quants as they have too many side effects - 8 is good enough and works with vllm

[-]

tovidagaming@reddit

So you would compare FP8 in vllm with Q8 GGUF in llama-cpp?

[-]

One_Difficulty_39@reddit

I may have to retry using them

[-]

frostpearI@reddit

608GBps is still a long way

[-]

psychicsword@reddit

Are there any guides on how to get it actually working though? I am still running qwen3-vl because I kept running into crashing issues.

[-]

Massive-Question-550@reddit

honestly really liking the price for the amount of memory you get but the performance is abysmal right now to a 3090. hopefully they can optimize the software as there's no excuse to have an ai dedicated card lose to a nearly 6 year old gaming GPU...

[-]

sn2006gy@reddit

3090's are 1400 bucks now on FB/Ebay and there are SO MANY FAKE / SCAM SELLERS

That $950 for new with warranty seems much more worth it.

Do I wish pricing was better? heck yeah... But i'd rather take my chances on NewEgg and run 2-3 of these cards and in both cases, we all win vs the $10k RTX6000 Pros (even though its faster, it's not 7,000 dollars faster)

[-]

Fit-Palpitation-7427@reddit

You see a big intelligence difference between q4 and fp8 ?

[-]

sn2006gy@reddit

on Qwen 3.6 27b - yes.

[-]

Evanisnotmyname@reddit

Hardware doesn’t just “go bad” or “wear out” that easily.

Yeah sure on some level it does…but PC parts are one of the few cases where you can tell pretty quockly if it’s working or not, test, and if it tests good it’s good.

It’s not like they get slower overtime..either it’s working, maybe working at 98% original capacity, or not working at all.

[-]

sn2006gy@reddit

3090's have been around the block with gaming, overclocking, mining and now AI - i know things don't "Wear out" but fans and paste do and if those fans and paste haven't been maintained then it causes heat failure in areas where I don't want to bother fixing it

And that's why you see 100s of gpus for sale or sold as not working/broken.

[-]

comperr@reddit

I have a 3090 hybrid for sale $1999 and 3090ti $1800 on eBay lol

[-]

Massive-Question-550@reddit

you would be better off buying a stack of 5060ti 16gb right now if you are on a budget. mature software, warranty, plus good vram to dollar price point and you can parallel compute in certain setups for more performance.

[-]

sn2006gy@reddit

B70 supports parallel compute

For businesses i'm not recommending any of this

[-]

kitanokikori@reddit

Even next to an R9700 Pro, the B70 is roughly the same price and like, 50% of the perf

[-]

takuonline@reddit

Even on Vulcan llama.cpp?

[-]

Massive-Question-550@reddit

yes. that's were it's performance is best and most stable. someone posted in depth performance comparisons between it and the 3090 using vulcan and it got less than half the performance most of the time. it was bad.

[-]

overand@reddit

If the software stack is actually stable, I'd probably recommend a B70 over 3090s for a business, because of the whole "used card gamble" thing. A bit slower performance with a bit more cost, but with a lower power consumption profile and a warranty & current support would probably push that over into "worth it" in that use case.

That said, yeah, you'll pull my dual 3090s from my cold dead hands. (Especially since I used some Dell OEM ones that are shorter than any others - in theory, I can put my stack of 8 3.5" drives back into my case!)

[-]

CoolConfusion434@reddit

I will share these bench stats if ya'll don't chase me out for being on Windows 😉

The other side of this box runs Ubuntu Server 26.04 with both SYCL and Vulkan compiled from sources. On the Windows side, and just for the lolz, I downloaded the pre-compiled binaries. SYCL sucked, then Vulkan beat all other combinations for this particular model:

 .\llama-bench.exe `
>>   -m \Llama\Models\unsloth\Qwen3.6-35B-A3B-MTP-GGUF\Qwen3.6-35B-A3B-UD-Q4_K_XL.gguf `
>>   -ngl 99 `
>>   -fa on `
>>   -b 2048 `
>>   -ub 512 `
>>   -p 512 `
>>   -n 128 `
>>   -d 4096,8192,32768,65536 `
>>   -r 5 `
>>   -o md
load_backend: loaded RPC backend from \llama\llama-b9413-bin-win-vulkan-x64\ggml-rpc.dll
ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = Intel(R) Arc(TM) Pro B70 Graphics (Intel Corporation) | uma: 0 | fp16: 1 | bf16: 0 | warp size: 32 | shared memory: 49152 | int dot: 1 | matrix cores: KHR_coopmat
load_backend: loaded Vulkan backend from \llama\llama-b9413-bin-win-vulkan-x64\ggml-vulkan.dll
load_backend: loaded CPU backend from \llama\llama-b9413-bin-win-vulkan-x64\ggml-cpu-alderlake.dll
| model                          |       size |     params | backend    | ngl |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | --------------: | -------------------: |
| qwen35moe 35B.A3B Q4_K - Medium |  21.27 GiB |    35.51 B | Vulkan     |  99 |   pp512 @ d4096 |      1766.38 ± 11.77 |
| qwen35moe 35B.A3B Q4_K - Medium |  21.27 GiB |    35.51 B | Vulkan     |  99 |   tg128 @ d4096 |         98.98 ± 0.09 |
| qwen35moe 35B.A3B Q4_K - Medium |  21.27 GiB |    35.51 B | Vulkan     |  99 |   pp512 @ d8192 |      1659.18 ± 11.02 |
| qwen35moe 35B.A3B Q4_K - Medium |  21.27 GiB |    35.51 B | Vulkan     |  99 |   tg128 @ d8192 |         95.15 ± 0.21 |
| qwen35moe 35B.A3B Q4_K - Medium |  21.27 GiB |    35.51 B | Vulkan     |  99 |  pp512 @ d32768 |        140.44 ± 0.40 |
| qwen35moe 35B.A3B Q4_K - Medium |  21.27 GiB |    35.51 B | Vulkan     |  99 |  tg128 @ d32768 |         78.27 ± 0.11 |
| qwen35moe 35B.A3B Q4_K - Medium |  21.27 GiB |    35.51 B | Vulkan     |  99 |  pp512 @ d65536 |         69.50 ± 0.25 |
| qwen35moe 35B.A3B Q4_K - Medium |  21.27 GiB |    35.51 B | Vulkan     |  99 |  tg128 @ d65536 |         47.41 ± 0.06 |

build: 6ed481eea (9413)

[-]

CoolConfusion434@reddit

Adding the Linux side results. This is on Ubuntu 26.04, and don't include the latest Intel SYCL fixes so it could get better.

For short prompts, Vulkan wins. For longer prompts, SYCL sustains prompt processing better.

ONEAPI_DEVICE_SELECTOR=level_zero:0

./llama-bench \
   -m /models/unsloth/Qwen3.6-35B-A3B-MTP-GGUF/Qwen3.6-35B-A3B-UD-Q4_K_XL.gguf \
   -ngl 99 \
   -fa on \
   -b 2048 \
   -ub 512 \
   -p 512 \
   -n 128 \
   -d 4096,8192,32768,65536 \
   -r 5 \
   -o md

./llama-bench    -m /models/unsloth/Qwen3.6-35B-A3B-MTP-GGUF/Qwen3.6-35B-A3B-UD-Q4_K_XL.gguf    -ngl 99    -fa on    -b 2048    -ub 512    -p 512    -n 128    -d 4096,8192,32768,65536    -r 5    -o md
| model                          |       size |     params | backend    | ngl |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | --------------: | -------------------: |
| qwen35moe 35B.A3B Q4_K - Medium |  21.27 GiB |    35.51 B | SYCL       |  99 |   pp512 @ d4096 |        862.35 ± 6.55 |
| qwen35moe 35B.A3B Q4_K - Medium |  21.27 GiB |    35.51 B | SYCL       |  99 |   tg128 @ d4096 |         69.11 ± 0.78 |
| qwen35moe 35B.A3B Q4_K - Medium |  21.27 GiB |    35.51 B | SYCL       |  99 |   pp512 @ d8192 |        811.73 ± 7.44 |
| qwen35moe 35B.A3B Q4_K - Medium |  21.27 GiB |    35.51 B | SYCL       |  99 |   tg128 @ d8192 |         63.68 ± 0.01 |
| qwen35moe 35B.A3B Q4_K - Medium |  21.27 GiB |    35.51 B | SYCL       |  99 |  pp512 @ d32768 |        681.18 ± 3.62 |
| qwen35moe 35B.A3B Q4_K - Medium |  21.27 GiB |    35.51 B | SYCL       |  99 |  tg128 @ d32768 |         48.99 ± 0.02 |
| qwen35moe 35B.A3B Q4_K - Medium |  21.27 GiB |    35.51 B | SYCL       |  99 |  pp512 @ d65536 |        555.86 ± 1.94 |
| qwen35moe 35B.A3B Q4_K - Medium |  21.27 GiB |    35.51 B | SYCL       |  99 |  tg128 @ d65536 |         33.64 ± 0.01 |

[-]

Positive_Kale@reddit

Do you have a link? I’m really thinking of buying that B70, but I will need to figure out the best way to use it

[-]

sn2006gy@reddit

https://github.com/intel/llm-scaler is the repo everyone is following. There are a few other repos on GitHub as people benchmark/test through the updates. It's had 4 releases in the last month, so Intel seems to finally be progressing through the prior growing pains.

[-]

M_Me_Meteo@reddit

What software stack?

[-]

sn2006gy@reddit

llm-scaler (vllm) https://github.com/intel/llm-scaler

[-]

dr_DCTR@reddit

Can the B50 compete with the B70 for smaller models below 16GB?

[-]

smallDeltaBigEffect@reddit

since the performance delta is mainly software-based, you will maybe get like 10% less net bandwitdth

[-]

sn2006gy@reddit

perhaps, but my b50 drives plex so i haven't tried LLMs on it.

[-]

WizardlyBump17@reddit

openvino 2026.2.0 was released yesterday and it adds support for gemma4 and qwen3.5. I tried the nightlies before and it is really fast, like 4k pp and 60 tg on qwen3.5 9b int4, though a specific nightly version tanked the performance of it later... That is on a b580. I wanted to try qwen3.6 35b and 27b, but i guess openvino isnt very great for cpu+gpu combos

[-]

lukistellar@reddit

Oh, I see we still are ignoring cheap AMD GPUs. Good for myself, just bought an used RX6800 16GB for 250€ the other day. RX 7900 XTX with 24GB go for as cheap as 500€ here in central Europe.

[-]

codsworth_2015@reddit

Yep, anything with good gguf support goes straight on either my XTX or Mi50's. Save the 5090 for when cuda is required like faster whisper and yolo. Might get another XTX if the price ever comes back down to what I paid for the first one but I'm not holding my breath and I'm not paying $200 extra for the same thing out of principle. Same with the Mi50's they are now triple what I paid.

[-]

qbiker@reddit

Same here, last week. What are you planning on using it for?

[-]

mrpmorris@reddit

The ones with very fast memory have much lower memory, so can only load smaller models which give much worse output.

If you combine multiple cards to increase that memory, then you introduce a lag in inter-card communication.

So this info is not the WHOLE story.

[-]

porkchop_d_clown@reddit

Maybe I'm one of the ones who needs to see this...? I worked in High Performance Computing for 25 years, retired at the end of 2024. It looks to me like you're comparing the TB5 speeds of a Mac Mini with the NVLink speeds of the NVidia cards?

But that means I really don't know what the numbers for the laptops mean...

[-]

SupersonicSpitfire@reddit

And ranked by GB/s per euro?

[-]

Opening-Broccoli9190@reddit

For those who wonder yet:

Single channel RDIMM DDR5 is 40GB/s,

Dual channel RDIMM DDR5 70GB/s

Quad channel RDIMM DDR5 is 140GB/s

[-]

devshore@reddit

M3 Ultra Mac is somewhere in the 800-900 range close to the 3090

[-]

spammmmmmmmy@reddit

For the M series you really have to see whether they are blank/Pro/Max/Ultra as they differ in the memory bandwidth.

[-]

dim_amnesia@reddit

Brain dead people somehow still spending $9500 for 10 tokens per business day and tiny context windows

[-]

twnznz@reddit

Does M5 improve prompt processing over M4 meaningfully?

[-]

Svobpata@reddit

From what I was able to find out, yes, noticeably so. The new neural units on each GPU core help with prefill

[-]

spammmmmmmmy@reddit

I don't know but somebody posted on that topic in the past 2 days I think. There was mention that the faster CPU will achieve better prefill time.

I have been chatting with Claude about the performance topic. He thinks there is no substitute to empirical testing. I may quit ollama and migrate to VLLM in order to understand the pieces of the inference process better.

My notes during my shopping for an M1 Max: ∙ M1 Pro: ~200 GB/s ∙ M1 Max: ~400 GB/s ∙ M2 Max: ~400 GB/s ∙ M4 Max: ~546 GB/s ∙ M1 Ultra: ~800 GB/s

[-]

Standard-Potential-6@reddit

Keep in mind Apple’s are also theoretical numbers summing the memory bandwidth of the CPU, GPU, and NPU.

Most workloads don’t break down like that and the GPU will only access memory at ballpark 60-80% of the total.

[-]

overand@reddit

Hugely different!

[-]

OkLettuce338@reddit

this is pure noise

[-]

dim_amnesia@reddit

Brain dead people somehow still spending 9500$ for 10 tokens per business day

[-]

dim_amnesia@reddit

Amount of brain dead people who bought mac ultra for LLMs is insane.

[-]

dim_amnesia@reddit

My RTX 6000 pro does \~1.8 TB/s

I am convicted people who compare DGX spark to it are brain dead.

[-]

leinadsey@reddit

Of course you get more tps out of a 5090 than a MBP, but the 5090 doesn’t have 128 GB memory for not-insane-money and oh.. yes, it comes with a computer too.

[-]

GameBoyRay@reddit

my whole circle of the nonunderstanding normie ass wife needs to see this.

[-]

substance90@reddit

U missed a hidden gem in there - the AMD 7900 XTX - literally twice as fast as my M4 Max MBP Pro for inference as long as the model fits.

[-]

Electrical_Table5543@reddit

Why is the transfer speed on the DGX spark so low???

[-]

FragmentedHeap@reddit

You missed one,

Nvidia RTX 4090 1008, GB/s,

You can get one $1800 ish which is much cheaper than a 5090 and you can get two 4090's cheaper than 1 5090 😄and that gives you 48 gb vram.

And if you are willing to mod them, and ship them to china, for about $150 each you can get them to be 48 gb, so two modded 4090's is dual 48gb for 96gb vram at over 2000 GB/s total.

You also left off the AMD Raedon 9700 AI 32gb vram card, which has 640 GB/s but comes with 32 GB Vram and is around $1300.

But... 2-4 Raedon 9700 AI cards is the best bang for buck with tensor parallelization. Sapphire makes one, it's $1379 on newegg.

[-]

tired514@reddit

Wait wait what's this Chinese modification to double a 4090's RAM? I found a few vids talking about how to do it, but there's a company that'll do it for $150?

[-]

FragmentedHeap@reddit

I think I have the ability to do it myself, but I'm not risking my $1800 gpu to try it. I have all the tools and a full lab...

[-]

tired514@reddit

I've got the tools as well and feel like I could probably pull it off.. but under $200? Including the memory itself? That seems irrationally cheap.. :/

[-]

Yes-Scale-9723@reddit

yes but the are too expensive and i'd never pay that price for a modded gpu. used 3090 is the way

[-]

FragmentedHeap@reddit

4090 is more than twice as fast with the same vram.

[-]

slavik-dev@reddit

This guy is in US, does that:

https://gpulab.net/product?id=2

[-]

Far_Course2496@reddit

Those gpus are from Alibaba

[-]

tired514@reddit

It bothers me so much that Apple managed to produce a desktop with 400-600GB/s memory bandwidth but there's no equivalent in the x86 world.

[-]

vasimv@reddit

Some Intel and AMD server CPUs has quite big memory bandwidth (Amd Epyc and Threadripper CPUs, Intel Xeon 6). Like AMD 9124 - 12 channels DDR5 with 460 GB/s. There are even Intel Xeon with HBM2e memory (like 1.6TB/s theoretical bandwidth). But building with these will cost quite much as need shitload of registered ECC DIMMs to fill all channels, CPUs costs quite much, same for servers motherboards.

[-]

tired514@reddit

Ya but are they unified with the GPU? That's the crux of the problem. :/

[-]

TokenRingAI@reddit

Intel and AMD have reacted to this, but it takes basically 5 years for that reaction to become a product you can buy at consumer prices

[-]

mjsxi__@reddit

why does this bother you? seems dumb to be bothered that a company is making something decent...

[-]

Look_0ver_There@reddit

What a strange interpretation of their statement. I read it as they are bothered by the fact that there's no x86 equivalent that matches the Apple laptops.

[-]

Keep-Darwin-Going@reddit

Is not the main problem being stuck at 24gb? That is why people are using Mac mini so they can go like way higher, speed is nothing if you are stuck using a crappy model.

[-]

complexminded@reddit

That's what I figured out when comparing my 2x 3090 cluster to my dgx spark cluster. The models I can run on the DGX spark, while considerably slower, get way more use than my 2x 3090 cluster. There are times when speed matters (classifying 40k comments) and I'll use my 3090 cluster for that. Everything else goes to the DGX Spark cluster (95%) regardless of speed.

[-]

KURD_1_STAN@reddit

I havent seen / there arent many benchmarks comparing dgx vs 3090/3090s, so im assuming based on my instincts here, but what model can be ran on dgx that cant be ran with gpu with ram while still being faster? I can only think of garbistral medium

[-]

Badger-Purple@reddit

I’m running Qwen-397b, Minimax 2.7, Mimo 2.5 or DS4 Flash on dual sparks. You can’t run those in 48gb VRAM. With offloading, even the 6000pro on a DDR5 system gets slower than the dual sparks.

[-]

complexminded@reddit

On a 4-node Spark? I get if you only have one dgx spark. Not saying it isn't possible to accomplish the same build with gpu's, but for me the simplicity of the dgx being plug and play with less "moving parts" (heat, power, etc) beats a build on 3090 + system RAM. Yes the trade-off is speed.

All personal preference; everyone has different tradeoffs.

[-]

Spara-Extreme@reddit

Aren't 2 spark nodes approaching RTX 6000 PRO money?

[-]

complexminded@reddit

Depending on what you can get an RTX 6000 PRO for. They range from 11k to 13k based on my searches. I got my sparks at original price. So it's a 3-5k difference. 32GB less RAM but WAY faster interference. Trade-offs for sure. I'm happy with the route I went but not everyone would be.

[-]

Spara-Extreme@reddit

Ah fuck I'm out of date. I picked mine up for 8.8k in January.

[-]

complexminded@reddit

Yea, some would say I was stupid for the early adoption, and I agreed until I saw the price skyrocket. I knew when I bought it what I was in for so I took the leap.

[-]

Spara-Extreme@reddit

I want to build an always on LLM inference and I have a relatively high budget, but I'm constantly torn between a Spark cluster and just adding another 6000RTX pro to my current machine.

[-]

complexminded@reddit

Yea that's a tough (great) position to be in. I'm not sure how I would decide tbh but it would be largely based on my use case. For instance, I use my 3090 x2 cluster for classification/sentiment processing. I sometimes need to process +40k records a day. Speed matters when you're doing tasks like that. In that case I'd definitely go 6000RTX route because speed is important.

But if you're into fine-tuning, which I also am, the dgx spark cluster is nice because I generally dont care about speed when training, and having more VRAM capacity is more important

[-]

Spara-Extreme@reddit

Yea - thats the crux of it. The inference speed of the RTX is (borat voice) verryyy niceee but I'm also finding it enticing to use some of the larger models regularly. Thanks for your perspective. I'll need to think on this.

[-]

Icy-Pay7479@reddit

What models are you using? I've been doing a lot of research and I haven't seen impressive results from 128gb setups. It seems like 256/512 is the big step from 48/64

[-]

complexminded@reddit

I have 2 node cluster now but for 1 DGX Spark, I think the best candidates are the recently released Step 3.7 Flash - reported to get 20-25 t/s. Or Qwen3.5 122B A10B int4 AutoRound - I find it a bit deeper than Qwen3.6 and it can get 35t/s with mtp. Even Qwen 3.6 27B at FP8 gets around 17 t/s with mpt and I find that a lot better in quality than Q4 quants. And you can run it at full context with 3x concurrency.

But it gets more useful with 2+ cluster imo.

[-]

Icy-Pay7479@reddit

thanks for the info - the higher quants on smaller models is appealing! I was looking at a Strix Halo.

[-]

Keep-Darwin-Going@reddit

I have no idea how people are having success with those quants model, they tend to go into loops and error so often it is frustrating. So usually I only use those with full precision which most will not fit into my 4090.

[-]

Icy-Pay7479@reddit

Maybe it’s the harness? They work well in Hermes agent but for actual coding they kinda suck.

[-]

complexminded@reddit

No worries! BTW this isn't an ad. I'm not trying to convince you to go this route or saying it's the best way. Just sharing my experience

[-]

mrgreen4242@reddit

Right, that’s the missing data point here: how much RAM can each of those devices access at that speed? Even the regular M4 mini could, until recently, be configured with 32gb of RAM and the Pro version up to 64gb. The M5 MBP mentioned on this list can also be configured with 128gb of RAM.

So, yes, an Nvidia GPU can be up to 2x as fast, but tops out at 32gb of VRAM. You could get two of them and have 64gb but you’re looking at $4k PLUS the computer they’d go in. You can almost get an entire MBP with 128gb of RAM for just what the GPUs cost.

Plus it fits in my backpack and draws 140w tops (technically I think they can draw up to 200w for a short period by pulling from the power adapter and battery at the same time).

For comparison, a single 5090 can draw 575w. So for two of them PLUS a PC to put them in and a monitor (to compare “apples” to “Apples”) you’re going to be looking at 10-15x the power usage.

It’s not really a “this is better than that” situation as much as it is these are two different options that have similar price points and make different trades offs - more total RAM, lower power consumption, compact form factor vs. faster RAM speed but less RAM, larger form factor and higher power consumption).

[-]

Whyme-__-@reddit

I have 2 Sparks connected together running Qwen 35b MOA for my startup and what I have seen is that if you use DP2 for concurrency I can get 32 concurrent request at peak using both hardware. I have a whole benchmark of DP, PP and TP done I can share. These hardwares are awesome for what they can do which is loading the LLM on vram and holding it at same space for long time. Meanwhile in a Mac you can load the model but when the OS needs the unified memory for chrome it will boot the model out and prioritize loaded apps. Concurrency over speed gets you to do things like fine tuning, parallel processing, intensive work like benchmarks. If you just want a chatbot to run then you will get max 35tps on fp16 models which is not bad.

I use these for my startup and so do my customers and it’s a game changer.

[-]

koenafyr@reddit

Disingenuous comparison because you're comparing flagship Nvidia to a Mac mini. You could have suggested bunch of 3060 12gb for example

[-]

rpkarma@reddit

Same reason why the Spark is so fun.

It’s slow, that’s true. You’re usually maxing out around 500tk/s pp and 20tk/s decode but there’s not much else that lets you run models of this size for this price

For me though it’s more about being able to train and quantise and distill, testing my experiments on similar-ish hardware to a cloud rented system before uploading it

[-]

BallsInSufficientSad@reddit

Yes. This is why I still recently got the M3 Ultra 512GB

[-]

Badger-Purple@reddit

M2/3 Ultra, 850 Gbps

[-]

TechySpecky@reddit

Bro I wish I could find an RTX 5090 anywhere close to RRP

[-]

overand@reddit

i'm genuinely thrilled with my dual 3090 setup on a DDR4 system with a Ryzen 5 3600, even though one of them is PCI-E x16 and the other is x4!

$2000 MSRP for the 5090 with 32 gigs of ram, and good luck getting that price
$1800 for a pair of used 3090 cards on eBay (as of a month or two ago), total of 48 GB

Yes, there's stuff that doesn't like running split between two cards, but mostly it's been pretty unusual to run into stuff that wants more than 24GB but less than 32 GB of VRAM on a single card. (I think one of them SOTA-ish FOSS voice models is like that, but I'm not even sure.)

[-]

Myarmhasteeth@reddit

My only problem on getting another 3090 is how to configure it. I see setups yet since I only have like 3 SFX mother boards, I’m cooked.

[-]

overand@reddit

You can prooooobably get a 570-based AMD chipset board for not tooooooo much money. (And, I managed to push this to 128 gigs because I already had 2 32 gig sticks in it, and DDR4 is only "sell a kidney" price, not "sell both and also your liver" price)

[-]

Practical_Form_1705@reddit

What is performance of such setup, let say 8core ryzen + 128gb ram in compare to gpu?

[-]

overand@reddit

Oh, I doubt CPU inference would work very well, but, if you give me a model you want me to test, I can give it a try with a CPU-only build of llama.cpp

But, I use it with my 2x 3090 setup - but, that runs one at x16 and one at x4, but it's still decent!

[-]

_realpaul@reddit

The 3090 cant do the latest features but its still an awesome piece of tech.

[-]

ohhi23021@reddit

i haven't tested over 40k context yet but it does about 70-80 t/s around there. at 0-5k context it hits 90 t/s with mtp.

[-]

_realpaul@reddit

Nice.

[-]

palashjain_@reddit

I recently bought a second 3090 for my setup hoping the same. I too have ryzen 5 3600, msi x570 a pro with 2 pcie slots. But for some reason anytime i plug anything into the second slot (x4, chipset slot) the motherboard does not post display and shows a red light on vga. I have tried single gpu on slot 2 and two gpus together. Doesn't work. Only thing that works is single gpu on first slot (x16,) . If it matters i do have 2 nvme ssds and 64gb ram. I tried removing everything and starting with just single ram chip too. Same outcome. I tried bios settings like gen 4 gen 3 and that weird mining setting. None of those worked. Any help is appreciated

[-]

undisputedx@reddit

check the shared lanes thingy on the mobo website.

[-]

lemondrops9@reddit

Also look at the manual and be sure if this PCIe slot or NVME slot is used that PCIe slot is unavailable. Its not very common for an NVME to do this but never know until you check.

[-]

overand@reddit

I'd start by taking a bright light and inspecting the slot to make sure there isn't anything in there like a bit of paper, plastic, etc, and that there aren't any bent pins.

After that:

See if your BIOS is current
See if it will POST with both NVME devices removed
Review BIOS settings & motherboard manual
You may need to disable some SATA ports or something like that

[-]

palashjain_@reddit

I will try to look for the debris and bent pins. I did try after removing both nvmes. Did not work. I am not very savvy when it comes to motherboards. What is funny to me is that it only works when the second pcie slot is unoccupied.

[-]

Clean_Hyena7172@reddit

How well does Qwen3.6-27B run on that setup? What quant? And how many t/s?

[-]

overand@reddit

(Apologies for the formatting here - I really attentively formatted everything, and when I tried to submit it, reddit wouldn't allow it. I'll reformat from desktop in a few; doing this from an ipad with a keyboard misssing the right arrow is awful lol)

I've used a few different configurations - one is the "Club 3090" setup, which has specific configurations for single and dual 3090s.

But, here. A standard Q8\_0 config, an MTP config, and an MTP + NGram config.

All 128k ctx, Q8\_0 (and no cache quantizing).

* Stock model gets PP: 2027 and 27.1 gen.

* MTP model gets PP: 1371, Gen: 49.

* NGram configs skipped as they don't seem to add any performance

* Smaller quants skipped because lazy

#This one gets PP: 2027 T/s, Gen: 27.1 T/s

#

\[unsloth/Qwen3.6-27B-GGUF-128-ctx:Q8\_0\]

hf = unsloth/Qwen3.6-27B-GGUF:Q8\_0

ctx-size = 131072

temperature = 1.0

top-p = 0.95

top-k = 20

min-p = 0.0

presence-penalty = 0.0

repeat-penalty = 1.0

reasoning = on

# This one gets PP: 1346.8 T/s, Gen: 41.9

#

\[unsloth/Qwen3.6-27B-MTP-GGUF-128k:Q8\_0\]

hf = unsloth/Qwen3.6-27B-MTP-GGUF:Q8\_0

no-mmproj-offload = true

spec-type = draft-mtp

spec-draft-n-max = 3

no-mmproj-offload = true

ctx-size = 131072

The "no-mmproj-offload" gets the mmproj (vision support) offloaded to system RAM / CPU, so it'll still **work** if I need to use it, but it won't take up VRAM. (I used to just disable vision for a lot of these.)

[-]

overand@reddit

I've used a few different configurations - one is the "Club 3090" setup, which has specific configurations for single and dual 3090s.

But, here. A standard Q8_0 config, an MTP config, and an MTP + NGram config.

All 128k ctx, Q8_0 (and no cache quantizing).

Stock model gets PP: 2027 and 27.1 gen.
MTP model gets PP: 1371, Gen: 49.
NGram configs skipped as they don't seem to add any performance
Smaller quants skipped because lazy

This one gets PP: 2027 T/s, Gen: 27.1 T/s

[unsloth/Qwen3.6-27B-GGUF-128-ctx:Q8_0] hf = unsloth/Qwen3.6-27B-GGUF:Q8_0 ctx-size = 131072 temperature = 1.0 top-p = 0.95 top-k = 20 min-p = 0.0 presence-penalty = 0.0 repeat-penalty = 1.0 reasoning = on

This one gets PP: 1346.8 T/s, Gen: 41.9

[unsloth/Qwen3.6-27B-MTP-GGUF-128k:Q8_0] hf = unsloth/Qwen3.6-27B-MTP-GGUF:Q8_0 no-mmproj-offload = true spec-type = draft-mtp spec-draft-n-max = 3 no-mmproj-offload = true ctx-size = 131072

The "no-mmproj-offload" gets the mmproj (vision support) offloaded to system RAM / CPU, so it'll still work if I need to use it, but it won't take up VRAM. (I used to just disable vision for a lot of these.)

[-]

Icy-Pay7479@reddit

I have both and managed to get a comparable qwen 3.6 27b setup on the 5090 by lowering context. It gets dumb long before 256k regardless.

Speed is similar between them, partially because I got an 8x 8x mobo and can make better use of tensor parallelism, but it was only a 10-15% boost.

Dual 3090 is the better value, but there are some models on the 5090 that absolutely scream in comparison.

[-]

overand@reddit

There are definitely a few models and tools out there that I've wished for a 5090 for, like some of the weird TTS models, or speech-to-speech ones. But, I even had "okay" performance on one of those "realtime 3d walk around in a hallucination" models with the 2x 3090s!

[-]

Icy-Pay7479@reddit

Don't fomo over in. 2x3090 is getting the most attention and optimization right now. You're at the right party and it's jumpin'

[-]

Afraid_Manner_5530@reddit

Hey just out of curiosity because I have the hardware, what about 5090 with a 3090 for overflow?

[-]

Icy-Pay7479@reddit

yeah for sure, only drawback is losing nvfp4 precision when you mix with an older card but it's unlikely you'd be optimizing for that anyways.

3090+5080 is 56gb, you could hold a much higher quant.

[-]

Afraid_Manner_5530@reddit

Ok thank you, I also have the really stupid idea of using an RTX pro 4500 blackwell in the same system with those other two cards because it's only 200 watts and I have a motherboard that can do x8/x8 gen 5 and also has a lower gen 4 slot with 4 dedicated lanes where I could stash the 3090. If I'm not mistaken this would cause me to take a significant hit to prompt processing but with 88gb of vram I think I should be plenty well usable, right? Certainly much better than falling back to system ram at least.

[-]

Antoniethebandit@reddit

Same here

[-]

Status-Secret-4292@reddit

I have found other 3090s I have considered buying to make a dual set up, but they are never the exact model of 3090 I have (gigabyte oc gaming) and from what I understand, it should be...

I wonder though, how important is that really?

[-]

overand@reddit

My undertanding is that it's not actually particularly important; maybe if you want to use NVLink, but even in that situation, I think it's explicitly allowed. (Double check me on that, though!)

[-]

Status-Secret-4292@reddit

I guess I was pretty much only considering it nvlink style as that seems to offer the best performance?

I appreciate the info!

[-]

Massive-Question-550@reddit

honestly best value setup. only better combo is if you got an old amd epyc before the shortage so you get the full 16x pcie gen 4 speeds per slot and can run large MoE models with all the ram.

also if you make the right setup you can have both cards work in parallel and cut your promp processing time by 30-40 percent and boost your token output.

[-]

m31317015@reddit

Did exactly that. 7B13, ROMED8-2T w/ 8x64GB DDR4, right now only have a 5090 and 3090 but can stuff another 3090 in. This is the best value setup (not counting GPUs) you can get at Q3-Q4 in 2025.

[-]

JustinPooDough@reddit

same. Also even have the one card in 4x. My CPU is only a 1600x.

It still gets like 50 t/s with Qwen 3.6 27B.

[-]

overand@reddit

You must be using MTP!

[-]

brickout@reddit

I'm rocking very similar to you after scoring a couple of 3090s for pretty cheap before the used prices went up. I love it.

[-]

marutthemighty@reddit

Aren't NVIDIA RTX 5090s (or whatever latest GPU NVIDIA have in their arsenal) already out of stock and taken up by AI enterprises in America and China?

[-]

TechySpecky@reddit

Well there's some available but closer to $4000

[-]

marutthemighty@reddit

That is...expensive.

[-]

compWizardLOL@reddit

I got one by signing up for those alerts then I started noticing patterns when they would happen so I would try to predict when one would happen. I got lucky and predicted before the notification went out and I got it

[-]

TechySpecky@reddit

alerts where? I'm in EU

[-]

Sophia1995_miam@reddit

rtx 6000 pro will be the same price as the 5090 if this keeps up. msi liquid is going for 45K

[-]

greentea05@reddit

I got a founders edition at RRP, put it in the PC and haven't used it once for LLMs or gaming - it should probably sell it and make that £1000 profit.

[-]

Far_Composer_5714@reddit

Any retrospect I kind of wanted to buy the $2,400 5090 before the mainstream ai craze. But it is what it is

[-]

MysteriousSilentVoid@reddit

LOL. I keep kicking myself for not jumping on it last August. I literally had purchased one at MSRP and cancelled the order.

[-]

bonesingyre@reddit

I regret not buying the 5090 FE when I got offered it for $1999 through Nvidia's website. I ended up buying the 5080 FE for gaming 😭

[-]

Xidium426@reddit

I feel incredibly lucky, I was unemployed and got one at MSRP during one of the Nvidia lotteries.

[-]

AnonsAnonAnonagain@reddit

Why though? Why wouldn’t you just get the RTX Pro

[-]

shuozhe@reddit

No worry, nvidia will support us. Didnt they announced a prices increase recently for 5090?

[-]

7EET-CS@reddit

I am struck by how not shit the M5 Pro is actually. I thought the gap would be much larger.

[-]

critsalot@reddit

GB/s != tokens/s though does it? also some models will fit in a 128 gb macbook but wont fit in a 3090

[-]

AlarmingProtection71@reddit

I can recommend AMD Radeon PRO W7800. perfectly balanced for MTB 32b Models.

Device / System	Memory Bandwidth	VRAM / Unified Memory
AMD Radeon PRO W7800	864 GB/s	48 GB GDDR6
Nvidia RTX 3090 GPU	936 GB/s	24 GB GDDR6X
Nvidia RTX 5090 GPU	1,792 GB/s	32 GB GDDR7

Möchtest du weitere Spezifikationen dieser drei Grafikkarten vergleichen?

[-]

pmttyji@reddit

I can recommend AMD Radeon PRO W7800. perfectly balanced for MTB 32b Models.

I'm also getting this soon. What stack are you using? Please share info. Thanks.

[-]

AlarmingProtection71@reddit

Sry for the late answer. I first had to configure my fastfetch module. I am experimenting with themeing and colors, so dont judge \^\^ Plus, when i bought the RAM (last september) it was only 330€ for two 48GB modules, two months later they went up to > 1500€, crazy timing !

[-]

AlarmingProtection71@reddit

Sure thing, i'll post it later. Currently i am @ the phone.

[-]

AlarmingProtection71@reddit

Sry for the late answer. I first had to configure my fastfetch. I am experimenting with themeing and colors, so dont judge \^\^ Plus, when i bought the RAM (last september) it was only 330€ for two 48GB modules, two months later they went up to > 1500€, crazy timing !

[-]

DoorStuckSickDuck@reddit

Do you mean the W7900? That's the 48GB one, the W7800 is 32gb and has a lower bandwidth

[-]

truthputer@reddit

As best as I can tell the W7800 48GB exists, was never released in the US, but is available in Europe.

[-]

AlarmingProtection71@reddit

Under "GPU Memory" > Peak Memory Bandwidth 864 GB/s

[-]

AlarmingProtection71@reddit

There are different Pro W7800 variants

[-]

Fun-Time9529@reddit

4 of those in a Mac Pro 2019

[-]

overand@reddit

MTB? Do you mean MTP? MoE?

[-]

AlarmingProtection71@reddit

Sry, it was a long day :D i meant MTP (Multi-Token Prediction).

[-]

overand@reddit

Hey, no complaints here, I'm just glad I don't have to learn something else! (And, I'm glad you put in that chart/table.)

[-]

joochung@reddit

The AMD MI50 32GB has slightly faster memory bandwidth.

[-]

pixelpoet_nz@reddit

Möchtest du weitere Spezifikationen dieser drei Grafikkarten vergleichen?

Thanks fellow German, who apparently can't write basic comments themselves. Why bother posting a canned response we can all prompt ourselves?

[-]

AlarmingProtection71@reddit

oh man :D komplett übersehen.

[-]

Altruistic_Welder@reddit

If only Apple makes nVidia GPUs play ball with macbooks - That was once the case in 2005-2006 when the 15" macbook pros came with nVidia GPUs. That relationship soured for some reason and Apple shipped Radeon GPUs but man I'd kill for a Macbook pro with a 5090.

[-]

Qu1etcocktease18@reddit

Still doesn't change the fact that I'm limited by how much VRAM I can actually afford to shove into my case.

[-]

triynizzles1@reddit

Rtx 8000 (48gb) 672 GB/s

[-]

mintybadgerme@reddit

That looks like a bargain card, but it's quite slow, isn't it?

[-]

triynizzles1@reddit

Not really. It can run 70b dense at 11 tps, which is the floor of performance. The 30b a3b sized models are all 90-110 tps. With some clever moe layer offloading this card + 64gb system ram can run all of the big 120b moe models at ~30 tps. Id day 30tps is useable still. Worth noting most modern models can be ran at max context too :)

[-]

spectre1006@reddit

Wow i feel better about my old 3090 its chugging along after a repaste

[-]

brenden77@reddit

Right, but also list the max RAM in each instance.

It's not so simple.

[-]

avariqfr30@reddit

The M5 Ultra is rumored to debut during WWDC. The Max already had 614, imagine that M5 Ultra. That might be in the 1200s..

[-]

NothingButTheDude@reddit

I am quite amazed the DGX Sparc is so slow!

But doesn't it handle way more load that that speed withut slowing down, whereas the other consumer level cards only handle a single user load?

So the DGX is actually WAY better for enterprise use?

[-]

use_net@reddit

Combined with M3 Ultra is incredible fast

[-]

techdevjp@reddit

Bandwidth is obviously incredibly important, but so is the amount of memory. 1.8TB/sec is wonderful, but only 32GB of it.

So that M5 Max 40-core MacBook Pro might be "only" 614GB/sec but you can stuff it with 128GB of memory for $5550.

Meanwhile an RTX PRO 6000 "Max-Q" has 96GB of 1.8TB/sec memory, but will run you $12k. (And you still need the rest of the computer to put it into.)

Bang for the buck, it's not hard to see why so many people still buy Macs to run local LLMs.

[-]

Pepper_pusher23@reddit

Yeah, I feel like this graphic completely misses the point.

[-]

techdevjp@reddit

There's a segment of users here who like to hate on people who buy Macs to run local LLMs.

The main issue with Macs is that the prompt processing is slow, so the time to first token can be quite long. That has been improved with the M5, but i don't think we'll see exactly how much it has improved until we get the M5 Studio later this year.

[-]

Sutanreyu@reddit

Shh, keep it a secret. Though, great for Apple. I just wish they'd drop their game changer AI already.

[-]

techdevjp@reddit

The M5 Ultra coming with the Mac Studio sometime later this year should double the bandwidth of the M5 Max, and double the GPU cores. With 256GB it will still be well under the cost of a single 96GB RTX PRO 6000. Saving my pennies.

[-]

StableLlama@reddit

This shows how interesting the Intel B70 is, money wise.

But so far I couldn't read much about the real live performance of that card for local LLM applications.

[-]

smallDeltaBigEffect@reddit

honestly, the R9700 32 GB is missing. And that fills the gap rather than the B70

[-]

mycall@reddit

R9700 32 GB

Worth $1379?

[-]

smallDeltaBigEffect@reddit

if you dont want /cant dual gpu setup with mid range 50xx or 40xx, and you don't want to buy used 3090, then the R9700 seems like the best option performance / VRAM / price-wise.

[-]

NeedsSomeSnare@reddit

As an intel owner, I assure you the real life performance isn't what it says on paper. I don't have that card to give specs on.

The problem is the software side of things is a bit messy. It's not terrible, but still needs a fair amount of work.

[-]

In_der_Tat@reddit

Why doesn't Intel hire enough competent developers to catch up with Nvidia? Would that be too expensive?

[-]

NeedsSomeSnare@reddit

A lot of people wonder that too. I'm guessing it's just related to corporate money saving bs. I'm sure that the people who actually work at intel know they need more staff.

It honestly appears only a handful of people work on the software.

[-]

superloser48@reddit

can you share any benchmarks on model/quant -> prfill and token gen?

[-]

Brian-Puccio@reddit

https://youtu.be/MnGLqo5cuGQ

[-]

NeedsSomeSnare@reddit

I don't have a B70, so it's of no use to anyone.

The other problem is that there are 3 ways to run models on intel, (SYCL, openvino and vulkan)all of which have different performance on different models.

The info is out there though. You want to look for Openvino benchmarks for the best performance. It has the worst compatibility though and is sometimes months behind something like llamacpp.

[-]

Upstairs-Extension-9@reddit

I got one for MSRP on release and quite pleased with it, I used my RTX 2070 before mainly for SDXL and Gemma 4. I’m very happy with the card especially for the price and save a shit ton of money I used to spend on Claude.

[-]

overand@reddit

I'd love to hear what models you're using, what backend, what quant, and what sorts of PP and Gen T/s numbers you have!

[-]

Upstairs-Extension-9@reddit

The thing is I have a very niche use case, for general day use and just thinking I use my Claude Pro plan. I’m an architecture model builder, architect and lifelong woodworker.

I have my own fine tuned Qwen 3.5 27B model that used to be run over Runpod and I trained it there as well, it’s directly connected through a VSCode Codelistener instance that can read and adjust my code for Rhino + Grasshopper through Python. Generally Rhino is a script based 3D modeling software that is perfect for custom Python or C++ scripts, many leading architects in the world use it. I’m not a software engineer but been making my own scripts for like 20 years now for various things from site analysis, parametric modeling and calculation for efficiency. I used to this all by myself, but since a few years Claude has helped me immensely improve my scripts and help me if I’m stuck.

Now this is all running on my B70 plus 96GB RAM and works like a dream so I don’t need the 200$ Claude plan anymore and pro is enough, Opus for planning and guiding Qwen and then I mainly use my finetune.

I spent most of my work day on CNC machines, laser cutters and general woodworking machines and LLMs have helped me a lot in recent years, and now I’m saving 2000$ a year with going fully local.

Honestly I don’t have exact benchmark numbers for you right now since I’m not at my workshop but I can get back to you in the coming days, it’s Friday today.

[-]

lloyd08@reddit

There was a few posts when it came out that effectively showed it matched the price point, but had the potential for growth assuming intel actually invests in the software space. So at worst, it's price point accurate.

[-]

crossoverXYZ@reddit

thanks for the heads up, good to know before I updated anything

[-]

Nice_Cellist_7595@reddit

Lol, but will they know what to do with it?

[-]

HugoCortell@reddit

The speed is basically wasted at those sizes. What's the point of going that fast if all you can fit is a small model. A cluster of mac minis is probably better off at the price. Slower, but you can run a more competent model.

[-]

siggystabs@reddit

Multi agent workflows can overwhelm slow setups

[-]

EndlessZone123@reddit

What multi agent are you gonna reliability do with <32B models?

[-]

siggystabs@reddit

I’ve been using 8B and above on well-defined tasks for about a year now.

I have pipelines that break down workloads and process in parallel batches.These batches are initiated by scheduled jobs, personal Claude/Codex agents, and API requests from various apps I’ve built. Some systems collect data, some analyze it, and some report on it. Recent models (like Qwen) can produce reliable tool calls and output if you can structure your processes. I have evals up and down the stack. Each pipeline stage is well defined.

If you need 32B models and above you are probably working with complex tasks that benefit from intelligence more than speed. That’s completely fine, it’s why I still have Claude and Codex subs. However, if you’re using high intelligence models to do basic VLM, you’re probably wasting time, money, or both.

[-]

see_spot_ruminate@reddit

Don't downvote this person. This obsession with bandwidth is the type of crap people say when their tricked out honda civic has such and such hp. It does not really point at the actually rate limiting step that is vram first.

[-]

Randomdotmath@reddit

People are downvoting because the comment makes it sound like you’re unaware that GPUs can be connected lol

[-]

see_spot_ruminate@reddit

right... even though the comment says a "cluster"...

[-]

2Norn@reddit

image or audio models are not that big. not everyone uses llms for coding.

[-]

XO33OX@reddit

yup, qwen3 VL exists for a reason

[-]

DoorStuckSickDuck@reddit

Eh, at some point you will run into the issue of wanting multiple parallel streams, at which point you will quickly understand that the bus bandwidth is your new bottleneck.

[-]

Cosack@reddit

Workflows, multi-shot inference, and tuning. Because these are lossy systems regardless of size, you should be building for all this anyway.

The most speed and cost effective setup runs easy to manageable tasks locally and bursts to SOTA models as needed. Because burst frequency is low, the cost of non-local calls is trivial, and privacy can be retained through obfuscation and local translation. Local speed and cloud burst for temporary model size increase is the optimal setup.

[-]

pixelpoet_nz@reddit

I hate that you're getting downvoted for this, as it's 100% true.

As the saying goes, "all the speed in the world doesn't matter if you're headed the wrong way". Buncha ADHD people out here who just want infinite tokens per second of absolutely anything / random trash

[-]

mrinterweb@reddit

When using more than a single card for inference, the PCIe bus is capped at 128 GB/s on version 6. So yeah. You either need a model that will fit on a single card or you need to accept that BUS cap. Small models can be quite capable though.

[-]

bcRIPster@reddit

And for most of us stugglebussin' on our 2019 gaming purchase: Nvidia RTX 2060 GPU, 336 GB/s

And for my surplus scrap bro's: Nvidia RTX A2000 GPU, 288 GB/s

[-]

suesing@reddit

Those 2 mac speeds are the max variants. But the pro. Base pro starts astound 370

[-]

Blackdragon1400@reddit

This is also useless without average prefill and token generation speed because they are wildly different between these platforms and architecture will make the memory bandwidth a non-issue in a lot of circumstances.

[-]

bennyb0y@reddit

Gimme that MacBook ultra pls

[-]

mintybadgerme@reddit

Can someone explain what the difference between these two is??

https://www.ebay.co.uk/itm/178177138808 https://www.ebay.co.uk/itm/406910850068

[-]

amatisig@reddit

2080Ti-22G/11G 616GB/s

[-]

realblindseeker@reddit

I’ll add: Jetson AGX Orin 64GB, 204 GB/s :-)

[-]

Shoddy-Tutor9563@reddit

that alone doesn't give the full picture. Something like this one does a little bit better job:

name	usable vram, Gb	price	fa	pp512	pp32768	tg128	power usage, Watts
3090	24	\~$700	1	5911	2361	174	\~300

... based on public benchmarks from llama-bench - a tool from llama.cpp project. The standard benchmark figures are assuming you're running TheBloke/Llama-2-7B-GGUF:Q4_0. Noone in the health mind uses it today, but it gives you a base reference that is comparable.

[-]

100and10@reddit

Intel Arc the absolute goat. Look at it go!

[-]

Steus_au@reddit

there is always 5060ti's at the price of mac with its 500gb/s at your possession, no need thankyou

[-]

Kubas_inko@reddit

Bandwidth is mostly useless, if you can't load the model in the first place.

[-]

comatrices@reddit

RTX 3080 16GB MXM, 448 GB/s

Any MXM card users here?

[-]

Alexal88@reddit

Guys, and how EXACTLY are you using it?

Please give me some 101s beyond the “I run local models on it” 🙏

[-]

Colecoman1982@reddit

This table is kind of useless without including price per GB/s and total vram per option. Also, I've seen others in this discussion point out that there are more competitive options that have been entirely left off this list...

[-]

BringOutYaThrowaway@reddit

The 3090 doesn’t get enough credit. Great performance for the money.

[-]

alphatrad@reddit

Dude conveniently leaves off the most compelling AMD & APPLE options to make NVIDIA look good.

AMD AI Pro R9700 GPU, 640 GB/s APPLE M3 Ultra, 819 GB/s AMD RX 7900 XTX GPU, 960 GB/s

Chart also doesn't account for max memory. So it's misleading on trade offs for why you might go unified over GPU.

This is the stuff that is causing so much confusion in these communities.

Low effort slop!

Low effort slop.

[-]

Art_4_Tech@reddit

I gave up my 3090 and I regret it. I've been running the strix halo and I'm trying to get some more serious performance.

What are peoples genuine thoughts on the gigabyte aorus ai box 32gb 5090?

It's pricey but I don't have a machine to put a normal unit in and I'd like to run an external enclosure if its viable for the money.

[-]

ideal2545@reddit

is a 5080 any decent? it’s what i got in a gaming rig, i think memory is the issue with it?

[-]

putrasherni@reddit

Bro ignoring R9709 completely

[-]

gAmmi_ua@reddit

RTX PRO 4000 Blackwell SFF (70W) - 432 GB/s Not fastest/cheapest, but pretty good at 70w cap

[-]

SkyResponsible3718@reddit

I wish the 5090 came with twice the memory. 32GB just isn't enough, and 64GB would be a complete game changer for me.

[-]

beasthunterr69@reddit

So MBA is out of the equation here?

[-]

crossoverXYZ@reddit

Been running local models for about a year now and the progress is honestly staggering. What used to require a 70B model can now be handled by well-trained 8B-14B models for most practical tasks. My daily driver setup is a 14B model for general tasks on a single GPU, and I only reach for larger models or API calls when I need that extra capability. The latency advantage of local inference is underrated too — for interactive coding assistance, having instant responses changes how you work with it fundamentally.

[-]

XO33OX@reddit

why we dont talk about rtx pro 5000 both 48GB and 72GB or rtx pro 4500 32GB, rtx pro 4000 24GB ?

[-]

MiniEval_@reddit

I have a 4500 because I just wanted to have a mini-ITX build that wouldn't blow up. A 5090 is by all means a better option when it comes to value if compute is the only concern, as it's slightly more expensive for double the bandwidth.

[-]

XO33OX@reddit

if 32GB VRAM is enough for you then single 5090 is superb (i have one), but it doesnt scale (space, heat, power, even with undervolt and aio version) well and creates a lot of headaches beyond that. On the other hand you slide 4500s one after another into standart workstation (trx50, wrx90e..) without much hassle.

[-]

LinkSea8324@reddit

Wait for OP to learn that he can use directly use text instead of storing it into an image

[-]

gandhi_theft@reddit

You left out the Apple M3 Ultra Studio which gets 819 GB/s

[-]

HuRyde@reddit

Nvidia V100 on eBay $99 is 900 GB/s

[-]

Various-Welder5544@reddit

Leave my cheap desktop alone

[-]

GeneralRieekan@reddit

This table needs a 2nd dimension: VRAM/Unified RAM amt

[-]

firetech97@reddit

Wow is the performance gap really that bug between a DGX Spark and a 5090?

[-]

nacholunchable@reddit

Yes! So many new spark users go down this rabbit hole on NVFP4 kernels and why their LLMs arent running faster, meanwhile token generation is speed bound by the memory bus and nothing they do will change that. How do I know? I went down the same rabbit hole when i got my spark half a year ago.

[-]

firetech97@reddit

I was eyeing one but have not done any actual research yet, which i was going to do before pulling the trigger. With RAM prices so high, the 128gb of unified for ~5k seemed like a better deal than building a 5090 rig, where the GPU alone is 4k and id spend at least another 2k in CPU, RAM, Storage, Mobo.

I probably would've come to the conclusion to build anyway over it, but it is an attractive all in one package with a very small footprint. I'll have to look into some benchmarks and go from there i suppose

[-]

nacholunchable@reddit

Ya, for sure. Honestly i still feel like the asus ascent gx10 (undercuts the other versions price with the same hardware in a different case) is a steal. I went for the 1tb version, it was 3k back then, 3.5k today (usd). Its a great unit for that price. I mean there were (and maybe still are?) some amd boxes you can get even cheaper, but you give up a touch of mem speed, a lot of gpu power, and close the door on clustering. If ur chill with 15 - 60 genned tps (depending on the model you run) and want the fat capacity and low energy cost, its the way to go imo. But if you crave faster speed, deeper upgradability, dont care about energy, want a real desktop for non-ai or gaming, a proper rig is better. I have no regrets, but I was expecting more performance going into this.

[-]

jakubl@reddit

There are 4 important factors when choosing hardware. They relative weight depend on the use case, and memory bandwidth is only one of them and very often not the most important one.

The total available memory. For LLMs the bigger memory the better model you can run with bigger KV cache = longer context. Super important for agentic AI with large context and models smart enough to do anything useful. That is less of an issue for image generation as models are smaller.
Memory bandwidth. That determines token generation speed, but this is only half of the perceived model performance, see next point.
Compute performance. That determines time to first token - a waiting before any response even applies. With large context it’s more important than token generation speed as it’s pure waiting time, and even very slow generation is faster than human reading speed. Smart agents also don’t need full llm response to start working and can start executing tools as soon as they arrive.
Energy consumption. Unless you have free power, that’s also important factor. Older hardware may be cheaper but usually is less energy efficient and it may turn out than renting or paying for API is cheaper than electricity cost.

And as I mentioned a lot depends on use case. If you are building interactive chat, the time to first token is the most important factor, then token generation speed. Human time is still orders of magnitude more expensive than hardware and electricity and if humans are sitting and doing nothing while waiting for AI response that is a huge loss. If building fully autonomous agents that work in fire-and-forget mode it’s less important factor, but the context and model capabilities are very important so that it can actually run without supervision. Getting crappy results but very fast is way worse than waiting for good results.

That’s why Macs are very popular - they can handle large models and if you can wait, you can get good results cheaply with lower energy usage. It’s kinda funny that Apple become the most cost effective hardware for a task. I believe it won’t last for long and seeing how easily they hardware is sold out someone at Apple would probably decide to raise prices 2x and still they won’t have any trouble finding customers.

You can optimize cost by adjusting workflows. Instead of waiting for response and interactively correcting model behavior, prepare batch, run it, go to sleep and wake up to finished job.

[-]

fuckable-switcher@reddit

And you forgot about the m5 max and then double that for the m5 ultra

[-]

fuckable-switcher@reddit

Dude you forgot about amd when it did its hbm card era

The Radeon 7 has close to 2tbs of bandwidth

[-]

Covert-Agenda@reddit

Soo much context is missing off this.

Mac Studio 800gb/s minimal power draw 256/512GB memory.

[-]

Koalababies@reddit

The power draw always blows my mind

[-]

fivetoedslothbear@reddit

Yeah, my 128GB M4 Max MacBook Pro isn't the fastest machine, but it only has a 140W power adapter and can do extended inferencing on a battery. And it's portable.

[-]

Covert-Agenda@reddit

Yeah I’ve got the m5 max variant and I can use some mega models locally.

Yeah not as fast as the 5090 but it’s portable.

[-]

Aardvark-One@reddit

The biggest issue I have with Mac is for agentic use. A lot of context is sent in the prompt when using agents and prompt processing on the Mac is incredibly slow. Although, the M5 has closed the gap a bit, it still can't get close to Nvidia.

[-]

MiaBchDave@reddit

Hot and cold (SSD) KV cache solves this issue. Unless your workflow is to RAG a different PDF document for every prompt by the thousands, otherwise agentic harnesses fly when using a proper prompt cache. In other words, this is a non-issue for local agentic work lately with the current systems (like oMLX) which are based on vLLM engines for multiple users but are repurposed for local agentic use.

[-]

heresyforfunnprofit@reddit

I need more explanation of this.

[-]

Aardvark-One@reddit

Thank you. That is something that I hadn't explored yet. Going to give it a go and see how it works out. Was giving up on local LLMs; t/s on the Mac was great but the prompt processing threw a wrench into the works.

[-]

Ok_Top9254@reddit

It's actually not impressive at all if you look into specs. It's a beefy CPU with an extremely outdated GPU using late 2010s level architecture. 26TFlops of FP32, no FP16, FP8 or FP4, some 36 INT8 TOPS from the neural engine. For reference 1080Ti has 45 TOPS of INT8 and RTX 2060 vanilla, has 52 TFlops of FP16, double that of Mac Studio.

With so little compute performance no wonder it uses so little power. The memory is also mobile LPDDR5X too, that consumes like 1.2W per 8GB. Except for the memory and CPU, you are basically getting scammed.

[-]

Covert-Agenda@reddit

Mine sips 75w max at full tilt ☺️

[-]

kiwibonga@reddit

My electricity bill is lower since AI because I don't do anything else.

[-]

Covert-Agenda@reddit

Hahaha same here!

[-]

droptableadventures@reddit

It's intentionally left off because it'd undermine the point of their Nvidia fanboy posting.

[-]

Covert-Agenda@reddit

I mean, raw throughout the 5090 or 6000 are monsters but they also burn a lot of juice.

I went with a PGX ThinkStation for my CUDA and the studio for MLX.

Works well for what I need.

[-]

droptableadventures@reddit

Yes but also only 32GB of VRAM on a 3090. So you can only run a very small model, even if it is fast.

[-]

Individual_Holiday_9@reddit

Also a Mac mini is $400 lol

[-]

Embarrassed_Adagio28@reddit

The watt per token is not as impressive on macs as you would think. Because macs are so much slower, their efficiency is deceiving. In fact I just had opus (could be wrong) calculate watts per token of a m3 ultra and rtx 5090, with Gemma 4 26b the mac studio only came out 10% more efficient per watt and 40% with qwen3.6 35b. Considering that a rtx 5090 is over twice as fast, that isnt very impressive for the mac.

Macs can handle huge models and are efficient but their slow speeds make it not worth it.

[-]

rpkarma@reddit

Correct; race-to-idle matters. If you have a very fast system and the fixed overheads aren’t bad, it can be more efficient to use the power hungry one and idle than have the Mac go much slower for the same task.

But it depends, too, on what you’re doing. YMMV.

You can run models on a Mac Studio that you simply cannot put on one or even two 5090s.

[-]

Hydroskeletal@reddit

I enjoy my office not being an oven in the summer

[-]

AnonLlamaThrowaway@reddit

Soo much context is missing off this.

sorry, the context was at q4_0, it got quantized too much

[-]

Covert-Agenda@reddit

Hahahahah touche 😎

[-]

TheRealDatapunk@reddit

An RTX Pro 4500 has half the memory bandwidth of the 3090, but is still way faster (15-70%) on pp and tg for me. Plus, the 32G allow for full context windows with most models targeted at the single gpu market

[-]

Total-Confusion-9198@reddit

Anything above 500 GB/s is a serious local LLM setup. Unified memory remains the underdog.

[-]

ea_man@reddit

Oh thanks, my 6800 at 512 GB/s is standing tall 😄

With some rust you can get 16GB of that for \~260e.

[-]

Away-Sorbet-9740@reddit

3090s evaporated from the Bangkok local market about a month after qwen 3.5 released. Went from dozen + at 22-25k baht, to 35-40k if you can find them lol.

100% the value sweet spot if you are buying today. I have a 4090 and 4070tis in separate rigs, and that extra 8gb is really the unlock to running capable local assistants.

[-]

drycounty@reddit

M3 Mac Ultra 819 GB/s

[-]

Fit_Assistant7953@reddit

everyom3 needs to see this

https://huggingface.co/FerrellSyntheticIntelligence/Vitalis_Devcore--- license: gpl-3.0 language: - en tags: - code - autonomous - self-healing - agent - code-generation - software-engineering - agentic - devops library_name: custom pipeline_tag: text-generation

Ferrell Synthetic Intelligence (FSI): Vitalis_Devcore

**Vitalis is a self-evolving digital engineer that lives inside your computer — capable of writing, testing, and fixing its own code to build whatever software you can dream up, without you doing the manual heavy lifting.**

Built entirely by one developer. No team. No funding. Four years of self-taught work.

What Is This?

Most AI coding tools are assistants — they wait for you to ask, then suggest. Vitalis is different.

Vitalis_Devcore is an **autonomous execution engine**. It receives an intent, writes the code, runs the tests, and if something breaks, it heals itself and tries again — all without human intervention. It is the "hands" of the FSI ecosystem, designed to operate alongside **[Vitalis_Core](https://huggingface.co/FerrellSyntheticIntelligence/Vitalis_Core)**, which provides the cognitive reasoning layer.

Core Architecture

Component	Role
`SovereignKernel`	Writes and scaffolds code to disk
`KernelDaemon`	Watches for tasks, executes them, validates results
`SelfHealingLoop`	Detects failures and autonomously attempts recovery
`KernelValidator`	Runs pytest against generated code
`ProjectLedger`	Immutable append-only audit log of every action
`InferenceEngine`	Confidence-gated response generation with RAG augmentation
`ConfidenceBridge`	Autonomously re-queries when confidence is in the hypothesis zone (0.45–0.65)
`Hippocampus`	Memory-mapped binary vector store for long-term recall
`ResonanceEngine`	Continual learning — adjusts kernel weights from interaction history
`ContextSerializer`	Serializes full project state for agent context windows

How It Works

``` You give Vitalis an intent ↓ CognitionEngine generates a plan ↓ KernelDaemon picks up the task ↓ SovereignKernel writes the code ↓ KernelValidator runs the tests ↓ Pass → ProjectLedger logs success Fail → SelfHealingLoop attempts autonomous recovery ↓ Pass → Recovered and logged Fail → Failure report generated for review ```

Getting Started

1. Clone the repository

```bash git clone https://huggingface.co/FerrellSyntheticIntelligence/Vitalis_Devcore cd Vitalis_Devcore ```

2. Install dependencies

```bash pip install -r requirements.txt ```

3. Start the Kernel Daemon

```bash python3 -m src.ide_kernel.daemon ```

4. Send your first task

```bash python3 -m src.ide_kernel.client scaffold my_module ```

Vitalis will scaffold a full module structure under `app/modules/my_module/`, generate a test file, run it, and log the result — all automatically.

REST Gateway (Optional)

Start the Flask gateway to send tasks over HTTP:

```bash python3 src/ide_kernel/gateway.py ```

Then POST to it:

```bash curl -X POST http://127.0.0.1:5001/execute \ -H "Content-Type: application/json" \ -d '{"intent": "scaffold", "module_name": "my_module"}' ```

Self-Healing Demo

```bash

Start the self-healing monitor in a separate terminal

python3 -m src.loop.self_healing

Trigger a task that fails — Vitalis will detect the failure

and autonomously attempt recovery without you touching anything

```

Technical Highlights

**Custom HDC Engine** — A compiled C extension (`hdc_engine.so`) for hyperdimensional computing operations including vector binding and bundling
**Memory-Mapped Neural Store** — `Hippocampus` uses `numpy.memmap` for persistent binary vector storage across sessions
**Confidence-Gated Inference** — The `InferenceEngine` uses a `ConfidenceBridge` to autonomously augment prompts when confidence falls in the hypothesis zone
**Temporal Knowledge Retrieval** — `train_self.py` supports querying memory nodes that were alive at a specific Unix timestamp
**Hot-Ingestion Daemon** — `watcher.py` monitors the knowledge directory and re-ingests new documents in real time

Governance & Integrity

**Quality Gates** — All autonomous actions require passing pytest before being committed to the ledger
**Immutable Audit** — Every action is SHA-recorded in `project_ledger.json`
**Failure Transparency** — All failures are written to `failure_report.json` before recovery is attempted

Roadmap

[ ] Connect Vitalis_Core LLM as the live reasoning backend
[ ] HuggingFace Space interactive demo
[ ] Natural language task input via CLI
[ ] Multi-agent coordination between Devcore instances
[ ] Web UI dashboard for ledger and task visualization

About the Developer

FSI (Ferrell Synthetic Intelligence) is an independent AI research project built by a single self-taught developer over four years — no formal education, no team, no funding. Just a vision, a tablet, and a GPU.

If this project resonates with you, a ⭐ star goes a long way.

*License: GPL-3.0*

[-]

RagingAnemone@reddit

Anybody out there doing 8 channel or 12 channel cpu inferencing? What kind of speed are you getting on big models?

[-]

ItstheRealMon@reddit

That's decent for a GDDR6X

[-]

Meterman@reddit

Rx6800xt 512Gb/s

[-]

vodanh@reddit

Doesn't vram size matter?

[-]

Even-Actuator-2608@reddit

Where's the little nvidia nano kit

[-]

BornInAFish@reddit

Intel Arc Pro B60 Dual: almost identical bandwidth and price as 3090, double the VRAM, and double the PCIe speed.

[-]

billatq@reddit

Okay, now adjust it for price for what you get.

[-]

Super_Sierra@reddit

Nvidiachuds in this subreddit don't understand anything, your words will be wasted on them.

Those 5090s 32gb at 15 will be 500 or so GBs of vram. But you will need to rewire your fucking house so you don't blow your breaker, and the power draw will be around 6000w.

That same unified memory macbook does that but at 150w max, and if the power goes out, well, it can do that on battery for three hours.

The macbook also costs 5x less lol.

[-]

valdev@reddit

Alright, now add two more columns. Cost per gb of RAM/VRAM. And cost to operate over an hour.

[-]

qalpi@reddit

What can my 3080 Ti do

[-]

Kikopedia@reddit

This seems wrong, I’m unsure what metric this is, my m4 mbp is a lot faster than my spark

[-]

joochung@reddit

AMD MI50 over 1000GB/s

[-]

Old_Grapefruit8774@reddit

MI50’s need more love from the community

[-]

joochung@reddit

I agree. I have 3 in my server and happily run gpt-oss-120b

[-]

KiDNEXTDXXR@reddit

I run perfect self tuned local llms on a 1660ti. Soon as I get money building these websites I will get a dual 5090 set up

[-]

here_n_dere@reddit

Also interesting would be to stack DGX spark, RtX pros, and their memory capacity (each)

[-]

Shoddy_Bed3240@reddit

The cheapest high speed option is 3090 ti, 1,008 GB/s

[-]

migsperez@reddit

I bought my first GPU ever today, after computing for decades. Local LLMs pushed me over the edge. AMD 9700 32gb, I really hope it has almost similar performance to a 3090.

[-]

Diablo-D3@reddit

https://en.wikipedia.org/wiki/List_of_Nvidia_graphics_processing_units

https://en.wikipedia.org/wiki/List_of_AMD_graphics_processing_units

On a dollars per GB/s+GB+slot (assuming multi-GPU inference), jamming a machine full of RDNA4s with 16GB or more ends up being the win. You end up being able to scale GB sanely, but also scale GB/s the cheapest.

Can't wait until R9600Ds start showing up in the grey market, they're 9070GREs /w 32GB and built like the R9700S. Should grey market retail for like $800ish.

[-]

Zolty@reddit

My mac studio M3 Ultra 256gb of ram does around 780 GB/s if you want another datapoint.

[-]

einthecorgi2@reddit

Show with power usage and prefill lol

[-]

mintakka_@reddit

honestly the M5 40 core macbook pro is super cost competitive depending on your exact use case. $5k all in to run dense models at 100+ Gb memory and acceptable (depending on use case) inference speed can be a deal breaker

[-]

Ok_gosh@reddit

No m3 ultra? M3 ultra: 819GB/s

[-]

aguspiza@reddit

dual channel DDR4 3200 ... 50GB/s
dual channel DDR5 6000 ... 90GB/s

[-]

lemondrops9@reddit

DDR4 3200 real world is more like 30GB/s (Maybe 35? think my settings are off on that PC). I tested my other PC with ddr4 3600 and got 38GB/s

[-]

dazzou5ouh@reddit

So this bad boy I've built should be fast?

[-]

laexpat@reddit

Could you add one more to the left side to balance it?

[-]

LittleBlueLaboratory@reddit

Nope, his CPU cooler is in that spot

[-]

laexpat@reddit

lol I know - need one more so it’s 3:1:3 :)

[-]

LittleBlueLaboratory@reddit

Ooh! yeh!

[-]

ScaredyCatUK@reddit

Mac studio is missing

How much ram does your 5090 have?

[-]

panchovix@reddit

By the way, a simple +2000Mhz VRAM OC (or +4000Mhz on LACT on linux) brings the 5090/6000 PRO to 2TB/s bandwidth.

[-]

TokenRingAI@reddit

That's interesting information, but neither the 5090 or RTX 6000 have a speed problem, and potentially damaging my $8000 GPU or doing anything that might impact the warranty is a real non-starter

Would these speeds also work on the 5060 ti? It's got 1/4 the bus width and bandwidth of a 5090

[-]

panchovix@reddit

I can understand that.

5060Ti is able to do the same overclock, not sure how much would be the resultant bandwidth.

[-]

TokenRingAI@reddit

The 5060 TI is interesting, because the density is double compared to the other 5000 series GPUs, it has 16G on a 128 bit bus, if they did the same to the 5090 it would be a 64G gpu.

[-]

aeroumbria@reddit

I also depends on how long you expect your build to last, and your overall outlook for the technical landscape. There are a few promising signs that the balance between core speed and memory bandwidth might shift. We are moving towards more efficient, low footprint kc caches with slightly more processing steps, MTP shifts workload from more TG-like to more PP-like, and diffusion models, even if only used for drafting, is a big inversion of processing power vs memory bandwidth. For local, single user scenarios, any technique that liberates computation power from memory bottleneck would be extremely effective and will impact what hardware is to be considered better value.

[-]

WithoutReason1729@reddit

Your post is getting popular and we just featured it on our Discord! Come check it out!

You've also been given a special flair for your contribution. We appreciate your post!

I am a bot and this action was performed automatically.

[-]

a_beautiful_rhind@reddit

My ddr4 is 230gb/s but it's hobbled by uma support.

[-]

Lxxtsch@reddit

Sadly no one tells that this is not everything in llm world. M5 max witg 128gb running mlx optimised models is very viable option, being only around 600gb/s. I tought i would see improvement with 3090 over it (filling only vram) and jokes on me, mlx optimised model goes head to head with 3090.

[-]

Pleasant-Shallot-707@reddit

Yeah, I frankly don’t care if I can serve myself 40tok/s or 300. I am only serving one person.

[-]

neopolitan77@reddit

Actually just shows what an incredible beast M3 Ultra is/was. I'd take 512GB RAM @ 819GB/s over any of those in a heartbeat.

[-]

Pixel_Hunter81@reddit

yeah but macs draw very little power and they are pretty much plug and play which is a big plus for a lot of people. MacOS seems to be well optimezed for ai usage as well (i am not sure i've never used it).

[-]

Pleasant-Shallot-707@reddit

Yeah, getting 30-60 tok/s is fine

[-]

tvmaly@reddit

It seems like they need a DGX Spark 2.0 that is at least as fast as a 4090

[-]

Dicond@reddit

As the owner of a system with a 5090 and 3090, please stop, I can only get so erect.

[-]

rorowhat@reddit

Need some more and gpus

[-]

romeozor@reddit

I just bought a B70 today. Here's hoping it wasn't a mistake

[-]

Wolfpack99111@reddit

Can someone tell me how good is a 3060 12 gb

[-]

greentea05@reddit

I have a 5090 and a 128gb M5 Max, I wonder if I can combine them in someway - tricky I imagine if not impossible.

[-]

Full-Bag-3253@reddit

M5 40 Core Ultra should be around 1228 GB/s, come with 8x or maybe 16x more RAM/VRAM, and use a fraction of the power. If you want to scale a Mac Studio larger you can use thunderbolt cables to build a RDMA Cluster. Becuase of the low power draw you could plug 4 MAC studios into power bar and have them sit on your desk. 8 x 5090 @ $4000 is $32,000 going to cost a lot more than a Mac Studio even before you add the rest of the CPU/PSU/RAM/Cooling/Enclosures. You still will have more throughput, but for people 'Using" AI not training it I think the Apple ecosystem is a strong option. I expect the New CEO will push in this direction more. The stuff done to date (RDMA over Thuderbolt) isn't really a retail user thing and the fact that they are selling out of mac minis and studios is going to draw their attention in this area.

[-]

youneshabbal@reddit

I rememberwatched YouTube video experimented this

[-]

Stock_Ad9641@reddit

It’s insane that a 11 year old GPU beats the best intel or amd have to offer by a wide margin.

[-]

DesignerTruth9054@reddit

Its like comparing apples to oranges

[-]

Stock_Ad9641@reddit

Exactly! The nutritional value of apples and oranges.

[-]

hurdurdur7@reddit

R9700 missing from the pic

[-]

zica-do-reddit@reddit

How is this measured?

[-]

TopChard1274@reddit

What's the context for this

[-]

One_Curious_Cats@reddit

Add VRAM limits and watt usage as well.

[-]

aidycas@reddit

M3 Ultra by comparison please? Think it would be 3rd from the top? (Bottom)

[-]

Quebber@reddit

Yes I could use my 5090 but my MS-01 AMD Strix Halo with 128gb 96/32 split allows me to run a Q8 Qwen 3.6 35B model with 256k context.

[-]

Ill_Barber8709@reddit

And someone out there needs to see this

M5 chips are laptop chips with up to 32GB of 153.6 GB/s memory M5 chips are laptop chips with up to 64GB of 307 GB/s memory M5 Max chips are laptop chips with up to 128GB of 614 GB/s memory

RTX 3090 GPU doesn't exist as mobile RTX 3080 Ti Mobile GPU has 12GB of 384GB/s memory OR 16GB of 512GB/s memory RTX 5090 Mobile GPU has 24GB of 896GB/s memory

[-]

Long_comment_san@reddit

You're gonna laugh but that's exactly what I asked Qwen a couple of hours ago. Huh

[-]

Evanisnotmyname@reddit

Sometimes I feel like LLM providers use customer prompts to immediately create advertising on Reddit

[-]

vinigrae@reddit

But but Apple

[-]

Both-Activity6432@reddit

Can we just pause and think about how fucking fast that is? I know it is local, but think of our 56.6k modems… Near 2TB/s. Home internet tops out (generally) 1-2 Gbps. Thunderbolt 4 tops out at 40Gbps. The worst card listed is 960Gbps. And yes I know this internal computer architecture vs accessories or internet, but holy fuck

[-]

Evanisnotmyname@reddit

Really though. Think about the amount of data being processed in an AI data center on the minute.

[-]

WiseassWolfOfYoitsu@reddit

A few random bonus ones:

MI50: 1024GB/s MI100: 1230GB/s 7900XTX: 960GB/s A6000 Blackwell: 1790GB/s (so 5090 performance with a much bigger memory pool) Radeon AI Pro 9700: 640GB/s

[-]

exographicskip@reddit

Thanks for the a6000 clarification

[-]

zerubeus@reddit

And Im only using the 5090 to play Arc raiders

[-]

Last-Owl-8342@reddit

idk man after calculating how cheap deep seek 4 flash is, Im not going local anymore

is not a rival to claude sure, but I know what I want just need someone to type all the boring parts

[-]

XxBrando6xX@reddit

M3 ultra Mac Studio, 819 GB/s

[-]

poopsinshoe@reddit

https://marketplace.nvidia.com/en-us/enterprise/personal-ai-supercomputers/dgx-spark/

[-]

JigglyWiggly_@reddit

How about amd?

[-]

ItsAMeUsernamio@reddit

16GB 5060Ti is 448GB/s

16GB 5070Ti is 896GB/s

16GB 5080 960GB/s

Interesting how Nvidia scales it up with the price.

[-]

HavenTerminal_com@reddit

M3 Max sits somewhere between 300 and 400 GB/s depending on config, for anyone on Apple Silicon squinting at this

[-]

Creepy-Bell-4527@reddit

M3 Ultra: 819 GB/s

[-]

kl3x@reddit

I'm pretty curious how Nvidia N1x will perform

[-]

Queasy_Problem_563@reddit

my mac studio m2 ultra 192gb is doing 800gb/sec

[-]

heitortp0@reddit

It hurts to remember that a 5090 in my country is ≈5/6k usd

[-]

ortegaalfredo@reddit

The bandwidth gives you the tok/s in generation, but the compute gives you the tok/s prompt-processing.

The M5 Mac will generate tokens quickly but will still take 10 minutes to process a prompt.

[-]

eltonjock@reddit

“10 minutes to process a prompt” either you’re wildly exaggerating or you’ve never used an M5 Mac.

[-]

xlltt@reddit

try processing a large context prompt 200k and then you will see that m5 will take at least 300-400 seconds on a dense model

[-]

Ok_Hope_4007@reddit

That is not correct. You can look up benchmarks yourself. For qwen 3.5 35B the M5 (Max) has PP of about 2000 t/s at 8K Prompt length. The 3090 was around 2300 t/s. Not exactly the ballpark you are mentioning

[-]

ortegaalfredo@reddit

Oh they fixed it in the M5. Nice to see. Now its much more competitive with the 3090s.

[-]

Internal_Quail3960@reddit

possible m5 ultra will be 1228 GB/s

[-]

DCGreatDane@reddit

Maybe faster if rumors are true.

[-]

Internal_Quail3960@reddit

possibly, but going off of the past generation ultra chips they have always had double the memory bandwidth

[-]

marutthemighty@reddit

What is the latest and most powerful NVIDIA RTX (or any other GPU) model?

[-]

redmctrashface@reddit

Yeah should also display vram amount. It's awesome to have a lot of speed but if you can't load a decent model because you lack vram space, what's the point? Don't get me wrong, Im not praising ram amount over bandwidth. It's just that things are a little bit more complicated than "look at my speed" or "look at my huge ram". This kind of post is misleading.

[-]

power97992@reddit

For agentic purposes, prefill is important too

[-]

Blah-Blah-Blah-2023@reddit

RTX2060 336GB/s ... yeah I am poor.

[-]

whatsamanual@reddit

The other side is capacity. I just bought my second dgx spark yesterday... I can't wait to see what they can do together!!!

[-]

into_devoid@reddit

AMD HBM2: 1.024TB/s

[-]

gomezer1180@reddit

Where is the Mac studio in this list?

[-]

InnerSun@reddit

Yeah a 2023 Mac Studio M2 Ultra has 800GB/s, that's really insane value even today when we place into context

[-]

MasterKoolT@reddit

Yes, but generations prior to M5 didn't have matmul acceleration so they struggle on prefill (M5 generation is about 4x M4)

[-]

InnerSun@reddit

Interesting, but I can see right now even the M5 Max has 460GB/s, so does it really help if the bandwidth is still lower in the end?
The naming is a clusterfuck lol

[-]

MasterKoolT@reddit

Basically, the Ultra is two Max chips duct-taped together in a really clever way to essentially double the performance. Apple hasn't produced an Ultra chip yet for M5 (and they skipped M4 Ultra) so there's a weird trade-off where you get better bandwidth on the M3 Ultra at the cost of the older, less efficient architecture.

[-]

fivetoedslothbear@reddit

I'm expecting an M5 Ultra to be released; Apple seems to be making odd-numbered Ultra processors. And if they offer it with 512GB, they've got my money.

[-]

MasterKoolT@reddit

I hope so too but I'm bracing for them to skip M5 Ultra. Fab capacity is so constrained at the moment that I wouldn't be surprised if Apple is stockpiling chips and RAM for iPhones (since that's the profit center) instead of allocating a lot of silicon to niche products like Mac Studio Ultras

[-]

InnerSun@reddit

I see, the scaling depends on the two Max chips of that generation then

[-]

Southern_Sun_2106@reddit

can confirm, qwen 3.6 on m5 max pro feels 'snappier' than on the m3 ultra

[-]

ZurielA@reddit

there is a M3 Ultra from Nov 2025, I own one comes stock with 96GB ram or can opt for 256gb

[-]

InnerSun@reddit

I'm still on the original M2 Ultra, I wonder how much better it is? From what I can find its really negligible. I guess the main benefit is that the max addressable VRAM is technically higher, but a maxxed out Mac Studio starts getting so expensive that we're back to considering NVIDIA setups.
Lets hope there's a new refresh that really changes the perfs.

[-]

joochung@reddit

I would expect the future M5 Ultra to have 1200GB/s aggregate memory bandwidth.

[-]

TokenRingAI@reddit

That's what the math works out to

[-]

HerrGronbar@reddit

Now compare it with price.

[-]

5olArchitect@reddit

Sure but that’s 128 gb of integrated ram

[-]

TheDailySpank@reddit

[-]

ImportancePitiful795@reddit

R9700 640GB/s something. 7900XTX around 1GB/s

However need also to point out that some cards are better than others because of their support on things like FP8 etc which some of the above are missing like the RTX3090

[-]

kenzu82@reddit

Still rocking Nvidia Tesla P100 at 732.2 GB/s

[-]

laexpat@reddit

That along with my P40 at 347.1 GB/s

[-]

dsanft@reddit

A dual socket Xeon Gold Cascade Lake with DDR4-2933 has about 220GB/s bandwidth. Don't underestimate CPU.

[-]

Kamimashita@reddit

2x RTX 3090 might be the most balanced? And not too expensive if you already have a system you can slot them into?

[-]

durden111111@reddit

I have a 5090 and the vram can easily OC +3000 which gives a bandwidth of 2176 GB/s

[-]

thetaFAANG@reddit

M1 max has 400 gb/s memory bandwidth btw

Apple accidentally made a machine that’s too good to upgrade for the price. M5 variants are close and compelling though

[-]

Bludsh0t@reddit

Very nice. Now do tdp

[-]

IllExample3639@reddit

Laughs in dual 3090. Worth double what I paid after 2 years if you believe eBay pricing.

[-]

NoFudge4700@reddit

B70 Pro is decent for home inference.

[-]

synn89@reddit

M1 Ultra, 820 GB/s

[-]

diggamata@reddit

MI350P is 4 TB/s

[-]

pfn0@reddit

not yet available, and I expect it to land in the $15-20K range, closer to the 20K range.

[-]

garlic-silo-fanta@reddit

Needs a column for electricity

[-]

RealSataan@reddit

Now the power draw also.

[-]

RealSataan@reddit

Now the power draw also.

[-]

higglesworth@reddit

B70 at 1/3 the performance for 1/3 the price

[-]

DrBearJ3w@reddit

Cough AMD Cough

[-]

NeedsMoreMinerals@reddit

Any changes to hardware on the horizon? Are they gonna start building pcs or gpus with 200 gb of ram?

[-]

Few_Painter_5588@reddit

Also don't forget that bandwidth is mostly additive. So if you have 4 RTX 3090s, you'll have nearly 4TB/s of bandwidth. LLMs are one of the few things that can saturate compute before bandwidth

[-]

tired514@reddit

What the.. who is modding you down?

If you're using graph split mode this is absolutely true.

[-]

ziphnor@reddit

Its not the whole story though. Bandwidth *per* GB also matters. E.g. the B70 is even worse than it looks vs 3090 here, because its 608GB/s that is (generally) reading 32gb, while the 3090 has bigger bandwidth to read from a smaller memory.

[-]

1ncehost@reddit

Also not the full story because PP is mostly compute bound and for many applications is just as important as TG.

[-]

tired514@reddit

please don't reset the context checkpoint back to 0... please don't reset the context checkpoint back to 0...

Damn you, opencode! *shakes fist*

[-]

ziphnor@reddit

Also true, I was just staying with the memory topic:) Not sure why I was downvoted though? People do tend to forget that bandwidth needs to be considered in connection with how much you will be reading.

[-]

1ncehost@reddit

There have been several posts recently that seemed like bot brigaded in the comments to pump links. I think its getting really bad here, so basically I wouldn't take any upvote/downvote numbers seriously anymore.

[-]

No-Juggernaut-9832@reddit

M3 Ultra is in the 800’s … can’t buy one now but when you could. More memory than the rest

[-]

chitown160@reddit

Imagine including TFLOPS along with wattage and cost ... oh wait the there is already websites like https://www.techpowerup.com/gpu-specs and https://technical.city/en/video/ that do exactly this.

[-]

Non-Technical@reddit

I have an M5 max Mac studio that is very fast but not enough ram and a strix halo that has much more RAM but is slow. Kind of in a weird place until more options are available.

[-]

jcdoe@reddit

No you don’t, the M5 Max Mac Studio isn’t out yet.

[-]

Non-Technical@reddit

Oh you are right. It is an M4.

[-]

exaknight21@reddit

For my Mi50 gang, 1 TB/s

Represent fam. Beat dollars per gb of vram i say. Huge shoutout to gfx906 / mixa/aiinfos !

[-]

BlackBeardAI@reddit

Unless you are rich enough to buy 5090(s) or a 6000 pro, 3090 is the king.

[-]

Intrepid_Dare6377@reddit

Just bought an HP Omen PC with a 5090 from Microcenter. Not as fun as doing a custom build but my energy is focused on development right now so went pre build. It is an absolute flamethrower speed wise (although the actual thermals and noise are quite good)🔥

[-]

freia_pr_fr@reddit

M3 Ultra, 819.3 GB/s

[-]

ColonelKlanka@reddit

Wow. I disnt realise the apple silicon non pro chips were still such low memory bandwidth.

I have a older m2 pro tht has 200gb bandwidth - this is faster than m4 non pro!

[-]