I already did and have such a sheet and have made shared sheets for things before to allow others' input. Why bother sharing at all if you're not going to share...meaningfully? Searchably, usably, notjustwastingyourowntimefully? 🤷♂️
Also, I wasn't trying to say you specifically should do it. I was going to comment at the top level but saw you contributed more to it, so I replied to yours because of the "multiple people contributing to the same dataset" context.
For some reason we rarely hear people talking about 4090s, probably something to do with being a lot more expensive than a 3090 and nearer in price to the 5090 for less VRAM and speed.
VRAM is too limited. The smallest really competitive local model in my benchmarking right now is Qwen 3.6 35bA3b whose NVFP4 variant requires about 36GB minimum to barely run with concurrency of 1. Smaller models are still not really competitive in terms of instruction following and coding accuracy. So I'd look at at least unified RAM systems of 48, 64 or 128GB for anything effective.
Hmmm.. what's the throughout hit you practically see doing that? I use a DGX.
Interestingly enough while I fully expected 27b to be smarter, I found they benched almost the same - here are my benchmarks - https://srinathh.medium.com/mid-size-local-models-are-now-competitive-for-ai-agents-7696b2e8b535
I just tested with Qwen 3.6 35b and I'm getting 55tok/s right now.
For some reason only 9.1/24gigs of VRAM on my 4090 are used and my PC memory use by llamacpp is 19.7gb.
By way of comparison when I run 27B fully in VRAM without MTP I get about 45t/s.
As for benchmarks, I always take those with a big grain of salt and I prefer testing models for my specific use cases which are mostly coding related. That being said, chatting with 35b right now gives me the impression that it might be better at general language, though I am certain that 27B is a better coder.
I'm using the following to launch it:
llama-server -m "E:\AI Models\Qwen3.5-35B-A3B-Q4_K_M.gguf" --alias "qwen3.6-35b-a3b" --host 0.0.0.0 --port 8080 --ctx-size 32767 -n 32676 -ctk q8_0 -ctv q8_0 -b 512 -ngl 99 --mlock --no-mmap --jinja -fa on --cpu-moe
Nice! I run on vllm with the full 256k context because I find that in my openclaw turns, i routinely run in the 50k-150k token range on context with all tools, memory & session conversation history loaded
Nice - i'm using vllm - I default to using the full 256k context these models support because in my openclaw turns, I find contexts routinely run in the 50k-150k token range with all the tools & memory and covnersation session histroy etc loaded.
This. I run Qwen 3.6 27b at fp8 on two 3090s, full context, image processing and with MTP, getting a consistent 60+ tok/s in decoding. It’s seriously powerful for agentic tasks and coding in general, I’m a professional software developer and a lot of my production code nowadays is made by the GPT 5.5 plan + Qwen3.6 27b execution combo, I sometimes need a code review from 5.5 and then another coding round from 27b but that’s it. It’s beyond incredible I can actually ship production code from my Chinese motherboard and used GPUs, this was unimaginable six months ago.
Could you please share your rig setup?
I have a RTX 4090 with a AMD 12-core CPU, using it for mostly gaming. I would love to get rid of Windows, install a Linux distro for just running LLMs
All GPUs were bought used, CPU is obviously used, RAM sticks probably are too, motherboard is a Frankenstein. I love that I can run something as ridiculous as 27b on this freak. I live in strange times.
Not much tbh, as benchmarks are behind 3.5 27b, so I didn’t think it vs 3.6 was even a question worth considering. Is it that good? I’ve tried 26b a4b, and it’s very good for natural language stuff but fails long running agent sessions, which is what I use these models for (long coding sessions basically). Is 31b much better in that sense?
I use:
Huananzhi H12D-8D
AMD EPYC 7502
128GB RAM
4x RTX 3090 24GB
(I cap them at 250W)
Ubuntu 24.04 LTS
Allegedly, I "should" be able to add more cards via converting my three Mini-SAS-HD (SFF-8643), but I'm very skeptical, the Huananzhi bios has been a pain in the rear for me.
I'm considering switching to PCI-E x16 to x8/x8 splitters when I get the money for more GPUs depending on how the other adapter goes. I do have a Mini-SAS-HD to OCuLink adapter, I just need a card to test with.
The worst part of this system is that I can't really make use of the BMC. If I enable the BMC and I change even a single setting from default in the bios, I immediately lose the ability to see the NVME slots.
If I had the money, I'd have gotten a different board, but the ones I would have wanted were all well over 1k.
Same. I used to “need” windows for certain multiplayer games, but don’t really play them anymore, so have one of my machines running CachyOS instead. It’s amazing. Boots up so much faster than windows and stuff isn’t as… annoying.
I also have two 3090s and am looking at all the various options for optimizing stuff. Would you mind sharing a bit more about your inference software setup and what you use harness wise? I assume you are swapping between Codex and something like Pi or OpenCode?
It would be nice if there was something out there that would smoothly combine frontier planning + local execution in one polished and reliable setup, but I don't think there's a one stop shop for that quite yet from what I've seen.
But you use Q8 for KV cache too to fit full context , right? Also Wouldn’t a good Q6 quant be better for 3090(assuming you run on llama.cpp or its forks)?
Yes, forgot to mention, Q8 for KV cache. I find it to be virtually free lunch, never ran into any apparent issues (Q4 is another story, can be very good or downright unreliable, depends on factors). I run this setup on vLLM for tensor parallelism, that's how I'm getting 60+ tok/s (and I'm on PCIe 3.0 x16, if I were on 5.0 this could easily border the high 80s or even 90s). Q6 would be very good indeed if I were using cpp.
Going from 1 to 2 is a world of difference! A system with 2 4090 would be a monster. All you need is a motherboard that can bifurcate the PCI and you’re Gucci.
I added a 2nd GPU to mine externally to skip the new case, connected with an m2 oculink adapter, minimax GPU dock and a 2nd PSU. I'm sure it's not as fast as a normal pcie slot, but it's working great so far and was way easier than a new case.
Compute units and compute in general. So higher clocks and more cores are faster. Also perf per clock (aka IPC, for the same clock getting higher performance on newer GPUs)
The rtx pro 6000 is just a 5090 with 72 or 96gb vram. So it is only as fast as one 5090 even if you dont need all the vram. With 2 5090s i can literally fit 2 27b qwen3.6 with q8_0 kvcache in each card and run them simultaneously.
It doesn't seem to get talked about very much. I have a 5090 and a 4090 in my system. I had the 4090 first and while the 5090 is clearly a big step up, the 4090 is no slouch!
This is sorta my situation, I had a 4090 from before prices were insane and I'm considering adding a 5090. Do you feel the 4090 keeps up well enough in speed when splitting a model between the two cards? And what models and quants are you running on there?
My go to model right now is Gemma4-31B-Q8_0.gguf (31G) w/mtp-gemma-4-31B-it.gguf (491M) drafter model split across the two cards with a 128K context. I get about 65-70 t/s. I'm using the llamacpp Gemma4 MTP branch.
I see about ~4000 t/s PP combined across both cards. llamacpp doesn't give me a breakdown per card. Model is too large to run on the 4090 for me to test each card solo.
3090 outstanding value if you can find one, maybe worried about it needing a repaste or just burning out from age.
4090 people holding onto them like unicorns, just superb cards.
5090 just far too expensive for most consumers to justify.
Not to be confused with the 4090M; I'm running a Morefine G1 eGPU (16gb) at 576GB/s with a second one on the way. Crazy expensive, but awesome little DC-powered portable rig.
Try using an extended PCIE riser card with a 90 degree angle. Lets you keep your PCIE speed and bandwidth without using oculink, or TB, etc.
I have 5090 as my main, and a 4090 using a riser card with a 3ft ribbon cable running into a 3d printed external GPU enclosure containing the 4090, a PSU, and the female end of the riser card.
They're wickedly expensive for what they are but in my case form factor is everything. I live on a boat in the summer and I'm trying to avoid running the main inverter (that makes 120V AC) at all costs. These little Morefines run on 20V DC, so I can run them off my dedicated 12V->20V SMPS. Drops my idle consumption from \~100W to \~10W.
I'm using a 5090M and those mobile GPUs usually have really impressive memory overclocking potential. I am running a stable +250MHz (equals nvidia-settings +4000MT/s) overclock, bringing me to 2000MHz. This raises bandwidth from 896GB/s to 1024GB/s.
Maybe test a more conservative +125MHz first though, although GDDR6 on my old 3060m overclocked equally well with +250MHz (+2000MT/s in nvidia-settings)
I was literally thinking "if the 4090M is only limited by thermals / power, what's to stop one from overclocking the VRAM given memory bandwidth is the main constraint during inference?"
Turns out the 4090M uses a 256bit VRAM bus while the 4090 is 384bit, but I did end up overclocking a fair bit and yup - decent performance improvement!
I can't wait to see what kind of token rate I get with Qwen3.6-27B @ Q6 with two 4090Ms.
The biggest issue I had with vllm which is what seems to be needed for llm-scaler, is how to compare vllm supported quants (INT4, Fp8, AWQ, etc) with models running the usual q4, 5, 6, 8 quants on llama-cpp. It just felt like comparing apples and oranges. And that's when I was able to get vllm to even work. I will have to try the new update in a docker container...
honestly really liking the price for the amount of memory you get but the performance is abysmal right now to a 3090. hopefully they can optimize the software as there's no excuse to have an ai dedicated card lose to a nearly 6 year old gaming GPU...
3090's are 1400 bucks now on FB/Ebay and there are SO MANY FAKE / SCAM SELLERS
That $950 for new with warranty seems much more worth it.
Do I wish pricing was better? heck yeah... But i'd rather take my chances on NewEgg and run 2-3 of these cards and in both cases, we all win vs the $10k RTX6000 Pros (even though its faster, it's not 7,000 dollars faster)
Hardware doesn’t just “go bad” or “wear out” that easily.
Yeah sure on some level it does…but PC parts are one of the few cases where you can tell pretty quockly if it’s working or not, test, and if it tests good it’s good.
It’s not like they get slower overtime..either it’s working, maybe working at 98% original capacity, or not working at all.
3090's have been around the block with gaming, overclocking, mining and now AI - i know things don't "Wear out" but fans and paste do and if those fans and paste haven't been maintained then it causes heat failure in areas where I don't want to bother fixing it
And that's why you see 100s of gpus for sale or sold as not working/broken.
you would be better off buying a stack of 5060ti 16gb right now if you are on a budget. mature software, warranty, plus good vram to dollar price point and you can parallel compute in certain setups for more performance.
yes. that's were it's performance is best and most stable. someone posted in depth performance comparisons between it and the 3090 using vulcan and it got less than half the performance most of the time. it was bad.
If the software stack is actually stable, I'd probably recommend a B70 over 3090s for a business, because of the whole "used card gamble" thing. A bit slower performance with a bit more cost, but with a lower power consumption profile and a warranty & current support would probably push that over into "worth it" in that use case.
That said, yeah, you'll pull my dual 3090s from my cold dead hands. (Especially since I used some Dell OEM ones that are shorter than any others - in theory, I can put my stack of 8 3.5" drives back into my case!)
I will share these bench stats if ya'll don't chase me out for being on Windows 😉
The other side of this box runs Ubuntu Server 26.04 with both SYCL and Vulkan compiled from sources. On the Windows side, and just for the lolz, I downloaded the pre-compiled binaries. SYCL sucked, then Vulkan beat all other combinations for this particular model:
https://github.com/intel/llm-scaler is the repo everyone is following. There are a few other repos on GitHub as people benchmark/test through the updates. It's had 4 releases in the last month, so Intel seems to finally be progressing through the prior growing pains.
openvino 2026.2.0 was released yesterday and it adds support for gemma4 and qwen3.5. I tried the nightlies before and it is really fast, like 4k pp and 60 tg on qwen3.5 9b int4, though a specific nightly version tanked the performance of it later... That is on a b580. I wanted to try qwen3.6 35b and 27b, but i guess openvino isnt very great for cpu+gpu combos
Oh, I see we still are ignoring cheap AMD GPUs. Good for myself, just bought an used RX6800 16GB for 250€ the other day. RX 7900 XTX with 24GB go for as cheap as 500€ here in central Europe.
Yep, anything with good gguf support goes straight on either my XTX or Mi50's. Save the 5090 for when cuda is required like faster whisper and yolo. Might get another XTX if the price ever comes back down to what I paid for the first one but I'm not holding my breath and I'm not paying $200 extra for the same thing out of principle. Same with the Mi50's they are now triple what I paid.
Maybe I'm one of the ones who needs to see this...? I worked in High Performance Computing for 25 years, retired at the end of 2024. It looks to me like you're comparing the TB5 speeds of a Mac Mini with the NVLink speeds of the NVidia cards?
But that means I really don't know what the numbers for the laptops mean...
I don't know but somebody posted on that topic in the past 2 days I think. There was mention that the faster CPU will achieve better prefill time.
I have been chatting with Claude about the performance topic. He thinks there is no substitute to empirical testing. I may quit ollama and migrate to VLLM in order to understand the pieces of the inference process better.
My notes during my shopping for an M1 Max:
∙ M1 Pro: ~200 GB/s
∙ M1 Max: ~400 GB/s
∙ M2 Max: ~400 GB/s
∙ M4 Max: ~546 GB/s
∙ M1 Ultra: ~800 GB/s
Of course you get more tps out of a 5090 than a MBP, but the 5090 doesn’t have 128 GB memory for not-insane-money and oh.. yes, it comes with a computer too.
You can get one $1800 ish which is much cheaper than a 5090 and you can get two 4090's cheaper than 1 5090 😄and that gives you 48 gb vram.
And if you are willing to mod them, and ship them to china, for about $150 each you can get them to be 48 gb, so two modded 4090's is dual 48gb for 96gb vram at over 2000 GB/s total.
You also left off the AMD Raedon 9700 AI 32gb vram card, which has 640 GB/s but comes with 32 GB Vram and is around $1300.
But... 2-4 Raedon 9700 AI cards is the best bang for buck with tensor parallelization. Sapphire makes one, it's $1379 on newegg.
Wait wait what's this Chinese modification to double a 4090's RAM? I found a few vids talking about how to do it, but there's a company that'll do it for $150?
I've got the tools as well and feel like I could probably pull it off.. but under $200? Including the memory itself? That seems irrationally cheap.. :/
Some Intel and AMD server CPUs has quite big memory bandwidth (Amd Epyc and Threadripper CPUs, Intel Xeon 6). Like AMD 9124 - 12 channels DDR5 with 460 GB/s. There are even Intel Xeon with HBM2e memory (like 1.6TB/s theoretical bandwidth). But building with these will cost quite much as need shitload of registered ECC DIMMs to fill all channels, CPUs costs quite much, same for servers motherboards.
What a strange interpretation of their statement. I read it as they are bothered by the fact that there's no x86 equivalent that matches the Apple laptops.
Is not the main problem being stuck at 24gb? That is why people are using Mac mini so they can go like way higher, speed is nothing if you are stuck using a crappy model.
That's what I figured out when comparing my 2x 3090 cluster to my dgx spark cluster. The models I can run on the DGX spark, while considerably slower, get way more use than my 2x 3090 cluster. There are times when speed matters (classifying 40k comments) and I'll use my 3090 cluster for that. Everything else goes to the DGX Spark cluster (95%) regardless of speed.
I havent seen / there arent many benchmarks comparing dgx vs 3090/3090s, so im assuming based on my instincts here, but what model can be ran on dgx that cant be ran with gpu with ram while still being faster? I can only think of garbistral medium
I’m running Qwen-397b, Minimax 2.7, Mimo 2.5 or DS4 Flash on dual sparks. You can’t run those in 48gb VRAM. With offloading, even the 6000pro on a DDR5 system gets slower than the dual sparks.
On a 4-node Spark? I get if you only have one dgx spark. Not saying it isn't possible to accomplish the same build with gpu's, but for me the simplicity of the dgx being plug and play with less "moving parts" (heat, power, etc) beats a build on 3090 + system RAM. Yes the trade-off is speed.
All personal preference; everyone has different tradeoffs.
Depending on what you can get an RTX 6000 PRO for. They range from 11k to 13k based on my searches. I got my sparks at original price. So it's a 3-5k difference. 32GB less RAM but WAY faster interference. Trade-offs for sure. I'm happy with the route I went but not everyone would be.
Yea, some would say I was stupid for the early adoption, and I agreed until I saw the price skyrocket. I knew when I bought it what I was in for so I took the leap.
I want to build an always on LLM inference and I have a relatively high budget, but I'm constantly torn between a Spark cluster and just adding another 6000RTX pro to my current machine.
Yea that's a tough (great) position to be in. I'm not sure how I would decide tbh but it would be largely based on my use case. For instance, I use my 3090 x2 cluster for classification/sentiment processing. I sometimes need to process +40k records a day. Speed matters when you're doing tasks like that. In that case I'd definitely go 6000RTX route because speed is important.
But if you're into fine-tuning, which I also am, the dgx spark cluster is nice because I generally dont care about speed when training, and having more VRAM capacity is more important
Yea - thats the crux of it. The inference speed of the RTX is (borat voice) verryyy niceee but I'm also finding it enticing to use some of the larger models regularly. Thanks for your perspective. I'll need to think on this.
What models are you using? I've been doing a lot of research and I haven't seen impressive results from 128gb setups. It seems like 256/512 is the big step from 48/64
I have 2 node cluster now but for 1 DGX Spark, I think the best candidates are the recently released Step 3.7 Flash - reported to get 20-25 t/s. Or Qwen3.5 122B A10B int4 AutoRound - I find it a bit deeper than Qwen3.6 and it can get 35t/s with mtp. Even Qwen 3.6 27B at FP8 gets around 17 t/s with mpt and I find that a lot better in quality than Q4 quants. And you can run it at full context with 3x concurrency.
I have no idea how people are having success with those quants model, they tend to go into loops and error so often it is frustrating. So usually I only use those with full precision which most will not fit into my 4090.
Right, that’s the missing data point here: how much RAM can each of those devices access at that speed? Even the regular M4 mini could, until recently, be configured with 32gb of RAM and the Pro version up to 64gb. The M5 MBP mentioned on this list can also be configured with 128gb of RAM.
So, yes, an Nvidia GPU can be up to 2x as fast, but tops out at 32gb of VRAM. You could get two of them and have 64gb but you’re looking at $4k PLUS the computer they’d go in. You can almost get an entire MBP with 128gb of RAM for just what the GPUs cost.
Plus it fits in my backpack and draws 140w tops (technically I think they can draw up to 200w for a short period by pulling from the power adapter and battery at the same time).
For comparison, a single 5090 can draw 575w. So for two of them PLUS a PC to put them in and a monitor (to compare “apples” to “Apples”) you’re going to be looking at 10-15x the power usage.
It’s not really a “this is better than that” situation as much as it is these are two different options that have similar price points and make different trades offs - more total RAM, lower power consumption, compact form factor vs. faster RAM speed but less RAM, larger form factor and higher power consumption).
I have 2 Sparks connected together running Qwen 35b MOA for my startup and what I have seen is that if you use DP2 for concurrency I can get 32 concurrent request at peak using both hardware. I have a whole benchmark of DP, PP and TP done I can share. These hardwares are awesome for what they can do which is loading the LLM on vram and holding it at same space for long time. Meanwhile in a Mac you can load the model but when the OS needs the unified memory for chrome it will boot the model out and prioritize loaded apps. Concurrency over speed gets you to do things like fine tuning, parallel processing, intensive work like benchmarks. If you just want a chatbot to run then you will get max 35tps on fp16 models which is not bad.
I use these for my startup and so do my customers and it’s a game changer.
It’s slow, that’s true. You’re usually maxing out around 500tk/s pp and 20tk/s decode but there’s not much else that lets you run models of this size for this price
For me though it’s more about being able to train and quantise and distill, testing my experiments on similar-ish hardware to a cloud rented system before uploading it
i'm genuinely thrilled with my dual 3090 setup on a DDR4 system with a Ryzen 5 3600, even though one of them is PCI-E x16 and the other is x4!
$2000 MSRP for the 5090 with 32 gigs of ram, and good luck getting that price
$1800 for a pair of used 3090 cards on eBay (as of a month or two ago), total of 48 GB
Yes, there's stuff that doesn't like running split between two cards, but mostly it's been pretty unusual to run into stuff that wants more than 24GB but less than 32 GB of VRAM on a single card. (I think one of them SOTA-ish FOSS voice models is like that, but I'm not even sure.)
You can prooooobably get a 570-based AMD chipset board for not tooooooo much money. (And, I managed to push this to 128 gigs because I already had 2 32 gig sticks in it, and DDR4 is only "sell a kidney" price, not "sell both and also your liver" price)
Oh, I doubt CPU inference would work very well, but, if you give me a model you want me to test, I can give it a try with a CPU-only build of llama.cpp
But, I use it with my 2x 3090 setup - but, that runs one at x16 and one at x4, but it's still decent!
I recently bought a second 3090 for my setup hoping the same. I too have ryzen 5 3600, msi x570 a pro with 2 pcie slots. But for some reason anytime i plug anything into the second slot (x4, chipset slot) the motherboard does not post display and shows a red light on vga. I have tried single gpu on slot 2 and two gpus together. Doesn't work. Only thing that works is single gpu on first slot (x16,) . If it matters i do have 2 nvme ssds and 64gb ram. I tried removing everything and starting with just single ram chip too. Same outcome. I tried bios settings like gen 4 gen 3 and that weird mining setting. None of those worked. Any help is appreciated
Also look at the manual and be sure if this PCIe slot or NVME slot is used that PCIe slot is unavailable. Its not very common for an NVME to do this but never know until you check.
I'd start by taking a bright light and inspecting the slot to make sure there isn't anything in there like a bit of paper, plastic, etc, and that there aren't any bent pins.
After that:
See if your BIOS is current
See if it will POST with both NVME devices removed
Review BIOS settings & motherboard manual
You may need to disable some SATA ports or something like that
I will try to look for the debris and bent pins. I did try after removing both nvmes. Did not work. I am not very savvy when it comes to motherboards. What is funny to me is that it only works when the second pcie slot is unoccupied.
(Apologies for the formatting here - I really attentively formatted everything, and when I tried to submit it, reddit wouldn't allow it. I'll reformat from desktop in a few; doing this from an ipad with a keyboard misssing the right arrow is awful lol)
I've used a few different configurations - one is the "Club 3090" setup, which has specific configurations for single and dual 3090s.
But, here. A standard Q8\_0 config, an MTP config, and an MTP + NGram config.
All 128k ctx, Q8\_0 (and no cache quantizing).
* Stock model gets PP: 2027 and 27.1 gen.
* MTP model gets PP: 1371, Gen: 49.
* NGram configs skipped as they don't seem to add any performance
* Smaller quants skipped because lazy
#This one gets PP: 2027 T/s, Gen: 27.1 T/s
#
\[unsloth/Qwen3.6-27B-GGUF-128-ctx:Q8\_0\]
hf = unsloth/Qwen3.6-27B-GGUF:Q8\_0
ctx-size = 131072
temperature = 1.0
top-p = 0.95
top-k = 20
min-p = 0.0
presence-penalty = 0.0
repeat-penalty = 1.0
reasoning = on
# This one gets PP: 1346.8 T/s, Gen: 41.9
#
\[unsloth/Qwen3.6-27B-MTP-GGUF-128k:Q8\_0\]
hf = unsloth/Qwen3.6-27B-MTP-GGUF:Q8\_0
no-mmproj-offload = true
spec-type = draft-mtp
spec-draft-n-max = 3
no-mmproj-offload = true
ctx-size = 131072
The "no-mmproj-offload" gets the mmproj (vision support) offloaded to system RAM / CPU, so it'll still **work** if I need to use it, but it won't take up VRAM. (I used to just disable vision for a lot of these.)
The "no-mmproj-offload" gets the mmproj (vision support) offloaded to system RAM / CPU, so it'll still work if I need to use it, but it won't take up VRAM. (I used to just disable vision for a lot of these.)
There are definitely a few models and tools out there that I've wished for a 5090 for, like some of the weird TTS models, or speech-to-speech ones. But, I even had "okay" performance on one of those "realtime 3d walk around in a hallucination" models with the 2x 3090s!
Ok thank you, I also have the really stupid idea of using an RTX pro 4500 blackwell in the same system with those other two cards because it's only 200 watts and I have a motherboard that can do x8/x8 gen 5 and also has a lower gen 4 slot with 4 dedicated lanes where I could stash the 3090. If I'm not mistaken this would cause me to take a significant hit to prompt processing but with 88gb of vram I think I should be plenty well usable, right? Certainly much better than falling back to system ram at least.
I have found other 3090s I have considered buying to make a dual set up, but they are never the exact model of 3090 I have (gigabyte oc gaming) and from what I understand, it should be...
My undertanding is that it's not actually particularly important; maybe if you want to use NVLink, but even in that situation, I think it's explicitly allowed. (Double check me on that, though!)
honestly best value setup. only better combo is if you got an old amd epyc before the shortage so you get the full 16x pcie gen 4 speeds per slot and can run large MoE models with all the ram.
also if you make the right setup you can have both cards work in parallel and cut your promp processing time by 30-40 percent and boost your token output.
Did exactly that. 7B13, ROMED8-2T w/ 8x64GB DDR4, right now only have a 5090 and 3090 but can stuff another 3090 in. This is the best value setup (not counting GPUs) you can get at Q3-Q4 in 2025.
Aren't NVIDIA RTX 5090s (or whatever latest GPU NVIDIA have in their arsenal) already out of stock and taken up by AI enterprises in America and China?
I got one by signing up for those alerts then I started noticing patterns when they would happen so I would try to predict when one would happen. I got lucky and predicted before the notification went out and I got it
Sry for the late answer. I first had to configure my fastfetch module. I am experimenting with themeing and colors, so dont judge \^\^ Plus, when i bought the RAM (last september) it was only 330€ for two 48GB modules, two months later they went up to > 1500€, crazy timing !
Sry for the late answer. I first had to configure my fastfetch. I am experimenting with themeing and colors, so dont judge \^\^ Plus, when i bought the RAM (last september) it was only 330€ for two 48GB modules, two months later they went up to > 1500€, crazy timing !
If only Apple makes nVidia GPUs play ball with macbooks - That was once the case in 2005-2006 when the 15" macbook pros came with nVidia GPUs. That relationship soured for some reason and Apple shipped Radeon GPUs but man I'd kill for a Macbook pro with a 5090.
Not really. It can run 70b dense at 11 tps, which is the floor of performance. The 30b a3b sized models are all 90-110 tps. With some clever moe layer offloading this card + 64gb system ram can run all of the big 120b moe models at ~30 tps. Id day 30tps is useable still. Worth noting most modern models can be ran at max context too :)
There's a segment of users here who like to hate on people who buy Macs to run local LLMs.
The main issue with Macs is that the prompt processing is slow, so the time to first token can be quite long. That has been improved with the M5, but i don't think we'll see exactly how much it has improved until we get the M5 Studio later this year.
The M5 Ultra coming with the Mac Studio sometime later this year should double the bandwidth of the M5 Max, and double the GPU cores. With 256GB it will still be well under the cost of a single 96GB RTX PRO 6000. Saving my pennies.
if you dont want /cant dual gpu setup with mid range 50xx or 40xx, and you don't want to buy used 3090, then the R9700 seems like the best option performance / VRAM / price-wise.
A lot of people wonder that too. I'm guessing it's just related to corporate money saving bs. I'm sure that the people who actually work at intel know they need more staff.
It honestly appears only a handful of people work on the software.
The other problem is that there are 3 ways to run models on intel, (SYCL, openvino and vulkan)all of which have different performance on different models.
The info is out there though. You want to look for Openvino benchmarks for the best performance. It has the worst compatibility though and is sometimes months behind something like llamacpp.
I got one for MSRP on release and quite pleased with it, I used my RTX 2070 before mainly for SDXL and Gemma 4. I’m very happy with the card especially for the price and save a shit ton of money I used to spend on Claude.
The thing is I have a very niche use case, for general day use and just thinking I use my Claude Pro plan. I’m an architecture model builder, architect and lifelong woodworker.
I have my own fine tuned Qwen 3.5 27B model that used to be run over Runpod and I trained it there as well, it’s directly connected through a VSCode Codelistener instance that can read and adjust my code for Rhino + Grasshopper through Python. Generally Rhino is a script based 3D modeling software that is perfect for custom Python or C++ scripts, many leading architects in the world use it. I’m not a software engineer but been making my own scripts for like 20 years now for various things from site analysis, parametric modeling and calculation for efficiency. I used to this all by myself, but since a few years Claude has helped me immensely improve my scripts and help me if I’m stuck.
Now this is all running on my B70 plus 96GB RAM and works like a dream so I don’t need the 200$ Claude plan anymore and pro is enough, Opus for planning and guiding Qwen and then I mainly use my finetune.
I spent most of my work day on CNC machines, laser cutters and general woodworking machines and LLMs have helped me a lot in recent years, and now I’m saving 2000$ a year with going fully local.
Honestly I don’t have exact benchmark numbers for you right now since I’m not at my workshop but I can get back to you in the coming days, it’s Friday today.
There was a few posts when it came out that effectively showed it matched the price point, but had the potential for growth assuming intel actually invests in the software space. So at worst, it's price point accurate.
The speed is basically wasted at those sizes. What's the point of going that fast if all you can fit is a small model. A cluster of mac minis is probably better off at the price. Slower, but you can run a more competent model.
I’ve been using 8B and above on well-defined tasks for about a year now.
I have pipelines that break down workloads and process in parallel batches.These batches are initiated by scheduled jobs, personal Claude/Codex agents, and API requests from various apps I’ve built. Some systems collect data, some analyze it, and some report on it. Recent models (like Qwen) can produce reliable tool calls and output if you can structure your processes. I have evals up and down the stack. Each pipeline stage is well defined.
If you need 32B models and above you are probably working with complex tasks that benefit from intelligence more than speed. That’s completely fine, it’s why I still have Claude and Codex subs. However, if you’re using high intelligence models to do basic VLM, you’re probably wasting time, money, or both.
Don't downvote this person. This obsession with bandwidth is the type of crap people say when their tricked out honda civic has such and such hp. It does not really point at the actually rate limiting step that is vram first.
Eh, at some point you will run into the issue of wanting multiple parallel streams, at which point you will quickly understand that the bus bandwidth is your new bottleneck.
Workflows, multi-shot inference, and tuning. Because these are lossy systems regardless of size, you should be building for all this anyway.
The most speed and cost effective setup runs easy to manageable tasks locally and bursts to SOTA models as needed. Because burst frequency is low, the cost of non-local calls is trivial, and privacy can be retained through obfuscation and local translation. Local speed and cloud burst for temporary model size increase is the optimal setup.
I hate that you're getting downvoted for this, as it's 100% true.
As the saying goes, "all the speed in the world doesn't matter if you're headed the wrong way". Buncha ADHD people out here who just want infinite tokens per second of absolutely anything / random trash
When using more than a single card for inference, the PCIe bus is capped at 128 GB/s on version 6. So yeah. You either need a model that will fit on a single card or you need to accept that BUS cap. Small models can be quite capable though.
This is also useless without average prefill and token generation speed because they are wildly different between these platforms and architecture will make the memory bandwidth a non-issue in a lot of circumstances.
that alone doesn't give the full picture. Something like this one does a little bit better job:
name
usable vram, Gb
price
fa
pp512
pp32768
tg128
power usage, Watts
3090
24
\~$700
1
5911
2361
174
\~300
... based on public benchmarks from llama-bench - a tool from llama.cpp project. The standard benchmark figures are assuming you're running TheBloke/Llama-2-7B-GGUF:Q4_0. Noone in the health mind uses it today, but it gives you a base reference that is comparable.
This table is kind of useless without including price per GB/s and total vram per option. Also, I've seen others in this discussion point out that there are more competitive options that have been entirely left off this list...
Been running local models for about a year now and the progress is honestly staggering. What used to require a 70B model can now be handled by well-trained 8B-14B models for most practical tasks. My daily driver setup is a 14B model for general tasks on a single GPU, and I only reach for larger models or API calls when I need that extra capability. The latency advantage of local inference is underrated too — for interactive coding assistance, having instant responses changes how you work with it fundamentally.
I have a 4500 because I just wanted to have a mini-ITX build that wouldn't blow up. A 5090 is by all means a better option when it comes to value if compute is the only concern, as it's slightly more expensive for double the bandwidth.
if 32GB VRAM is enough for you then single 5090 is superb (i have one), but it doesnt scale (space, heat, power, even with undervolt and aio version) well and creates a lot of headaches beyond that. On the other hand you slide 4500s one after another into standart workstation (trx50, wrx90e..) without much hassle.
Yes! So many new spark users go down this rabbit hole on NVFP4 kernels and why their LLMs arent running faster, meanwhile token generation is speed bound by the memory bus and nothing they do will change that. How do I know? I went down the same rabbit hole when i got my spark half a year ago.
I was eyeing one but have not done any actual research yet, which i was going to do before pulling the trigger. With RAM prices so high, the 128gb of unified for ~5k seemed like a better deal than building a 5090 rig, where the GPU alone is 4k and id spend at least another 2k in CPU, RAM, Storage, Mobo.
I probably would've come to the conclusion to build anyway over it, but it is an attractive all in one package with a very small footprint. I'll have to look into some benchmarks and go from there i suppose
Ya, for sure. Honestly i still feel like the asus ascent gx10 (undercuts the other versions price with the same hardware in a different case) is a steal. I went for the 1tb version, it was 3k back then, 3.5k today (usd). Its a great unit for that price. I mean there were (and maybe still are?) some amd boxes you can get even cheaper, but you give up a touch of mem speed, a lot of gpu power, and close the door on clustering. If ur chill with 15 - 60 genned tps (depending on the model you run) and want the fat capacity and low energy cost, its the way to go imo. But if you crave faster speed, deeper upgradability, dont care about energy, want a real desktop for non-ai or gaming, a proper rig is better. I have no regrets, but I was expecting more performance going into this.
There are 4 important factors when choosing hardware. They relative weight depend on the use case, and memory bandwidth is only one of them and very often not the most important one.
The total available memory. For LLMs the bigger memory the better model you can run with bigger KV cache = longer context. Super important for agentic AI with large context and models smart enough to do anything useful. That is less of an issue for image generation as models are smaller.
Memory bandwidth. That determines token generation speed, but this is only half of the perceived model performance, see next point.
Compute performance. That determines time to first token - a waiting before any response even applies. With large context it’s more important than token generation speed as it’s pure waiting time, and even very slow generation is faster than human reading speed. Smart agents also don’t need full llm response to start working and can start executing tools as soon as they arrive.
Energy consumption. Unless you have free power, that’s also important factor. Older hardware may be cheaper but usually is less energy efficient and it may turn out than renting or paying for API is cheaper than electricity cost.
And as I mentioned a lot depends on use case. If you are building interactive chat, the time to first token is the most important factor, then token generation speed. Human time is still orders of magnitude more expensive than hardware and electricity and if humans are sitting and doing nothing while waiting for AI response that is a huge loss. If building fully autonomous agents that work in fire-and-forget mode it’s less important factor, but the context and model capabilities are very important so that it can actually run without supervision. Getting crappy results but very fast is way worse than waiting for good results.
That’s why Macs are very popular - they can handle large models and if you can wait, you can get good results cheaply with lower energy usage. It’s kinda funny that Apple become the most cost effective hardware for a task. I believe it won’t last for long and seeing how easily they hardware is sold out someone at Apple would probably decide to raise prices 2x and still they won’t have any trouble finding customers.
You can optimize cost by adjusting workflows. Instead of waiting for response and interactively correcting model behavior, prepare batch, run it, go to sleep and wake up to finished job.
Yeah, my 128GB M4 Max MacBook Pro isn't the fastest machine, but it only has a 140W power adapter and can do extended inferencing on a battery. And it's portable.
The biggest issue I have with Mac is for agentic use. A lot of context is sent in the prompt when using agents and prompt processing on the Mac is incredibly slow. Although, the M5 has closed the gap a bit, it still can't get close to Nvidia.
Hot and cold (SSD) KV cache solves this issue. Unless your workflow is to RAG a different PDF document for every prompt by the thousands, otherwise agentic harnesses fly when using a proper prompt cache. In other words, this is a non-issue for local agentic work lately with the current systems (like oMLX) which are based on vLLM engines for multiple users but are repurposed for local agentic use.
Thank you. That is something that I hadn't explored yet. Going to give it a go and see how it works out. Was giving up on local LLMs; t/s on the Mac was great but the prompt processing threw a wrench into the works.
It's actually not impressive at all if you look into specs. It's a beefy CPU with an extremely outdated GPU using late 2010s level architecture. 26TFlops of FP32, no FP16, FP8 or FP4, some 36 INT8 TOPS from the neural engine. For reference 1080Ti has 45 TOPS of INT8 and RTX 2060 vanilla, has 52 TFlops of FP16, double that of Mac Studio.
With so little compute performance no wonder it uses so little power. The memory is also mobile LPDDR5X too, that consumes like 1.2W per 8GB. Except for the memory and CPU, you are basically getting scammed.
The watt per token is not as impressive on macs as you would think. Because macs are so much slower, their efficiency is deceiving. In fact I just had opus (could be wrong) calculate watts per token of a m3 ultra and rtx 5090, with Gemma 4 26b the mac studio only came out 10% more efficient per watt and 40% with qwen3.6 35b. Considering that a rtx 5090 is over twice as fast, that isnt very impressive for the mac.
Macs can handle huge models and are efficient but their slow speeds make it not worth it.
Correct; race-to-idle matters. If you have a very fast system and the fixed overheads aren’t bad, it can be more efficient to use the power hungry one and idle than have the Mac go much slower for the same task.
But it depends, too, on what you’re doing. YMMV.
You can run models on a Mac Studio that you simply cannot put on one or even two 5090s.
An RTX Pro 4500 has half the memory bandwidth of the 3090, but is still way faster (15-70%) on pp and tg for me. Plus, the 32G allow for full context windows with most models targeted at the single gpu market
3090s evaporated from the Bangkok local market about a month after qwen 3.5 released. Went from dozen + at 22-25k baht, to 35-40k if you can find them lol.
100% the value sweet spot if you are buying today. I have a 4090 and 4070tis in separate rigs, and that extra 8gb is really the unlock to running capable local assistants.
**Vitalis is a self-evolving digital engineer that lives inside your computer — capable of writing, testing, and fixing its own code to build whatever software you can dream up, without you doing the manual heavy lifting.**
Built entirely by one developer. No team. No funding. Four years of self-taught work.
What Is This?
Most AI coding tools are assistants — they wait for you to ask, then suggest. Vitalis is different.
Vitalis_Devcore is an **autonomous execution engine**. It receives an intent, writes the code, runs the tests, and if something breaks, it heals itself and tries again — all without human intervention. It is the "hands" of the FSI ecosystem, designed to operate alongside **[Vitalis_Core](https://huggingface.co/FerrellSyntheticIntelligence/Vitalis_Core)**, which provides the cognitive reasoning layer.
Core Architecture
Component
Role
`SovereignKernel`
Writes and scaffolds code to disk
`KernelDaemon`
Watches for tasks, executes them, validates results
`SelfHealingLoop`
Detects failures and autonomously attempts recovery
`KernelValidator`
Runs pytest against generated code
`ProjectLedger`
Immutable append-only audit log of every action
`InferenceEngine`
Confidence-gated response generation with RAG augmentation
`ConfidenceBridge`
Autonomously re-queries when confidence is in the hypothesis zone (0.45–0.65)
`Hippocampus`
Memory-mapped binary vector store for long-term recall
`ResonanceEngine`
Continual learning — adjusts kernel weights from interaction history
`ContextSerializer`
Serializes full project state for agent context windows
How It Works
```
You give Vitalis an intent
↓
CognitionEngine generates a plan
↓
KernelDaemon picks up the task
↓
SovereignKernel writes the code
↓
KernelValidator runs the tests
↓
Pass → ProjectLedger logs success
Fail → SelfHealingLoop attempts autonomous recovery
↓
Pass → Recovered and logged
Fail → Failure report generated for review
```
Getting Started
1. Clone the repository
```bash
git clone https://huggingface.co/FerrellSyntheticIntelligence/Vitalis_Devcore
cd Vitalis_Devcore
```
Start the self-healing monitor in a separate terminal
python3 -m src.loop.self_healing
Trigger a task that fails — Vitalis will detect the failure
and autonomously attempt recovery without you touching anything
```
Technical Highlights
**Custom HDC Engine** — A compiled C extension (`hdc_engine.so`) for hyperdimensional computing operations including vector binding and bundling
**Memory-Mapped Neural Store** — `Hippocampus` uses `numpy.memmap` for persistent binary vector storage across sessions
**Confidence-Gated Inference** — The `InferenceEngine` uses a `ConfidenceBridge` to autonomously augment prompts when confidence falls in the hypothesis zone
**Temporal Knowledge Retrieval** — `train_self.py` supports querying memory nodes that were alive at a specific Unix timestamp
**Hot-Ingestion Daemon** — `watcher.py` monitors the knowledge directory and re-ingests new documents in real time
Governance & Integrity
**Quality Gates** — All autonomous actions require passing pytest before being committed to the ledger
**Immutable Audit** — Every action is SHA-recorded in `project_ledger.json`
**Failure Transparency** — All failures are written to `failure_report.json` before recovery is attempted
Roadmap
[ ] Connect Vitalis_Core LLM as the live reasoning backend
[ ] HuggingFace Space interactive demo
[ ] Natural language task input via CLI
[ ] Multi-agent coordination between Devcore instances
[ ] Web UI dashboard for ledger and task visualization
About the Developer
FSI (Ferrell Synthetic Intelligence) is an independent AI research project built by a single self-taught developer over four years — no formal education, no team, no funding. Just a vision, a tablet, and a GPU.
If this project resonates with you, a ⭐ star goes a long way.
Nvidiachuds in this subreddit don't understand anything, your words will be wasted on them.
Those 5090s 32gb at 15 will be 500 or so GBs of vram. But you will need to rewire your fucking house so you don't blow your breaker, and the power draw will be around 6000w.
That same unified memory macbook does that but at 150w max, and if the power goes out, well, it can do that on battery for three hours.
I bought my first GPU ever today, after computing for decades. Local LLMs pushed me over the edge. AMD 9700 32gb, I really hope it has almost similar performance to a 3090.
On a dollars per GB/s+GB+slot (assuming multi-GPU inference), jamming a machine full of RDNA4s with 16GB or more ends up being the win. You end up being able to scale GB sanely, but also scale GB/s the cheapest.
Can't wait until R9600Ds start showing up in the grey market, they're 9070GREs /w 32GB and built like the R9700S. Should grey market retail for like $800ish.
honestly the M5 40 core macbook pro is super cost competitive depending on your exact use case. $5k all in to run dense models at 100+ Gb memory and acceptable (depending on use case) inference speed can be a deal breaker
That's interesting information, but neither the 5090 or RTX 6000 have a speed problem, and potentially damaging my $8000 GPU or doing anything that might impact the warranty is a real non-starter
Would these speeds also work on the 5060 ti? It's got 1/4 the bus width and bandwidth of a 5090
The 5060 TI is interesting, because the density is double compared to the other 5000 series GPUs, it has 16G on a 128 bit bus, if they did the same to the 5090 it would be a 64G gpu.
I also depends on how long you expect your build to last, and your overall outlook for the technical landscape. There are a few promising signs that the balance between core speed and memory bandwidth might shift. We are moving towards more efficient, low footprint kc caches with slightly more processing steps, MTP shifts workload from more TG-like to more PP-like, and diffusion models, even if only used for drafting, is a big inversion of processing power vs memory bandwidth. For local, single user scenarios, any technique that liberates computation power from memory bottleneck would be extremely effective and will impact what hardware is to be considered better value.
Sadly no one tells that this is not everything in llm world. M5 max witg 128gb running mlx optimised models is very viable option, being only around 600gb/s. I tought i would see improvement with 3090 over it (filling only vram) and jokes on me, mlx optimised model goes head to head with 3090.
yeah but macs draw very little power and they are pretty much plug and play which is a big plus for a lot of people. MacOS seems to be well optimezed for ai usage as well (i am not sure i've never used it).
M5 40 Core Ultra should be around 1228 GB/s, come with 8x or maybe 16x more RAM/VRAM, and use a fraction of the power. If you want to scale a Mac Studio larger you can use thunderbolt cables to build a RDMA Cluster. Becuase of the low power draw you could plug 4 MAC studios into power bar and have them sit on your desk. 8 x 5090 @ $4000 is $32,000 going to cost a lot more than a Mac Studio even before you add the rest of the CPU/PSU/RAM/Cooling/Enclosures. You still will have more throughput, but for people 'Using" AI not training it I think the Apple ecosystem is a strong option. I expect the New CEO will push in this direction more. The stuff done to date (RDMA over Thuderbolt) isn't really a retail user thing and the fact that they are selling out of mac minis and studios is going to draw their attention in this area.
M5 chips are laptop chips with up to 32GB of 153.6 GB/s memory
M5 chips are laptop chips with up to 64GB of 307 GB/s memory
M5 Max chips are laptop chips with up to 128GB of 614 GB/s memory
RTX 3090 GPU doesn't exist as mobile
RTX 3080 Ti Mobile GPU has 12GB of 384GB/s memory OR 16GB of 512GB/s memory
RTX 5090 Mobile GPU has 24GB of 896GB/s memory
Can we just pause and think about how fucking fast that is? I know it is local, but think of our 56.6k modems… Near 2TB/s. Home internet tops out (generally) 1-2 Gbps. Thunderbolt 4 tops out at 40Gbps. The worst card listed is 960Gbps. And yes I know this internal computer architecture vs accessories or internet, but holy fuck
MI50: 1024GB/s
MI100: 1230GB/s
7900XTX: 960GB/s
A6000 Blackwell: 1790GB/s (so 5090 performance with a much bigger memory pool)
Radeon AI Pro 9700: 640GB/s
That is not correct. You can look up benchmarks yourself. For qwen 3.5 35B the M5 (Max) has PP of about 2000 t/s at 8K Prompt length. The 3090 was around 2300 t/s. Not exactly the ballpark you are mentioning
Yeah should also display vram amount. It's awesome to have a lot of speed but if you can't load a decent model because you lack vram space, what's the point? Don't get me wrong, Im not praising ram amount over bandwidth. It's just that things are a little bit more complicated than "look at my speed" or "look at my huge ram". This kind of post is misleading.
Interesting, but I can see right now even the M5 Max has 460GB/s, so does it really help if the bandwidth is still lower in the end?
The naming is a clusterfuck lol
Basically, the Ultra is two Max chips duct-taped together in a really clever way to essentially double the performance. Apple hasn't produced an Ultra chip yet for M5 (and they skipped M4 Ultra) so there's a weird trade-off where you get better bandwidth on the M3 Ultra at the cost of the older, less efficient architecture.
I'm expecting an M5 Ultra to be released; Apple seems to be making odd-numbered Ultra processors. And if they offer it with 512GB, they've got my money.
I hope so too but I'm bracing for them to skip M5 Ultra. Fab capacity is so constrained at the moment that I wouldn't be surprised if Apple is stockpiling chips and RAM for iPhones (since that's the profit center) instead of allocating a lot of silicon to niche products like Mac Studio Ultras
I'm still on the original M2 Ultra, I wonder how much better it is? From what I can find its really negligible. I guess the main benefit is that the max addressable VRAM is technically higher, but a maxxed out Mac Studio starts getting so expensive that we're back to considering NVIDIA setups.
Lets hope there's a new refresh that really changes the perfs.
However need also to point out that some cards are better than others because of their support on things like FP8 etc which some of the above are missing like the RTX3090
Also don't forget that bandwidth is mostly additive. So if you have 4 RTX 3090s, you'll have nearly 4TB/s of bandwidth. LLMs are one of the few things that can saturate compute before bandwidth
Its not the whole story though. Bandwidth *per* GB also matters. E.g. the B70 is even worse than it looks vs 3090 here, because its 608GB/s that is (generally) reading 32gb, while the 3090 has bigger bandwidth to read from a smaller memory.
Also true, I was just staying with the memory topic:) Not sure why I was downvoted though? People do tend to forget that bandwidth needs to be considered in connection with how much you will be reading.
There have been several posts recently that seemed like bot brigaded in the comments to pump links. I think its getting really bad here, so basically I wouldn't take any upvote/downvote numbers seriously anymore.
I have an M5 max Mac studio that is very fast but not enough ram and a strix halo that has much more RAM but is slow. Kind of in a weird place until more options are available.
Just bought an HP Omen PC with a 5090 from Microcenter. Not as fun as doing a custom build but my energy is focused on development right now so went pre build. It is an absolute flamethrower speed wise (although the actual thermals and noise are quite good)🔥
Thrumpwart@reddit
Someone out there likely needs to read this: get an AMD GPU.
pgrpcie@reddit
AMD is good for competition and keeping GPU prices in check, but Nvidia RTX is still the best for drivers and software compatibility.
Eg: Nvidia OptiX support in Blender rendering (rather than using CUDA).
usernmechecksout_@reddit
Are the "GPU prices in check" in the room with us?
Evanisnotmyname@reddit
MI50? What about the cheaper Nvidia cards or an older M1 Max, M2/M3 ultra?
Thrumpwart@reddit
7900XTX is the obvious winner in terms of price/performance. R9700 is 32GB at half the price of 5090.
Athabasco@reddit
7900 XTX is a little short on VRAM to run tolerable quants of Qwen3.6 or Gemma 4
Thrumpwart@reddit
So run a W7900 like I do, or dual 7900xtx.
r1str3tto@reddit
What kind of perf do you see with the W7900? Very interested in this.
Thrumpwart@reddit
Here's some llama-bench figures:
Model Size Backend GPU Layers Batch Size Test Performance (Tokens/s) llama 3.1 8B Q4_K_M 4.58 GiB ROCm 99 512 Prompt Processing 2935.69 ± 36.32 llama 3.1 8B Q4_K_M 4.58 GiB ROCm 99 512 Text Generation 94.46 ± 0.22 llama 3.1 8B Q4_K_M 4.58 GiB ROCm 99 1024 Prompt Processing 2900.54 ± 22.52 llama 3.1 8B Q4_K_M 4.58 GiB ROCm 99 1024 Text Generation 93.85 ± 0.22 llama 3.1 8B Q4_K_M 4.58 GiB ROCm 99 2048 Prompt Processing 2880.82 ± 5.92 llama 3.1 8B Q4_K_M 4.58 GiB ROCm 99 2048 Text Generation 93.20 ± 0.18
Thrumpwart@reddit
I love my chonky boi. It runs about 5-10% slower than 7900XTX, but the 48GB Vram is incredible. Perfect for Qwen3.6 27B with high context runs.
I love it for agentic coding overnight runs. Rock solid stable.
__no_author__@reddit
I would add:
AMD RX 7900XT, 800 GB/s
AMD RX 7900XTX, 960 GB/s
Only-An-Egg@reddit
What you fail to mention is max memory capacity:
NiceAttorney@reddit
Strix Halo and DGX Spark are also shared memory systems too.
onetwomiku@reddit
strix halo only needs to reserve 512Mb for system (some vendors locks it at 1GB)
CalmSpinach2140@reddit
https://medium.com/@se.mehmet.baykar/increase-vram-on-apple-silicon-for-local-llms-1b35c453b165
You can override default macOS ram allocation. No need to restart either.
Only-An-Egg@reddit
True. I don't know how much memory the OS needs to reserve on those. Running headless Linux would take up a lot less memory than macOS.
mycall@reddit
DeProgrammer99@reddit
Could make a shared Google Sheet and include recent prices and FP8 FLOPS and such, too.
In_der_Tat@reddit
Please do.
Only-An-Egg@reddit
No. Do your own research.
DeProgrammer99@reddit
I already did and have such a sheet and have made shared sheets for things before to allow others' input. Why bother sharing at all if you're not going to share...meaningfully? Searchably, usably, notjustwastingyourowntimefully? 🤷♂️
DeProgrammer99@reddit
Also, I wasn't trying to say you specifically should do it. I was going to comment at the top level but saw you contributed more to it, so I replied to yours because of the "multiple people contributing to the same dataset" context.
truthputer@reddit
You missed the R9700 32GB, which is in my opinion extremely underrated and a bargain.
lannistersstark@reddit
Why would anyone get GDDR6 over HBM2 which is MI60/MI50?
Total-Buy2684@reddit
You can assign more memory to llms with a command prompt in Mac. Can squeeze a few more gb if you close everything else.
addiktion@reddit
No numbers yet for the M5 Max chip? That would give us a rough idea of where the new M5 Ultra would land.
Only-An-Egg@reddit
The 32 and 40 core model speeds are listed in OP's image
AIgavemethisusername@reddit
Nvidia RTX 5070 GPU, 896 GB/s
SBoots@reddit
Nvidia RTX 4090 GPU, 1,008 GB/s
For anyone wondering
kwizzle@reddit
For some reason we rarely hear people talking about 4090s, probably something to do with being a lot more expensive than a 3090 and nearer in price to the 5090 for less VRAM and speed.
sfifs@reddit
VRAM is too limited. The smallest really competitive local model in my benchmarking right now is Qwen 3.6 35bA3b whose NVFP4 variant requires about 36GB minimum to barely run with concurrency of 1. Smaller models are still not really competitive in terms of instruction following and coding accuracy. So I'd look at at least unified RAM systems of 48, 64 or 128GB for anything effective.
kwizzle@reddit
Yeah but you can run qwen 35b by offloading experts to cpu very well with the 4090, and besides 27b is smarter and fits well with a 4 bit quant.
sfifs@reddit
Hmmm.. what's the throughout hit you practically see doing that? I use a DGX. Interestingly enough while I fully expected 27b to be smarter, I found they benched almost the same - here are my benchmarks - https://srinathh.medium.com/mid-size-local-models-are-now-competitive-for-ai-agents-7696b2e8b535
kwizzle@reddit
I just tested with Qwen 3.6 35b and I'm getting 55tok/s right now.
For some reason only 9.1/24gigs of VRAM on my 4090 are used and my PC memory use by llamacpp is 19.7gb.
By way of comparison when I run 27B fully in VRAM without MTP I get about 45t/s.
As for benchmarks, I always take those with a big grain of salt and I prefer testing models for my specific use cases which are mostly coding related. That being said, chatting with 35b right now gives me the impression that it might be better at general language, though I am certain that 27B is a better coder.
I'm using the following to launch it:
llama-server -m "E:\AI Models\Qwen3.5-35B-A3B-Q4_K_M.gguf" --alias "qwen3.6-35b-a3b" --host 0.0.0.0 --port 8080 --ctx-size 32767 -n 32676 -ctk q8_0 -ctv q8_0 -b 512 -ngl 99 --mlock --no-mmap --jinja -fa on --cpu-moe
sfifs@reddit
Nice! I run on vllm with the full 256k context because I find that in my openclaw turns, i routinely run in the 50k-150k token range on context with all tools, memory & session conversation history loaded
Aumcoming_Inquiry@reddit
Nice - i'm using vllm - I default to using the full 256k context these models support because in my openclaw turns, I find contexts routinely run in the 50k-150k token range with all the tools & memory and covnersation session histroy etc loaded.
Caffeine_Monster@reddit
Native fp8 is nothing to laugh at - though you really need two 4090s to get the most out of them in terms if gpu only deployments.
3090 is still the value king, and it's not even close. Only real reason to go mac is low power / always on applications.
Lost-Vermicelli-6252@reddit
I have two 4090s but they are in diff machines. If I moved them to same machine, does it use the compute from both or just the VRAM?
I’m debating whether or not the new PSU/case/cooling would be worth the effort.
Caffeine_Monster@reddit
It's worth it as it doubles the compute and bandwidth if you deploy models correctly with tensor parallel.
48gb vram @ fp8 can get you a long way.
You don't necessarily need to change much cooling wise, and you can use 2 PSUs if you want to cut corners.
formlessglowie@reddit
This. I run Qwen 3.6 27b at fp8 on two 3090s, full context, image processing and with MTP, getting a consistent 60+ tok/s in decoding. It’s seriously powerful for agentic tasks and coding in general, I’m a professional software developer and a lot of my production code nowadays is made by the GPT 5.5 plan + Qwen3.6 27b execution combo, I sometimes need a code review from 5.5 and then another coding round from 27b but that’s it. It’s beyond incredible I can actually ship production code from my Chinese motherboard and used GPUs, this was unimaginable six months ago.
indyfromoz@reddit
Could you please share your rig setup? I have a RTX 4090 with a AMD 12-core CPU, using it for mostly gaming. I would love to get rid of Windows, install a Linux distro for just running LLMs
formlessglowie@reddit
All GPUs were bought used, CPU is obviously used, RAM sticks probably are too, motherboard is a Frankenstein. I love that I can run something as ridiculous as 27b on this freak. I live in strange times.
Ok_Rope_9332@reddit
Have you tried Gemma4 31b?
formlessglowie@reddit
Not much tbh, as benchmarks are behind 3.5 27b, so I didn’t think it vs 3.6 was even a question worth considering. Is it that good? I’ve tried 26b a4b, and it’s very good for natural language stuff but fails long running agent sessions, which is what I use these models for (long coding sessions basically). Is 31b much better in that sense?
indyfromoz@reddit
Thank you 🙏
DonkeyBonked@reddit
I use: Huananzhi H12D-8D AMD EPYC 7502 128GB RAM 4x RTX 3090 24GB (I cap them at 250W) Ubuntu 24.04 LTS
Allegedly, I "should" be able to add more cards via converting my three Mini-SAS-HD (SFF-8643), but I'm very skeptical, the Huananzhi bios has been a pain in the rear for me.
I'm considering switching to PCI-E x16 to x8/x8 splitters when I get the money for more GPUs depending on how the other adapter goes. I do have a Mini-SAS-HD to OCuLink adapter, I just need a card to test with.
The worst part of this system is that I can't really make use of the BMC. If I enable the BMC and I change even a single setting from default in the bios, I immediately lose the ability to see the NVME slots.
If I had the money, I'd have gotten a different board, but the ones I would have wanted were all well over 1k.
Fit-Palpitation-7427@reddit
I did exactly that and never looked back
Lost-Vermicelli-6252@reddit
Same. I used to “need” windows for certain multiplayer games, but don’t really play them anymore, so have one of my machines running CachyOS instead. It’s amazing. Boots up so much faster than windows and stuff isn’t as… annoying.
tmflynnt@reddit
I also have two 3090s and am looking at all the various options for optimizing stuff. Would you mind sharing a bit more about your inference software setup and what you use harness wise? I assume you are swapping between Codex and something like Pi or OpenCode?
It would be nice if there was something out there that would smoothly combine frontier planning + local execution in one polished and reliable setup, but I don't think there's a one stop shop for that quite yet from what I've seen.
voyager256@reddit
But you use Q8 for KV cache too to fit full context , right? Also Wouldn’t a good Q6 quant be better for 3090(assuming you run on llama.cpp or its forks)?
formlessglowie@reddit
Yes, forgot to mention, Q8 for KV cache. I find it to be virtually free lunch, never ran into any apparent issues (Q4 is another story, can be very good or downright unreliable, depends on factors). I run this setup on vLLM for tensor parallelism, that's how I'm getting 60+ tok/s (and I'm on PCIe 3.0 x16, if I were on 5.0 this could easily border the high 80s or even 90s). Q6 would be very good indeed if I were using cpp.
Fit-Palpitation-7427@reddit
VLLM to do tensor parallel I guess?
formlessglowie@reddit
Yes, forgot to add that detail.
etaoin314@reddit
Going from 1 to 2 is a world of difference! A system with 2 4090 would be a monster. All you need is a motherboard that can bifurcate the PCI and you’re Gucci.
BosphorusScalene@reddit
I added a 2nd GPU to mine externally to skip the new case, connected with an m2 oculink adapter, minimax GPU dock and a 2nd PSU. I'm sure it's not as fast as a normal pcie slot, but it's working great so far and was way easier than a new case.
inevitabledeath3@reddit
The reason to go mac is for RAM/VRAM capacity. Nvidia GPUs get very expensive if you need VRAM for bigger models.
FinancialElephant@reddit
What about model size though?
AcaciaBlue@reddit
and more memory surely?
raindownthunda@reddit
Definitely. INT8 seems to be becoming more viable and keeping 3090’s competitive. The speed difference between fp8 and int8 on a 3090 is 1.5x+
panchovix@reddit
For LLMs, 4090 is way more expensive than a 3090 for the same amount of memory and almost same bandwidth.
The 4090 will be 2x times faster on PP vs a 3090 tho. And also is about 2x faster on compute in general (diffusion like txt2img, etc)
FissionFusion@reddit
what stat is the determining factor in PP?
panchovix@reddit
Compute units and compute in general. So higher clocks and more cores are faster. Also perf per clock (aka IPC, for the same clock getting higher performance on newer GPUs)
hidden2u@reddit
wow 2x prompt processing is huge for giant agent system prompts
comperr@reddit
Yeah that's why i got 2 5090s. They're pretty fast. I am getting a 4th one when i have time to drive to microcenter
Clear-Ad-9312@reddit
at that point, why not just buy an rtx pro 6000 that is just about the same price as 2x 5090 and has more vram than 2 5090s?
comperr@reddit
The rtx pro 6000 is just a 5090 with 72 or 96gb vram. So it is only as fast as one 5090 even if you dont need all the vram. With 2 5090s i can literally fit 2 27b qwen3.6 with q8_0 kvcache in each card and run them simultaneously.
Clear-Ad-9312@reddit
price is still a big point of reason to go for the 6000 over the 2x 5090, that is what I am presenting here
Anonymous_Prime99@reddit
I did it for the wattage really. 300W for the power of the sun without turning the place into an oven? Yes.
Clear-Ad-9312@reddit
hmm, I thought you could just limit power anyways through the settings, but ok, sounds fair.
comperr@reddit
Likewise, i limit my 5090s to 425w. You can lower the power limit extremely easily
rbit4@reddit
I got 8 5090s at price of about 2200 avg. 3.5x less than price of rtx6000.. question is when do you need the 96gb vram
Clear-Ad-9312@reddit
mind letting me know how you came across an abundent source of cheap 5090s? I can only find them for like $4.5k
Late_Night_AI@reddit
Go ahead and grab one for me while you’re there, i need a 2nd 5090 for more vram. 🚗😎
comperr@reddit
The household limit is a killer im thinking about bringing a friend so i can actually get 2
rbit4@reddit
That's the reason why I have a 8 5090 rig.. 5x perf of a 3090 and 2x of 4090 using nvfp4 vs 4 bit quant on the rest
AcePilot01@reddit
pp?
ScorpiaChasis@reddit
prompt processing... unless it is really peeing
sibilischtic@reddit
They did mention txt2img
SBoots@reddit
It doesn't seem to get talked about very much. I have a 5090 and a 4090 in my system. I had the 4090 first and while the 5090 is clearly a big step up, the 4090 is no slouch!
kwizzle@reddit
This is sorta my situation, I had a 4090 from before prices were insane and I'm considering adding a 5090. Do you feel the 4090 keeps up well enough in speed when splitting a model between the two cards? And what models and quants are you running on there?
SBoots@reddit
My go to model right now is Gemma4-31B-Q8_0.gguf (31G) w/mtp-gemma-4-31B-it.gguf (491M) drafter model split across the two cards with a 128K context. I get about 65-70 t/s. I'm using the llamacpp Gemma4 MTP branch.
rainbyte@reddit
How much PP on 4090 vs 5090 with that model?
SBoots@reddit
I see about ~4000 t/s PP combined across both cards. llamacpp doesn't give me a breakdown per card. Model is too large to run on the 4090 for me to test each card solo.
Freonr2@reddit
$/GB/s and $/GB has always been poor since launch date. They don't make a lot of sense.
Mulster_@reddit
Also 40 series power connectors burning down
biogoly@reddit
Way more 3090s were made and used for crypto mining, so it’s a much bigger pool for used and affordable second hand GPUs.
eugene20@reddit
3090 outstanding value if you can find one, maybe worried about it needing a repaste or just burning out from age.
4090 people holding onto them like unicorns, just superb cards.
5090 just far too expensive for most consumers to justify.
lemondrops9@reddit
4090's have been 2x the price compared to 3090s in my area for as long as I can remember. Guessing the supply for 4090 was low as well.
KingSlayin@reddit
That's my card
wilhelmbw@reddit
Chinese bought them up to make 4090 48gbs
Endflux@reddit
And the 3090 TI has the same memory bandwidth
ahtolllka@reddit
Anyone who have full sized 4090 should consider converting it to 48gb
SBoots@reddit
would be amazing but that sort of modification is well beyond my skill level
tired514@reddit
Not to be confused with the 4090M; I'm running a Morefine G1 eGPU (16gb) at 576GB/s with a second one on the way. Crazy expensive, but awesome little DC-powered portable rig.
megawhop@reddit
Try using an extended PCIE riser card with a 90 degree angle. Lets you keep your PCIE speed and bandwidth without using oculink, or TB, etc.
I have 5090 as my main, and a 4090 using a riser card with a 3ft ribbon cable running into a 3d printed external GPU enclosure containing the 4090, a PSU, and the female end of the riser card.
tired514@reddit
Oh the Morefine G1s are fully integrated little boxen:
https://www.morefine.com/en-ca/collections/external-gpu
They're wickedly expensive for what they are but in my case form factor is everything. I live on a boat in the summer and I'm trying to avoid running the main inverter (that makes 120V AC) at all costs. These little Morefines run on 20V DC, so I can run them off my dedicated 12V->20V SMPS. Drops my idle consumption from \~100W to \~10W.
arbobendik@reddit
I'm using a 5090M and those mobile GPUs usually have really impressive memory overclocking potential. I am running a stable +250MHz (equals nvidia-settings +4000MT/s) overclock, bringing me to 2000MHz. This raises bandwidth from 896GB/s to 1024GB/s.
Maybe test a more conservative +125MHz first though, although GDDR6 on my old 3060m overclocked equally well with +250MHz (+2000MT/s in nvidia-settings)
tired514@reddit
Funny you mention it. 😄
I was literally thinking "if the 4090M is only limited by thermals / power, what's to stop one from overclocking the VRAM given memory bandwidth is the main constraint during inference?"
Turns out the 4090M uses a 256bit VRAM bus while the 4090 is 384bit, but I did end up overclocking a fair bit and yup - decent performance improvement!
I can't wait to see what kind of token rate I get with Qwen3.6-27B @ Q6 with two 4090Ms.
NineThreeTilNow@reddit
Native FP8 as well.
illforgetsoonenough@reddit
AMD 7900XTX, 960 GB/s
sillynoobhorse@reddit
300$ 3080M 16 GB
448.0 GB/s
AcePilot01@reddit
So either a 3090 or a 5090. Got it.
4090 have it's balls cut off for no reason or something?
SBoots@reddit
24GB of VRAM and plenty fast in my experience
Ok-Measurement-1575@reddit
Same for 3090Ti.
gggghhhhiiiijklmnop@reddit
Thanks man, I felt left out 🤣
sn2006gy@reddit
FYI, for B70 users, Intel just released an update that addresses Qwen 3.6 perf issues. May start getting closer to that 608 GB/s perf.
tovidagaming@reddit
The biggest issue I had with vllm which is what seems to be needed for llm-scaler, is how to compare vllm supported quants (INT4, Fp8, AWQ, etc) with models running the usual q4, 5, 6, 8 quants on llama-cpp. It just felt like comparing apples and oranges. And that's when I was able to get vllm to even work. I will have to try the new update in a docker container...
sn2006gy@reddit
i don’t even bother with small quants as they have too many side effects - 8 is good enough and works with vllm
tovidagaming@reddit
So you would compare FP8 in vllm with Q8 GGUF in llama-cpp?
One_Difficulty_39@reddit
I may have to retry using them
frostpearI@reddit
608GBps is still a long way
psychicsword@reddit
Are there any guides on how to get it actually working though? I am still running qwen3-vl because I kept running into crashing issues.
Massive-Question-550@reddit
honestly really liking the price for the amount of memory you get but the performance is abysmal right now to a 3090. hopefully they can optimize the software as there's no excuse to have an ai dedicated card lose to a nearly 6 year old gaming GPU...
sn2006gy@reddit
3090's are 1400 bucks now on FB/Ebay and there are SO MANY FAKE / SCAM SELLERS
That $950 for new with warranty seems much more worth it.
Do I wish pricing was better? heck yeah... But i'd rather take my chances on NewEgg and run 2-3 of these cards and in both cases, we all win vs the $10k RTX6000 Pros (even though its faster, it's not 7,000 dollars faster)
Fit-Palpitation-7427@reddit
You see a big intelligence difference between q4 and fp8 ?
sn2006gy@reddit
on Qwen 3.6 27b - yes.
Evanisnotmyname@reddit
Hardware doesn’t just “go bad” or “wear out” that easily.
Yeah sure on some level it does…but PC parts are one of the few cases where you can tell pretty quockly if it’s working or not, test, and if it tests good it’s good.
It’s not like they get slower overtime..either it’s working, maybe working at 98% original capacity, or not working at all.
sn2006gy@reddit
3090's have been around the block with gaming, overclocking, mining and now AI - i know things don't "Wear out" but fans and paste do and if those fans and paste haven't been maintained then it causes heat failure in areas where I don't want to bother fixing it
And that's why you see 100s of gpus for sale or sold as not working/broken.
comperr@reddit
I have a 3090 hybrid for sale $1999 and 3090ti $1800 on eBay lol
Massive-Question-550@reddit
you would be better off buying a stack of 5060ti 16gb right now if you are on a budget. mature software, warranty, plus good vram to dollar price point and you can parallel compute in certain setups for more performance.
sn2006gy@reddit
B70 supports parallel compute
For businesses i'm not recommending any of this
kitanokikori@reddit
Even next to an R9700 Pro, the B70 is roughly the same price and like, 50% of the perf
takuonline@reddit
Even on Vulcan llama.cpp?
Massive-Question-550@reddit
yes. that's were it's performance is best and most stable. someone posted in depth performance comparisons between it and the 3090 using vulcan and it got less than half the performance most of the time. it was bad.
overand@reddit
If the software stack is actually stable, I'd probably recommend a B70 over 3090s for a business, because of the whole "used card gamble" thing. A bit slower performance with a bit more cost, but with a lower power consumption profile and a warranty & current support would probably push that over into "worth it" in that use case.
That said, yeah, you'll pull my dual 3090s from my cold dead hands. (Especially since I used some Dell OEM ones that are shorter than any others - in theory, I can put my stack of 8 3.5" drives back into my case!)
CoolConfusion434@reddit
I will share these bench stats if ya'll don't chase me out for being on Windows 😉
The other side of this box runs Ubuntu Server 26.04 with both SYCL and Vulkan compiled from sources. On the Windows side, and just for the lolz, I downloaded the pre-compiled binaries. SYCL sucked, then Vulkan beat all other combinations for this particular model:
CoolConfusion434@reddit
Adding the Linux side results. This is on Ubuntu 26.04, and don't include the latest Intel SYCL fixes so it could get better.
For short prompts, Vulkan wins. For longer prompts, SYCL sustains prompt processing better.
Positive_Kale@reddit
Do you have a link? I’m really thinking of buying that B70, but I will need to figure out the best way to use it
sn2006gy@reddit
https://github.com/intel/llm-scaler is the repo everyone is following. There are a few other repos on GitHub as people benchmark/test through the updates. It's had 4 releases in the last month, so Intel seems to finally be progressing through the prior growing pains.
M_Me_Meteo@reddit
What software stack?
sn2006gy@reddit
llm-scaler (vllm) https://github.com/intel/llm-scaler
dr_DCTR@reddit
Can the B50 compete with the B70 for smaller models below 16GB?
smallDeltaBigEffect@reddit
since the performance delta is mainly software-based, you will maybe get like 10% less net bandwitdth
sn2006gy@reddit
perhaps, but my b50 drives plex so i haven't tried LLMs on it.
WizardlyBump17@reddit
openvino 2026.2.0 was released yesterday and it adds support for gemma4 and qwen3.5. I tried the nightlies before and it is really fast, like 4k pp and 60 tg on qwen3.5 9b int4, though a specific nightly version tanked the performance of it later... That is on a b580. I wanted to try qwen3.6 35b and 27b, but i guess openvino isnt very great for cpu+gpu combos
lukistellar@reddit
Oh, I see we still are ignoring cheap AMD GPUs. Good for myself, just bought an used RX6800 16GB for 250€ the other day. RX 7900 XTX with 24GB go for as cheap as 500€ here in central Europe.
codsworth_2015@reddit
Yep, anything with good gguf support goes straight on either my XTX or Mi50's. Save the 5090 for when cuda is required like faster whisper and yolo. Might get another XTX if the price ever comes back down to what I paid for the first one but I'm not holding my breath and I'm not paying $200 extra for the same thing out of principle. Same with the Mi50's they are now triple what I paid.
qbiker@reddit
Same here, last week. What are you planning on using it for?
mrpmorris@reddit
The ones with very fast memory have much lower memory, so can only load smaller models which give much worse output.
If you combine multiple cards to increase that memory, then you introduce a lag in inter-card communication.
So this info is not the WHOLE story.
porkchop_d_clown@reddit
Maybe I'm one of the ones who needs to see this...? I worked in High Performance Computing for 25 years, retired at the end of 2024. It looks to me like you're comparing the TB5 speeds of a Mac Mini with the NVLink speeds of the NVidia cards?
But that means I really don't know what the numbers for the laptops mean...
SupersonicSpitfire@reddit
And ranked by GB/s per euro?
Opening-Broccoli9190@reddit
For those who wonder yet:
Single channel RDIMM DDR5 is 40GB/s,
Dual channel RDIMM DDR5 70GB/s
Quad channel RDIMM DDR5 is 140GB/s
devshore@reddit
M3 Ultra Mac is somewhere in the 800-900 range close to the 3090
spammmmmmmmy@reddit
For the M series you really have to see whether they are blank/Pro/Max/Ultra as they differ in the memory bandwidth.
dim_amnesia@reddit
Brain dead people somehow still spending $9500 for 10 tokens per business day and tiny context windows
twnznz@reddit
Does M5 improve prompt processing over M4 meaningfully?
Svobpata@reddit
From what I was able to find out, yes, noticeably so. The new neural units on each GPU core help with prefill
spammmmmmmmy@reddit
I don't know but somebody posted on that topic in the past 2 days I think. There was mention that the faster CPU will achieve better prefill time.
I have been chatting with Claude about the performance topic. He thinks there is no substitute to empirical testing. I may quit ollama and migrate to VLLM in order to understand the pieces of the inference process better.
My notes during my shopping for an M1 Max: ∙ M1 Pro: ~200 GB/s ∙ M1 Max: ~400 GB/s ∙ M2 Max: ~400 GB/s ∙ M4 Max: ~546 GB/s ∙ M1 Ultra: ~800 GB/s
Standard-Potential-6@reddit
Keep in mind Apple’s are also theoretical numbers summing the memory bandwidth of the CPU, GPU, and NPU.
Most workloads don’t break down like that and the GPU will only access memory at ballpark 60-80% of the total.
overand@reddit
Hugely different!
OkLettuce338@reddit
this is pure noise
dim_amnesia@reddit
Brain dead people somehow still spending 9500$ for 10 tokens per business day
dim_amnesia@reddit
Amount of brain dead people who bought mac ultra for LLMs is insane.
dim_amnesia@reddit
My RTX 6000 pro does \~1.8 TB/s
I am convicted people who compare DGX spark to it are brain dead.
leinadsey@reddit
Of course you get more tps out of a 5090 than a MBP, but the 5090 doesn’t have 128 GB memory for not-insane-money and oh.. yes, it comes with a computer too.
GameBoyRay@reddit
my whole circle of the nonunderstanding normie ass wife needs to see this.
substance90@reddit
U missed a hidden gem in there - the AMD 7900 XTX - literally twice as fast as my M4 Max MBP Pro for inference as long as the model fits.
Electrical_Table5543@reddit
Why is the transfer speed on the DGX spark so low???
FragmentedHeap@reddit
You missed one,
Nvidia RTX 4090 1008, GB/s,
You can get one $1800 ish which is much cheaper than a 5090 and you can get two 4090's cheaper than 1 5090 😄and that gives you 48 gb vram.
And if you are willing to mod them, and ship them to china, for about $150 each you can get them to be 48 gb, so two modded 4090's is dual 48gb for 96gb vram at over 2000 GB/s total.
You also left off the AMD Raedon 9700 AI 32gb vram card, which has 640 GB/s but comes with 32 GB Vram and is around $1300.
But... 2-4 Raedon 9700 AI cards is the best bang for buck with tensor parallelization. Sapphire makes one, it's $1379 on newegg.
tired514@reddit
Wait wait what's this Chinese modification to double a 4090's RAM? I found a few vids talking about how to do it, but there's a company that'll do it for $150?
FragmentedHeap@reddit
I think I have the ability to do it myself, but I'm not risking my $1800 gpu to try it. I have all the tools and a full lab...
tired514@reddit
I've got the tools as well and feel like I could probably pull it off.. but under $200? Including the memory itself? That seems irrationally cheap.. :/
Yes-Scale-9723@reddit
yes but the are too expensive and i'd never pay that price for a modded gpu. used 3090 is the way
FragmentedHeap@reddit
4090 is more than twice as fast with the same vram.
slavik-dev@reddit
This guy is in US, does that:
https://gpulab.net/product?id=2
Far_Course2496@reddit
Those gpus are from Alibaba
tired514@reddit
It bothers me so much that Apple managed to produce a desktop with 400-600GB/s memory bandwidth but there's no equivalent in the x86 world.
vasimv@reddit
Some Intel and AMD server CPUs has quite big memory bandwidth (Amd Epyc and Threadripper CPUs, Intel Xeon 6). Like AMD 9124 - 12 channels DDR5 with 460 GB/s. There are even Intel Xeon with HBM2e memory (like 1.6TB/s theoretical bandwidth). But building with these will cost quite much as need shitload of registered ECC DIMMs to fill all channels, CPUs costs quite much, same for servers motherboards.
tired514@reddit
Ya but are they unified with the GPU? That's the crux of the problem. :/
TokenRingAI@reddit
Intel and AMD have reacted to this, but it takes basically 5 years for that reaction to become a product you can buy at consumer prices
mjsxi__@reddit
why does this bother you? seems dumb to be bothered that a company is making something decent...
Look_0ver_There@reddit
What a strange interpretation of their statement. I read it as they are bothered by the fact that there's no x86 equivalent that matches the Apple laptops.
Keep-Darwin-Going@reddit
Is not the main problem being stuck at 24gb? That is why people are using Mac mini so they can go like way higher, speed is nothing if you are stuck using a crappy model.
complexminded@reddit
That's what I figured out when comparing my 2x 3090 cluster to my dgx spark cluster. The models I can run on the DGX spark, while considerably slower, get way more use than my 2x 3090 cluster. There are times when speed matters (classifying 40k comments) and I'll use my 3090 cluster for that. Everything else goes to the DGX Spark cluster (95%) regardless of speed.
KURD_1_STAN@reddit
I havent seen / there arent many benchmarks comparing dgx vs 3090/3090s, so im assuming based on my instincts here, but what model can be ran on dgx that cant be ran with gpu with ram while still being faster? I can only think of garbistral medium
Badger-Purple@reddit
I’m running Qwen-397b, Minimax 2.7, Mimo 2.5 or DS4 Flash on dual sparks. You can’t run those in 48gb VRAM. With offloading, even the 6000pro on a DDR5 system gets slower than the dual sparks.
complexminded@reddit
On a 4-node Spark? I get if you only have one dgx spark. Not saying it isn't possible to accomplish the same build with gpu's, but for me the simplicity of the dgx being plug and play with less "moving parts" (heat, power, etc) beats a build on 3090 + system RAM. Yes the trade-off is speed.
All personal preference; everyone has different tradeoffs.
Spara-Extreme@reddit
Aren't 2 spark nodes approaching RTX 6000 PRO money?
complexminded@reddit
Depending on what you can get an RTX 6000 PRO for. They range from 11k to 13k based on my searches. I got my sparks at original price. So it's a 3-5k difference. 32GB less RAM but WAY faster interference. Trade-offs for sure. I'm happy with the route I went but not everyone would be.
Spara-Extreme@reddit
Ah fuck I'm out of date. I picked mine up for 8.8k in January.
complexminded@reddit
Yea, some would say I was stupid for the early adoption, and I agreed until I saw the price skyrocket. I knew when I bought it what I was in for so I took the leap.
Spara-Extreme@reddit
I want to build an always on LLM inference and I have a relatively high budget, but I'm constantly torn between a Spark cluster and just adding another 6000RTX pro to my current machine.
complexminded@reddit
Yea that's a tough (great) position to be in. I'm not sure how I would decide tbh but it would be largely based on my use case. For instance, I use my 3090 x2 cluster for classification/sentiment processing. I sometimes need to process +40k records a day. Speed matters when you're doing tasks like that. In that case I'd definitely go 6000RTX route because speed is important.
But if you're into fine-tuning, which I also am, the dgx spark cluster is nice because I generally dont care about speed when training, and having more VRAM capacity is more important
Spara-Extreme@reddit
Yea - thats the crux of it. The inference speed of the RTX is (borat voice) verryyy niceee but I'm also finding it enticing to use some of the larger models regularly. Thanks for your perspective. I'll need to think on this.
Icy-Pay7479@reddit
What models are you using? I've been doing a lot of research and I haven't seen impressive results from 128gb setups. It seems like 256/512 is the big step from 48/64
complexminded@reddit
I have 2 node cluster now but for 1 DGX Spark, I think the best candidates are the recently released Step 3.7 Flash - reported to get 20-25 t/s. Or Qwen3.5 122B A10B int4 AutoRound - I find it a bit deeper than Qwen3.6 and it can get 35t/s with mtp. Even Qwen 3.6 27B at FP8 gets around 17 t/s with mpt and I find that a lot better in quality than Q4 quants. And you can run it at full context with 3x concurrency.
But it gets more useful with 2+ cluster imo.
Icy-Pay7479@reddit
thanks for the info - the higher quants on smaller models is appealing! I was looking at a Strix Halo.
Keep-Darwin-Going@reddit
I have no idea how people are having success with those quants model, they tend to go into loops and error so often it is frustrating. So usually I only use those with full precision which most will not fit into my 4090.
Icy-Pay7479@reddit
Maybe it’s the harness? They work well in Hermes agent but for actual coding they kinda suck.
complexminded@reddit
No worries! BTW this isn't an ad. I'm not trying to convince you to go this route or saying it's the best way. Just sharing my experience
mrgreen4242@reddit
Right, that’s the missing data point here: how much RAM can each of those devices access at that speed? Even the regular M4 mini could, until recently, be configured with 32gb of RAM and the Pro version up to 64gb. The M5 MBP mentioned on this list can also be configured with 128gb of RAM.
So, yes, an Nvidia GPU can be up to 2x as fast, but tops out at 32gb of VRAM. You could get two of them and have 64gb but you’re looking at $4k PLUS the computer they’d go in. You can almost get an entire MBP with 128gb of RAM for just what the GPUs cost.
Plus it fits in my backpack and draws 140w tops (technically I think they can draw up to 200w for a short period by pulling from the power adapter and battery at the same time).
For comparison, a single 5090 can draw 575w. So for two of them PLUS a PC to put them in and a monitor (to compare “apples” to “Apples”) you’re going to be looking at 10-15x the power usage.
It’s not really a “this is better than that” situation as much as it is these are two different options that have similar price points and make different trades offs - more total RAM, lower power consumption, compact form factor vs. faster RAM speed but less RAM, larger form factor and higher power consumption).
Whyme-__-@reddit
I have 2 Sparks connected together running Qwen 35b MOA for my startup and what I have seen is that if you use DP2 for concurrency I can get 32 concurrent request at peak using both hardware. I have a whole benchmark of DP, PP and TP done I can share. These hardwares are awesome for what they can do which is loading the LLM on vram and holding it at same space for long time. Meanwhile in a Mac you can load the model but when the OS needs the unified memory for chrome it will boot the model out and prioritize loaded apps. Concurrency over speed gets you to do things like fine tuning, parallel processing, intensive work like benchmarks. If you just want a chatbot to run then you will get max 35tps on fp16 models which is not bad.
I use these for my startup and so do my customers and it’s a game changer.
koenafyr@reddit
Disingenuous comparison because you're comparing flagship Nvidia to a Mac mini. You could have suggested bunch of 3060 12gb for example
rpkarma@reddit
Same reason why the Spark is so fun.
It’s slow, that’s true. You’re usually maxing out around 500tk/s pp and 20tk/s decode but there’s not much else that lets you run models of this size for this price
For me though it’s more about being able to train and quantise and distill, testing my experiments on similar-ish hardware to a cloud rented system before uploading it
BallsInSufficientSad@reddit
Yes. This is why I still recently got the M3 Ultra 512GB
Badger-Purple@reddit
M2/3 Ultra, 850 Gbps
TechySpecky@reddit
Bro I wish I could find an RTX 5090 anywhere close to RRP
overand@reddit
i'm genuinely thrilled with my dual 3090 setup on a DDR4 system with a Ryzen 5 3600, even though one of them is PCI-E x16 and the other is x4!
$2000 MSRP for the 5090 with 32 gigs of ram, and good luck getting that price
$1800 for a pair of used 3090 cards on eBay (as of a month or two ago), total of 48 GB
Yes, there's stuff that doesn't like running split between two cards, but mostly it's been pretty unusual to run into stuff that wants more than 24GB but less than 32 GB of VRAM on a single card. (I think one of them SOTA-ish FOSS voice models is like that, but I'm not even sure.)
Myarmhasteeth@reddit
My only problem on getting another 3090 is how to configure it. I see setups yet since I only have like 3 SFX mother boards, I’m cooked.
overand@reddit
You can prooooobably get a 570-based AMD chipset board for not tooooooo much money. (And, I managed to push this to 128 gigs because I already had 2 32 gig sticks in it, and DDR4 is only "sell a kidney" price, not "sell both and also your liver" price)
Practical_Form_1705@reddit
What is performance of such setup, let say 8core ryzen + 128gb ram in compare to gpu?
overand@reddit
Oh, I doubt CPU inference would work very well, but, if you give me a model you want me to test, I can give it a try with a CPU-only build of llama.cpp
But, I use it with my 2x 3090 setup - but, that runs one at x16 and one at x4, but it's still decent!
_realpaul@reddit
The 3090 cant do the latest features but its still an awesome piece of tech.
ohhi23021@reddit
i haven't tested over 40k context yet but it does about 70-80 t/s around there. at 0-5k context it hits 90 t/s with mtp.
_realpaul@reddit
Nice.
palashjain_@reddit
I recently bought a second 3090 for my setup hoping the same. I too have ryzen 5 3600, msi x570 a pro with 2 pcie slots. But for some reason anytime i plug anything into the second slot (x4, chipset slot) the motherboard does not post display and shows a red light on vga. I have tried single gpu on slot 2 and two gpus together. Doesn't work. Only thing that works is single gpu on first slot (x16,) . If it matters i do have 2 nvme ssds and 64gb ram. I tried removing everything and starting with just single ram chip too. Same outcome. I tried bios settings like gen 4 gen 3 and that weird mining setting. None of those worked. Any help is appreciated
undisputedx@reddit
check the shared lanes thingy on the mobo website.
lemondrops9@reddit
Also look at the manual and be sure if this PCIe slot or NVME slot is used that PCIe slot is unavailable. Its not very common for an NVME to do this but never know until you check.
overand@reddit
I'd start by taking a bright light and inspecting the slot to make sure there isn't anything in there like a bit of paper, plastic, etc, and that there aren't any bent pins.
After that:
palashjain_@reddit
I will try to look for the debris and bent pins. I did try after removing both nvmes. Did not work. I am not very savvy when it comes to motherboards. What is funny to me is that it only works when the second pcie slot is unoccupied.
Clean_Hyena7172@reddit
How well does Qwen3.6-27B run on that setup? What quant? And how many t/s?
overand@reddit
(Apologies for the formatting here - I really attentively formatted everything, and when I tried to submit it, reddit wouldn't allow it. I'll reformat from desktop in a few; doing this from an ipad with a keyboard misssing the right arrow is awful lol)
I've used a few different configurations - one is the "Club 3090" setup, which has specific configurations for single and dual 3090s.
But, here. A standard Q8\_0 config, an MTP config, and an MTP + NGram config.
All 128k ctx, Q8\_0 (and no cache quantizing).
* Stock model gets PP: 2027 and 27.1 gen.
* MTP model gets PP: 1371, Gen: 49.
* NGram configs skipped as they don't seem to add any performance
* Smaller quants skipped because lazy
#This one gets PP: 2027 T/s, Gen: 27.1 T/s
#
\[unsloth/Qwen3.6-27B-GGUF-128-ctx:Q8\_0\]
hf = unsloth/Qwen3.6-27B-GGUF:Q8\_0
ctx-size = 131072
temperature = 1.0
top-p = 0.95
top-k = 20
min-p = 0.0
presence-penalty = 0.0
repeat-penalty = 1.0
reasoning = on
# This one gets PP: 1346.8 T/s, Gen: 41.9
#
\[unsloth/Qwen3.6-27B-MTP-GGUF-128k:Q8\_0\]
hf = unsloth/Qwen3.6-27B-MTP-GGUF:Q8\_0
no-mmproj-offload = true
spec-type = draft-mtp
spec-draft-n-max = 3
no-mmproj-offload = true
ctx-size = 131072
The "no-mmproj-offload" gets the mmproj (vision support) offloaded to system RAM / CPU, so it'll still **work** if I need to use it, but it won't take up VRAM. (I used to just disable vision for a lot of these.)
overand@reddit
I've used a few different configurations - one is the "Club 3090" setup, which has specific configurations for single and dual 3090s.
But, here. A standard Q8_0 config, an MTP config, and an MTP + NGram config.
All 128k ctx, Q8_0 (and no cache quantizing).
Smaller quants skipped because lazy
This one gets PP: 2027 T/s, Gen: 27.1 T/s
[unsloth/Qwen3.6-27B-GGUF-128-ctx:Q8_0] hf = unsloth/Qwen3.6-27B-GGUF:Q8_0 ctx-size = 131072 temperature = 1.0 top-p = 0.95 top-k = 20 min-p = 0.0 presence-penalty = 0.0 repeat-penalty = 1.0 reasoning = on
This one gets PP: 1346.8 T/s, Gen: 41.9
[unsloth/Qwen3.6-27B-MTP-GGUF-128k:Q8_0] hf = unsloth/Qwen3.6-27B-MTP-GGUF:Q8_0 no-mmproj-offload = true spec-type = draft-mtp spec-draft-n-max = 3 no-mmproj-offload = true ctx-size = 131072
The "no-mmproj-offload" gets the mmproj (vision support) offloaded to system RAM / CPU, so it'll still work if I need to use it, but it won't take up VRAM. (I used to just disable vision for a lot of these.)
Icy-Pay7479@reddit
I have both and managed to get a comparable qwen 3.6 27b setup on the 5090 by lowering context. It gets dumb long before 256k regardless.
Speed is similar between them, partially because I got an 8x 8x mobo and can make better use of tensor parallelism, but it was only a 10-15% boost.
Dual 3090 is the better value, but there are some models on the 5090 that absolutely scream in comparison.
overand@reddit
There are definitely a few models and tools out there that I've wished for a 5090 for, like some of the weird TTS models, or speech-to-speech ones. But, I even had "okay" performance on one of those "realtime 3d walk around in a hallucination" models with the 2x 3090s!
Icy-Pay7479@reddit
Don't fomo over in. 2x3090 is getting the most attention and optimization right now. You're at the right party and it's jumpin'
Afraid_Manner_5530@reddit
Hey just out of curiosity because I have the hardware, what about 5090 with a 3090 for overflow?
Icy-Pay7479@reddit
yeah for sure, only drawback is losing nvfp4 precision when you mix with an older card but it's unlikely you'd be optimizing for that anyways.
3090+5080 is 56gb, you could hold a much higher quant.
Afraid_Manner_5530@reddit
Ok thank you, I also have the really stupid idea of using an RTX pro 4500 blackwell in the same system with those other two cards because it's only 200 watts and I have a motherboard that can do x8/x8 gen 5 and also has a lower gen 4 slot with 4 dedicated lanes where I could stash the 3090. If I'm not mistaken this would cause me to take a significant hit to prompt processing but with 88gb of vram I think I should be plenty well usable, right? Certainly much better than falling back to system ram at least.
Antoniethebandit@reddit
Same here
Status-Secret-4292@reddit
I have found other 3090s I have considered buying to make a dual set up, but they are never the exact model of 3090 I have (gigabyte oc gaming) and from what I understand, it should be...
I wonder though, how important is that really?
overand@reddit
My undertanding is that it's not actually particularly important; maybe if you want to use NVLink, but even in that situation, I think it's explicitly allowed. (Double check me on that, though!)
Status-Secret-4292@reddit
I guess I was pretty much only considering it nvlink style as that seems to offer the best performance?
I appreciate the info!
Massive-Question-550@reddit
honestly best value setup. only better combo is if you got an old amd epyc before the shortage so you get the full 16x pcie gen 4 speeds per slot and can run large MoE models with all the ram.
also if you make the right setup you can have both cards work in parallel and cut your promp processing time by 30-40 percent and boost your token output.
m31317015@reddit
Did exactly that. 7B13, ROMED8-2T w/ 8x64GB DDR4, right now only have a 5090 and 3090 but can stuff another 3090 in. This is the best value setup (not counting GPUs) you can get at Q3-Q4 in 2025.
JustinPooDough@reddit
same. Also even have the one card in 4x. My CPU is only a 1600x.
It still gets like 50 t/s with Qwen 3.6 27B.
overand@reddit
You must be using MTP!
brickout@reddit
I'm rocking very similar to you after scoring a couple of 3090s for pretty cheap before the used prices went up. I love it.
marutthemighty@reddit
Aren't NVIDIA RTX 5090s (or whatever latest GPU NVIDIA have in their arsenal) already out of stock and taken up by AI enterprises in America and China?
TechySpecky@reddit
Well there's some available but closer to $4000
marutthemighty@reddit
That is...expensive.
compWizardLOL@reddit
I got one by signing up for those alerts then I started noticing patterns when they would happen so I would try to predict when one would happen. I got lucky and predicted before the notification went out and I got it
TechySpecky@reddit
alerts where? I'm in EU
Sophia1995_miam@reddit
rtx 6000 pro will be the same price as the 5090 if this keeps up. msi liquid is going for 45K
greentea05@reddit
I got a founders edition at RRP, put it in the PC and haven't used it once for LLMs or gaming - it should probably sell it and make that £1000 profit.
Far_Composer_5714@reddit
Any retrospect I kind of wanted to buy the $2,400 5090 before the mainstream ai craze. But it is what it is
MysteriousSilentVoid@reddit
LOL. I keep kicking myself for not jumping on it last August. I literally had purchased one at MSRP and cancelled the order.
bonesingyre@reddit
I regret not buying the 5090 FE when I got offered it for $1999 through Nvidia's website. I ended up buying the 5080 FE for gaming 😭
Xidium426@reddit
I feel incredibly lucky, I was unemployed and got one at MSRP during one of the Nvidia lotteries.
AnonsAnonAnonagain@reddit
Why though? Why wouldn’t you just get the RTX Pro
shuozhe@reddit
No worry, nvidia will support us. Didnt they announced a prices increase recently for 5090?
7EET-CS@reddit
I am struck by how not shit the M5 Pro is actually. I thought the gap would be much larger.
critsalot@reddit
GB/s != tokens/s though does it? also some models will fit in a 128 gb macbook but wont fit in a 3090
AlarmingProtection71@reddit
I can recommend AMD Radeon PRO W7800. perfectly balanced for MTB 32b Models.
Möchtest du weitere Spezifikationen dieser drei Grafikkarten vergleichen?
pmttyji@reddit
I'm also getting this soon. What stack are you using? Please share info. Thanks.
AlarmingProtection71@reddit
Sry for the late answer. I first had to configure my fastfetch module. I am experimenting with themeing and colors, so dont judge \^\^ Plus, when i bought the RAM (last september) it was only 330€ for two 48GB modules, two months later they went up to > 1500€, crazy timing !
AlarmingProtection71@reddit
Sure thing, i'll post it later. Currently i am @ the phone.
AlarmingProtection71@reddit
Sry for the late answer. I first had to configure my fastfetch. I am experimenting with themeing and colors, so dont judge \^\^ Plus, when i bought the RAM (last september) it was only 330€ for two 48GB modules, two months later they went up to > 1500€, crazy timing !
DoorStuckSickDuck@reddit
Do you mean the W7900? That's the 48GB one, the W7800 is 32gb and has a lower bandwidth
truthputer@reddit
As best as I can tell the W7800 48GB exists, was never released in the US, but is available in Europe.
AlarmingProtection71@reddit
Under "GPU Memory" > Peak Memory Bandwidth 864 GB/s
AlarmingProtection71@reddit
There are different Pro W7800 variants
Fun-Time9529@reddit
4 of those in a Mac Pro 2019
overand@reddit
MTB? Do you mean MTP? MoE?
AlarmingProtection71@reddit
Sry, it was a long day :D i meant MTP (Multi-Token Prediction).
overand@reddit
Hey, no complaints here, I'm just glad I don't have to learn something else! (And, I'm glad you put in that chart/table.)
joochung@reddit
The AMD MI50 32GB has slightly faster memory bandwidth.
pixelpoet_nz@reddit
Thanks fellow German, who apparently can't write basic comments themselves. Why bother posting a canned response we can all prompt ourselves?
AlarmingProtection71@reddit
oh man :D komplett übersehen.
Altruistic_Welder@reddit
If only Apple makes nVidia GPUs play ball with macbooks - That was once the case in 2005-2006 when the 15" macbook pros came with nVidia GPUs. That relationship soured for some reason and Apple shipped Radeon GPUs but man I'd kill for a Macbook pro with a 5090.
Qu1etcocktease18@reddit
Still doesn't change the fact that I'm limited by how much VRAM I can actually afford to shove into my case.
triynizzles1@reddit
Rtx 8000 (48gb) 672 GB/s
mintybadgerme@reddit
That looks like a bargain card, but it's quite slow, isn't it?
triynizzles1@reddit
Not really. It can run 70b dense at 11 tps, which is the floor of performance. The 30b a3b sized models are all 90-110 tps. With some clever moe layer offloading this card + 64gb system ram can run all of the big 120b moe models at ~30 tps. Id day 30tps is useable still. Worth noting most modern models can be ran at max context too :)
spectre1006@reddit
Wow i feel better about my old 3090 its chugging along after a repaste
brenden77@reddit
Right, but also list the max RAM in each instance.
It's not so simple.
avariqfr30@reddit
The M5 Ultra is rumored to debut during WWDC. The Max already had 614, imagine that M5 Ultra. That might be in the 1200s..
NothingButTheDude@reddit
I am quite amazed the DGX Sparc is so slow!
But doesn't it handle way more load that that speed withut slowing down, whereas the other consumer level cards only handle a single user load?
So the DGX is actually WAY better for enterprise use?
use_net@reddit
Combined with M3 Ultra is incredible fast
techdevjp@reddit
Bandwidth is obviously incredibly important, but so is the amount of memory. 1.8TB/sec is wonderful, but only 32GB of it.
So that M5 Max 40-core MacBook Pro might be "only" 614GB/sec but you can stuff it with 128GB of memory for $5550.
Meanwhile an RTX PRO 6000 "Max-Q" has 96GB of 1.8TB/sec memory, but will run you $12k. (And you still need the rest of the computer to put it into.)
Bang for the buck, it's not hard to see why so many people still buy Macs to run local LLMs.
Pepper_pusher23@reddit
Yeah, I feel like this graphic completely misses the point.
techdevjp@reddit
There's a segment of users here who like to hate on people who buy Macs to run local LLMs.
The main issue with Macs is that the prompt processing is slow, so the time to first token can be quite long. That has been improved with the M5, but i don't think we'll see exactly how much it has improved until we get the M5 Studio later this year.
Sutanreyu@reddit
Shh, keep it a secret. Though, great for Apple. I just wish they'd drop their game changer AI already.
techdevjp@reddit
The M5 Ultra coming with the Mac Studio sometime later this year should double the bandwidth of the M5 Max, and double the GPU cores. With 256GB it will still be well under the cost of a single 96GB RTX PRO 6000. Saving my pennies.
StableLlama@reddit
This shows how interesting the Intel B70 is, money wise.
But so far I couldn't read much about the real live performance of that card for local LLM applications.
smallDeltaBigEffect@reddit
honestly, the R9700 32 GB is missing. And that fills the gap rather than the B70
mycall@reddit
Worth $1379?
smallDeltaBigEffect@reddit
if you dont want /cant dual gpu setup with mid range 50xx or 40xx, and you don't want to buy used 3090, then the R9700 seems like the best option performance / VRAM / price-wise.
NeedsSomeSnare@reddit
As an intel owner, I assure you the real life performance isn't what it says on paper. I don't have that card to give specs on.
The problem is the software side of things is a bit messy. It's not terrible, but still needs a fair amount of work.
In_der_Tat@reddit
Why doesn't Intel hire enough competent developers to catch up with Nvidia? Would that be too expensive?
NeedsSomeSnare@reddit
A lot of people wonder that too. I'm guessing it's just related to corporate money saving bs. I'm sure that the people who actually work at intel know they need more staff.
It honestly appears only a handful of people work on the software.
superloser48@reddit
can you share any benchmarks on model/quant -> prfill and token gen?
Brian-Puccio@reddit
https://youtu.be/MnGLqo5cuGQ
NeedsSomeSnare@reddit
I don't have a B70, so it's of no use to anyone.
The other problem is that there are 3 ways to run models on intel, (SYCL, openvino and vulkan)all of which have different performance on different models.
The info is out there though. You want to look for Openvino benchmarks for the best performance. It has the worst compatibility though and is sometimes months behind something like llamacpp.
Upstairs-Extension-9@reddit
I got one for MSRP on release and quite pleased with it, I used my RTX 2070 before mainly for SDXL and Gemma 4. I’m very happy with the card especially for the price and save a shit ton of money I used to spend on Claude.
overand@reddit
I'd love to hear what models you're using, what backend, what quant, and what sorts of PP and Gen T/s numbers you have!
Upstairs-Extension-9@reddit
The thing is I have a very niche use case, for general day use and just thinking I use my Claude Pro plan. I’m an architecture model builder, architect and lifelong woodworker.
I have my own fine tuned Qwen 3.5 27B model that used to be run over Runpod and I trained it there as well, it’s directly connected through a VSCode Codelistener instance that can read and adjust my code for Rhino + Grasshopper through Python. Generally Rhino is a script based 3D modeling software that is perfect for custom Python or C++ scripts, many leading architects in the world use it. I’m not a software engineer but been making my own scripts for like 20 years now for various things from site analysis, parametric modeling and calculation for efficiency. I used to this all by myself, but since a few years Claude has helped me immensely improve my scripts and help me if I’m stuck.
Now this is all running on my B70 plus 96GB RAM and works like a dream so I don’t need the 200$ Claude plan anymore and pro is enough, Opus for planning and guiding Qwen and then I mainly use my finetune.
I spent most of my work day on CNC machines, laser cutters and general woodworking machines and LLMs have helped me a lot in recent years, and now I’m saving 2000$ a year with going fully local.
Honestly I don’t have exact benchmark numbers for you right now since I’m not at my workshop but I can get back to you in the coming days, it’s Friday today.
lloyd08@reddit
There was a few posts when it came out that effectively showed it matched the price point, but had the potential for growth assuming intel actually invests in the software space. So at worst, it's price point accurate.
crossoverXYZ@reddit
thanks for the heads up, good to know before I updated anything
Nice_Cellist_7595@reddit
Lol, but will they know what to do with it?
HugoCortell@reddit
The speed is basically wasted at those sizes. What's the point of going that fast if all you can fit is a small model. A cluster of mac minis is probably better off at the price. Slower, but you can run a more competent model.
siggystabs@reddit
Multi agent workflows can overwhelm slow setups
EndlessZone123@reddit
What multi agent are you gonna reliability do with <32B models?
siggystabs@reddit
I’ve been using 8B and above on well-defined tasks for about a year now.
I have pipelines that break down workloads and process in parallel batches.These batches are initiated by scheduled jobs, personal Claude/Codex agents, and API requests from various apps I’ve built. Some systems collect data, some analyze it, and some report on it. Recent models (like Qwen) can produce reliable tool calls and output if you can structure your processes. I have evals up and down the stack. Each pipeline stage is well defined.
If you need 32B models and above you are probably working with complex tasks that benefit from intelligence more than speed. That’s completely fine, it’s why I still have Claude and Codex subs. However, if you’re using high intelligence models to do basic VLM, you’re probably wasting time, money, or both.
see_spot_ruminate@reddit
Don't downvote this person. This obsession with bandwidth is the type of crap people say when their tricked out honda civic has such and such hp. It does not really point at the actually rate limiting step that is vram first.
Randomdotmath@reddit
People are downvoting because the comment makes it sound like you’re unaware that GPUs can be connected lol
see_spot_ruminate@reddit
right... even though the comment says a "cluster"...
2Norn@reddit
image or audio models are not that big. not everyone uses llms for coding.
XO33OX@reddit
yup, qwen3 VL exists for a reason
DoorStuckSickDuck@reddit
Eh, at some point you will run into the issue of wanting multiple parallel streams, at which point you will quickly understand that the bus bandwidth is your new bottleneck.
Cosack@reddit
Workflows, multi-shot inference, and tuning. Because these are lossy systems regardless of size, you should be building for all this anyway.
The most speed and cost effective setup runs easy to manageable tasks locally and bursts to SOTA models as needed. Because burst frequency is low, the cost of non-local calls is trivial, and privacy can be retained through obfuscation and local translation. Local speed and cloud burst for temporary model size increase is the optimal setup.
pixelpoet_nz@reddit
I hate that you're getting downvoted for this, as it's 100% true.
As the saying goes, "all the speed in the world doesn't matter if you're headed the wrong way". Buncha ADHD people out here who just want infinite tokens per second of absolutely anything / random trash
mrinterweb@reddit
When using more than a single card for inference, the PCIe bus is capped at 128 GB/s on version 6. So yeah. You either need a model that will fit on a single card or you need to accept that BUS cap. Small models can be quite capable though.
bcRIPster@reddit
And for most of us stugglebussin' on our 2019 gaming purchase: Nvidia RTX 2060 GPU, 336 GB/s
And for my surplus scrap bro's: Nvidia RTX A2000 GPU, 288 GB/s
suesing@reddit
Those 2 mac speeds are the max variants. But the pro. Base pro starts astound 370
Blackdragon1400@reddit
This is also useless without average prefill and token generation speed because they are wildly different between these platforms and architecture will make the memory bandwidth a non-issue in a lot of circumstances.
bennyb0y@reddit
Gimme that MacBook ultra pls
mintybadgerme@reddit
Can someone explain what the difference between these two is??
https://www.ebay.co.uk/itm/178177138808 https://www.ebay.co.uk/itm/406910850068
amatisig@reddit
2080Ti-22G/11G 616GB/s
realblindseeker@reddit
I’ll add: Jetson AGX Orin 64GB, 204 GB/s :-)
Shoddy-Tutor9563@reddit
that alone doesn't give the full picture. Something like this one does a little bit better job:
... based on public benchmarks from
llama-bench- a tool from llama.cpp project. The standard benchmark figures are assuming you're running TheBloke/Llama-2-7B-GGUF:Q4_0. Noone in the health mind uses it today, but it gives you a base reference that is comparable.100and10@reddit
Intel Arc the absolute goat. Look at it go!
Steus_au@reddit
there is always 5060ti's at the price of mac with its 500gb/s at your possession, no need thankyou
Kubas_inko@reddit
Bandwidth is mostly useless, if you can't load the model in the first place.
comatrices@reddit
RTX 3080 16GB MXM, 448 GB/s
Any MXM card users here?
Alexal88@reddit
Guys, and how EXACTLY are you using it?
Please give me some 101s beyond the “I run local models on it” 🙏
Colecoman1982@reddit
This table is kind of useless without including price per GB/s and total vram per option. Also, I've seen others in this discussion point out that there are more competitive options that have been entirely left off this list...
BringOutYaThrowaway@reddit
The 3090 doesn’t get enough credit. Great performance for the money.
alphatrad@reddit
Dude conveniently leaves off the most compelling AMD & APPLE options to make NVIDIA look good.
AMD AI Pro R9700 GPU, 640 GB/s APPLE M3 Ultra, 819 GB/s AMD RX 7900 XTX GPU, 960 GB/s
Chart also doesn't account for max memory. So it's misleading on trade offs for why you might go unified over GPU.
This is the stuff that is causing so much confusion in these communities.
Low effort slop!
Low effort slop.
Art_4_Tech@reddit
I gave up my 3090 and I regret it. I've been running the strix halo and I'm trying to get some more serious performance.
What are peoples genuine thoughts on the gigabyte aorus ai box 32gb 5090?
It's pricey but I don't have a machine to put a normal unit in and I'd like to run an external enclosure if its viable for the money.
ideal2545@reddit
is a 5080 any decent? it’s what i got in a gaming rig, i think memory is the issue with it?
putrasherni@reddit
Bro ignoring R9709 completely
gAmmi_ua@reddit
RTX PRO 4000 Blackwell SFF (70W) - 432 GB/s Not fastest/cheapest, but pretty good at 70w cap
SkyResponsible3718@reddit
I wish the 5090 came with twice the memory. 32GB just isn't enough, and 64GB would be a complete game changer for me.
beasthunterr69@reddit
So MBA is out of the equation here?
crossoverXYZ@reddit
Been running local models for about a year now and the progress is honestly staggering. What used to require a 70B model can now be handled by well-trained 8B-14B models for most practical tasks. My daily driver setup is a 14B model for general tasks on a single GPU, and I only reach for larger models or API calls when I need that extra capability. The latency advantage of local inference is underrated too — for interactive coding assistance, having instant responses changes how you work with it fundamentally.
XO33OX@reddit
why we dont talk about rtx pro 5000 both 48GB and 72GB or rtx pro 4500 32GB, rtx pro 4000 24GB ?
MiniEval_@reddit
I have a 4500 because I just wanted to have a mini-ITX build that wouldn't blow up. A 5090 is by all means a better option when it comes to value if compute is the only concern, as it's slightly more expensive for double the bandwidth.
XO33OX@reddit
if 32GB VRAM is enough for you then single 5090 is superb (i have one), but it doesnt scale (space, heat, power, even with undervolt and aio version) well and creates a lot of headaches beyond that. On the other hand you slide 4500s one after another into standart workstation (trx50, wrx90e..) without much hassle.
LinkSea8324@reddit
Wait for OP to learn that he can use directly use text instead of storing it into an image
gandhi_theft@reddit
You left out the Apple M3 Ultra Studio which gets 819 GB/s
HuRyde@reddit
Nvidia V100 on eBay $99 is 900 GB/s
Various-Welder5544@reddit
Leave my cheap desktop alone
GeneralRieekan@reddit
This table needs a 2nd dimension: VRAM/Unified RAM amt
firetech97@reddit
Wow is the performance gap really that bug between a DGX Spark and a 5090?
nacholunchable@reddit
Yes! So many new spark users go down this rabbit hole on NVFP4 kernels and why their LLMs arent running faster, meanwhile token generation is speed bound by the memory bus and nothing they do will change that. How do I know? I went down the same rabbit hole when i got my spark half a year ago.
firetech97@reddit
I was eyeing one but have not done any actual research yet, which i was going to do before pulling the trigger. With RAM prices so high, the 128gb of unified for ~5k seemed like a better deal than building a 5090 rig, where the GPU alone is 4k and id spend at least another 2k in CPU, RAM, Storage, Mobo.
I probably would've come to the conclusion to build anyway over it, but it is an attractive all in one package with a very small footprint. I'll have to look into some benchmarks and go from there i suppose
nacholunchable@reddit
Ya, for sure. Honestly i still feel like the asus ascent gx10 (undercuts the other versions price with the same hardware in a different case) is a steal. I went for the 1tb version, it was 3k back then, 3.5k today (usd). Its a great unit for that price. I mean there were (and maybe still are?) some amd boxes you can get even cheaper, but you give up a touch of mem speed, a lot of gpu power, and close the door on clustering. If ur chill with 15 - 60 genned tps (depending on the model you run) and want the fat capacity and low energy cost, its the way to go imo. But if you crave faster speed, deeper upgradability, dont care about energy, want a real desktop for non-ai or gaming, a proper rig is better. I have no regrets, but I was expecting more performance going into this.
jakubl@reddit
There are 4 important factors when choosing hardware. They relative weight depend on the use case, and memory bandwidth is only one of them and very often not the most important one.
And as I mentioned a lot depends on use case. If you are building interactive chat, the time to first token is the most important factor, then token generation speed. Human time is still orders of magnitude more expensive than hardware and electricity and if humans are sitting and doing nothing while waiting for AI response that is a huge loss. If building fully autonomous agents that work in fire-and-forget mode it’s less important factor, but the context and model capabilities are very important so that it can actually run without supervision. Getting crappy results but very fast is way worse than waiting for good results.
That’s why Macs are very popular - they can handle large models and if you can wait, you can get good results cheaply with lower energy usage. It’s kinda funny that Apple become the most cost effective hardware for a task. I believe it won’t last for long and seeing how easily they hardware is sold out someone at Apple would probably decide to raise prices 2x and still they won’t have any trouble finding customers.
You can optimize cost by adjusting workflows. Instead of waiting for response and interactively correcting model behavior, prepare batch, run it, go to sleep and wake up to finished job.
fuckable-switcher@reddit
And you forgot about the m5 max and then double that for the m5 ultra
fuckable-switcher@reddit
Dude you forgot about amd when it did its hbm card era
The Radeon 7 has close to 2tbs of bandwidth
Covert-Agenda@reddit
Soo much context is missing off this.
Mac Studio 800gb/s minimal power draw 256/512GB memory.
Koalababies@reddit
The power draw always blows my mind
fivetoedslothbear@reddit
Yeah, my 128GB M4 Max MacBook Pro isn't the fastest machine, but it only has a 140W power adapter and can do extended inferencing on a battery. And it's portable.
Covert-Agenda@reddit
Yeah I’ve got the m5 max variant and I can use some mega models locally.
Yeah not as fast as the 5090 but it’s portable.
Aardvark-One@reddit
The biggest issue I have with Mac is for agentic use. A lot of context is sent in the prompt when using agents and prompt processing on the Mac is incredibly slow. Although, the M5 has closed the gap a bit, it still can't get close to Nvidia.
MiaBchDave@reddit
Hot and cold (SSD) KV cache solves this issue. Unless your workflow is to RAG a different PDF document for every prompt by the thousands, otherwise agentic harnesses fly when using a proper prompt cache. In other words, this is a non-issue for local agentic work lately with the current systems (like oMLX) which are based on vLLM engines for multiple users but are repurposed for local agentic use.
heresyforfunnprofit@reddit
I need more explanation of this.
Aardvark-One@reddit
Thank you. That is something that I hadn't explored yet. Going to give it a go and see how it works out. Was giving up on local LLMs; t/s on the Mac was great but the prompt processing threw a wrench into the works.
Ok_Top9254@reddit
It's actually not impressive at all if you look into specs. It's a beefy CPU with an extremely outdated GPU using late 2010s level architecture. 26TFlops of FP32, no FP16, FP8 or FP4, some 36 INT8 TOPS from the neural engine. For reference 1080Ti has 45 TOPS of INT8 and RTX 2060 vanilla, has 52 TFlops of FP16, double that of Mac Studio.
With so little compute performance no wonder it uses so little power. The memory is also mobile LPDDR5X too, that consumes like 1.2W per 8GB. Except for the memory and CPU, you are basically getting scammed.
Covert-Agenda@reddit
Mine sips 75w max at full tilt ☺️
kiwibonga@reddit
My electricity bill is lower since AI because I don't do anything else.
Covert-Agenda@reddit
Hahaha same here!
droptableadventures@reddit
It's intentionally left off because it'd undermine the point of their Nvidia fanboy posting.
Covert-Agenda@reddit
I mean, raw throughout the 5090 or 6000 are monsters but they also burn a lot of juice.
I went with a PGX ThinkStation for my CUDA and the studio for MLX.
Works well for what I need.
droptableadventures@reddit
Yes but also only 32GB of VRAM on a 3090. So you can only run a very small model, even if it is fast.
Individual_Holiday_9@reddit
Also a Mac mini is $400 lol
Embarrassed_Adagio28@reddit
The watt per token is not as impressive on macs as you would think. Because macs are so much slower, their efficiency is deceiving. In fact I just had opus (could be wrong) calculate watts per token of a m3 ultra and rtx 5090, with Gemma 4 26b the mac studio only came out 10% more efficient per watt and 40% with qwen3.6 35b. Considering that a rtx 5090 is over twice as fast, that isnt very impressive for the mac.
Macs can handle huge models and are efficient but their slow speeds make it not worth it.
rpkarma@reddit
Correct; race-to-idle matters. If you have a very fast system and the fixed overheads aren’t bad, it can be more efficient to use the power hungry one and idle than have the Mac go much slower for the same task.
But it depends, too, on what you’re doing. YMMV.
You can run models on a Mac Studio that you simply cannot put on one or even two 5090s.
Hydroskeletal@reddit
I enjoy my office not being an oven in the summer
AnonLlamaThrowaway@reddit
sorry, the context was at q4_0, it got quantized too much
Covert-Agenda@reddit
Hahahahah touche 😎
TheRealDatapunk@reddit
An RTX Pro 4500 has half the memory bandwidth of the 3090, but is still way faster (15-70%) on pp and tg for me. Plus, the 32G allow for full context windows with most models targeted at the single gpu market
Total-Confusion-9198@reddit
Anything above 500 GB/s is a serious local LLM setup. Unified memory remains the underdog.
ea_man@reddit
Oh thanks, my 6800 at 512 GB/s is standing tall 😄
With some rust you can get 16GB of that for \~260e.
Away-Sorbet-9740@reddit
3090s evaporated from the Bangkok local market about a month after qwen 3.5 released. Went from dozen + at 22-25k baht, to 35-40k if you can find them lol.
100% the value sweet spot if you are buying today. I have a 4090 and 4070tis in separate rigs, and that extra 8gb is really the unlock to running capable local assistants.
drycounty@reddit
M3 Mac Ultra 819 GB/s
Fit_Assistant7953@reddit
everyom3 needs to see this
https://huggingface.co/FerrellSyntheticIntelligence/Vitalis_Devcore--- license: gpl-3.0 language: - en tags: - code - autonomous - self-healing - agent - code-generation - software-engineering - agentic - devops library_name: custom pipeline_tag: text-generation
Ferrell Synthetic Intelligence (FSI): Vitalis_Devcore
Built entirely by one developer. No team. No funding. Four years of self-taught work.
What Is This?
Most AI coding tools are assistants — they wait for you to ask, then suggest. Vitalis is different.
Vitalis_Devcore is an **autonomous execution engine**. It receives an intent, writes the code, runs the tests, and if something breaks, it heals itself and tries again — all without human intervention. It is the "hands" of the FSI ecosystem, designed to operate alongside **[Vitalis_Core](https://huggingface.co/FerrellSyntheticIntelligence/Vitalis_Core)**, which provides the cognitive reasoning layer.
Core Architecture
How It Works
``` You give Vitalis an intent ↓ CognitionEngine generates a plan ↓ KernelDaemon picks up the task ↓ SovereignKernel writes the code ↓ KernelValidator runs the tests ↓ Pass → ProjectLedger logs success Fail → SelfHealingLoop attempts autonomous recovery ↓ Pass → Recovered and logged Fail → Failure report generated for review ```
Getting Started
1. Clone the repository
```bash git clone https://huggingface.co/FerrellSyntheticIntelligence/Vitalis_Devcore cd Vitalis_Devcore ```
2. Install dependencies
```bash pip install -r requirements.txt ```
3. Start the Kernel Daemon
```bash python3 -m src.ide_kernel.daemon ```
4. Send your first task
```bash python3 -m src.ide_kernel.client scaffold my_module ```
Vitalis will scaffold a full module structure under `app/modules/my_module/`, generate a test file, run it, and log the result — all automatically.
REST Gateway (Optional)
Start the Flask gateway to send tasks over HTTP:
```bash python3 src/ide_kernel/gateway.py ```
Then POST to it:
```bash curl -X POST http://127.0.0.1:5001/execute \ -H "Content-Type: application/json" \ -d '{"intent": "scaffold", "module_name": "my_module"}' ```
Self-Healing Demo
```bash
Start the self-healing monitor in a separate terminal
python3 -m src.loop.self_healing
Trigger a task that fails — Vitalis will detect the failure
and autonomously attempt recovery without you touching anything
```
Technical Highlights
Governance & Integrity
Roadmap
About the Developer
FSI (Ferrell Synthetic Intelligence) is an independent AI research project built by a single self-taught developer over four years — no formal education, no team, no funding. Just a vision, a tablet, and a GPU.
If this project resonates with you, a ⭐ star goes a long way.
*License: GPL-3.0*
RagingAnemone@reddit
Anybody out there doing 8 channel or 12 channel cpu inferencing? What kind of speed are you getting on big models?
ItstheRealMon@reddit
That's decent for a GDDR6X
Meterman@reddit
Rx6800xt 512Gb/s
vodanh@reddit
Doesn't vram size matter?
Even-Actuator-2608@reddit
Where's the little nvidia nano kit
BornInAFish@reddit
Intel Arc Pro B60 Dual: almost identical bandwidth and price as 3090, double the VRAM, and double the PCIe speed.
billatq@reddit
Okay, now adjust it for price for what you get.
Super_Sierra@reddit
Nvidiachuds in this subreddit don't understand anything, your words will be wasted on them.
Those 5090s 32gb at 15 will be 500 or so GBs of vram. But you will need to rewire your fucking house so you don't blow your breaker, and the power draw will be around 6000w.
That same unified memory macbook does that but at 150w max, and if the power goes out, well, it can do that on battery for three hours.
The macbook also costs 5x less lol.
valdev@reddit
Alright, now add two more columns. Cost per gb of RAM/VRAM. And cost to operate over an hour.
qalpi@reddit
What can my 3080 Ti do
Kikopedia@reddit
This seems wrong, I’m unsure what metric this is, my m4 mbp is a lot faster than my spark
joochung@reddit
AMD MI50 over 1000GB/s
Old_Grapefruit8774@reddit
MI50’s need more love from the community
joochung@reddit
I agree. I have 3 in my server and happily run gpt-oss-120b
KiDNEXTDXXR@reddit
I run perfect self tuned local llms on a 1660ti. Soon as I get money building these websites I will get a dual 5090 set up
here_n_dere@reddit
Also interesting would be to stack DGX spark, RtX pros, and their memory capacity (each)
Shoddy_Bed3240@reddit
The cheapest high speed option is 3090 ti, 1,008 GB/s
migsperez@reddit
I bought my first GPU ever today, after computing for decades. Local LLMs pushed me over the edge. AMD 9700 32gb, I really hope it has almost similar performance to a 3090.
Diablo-D3@reddit
https://en.wikipedia.org/wiki/List_of_Nvidia_graphics_processing_units
https://en.wikipedia.org/wiki/List_of_AMD_graphics_processing_units
On a dollars per GB/s+GB+slot (assuming multi-GPU inference), jamming a machine full of RDNA4s with 16GB or more ends up being the win. You end up being able to scale GB sanely, but also scale GB/s the cheapest.
Can't wait until R9600Ds start showing up in the grey market, they're 9070GREs /w 32GB and built like the R9700S. Should grey market retail for like $800ish.
Zolty@reddit
My mac studio M3 Ultra 256gb of ram does around 780 GB/s if you want another datapoint.
einthecorgi2@reddit
Show with power usage and prefill lol
mintakka_@reddit
honestly the M5 40 core macbook pro is super cost competitive depending on your exact use case. $5k all in to run dense models at 100+ Gb memory and acceptable (depending on use case) inference speed can be a deal breaker
Ok_gosh@reddit
No m3 ultra? M3 ultra: 819GB/s
aguspiza@reddit
dual channel DDR4 3200 ... 50GB/s
dual channel DDR5 6000 ... 90GB/s
lemondrops9@reddit
DDR4 3200 real world is more like 30GB/s (Maybe 35? think my settings are off on that PC). I tested my other PC with ddr4 3600 and got 38GB/s
dazzou5ouh@reddit
So this bad boy I've built should be fast?
laexpat@reddit
Could you add one more to the left side to balance it?
LittleBlueLaboratory@reddit
Nope, his CPU cooler is in that spot
laexpat@reddit
lol I know - need one more so it’s 3:1:3 :)
LittleBlueLaboratory@reddit
Ooh! yeh!
ScaredyCatUK@reddit
Mac studio is missing
How much ram does your 5090 have?
panchovix@reddit
By the way, a simple +2000Mhz VRAM OC (or +4000Mhz on LACT on linux) brings the 5090/6000 PRO to 2TB/s bandwidth.
TokenRingAI@reddit
That's interesting information, but neither the 5090 or RTX 6000 have a speed problem, and potentially damaging my $8000 GPU or doing anything that might impact the warranty is a real non-starter
Would these speeds also work on the 5060 ti? It's got 1/4 the bus width and bandwidth of a 5090
panchovix@reddit
I can understand that.
5060Ti is able to do the same overclock, not sure how much would be the resultant bandwidth.
TokenRingAI@reddit
The 5060 TI is interesting, because the density is double compared to the other 5000 series GPUs, it has 16G on a 128 bit bus, if they did the same to the 5090 it would be a 64G gpu.
aeroumbria@reddit
I also depends on how long you expect your build to last, and your overall outlook for the technical landscape. There are a few promising signs that the balance between core speed and memory bandwidth might shift. We are moving towards more efficient, low footprint kc caches with slightly more processing steps, MTP shifts workload from more TG-like to more PP-like, and diffusion models, even if only used for drafting, is a big inversion of processing power vs memory bandwidth. For local, single user scenarios, any technique that liberates computation power from memory bottleneck would be extremely effective and will impact what hardware is to be considered better value.
WithoutReason1729@reddit
Your post is getting popular and we just featured it on our Discord! Come check it out!
You've also been given a special flair for your contribution. We appreciate your post!
I am a bot and this action was performed automatically.
a_beautiful_rhind@reddit
My ddr4 is 230gb/s but it's hobbled by uma support.
Lxxtsch@reddit
Sadly no one tells that this is not everything in llm world. M5 max witg 128gb running mlx optimised models is very viable option, being only around 600gb/s. I tought i would see improvement with 3090 over it (filling only vram) and jokes on me, mlx optimised model goes head to head with 3090.
Pleasant-Shallot-707@reddit
Yeah, I frankly don’t care if I can serve myself 40tok/s or 300. I am only serving one person.
neopolitan77@reddit
Actually just shows what an incredible beast M3 Ultra is/was. I'd take 512GB RAM @ 819GB/s over any of those in a heartbeat.
Pixel_Hunter81@reddit
yeah but macs draw very little power and they are pretty much plug and play which is a big plus for a lot of people. MacOS seems to be well optimezed for ai usage as well (i am not sure i've never used it).
Pleasant-Shallot-707@reddit
Yeah, getting 30-60 tok/s is fine
tvmaly@reddit
It seems like they need a DGX Spark 2.0 that is at least as fast as a 4090
Dicond@reddit
As the owner of a system with a 5090 and 3090, please stop, I can only get so erect.
rorowhat@reddit
Need some more and gpus
romeozor@reddit
I just bought a B70 today. Here's hoping it wasn't a mistake
Wolfpack99111@reddit
Can someone tell me how good is a 3060 12 gb
greentea05@reddit
I have a 5090 and a 128gb M5 Max, I wonder if I can combine them in someway - tricky I imagine if not impossible.
Full-Bag-3253@reddit
M5 40 Core Ultra should be around 1228 GB/s, come with 8x or maybe 16x more RAM/VRAM, and use a fraction of the power. If you want to scale a Mac Studio larger you can use thunderbolt cables to build a RDMA Cluster. Becuase of the low power draw you could plug 4 MAC studios into power bar and have them sit on your desk. 8 x 5090 @ $4000 is $32,000 going to cost a lot more than a Mac Studio even before you add the rest of the CPU/PSU/RAM/Cooling/Enclosures. You still will have more throughput, but for people 'Using" AI not training it I think the Apple ecosystem is a strong option. I expect the New CEO will push in this direction more. The stuff done to date (RDMA over Thuderbolt) isn't really a retail user thing and the fact that they are selling out of mac minis and studios is going to draw their attention in this area.
youneshabbal@reddit
I rememberwatched YouTube video experimented this
Stock_Ad9641@reddit
It’s insane that a 11 year old GPU beats the best intel or amd have to offer by a wide margin.
DesignerTruth9054@reddit
Its like comparing apples to oranges
Stock_Ad9641@reddit
Exactly! The nutritional value of apples and oranges.
hurdurdur7@reddit
R9700 missing from the pic
zica-do-reddit@reddit
How is this measured?
TopChard1274@reddit
What's the context for this
One_Curious_Cats@reddit
Add VRAM limits and watt usage as well.
aidycas@reddit
M3 Ultra by comparison please? Think it would be 3rd from the top? (Bottom)
Quebber@reddit
Yes I could use my 5090 but my MS-01 AMD Strix Halo with 128gb 96/32 split allows me to run a Q8 Qwen 3.6 35B model with 256k context.
Ill_Barber8709@reddit
And someone out there needs to see this
M5 chips are laptop chips with up to 32GB of 153.6 GB/s memory M5 chips are laptop chips with up to 64GB of 307 GB/s memory M5 Max chips are laptop chips with up to 128GB of 614 GB/s memory
RTX 3090 GPU doesn't exist as mobile RTX 3080 Ti Mobile GPU has 12GB of 384GB/s memory OR 16GB of 512GB/s memory RTX 5090 Mobile GPU has 24GB of 896GB/s memory
Long_comment_san@reddit
You're gonna laugh but that's exactly what I asked Qwen a couple of hours ago. Huh
Evanisnotmyname@reddit
Sometimes I feel like LLM providers use customer prompts to immediately create advertising on Reddit
vinigrae@reddit
But but Apple
Both-Activity6432@reddit
Can we just pause and think about how fucking fast that is? I know it is local, but think of our 56.6k modems… Near 2TB/s. Home internet tops out (generally) 1-2 Gbps. Thunderbolt 4 tops out at 40Gbps. The worst card listed is 960Gbps. And yes I know this internal computer architecture vs accessories or internet, but holy fuck
Evanisnotmyname@reddit
Really though. Think about the amount of data being processed in an AI data center on the minute.
WiseassWolfOfYoitsu@reddit
A few random bonus ones:
MI50: 1024GB/s MI100: 1230GB/s 7900XTX: 960GB/s A6000 Blackwell: 1790GB/s (so 5090 performance with a much bigger memory pool) Radeon AI Pro 9700: 640GB/s
exographicskip@reddit
Thanks for the a6000 clarification
zerubeus@reddit
And Im only using the 5090 to play Arc raiders
Last-Owl-8342@reddit
idk man after calculating how cheap deep seek 4 flash is, Im not going local anymore
is not a rival to claude sure, but I know what I want just need someone to type all the boring parts
XxBrando6xX@reddit
M3 ultra Mac Studio, 819 GB/s
poopsinshoe@reddit
https://marketplace.nvidia.com/en-us/enterprise/personal-ai-supercomputers/dgx-spark/
JigglyWiggly_@reddit
How about amd?
ItsAMeUsernamio@reddit
16GB 5060Ti is 448GB/s
16GB 5070Ti is 896GB/s
16GB 5080 960GB/s
Interesting how Nvidia scales it up with the price.
HavenTerminal_com@reddit
M3 Max sits somewhere between 300 and 400 GB/s depending on config, for anyone on Apple Silicon squinting at this
Creepy-Bell-4527@reddit
M3 Ultra: 819 GB/s
kl3x@reddit
I'm pretty curious how Nvidia N1x will perform
Queasy_Problem_563@reddit
my mac studio m2 ultra 192gb is doing 800gb/sec
heitortp0@reddit
It hurts to remember that a 5090 in my country is ≈5/6k usd
ortegaalfredo@reddit
The bandwidth gives you the tok/s in generation, but the compute gives you the tok/s prompt-processing.
The M5 Mac will generate tokens quickly but will still take 10 minutes to process a prompt.
eltonjock@reddit
“10 minutes to process a prompt” either you’re wildly exaggerating or you’ve never used an M5 Mac.
xlltt@reddit
try processing a large context prompt 200k and then you will see that m5 will take at least 300-400 seconds on a dense model
Ok_Hope_4007@reddit
That is not correct. You can look up benchmarks yourself. For qwen 3.5 35B the M5 (Max) has PP of about 2000 t/s at 8K Prompt length. The 3090 was around 2300 t/s. Not exactly the ballpark you are mentioning
ortegaalfredo@reddit
Oh they fixed it in the M5. Nice to see. Now its much more competitive with the 3090s.
Internal_Quail3960@reddit
possible m5 ultra will be 1228 GB/s
DCGreatDane@reddit
Maybe faster if rumors are true.
Internal_Quail3960@reddit
possibly, but going off of the past generation ultra chips they have always had double the memory bandwidth
marutthemighty@reddit
What is the latest and most powerful NVIDIA RTX (or any other GPU) model?
redmctrashface@reddit
Yeah should also display vram amount. It's awesome to have a lot of speed but if you can't load a decent model because you lack vram space, what's the point? Don't get me wrong, Im not praising ram amount over bandwidth. It's just that things are a little bit more complicated than "look at my speed" or "look at my huge ram". This kind of post is misleading.
power97992@reddit
For agentic purposes, prefill is important too
Blah-Blah-Blah-2023@reddit
RTX2060 336GB/s ... yeah I am poor.
whatsamanual@reddit
The other side is capacity. I just bought my second dgx spark yesterday... I can't wait to see what they can do together!!!
into_devoid@reddit
AMD HBM2: 1.024TB/s
gomezer1180@reddit
Where is the Mac studio in this list?
InnerSun@reddit
Yeah a 2023 Mac Studio M2 Ultra has 800GB/s, that's really insane value even today when we place into context
MasterKoolT@reddit
Yes, but generations prior to M5 didn't have matmul acceleration so they struggle on prefill (M5 generation is about 4x M4)
InnerSun@reddit
Interesting, but I can see right now even the M5 Max has 460GB/s, so does it really help if the bandwidth is still lower in the end?
The naming is a clusterfuck lol
MasterKoolT@reddit
Basically, the Ultra is two Max chips duct-taped together in a really clever way to essentially double the performance. Apple hasn't produced an Ultra chip yet for M5 (and they skipped M4 Ultra) so there's a weird trade-off where you get better bandwidth on the M3 Ultra at the cost of the older, less efficient architecture.
fivetoedslothbear@reddit
I'm expecting an M5 Ultra to be released; Apple seems to be making odd-numbered Ultra processors. And if they offer it with 512GB, they've got my money.
MasterKoolT@reddit
I hope so too but I'm bracing for them to skip M5 Ultra. Fab capacity is so constrained at the moment that I wouldn't be surprised if Apple is stockpiling chips and RAM for iPhones (since that's the profit center) instead of allocating a lot of silicon to niche products like Mac Studio Ultras
InnerSun@reddit
I see, the scaling depends on the two Max chips of that generation then
Southern_Sun_2106@reddit
can confirm, qwen 3.6 on m5 max pro feels 'snappier' than on the m3 ultra
ZurielA@reddit
there is a M3 Ultra from Nov 2025, I own one comes stock with 96GB ram or can opt for 256gb
InnerSun@reddit
I'm still on the original M2 Ultra, I wonder how much better it is? From what I can find its really negligible. I guess the main benefit is that the max addressable VRAM is technically higher, but a maxxed out Mac Studio starts getting so expensive that we're back to considering NVIDIA setups.
Lets hope there's a new refresh that really changes the perfs.
joochung@reddit
I would expect the future M5 Ultra to have 1200GB/s aggregate memory bandwidth.
TokenRingAI@reddit
That's what the math works out to
HerrGronbar@reddit
Now compare it with price.
5olArchitect@reddit
Sure but that’s 128 gb of integrated ram
TheDailySpank@reddit
ImportancePitiful795@reddit
R9700 640GB/s something. 7900XTX around 1GB/s
However need also to point out that some cards are better than others because of their support on things like FP8 etc which some of the above are missing like the RTX3090
kenzu82@reddit
Still rocking Nvidia Tesla P100 at 732.2 GB/s
laexpat@reddit
That along with my P40 at 347.1 GB/s
dsanft@reddit
A dual socket Xeon Gold Cascade Lake with DDR4-2933 has about 220GB/s bandwidth. Don't underestimate CPU.
Kamimashita@reddit
2x RTX 3090 might be the most balanced? And not too expensive if you already have a system you can slot them into?
durden111111@reddit
I have a 5090 and the vram can easily OC +3000 which gives a bandwidth of 2176 GB/s
thetaFAANG@reddit
M1 max has 400 gb/s memory bandwidth btw
Apple accidentally made a machine that’s too good to upgrade for the price. M5 variants are close and compelling though
Bludsh0t@reddit
Very nice. Now do tdp
IllExample3639@reddit
Laughs in dual 3090. Worth double what I paid after 2 years if you believe eBay pricing.
NoFudge4700@reddit
B70 Pro is decent for home inference.
synn89@reddit
M1 Ultra, 820 GB/s
diggamata@reddit
MI350P is 4 TB/s
pfn0@reddit
not yet available, and I expect it to land in the $15-20K range, closer to the 20K range.
garlic-silo-fanta@reddit
Needs a column for electricity
RealSataan@reddit
Now the power draw also.
RealSataan@reddit
Now the power draw also.
higglesworth@reddit
B70 at 1/3 the performance for 1/3 the price
DrBearJ3w@reddit
Cough AMD Cough
NeedsMoreMinerals@reddit
Any changes to hardware on the horizon? Are they gonna start building pcs or gpus with 200 gb of ram?
Few_Painter_5588@reddit
Also don't forget that bandwidth is mostly additive. So if you have 4 RTX 3090s, you'll have nearly 4TB/s of bandwidth. LLMs are one of the few things that can saturate compute before bandwidth
tired514@reddit
What the.. who is modding you down?
If you're using graph split mode this is absolutely true.
ziphnor@reddit
Its not the whole story though. Bandwidth *per* GB also matters. E.g. the B70 is even worse than it looks vs 3090 here, because its 608GB/s that is (generally) reading 32gb, while the 3090 has bigger bandwidth to read from a smaller memory.
1ncehost@reddit
Also not the full story because PP is mostly compute bound and for many applications is just as important as TG.
tired514@reddit
please don't reset the context checkpoint back to 0... please don't reset the context checkpoint back to 0...
Damn you, opencode! *shakes fist*
ziphnor@reddit
Also true, I was just staying with the memory topic:) Not sure why I was downvoted though? People do tend to forget that bandwidth needs to be considered in connection with how much you will be reading.
1ncehost@reddit
There have been several posts recently that seemed like bot brigaded in the comments to pump links. I think its getting really bad here, so basically I wouldn't take any upvote/downvote numbers seriously anymore.
No-Juggernaut-9832@reddit
M3 Ultra is in the 800’s … can’t buy one now but when you could. More memory than the rest
chitown160@reddit
Imagine including TFLOPS along with wattage and cost ... oh wait the there is already websites like https://www.techpowerup.com/gpu-specs and https://technical.city/en/video/ that do exactly this.
Non-Technical@reddit
I have an M5 max Mac studio that is very fast but not enough ram and a strix halo that has much more RAM but is slow. Kind of in a weird place until more options are available.
jcdoe@reddit
No you don’t, the M5 Max Mac Studio isn’t out yet.
Non-Technical@reddit
Oh you are right. It is an M4.
exaknight21@reddit
For my Mi50 gang, 1 TB/s
Represent fam. Beat dollars per gb of vram i say. Huge shoutout to gfx906 / mixa/aiinfos !
BlackBeardAI@reddit
Unless you are rich enough to buy 5090(s) or a 6000 pro, 3090 is the king.
Intrepid_Dare6377@reddit
Just bought an HP Omen PC with a 5090 from Microcenter. Not as fun as doing a custom build but my energy is focused on development right now so went pre build. It is an absolute flamethrower speed wise (although the actual thermals and noise are quite good)🔥
freia_pr_fr@reddit
M3 Ultra, 819.3 GB/s
ColonelKlanka@reddit
Wow. I disnt realise the apple silicon non pro chips were still such low memory bandwidth.
I have a older m2 pro tht has 200gb bandwidth - this is faster than m4 non pro!
mike7seven@reddit
So on what 3b or 1b model with these numbers? Where’s the rest OP?
Acu17y@reddit
RADEON TEAM ❤️
SV_SV_SV@reddit
Nvidia P40 , 346 GB/s 🫡
Buildthehomelab@reddit
I wish it was the full picture if only we could just use mem bandwidth. Tool maturity matters so much.