Feeling a bit handicapped by my 7900 XT. Is Apple the move?

Posted by vick2djax@reddit | LocalLLaMA | View on Reddit | 39 comments

I’ve been using ChatGPT, Gemini and Claude for a long time. My work is being a Salesforce developer/admin/holyshiteverything. I’ve got an Unraid machine with an Intel i9-12900K, 64 GB of RAM, an unholy amount of storage that serves a lot of dockers like Plex. I ended up with a 7900 XT with 20 GB VRAM from a failed VM pass through experiment with a Linux project. Then I got into Claude Code wanting to make a daily RSS feed digest and then a fact checking JarvisGPT…. long story short and a 1500W APC purchase later, I’m feeling the ceiling of 20GB VRAM (also wtf qwen3 30b-a3b being 20.2 GB after KV cache fucking jerks).

I’m trying to figure out what the move is to go bigger. My mobo can’t do another full fledged GPU. But I DO have a M3 Max 36GB MacBook Pro that is my daily driver/consulting machine. Maybe the move is to sell it and try to get a 128GB one? Or maybe throw more on it and try to make it a M5 Max?

It seems from my research on here that 70B model is the size you want to be able to run. With my consulting work, it tends to deal with sensitive data. I don’t think it’s very marketable or even a good idea to send anything touching it through any cloud AI service (and I don’t). But I’d like to be able to say that I’m 100% local with all of my AI work from a privacy standpoint. But I also can’t host a data center at home and I dunno that I can run my JarvisGPT and having a coding agent at the same time on my Unraid build.

Would a good move be to try to get a M3 Max 128GB MacBook Pro as my daily driver and use it specifically for programming to have a fast response 70B coding agent? Leave my more explorative AI work for the Unraid machine. Or does the 128GB Mac still have a lot of ceiling that are similar to what I’m hitting now? Right now, I have qwen3.5 9B as my chatbot and qwen3 30b-a3b as my overnight batch ingester as I add to my knowledge base.

[-]

Look_0ver_There@reddit

You could also consider something like a 2nd GPU, like the Radeon AI 9700Pro's, that give you 32GB of VRAM for US$1300. If you pair that with your 20GB 7900XT, you'll have enough memory to load all of the models you're talking about at Q8_0, and 256K context. You could also move up to Qwen3 Coder Next at IQ4_NL. The preprocessing and token generation speeds will blow the Mac away. (I have a 128GB M4 Max MacBook Pro, and a 7900XTX and a 32GB AI 9700 Pro, and see exactly what I'm describing).

[-]

vick2djax@reddit (OP)

Well, sadly I’ve got a ROG STRIX Z790-E GAMING WIFI II that has 1 x PCIe 5.0 x16 slot (7900 XT), and 2 x PCIe 4.0 x16 slots (hard drive expansion and faster Ethernet card). So I’m maxed out. And even if I traded the Ethernet one out, the second GPU would be running at x4 mode. Wouldn’t that destroy performance?

[-]

Look_0ver_There@reddit

There's any number of reasonably priced board with 3 PCIe16 that are adequately spaced apart to fit in up to 3 cards. I use AMD CPUs, and picked up the Asus ProArt Creator board, and that has 3 slots spaced at 3,2,2 apart, which means you can fit your 7900XT, and 2 other GPUs

I'm not saying go do it, but I'm just saying that there's a highly viable third option here that you may not have considered.

[-]

Kahvana@reddit

Which ASUS ProArt Creator board has three slots?

Mine (ASUS ProArt X870E Creator Wifi) has two PCIE x16 slots, running at x8x8 configuration when having two inserted, or x8x4x4 if there is a PCIE 5.0 NVME inserted on M.2. slot 1.

[-]

Look_0ver_There@reddit

I know it's a day later, but here's a photo of 2 x 9700Pro's and a 7900XTX installed in the Asus Proart Creator

[-]

Kahvana@reddit

Appriciated, thank you very much!

[-]

Look_0ver_There@reddit

BTW, there is an option that still exists for you. Grab one of those PCI extenders that are used for mounting video cards sideways. This allows you to use that 3rd slot, even though a card may not fit there.

I'm talking about one of these things:

https://www.microcenter.com/product/694523/PCIe_50_x16_Riser_Cable

[-]

Look_0ver_There@reddit

The regular one. The R9700Pro cards are slightly less than 2-slots wide, so two of them will fit in the 2nd and 3rd slot. If you're talking about the usual consumer-brand "2-slot" cards that are really more like 3.2 slots wide, then yeah, it won't fit. My R9700Pro looks positively tiny compared to the 7900XTX.

Always remember, there's 2-slot cards, and then there's "2-slot" cards.

[-]

Kahvana@reddit

...that's honestly very good to know, especially about the R9700 Pro. Thank you for the information!

What's your experience so far with the R9700 Pro? I've been considering to purchase it since 2x16gb and \~480GB/s bandwidth is showing it's own issues under heavy load. I run text llms with vision, want to try ASR and TTS models later.

[-]

Look_0ver_There@reddit

The 9700Pro isn't exactly a bandwidth monster either at just 640GB/s. IMO, that is it's biggest drawback. I kind of wish AMD gave it a 384-bit memory bus and 48GB of memory instead, a bit like a RDNA4 version of the 7900XTX but with double the memory. That would likely make the card cost $2K instead of $1300 though, which is likely why AMD didn't do that, but that doesn't change that such a card would be far more desirable with 960GB/s of bandwidth.

I have some benchmarks that I posted in this thread over here if you want to take a look:

Qwen3-Coder-Next @ Q4_K_M

Qwen3.5-27B @ Q8_0

For Qwen3.5-35B-A3B @ Q8_0

| model                          |       size |     params | backend    | ngl | fa |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | -: | --------------: | -------------------: |
| qwen35moe 35B.A3B Q8_0         |  34.36 GiB |    34.66 B | Vulkan     |  99 |  1 |           pp512 |      3201.59 ± 13.96 |
| qwen35moe 35B.A3B Q8_0         |  34.36 GiB |    34.66 B | Vulkan     |  99 |  1 |           tg128 |        101.22 ± 0.12 |

For Gemma4-26B-A4B-it @ Q8_0

| model                          |       size |     params | backend    | ngl | fa |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | -: | --------------: | -------------------: |
| gemma4 ?B Q8_0                 |  25.00 GiB |    25.23 B | Vulkan     |  99 |  1 |           pp512 |      3674.04 ± 98.03 |
| gemma4 ?B Q8_0                 |  25.00 GiB |    25.23 B | Vulkan     |  99 |  1 |           tg128 |         94.14 ± 2.45 |

If you would like me to run more tests on different models, let me know.

[-]

rainbyte@reddit

Yeah, 2nd GPU would be a good option. It could even be an extra 7900XT if those are cheap there or one from series 9000. For inference Llama.cpp + Vulkan work well on Linux

[-]

matt-k-wong@reddit

Having a model that just barely fits is almost useless because theres no room for KV cache, now I almost double the model size which is a decent heuristic for running long sessions. However, I also did some experimentation and found that 32K or 64K context is quite usable (though I prefer 128). Actually the 70b class is largely being ignored right now. The new models that came out this month punch way above their weight class. the new \~30b models basically outperform the old 70b class (think 2 years or so).

[-]

vick2djax@reddit (OP)

Yeah I’m just learning of KC cache cause I was excited to at least be able to consistently run 30B on my 20GB VRAM but that doesn’t seem possible. I think qwen3.5 27B took up like 28 GB after KV cache and it spilled into my RAM and ran like shit.

I’m the only user so I’m getting by. But it seems like 70B is really where you want to be at. What has been your favorite 70b model?

[-]

rainbyte@reddit

It is possible to run Qwen3.5-27B with less than the max ctx size. IQ4 and Q4 variants consume around 15GB, so you have around 4GB for ctx.

[-]

vick2djax@reddit (OP)

Update: tested this on my 7900 XT. The math checks out for standard transformers but Qwen3.5-27B's hybrid Gated DeltaNet/Mamba2 architecture breaks the formula. Weights are 16.2GB on disk but runtime ballooned to 25.6GB — the recurrent state buffers eat \~9GB that a pure transformer wouldn't need. Spilled 33% to CPU, got 0.6 tok/s vs 52 tok/s on qwen3:30b-a3b running fully in VRAM.

For what it's worth, Gemma 4 26B-A4B (also MoE, but pure transformer) loaded at 20.1GB and runs at 54.9 tok/s on the same card.

[-]

rainbyte@reddit

Then something is weird. Here I'm running Qwen3.5-27B with almost full ctx on 24GB vram. It should be possible to get it working with partial ctx.

[-]

vick2djax@reddit (OP)

Update: tested this on my 7900 XT. The math checks out for standard transformers but Qwen3.5-27B's hybrid Gated DeltaNet/Mamba2 architecture breaks the formula. Weights are 16.2GB on disk but runtime ballooned to 25.6GB — the recurrent state buffers eat ~9GB that a pure transformer wouldn't need. Spilled 33% to CPU, got 0.6 tok/s vs 52 tok/s on qwen3:30b-a3b running fully in VRAM.

For what it's worth, Gemma 4 26B-A4B (also MoE, but pure transformer) loaded at 20.1GB and runs at 54.9 tok/s on the same card.

[-]

vick2djax@reddit (OP)

I’ll do some testing to see how that goes. Thank you!

[-]

Responsible_Buy_7999@reddit

Your agreements with your clients will govern what you can do with their data.

Using a hosted service with "train your model with my usage habits" turned OFF is commercially reasonable. However there is no reason for PII to leave your desk.

You may have other justifications for blowing thousands of dollars on gear, but that isn't one of them.

[-]

InvertedVantage@reddit

I've enjoyed using my 7900XTX before I moved to a separate dedicated box. You can pick them up on eBay for $850, so two of those and you have 48 GB of VRAM.

[-]

vick2djax@reddit (OP)

What did you move to and when was it that you felt you hit the ceiling on the 7900XTX? I know that’s a bit better than my XT, but curious

[-]

InvertedVantage@reddit

I just moved to 2x 3060s and a 5060. I moved because the 7900 is in my main desktop so I wanted to be able to use my machine during inference :)

[-]

SleazyF@reddit

I have a 7900xtx and would like to hop into this hobby. What have you used that ran well on this card? I’m thinking of starting with the new Gemma 4, just not sure which one to start with. Any recommendations appreciated!!

[-]

InvertedVantage@reddit

Qwen3.5-35b-a3b works really well!

[-]

kidflashonnikes@reddit

At this point - it’s either Apple unified memory or Nvidia GPUs. Nothing in between that’s it.

[-]

Look_0ver_There@reddit

Could pick up 3 x R9700Pros for less than the price of the Mac, and have 96GB of VRAM that will run more than twice as fast as the Mac will.

[-]

kidflashonnikes@reddit

I work in an AI Lab and we don’t touch those. The only person who is doing anything meaningful with AMD GPUs is George hotz and Tiny

[-]

the__storm@reddit

Yeah but OP doesn't work in an AI lab - they just need inference, and only with popular models.

[-]

Confident_Ideal_5385@reddit

Yeah nah, the AMD stuff is just as compelling if you're prepared to do a bit of hacking and use vulkan. The price alone makes it worthwhile to avoid Huang's moat.

[-]

rainbyte@reddit

Here new 4000s and 5000s are pretty expensive, so the consumer options are used 3000s or Radeon. For inference AMD devices are fine as long as it is a GPU with matrix cores (like 7900 xtx)

[-]

Confident_Ideal_5385@reddit

Yeah, the XTX is a goat. $1000-1300ish, 24GB VRAM, and almost 1TB/sec bandwidth.

[-]

Radiant-Video7257@reddit

R9700 + gemma 4 31b or qwen3.5 27b

[-]

rebelSun25@reddit

64gb mac is ihe minimum I'd recommend if you're upgrading from what you have. I think the dense 27b to 35b models are very capable, and at 64gb, you could run some 70b at lower quants. Obviously higher is better, but at 128gb the price gets silly, unless you go with AMD, and even that is $4k+ where I live.

I'd take a look at openrouter ZDR before you commit. They allow you to enable a zero data retention policy on your API key, so that your requests only travel to provisers who obey that policy. You can also specify providers additionally. No idea if you verified if this passes your risk tolerance

[-]

vick2djax@reddit (OP)

The price does get a bit silly at 128 GB. $5400 for a Mac M5 Max with 128GB and 2TB HD. I think my M3 Max 36GB is worth about $2k.

Are there diminishing returns going from 64GB to 128GB for this or is it still significantly better?

[-]

rebelSun25@reddit

Like others said, if you find that your context or KV cache needs to be large, as per type of requests you do, then larger VRAM pool is necessary.

I can't comment on that as I just gave up after I realized my ideal setup needs to be $10k+ , so I pay for openrouter with ZDR policy enabled. I use local for work that isn't critical to my deliverables.

I'm literally hoping for hardware prices to crash, while model quality improves within 128gb

[-]

matt-k-wong@reddit

you should be able to run the latest 30B class on your Mac just fine at least to test it out

[-]

vick2djax@reddit (OP)

What’s the jump like from 30B to 70B? I’m having a hard time figuring that out and I can’t test 70B either. I’m more or less on 9B qwen3.5 most of the time now.

[-]

matt-k-wong@reddit

the latest 30b models are roughly equivalent to the 70b models you're thinking about. You can't use parameter count as an indicator of quality anymore. However, what I did notice from the jump up to 120B is you can leave them running on their own and they will try over and over and figure things out for you.

[-]

Look_0ver_There@reddit

If you have a second PCIe slot, then possibly consider something like the Radeon R9700Pro which has 32GB of VRAM. These sell for $1300. Llama.cpp will automatically balance the load across the two GPUs, and you'll get twice the token generation rate of the Mac, and 3x the PP speeds. You'll be able to run all models up to 40B at Q8 quants and 100-300K context (depending on the model size).

You'll be able to run Qwen3-Coder-Next with a 4 bit quant like IQ4_NL, and the generation rate will be very fast.

The downside to the multiple GPU solution is that power draw will be pretty high, and you may need to upgrade your PSU.

I do have an M4 Max MacBook Pro (work computer) with 128GB of ram, and while it a very nice machine, if I run inference on it for any period of time greater than 30s them the laptop sounds like a jet engine under your fingertips, and the crazy thing is I don't see many people talking about that. Other than that drawback, it'll happy run models up to 200B, depending on the quantization, but it doesn't hold a candle to the inference speeds of GPUs.

If you need portability, go with the Mac, so long as you don't mind using MacOS. Otherwise for the same money you could grab 3 of those R9700Pros and have a killer local inferencing rig.