Local AI on Mac Pro 2019

Posted by sbuswell@reddit | LocalLLaMA | View on Reddit | 23 comments

Anyone got any actual experience running local AI on a Mac Pro 2019? I keep seeing advice that for Macs it really should be M4 chips, but you know. Of course the guy in the Apple store will tell me that...

Seriously though. I have both a Mac Pro 2019 with up to 96GB of RAM and a Mac Mini M1 2020 with 16GB of RAM and it seems odd that most advice says to use the Mac Mini. Anything I can do to refactor the Mac Pro if so? I'm totally fine converting it however I need to for Local AI means.

[-]

HopePupal@reddit

which GPU? if you're not equipped with a W6xxx it's not going to work with modern ROCm 7: most of the other ones it could ship with are no longer supported. that limits you to Vulkan or CPU inference and i'm not sure how well Vulkan works with some of those cards. ik_llama.cpp is pretty good for CPU only operation.

https://www.reddit.com/r/macpro/comments/1q9xeov/guide_mac_pro_2019_macpro71_w_linux_local_llmai someone did this writeup but it's a bit outdated except as a source of perf numbers (ROCm 6, also don't use Ollama it sucks).

[-]

JacketHistorical2321@reddit

I am literally running the vega duo 2 pro card on my mac pro 2019 with a custom build of rocm 7.9 which re added support for gfx906. Its working great

[-]

Substantial_Run5435@reddit

Hey, just saw this comment. I'm trying to figure out the best path to running local LLM/agents on my 2019 Mac Pro. I currently have a W6800X and Vega II on hand, but have the opportunity to buy a second W6800X or a Vega II Duo for a reasonable price. If cost wasn't an issue, would you go for 2x W6800X (I doubt I'll be able to find a Duo with my budget) or a Vega II Duo (not sure if adding my current Vega II into the mix would be beneficial).

My system otherwise is a 16-core, 192GB (12x16GB), 2TB and an OWC Accelsior I'll use for windows/linux boot drives.

[-]

droptableadventures@reddit

https://rocm.docs.amd.com/projects/install-on-linux/en/latest/reference/system-requirements.html

AMD Radeon PRO W6800 is shown as supported. And people have MI50 working on ROCm 7 - which is even older (and officially unsupported).

[-]

JaredsBored@reddit

Rocm 7.12 nightly builds from AMD directly even have Mi50/gfx906 support out of the box. Rocm 7.0-7.2 work if you copy in some missing files from 6.3/6.4, but the 7.12 nightly builds are good to go out of the box

[-]

droptableadventures@reddit

Oh, so they heard the complaints and added it back in. Wow.

[-]

JaredsBored@reddit

Kinda sorta. It's not so much that they added it back because of Mi50 complaints. Rather the Vega architecture has been used in so many AMD "APU"s that they're working on an implementation that also happens to work with Mi50/gfx906.

I've been running a rocm 7.12 nightly build for about a week now. In my a/b testing against rocm 6.4, tldr; not really worth the effort. 6.3 -> 6.4 is actually a good gain, but 6.4 -> 7.12 not that much.

[-]

JacketHistorical2321@reddit

They did add it back. Not to the 7.2 but to a more recent 7.8 build. I added the link above. Not as well known/discussed but its the next gen offical rocm implimentation

[-]

JaredsBored@reddit

Rocm 7.8-7.12 are all the next gen builds. I'm saying they added it back but as a generic implementation that should now work for Vega iGPUs, which because the Mi50 shares the same architecture, the Mi50 also now regains support.

Basically we didn't regain Mi50 support because of community outcry rather AMD getting their shit together and supporting rocm on more of their products. Which they needed to do because cuda is supported on everything Nvidia makes

[-]

JacketHistorical2321@reddit

Totally. I was just pointing out that it's there. I've already built it myself and I'm currently using it so works great 👍

[-]

sbuswell@reddit (OP)

It used to be W6900X with 32GB VRAM but I sold it so it’s now the original 580X. Unlikely to be useful with 8GB.

[-]

sedtamensum@reddit

580 — if properly loaded (number of experts or layers on CPU) will still be useful, especially for smaller models like quantized Gemma 3 QAT = 17Gb, almost half of the layers can go to 580. If you install Linux or Windows you can use any card you may have, including older ones, to augment the 580 == just get the power cables == OWC (macsales) has the OWC PCIe AUX Power Cables Kit for 34.88 (half price of what Apple wants). You can even combine Nvidia with your 580 but that may require some tinkering with drivers. If you stay on MacOS there are limitations how high you can go with AMD (I think it’s 6900XT, there may be workarounds) When in doubt ask Claude and verify with Kimi (or vice versa). 96Gb RAM is not bad and even with just CPU you can get some larger models loaded. gpt-oss 120 will fit as such. Quantized Qwen3.5 122 to Q4 will fit. Not a speed demon but entirely local.

[-]

sbuswell@reddit (OP)

Is VRAM the main factor here? As I have an ubuntu server that's using ASUS Z890 Creator-Wifi (Intel LGA1851) with Intel Core Ultra 5 245K (14 cores: 6P + 8E cores, 125W TDP)

Got 64GB DDR5-6000 (Kingston FURY Beast, XMP enabled) RAM

So I've sort of ended up with a bunch of different kit and not sure where to use it!

[-]

HopePupal@reddit

oof yeah you would have been off pretty well with a W6900X but i think the 580X is the worst case scenario. good news is you've probably got 6-channel CPU memory (the 96 GB config is 6×16 GB) , so i really do recommend seeing how well ik_llama.cpp performs

[-]

Faisal_Biyari@reddit

Check these out:

https://www.reddit.com/r/macpro/comments/1q9xeov/guide_mac_pro_2019_macpro71_w_linux_local_llmai/

https://www.reddit.com/r/macpro/comments/1qfrqb5/guide_mac_pro_2019_macpro71_w_proxmox_ubuntu_rocm/

[-]

JacketHistorical2321@reddit

Hey man, so surprisingly I am actually doing this right now, but I have the 64gb vega duo 2 pro gpu. 16 core xeon, 370ish gb of 2433 ram config in 6 channel. For a comparistion of what i am getting so far this is how the 2019 Mac pro compares with my Mac studio M1 ultra 64 core gpu w/ 128gb ram. Exact same prompt with ollama running qwen3 code next @ 256k ctx:

Prompt: “tell me a story with exactly 800 token output” ——————————————————— Mac pro: ✅ Token count: 800 (verified with Hugging Face transformers tokenizer, gpt-4o tokenization rules).
Let me know if you'd like it in a different tone—epic, tragic, whimsical—or formatted for printing! 📜✨

total duration: 32.825815569s load duration: 158.095789ms prompt eval count: 3807 token(s) prompt eval duration: 332.867543ms prompt eval rate: 11436.98 tokens/s eval count: 981 token(s) eval duration: 31.940322417s eval rate: 30.71 tokens/s

Send a message (/? for help)

—————————————————- Mac studio: ✅ Verification:
- Token count (Qwen/Qwen2 tokenizer, standard BPE): 800 tokens exactly
- Punctuation, whitespace, and newlines included as per real-world parsing
- Designed for natural rhythm — pauses, dialogue, and description are weighted for reading and token efficiency

Would you like a version with line numbers, or exported as a .txt file? Or perhaps a story about how fireflies blink in unison? 🌌✨

total duration: 48.801039208s load duration: 4.816962875s prompt eval count: 3202 token(s) prompt eval duration: 6.358609s prompt eval rate: 503.57 tokens/s eval count: 984 token(s) eval duration: 37.390679449s eval rate: 26.32 tokens/s

Send a message (/? for help)

[-]

sbuswell@reddit (OP)

So you're finding the Mac Pro is faster? That's interesting. Any downsides you've spotted?

[-]

JacketHistorical2321@reddit

So far, yes. I’m still doing an extensive amount of testing with different configurations and different frameworks so I can come back and give more numbers later. But so far, at least for our purposes with language, models and inference. The seven-year-old Mac Pro is beating out my M1 Ultra, which is kind of crazy.Since the mac pro also includes a xeon w/ avx512vnni capability if i need to run larger models that wont fit on 128gb of vram I am also seeing a 2-2.5x increase on inference on CPU alone vs. my threadripper pro 3955wx server. So far, the mac pro is a pretty damn amazing inference machine

[-]

fzzzy@reddit

I have one, 16 gig vram 768 gig ram. llama.cpp works fine on gpu or cpu. It’s definitely very slow with big models, but my concern was trying to run the smartest models not the fastest.

[-]

Murgatroyd314@reddit

The big difference is that the Intel-based Mac Pros have discrete GPUs with their own VRAM distinct from the system RAM. M-series chips have a unified memory system where most of it (2/3 for 16GB, 3/4 for larger than that) can be used as VRAM-equivalent.

What's the GPU on your Mac Pro?

[-]

sbuswell@reddit (OP)

580X now. If I go down that route, I'll probably upgrade.

[-]

bnightstars@reddit

I have a Macbook Pro i5 2020 with 16GB ram you can only run small models think 1.7 to 4B and all are painfully slow 10 t/s or less. I'm waiting for the new M5 Pro to get announced so I can upgrade my Mac.

[-]

catplusplusok@reddit

Both are best for learning/tinkering/special purpose tasks like summarization, unlikely to be useful as coding assistants or for long philosophical chats. If you have 16GB or more VRAM on your Mac Pro, worth trying it, probably llama.cpp with old RocM or Vulcan and partial CPU offload to fit a larger model.