Anyone here with an AMD AI Max+ 395 + 128GB setup running coding agents?

Posted by Admirable_Reality281@reddit | LocalLLaMA | View on Reddit | 36 comments

For those of you who happen to own an AMD AI Max+ 395 machine with 128GB of RAM, have you tried running models with coding agents like Cline, Aider, or similar tools?

[-]

GreenCap49@reddit

Yes, I'm doing that. Qwen 30B A3B Coder is super good. With the AMDVLK drivers on Linux I'm getting around 550 tokens/second on prompt processing and 50 tk/s for token generation and you can run a huge context. With a big context it gets progressively slower though. Check out the strix halo home lab discord and this website https://strixhalo-homelab.d7.wtf/ https://discord.gg/pnPRyucNrG

[-]

CalangoVelho@reddit

Have you tried gpt-oss 120b?

[-]

liright@reddit

What's the maximum parameter model you can run on it?

[-]

GreenCap49@reddit

glm air and gpt OSS 120B fit in easily, I think the biggest you can fit is qwen 235B in Q2 or Q3. Check https://kyuz0.github.io/amd-strix-halo-toolboxes/ he has done extensive benchmarking on different backends + models

[-]

liright@reddit

Wow, that thing is crazy considering it can be bought for just $2000. My RTX 4090 cost nearly that and I'm lucky I can run a 30B model.

[-]

GreenCap49@reddit

Yeah I'm super happy with my purchase! Was a bit worried at first because I bought the Gmktec Evo X2, but my thermals are fine. Now I can ditch expensive APIs, use it as a home lab server and because it's an x86 machine I can also do gaming. And of course the tweaking is fun. Amazing device!

[-]

Phptower@reddit

Are thermals good out of the box? Or only with mods and tweaking? Or did it happen you got an improved version ( bios or heatsink or paste or fans)? In the GMKtec there are lots of posts about overheating.

[-]

GreenCap49@reddit

My thermals were fine from the start but I applied some PTM7950 and now it's running at like 65° under full load. Keep in mind most people that are posting on the internet are the ones with problems, not the happy average customer.

[-]

Phptower@reddit

65°c is crazy low under full load...with fans at 100% and ambient 10°C and just 1 min? Not buying it 🙂.

The Beelink has a full size vapor chamber + high capacity blowers double the height of the evo-x2 for a reason and it hit 90°C at full load (120-140W).

[-]

GreenCap49@reddit

No need to buy anything, you do you. Just telling you what amdgpu_top is reporting. Not gonna buy a temperature gun for you ;)

[-]

liright@reddit

Damn this is making me want to sell my gaming PC and get that instead. But I still do some AAA 1440p gaming and looking at benchmarks that iGPU wouldn't quite cut it. But if they come out with a 256GB+ RAM model I don't think I would be able to resist that haha.

[-]

GreenCap49@reddit

I totally get it! Maybe they'll do that with the next Gen Medusa halo, that is rumored to have 50% higher memory speed as well. I haven't actually done any gaming with it yet but I'll surely do that later.

[-]

Mediocre-Waltz6792@reddit

Not true, Unsloth dynamic 4q of the Qwen3 235B just fits into 128gb of ram on my PC. Only get 1 - 1.6 t/s but it runs.

[-]

GreenCap49@reddit

Yeah but you also want some context, so not practical.

[-]

Mediocre-Waltz6792@reddit

its alright right. I have it at 16k context. But the speed... oof 😅

[-]

GreenCap49@reddit

In Linux you can actually configure it so it uses the whole 128GB as unified memory. No split like under Windows. The limitation right now is the memory speed but this is less of an issue with all the new MoE models. GPT OSS 120B runs super fast, almost as fast as Qwen 3 30b A3b Coder, which is crazy!

[-]

DocWolle@reddit

how do you configure it? Is it GTT memory size as kernel parameter?

[-]

GreenCap49@reddit

Yes exactly, you set that one to 128GB and set VRAM in the BIOS to 512MB, then everything gets dynamically allocated to GTT.

[-]

Phptower@reddit

But isn't it the same with Windows? I'm not switching to Linux.

[-]

GreenCap49@reddit

No in Windows you can only allocate up to 96GB and it's a fixed split.

[-]

waltercool@reddit

I don't see any performance loss between dedicated VRAM (UMA) and GTT usage. Do you?

[-]

rootbeer_racinette@reddit

Does it use the NPU hardware also or just the GPU?

[-]

GreenCap49@reddit

NPU is not being used at the moment. But from what I understand their main use case would be running small LLMs in the background. They could help with inference though. https://youtu.be/a9NprGqBr54?si=2gV37xHIaX5JcgVH

[-]

Phptower@reddit

But Lemonade is supposed to support the npu?

[-]

GreenCap49@reddit

When I last checked lemonade it didn't support many models.

[-]

friendlyq@reddit

Why AMDVLK? ROCm is faster.

[-]

GreenCap49@reddit

It depends on the model that you're running, check for some benchmarks https://kyuz0.github.io/amd-strix-halo-toolboxes/

[-]

friendlyq@reddit

Vulkan can be faster only by mistake or in case of bug.

[-]

paschty@reddit

ROCm runs like shit on gfx1151

[-]

rebelSun25@reddit

Impressive

[-]

its_just_andy@reddit

what quant level for A3B Coder are you using? 8bit? 4bit?

[-]

GreenCap49@reddit

Q6_K_XL

[-]

Admirable_Reality281@reddit (OP)

THANK YOU 🙏
That’s awesome to hear!
How much slower does it get with a bigger context like around 100k tokens?

[-]

GreenCap49@reddit

I don't have my PC in front of me right now but I would say at 40k tokens it's down to half the speed. Still perfectly usable for my use case. I'm using bolt.diy at it at the moment

[-]

Admirable_Reality281@reddit (OP)

Thanks!

[-]

waiting_for_zban@reddit

You can also check u/randomfoo2 journey with AMD hardware, including the strix halo. There are lots of detailed benchmarks.