Anyone here with an AMD AI Max+ 395 + 128GB setup running coding agents?
Posted by Admirable_Reality281@reddit | LocalLLaMA | View on Reddit | 36 comments
For those of you who happen to own an AMD AI Max+ 395 machine with 128GB of RAM, have you tried running models with coding agents like Cline, Aider, or similar tools?
GreenCap49@reddit
Yes, I'm doing that. Qwen 30B A3B Coder is super good. With the AMDVLK drivers on Linux I'm getting around 550 tokens/second on prompt processing and 50 tk/s for token generation and you can run a huge context. With a big context it gets progressively slower though. Check out the strix halo home lab discord and this website https://strixhalo-homelab.d7.wtf/ https://discord.gg/pnPRyucNrG
CalangoVelho@reddit
Have you tried gpt-oss 120b?
liright@reddit
What's the maximum parameter model you can run on it?
GreenCap49@reddit
glm air and gpt OSS 120B fit in easily, I think the biggest you can fit is qwen 235B in Q2 or Q3. Check https://kyuz0.github.io/amd-strix-halo-toolboxes/ he has done extensive benchmarking on different backends + models
liright@reddit
Wow, that thing is crazy considering it can be bought for just $2000. My RTX 4090 cost nearly that and I'm lucky I can run a 30B model.
GreenCap49@reddit
Yeah I'm super happy with my purchase! Was a bit worried at first because I bought the Gmktec Evo X2, but my thermals are fine. Now I can ditch expensive APIs, use it as a home lab server and because it's an x86 machine I can also do gaming. And of course the tweaking is fun. Amazing device!
Phptower@reddit
Are thermals good out of the box? Or only with mods and tweaking? Or did it happen you got an improved version ( bios or heatsink or paste or fans)? In the GMKtec there are lots of posts about overheating.
GreenCap49@reddit
My thermals were fine from the start but I applied some PTM7950 and now it's running at like 65° under full load. Keep in mind most people that are posting on the internet are the ones with problems, not the happy average customer.
Phptower@reddit
65°c is crazy low under full load...with fans at 100% and ambient 10°C and just 1 min? Not buying it 🙂.
The Beelink has a full size vapor chamber + high capacity blowers double the height of the evo-x2 for a reason and it hit 90°C at full load (120-140W).
GreenCap49@reddit
No need to buy anything, you do you. Just telling you what amdgpu_top is reporting. Not gonna buy a temperature gun for you ;)
liright@reddit
Damn this is making me want to sell my gaming PC and get that instead. But I still do some AAA 1440p gaming and looking at benchmarks that iGPU wouldn't quite cut it. But if they come out with a 256GB+ RAM model I don't think I would be able to resist that haha.
GreenCap49@reddit
I totally get it! Maybe they'll do that with the next Gen Medusa halo, that is rumored to have 50% higher memory speed as well. I haven't actually done any gaming with it yet but I'll surely do that later.
Mediocre-Waltz6792@reddit
Not true, Unsloth dynamic 4q of the Qwen3 235B just fits into 128gb of ram on my PC. Only get 1 - 1.6 t/s but it runs.
GreenCap49@reddit
Yeah but you also want some context, so not practical.
Mediocre-Waltz6792@reddit
its alright right. I have it at 16k context. But the speed... oof 😅
GreenCap49@reddit
In Linux you can actually configure it so it uses the whole 128GB as unified memory. No split like under Windows. The limitation right now is the memory speed but this is less of an issue with all the new MoE models. GPT OSS 120B runs super fast, almost as fast as Qwen 3 30b A3b Coder, which is crazy!
DocWolle@reddit
how do you configure it? Is it GTT memory size as kernel parameter?
GreenCap49@reddit
Yes exactly, you set that one to 128GB and set VRAM in the BIOS to 512MB, then everything gets dynamically allocated to GTT.
Phptower@reddit
But isn't it the same with Windows? I'm not switching to Linux.
GreenCap49@reddit
No in Windows you can only allocate up to 96GB and it's a fixed split.
waltercool@reddit
I don't see any performance loss between dedicated VRAM (UMA) and GTT usage. Do you?
rootbeer_racinette@reddit
Does it use the NPU hardware also or just the GPU?
GreenCap49@reddit
NPU is not being used at the moment. But from what I understand their main use case would be running small LLMs in the background. They could help with inference though. https://youtu.be/a9NprGqBr54?si=2gV37xHIaX5JcgVH
Phptower@reddit
But Lemonade is supposed to support the npu?
GreenCap49@reddit
When I last checked lemonade it didn't support many models.
friendlyq@reddit
Why AMDVLK? ROCm is faster.
GreenCap49@reddit
It depends on the model that you're running, check for some benchmarks https://kyuz0.github.io/amd-strix-halo-toolboxes/
friendlyq@reddit
Vulkan can be faster only by mistake or in case of bug.
paschty@reddit
ROCm runs like shit on gfx1151
rebelSun25@reddit
Impressive
its_just_andy@reddit
what quant level for A3B Coder are you using? 8bit? 4bit?
GreenCap49@reddit
Q6_K_XL
Admirable_Reality281@reddit (OP)
THANK YOU 🙏
That’s awesome to hear!
How much slower does it get with a bigger context like around 100k tokens?
GreenCap49@reddit
I don't have my PC in front of me right now but I would say at 40k tokens it's down to half the speed. Still perfectly usable for my use case. I'm using bolt.diy at it at the moment
Admirable_Reality281@reddit (OP)
Thanks!
waiting_for_zban@reddit
You can also check u/randomfoo2 journey with AMD hardware, including the strix halo. There are lots of detailed benchmarks.