GMK X2(AMD Max+ 395 w/128GB) first impressions.
Posted by fallingdowndizzyvr@reddit | LocalLLaMA | View on Reddit | 69 comments
I've had a X2 for about a day. These are my first impressions of it including a bunch of numbers comparing to other GPUs I have.
First, the people who were claiming that you couldn't load a model larger than 64GB because it would need to use 64GB of RAM for the CPU too are wrong. That's simple user error. That is simply not the case.
Second, the GPU can use 120W. It does that when doing PP. Unfortunately, TG seems to be memory bandwidth limited and when doing that the GPU is at around 89W.
Third, as delivered the BIOS was not capable of allocating more than 64GB to the GPU on my 128GB machine. It need a BIOS update. GMK should at least send email about with the correct BIOS to use. I first tried the one linked to on the GMK store page. That updated me to what it claimed was the required one, version 1.04 from 5/12 or later. That didn't do the job. the BIOS was dated 5/12. I still couldn't allocate more than 64GB to the GPU. So I dug around the GMK website and found a link to a different BIOS. It is also version 1.04 but is dated 5/14. That one worked. It took forever to flash compared to the first one and took forever to reboot, it turns out twice. There was no video signal for what felt like a long time, although it was probably only about a minute or so. But it finally showed the GMK logo only to restart again with another wait. The second time it booted back up to Windows. This time I could set the VRAM allocation to 96GB.
Overall, it's as I expected. So far, it's like my M1 Max with 96GB. But with about 3x the PP speed. It strangely uses more than a bit of "shared memory" for the GPU as opposed to the "dedicated memory". Like GBs worth. Which normally would make me believe it's slowing it down, on this machine the "shared" and "dedicated" RAM is the same. Although it's probably less efficient to go though the shared stack. I wish there was a way to turn off shared memory for a GPU in Windows. It can be done in Linux.
Here are a bunch of numbers. First for a small LLM that I can fit onto a 3060 12GB. Then successively bigger from there.
9B
**Max+**
| model | size | params | backend | ngl | mmap | test | t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | ---: | --------------: | -------------------: |
| gemma2 9B Q8_0 | 9.15 GiB | 9.24 B | RPC,Vulkan | 99 | 0 | pp512 | 923.76 ± 2.45 |
| gemma2 9B Q8_0 | 9.15 GiB | 9.24 B | RPC,Vulkan | 99 | 0 | tg128 | 21.22 ± 0.03 |
| gemma2 9B Q8_0 | 9.15 GiB | 9.24 B | RPC,Vulkan | 99 | 0 | pp512 @ d5000 | 486.25 ± 1.08 |
| gemma2 9B Q8_0 | 9.15 GiB | 9.24 B | RPC,Vulkan | 99 | 0 | tg128 @ d5000 | 12.31 ± 0.04 |
**M1 Max**
| model | size | params | backend | threads | mmap | test | t/s |
| ------------------------------ | ---------: | ---------: | ---------- | ------: | ---: | --------------: | -------------------: |
| gemma2 9B Q8_0 | 9.15 GiB | 9.24 B | Metal,BLAS,RPC | 8 | 0 | pp512 | 335.93 ± 0.22 |
| gemma2 9B Q8_0 | 9.15 GiB | 9.24 B | Metal,BLAS,RPC | 8 | 0 | tg128 | 28.08 ± 0.02 |
| gemma2 9B Q8_0 | 9.15 GiB | 9.24 B | Metal,BLAS,RPC | 8 | 0 | pp512 @ d5000 | 262.21 ± 0.15 |
| gemma2 9B Q8_0 | 9.15 GiB | 9.24 B | Metal,BLAS,RPC | 8 | 0 | tg128 @ d5000 | 20.07 ± 0.01 |
**3060**
| model | size | params | backend | ngl | mmap | test | t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | ---: | --------------: | -------------------: |
| gemma2 9B Q8_0 | 9.15 GiB | 9.24 B | Vulkan,RPC | 999 | 0 | pp512 | 951.23 ± 1.50 |
| gemma2 9B Q8_0 | 9.15 GiB | 9.24 B | Vulkan,RPC | 999 | 0 | tg128 | 26.40 ± 0.12 |
| gemma2 9B Q8_0 | 9.15 GiB | 9.24 B | Vulkan,RPC | 999 | 0 | pp512 @ d5000 | 545.49 ± 9.61 |
| gemma2 9B Q8_0 | 9.15 GiB | 9.24 B | Vulkan,RPC | 999 | 0 | tg128 @ d5000 | 19.94 ± 0.01 |
**7900xtx**
| model | size | params | backend | ngl | mmap | test | t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | ---: | --------------: | -------------------: |
| gemma2 9B Q8_0 | 9.15 GiB | 9.24 B | Vulkan,RPC | 999 | 0 | pp512 | 2164.10 ± 3.98 |
| gemma2 9B Q8_0 | 9.15 GiB | 9.24 B | Vulkan,RPC | 999 | 0 | tg128 | 61.94 ± 0.20 |
| gemma2 9B Q8_0 | 9.15 GiB | 9.24 B | Vulkan,RPC | 999 | 0 | pp512 @ d5000 | 1197.40 ± 4.75 |
| gemma2 9B Q8_0 | 9.15 GiB | 9.24 B | Vulkan,RPC | 999 | 0 | tg128 @ d5000 | 44.51 ± 0.08 |
**Max+ CPU**
| model | size | params | backend | ngl | mmap | test | t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | ---: | --------------: | -------------------: |
| gemma2 9B Q8_0 | 9.15 GiB | 9.24 B | RPC,Vulkan | 0 | 0 | pp512 | 438.57 ± 3.88 |
| gemma2 9B Q8_0 | 9.15 GiB | 9.24 B | RPC,Vulkan | 0 | 0 | tg128 | 6.99 ± 0.01 |
| gemma2 9B Q8_0 | 9.15 GiB | 9.24 B | RPC,Vulkan | 0 | 0 | pp512 @ d5000 | 292.43 ± 0.30 |
| gemma2 9B Q8_0 | 9.15 GiB | 9.24 B | RPC,Vulkan | 0 | 0 | tg128 @ d5000 | 5.82 ± 0.01 |
poli-cya@reddit
Thanks so much for getting us hard numbers on this. Any chance you'll be testing some more MoEs or with speculative decoding? Those are the situations it should shine and I'm really curious what speeds we'll see on scout q4_0 or Qwen 235B Q3. Did you happen to see what % GPU usage was shown during these runs?
Also, if it's not asking too much on top of that, can you check how image generation works with one of the diffusion models?
Crazy to think we're seeing this sort of performance at this price and power draw. If diffusion works well then I think I'm decided on pulling the trigger.
fallingdowndizzyvr@reddit (OP)
Image and video gen works. I've only been able to get it working in Windows though. Which is weird since ROCm isn't even officially supported on Windows. But it is what it is.
I have both SD and Wan working. Just as with LLMs, it's about the speed of a 3060. Actually for Wan the 3060 is about twice as fast since it can use sage attention and I can't get that working on the Max+, yet.
MoffKalast@reddit
It's actually impressive how it's completely demolishing the M1 in PP, overall really decent. Might be worth it once ROCm for it stabilizes and it goes on sale :P
Average GMK support, they seem to really enjoy hiding the right links. Google drive I presume? xd
The real question is, what kind of decibels are you getting from the fan while running inference? Jet engine or F1 car?
fallingdowndizzyvr@reddit (OP)
Yep. It's funny. When I clicked on the download directly it said that something like the number of downloads has been exceed for the day. But if I click on download all and GD made a bespoke ZIP file, that downloaded with no problems.
That's why the wait for at least the video signal to come back up after flashing was so nerve racking. I thought that I had bricked it. Since there weren't any instructions, I just clicked on some flash program in some directory. I put my trust in the flash program making sure it was the right BIOS for the MB.
You know, I'm OK with it. It sounds like... well... a GPU. I run many machines without the side panel, or in my case the top panel, on. So I'm used to how a GPU sounds when it spins up. This sounds exactly like that. I would say it sounds a lot like an A770. Which makes sense since its really a GPU that's in an external enclosure. Even the heatsink looks like a GPU heatsink when I look through the case opening.
I know people expect silence from a minipc. But most minipcs are low powered and thus low heat. This isn't.
Hopefully they work on the fan software. I swear it seems like it's based on load and not temperature. Since sometimes the fans spin up even though the machine is stone cold, but there is a spike in load. The air coming out of it is cool.
MoffKalast@reddit
That does actually sound fairly decent for a GMK machine, they're sort of notorious for loud cooling solutions. One can always slap a Noctua fan onto it if it gets too annoying though.
fallingdowndizzyvr@reddit (OP)
I take it back. It's way quieter than an A770. It's that or I've gotten used to it. Since I've started putting my hand in front of the vent to make sure it's running every since I've turned off the rainbow LEDs. I never have to wonder about that with my A770s.
fallingdowndizzyvr@reddit (OP)
People are already doing that. One dude replaced the 120mm fan with a better 120mm fan. Some other dude made his own case with a 140mm fan.
lakySK@reddit
It’s quite surprising I’d say. Does the AMD have so much better GPU in the chip compared to the Mac? Or is this due to software?
MoffKalast@reddit
By roughs specs, the 8060S has 37 Tflops vs. 10 on the M1 Max, which is 3.7x compared to the 3.5x PP speed difference, so it may be that Metal is slightly more optimized but still falls behind because it has that much less total compute.
lakySK@reddit
That's nice then! Good job AMD!
Now just double the bandwidth and the maximum memory capacity once more and we're getting something very interesting!
MoffKalast@reddit
Some kind of twin CPU NUMA setup with these would be pretty interesting, twice the GPU power, eight memory channels...
mycall000@reddit
Zen 6 will improve bandwidth.
foldl-li@reddit
thanks for your data.
IMHO, the iGPU does not look powerful enough, while7900xtx really worth a try.
Pogo4Fufu@reddit
A 7900xtx with 64GB or 96GB VRAM?
foldl-li@reddit
speed is important.
Pogo4Fufu@reddit
No, fast VRAM. A single 7900xtx has only 24GB of VRAM
Key-Software3774@reddit
Why are the pp512 performances lower on 27 Q5 compared to 27 Q8?
fallingdowndizzyvr@reddit (OP)
It's a common misconception that being a smaller quant automatically makes it faster. That's not the case. You also have to factor how compute intensive it is to dequant data into a datatype that you can do compute with. That's why the most performant format can be FP16.
Key-Software3774@reddit
Makes sense! Thx for your valuable post and answer 🙏
InternationalNebula7@reddit
Is there a reason you chose Gemma 2 9B as opposed to Gemma 3 12B (or even Gemma 3 27B) to evaluate performance? Is the 12B Gemma 3 too big for the 3060 comparison?
fallingdowndizzyvr@reddit (OP)
I chose those models from what I had available on the drives I had attached to the machine at the time. I picked them purely based on size. I picked 9B because it would fit on the 3060. And that was barely. Since if you notice, my other tests were at 10000 context. That wouldn't run on the 3060 so I had to bring that down to 5000 for the 9B runs.
holistech@reddit
Thanks a lot for your post and benchmark runs. In my experience, the Vulkan driver has problems allocating more than 64GB for the model weights. However, I set the VRAM to 512MB in BIOS and was able to run large models like Llama-4-Scout at Q4.
I have created a benchmark on my HP ZBook Ultra G1a using LM Studio.
The key finding is that Mixture-of-Experts (MoE) models, such as Qwen-30B and Llama-4 Scout, perform very well. In contrast, dense models run quite slowly.
For a real-world test case, I used a large 27KB text about Plato to fill an 8192-token context window. Here are the performance highlights:
What's particularly impressive is that this level of performance with MoE models was achieved while consuming a maximum of only 70W.
You can find the full benchmark results here:
https://docs.google.com/document/d/1qPad75t_4ex99tbHsHTGhAH7i5JGUDPc-TKRfoiKFJI/edit?tab=t.0
poli-cya@reddit
Any reason you didn't use flash attention in your benchmarks?
Have you tried any diffusion/flux workloads so we could compare them to GPUs?
holistech@reddit
Hi, I did not use flash attention and KV cache quantization to ensure high accuracy of model outputs. I noticed significant result degradation otherwise. In my workflow, I need high accuracy when analyzing large, complex text and code.
In my experiments using speculative decoding, the performance gain was not enough or was negative, so I do not use it. You also need compatible models for this approach.
I barely use diffusion or other image/video generation models, so there was no need to include them in the benchmark.
burntheheretic@reddit
I have an image processing use case, I get about 9.5 tps with Llama 4. I haven't even tried to optimise anything yet - thats just loading a GGUF into Ollama and letting it rip.
Really impressive stuff!
fallingdowndizzyvr@reddit (OP)
Yep. That's the workaround. But in my case I went with 32GB of dedicated RAM and then that leaves 48GB of shared RAM. That allows for 79.5GB of total memory for the GPU. I've used up to 77.7GB.
Using shared instead of dedicated memory is not that much slower. I've updated my numbers for 9B with a shared memory only run. It's 90% the speed of using dedicated memory. So that's an option for people that have been wanting variable memory allocation to the GPU instead of fixed. Like on a Mac. Use shared memory.
Desperate-Sir-5088@reddit
Thanks for your comment. If you could please bind your Max & m1 (other m1) and test "distributed inference" for BIG models (over 120B)
fallingdowndizzyvr@reddit (OP)
Yep. That's the plan. I'm hoping this 96GB(hopefully 110GB) will let me run R1. Even at 96GB I should have 200GB of VRAM with another 32GB I can bring on in a pinch.
profcuck@reddit
We are all very excited to see that.
Also wondering about Llama 3.3 70b.
burntheheretic@reddit
Without tuning Llama 3.2 Vision 90b is about 2.5 tps.
Dense models aren't great.
profcuck@reddit
Thanks!
WaveCut@reddit
Haha I love the post itself like it’s starting “hey all it’s not that scary” and right after that a list of highly specific blockers emerged. AMD moment.
No_Afternoon_4260@reddit
So it's like a 3060 with 128gb, cool
burntheheretic@reddit
Got mine last week, it came with BIOS 1.05. No idea what the differences are...
Uaing Ubuntu 24.04, running Llama 4 Scout on it with 96gb allocated to the GPU. The architecture seems to love big MoE models - you can load a pretty giant model into RAM but the constraint seems to be fundamentally compute.
fallingdowndizzyvr@reddit (OP)
It's weird that mine showed up later, yet has an earlier BIOS. Although mine was delayed for a while so I guess it was stuck somewhere.
I think it's the opposite. It has compute to spare but is limited by memory bandwidth. Since I see it running at 120w during compute intensive things but only 89w during inference. So it's being memory I/O bound. Which you can see by the buzz saw pattern of GPU use. Also, from the tks and size of the model, it's pushing 200GB/s. Which is pretty much what it's capable of.
VowedMalice@reddit
I'm about to unbox mine and get Ubuntu running. Can you share the link to the fixed BIOS you used?
oxygen_addiction@reddit
https://www.reddit.com/r/GMKtec/comments/1ldtnbl/new_firmware_for_evox2_bios_105_ec_106/
fallingdowndizzyvr@reddit (OP)
Ah.... I was wary enough from flashing a BIOS from a GD linked to directly by GMK. I don't think I would flash something off of some random website.
burntheheretic@reddit
My new unit shipped with 1.05, so its a real thing that exists...
fallingdowndizzyvr@reddit (OP)
Yes, but really existing and downloading one from a random website when it's not available from the manufacturer is another thing altogether.
fallingdowndizzyvr@reddit (OP)
I think it was this one. TBH, I didn't really keep track.
https://www.gmktec.com/pages/drivers-and-software
uti24@reddit
what is d5000/d10000 here, is it like context size?
fallingdowndizzyvr@reddit (OP)
Exactly.
its_just_andy@reddit
sorry, can you explain further? I thought pp512 meant "preprocessing 512 tokens", i.e. context size of 512, and "tg128" meant "generating 128 tokens", i.e. output of 128 tokens. Is that not correct? If "d5000" means "context size 5000 tokens" then I don't know what pp512 and tg128 are :D
fallingdowndizzyvr@reddit (OP)
No. Prompt Processing.
Yes. Text Generation.
This is just standard llama-bench. You can read up about that here.
https://github.com/ggml-org/llama.cpp/tree/master/tools/llama-bench
HilLiedTroopsDied@reddit
I'd highly suggest you throw ubuntu or similar Linux on there to maximize it's abilities and partitionable ram size.
fallingdowndizzyvr@reddit (OP)
I've been moving from Linux to Windows. Since things are just faster on Window. From my 7900xtx to my A770s. Windows is a bit faster than linux using my 7900xtx but it's 3x faster for my A770s. It makes the A770 go from meh, to that's pretty darn good.
I have my Windows machines setup to seem like Linux machines. I ssh into them. I don't even use the GUI. I use bash on Windows.
segmond@reddit
Thanks for sharing, this does bring AI computing to the max for cheap, but sadly not for me, maybe in next gen or 2. Power uses gonna need dedicated GPUs
TheTerrasque@reddit
Thanks for the test! One question, why not cuda on the 3060?
fallingdowndizzyvr@reddit (OP)
Why CUDA? Vulkan is pretty much just as performant now. And I like to keep as much the same as possible when comparing things. Vary one thing and keep as much as possible the same. In this case, the variable is the GPU.
vibjelo@reddit
That hasn't been my experience at all, but I'll confess to testing this like a year ago or more, maybe things have changed lately?
You know of any public benchmarks/tests showing them being equal for some workloads right now?
fallingdowndizzyvr@reddit (OP)
A year ago is ancient times. I've been through this over and over again including posting numbers for CUDA and ROCm. Vulkan is close to, or even faster than, both of those now.
ItankForCAD@reddit
Vulkan support and performance in llama.cpp has pretty much been through its adolescence this past year. You should check it out.
vibjelo@reddit
Huh, that's pretty cool, I'll definitely check it out again. Thanks!
Sudden-Guide@reddit
Have you tried setting "VRAM" to auto instead of allocating fixed amount?
fallingdowndizzyvr@reddit (OP)
That's the way it comes. But I think all that does is allow some other software set it. Like AMDs own software. Which is the other way you can set it. You can set it using AMDs Adrenalin app.
sergeysi@reddit
Would be nice to see larger MoE models.
Also a comparison in performance between Windows and Linux is interesting.
fallingdowndizzyvr@reddit (OP)
I just added a run for Scout.
davew111@reddit
Is it possible to run a 123B model on one of these?
oxygen_addiction@reddit
https://www.reddit.com/r/GMKtec/comments/1ldtnbl/new_firmware_for_evox2_bios_105_ec_106/
New bios is out.
Can you try Qwen 235B Q4 and maybe Flux diffusion in ComfyUI?
mycall000@reddit
Have you tried ROCm 7?
https://www.amd.com/en/products/software/rocm/whats-new.html
Tai9ch@reddit
Try Linux on it.
The rumor is that Linux doesn't depend on setting VRAM in BIOS and can just do whatever it needs at runtime.
yoshiK@reddit
Interesting, what tests are in the test column? (pp512 etc.)
thirteen-bit@reddit
If I'm not mistaken:
pp is prompt processing (tokenizing the input: system prompt, history if any, probably none in these tests, and actual prompt itself).
tg should be token generation - LLM response generation.
You look at pp if you're interested in huge prompts. (E.g. here is the text of the entire novel, in what chapter the butler is the lead suspect?).
And tg for other way round, small prompt, a lot of generation (With a constant acceleration of 1g until the midpoint and then a constant deceleration of 1g for the rest of the trip, how long would it take to get to Alpha Centauri? Also how long would it appear to take for an observer on Earth?)
thirteen-bit@reddit
Here's the source for the second question:
https://www.reddit.com/r/LocalLLaMA/comments/1j4p3xw/comment/mgbkx0x/
IrisColt@reddit
Sudden mood whiplash.
fallingdowndizzyvr@reddit (OP)
Not really. Since the reason they said it wouldn't load is that you needed to have just as much CPU RAM as GPU RAM. That's not true. Loading a model into GPU RAM shouldn't take up CPU RAM. I can load a big model with ROCm, it just doesn't run with the ROCm I'm using. But checkout my Update #2. I can do it with a workaround with Vulkan.
Antique_Savings7249@reddit
Brilliant work! Thanks mate.
windozeFanboi@reddit
Speculative execution should also help if you have spare VRAM, but how much is up for benchmarks to show...