The new Linux kernel AI bot uncovering bugs is a local LLM on Framework Desktop + AMD Ryzen AI Max

Posted by Fcking_Chuck@reddit | linux | View on Reddit | 50 comments

[-]

HearMeOut-13@reddit

Ok but the real question is what LLM is it running? Cause while yes AMD Ryzen AI Max is pretty strong i cant imagine its strong enough

[-]

It's strong enough if you have the patience. You can outfit it with 128gb of ram 96 of which can be dedicated to the GPU. With MOE style models you can achieve really good context length and decent tok/s

[-]

z-lf@reddit

On Linux, you can dedicate the whole ram to the GPU. 96 is a windows limitation.

[-]

DonaldLucas@reddit

I don't understand. Are you saying that you can dedicate all 128GB to the GPU while 0 goes to the OS?

[-]

z-lf@reddit

No, I'm saying you can let the os manage it all and it will give what it doesn't need to the GPU.

Since you'd usually put this on a server os, that's max 2gb for the OS (running container +llamacpp for instance) the rest can be used by the GPU.

[-]

MaybeTheDoctor@reddit

“Who could possibly need more than 640kb ram?”

[-]

RoosTheFemboy@reddit

Couldn’t it do more to the gpu on Linux? afaik this is a windows limitation

[-]

creeper6530@reddit

Yes, Framework themselves said the 96 is a WIndows limit

[-]

HomsarWasRight@reddit

For my Strix Halo machine (ROG Flow Z13) reserving more than that it’s a BIOS limitation.

But as I said above, it’s actually better not to reserve it. Better to set the reservation as low as possible (512MB on my machine) and then configure the GTT to allow the GPU and CPU to share the entire remaining pool.

The only reason not to do it that way is if you’re simultaneously doing something that uses a great deal of system memory and you want to make sure it’s there.

[-]

hi_im_mom@reddit

Not anywhere near using tokens and an actual frontier model.

Whenever you see a bunch of Mac minis or other local shit like this, look with caution.

[-]

HomsarWasRight@reddit

You’re right that it’s not to the level of the best frontier models. People do overestimate what you can do locally based on some benchmarks and not actual use.

But it doesn’t necessarily have to be to be useful and avoid tons of costs. This can run 24/7 and the initial hardware cost plus energy usage can still come out way ahead of the same result done on cloud models.

It’s going to be slow (assuming non MoE models, which would be most useful), but when working non-interactively it’s fine.

It would be smart to (once something is found) have a frontier model look it over quickly as a confirmation. Local + remote workflow makes sense.

[-]

hi_im_mom@reddit

People want agents. Local models can't do that effectively even with 256GB. I know from experience

[-]

HomsarWasRight@reddit

I mean, I kinda disagree. I’m running a Strix Halo 128GB and it’s useful. So I also speak from experience.

But we’re not talking about “what people want”, we’re talking about a real, credible, person running an actual workload that he says is useful.

[-]

hi_im_mom@reddit

I hear what you're saying. Running Claude code in the browser type person

[-]

warpedgeoid@reddit

The difference is much less than you think with proper if you know what you are doing.

[-]

hi_im_mom@reddit

No. It isnt

[-]

HomsarWasRight@reddit

For LLMs it’s actually better not to dedicate memory to the GPU and use GTT to allow the GPU to directly address the entire pool.

[-]

Marce7a@reddit

Probably deepseek or other Chinese open models

[-]

Daktyl198@reddit

Since nobody answered you, Greg said "all of them". He's basically testing all of the ones that can run locally to see where they shine and where they fail.

[-]

amroamroamro@reddit

the Ryzen AI Max is actually a beast, rivaling desktop class CPUs:

https://www.phoronix.com/review/ryzen-ai-max-395-9950x-9950x3d/2

plus no other iGPU even comes close, it's almost close to a current gen low-end dedicated nvidia dGPU

[-]

Dangerous-Report8517@reddit

Sure but desktop class CPUs are incredibly slow for this use case. What actually makes it work reasonably well is that it's effectively a 128GB VRAM GPU, so it can run large-ish models without any CPU offload, giving it reasonable performance. That, and the fact that it's not being used interactively so going a bit slower is fine

[-]

amroamroamro@reddit

the APU with up to 128GB of shared memory would allow you to load some decent medium-sized models with usable tok/sec speeds.

according to here:

For compute and memory bandwidth, you can think of the Strix Halo GPU like a Radeon RX 7600 XT, but with up to 124GiB+ of VRAM

I googled a bit for some LLM benchmarks:

phoronix comparing CPU vs GPU backends (ROCm and Vulkan), though the models used were all kinda smallish: https://www.phoronix.com/review/amd-rocm-7-strix-halo/3
these look much more comprehensive with various models ranging from small to medium that can fit in the unified VRAM: https://kyuz0.github.io/amd-strix-halo-toolboxes/
more tests with models like gpt-oss-120B, GLM-4.5-Air, qwen3-235B-A22B: https://github.com/lhl/strix-halo-testing/tree/main/llm-bench
another writeup: https://netstatz.com/strix_halo_lemonade/#Performance (QWEN3-30B-Coder-A3B ~ 59 t/s, GPT-OSS-120B ~ 47 t/s, QWEN3-235B-Coder-A22B Instruct 2507 ~ 11 t/s)

these are respectable speeds on actually decent models you can get work done with interactively. This Framework Desktop + Ryzen AI Max is a sweet little machine :)

oh and its not just the GPU, the often forgotten NPU can be utilized for LLMs too, see: https://fastflowlm.com/

[-]

Dangerous-Report8517@reddit

All of that is true, but has absolutely nothing to do with the fact that the CPU cores are almost as fast as desktop cores. What makes the AI Max work well here is that it has a pretty strong GPU with access to all of that memory and a quad channel memory controller, both of which are absent on every consumer class desktop platform, and the former is absent on all of them.

[-]

amroamroamro@reddit

CPU

from the above phoronix benchmarks, llama.cpp text generation on CPU BLAS backend was about half the speed of the GPU Vulkan backend, so the cpu here is no slouch either.

Nvidia Spark

Based on this:

If you think of the Strix Halo GPU as a Radeon RX 7600 XT with 128GB of LPDDR5X, you can think about the Spark as a (very low power) RTX 5070 with 128GB of LPDDR5X.

and according to their benchmarks (on Vulkan and tuned ROCm backends), token generation speed is slower by about 10 to 20 percent than Spark.

that would put the Framework Desktop (and other Ryzen AI Max miniPCs) as contender to the NVIDIA DGX Spark (ARM-based CPU + Blackwell GPU), and the Apple Silicon Macs. What makes Strix Halo unique is it's the only x86-based cpu option in this space and relatively the cheapest too:

https://strixhalo.wiki/Guides/Buyer's_Guide#alternatives

[-]

annodomini@reddit

The Qwen 3.5 and 3.6 models up to 122b params, and Gemma 4 models are some of the best models that run well on this kind of hardware.

GPT-OSS also works, but is a bit old at this point, some folks still like it. Nvidia Nemotron is decent, and one of the more open models in terms of training process and data.

[-]

natermer@reddit

You can combine it with eGPU if you want.

When it comes to a lot of LLM stuff the limiting factor is memory speed and memory capacity then raw compute power.

These "AI Max+ 395" have soldered in LPDDR5X 8000 MT/s memory and is capable of a theoretical 256GB/s memory speed. To put that into perspective a typical Ryzen 7 or 9 desktop has around 57GB/s or so. You need a Threadripper system with all the channels filled up to get something similar.

This puts it on par with a lower end GPU, except that it has potentially 96GB of VRAM.

If you hack in a eGPU setup then you can attach something like a Radeon RX 7900 XTX with 24GB of VRAM and get a total of 110GB of VRAM.

That is enough to run a 120 to 200 billion parameter model.

There is also a big difference between "Dense" and "Mixture of Experts" model and it comes down to, pretty sure, memory speeds.

Like Qwen 27B models are "dense" and I have a hard time running those on my laptop... it gets maybe 5 to 10 tps. However the larger Qwen 3.6 model I can typically get around 20tps, which is sufficient performance.

And then there is the ability to cluster these things. So it is hard to say what he is doing. You can run powerful models locally... it just takes money.

The best open weight models are no slouch. They are not as good as the top tier models like Claude Opus... but they are about the same or better then the cheaper or "fast" versions available from hosting providers.

Which means that even if you feel you benefit heavily from the most expensive models available you can still can benefit and offset a lot of the cost through doing things locally.

Right now these are about $3K machines. If you are burning through $300 or more a month on tokens right now and self hosting cuts that cost by 50%, it'll take a couple years to pay it off.

Also you then have something you don't have to worry about the tokens for. You can have it clanking away 24/7 and the only thing you pay for after the initial purchase is electricity.

[-]

Dangerous-Report8517@reddit

The Framework Desktop runs at 10,000 MT/s iirc

[-]

Journeyj012@reddit

So the AMD Ryzen AI Max+ 395 seems to have 128GB of VRAM, and I'm guessing the Framework probably has somewhere in the area of 128-256GB of RAM. With that, I would guess either Deepseek v4, a low quant Kimi K2.6, or GLM 5.1 depending on how long GKH_clanker has been active.

[-]

Green0Photon@reddit

The Framework Desktop is an AMD Ryzen AI Max+ 395 board.

It doesn't go higher than 128GB, so it's almost definitely just 128GB.

[-]

Journeyj012@reddit

Forgive me for my lack of knowledge on AI hardware.

I meant that the rest of the model would be loaded into RAM. Does the AI Max share RAM and VRAM as one?

[-]

Green0Photon@reddit

I meant that the rest of the model would be loaded into RAM. Does the AI Max share RAM and VRAM as one?

Correct.

It's essentially a high end consumer board, like Apple's chips. Both only have RAM, which can get utilized like VRAM.

Like laptop and mobile chips, but sufficiently powerful that it's more akin to game consoles in their ability to use it.

The AI Max uses LPDDR5x-8000 soldered. Which is fast enough to be usable. Especially since what you need is all that data in any kind of RAM instead of in storage.

But it's fundamentally an iGPU situation. The system allocates say 96GB as VRAM, 32 leftover for normal RAM, and the model can get loaded into the "VRAM" as if it were a normal GPU. Just slower. But that's still faster than e.g. 24GB of GDDR6 VRAM, since you need the whole model.

[-]

Journeyj012@reddit

Ah, so it could be quantised Deepseek v4 or Qwen 3.6.

[-]

annodomini@reddit

You're not really going to have a good time quantizing Deepseek that much.

It's likely to be one of the models that fits into RAM on that system that much; GPT-OSS 120b, one of the Qwen 3.5 or 3.6 models, or one of the Gemma 4 models. At this point the Qwen and Gemma models greatly surpass the capabilities of GPT-OSS, so probably one of those, unless he got it set up a few months ago and hasn't touched it since.

[-]

Green0Photon@reddit

It's hard for me to tell for sure, but the Framework Desktop seems to have normal support for 112GB VRAM on Linux. One report says 122GB, but that might not leave anything usable.

[-]

Journeyj012@reddit

Qwen 3.6 would sit smoothly in there 3 times over, and it performs at around the level of Claude 4.5 Opus.

[-]

jonnywoh@reddit

Yes, it's shared memory. 128GB total.

[-]

HomsarWasRight@reddit

I’m also curious.

It can fit really large dense models, but it’s slow. Depending on how large you go, for non-interactive tasks it’s kinda fine, IMHO.

But where it really shines is MoE models. These can be really large and require all that space, but are much faster to run due to lower memory bandwidth and hardware bottlenecks.

[-]

AutoModerator@reddit

This comment has been removed due to receiving too many reports from users. The mods have been notified and will re-approve if this removal was inappropriate, or leave it removed.

This is most likely because:

Your post belongs in r/linuxquestions or r/linux4noobs
Your post belongs in r/linuxmemes
Your post is considered "fluff" - things like a Tux plushie or old Linux CDs are an example and, while they may be popular vote wise, they are not considered on topic
Your post is otherwise deemed not appropriate for the subreddit

I am a bot, and this action was performed automatically. Please contact the moderators of this subreddit if you have any questions or concerns.

[-]

PerkyPangolin@reddit

Whatever the polite version of get bent is.

[-]

Legal-Swordfish-1893@reddit

“Go forth and prosper”?