The new Linux kernel AI bot uncovering bugs is a local LLM on Framework Desktop + AMD Ryzen AI Max
Posted by Fcking_Chuck@reddit | linux | View on Reddit | 50 comments
HearMeOut-13@reddit
Ok but the real question is what LLM is it running? Cause while yes AMD Ryzen AI Max is pretty strong i cant imagine its strong enough
jack-of-some@reddit
It's strong enough if you have the patience. You can outfit it with 128gb of ram 96 of which can be dedicated to the GPU. With MOE style models you can achieve really good context length and decent tok/s
z-lf@reddit
On Linux, you can dedicate the whole ram to the GPU. 96 is a windows limitation.
DonaldLucas@reddit
I don't understand. Are you saying that you can dedicate all 128GB to the GPU while 0 goes to the OS?
z-lf@reddit
No, I'm saying you can let the os manage it all and it will give what it doesn't need to the GPU.
Since you'd usually put this on a server os, that's max 2gb for the OS (running container +llamacpp for instance) the rest can be used by the GPU.
MaybeTheDoctor@reddit
“Who could possibly need more than 640kb ram?”
RoosTheFemboy@reddit
Couldn’t it do more to the gpu on Linux? afaik this is a windows limitation
creeper6530@reddit
Yes, Framework themselves said the 96 is a WIndows limit
HomsarWasRight@reddit
For my Strix Halo machine (ROG Flow Z13) reserving more than that it’s a BIOS limitation.
But as I said above, it’s actually better not to reserve it. Better to set the reservation as low as possible (512MB on my machine) and then configure the GTT to allow the GPU and CPU to share the entire remaining pool.
The only reason not to do it that way is if you’re simultaneously doing something that uses a great deal of system memory and you want to make sure it’s there.
hi_im_mom@reddit
Not anywhere near using tokens and an actual frontier model.
Whenever you see a bunch of Mac minis or other local shit like this, look with caution.
HomsarWasRight@reddit
You’re right that it’s not to the level of the best frontier models. People do overestimate what you can do locally based on some benchmarks and not actual use.
But it doesn’t necessarily have to be to be useful and avoid tons of costs. This can run 24/7 and the initial hardware cost plus energy usage can still come out way ahead of the same result done on cloud models.
It’s going to be slow (assuming non MoE models, which would be most useful), but when working non-interactively it’s fine.
It would be smart to (once something is found) have a frontier model look it over quickly as a confirmation. Local + remote workflow makes sense.
hi_im_mom@reddit
People want agents. Local models can't do that effectively even with 256GB. I know from experience
HomsarWasRight@reddit
I mean, I kinda disagree. I’m running a Strix Halo 128GB and it’s useful. So I also speak from experience.
But we’re not talking about “what people want”, we’re talking about a real, credible, person running an actual workload that he says is useful.
hi_im_mom@reddit
I hear what you're saying. Running Claude code in the browser type person
warpedgeoid@reddit
The difference is much less than you think with proper if you know what you are doing.
hi_im_mom@reddit
No. It isnt
HomsarWasRight@reddit
For LLMs it’s actually better not to dedicate memory to the GPU and use GTT to allow the GPU to directly address the entire pool.
Marce7a@reddit
Probably deepseek or other Chinese open models
Daktyl198@reddit
Since nobody answered you, Greg said "all of them". He's basically testing all of the ones that can run locally to see where they shine and where they fail.
amroamroamro@reddit
the Ryzen AI Max is actually a beast, rivaling desktop class CPUs:
https://www.phoronix.com/review/ryzen-ai-max-395-9950x-9950x3d/2
plus no other iGPU even comes close, it's almost close to a current gen low-end dedicated nvidia dGPU
Dangerous-Report8517@reddit
Sure but desktop class CPUs are incredibly slow for this use case. What actually makes it work reasonably well is that it's effectively a 128GB VRAM GPU, so it can run large-ish models without any CPU offload, giving it reasonable performance. That, and the fact that it's not being used interactively so going a bit slower is fine
amroamroamro@reddit
the APU with up to 128GB of shared memory would allow you to load some decent medium-sized models with usable tok/sec speeds.
according to here:
I googled a bit for some LLM benchmarks:
phoronix comparing CPU vs GPU backends (ROCm and Vulkan), though the models used were all kinda smallish: https://www.phoronix.com/review/amd-rocm-7-strix-halo/3
these look much more comprehensive with various models ranging from small to medium that can fit in the unified VRAM: https://kyuz0.github.io/amd-strix-halo-toolboxes/
more tests with models like gpt-oss-120B, GLM-4.5-Air, qwen3-235B-A22B: https://github.com/lhl/strix-halo-testing/tree/main/llm-bench
another writeup: https://netstatz.com/strix_halo_lemonade/#Performance (QWEN3-30B-Coder-A3B ~ 59 t/s, GPT-OSS-120B ~ 47 t/s, QWEN3-235B-Coder-A22B Instruct 2507 ~ 11 t/s)
these are respectable speeds on actually decent models you can get work done with interactively. This Framework Desktop + Ryzen AI Max is a sweet little machine :)
oh and its not just the GPU, the often forgotten NPU can be utilized for LLMs too, see: https://fastflowlm.com/
Dangerous-Report8517@reddit
All of that is true, but has absolutely nothing to do with the fact that the CPU cores are almost as fast as desktop cores. What makes the AI Max work well here is that it has a pretty strong GPU with access to all of that memory and a quad channel memory controller, both of which are absent on every consumer class desktop platform, and the former is absent on all of them.
amroamroamro@reddit
from the above phoronix benchmarks, llama.cpp text generation on CPU BLAS backend was about half the speed of the GPU Vulkan backend, so the cpu here is no slouch either.
Based on this:
and according to their benchmarks (on Vulkan and tuned ROCm backends), token generation speed is slower by about 10 to 20 percent than Spark.
that would put the Framework Desktop (and other Ryzen AI Max miniPCs) as contender to the NVIDIA DGX Spark (ARM-based CPU + Blackwell GPU), and the Apple Silicon Macs. What makes Strix Halo unique is it's the only x86-based cpu option in this space and relatively the cheapest too:
https://strixhalo.wiki/Guides/Buyer's_Guide#alternatives
annodomini@reddit
The Qwen 3.5 and 3.6 models up to 122b params, and Gemma 4 models are some of the best models that run well on this kind of hardware.
GPT-OSS also works, but is a bit old at this point, some folks still like it. Nvidia Nemotron is decent, and one of the more open models in terms of training process and data.
natermer@reddit
You can combine it with eGPU if you want.
When it comes to a lot of LLM stuff the limiting factor is memory speed and memory capacity then raw compute power.
These "AI Max+ 395" have soldered in LPDDR5X 8000 MT/s memory and is capable of a theoretical 256GB/s memory speed. To put that into perspective a typical Ryzen 7 or 9 desktop has around 57GB/s or so. You need a Threadripper system with all the channels filled up to get something similar.
This puts it on par with a lower end GPU, except that it has potentially 96GB of VRAM.
If you hack in a eGPU setup then you can attach something like a Radeon RX 7900 XTX with 24GB of VRAM and get a total of 110GB of VRAM.
That is enough to run a 120 to 200 billion parameter model.
There is also a big difference between "Dense" and "Mixture of Experts" model and it comes down to, pretty sure, memory speeds.
Like Qwen 27B models are "dense" and I have a hard time running those on my laptop... it gets maybe 5 to 10 tps. However the larger Qwen 3.6 model I can typically get around 20tps, which is sufficient performance.
And then there is the ability to cluster these things. So it is hard to say what he is doing. You can run powerful models locally... it just takes money.
The best open weight models are no slouch. They are not as good as the top tier models like Claude Opus... but they are about the same or better then the cheaper or "fast" versions available from hosting providers.
Which means that even if you feel you benefit heavily from the most expensive models available you can still can benefit and offset a lot of the cost through doing things locally.
Right now these are about $3K machines. If you are burning through $300 or more a month on tokens right now and self hosting cuts that cost by 50%, it'll take a couple years to pay it off.
Also you then have something you don't have to worry about the tokens for. You can have it clanking away 24/7 and the only thing you pay for after the initial purchase is electricity.
Dangerous-Report8517@reddit
The Framework Desktop runs at 10,000 MT/s iirc
Journeyj012@reddit
So the AMD Ryzen AI Max+ 395 seems to have 128GB of VRAM, and I'm guessing the Framework probably has somewhere in the area of 128-256GB of RAM. With that, I would guess either Deepseek v4, a low quant Kimi K2.6, or GLM 5.1 depending on how long GKH_clanker has been active.
Green0Photon@reddit
The Framework Desktop is an AMD Ryzen AI Max+ 395 board.
It doesn't go higher than 128GB, so it's almost definitely just 128GB.
Journeyj012@reddit
Forgive me for my lack of knowledge on AI hardware.
I meant that the rest of the model would be loaded into RAM. Does the AI Max share RAM and VRAM as one?
Green0Photon@reddit
Correct.
It's essentially a high end consumer board, like Apple's chips. Both only have RAM, which can get utilized like VRAM.
Like laptop and mobile chips, but sufficiently powerful that it's more akin to game consoles in their ability to use it.
The AI Max uses LPDDR5x-8000 soldered. Which is fast enough to be usable. Especially since what you need is all that data in any kind of RAM instead of in storage.
But it's fundamentally an iGPU situation. The system allocates say 96GB as VRAM, 32 leftover for normal RAM, and the model can get loaded into the "VRAM" as if it were a normal GPU. Just slower. But that's still faster than e.g. 24GB of GDDR6 VRAM, since you need the whole model.
Journeyj012@reddit
Ah, so it could be quantised Deepseek v4 or Qwen 3.6.
annodomini@reddit
You're not really going to have a good time quantizing Deepseek that much.
It's likely to be one of the models that fits into RAM on that system that much; GPT-OSS 120b, one of the Qwen 3.5 or 3.6 models, or one of the Gemma 4 models. At this point the Qwen and Gemma models greatly surpass the capabilities of GPT-OSS, so probably one of those, unless he got it set up a few months ago and hasn't touched it since.
Green0Photon@reddit
It's hard for me to tell for sure, but the Framework Desktop seems to have normal support for 112GB VRAM on Linux. One report says 122GB, but that might not leave anything usable.
Journeyj012@reddit
Qwen 3.6 would sit smoothly in there 3 times over, and it performs at around the level of Claude 4.5 Opus.
jonnywoh@reddit
Yes, it's shared memory. 128GB total.
HomsarWasRight@reddit
I’m also curious.
It can fit really large dense models, but it’s slow. Depending on how large you go, for non-interactive tasks it’s kinda fine, IMHO.
But where it really shines is MoE models. These can be really large and require all that space, but are much faster to run due to lower memory bandwidth and hardware bottlenecks.
creeper6530@reddit
It doesn't need to livestream its output to an impatient human, so I wager it's alright.
gregkh@reddit
So strange when there's a picture that has at least 5 visible computers in it, that everyone only focuses on one of them and ignores the rest...
RoomyRoots@reddit
Because that's the one that is recognizable and one of the best builds for cheap local AI.
RoomyRoots@reddit
Seems like a good usage of a Strix Halo. Kinda what I wanted to build for myself.
redbarchetta_21@reddit
Perfect usecase for the device.
nitroburr@reddit
Calling it clanker is just *chefs kiss*. Thanks Greg.
platinummyr@reddit
The model I can get behind
Chad-Anouga@reddit
When the bots take over this will be their version of HP Lovecraft’s cat
Responsible-Key8163@reddit
Pretty wild and honestly very fitting for kernel development, using local LLMs for bug finding feels much more practical than a lot of AI hype. Also cool seeing it run on local hardware instead of some giant cloud setup.
Otherwise_Wave9374@reddit
Super cool. A local LLM that can actually uncover bugs in something as gnarly as the kernel is one of the better arguments for "agents" being useful today.
Any details on whether it is running in a tight tool loop (build, run tests, minimize repro) or mostly doing static reasoning + reporting? If you are looking for similar agent-y dev workflow examples, https://www.agentixlabs.com/ has a few.
AutoModerator@reddit
This comment has been removed due to receiving too many reports from users. The mods have been notified and will re-approve if this removal was inappropriate, or leave it removed.
This is most likely because:
I am a bot, and this action was performed automatically. Please contact the moderators of this subreddit if you have any questions or concerns.
PerkyPangolin@reddit
Whatever the polite version of get bent is.
Legal-Swordfish-1893@reddit
“Go forth and prosper”?