Local mini LLM PC?
Posted by LankyGuitar6528@reddit | LocalLLaMA | View on Reddit | 26 comments
Hey people... I keep seeing Fakebook ads for a local AI computer that's "perfect" for my local LLM. I do light coding and I'd like to run a decent LLM... play around a bit with some of these fancy new models you guys are posting about.
This is the pc: gmktec with an amd-ryzen ai-max 395 x2 128GB Ram with 2TB SSD for $3299 USD.
I don't know about the rules for links so mods please forgive me if I have sinned. I don't have any affiliate link or anything to sell. I'll black it out too... but this is the one (128GB variant) I'm looking at:
!<
Please tell me why these specs are terrible and why I'm an idiot for considering this when I could easily buy something 10X cheaper and 100X better or wait 2 weeks for the new version to drop?
lemondrops9@reddit
The main question I have is do you have a PC that you can gpus to it? Because that is a way better option for speed and price. Unless you need to run large models but then you'll have to put up with slow speeds.
LankyGuitar6528@reddit (OP)
I have a PC now with 32GB of Ram and 1 slot filled with a 3060 at present.
lemondrops9@reddit
Easier and better to grow your system by adding gpus IMO.
LankyGuitar6528@reddit (OP)
My Motherboard takes just one Graphics card. No sense tossing in a $1000 card and finding it's not enough. Then turning around and buying a whole new $3500 computer.
lemondrops9@reddit
does it though? I thought I would Only be able to do 2 gpus on my mother board but now I have 6 running off of it.
M2 socket will work if you have one open or wifi socket, a free PCIe slot or upgrade your mother board with some more options.
ironwroth@reddit
I’m able to run 4bit Qwen 3.6 35B A3B at like 30t/s on 32gb of DDR4 and an 8gb 3070. You should try that before you buy anything.
MrBIMC@reddit
I’m using strix halo machine (albeit from beelink). Got it for 2400€ the second they got available for preorder.
Things it can do in llm terms:
Qwen3.6-35b-a3b at 8bit with mtp at around 60 sustained tps, Full context, kvcache at q8_0. with 3 token mtp and parallel 1.
Minimax m2.7 at iq3-k-xl with 200k q4 context(couldn’t get turbo4 to run) and —parallel 2, I get about 40 sustained tps across 2 sessions of around 20tps. With parallel 1 it maintains 30tps.
Haven’t yet managed to get dflash running on llama on this machine across many forks.
On vllm side in general haven’t gotten much success either. For 1 node llama.cpp over vulkan radv is the way to go currently. But that stuff changes from week to week nowadays, so whoever reading this in the future, please recheck the current state of affairs.
For light coding and Hermes orchestration that thing is decent. Two of them would be even better through.
Current prices are stupid and strix halo now approaches nvidia spark in price. In Ukraine here I can official asus gx10 for 4k$, all locally available strix halos are now pricier than that, which is insane.
TLDR:
Decent machine, but you’d be better off with some variant of dgx spark, check the price. Strix halo can game though haha.
LankyGuitar6528@reddit (OP)
Great info. Thanks!
Th3Sim0n@reddit
I was eyeing similar minipc - Bosgame M5 - it has the same config but is cheaper due to a bit worse build quality but it is still decent enough.
After reading all the pros and cons and watching and reading reviews I decided that it is too expensive for what you get, at least for me. If you take the price away, it is a beast, the iGPU is on par with RTX 4060/4070 Mobile which is pretty darn good for gaming. The CPU is also best in class and you can throw everything at it and it will not break a sweat.
Large amount of unified memory is the main benefit. It is very fast and running MoE LLMs will yield acceptable speeds, but dont expect anything spectacular. It will also handle large models like Qwen 122b, Nemotron 120b gpt-oss-120b or similar in Q4-Q6 quants at reasonable speeds. Dense models are the main culprit of such device. The best in class Qwen 3.6 27B or Gemma 4 31b will be pretty slow, somewhere around 10tps. It is a fun device for tinkering and a great all-rounder, but don't expect anything spectacular nor production-ready.
Another "bonus" is that it is very energy efficient, drawing max of ~140w during workloads.
Again, if the price was lower, I would definitely grab it but for now it is just an expensive toy.
Instead I'd try to grab 2x 3090s with a DDR4 Motherboard that can do at least PCIE 3.0 x8/x8 if you want to keep it cheap, or you could go something on DDR5 that could do PCIE 4.0/5.0 but that will be more expensive in total. Pair that with 64gb ram and you'll have very close memory capacity that will run similar models way faster, especially dense ones, for less money.
Thats what I did in the end and have 4x 3090 on a x299 + i9 9820x + 64GB quad channel memory for basically the same price as the 128gb strix halo.
LankyGuitar6528@reddit (OP)
Probably the best advice. Thanks!
FullstackSensei@reddit
If you need to ask, IMO you shouldn't buy anything no matter what anyone tells you.
Your post reads like someone who knows almost nothing about local LLMs, which is a recipe for a terrible combination of disappointment, frustration and wasting money.
Spend a week or two learning about local LLMs, how to run them, what to expect, whether they can meet your needs, etc. You don't need a beefy machine either to try things out. You can run smaller models with almost any hardware you have to try things out and get comfortable with the software stack and tooling. You can also spend a few bucks on APIs to try out different models to see how small you can go for your needs.
Only after you've learned enough to have an opinion about what you need to run should you start looking at hardware options that could suit your needs.
LankyGuitar6528@reddit (OP)
I am indeed someone who knows nothing about local LLMs. Well... I guess technically slightly less than nothing. I run llama nomic for embeddings in a SQL Server that Claude uses for memory searchs. But nothing useful for building a decent local LLM computer that can assist with coding. But how do you know unless you start asking questions and eventually dive in? If I never bought a car I'd still be sitting in the back row of Driver's Ed class making paper airplanes to throw when the instructor turned his back.
FullstackSensei@reddit
Sorry, but that analogy is very bad. You absolutely don't need to buy a car to learn how to drive.
I already explained in my comment how you can learn without wasting more than $3k.
LankyGuitar6528@reddit (OP)
Well that's true. Do what my brother did and "borrow" my mothers car and wrap it around a telephone pole then "borrow" my sister's car and sink that one in the lake then "borrow" my mother's rental her rental while her car was in the shop and wrap that one around a telephone pole. Way cheaper than buying your own car.
Moscato359@reddit
You can dive in and play with local llm without buying any new hardware at all
lemondrops9@reddit
💯 and then a person would know what 10tks a sec looks like.
Silver-Champion-4846@reddit
This.
BankjaPrameth@reddit
If you want to do agentic coding, may I propose DGX Spark? The difference in price is the prompt processing speed and ability to connect 2 or more devices to create a node for future expansion.
But focus on prompt processing (prefill speed) for now. Don’t believe me yet. Do more research on this topic to see why it’s worth consideration and decide later.
LankyGuitar6528@reddit (OP)
I love the look of the DGX Spark. The sparkly gold one in particular. Beautiful But at $7K CDN on Amazon. It's just out of my budget... and 2 of them... nope. Not in this lifetime. But the $3500 range is doable. Then again, if it doesn't work - what's the point. Perhaps a decent LLM Machine is just not within the realm of reality for me at this point.
BankjaPrameth@reddit
You should look for third party model like Asus or MSI. I bought my MSI for around $4,000 just last month.
LankyGuitar6528@reddit (OP)
Good call. You mean an Asus or MSI branded DGX Spark for $4000 USD? Or did you have another model in mind? $4000USD would be about $5500CAD.
IMakeBreadLoaves@reddit
Get the Asus GX-10 for $3500, it only has 1tb of storage but if ai development or inference is your thing the cuda architecture will be the path of least resistance.
PositiveBit01@reddit
I bought a dgx spark which has more compute and works with cuda, but has similar memory bandwidth. So better prompt processing but similar token generation usually (maybe more room for mtp/dflash optimizations and supposedly someday nvfp4).
So it's best for MoE models that fit in the RAM, but right now it seems like gemma4 31b and qwen3.6 27b are the best you can do so a 5090 would maybe be better (of course you need the rest of the pc but should still be cheaper or at least not more expensive)
I thought the extra ram allowing some big MoE models would make it worth it but I'm not sure.
But then tomorrow a new MoE model that fits and is awesome could come out. Who knows. Feels like there's a bit of a gap right now, recent good models have either been small-ish where you don't really make good use of the ram or just a tad too big for 128gb if you want decent context size and a little headroom at q4. I guess I could try q3 but I hear the dropoff isn't worth it.
Anyway, I don't regret my purchase. It's a fairly low power mini pc with 128 gb ram and decent compute. I can at least run some VMs and side models on it for quick tasks and long running always-on agents even if it doesn't end up being my main driver. For sure it's going a long way towards helping me accomplish my goal of learning more about this stuff.
imshookboi@reddit
Check out r/strixhalo Lots you can do but be warned they’re a little slow especially with dense models.
noticedbyai@reddit
Given the state of the market, that doesn’t look terrible. Memory bandwidth isn’t too bad, 256 bit? What kind of models do you want to run any idea (parameter count, MoE) or any specifics ?
I
Herr_Drosselmeyer@reddit
It's not bad, especially if you get the largest VRAM configuration, provided you're not under the illusion that it's going to run large models at blazing speeds, because it won't. It doesn't use a lot of power (120w) and has a pretty poor memory bandwidth of 256GB/s.
Like the DGX Spark, I see it more as a dev kit where you develop a proof of concept that you p’an to later deploy to a more powerful rig, rather than a primary inference machine.