ROG Flow Z13 best laptop for local LLMs?
Posted by Bombarding_@reddit | LocalLLaMA | View on Reddit | 36 comments
Hey y'all, I've been trying to figure out what laptop would be the best for running local LLM's at my company (small startup) and they want to splurge on whatever laptops run LLM's locally the best.
ASUS ROG Flow Z13 with 128gb unified memory
seems to be the top pick according to all reviewers right now, including Tom's Hardware. It's steep, going for 2.8k right now, and pretty gamer-y tbh. Anyone know of other laptops that'd out-perform this one?
We're looking at buying them for the employees who use it within the next two months, but I could convince them to wait if something crazy is about to come out
Use case: Exclusively work, mostly API coding tasks and some Excel functioality with PowerQuery to pull data from API's and Macro coding as well.
Tom's Hardware reviews: https://www.tomsguide.com/best-picks/best-ai-laptop#section-the-best-ai-laptop-overall
HealthyCommunicat@reddit
I tried this. Gets too hot and way too loud. Something as small as GPT OSS 120b was… really sadly slow.
Saying its the best laptop for LLM’s is ridiculous. I directly returned the z13 flow and went to the m4 max. Literally more than double the speed for t/s and pp.
Bombarding_@reddit (OP)
I really appreciate your insight!! Our only catch is we can't use MacOS because Excel for Mac isn't fun featured and we use PowerQuery and Pivot Table features pretty much daily.
I think a 96gh vram and 24gb system ran split would handle 70b models pretty well, and cross reference API documentation to quickly pull what we need
HealthyCommunicat@reddit
I know that you’re wanting it to, but I’m trying to tell you this isn’t a matter of opinion or subjective, this is a mathematical thing that you can calculate yourself and then also confirm; its just the rule of compute. Take the size of the model’s active parameter count and then divide your max memory bandwidth by it - thats your token/s count.
At q4 a 70b model DENSE is 35gb. The AI MAX 395+ has around 250 gb/s of memory bandwidth. 250/35 =7.143 . Thats around 7 token/s output.
Again, I tried this myself and have the stats and performance benchmarks written down. If you’re willing to experience the dissapointment for yourself thats fine, but try to calm your expectations and understand that you’re gunna HAVE to run MoE models, and even then be capped at 20-30token/s maximum.
Bombarding_@reddit (OP)
Ahhh I gotcha. 7TPS is definitely slow compared to other numbers I vaguely know, if 50-60 is what flagship consumer models seem to output right now.
With that in mind, even if it's slow, if having that much memory increases the context window significantly, I think a slower system that is more accurate with fewer hallucinations due to the context window being larger would be okay.
Also, I'm newer to the localLLM space, and I appreciate your insight. Sorry if I'm asking dumb questions, I'm still trying to piece things together
HealthyCommunicat@reddit
The system you have does not affect hallucinations. Thats the model you choose to run.
I know it might seem like there are many variables but it really always comes down to the core of memory bandwidth and speed. Try to understand the core basics; if you can understand that then you won’t have trouble understanding anything else. Inferencing LLM’s is as simple as knowing bits and bytes - no “Machine learning” knowledge needed at all. Just the real basics of computer science.
Bombarding_@reddit (OP)
Okay, I may have misunderstood. I thought hallucinations were a result of context window limitations which are based on model, but to have larger context windows you need more VRAM therefore more/better hardware
HealthyCommunicat@reddit
It’s based on the attention mechanism! Here’s a great website for you to get started on preparing for what kind of model you would want to run, they have specific benchmarks for everything from math to physics to hallucination rates.
https://artificialanalysis.ai/models
Bombarding_@reddit (OP)
I watched a reviewer say that this laptop ran gpt oss 120b at a strong 50+token per second on his benchmark he did on video. Would that not be plenty?
https://youtu.be/prIUKAbHlj8?si=chKUd6pMxs5FI34M
HealthyCommunicat@reddit
Hey - when you run biggers models that you would actually want to use - (biggest to smallest: Kimi K2.5, GLM 5, GLM 4.7, Qwen 3.5 397b, MiniMax m2.5 etc), the compute required not only for the bigger count of active parameters but also for the higher count of experts is necessary to be able to utilize bigger real capable models with proper speed. This is going to be super super crucial if you’re mainly used to more intelligent models such as models from Claude or OpenAI - it’s even worse because these companies sustain speeds above 50 token/s+ while giving you intelligence of models that are trillions of parameters - while Kimi k2.5 is only 1 trillion and still doesn’t match.
My first unified memory machine was the same exact AI Halo Strix 395+ and ended up returning because it just really couldn’t be used comfortably if you’re used to the working speeds and getting stuff done as fast as using Claude or OpenAI. I then tried the DGX Spark and it was literally no better except faster prompt processing.
I have nothing personally against the AI Halo Strix or the DGX Spark - I simply want to be as brutally honest as possible and warn others of the experience I went through that led me to waste 1-2k just from having to return opened box.
Bombarding_@reddit (OP)
Ohhhh okay I understand! I would like to think I'd be able to work with small 70-120b models and still get decent output, but if it's going to feel wildly slower then I might as well stick to a subscription and run Claude Code or Google Antigravity or something
HealthyCommunicat@reddit
What kind of device are you looking at?
I can spin up a 70-120b model and see if I can cap my mem bandwidth down to the same machine and let you use the API if you want, I just don’t want another person going through massive dissapointment.
invabun@reddit
After seeing your post, I already have my second Z13, this time it’s the Z13 KJP. Although I just bought it emotionally at first, I think it’s quite interesting to be able to use it locally. Although it is slower, I want to experiment with the local model that is not locked with various security labels. After all, the original 4090 can't run anything.... Any suggestions?
HealthyCommunicat@reddit
The thing is, with a 4090 you can run Qwen 3.6 27b mxfp4 and get really impressive results at really usuable speeds. You can't do that on the z13. People have to realize that without speed, being able to carry 128 lbs on your back does not matter if you can only take a step a day. 25token/s start is a bare minimum so that at long context you get at least 10-15t/s.
I now specialize in making Uber compressed models for the MLX community using turboquant as a quantization method instead of just cache - I've been able to make remarkable 55gb mixed quant MiniMax, and even the most recent DeepSeek 4 Flash at 72gb. There should be people like me making custom runtime and custom models, look around.
The recent Ling 2.6 100b, Qwen 3.6 35b, Mistral 3.5 Med, try out 30-120b models at smaller quants.
RainierPC@reddit
I don't know what this commenter you are talking with is smoking, gpt-oss-120b at 50+ tok/s is definitely possible on the Strix Halo 128gb. Not what I would call slow. For other models, see here.
TheBeardedDen@reddit
You are not getting double the speed. You either never had the Z13, never had the M4 Max, or user error across the board. Hope no one listened to your outrageously terrible advice and bought the M4 Max.
HealthyCommunicat@reddit
HealthyCommunicat@reddit
peep out my asus gb10 return too - i have no clue as to why people would lie about having owned something. i have videos showing side to side comparisons i got to do before returning too.
__JockY__@reddit
Honestly there’s no “best”, only “least shit”.
With the exception of maybe an M5 Max (and that’s a big maybe) they’re all slow at inference and much worse at prompt processing, which is measured as the time between you submitting your prompt and the LLM emitting its first token at the commencement of inference.
For tiny contexts you’ll only wait a few seconds between sending a prompt and the first token appearing.
But if you’re using large prompts with a lot of data then you can expect to wait minutes between submitting a prompt and receiving your first token. Larger prompt = slower. You will weep tears of frustration if you’re using large prompts.
Further, the larger the context the slower inference runs, too. So once you’ve waited your 70 seconds for the prompt to process, inference might start at 20 tokens/sec, but by the time you’re 10k tokens deep you’ll be in the single digit inference speeds.
I know Excel is a blocker for you, but you’ll be wasting your money on toys people won’t use if you buy Z13s.
Bombarding_@reddit (OP)
I appreciate the insight. Do you think I'd be able to get to produce basic API code? I.e. here's CRM ____'s API documentation, write PowerQuery script to pull columns x y and z into excel. That's all I'd aim for, but I don't know that it'd take crazy inference or context windows to get that done?
__JockY__@reddit
Who knows. At this point you’d be as well pushing for a feasibility study by buying one and running tests that simulate real workloads.
If it performs, great! If not…
bityard@reddit
I don't get all the other comments in this thread. No, you won't get SOTA performance or capability out of the Strix Halo. Yes, Mac is faster, but also 2x or more as expensive. But at the end of the day, this will do real work and runs many medium sized models just fine. There are lots of threads proving this.
Bombarding_@reddit (OP)
Thank you! I was starting to question if I was being ridiculous or not, alor if I just had bad info from the start
pondy12@reddit
The best laptop for local llm is macbook pro m5 max 128gb, its not even close
Bombarding_@reddit (OP)
Unfortunately we can't use MacOS or Linux, has to be windows :(
pondy12@reddit
anything with AMD Ryzen AI Max+ 395 - 128GB RAM, like ASUS - ProArt GoPro Edition, HP ZBook Ultra G1a
Bombarding_@reddit (OP)
the ProArt PX13 GoPro Edition seems to be a good contender, but the HP G1a is priced at $2k more than the ProArt PX13 and the Flow Z13, and idk why. The Z13, while I'd prefer something less gamer-y for the company, seems to be spec'd the best price to performance
pondy12@reddit
yea but thats a tablet with keyboard attachment, cant really type on your lap with it
Bombarding_@reddit (OP)
We're usually at desks, but some people travel occasionally. I'm less concerned about the form factor
pondy12@reddit
not sure where you are getting your prices btw.
ebay.com/itm/117070119560
ebay.com/itm/167955520042
ebay.com/itm/198149561044
Bombarding_@reddit (OP)
Damn, they're waaaaay more expensive new on HP's site, those refurbs and open box deals are insane. I stand corrected
riklaunim@reddit
Vibe coding is dangerous ;) Cheapest option to run LLMs locally would be mac mini with enough RAM. Then probably Strix Halo. Both can run mid-sized LLMs that won't fit on RTX 5090 ;) but they won't run them quickly.
Even standard laptops can run small/mid-size models if you give them enough RAM. It will be slower than Strix Halo, but it will technically run at few+ tokens/s :)
From upcoming tech - Strix Halo 388/392 devices which may be cheaper, but the 128GB RAM variants will still be crazy. Nvidia N1X/N1 laptops - if the GPU is integrated and shares memory then it will be similar to Strix Halo, likely better if compute is around RTX 5070 mobile (5060 even), but expect crazy pricing due to AI hype and components costs (like memory).
Bombarding_@reddit (OP)
I've looked into building / buying a server just for LLMs, but it looked like it'd cost thousands quickly and we've really only got 3-4 people that would need a device like this
Can't use MacOS, unfortunately.
The thought was if we were going to get engineers a $1,500-2,000 laptop anyways, maybe we could get them one that would be able to run LLMs locally and cut our subscription costs for them too
munkiemagik@reddit
4 people with laptops capable enough you're saying you can budget in the region of 10K. Now despite the company having spent 10K it is going to give all 4 employees equally a lower end 2.5K experience. But that same 10K poured into a dedicated LLM server would give all 4 a significantly better 10K experience concurrently.
Bombarding_@reddit (OP)
The idea was, would it cost $500 ish more to go from a workstation laptop to a workstation laptop that could run local LLMs, therefore cutting the cost, and get a better laptop, extending time before replacement
$1,500-2,000/laptop/person + either subscription, cloud hosted, or self hosted LLM cost. Compared to:
$2,500/laptop/person and they all run local LLMs. No cloud, no self hosting server hiccups, no unexpected high API bill. Also came with the idea that they'd last longer before needing to be replaced.
riklaunim@reddit
My friend is trying this as well, but it looks like realistically cloud-hosted model would be best option, assuming public model can get the job done vs proprietary model over an API.
Vaddieg@reddit
why do they hide memory bandwidth? It's critical for LLM inference