AMD Strix Halo refresh with 192gb!
Posted by mindwip@reddit | LocalLLaMA | View on Reddit | 142 comments
Looks like the next strix halo, the Gorgon halo 495 max will have more then 128gb! I already bought a strix halo mini forms couple months ago since the 2026 refesh rumors was not interesting. Was not planning on getting another till 2027 with the bigger refresh, and linking them together. But was planning to add an external gpu for running smaller dense models for now till 2027. Cpu, gpu rumor was smaller improvements. Heard nothing about more memory.
But idk having 320gb of memory will allow running some of these newer huge moe models... maybe I drop external gpu thoughts for now. Of course rumors for now need to wait.
For those who have not bought one yet, a single 192gb would mean running all these recent 122b models at q8 with fullish context!
JinPing89@reddit
If memory bandwidth is still around 250gb/s, I think the best model fits this machine is Minimax 2.7, as it only has 10b active parameters.
silentsnake@reddit
The problem is not just with memory bandwidth. Compute is also an issue. My strix halo is gathering dust because its prefills is simply too slow. There’s no fp8 tensor cores, fp16 tflops 10x lower than dgx sparks. 20 toks/sec is fine but if takes 1-2 mins just to prefill after every agentic turn, it’s unusable.
MDSExpro@reddit
It issue with llama.cpp, not Strix Halo. Use vLLM with prefix cache.
hurdurdur7@reddit
The issue is still with strix halo. If you sit down and resume on a project that is not currently cached and you need to load 100k tokens to get going again... your grandchildren are probably goong to resume from where you left off.
Prompt processing of new context info is unbearably slow. If you compare to actual gpus at least.
MDSExpro@reddit
You do know you can save cache to disk and never worry about recomputing everything?
hurdurdur7@reddit
And if you switch to new task that takes part in the other part of your project and needs another 20 files for context, do save all those caches on disk independently?
audioen@reddit
Prefix caching is why agentic turns ought to work, they should just continue with the past context. You shouldn't need to reprocess everything.
wllmsaccnt@reddit
I thought I was running into prefill speed issues with my Strix Halo. Turns out OpenCode is slow, and the Qwen3.6 models spew out an endless amount of reasoning tokens.
Running llama-server with its tools enabled showed me what was really happening.
Fit-Produce420@reddit
What models are you using? I like to use different models for different purposes on the strix halo, dense models are good sometimes but MoE have caught up particularly those designed with consumer hardware rather than data centers.
colin_colout@reddit
I honestly struggle to find a better model than Minimax M2.7 on my 128mb Strix Halo.
...i just need to accept the types of errors you get with a quantized-to-hell model. qwen-3.5-122b is kinda "meh", and next-coder is impressive and speedy (and my go-to when i want a model with minimal loss), but doesn't do well with complexity (great as a coding subagent though)
guinaifen_enjoyer@reddit
what quants are you using for Minimax M2.7? it seems to be a bit big....
Fit-Produce420@reddit
Only Q1 or Q2 on a single strix halo, or some kind of pruned or reaped model which is why he mentions it's "struggles" with complexity.
kmouratidis@reddit
You misread the comment. It's Qwen coder next that struggles with complexity.
Fit-Produce420@reddit
Nope, I read it correctly.
If he's using minimax 2.7 on a single 128GB maxnjnd the largest you can fit is the smallest Q4 and that leaves you with a pretty limited context.
To run full context in 128GB total you are down to Q3 and for things like coding you are going to lose performance.
colin_colout@reddit
UD-IQ3_S with 131k context. UD-IQ4_NL fits when i quantize kv to 8_0 (quality gets worse on long sessions, so I'm waiting for rotorquant or turboquant or similar to drop before I try again)
colin_colout@reddit
UD-IQ3_S gets me 131k context. I can also run UD-IQ4_NL with 8_0 kv cache (I don't bother now... Would try again whenever llama.cpp implements some variation of rotoquant).
I run an ultra-minimal Linux distro though, so i only leave a few gigs for OS.
Latter-Foot-7192@reddit
Use reap 40% and run at higher precision, size is down to 139b, but quality loss outside of English is close to nonexistent
colin_colout@reddit
I can try again with newer reaps, but M2.5 reap did poorly on my coding evals compared to the Q2.
mr_zerolith@reddit
I see you haven't tried Step 3.5 Flash
Potential-Leg-639@reddit
Step 3.5 Flash is free via nvidia NIM and overthinking like crazy…too much for me.
mr_zerolith@reddit
It's weird that people still have this complaint, yet they'll use Qwen 3.6 and GLM ( almost all GLM models overthink ).
This model was badly supported in llama.cpp when it came out. But so are most models.
Potential-Leg-639@reddit
It‘s not weird, it‘s true, that it‘s more overthinking than any other model I tried till now. But maybe it‘s specific to the Nvidia NIM quant. Didi you try it yet there?
mr_zerolith@reddit
I just don't notice it thinking more than most newer models.
The results are worth the slightly longer wait ( like with deepseek )
I only use local hosted models. So i didn't know about Nvidia NIM until you mentioned it.
StartupTim@reddit
I must be blind, can you link me the HF for that please?
mr_zerolith@reddit
Here ya go, i believe this is the best quantized version:
https://huggingface.co/bartowski/stepfun-ai_Step-3.5-Flash-GGUF
StartupTim@reddit
Oh also, let me ask you what I asked somebody else, since we might be in the same boat...
Does it just load as normal with Lemonade or are you doing something different?
If you wouldn't mind, could you tell me how you're using it, the command line etc?
MANY THANKS in advance!
Also, are you using something like cline/roocode etc? I use Roo a lot myself as well as custom stuff...
Thanks a ton /u/mr_zerolith (love the name btw, it sounds very familiar for some reason!)
mr_zerolith@reddit
I've got NVIDIA hardware here and i'm running it via LMstudio. No special flags or settings.
My whole dev shop uses this system via Opencode, CLine, maybe other tools.
Zerolith is a high speed, low complexity PHP + frontend framework, and i'm supposed to be playing it's representative, but currently too excited about local LLMs to stay on topic 😄
StartupTim@reddit
Thanks a ton!
I wonder why I haven't heard more about stepfun on this subreddit? Does is not fare well by comparison to the other self-hosted giants?
mr_zerolith@reddit
The company doesn't seem to have a marketing budget.
Educational_Sun_8813@reddit
i made that quant for strix too, it works quite well: https://www.reddit.com/r/LocalLLaMA/comments/1r0519a/strix_halo_step35flashq4_k_s_imatrix/
Educational_Sun_8813@reddit
i made that quant for strix too, it works quite well: https://www.reddit.com/r/LocalLLaMA/comments/1r0519a/strix_halo_step35flashq4_k_s_imatrix/
beneath_steel_sky@reddit
AesSedai's version is also very good, see PPL/KLD https://huggingface.co/AesSedai/Step-3.5-Flash-GGUF
ubrtnk@reddit
https://huggingface.co/collections/stepfun-ai/step-35-flash
StartupTim@reddit
Huuuge thanks!
I have the Strix Halo too, how are you running that model? Does it just load as normal with Lemonade or are you doing something different?
If you wouldn't mind, could you tell me how you're using it, the command line etc?
MANY THANKS in advance!
Also, are you using something like cline/roocode etc? I use Roo a lot myself as well as custom stuff...
colin_colout@reddit
I can test again. I tried it when it first released and didn't find it too impressive (I can't remember why). Llama.cpp and the quants get bugfix updates, so maybe it's that.
Also keep in mind I think I tend to prefer models that have opus-like personalities, so even if it's extremely capable, I might not use it in my coding agent if it's a GPT distill.
330d@reddit
Have you tried qwen 3.5 397b?
colin_colout@reddit
I might try again, but it hallucinated way more in my testing.
Every quantized bitis precious it seems. If someone else has tips or a different experience, I'll happy retest.
Kahvana@reddit
The 90s called, it wants it's mbs back!
Jokes aside, thanks for sharing your findings on the models. Once DeepSeek v4 flash is compatible with llama.cpp, would love to see how that performs.
Zc5Gwu@reddit
Same. I’m squeezing q3_iqxss 128k ctx. I get OOMs all the time though.
shing3232@reddit
it should fit ds4f perfectly
_derpiii_@reddit
Are there any good rule of thumbs when it comes to calculating how memory bandwidth bottlenecks active parameters?
In this case, is it 250/10b => 25 gb/s required per 1b params?
As in would that imply that a 1TB/s bandwidth would be capped at 40b models?
Zyj@reddit
Using Q6 on 2x Strix Halo, it‘s slow but good
mindwip@reddit (OP)
Rumor is slight memory upgrade like 500mz, so 8000 tvs 8533 speed. So nothing to write home about.
zball_@reddit
Apparently should be DeepSeek v4 flash.
rumblemcskurmish@reddit
This will be interesting to track because I'm not sure of the utility of huge VRAM in these new devices because the only real utility it offers is bigger models . . . but bigger models (think 80/120/200b+ parameters) will perform horribly slow on this level of hardware. What's the point of 256GB of VRAM if you still just run Gemma4 or Qwen 3.6 35b so that you can get 50t/s on it?
To be fair, I'm definitely interested in a 128GB model so I can run something in the same performance category as Qwen 3.6-35b, with a huge context window. But I can't imagine throwing a giant dense model on this thing and getting 5 tokens/sec.
redoubt515@reddit
MoE
rumblemcskurmish@reddit
It's possible but I haven't see large MoE models. Are they out there? I kind of assumed people who are running 200b parameter models have the hardware to run large dense models. Maybe we'll see a new breed of 80/120b models with 3/4/5b active at once in an MoE design.
Double_Cause4609@reddit
There aren't really 200b dense models.
The most recent dense models are Qwen 3.6 27B (great model, tbf), Gemma 4 31B, and...Maybe one of the 32B coding models and I guess...Nemotron 253B (which was actually a pruned Llama 3.1 405B)...?
Every other major release has been MoE once you go past 32B parameters.
tired514@reddit
There are some enormous dense models (llama 3.1 405B and 70B, Mistral 3.5 128B just released a few days ago, and some others) but ya they're few and far between.
It'll be interesting to see if the active parameters start to creep up along with hardware. I bet we'll see some 400B-A70B style models start to drop in the next few years intended for huge context deep research and specialized analysis running on ultra-fast 512gb rigs.
But for general use and coding A10B is pretty damned solid as it is.
Caffdy@reddit
this guy gotta be a bot
tired514@reddit
Wait, who?
Double_Cause4609@reddit
Llama 3.1 is years old (also I even mentioned the 405B implicitly in my comment). I was more meaning recent models. IIRC Llama 3 is like, two years old now.
I guess Mistral Medium 3.5 128B did just release (I forgot about that one), but that's the singular kind of interesting over 32B dense model that's release in the past year and a half. The only other one I can remotely think of is Apertus 70B.
That's it.
And keep in mind my comment was in reference to somebody talking about people running 200B dense models. My response was essentially "run *what* 200B dense models?"
Which I think is fair in this context.
As for MoE models with more active parameters...I'm actually a bit tepid on those. Right now it's kind of okay. Even though we're moving to MoE, we have a lot of models with small active parameter counts, but as soon as those go up people will start needing really specialized hardware to run the models, and they're not going to get by running with experts offloaded to CPU + RAM anymore.
The only way I'd be okay with it from an end-user running the model is if they start doing bigger shared experts (kind of more like Llama 4. Say what you will, but the arch was actually pretty great for running locally).
Tbh, if you read MoE literature, while shared experts have some problems with training frameworks, they actually do capture a lot of the value of dense model active parameter counts (hard reasoning, inductive inference, etc), and the MoE portion can actually be really sparse while still doing what MoE does well (rare sequence learning, factual memorization / recollection, etc). That'd be an arch about as well suited to heterogeneous local execution as anything could be.
tired514@reddit
Oh, sorry yeah I didn't mean to disagree! I didn't notice you mentioned llama 405B.. eyes buggin' out :p).
I was actually pretty surprised by mistral's latest release.
IGZ0@reddit
Amount of memory doesn't matter as long as its bandwidth is so low and the drivers are as bad as they are.
Waste of ram.
DarkGhostHunter@reddit
Basically a 395+ with 8 × 24GB LPDDR5X. Same bandwidth. Same GPU 3year old gpu arch.
For the two of you who already have a maxed out 395+ you're not missing more than 5% perf difference.
Daniel_H212@reddit
Yeah and I can't imagine pushing the compute capabilities of my strix halo to run anything bigger than what fits on it. Heck I'm already settling for Qwen3.5/3.6 35B because the 122B is too slow (and also the 35B is pretty damn good for its size).
tired514@reddit
I'm running a 128gb EVO-X2 and I'd love 192gb ram even at current speeds.
I've got two modes of use - research / realtime chat and coding.
For research mode "scrap the 'net and summarize changes in freecad 1.1" I usually stick with a high speed model at Q4 because it doesn't really need that much intelligence to succeed. Qwen3.6-35B-A3B at Q4 works beautifully 95% of the time (unless it's really complex research). Sure it takes a few minutes, but it's not the end of the world for me personally. Realtime chat is lightning fast.
For coding mode I usually run Qwen3.5-122B-A10B at Q5. It seems to do a much better job than Qwen3.6-27B at Q8, at least in C++. Gets confused way less often in my experience.
With only 10B active parameters it's "fast enough" for what I do. Sure it takes 10-20 minutes to go through a 10k line codebase, but whatever.. I let 'er rip and go read some reddit, lol.
To be able to run at Q6 or Q8 would be great and reduce the error rate (which is amazingly low given the hardware constraints, but could be lower). It'd open up huge 200-300B models as well, which while quantized would have significantly more knowledge and stability with large contexts.
On top of all this I just ordered a Morefine G2 4090M 16gb eGPU, and my hope is I can keep the non-experts (main inference) and context on the G2 and the experts in unified memory. With 192gb that'd be a seriously impressive and pretty snappy setup. And with two USBC-4 ports in theory I could add a second G2 and easily fit an 200B+ A30B model with context at Q6 split effectively between eGPUs and onboard.
Would more unified bandwidth be better? Of course lol.. but I'd take 192gb in a heartbeat even on the current platform.
notdba@reddit
The eGPU setup will be severely limited by the data transfer speed, such that you can't benefit much from GPU offload during PP. I got an old thread about this: https://www.reddit.com/r/LocalLLaMA/comments/1o7ewc5/fast_pcie_speed_is_needed_for_good_pp/
tired514@reddit
Damn. I was kinda worried that might be the case. Well, I needed the G2 4090M for 3D scanning anyway (since no 3D scanner supports ROCm / HIP yet, sigh) and possibly some gaming, so no big loss, but still that's annoying.
Assuming it's not slower it still would be a nice bump in total RAM (128gb + 16gb) which would help with 122B-A10B Q6. And maybe some efficiency magic will happen at some point. :p
notdba@reddit
Compared to using the strix halo alone, my 3090 eGPU does provide a nice bump to both PP and TG, especially on MoE models such as Qwen3.5/3.6 that have more always-activated parameters than sparsely activated parameters.
However, a gaming rig with PCIe 5.0 x16 will be able to deliver a very usable PP of 400\~800 t/s on large MoE models, and there is no way to get close to that with a strix halo.
MisticRain69@reddit
When I load a model like qwen 3.6 27b Q8 and use tensor split to fill up my 3090 ti by putting 23gb of the weights on it and the rest on the 395+ igpu I get 700-600t/s PP and 16tk/s TG. Minimax m2.7 Q3.5 I get a starting PP of 168tk/s and 29-27tk/s TG which drops to like 100-80tk/s PP and 13-15tk/s TG when context is up to 70k. Gets real slow lol.
tired514@reddit
That's the problem with this stupid hobby. There's always something faster and better and all it takes is more money, lol.
It is kinds nuts to be face with the proposition "hey, build a new rig and your digital coworker gets even smarter and faster." Like, imagine trying to convince yourself this would be our reality today ... say, a decade ago. 3D printing, LLMs, stable diffusion, custom MRNA pharmaceuticals, 26% efficiency solar, electric personal aircraft (jetson-1), jetpacks... like how'd the future rush up so fast on us?
stuckinmotion@reddit
Yeah already if I'm maxing out my 128gb Strix Halo I'm not loving life in terms of t/s and pp speed..
cu-pa@reddit
wait until new cpu instruction included in those newer cpu, amd and intel are collaborating to make cpu more capable enough to process inference engine on consumer grade.
mindwip@reddit (OP)
I know about the better 2027 chips, but could you link or give me something more to search about on what your talking about?
cu-pa@reddit
here https://wccftech.com/amd-intel-ace-partnership-boosts-ai-performance-standard-matrix-acceleration-architecture-for-x86/
mindwip@reddit (OP)
Thanks!
Purple-Programmer-7@reddit
Cost?
mindwip@reddit (OP)
Unknown right now.
riklaunim@reddit
It will end in $3000+ devices that still will be somewhat slow for such large models, while being RTX 4060 mobile for gaming. We are getting into a point in time where waiting for Medusa Halo could be better as a true next-gen chip.
Also Nvidia N1X mobile chips are lurking as well.
mycall@reddit
Slow is better than no.
fooo12gh@reddit
Lol, what $3k do you talk about? Right now beelilnk costs $4.4k and gmktec $3k for ryzen 395 + 128ram.
I hope it costs around 4k, but realistically i am not that much optimistic
Zyj@reddit
It‘s like a 4070 mobile for gaming
rpkarma@reddit
From what it sounds like, N1X is just GB10 again though
mindwip@reddit (OP)
Yes agreed, Medusa halo should use lpddr6, and be hopefully 2x speed or more for memory. That's what I was originally holding out for as an upgrade!
reto-wyss@reddit
Is more VRAM useless if compute is around the same?
No:
The easy answer is, you can run a larger model with similar amount of active parameters -> potentially better tokens/output.
But more importantly, if you have more "spare" VRAM and you are not compute bound you can potentially get much higher throughput on concurrent requests.
Freonr2@reddit
You can't have your cake and eat it here. A10B with 2 concurrent requests can become ~A19B-A20B because each concurrent request is routing to different experts.
starkruzr@reddit
yeah, but really gonna need the higher membw that will come with Medusa.
WithoutReason1729@reddit
Your post is getting popular and we just featured it on our Discord! Come check it out!
You've also been given a special flair for your contribution. We appreciate your post!
I am a bot and this action was performed automatically.
shing3232@reddit
192G is enough for full version of ds4f
HlddenDreck@reddit
Hm, the only thing which would be interesting to me is something with at least 512GB memory like those Mac Studio devices. Being able to run something like GLM-5.1 with at least 4-bit quant locally would be a gamechanger to me.
LagOps91@reddit
192gb is still a bit too low for me to consider it as an option. 256gb is the minimum and even that isn't enough for the current monster models...
ImportancePitiful795@reddit
Imho hold the money for Medusa Halo in 2027.
24 core Zen6 with ACE (Intel AMX on steroids), 48CU igpu and 6 channel LPDDR6 RAM. So around 700GB/s bandwidth.
oxygen_addiction@reddit
\~460 - 690 GB/s
ImportancePitiful795@reddit
The absolute minimum for LPDDR6 on 384bit bus the Medusa Halo has, is 513.6GB/s (10.4gbps modules). The Snapdragon models coming out this year have the most common mid range 12.8Gbps (614GB/s).
So is not "hyping".
More_Feature8687@reddit
We don't have confirmation for 384bit bus. It can still be 256bit only
No_Mango7658@reddit
Not very useful at the current memory bandwidth speeds, unless we have some real big Moe's lol. Maybe qwen 397b-17b wouldn't be torture to use.
mindwip@reddit (OP)
Yep I can't wait!
My only hope is 2026 amd releases a nice 48gb to 96gb card. But I doubt it...
ImportancePitiful795@reddit
They do not have bigger GPU than the gfx1201 (9070XT/9700) right now. While GDDR6 modules cap at 2GB.
AustinM731@reddit
I mean you can buy a MI210 with 64GB of HBM for like $4k on eBay. But I'm not sure that would be the best use of $4k.
pmttyji@reddit
We need better unified devices for Dense models.
edsonmedina@reddit
More memory will be useless if the memory bandwidth stays the same. You'll be able to run larger models but they'll be very slow
jacek2023@reddit
Not really true. If number of active parameters stay same with larger memory you can run bigger MoE with same speed. So for example 200B A10 will run like 100B A10
StardockEngineer@reddit
On a single node it will be sub 30tok/s with A10. Definitely on the line of what people will accept for token generation, and in pp doesn't improve it'll feel even slower.
TechExpert2910@reddit
prompt processing is honestly the biggest limitation when you're trying to use it for agentic coding or longer conversations.
it's so awful to have to wait ~30s each turn for inference to even start.
not a problem limited to strix halo, but even pre-m5 apple silicon
edsonmedina@reddit
Yeah. Max speed i can get on a Strix Halo with A10B is between 22 t/s at Q4 and 18 at Q6. Far from ideal.
edsonmedina@reddit
Fair point. Have you seen models like that?
jacek2023@reddit
MiniMax was hyped a lot here
petuman@reddit
Minimax M2.x is 230B-A10B
mindwip@reddit (OP)
No plenty of models with under 20 active B parts. No it's not a nvidia 5090, but i am running models very happly on my current strix halo.
edsonmedina@reddit
Can you name which model you're using at A20B and what speed you get? At which quant?
mindwip@reddit (OP)
Qwen 3.6 35b q8, Qwen next 80b, q8, Qween 3.5 122b q6 I think need to look
I dont have speeds but they been posted by others. Not using any at 20b right now, as my comment said below 20b, not 20b.
edsonmedina@reddit
Yes, but 20b is what I'm disputing. I also have a Strix Halo. The 122b model will give you 18 t/s at Q6, and that's an A10B. A A20B would be awfully slow.
tamerlanOne@reddit
Se non mettono mano alla larghezza di banda in modo sostanziale sarà solo un piccolo miglioramento che porterà benefici limitati. Serve almeno un incremento di banda di almeno il 30% per poter ottenere un esperienza d'uso accettabile con modelli llm densi medi e/o medio grandi moe.
Oppure aspettiamo la nuova classe di modelli a diffusione come https://github.com/inclusionAI/LLaDA2.X 😉
Only_Situation_4713@reddit
Still underwhelming. Recently got two DGX sparks to replace my 13 3090 setup and have been very happy. Being able to plug them into each other and run it in tensor parallel has been great with vLLM.
AMD needs a connect-x equivalent to be useful for a homelab.
CATLLM@reddit
13 3090???
Only_Situation_4713@reddit
Yes
CATLLM@reddit
Thats awesome. Did you sell the 13x 3090s?
I have 2x dgx spark clustered as well
Only_Situation_4713@reddit
Not yet. Just finished setting up deepseek v4 flash on them. Still needs more testing but I’ll probably sell them soon.
FullOf_Bad_Ideas@reddit
How did you get Deepseek v4 flash to work on 13 3090s? I'd like to make it work on 8x 3090 ti's but I haven't looked into software support much yet.
Only_Situation_4713@reddit
They don’t. Will be a while if ever. DeepGEMM will not be supporting the new deepseek architecture and vLLM will not be supporting it going forward for ampere. Llama cpp will be supporting but I don’t use it
makakouye@reddit
If any are FE cards I'd be happy to snag some off you.
Seemed silly to buy one at release but it now seems like I should have bought four.
fallingdowndizzyvr@reddit
Eh.... I don't know. 128GB Strix Halo was all fun and games because it was only $1800. But with current memory prices, this is probably going to be a $4000 machine. Even with 192GB, I don't think it's worth $4000.
notdba@reddit
A second hand Epyc 9004 rig with 192GB of DDR5 costs about $6000, with faster CPU, more memory bandwidth, and many more PCIe lanes.
There is no good choice these days..
segmond@reddit
Only 10% lift in compute, I'll like to see it have more PCIE lanes so the PCs can get 1 or 2 PCI slots for external GPU.
notdba@reddit
Indeed. 8 channels DDR5 and a PCIe 5.0 x16 slot will be a game changer
Highwaytothebeach@reddit
At least they are FINALLY unlocking their CPUs to handle more RAM, as simple Moore's law alowed it much earlier.
If they did it a few years ago probably the world would be more honest place with less scams.
Both DDR5 and DDR6 have all specification in place so lets hope more RAM producers may enter the market, as reasonably priced 1Tb RAM personal PCs are historically significantly overdue..
mindwip@reddit (OP)
I have not even dreamed of retail 1tb systems, I am hoping we get 512gb in 2027 version. Dreaming of 512gb but idk let's see how it goes...
fallingdowndizzyvr@reddit
They aren't unlocking anything. They are just using higher density RAM chips.
CalligrapherFar7833@reddit
They didnt unlock shit they always supported more ram. To get that bandwith they need to solder close to cpu. The only thing that moved is the ram per chip size.
This_Maintenance_834@reddit
192GB is on the edge to run DeepSeek-v4-Flash.
segmond@reddit
You can run DeepSeekV4Flash with 96gb vram.
This_Maintenance_834@reddit
people keep saying below Q4 quality degrade fast. so i was not leaning to quantize that far.
segmond@reddit
all the layers and tensors are not reduced to Q2, it's selective. dynamic quant that unsloth made popular. some layers/tensors are kept at F32, F16, Q8, etc.
FullOf_Bad_Ideas@reddit
I think it's going to work great with a dedicated GPU like 5090 that could keep attention and kv cache of big models like Qwen 3.5 397B on fast VRAM.
I think we'll be seeing more ultra-sparse MoE's in 2026 as kv cache size issues are largely solved by models like MiMo V2.5, Deepseek V4 Flash and Qwen 3.5, as kv cache size growth was a big issue preventing MoE's from being fast on hardware like Strix Halo at large context sizes.
phido3000@reddit
More ram is always welcome offering. But its likely this is going to be very expensive. And with no improvement in bandwidth or connectivity.
It will be expensive, not sure 192Gb gives you access to any awesome models to run either.. Deepseek Flash would be a nice architecture/size combo..
It would be nice if these were 10,000Mhz memory modules to give a 20% bump.
Or if it came with CVL type arrangement where you could put in DDR4/DDR5 modules and have a 2nd bank of slower, but much higher capacity. If it was 96 GB of LPxDDR5 10,000 then two channels of sodim DDR5 5600 192Gb now you are talking.
Maybe with the refresh there is a specific MXFP4 instruction. That could be a game changer, particularly in models built around FP4. Like Deepseek.
tired514@reddit
192gb would open up a bunch of 200-300B MoE models without any significant performance cost (like Minimax 2.7 220B at Q4 with tons of room for context and OS). It's still only 10B active, so at Q4 you're still likely pushing 30+ tokens/s which is pretty snappy.
10,000MT/s would certainly be awesome, though. 😄
Engineering_Acq@reddit
Im glad i returned my rog z13 flow
xlltt@reddit
Im buying that day 1
MentalStatusCode410@reddit
HBM or useless.
ziptofaf@reddit
AMD does sell 192GB HBM3 chips if you want one but they cost a tiny bit more than Strix Halo (Mi 300X, around 20-30k $ per chip).
Not that I disagree with the general sentiment, 256GB/s bandwidth is a huge limiter of these chips and frankly given how much this 192GB variant might cost I would probably just buy M5 Max laptop (128GB @ 614GB/s so you can actually run 15-20B active parameters at usable speeds).
MentalStatusCode410@reddit
My point being value proposition is likely going to be non-existent due to the lack native FP4 support and stupidly low bandwidth.
Even moving to 64GB/128GB HBM2e would be more practical for the target market.
shuozhe@reddit
Hmm gonna be a interesting Q2/Q3. Nvidia n1, m5 ultra mac studio, and this now. Wondering if any will be available for consumer..
Any advantage of getting a miniPC vs Laptop on 395? Performance looked pretty similar.. right?
geldonyetich@reddit
As impressed as I have been with the AMD Max 395+, I think I might end up building a new PC with this chip... if Ramogeddon is over by then.
RegularRecipe6175@reddit
Interesting. Currently requires a cluster of 2 to run Minimax at a good quant (e.g., Q5-Q6) and decent context. All things being equal, using one box is better than using two. I can only imagine the cost with current DRAM prices.
Fit-Produce420@reddit
More bandwidth when?
BangkokPadang@reddit
When DDR6 is supported. It has a max throughput of 17,600 MT/s per channel which works out to about 140GB/s per channel. So 560GB/s for a 4-channel setup, which is pretty zippy but when GDDR7 exists, it’s hard not to just wish we could have a system with 256GB of unified memory at _those_ speeds.
tarruda@reddit
Apparently Gorgon halo will have a slight memory bandwidth increase: https://www.reddit.com/r/LocalLLaMA/comments/1swiylm/comparison_of_upcoming_x86_unified_memory_systems/
I think I will skip until they have a 512G option with memory bandwidth in the 800GBps range. In other words, when they reach the capabilities of current gen Mac Studio M3 ultra.
Technical-Earth-3254@reddit
If it can run DS V4 Flash with at least 15 tps, I'm down
Terminator857@reddit
Previous discussion on this topic: https://www.reddit.com/r/LocalLLaMA/comments/1swiylm/comparison_of_upcoming_x86_unified_memory_systems/
misha1350@reddit
Incredible news for 2028
Looz-Ashae@reddit
Demogorgon hehe