Qwen 3.6 27b Q4.0 MTP GGUF
Posted by Available_Hornet3538@reddit | LocalLLaMA | View on Reddit | 22 comments
Not sure if others have updated but tried the MPT version of LLAMA CPP. It works pretty good. I have a shitty IGPU AMD 64gb unified memory. It's pretty fast. Would say as fast as 9b Qwen 3.5 Q4KM replies. This is pretty cool.
Gimel135@reddit
Love seeing this. People sleep on iGPUs but 64GB unified is a beast for local LLMs.
Daniel_H212@reddit
Not really, heavily depends on what iGPU and how the memory is configured. OP says theyre getting 8 t/s. Strix Halo and dgx spark barely gets acceptable performance, most systems won't be anywhere close on a conventional iGPU/shared memory.
Honest-Kangaroo-1830@reddit
Different goals, this is just the same argument as people discussing GPUs vs Strix Halo as you pointed out. The point of the large unified memory systems is not to run everything at blazing fast speeds, but to run larger models at slower speeds. That's the whole tradeoff.
No-Refrigerator-1672@reddit
That's the whole point of the arguement - running larger models only makes sense for complex tasks, and complex tasks are never short; they involve both long prompts and long multiturn generations. Even the venerated AI Max, DGX Spark and Mac Ultras fail to deliver prompt processing performance, which turns any complex task into eternal waiting challenge. Sure, if you have all the time in the world, go for it; but I refuse to call usable anything that works slower than me doing the task manually.
Daniel_H212@reddit
I would say they have a point in that being able to run models of a certain size at all is a massive benefit over being able to run small models fast, due to the existence of tasks that small models literally cannot ever complete adequately. In that situation it's better to let a bit model slowly complete it overnight or something than to not have the capability at all.
Like, do I use Qwen 122B day to day on my Strix Halo? No cuz it's slow and 35B is pretty damn good. But when I do need it boy am I glad I have 128 GB of unified memory and can run it in a pinch.
No-Refrigerator-1672@reddit
Well, strix halo is like 5x slower than my 2x 3080 20gb setup that cost me 1000 eur (gpus only). So, price/performance ratio is pretty much garbage for AI Max (this applies to DGX and Macs just the same). You're hindering your everyday workflows by quite a bit, and stressing your pocket, only to once in a while run a big model. Maybe this makes sense if you face 100B class tasks regularly, but for my job (I'm using AI for physics paper processing, research and lecture preparation) I've seen such tasks very rarely, and since Qwen 3.6 35B - never.
Honest-Kangaroo-1830@reddit
We are all glad that your setup works well for you, but the reality is that for the price you spent vs what most people paid for their Strix Halo setups, they could get 96gb of unified RAM and a 3090 on top of that over oculink.
If price to performance is so important, then go buy 20 MI50s and have a blast.
Daniel_H212@reddit
That's fair.
Available_Hornet3538@reddit (OP)
It runs about 8 tokens per second.
o0genesis0o@reddit
Mate, you bury the most important info. When I read your description, I thought it would run at 15 or 20t/s generation.
Which iGPU do you have? Is it one of the mini PC with the 780M iGPU?
AzerbaijanNyan@reddit
Ran some quick Q4_1 MTP tests yesterday on one of those actually (specs/old test). It varies for task/acceptance rate but I get around 27t/s for the MoE and 8-9t/s for the dense model.
o0genesis0o@reddit
Man, the pp speed is pretty rough at 16k depth. I guess it's not that bad for chatting or simple agents that chat more often than calling tools.
I still haven't been able to run llamacpp stable on my 780M since kernel 6.18. Fingercrossed kernel 7 would fix this issue for good.
AVX_Instructor@reddit
i have laptop with R7 7840HS and RX 780M and Fedora 44,
Llama.cpp (vulkan) work solid as rock (im using gemma 4 26b / qwen 3.6 35 with 200 token prompt eval and 20 token gen
AzerbaijanNyan@reddit
Yeah, I managed to score a cheap refurbished occulink setup and a MI50 before they blew up in price. It makes a world of difference even with such an old card. Running a dense model on the GPU and a MoE on the iGPU for parallel tasks or testing is pretty great. Qwen 3.5 4B also works pretty well on it.
I suspect a properly set up automated GPU orchestrator + iGPU subagent flow with small tasks that stay within reasonable context/pp speeds could be quite efficient.
SavingsWeather1659@reddit
where can i donwload it
AzerbaijanNyan@reddit
I uploaded Q4_1 versions yesterday, they're just for testing since the MTP implementation isn't finalized yet so they might perform poorly or not at all soon: 27B and 35BA3B
SavingsWeather1659@reddit
i just tried RDson/Qwen3.6-27B-MTP-IQ4_KS-GGUF at main and it's so slow slower than non mtp
CooperDK@reddit
35B-A3B will run much much better. It only has 3B parameters active at a time
Honest-Kangaroo-1830@reddit
With 9 tok sec, i would much sooner just go with the 35B A3B model instead. 8845HS w/ 64GB DDR5 5600 running a custom MTP q4 of 35B @ 35 tok/sec. At this point, its so optimized that its essentially within spitting distance of the Gemma 4 E2B model I use for camera/motion detection, except it can reliably make 10+ tool calls in a turn and re-reason in between if something doesnt come out cleanly.
pdycnbl@reddit
which igpu ? are you using llama.cpp?
Powerful_Evening5495@reddit
can you state the number of tokens or it a set number
mechkbfan@reddit
8t/s