LM Studio running on NPU, finally! (Qualcomm Snapdragon's Copilot+ PC )
Posted by geringonco@reddit | LocalLLaMA | View on Reddit | 42 comments
Posted by geringonco@reddit | LocalLLaMA | View on Reddit | 42 comments
RealityRox@reddit
I came across AnythingLLM. AnythingLLM | The all-in-one AI application for everyone. It claims to use the NPU on Snapdragon PCs.
geringonco@reddit (OP)
I doubt that. It's powered by Ollama.
RealityRox@reddit
Actually, I tried it and it does seem to use the NPU on Surface Laptop 7. The task manager shows the NPU being busy during text generation. But sadly, one of the CPU cores shows 100% usage all the time even when the model is only loaded without any ongoing text generation, and the app consumes "very high" power according to the task manager. So what's the point.
Also, the app is buggy af right now. One of the bugs prevents the NPU from being utilized when the app is not running in admin mode.
Shoddy-Tutor9563@reddit
I don't like LMStudio. They took the best open-source components, like llama.cpp and some Electron web framework, slapped their own frontend on top, made it closed source, and act like they've invented something new.
wphilt@reddit
Seria bem legal se pudessem abri-lo
RealityRox@reddit
When is this coming?
Moist-Cut-502@reddit
Any news on when this will be available in beta or full release?
kintotal@reddit
Qualcomm offers an extensive developer network and robust tools for deploying AI on the new Snapdragon® X Elite chips. However, I found the process daunting. The platform seems particularly suited for running smaller models tailored to specific inference tasks. While there are examples demonstrating how to deploy large language models (LLMs), they often fall short of being practically usable. From less than informed perspective, certain layers of these LLMs could significantly benefit from the Neural Processing Unit (NPU), but this would require redesigning the runtime specifically for Qualcomm's chip architecture.
sharifmo@reddit
Would love to see a 14b, 32b model in this demo.
FullOf_Bad_Ideas@reddit
Apparently NPUs are often limited in addressable memory to 4 or 8GB right now, i think that's why he used a small 3B model which is around 3GB in INT8.
DataPhreak@reddit
It's not that they are limited on memory, they don't actually use memory. It loads the model each time it processes. They are bottlenecked by drive read speed and bandwidth to the NPU. In theory, it could run any size model, but tok/s should get exponentially worse the larger the model gets.
Shoddy-Tutor9563@reddit
It's llama.cpp under the hood. It must read and load model weights out of gguf file to somewhere - either VRAM or RAM. It's no magic.
DataPhreak@reddit
No. Llama.cpp doesn't support NPUs, and a model has to be quantized to INT8 to run on an NPU. They're using something completely different for NPU acceleration.
Short-Sandwich-905@reddit
Any size model? Do you have a source for this crazy claim?
necrogay@reddit
In that case, one could resort to the technologies of the ancients. Such as a RAM disk, to store model files on it.
AnomalyNexus@reddit
I’d be very surprised if it is limited. Even lower end NPUs like on the 3855 can address all mem
Short-Sandwich-905@reddit
How? Has an NPU with the necessary memory been released for that?
me1000@reddit
NPUs are the future of local Inference, but this snapdragon chips is nowhere near as capable as it needs to be. Baby steps, I guess.
DataPhreak@reddit
There are still applications for models in this size. Since the model isn't 'loaded' into memory, you can host many models at once. This means you could have small stt, a tts, and an instruct tuned LLM for function calling/tool use all hosted at once, and they can take turns using the chip. You then use a larger model for your inferrence/text generation general purpose steps.
MrPecunius@reddit
Me too.
But is it likely to be faster than using the GPU given that output is memory bandwidth constrained?
stddealer@reddit
Most likely slower actually. But more energy efficient.
C4fud@reddit
Did anyone tried it?
awesomeo1989@reddit
It’s a smoke and mirrors demo, notice that he refreshes the chat after each prompt. Don’t ask me how I know, but that’s because it only works for one prompt (no chat). 😁
badabimbadabum2@reddit
does ryzen 9000 have npu?
thisusername_is_mine@reddit
Nice.
DuckyBlender@reddit
Hahah they didn't even turn off WiFi, airplane mode doesn't turn it off and you can see network activity later in task manager
clamuu@reddit
In way under a year we'll have models as good as today's frontier models running locally and that's all most people will ever need.
DataPhreak@reddit
I think it will be longer. For current frontier performance, you're going to want ASICs, which are in production but not yet affordable. Maybe we'll see an NPU cluster doing distributed inference with RING attention in 3 years? Still, these would be a box that you put in your bedroom or closet and your AI apps connect remotely.
uti24@reddit
Why all the fuzz guys?
Isn't speed of llm in those scenarios limited by memory speed? I mean, it would run (almost) as good on CPU or iGPU
nntb@reddit
My fold 4 has a npu on its SD 8 gen 1
Intelligent-Gift4519@reddit
NPUs were common on phones before PCs I think because they're really useful for phone camera optimization, stuff like image segmentation for exposure compensation and such. They're also used for noise cancellation and voice enhancement.
Own_Interaction7238@reddit
Nice! How much cost?
I feel I will cry...
Price?
Intelligent-Gift4519@reddit
$499.99 on sale at Best Buy
https://www.bestbuy.com/site/asus-vivobook-s-15-15-3k-oled-laptop-copilot-pc-qualcomm-snapdragon-x-plus-16gb-memory-512gb-ssd-neutral-black/6585180.p
bareweb@reddit
SCOTT HANSELMANNNN
Mandelaa@reddit
Most important info from this movie:
Llama 3.2 3B Q6 (size 2.64GB, so this is ~ Quant Q6)
16.43 tok/sec • 306 tok • 0.23s to first token
Snapdragon 8 Elite run only on NPU.
Saifl@reddit
The video is using x elite which is for the laptops. The 8 elite u mentioned is for smartphones.
MerePotato@reddit
zomg
Content-Ad7867@reddit
Since the memory bandwidth is the major bottleneck, it doesn't matter if it runs in cpu, igpu or npu. Am I correct ??
SandboChang@reddit
This is true but the prompt processing might still be faster if their NPU can work faster than their iGPU.
geringonco@reddit (OP)
Will Ryzen AI APUs NPUs be next?
b3081a@reddit
AMD already had a working llama.cpp based on Xilinx XRT and qlinear library, but it only support q4_0 quantization. So it is definitely possible, but their code need some polishing.
necrogay@reddit
I’ve heard that the performance of the NPU in Snapdragon processors is slower compared to the GPU and CPU, as it is focused on energy efficiency. How noticeable is this energy saving in practice, or does the NPU ultimately lack significant utility?