LM Studio running on NPU, finally! (Qualcomm Snapdragon's Copilot+ PC )

[-]

Moist-Cut-502@reddit

Any news on when this will be available in beta or full release?

[-]

awesomeo1989@reddit

Same here. I've been waiting forever for LM Studio to release this. If it doesn’t launch soon, I’ll assume the rumors about the LM Studio guy faking the demo are true.

[-]

vik_007@reddit

Anyone get hold NPU supported build ?

[-]

RealityRox@reddit

I came across AnythingLLM. AnythingLLM | The all-in-one AI application for everyone. It claims to use the NPU on Snapdragon PCs.

[-]

geringonco@reddit (OP)

I doubt that. It's powered by Ollama.

[-]

Actually, I tried it and it does seem to use the NPU on Surface Laptop 7. The task manager shows the NPU being busy during text generation. But sadly, one of the CPU cores shows 100% usage all the time even when the model is only loaded without any ongoing text generation, and the app consumes "very high" power according to the task manager. So what's the point.

Also, the app is buggy af right now. One of the bugs prevents the NPU from being utilized when the app is not running in admin mode.

[-]

Shoddy-Tutor9563@reddit

I don't like LMStudio. They took the best open-source components, like llama.cpp and some Electron web framework, slapped their own frontend on top, made it closed source, and act like they've invented something new.

[-]

wphilt@reddit

Seria bem legal se pudessem abri-lo

[-]

RealityRox@reddit

When is this coming?

[-]

kintotal@reddit

Qualcomm offers an extensive developer network and robust tools for deploying AI on the new Snapdragon® X Elite chips. However, I found the process daunting. The platform seems particularly suited for running smaller models tailored to specific inference tasks. While there are examples demonstrating how to deploy large language models (LLMs), they often fall short of being practically usable. From less than informed perspective, certain layers of these LLMs could significantly benefit from the Neural Processing Unit (NPU), but this would require redesigning the runtime specifically for Qualcomm's chip architecture.

[-]

sharifmo@reddit

Would love to see a 14b, 32b model in this demo.

[-]

FullOf_Bad_Ideas@reddit

Apparently NPUs are often limited in addressable memory to 4 or 8GB right now, i think that's why he used a small 3B model which is around 3GB in INT8.

[-]

DataPhreak@reddit

It's not that they are limited on memory, they don't actually use memory. It loads the model each time it processes. They are bottlenecked by drive read speed and bandwidth to the NPU. In theory, it could run any size model, but tok/s should get exponentially worse the larger the model gets.

[-]

Shoddy-Tutor9563@reddit

It's llama.cpp under the hood. It must read and load model weights out of gguf file to somewhere - either VRAM or RAM. It's no magic.

[-]

DataPhreak@reddit

No. Llama.cpp doesn't support NPUs, and a model has to be quantized to INT8 to run on an NPU. They're using something completely different for NPU acceleration.

[-]

Short-Sandwich-905@reddit

Any size model? Do you have a source for this crazy claim?

[-]

necrogay@reddit

In that case, one could resort to the technologies of the ancients. Such as a RAM disk, to store model files on it.

[-]

AnomalyNexus@reddit

I’d be very surprised if it is limited. Even lower end NPUs like on the 3855 can address all mem

[-]

Short-Sandwich-905@reddit

How? Has an NPU with the necessary memory been released for that?

[-]

me1000@reddit

NPUs are the future of local Inference, but this snapdragon chips is nowhere near as capable as it needs to be. Baby steps, I guess.

[-]

DataPhreak@reddit

There are still applications for models in this size. Since the model isn't 'loaded' into memory, you can host many models at once. This means you could have small stt, a tts, and an instruct tuned LLM for function calling/tool use all hosted at once, and they can take turns using the chip. You then use a larger model for your inferrence/text generation general purpose steps.

[-]

MrPecunius@reddit

Me too.

But is it likely to be faster than using the GPU given that output is memory bandwidth constrained?

[-]

stddealer@reddit

Most likely slower actually. But more energy efficient.

[-]

C4fud@reddit

Did anyone tried it?

[-]

awesomeo1989@reddit

It’s a smoke and mirrors demo, notice that he refreshes the chat after each prompt. Don’t ask me how I know, but that’s because it only works for one prompt (no chat). 😁

[-]

badabimbadabum2@reddit

does ryzen 9000 have npu?

[-]

thisusername_is_mine@reddit

Nice.

[-]

DuckyBlender@reddit

Hahah they didn't even turn off WiFi, airplane mode doesn't turn it off and you can see network activity later in task manager

[-]

clamuu@reddit

In way under a year we'll have models as good as today's frontier models running locally and that's all most people will ever need.

[-]

DataPhreak@reddit

I think it will be longer. For current frontier performance, you're going to want ASICs, which are in production but not yet affordable. Maybe we'll see an NPU cluster doing distributed inference with RING attention in 3 years? Still, these would be a box that you put in your bedroom or closet and your AI apps connect remotely.

[-]

uti24@reddit

Why all the fuzz guys?

Isn't speed of llm in those scenarios limited by memory speed? I mean, it would run (almost) as good on CPU or iGPU

[-]

nntb@reddit

My fold 4 has a npu on its SD 8 gen 1

[-]

Intelligent-Gift4519@reddit

NPUs were common on phones before PCs I think because they're really useful for phone camera optimization, stuff like image segmentation for exposure compensation and such. They're also used for noise cancellation and voice enhancement.

[-]

Own_Interaction7238@reddit

Nice! How much cost?
I feel I will cry...

Price?

[-]

Intelligent-Gift4519@reddit

$499.99 on sale at Best Buy

https://www.bestbuy.com/site/asus-vivobook-s-15-15-3k-oled-laptop-copilot-pc-qualcomm-snapdragon-x-plus-16gb-memory-512gb-ssd-neutral-black/6585180.p

[-]

bareweb@reddit

SCOTT HANSELMANNNN

[-]

Mandelaa@reddit

Most important info from this movie:

Llama 3.2 3B Q6 (size 2.64GB, so this is ~ Quant Q6)

16.43 tok/sec • 306 tok • 0.23s to first token

Snapdragon 8 Elite run only on NPU.

[-]

Saifl@reddit

The video is using x elite which is for the laptops. The 8 elite u mentioned is for smartphones.

[-]

MerePotato@reddit

zomg

[-]

Content-Ad7867@reddit

Since the memory bandwidth is the major bottleneck, it doesn't matter if it runs in cpu, igpu or npu. Am I correct ??

[-]

SandboChang@reddit

This is true but the prompt processing might still be faster if their NPU can work faster than their iGPU.

[-]

geringonco@reddit (OP)

Will Ryzen AI APUs NPUs be next?

[-]

b3081a@reddit

AMD already had a working llama.cpp based on Xilinx XRT and qlinear library, but it only support q4_0 quantization. So it is definitely possible, but their code need some polishing.

[-]

necrogay@reddit

I’ve heard that the performance of the NPU in Snapdragon processors is slower compared to the GPU and CPU, as it is focused on energy efficiency. How noticeable is this energy saving in practice, or does the NPU ultimately lack significant utility?