Running the latest LLMs like Granite-4.0 and Qwen3 fully on ANE (Apple NPU)

Posted by Different-Effect-724@reddit | LocalLLaMA | View on Reddit | 11 comments

Last year, our two co-founders were invited by the Apple Data & Machine Learning Innovation (DMLI) team to share our work on on-device multimodal models for local AI agents. One of the questions that came up in that discussion was: Can the latest LLMs actually run end-to-end on the Apple Neural Engine?

After months of experimenting and building, NexaSDK now runs the latest LLMs like Granite-4.0, Qwen3, Gemma3, and Parakeet-v3, fully on ANE (Apple's NPU), powered by the NexaML engine.

For developers building local AI apps on Apple devices, this unlocks low-power, always-on, fast inference across Mac and iPhone (iOS SDK coming very soon).

Video shows performance running directly on ANE

https://reddit.com/link/1p0tko5/video/ur014yfw342g1/player

Links in comment.

[-]

SkyFeistyLlama8@reddit

Nexa's cookin'!

Apple ANE, Qualcomm HTP and AMD NPUs are supported by Nexa now for running smaller models. I've been using Qwen and Granite 4B models on Qualcomm HTP for code fixes and git commit messages and it rocks.

[-]

AlanzhuLy@reddit

Thanks for your support! Let us know if you have any feedback!

[-]

DerDave@reddit

Didn't know about you guys. Impressive work! Seeing that you support Apple, Qualcomm, AMD - any love for Intel NPUs?

[-]

benja0x40@reddit

OP forgot the links. Here is the NexaML announcement and the GitHub repos for the NexaSDK.
https://nexa.ai/blogs/nexaml
https://github.com/NexaAI/nexa-sdk

[-]

Different-Effect-724@reddit (OP)

Thank you! Turns out OP was lowkey flagged, and the shared links weren’t visible to anyone. :(

[-]

benja0x40@reddit

The SDK and CLI seem interesting but the website lacks a comprehensive and quantitative overview of performances in real use cases. Perhaps it would help to provide a white paper or blog post with systematic evaluation of performances using representative models and hardware compared to other more established inference engines. As well as a detailed technical overview of the possibilities and limitations when running a model on the CPU, GPU or NPU (e.g. quants, params & context sizes, supported architectures & modalities, etc.).

[-]

alex_pro777@reddit

It useless without quantization. I've been trying to run Qwen 3 4B on my M1 8GB unified... I didn't know that it downloaded over 9GB. Of course it didn't fit my VRAM. I'd rather run a 7-8B model in q4 in GGUF than 1B model on my NPU. Possible it's a solution for GPU rich (sorry unified memory rich) Mac users.

[-]

Different-Effect-724@reddit (OP)

Here's all models now running on Apple Neural Engine + follow the 2-step Quickstart: https://huggingface.co/collections/NexaAI/apple-neural-engine

Model support request & Repo: https://github.com/NexaAI/nexa-sdk