Running the latest LLMs like Granite-4.0 and Qwen3 fully on ANE (Apple NPU)
Posted by Different-Effect-724@reddit | LocalLLaMA | View on Reddit | 11 comments
Last year, our two co-founders were invited by the Apple Data & Machine Learning Innovation (DMLI) team to share our work on on-device multimodal models for local AI agents. One of the questions that came up in that discussion was: Can the latest LLMs actually run end-to-end on the Apple Neural Engine?
After months of experimenting and building, NexaSDK now runs the latest LLMs like Granite-4.0, Qwen3, Gemma3, and Parakeet-v3, fully on ANE (Apple's NPU), powered by the NexaML engine.
For developers building local AI apps on Apple devices, this unlocks low-power, always-on, fast inference across Mac and iPhone (iOS SDK coming very soon).
Video shows performance running directly on ANE
https://reddit.com/link/1p0tko5/video/ur014yfw342g1/player
Links in comment.
SkyFeistyLlama8@reddit
Nexa's cookin'!
Apple ANE, Qualcomm HTP and AMD NPUs are supported by Nexa now for running smaller models. I've been using Qwen and Granite 4B models on Qualcomm HTP for code fixes and git commit messages and it rocks.
AlanzhuLy@reddit
Thanks for your support! Let us know if you have any feedback!
DerDave@reddit
Didn't know about you guys. Impressive work! Seeing that you support Apple, Qualcomm, AMD - any love for Intel NPUs?
benja0x40@reddit
OP forgot the links. Here is the NexaML announcement and the GitHub repos for the NexaSDK.
https://nexa.ai/blogs/nexaml
https://github.com/NexaAI/nexa-sdk
Different-Effect-724@reddit (OP)
Thank you! Turns out OP was lowkey flagged, and the shared links weren’t visible to anyone. :(
benja0x40@reddit
The SDK and CLI seem interesting but the website lacks a comprehensive and quantitative overview of performances in real use cases. Perhaps it would help to provide a white paper or blog post with systematic evaluation of performances using representative models and hardware compared to other more established inference engines. As well as a detailed technical overview of the possibilities and limitations when running a model on the CPU, GPU or NPU (e.g. quants, params & context sizes, supported architectures & modalities, etc.).
alex_pro777@reddit
It useless without quantization. I've been trying to run Qwen 3 4B on my M1 8GB unified... I didn't know that it downloaded over 9GB. Of course it didn't fit my VRAM. I'd rather run a 7-8B model in q4 in GGUF than 1B model on my NPU. Possible it's a solution for GPU rich (sorry unified memory rich) Mac users.
SkyFeistyLlama8@reddit
They should be quantified to int4 formats. I don't think they run as straight BF16 or float32.
jarec707@reddit
I run the smaller Granite models right now on my m5 12 gb ipad using Noema. https://noemaai.com
koushd@reddit
ANE doesn't support quantization below 8 bit I dont think?
Different-Effect-724@reddit (OP)
Here's all models now running on Apple Neural Engine + follow the 2-step Quickstart: https://huggingface.co/collections/NexaAI/apple-neural-engine
Model support request & Repo: https://github.com/NexaAI/nexa-sdk