Qwen3 for Apple Neural Engine

[-]

bwjxjelsbd@reddit

Yo this is what I’ve been waiting for as Mac user!

Do you see performance jump in using ANE over GPU? How about power consumption?

[-]

I mean, the project is super interesting, and I have no standing to complain, but isn't having python script calling into bash shell inlining a python script to find the model path and then calling another python script that invokes xcrun using subprocess a bit much...?

[-]

dessatel@reddit

LOL, guilty. A lot of stuff needs refactoring

[-]

Competitive-Bake4602@reddit (OP)

M4 pro has x2 faster memory access for ANE vs M1/M2 and slightly faster than M3/pro ultra, but not as fast as GPU. M4 also adds int8/4 compute but we did not include it yet. Besides energy it has potential to be faster on prefill for iOS and Mac Airs for bigger Docs

[-]

Waterbottles_solve@reddit

but not as fast as GPU.

We are trying to get a ~70B model working at our fortune 20 company and we've found its entirely useless to use our Macs.

I wasnt surprised, but the disappointment was real among the department.

Now we are looking at getting 2x A6000s.

[-]

Careless_Garlic1438@reddit

Look at WebAI … they have an inference setup that rivals NVIDIA at a fraction of the cost and energy consumption …

[-]

Competitive-Bake4602@reddit (OP)

Have you tried MLX on M3 ultra? One limitation for Macs is luck of Tensor Parallelism across 2-4 devices . We did initial tests that were promising with TB5, just not enough time for everything atm 🙈

[-]

Hanthunius@reddit

Not only energy but I bet it makes fanless macs (macbook air) throttle less due to less heat. Cool stuff!

[-]

No_Conversation9561@reddit

Does ANE have access to full memory like GPU?

[-]

Competitive-Bake4602@reddit (OP)

No, only on base models. See our repo on memory profiling of ANE: https://github.com/Anemll/anemll-bench

[-]

daaain@reddit

Seems like it would be useful to disambiguate to binned models?

See: https://github.com/ggml-org/llama.cpp/discussions/4167

[-]

daaain@reddit

Actually, never mind, now reading the detailed benchmark page it looks like the big difference is between M1/M2 vs M3/M4 generations and M3/M4 Max standing out.

[-]

Competitive-Bake4602@reddit (OP)

And M4 Pro memory bw = Max for ANE. Plus M4 added accelerated int8 compute that is x2 faster than FP16 but hard to use yet for single token prediction

[-]

Creative-Size2658@reddit

I see here https://github.com/Anemll/Anemll/blob/main/docs/sample_apps.md they only support up to 8B models.

Is the readme out of date, or do they not support 30B and 32B models?

[-]

Competitive-Bake4602@reddit (OP)

We’ll need to retest bigger models on new OS.

[-]

ieatrox@reddit

would it be possible to use ANE for a small speculative decode version of a model and keep the larger version on the gpu?

[-]

Competitive-Bake4602@reddit (OP)

Yes, and multi token prediction might be advantageous with ANE

[-]

ieatrox@reddit

I can't wait to see if you get that going, that would be exciting ;)

[-]

sannysanoff@reddit

While I'm personally curious about ANE as a user, I don't have enough knowledge about its strengths, and this project lacks information explaining what niche it fills. Is it power usage? Performance? Memory efficiency? This isn't clearly stated. It would be good to see a comparison table with all these metrics (including prefill and generation speed) for a few models, comparing MLX/GPU/CPU and ANE performance.

[-]

Competitive-Bake4602@reddit (OP)

Noted, but comparisons are tough, because “it depends”. If you solely focused on single token inference on high end Ultra or MAX, MLX is better choice solely due to memory b/w. However for wider range of devices ANE provides lower energy and consistent performance on most popular devices like iPhones, Mac Air and iPads. Never the less we’ll be adding comparison section soon. Some initial work is here https://github.com/Anemll/anemll-bench

[-]

taimusrs@reddit

Energy consumption most likely, and 'performance equity' second. So bar the memory requirement, you don't have to buy a fancy M4 Max

[-]

GiantPengsoo@reddit

This is really cool, first time seeing this project. I’m sure you have this explained somewhere, but how do you exactly use ANS? Like, how do you program to use ANE specifically?

My impression was that ANE is mostly for Apple internal apps’ use for AI stuff, and was mostly not truly accessible via APIs. And users were rather forced to use GPUs with Metal if you wanted to do AI yourself.

I think I recall something about how you could ask for request to use ANE with CoreML but it was something along the lines of “you could ask for ANE but jt could just be run on the GPUs, we won’t tell you”.

[-]

Competitive-Bake4602@reddit (OP)

Yes, we have to convert LLM models to CoreML “network”, there are some constraints on precision and operations and everything should map to 4D tensors. There is no branching allowed etc. ANE is tensor processor mostly related to systolic arrays.

[-]

me1000@reddit

No branching, does that imply it’s not possible to run an MoE model on the ANE?

[-]

Competitive-Bake4602@reddit (OP)

MoE is possible, but gate will be on CPU part of the code or you can run multiple agents in parallel. For coding, fixed tensor size and luck of group quantization is the main issues atm. On performance, memory bandwidth is the main concern at least on macOS vs GPU. There are some other odd things like tensor dimensions and support for integer tensors, but the latter seems to be addressed in ‘26, but not in public API yet. I’d say primary issue is the luck of public code that work with LLM on ANE that hinders ANE usage outside Apple.

[-]

These-Lychee4623@reddit

General limitation when converting to CoreML is that the computation graph cannot be dynamic. It needs a static graph.

Another usual issue when converting to CoreML is that one has to reimplement methods/functions which are not supported by CoreML. Example - torch.hamming is not supported, so one has to modify code to use Cos and Sin functions instead of torch.hamming

[-]

kadir_nar@reddit

Can you compare it with the MLX library? Or why should we use this library?

[-]

thezachlandes@reddit

Do you have any performance numbers? I’m a Mac user and curious to know if this is something I should be using for local inference?

[-]

rumm2602@reddit

Please use unsloth quants 🙏

[-]

Competitive-Bake4602@reddit (OP)

No group quantization on ANE 😢 but per layer bit allocation is definetly on the map

[-]

Competitive-Bake4602@reddit (OP)

To add, you can specify to run on ANE and cpu. If your models is 100 % cpu friendly it will run on ANE. Sometimes OS can decide to offload to CPU for a brief moment but it’s rare. CPU is mostly for the models that are not super tuned for ANE, which is the hard part

[-]

mzbacd@reddit

This is extremely useful for text processing, it should be faster in prompt prefill than gpu if the apple foundation model doesn't reject the text.

[-]

MrPecunius@reddit

Nice work!!

What benefits are you seeing from using the ANE? Low power for mobile, sure, but does e.g. a M4 see any benefit?