Qwen3 for Apple Neural Engine
Posted by Competitive-Bake4602@reddit | LocalLLaMA | View on Reddit | 33 comments
We just dropped ANEMLL 0.3.3 alpha with Qwen3 support for Apple's Neural Engine
https://github.com/Anemll/Anemll
Start to support open source! Cheers, Anemll š¤
bwjxjelsbd@reddit
Yo this is what Iāve been waiting for as Mac user!
Do you see performance jump in using ANE over GPU? How about power consumption?
SidneyFong@reddit
I mean, the project is super interesting, and I have no standing to complain, but isn't having python script calling into bash shell inlining a python script to find the model path and then calling another python script that invokes xcrun using subprocess a bit much...?
dessatel@reddit
LOL, guilty. A lot of stuff needs refactoring
Competitive-Bake4602@reddit (OP)
M4 pro has x2 faster memory access for ANE vs M1/M2 and slightly faster than M3/pro ultra, but not as fast as GPU. M4 also adds int8/4 compute but we did not include it yet. Besides energy it has potential to be faster on prefill for iOS and Mac Airs for bigger Docs
Waterbottles_solve@reddit
We are trying to get a ~70B model working at our fortune 20 company and we've found its entirely useless to use our Macs.
I wasnt surprised, but the disappointment was real among the department.
Now we are looking at getting 2x A6000s.
Careless_Garlic1438@reddit
Look at WebAI ⦠they have an inference setup that rivals NVIDIA at a fraction of the cost and energy consumption ā¦
Competitive-Bake4602@reddit (OP)
Have you tried MLX on M3 ultra? One limitation for Macs is luck of Tensor Parallelism across 2-4 devices . We did initial tests that were promising with TB5, just not enough time for everything atm š
Hanthunius@reddit
Not only energy but I bet it makes fanless macs (macbook air) throttle less due to less heat. Cool stuff!
No_Conversation9561@reddit
Does ANE have access to full memory like GPU?
Competitive-Bake4602@reddit (OP)
No, only on base models. See our repo on memory profiling of ANE: https://github.com/Anemll/anemll-bench
daaain@reddit
Seems like it would be useful to disambiguate to binned models?
See: https://github.com/ggml-org/llama.cpp/discussions/4167
daaain@reddit
Actually, never mind, now reading the detailed benchmark page it looks like the big difference is between M1/M2 vs M3/M4 generations and M3/M4 Max standing out.
Competitive-Bake4602@reddit (OP)
And M4 Pro memory bw = Max for ANE. Plus M4 added accelerated int8 compute that is x2 faster than FP16 but hard to use yet for single token prediction
Creative-Size2658@reddit
I see here https://github.com/Anemll/Anemll/blob/main/docs/sample_apps.md they only support up to 8B models.
Is the readme out of date, or do they not support 30B and 32B models?
Competitive-Bake4602@reddit (OP)
Weāll need to retest bigger models on new OS.
ieatrox@reddit
would it be possible to use ANE for a small speculative decode version of a model and keep the larger version on the gpu?
Competitive-Bake4602@reddit (OP)
Yes, and multi token prediction might be advantageous with ANE
ieatrox@reddit
I can't wait to see if you get that going, that would be exciting ;)
sannysanoff@reddit
While I'm personally curious about ANE as a user, I don't have enough knowledge about its strengths, and this project lacks information explaining what niche it fills. Is it power usage? Performance? Memory efficiency? This isn't clearly stated. It would be good to see a comparison table with all these metrics (including prefill and generation speed) for a few models, comparing MLX/GPU/CPU and ANE performance.
Competitive-Bake4602@reddit (OP)
Noted, but comparisons are tough, because āit dependsā. If you solely focused on single token inference on high end Ultra or MAX, MLX is better choice solely due to memory b/w. However for wider range of devices ANE provides lower energy and consistent performance on most popular devices like iPhones, Mac Air and iPads. Never the less weāll be adding comparison section soon. Some initial work is hereĀ https://github.com/Anemll/anemll-bench
taimusrs@reddit
Energy consumption most likely, and 'performance equity' second. So bar the memory requirement, you don't have to buy a fancy M4 Max
GiantPengsoo@reddit
This is really cool, first time seeing this project. Iām sure you have this explained somewhere, but how do you exactly use ANS? Like, how do you program to use ANE specifically?
My impression was that ANE is mostly for Apple internal appsā use for AI stuff, and was mostly not truly accessible via APIs. And users were rather forced to use GPUs with Metal if you wanted to do AI yourself.
I think I recall something about how you could ask for request to use ANE with CoreML but it was something along the lines of āyou could ask for ANE but jt could just be run on the GPUs, we wonāt tell youā.
Competitive-Bake4602@reddit (OP)
Yes, we have to convert LLM models to CoreML ānetworkā, there are some constraints on precision and operations and everything should map to 4D tensors. There is no branching allowed etc. ANE is tensor processor mostly related to systolic arrays.
me1000@reddit
No branching, does that imply itās not possible to run an MoE model on the ANE?Ā
Competitive-Bake4602@reddit (OP)
MoE is possible, but gate will be on CPU part of the code or you can run multiple agents in parallel.Ā For coding, fixed tensor size and luck of group quantization is the main issues atm. On performance, memory bandwidth is the main concern at least on macOS vs GPU. There are some other odd things like tensor dimensions and support for integer tensors, but the latter seems to be addressed in ā26, but not in public API yet. Iād say primary issue is the luck of public code that work with LLM on ANE that hinders ANE usage outside Apple.
These-Lychee4623@reddit
General limitation when converting to CoreML is that the computation graph cannot be dynamic. It needs a static graph.
Another usual issue when converting to CoreML is that one has to reimplement methods/functions which are not supported by CoreML. Example - torch.hamming is not supported, so one has to modify code to use Cos and Sin functions instead of torch.hamming
kadir_nar@reddit
Can you compare it with the MLX library? Or why should we use this library?
thezachlandes@reddit
Do you have any performance numbers? Iām a Mac user and curious to know if this is something I should be using for local inference?
rumm2602@reddit
Please use unsloth quants š
Competitive-Bake4602@reddit (OP)
No group quantization on ANE š¢ but per layer bit allocation is definetly on the map
Competitive-Bake4602@reddit (OP)
To add, you can specify to run on ANE and cpu. If your models is 100 % cpu friendly it will run on ANE. Sometimes OS can decide to offload to CPU for a brief moment but itās rare. CPU is mostly for the models that are not super tuned for ANE, which is the hard part
mzbacd@reddit
This is extremely useful for text processing, it should be faster in prompt prefill than gpu if the apple foundation model doesn't reject the text.
MrPecunius@reddit
Nice work!!
What benefits are you seeing from using the ANE? Low power for mobile, sure, but does e.g. a M4 see any benefit?