Needle: We Distilled Gemini Tool Calling Into a 26M Model
Posted by Henrie_the_dreamer@reddit | LocalLLaMA | View on Reddit | 42 comments
We open-sourced Needle, a 26M parameter function-calling (tool use) model. It runs at 6000 tok/s prefill and 1200 tok/s decode on consumer devices.
We were always frustrated by the little effort made towards building agentic models that run on budget phones, so we conducted investigations that led to an observation: agentic experiences are built upon tool calling, and massive models are overkill for it. Tool calling is fundamentally retrieval-and-assembly (match query to tool name, extract argument values, emit JSON), not reasoning. Cross-attention is the right primitive for this, and FFN parameters are wasted at this scale.
Simple Attention Networks: the entire model is just attention and gating, no MLPs anywhere. Needle is an experimental run for single-shot function calling for consumer devices (phones, watches, glasses...).
Training:
- Pretrained on 200B tokens across 16 TPU v6e (27 hours)
- Post-trained on 2B tokens of synthesized function-calling data (45 minutes)
- Dataset synthesized via Gemini with 15 tool categories (timers, messaging, navigation, smart home, etc.)
You can test it right now and finetune on your Mac/PC: https://github.com/cactus-compute/needle
The full writeup on the architecture is here: https://github.com/cactus-compute/needle/blob/main/docs/simple_attention_networks.md
We found that the "no FFN" finding generalizes beyond function calling to any task where the model has access to external structured knowledge (RAG, tool use, retrieval-augmented generation). The model doesn't need to memorize facts in FFN weights if the facts are provided in the input. Experimental results to be published.
While it beats FunctionGemma-270M, Qwen-0.6B, Granite-350M, LFM2.5-350M on single-shot function calling, those models have more scope/capacity and excel in conversational settings. We encourage you to test on your own tools via the playground and finetune accordingly.
Needle is part of a broader effort to make on-device AI practical. We also build Cactus (https://github.com/cactus-compute/cactus), an open-source inference engine for mobile and wearables, which we posted about previously: https://news.ycombinator.com/item?id=44524544
Everything is MIT licensed. Weights: https://huggingface.co/Cactus-Compute/needle
Britbong1492@reddit
Sounds cool, are you planning to apply this to qwen3.6:27b, asking for a friend....
Henrie_the_dreamer@reddit (OP)
Anyone can actually use the codebase to do that now.
mdda@reddit
Attention Is All You Have
Henrie_the_dreamer@reddit (OP)
Haha, exactly
okyaygokay@reddit
Of course you named your sword
Henrie_the_dreamer@reddit (OP)
Haha, GOT reference?
LeatherRub7248@reddit
u/Henrie_the_dreamer could you give some real world examples of use cases?
i was thinking of putting this upfront essentially as a 'router' but the key thing is the possible variations of input context could be infinite, especially if we provide a longer message history (to retain chronologicical context)
As a simple example, lets say 2 tools are available:
web_search
extract_web_page
and then the input context is last 30 messages of the chat history.
Is this the wrong use case for this model?
Henrie_the_dreamer@reddit (OP)
So this particular model was designed for single-shot function calling, for now :(
silenceimpaired@reddit
Could you provide an example of a single-shot use case? Trying to wrap my mind around the value of this.
sammcj@reddit
Interesting you used Gemini for the distillation, I've always found Gemini to have the least reliable tool calling of the larger models.
Limp_Statistician529@reddit
This is really helpful especially for someone like me who uses Gemini most of the time
TheGoddessInari@reddit
Rare to see pickle files uploaded anymore due to the security implications and python-specific dependence.
Henrie_the_dreamer@reddit (OP)
It’s for JAX ecosystem haha
xkaoticwolf@reddit
Can’t you guys use Onnx? I’ve exported my JAX RL models using Onnx via the jax2onnx library.
Henrie_the_dreamer@reddit (OP)
You can run it in Cactus directly
MindPsychological140@reddit
Henrie_the_dreamer@reddit (OP)
We are still experimenting to see if we need FFN for multi-chat inputs.
imonlysmarterthanyou@reddit
I was able to run the playground but this took up nearly all memory on my 5070 and it wasn’t fast. I just ran what was in the playground. What is the expected performance?
Henrie_the_dreamer@reddit (OP)
Oh, you should run on a simple CPU, the playground just uses JAX. For inference, use Cactus. Just that Cactus only runs on ARM chips, so it made sense to keep the playground in JAX.
lunerift@reddit
this is actually a pretty interesting direction, most tool calling workloads are structured routing plus argument extraction anyway, not deep reasoning, people keep throwing 70b models at problems that are basically schema matching and retrieval, curious how robust it is once tools become noisy or multi step though.
superloser48@reddit
Good start but its not really useful yet without API and the web demo is really a toy. Do you plan on adding openai compatible api or something similar?
Barry_22@reddit
Impressive.
Model itself aside, how did you get access to TPUs? some sort of a program, or you paid for the credits?
ready_or_not_3434@reddit
Stripping out the MLPs completely is a really smart move if you just need strict JSON routing. It always botherd me having to spin up a full 8B model just to extract arguments and populate a basic schema.
Henrie_the_dreamer@reddit (OP)
Exactly the problem!
CodeMichaelD@reddit
if it can fix crazy fast https://huggingface.co/0xSero/gemma-4-19b-a4b-it-REAP I am all for it..
the question is how to do it? do I need to setup localhost router and plug https://github.com/cactus-compute/needle#usage-python:\~:text=are%20auto%2Ddownloaded.-,Usage%20(Python),-from%20needle%20import
TLDR: How to use it for agentic coding to improve occasional broken toolcalls between llama.cpp and say codex? (thank you)
Henrie_the_dreamer@reddit (OP)
For coding, Needle will need finetuning.
anotherthrowaway469@reddit
This is super interesting, I had been looking for something like this. Do you have any plans for somewhat larger models (e.g. ~10B) that would still fit on most consumer GPUs but give you a lot more parameters?
Henrie_the_dreamer@reddit (OP)
We’re still perfecting the architecture and running experiments for now, so much to do before all that :(
themixtergames@reddit
Gemini is an interesting choice, it has issues with tool calling that Google had to patch in the system prompt. Every Gemini query I do it first thinks:
I'm sure your training data is clean, just thought it was funny. Nice work.
Henrie_the_dreamer@reddit (OP)
Yeah, cause our experiments used FunctionGemma as baseline, which was distilled from Gemini too.
Shot-Ad8790@reddit
The focus on tool-calling fits scenarios where external APIs or databases provide the 'facts.' It's a different problem than models designed for open-ended reasoning or conversation without external knowledge.
themixtergames@reddit
Your system prompt is wrong, you need to specify to write all sentences using lowercase.
Orolol@reddit
So you could have this model to route request toward a RAG, and then a small model like this (without FFN but post trained to this specific task) using the knowledge extracted via the RAG to articulate a response in natural language ?
Henrie_the_dreamer@reddit (OP)
Yes, MLP and scale seems to matter in tasks like coding, maths, situations that need internal model knowledge.
denoflore_ai_guy@reddit
*slow clap that turns into an auditorium standing applause*
Henrie_the_dreamer@reddit (OP)
haha, thanks, lets us know you thoughts!
gcavalcante8808@reddit
Nice! I was about to test Gemma 3 270 on tool calling. I'll take a look at the blog, thanks
Henrie_the_dreamer@reddit (OP)
Ping us if you need anything!
rentprompts@reddit
The part I would pull out of Needle: We Distilled Gemini Tool Calling Into a 26M Model is the concrete workflow. What input did they start with, what changed in the middle, and what final output was good enough to reuse?
That is the missing layer in a lot of AI posts right now. A trend becomes useful only when someone can copy the process, measure the result, and adapt it to their own offer.
Henrie_the_dreamer@reddit (OP)
We shared the full dataset generation pipeline in the repository, you should check it out.
Firstbober@reddit
So essentially such model could be used to filter where any query should go by just one-shot calling a proper "big LLM" with proper params. Also, the same architecture could be used for perfect summarization AI?
Henrie_the_dreamer@reddit (OP)
Yes, any task that does not require the model to have robust world knowledge without external data.