Needle: We Distilled Gemini Tool Calling Into a 26M Model

[-]

Britbong1492@reddit

Sounds cool, are you planning to apply this to qwen3.6:27b, asking for a friend....

[-]

Henrie_the_dreamer@reddit (OP)

Anyone can actually use the codebase to do that now.

[-]

Henrie_the_dreamer@reddit (OP)

Haha, exactly

[-]

Henrie_the_dreamer@reddit (OP)

Haha, GOT reference?

[-]

LeatherRub7248@reddit

u/Henrie_the_dreamer could you give some real world examples of use cases?

i was thinking of putting this upfront essentially as a 'router' but the key thing is the possible variations of input context could be infinite, especially if we provide a longer message history (to retain chronologicical context)

As a simple example, lets say 2 tools are available:
web_search
extract_web_page

and then the input context is last 30 messages of the chat history.

Is this the wrong use case for this model?

[-]

Henrie_the_dreamer@reddit (OP)

So this particular model was designed for single-shot function calling, for now :(

[-]

silenceimpaired@reddit

Could you provide an example of a single-shot use case? Trying to wrap my mind around the value of this.

[-]

sammcj@reddit

Interesting you used Gemini for the distillation, I've always found Gemini to have the least reliable tool calling of the larger models.

[-]

Limp_Statistician529@reddit

This is really helpful especially for someone like me who uses Gemini most of the time

[-]

TheGoddessInari@reddit

Rare to see pickle files uploaded anymore due to the security implications and python-specific dependence.

[-]

Henrie_the_dreamer@reddit (OP)

It’s for JAX ecosystem haha

[-]

xkaoticwolf@reddit

Can’t you guys use Onnx? I’ve exported my JAX RL models using Onnx via the jax2onnx library.

[-]

Henrie_the_dreamer@reddit (OP)

You can run it in Cactus directly

[-]

MindPsychological140@reddit

Facts in input ≠ FFN memorization needed" is the cleanest argument for 
FFN-free designs I've seen. Does it hold for chained tool calls, or does 
multi-step composition pull FFNs back in?

[-]

Henrie_the_dreamer@reddit (OP)

We are still experimenting to see if we need FFN for multi-chat inputs.

[-]

imonlysmarterthanyou@reddit

I was able to run the playground but this took up nearly all memory on my 5070 and it wasn’t fast. I just ran what was in the playground. What is the expected performance?

[-]

Henrie_the_dreamer@reddit (OP)

Oh, you should run on a simple CPU, the playground just uses JAX. For inference, use Cactus. Just that Cactus only runs on ARM chips, so it made sense to keep the playground in JAX.

[-]

lunerift@reddit

this is actually a pretty interesting direction, most tool calling workloads are structured routing plus argument extraction anyway, not deep reasoning, people keep throwing 70b models at problems that are basically schema matching and retrieval, curious how robust it is once tools become noisy or multi step though.

[-]

superloser48@reddit

Good start but its not really useful yet without API and the web demo is really a toy. Do you plan on adding openai compatible api or something similar?

[-]

Barry_22@reddit

Impressive.

Model itself aside, how did you get access to TPUs? some sort of a program, or you paid for the credits?

[-]

ready_or_not_3434@reddit

Stripping out the MLPs completely is a really smart move if you just need strict JSON routing. It always botherd me having to spin up a full 8B model just to extract arguments and populate a basic schema.

[-]

Henrie_the_dreamer@reddit (OP)

Exactly the problem!

[-]

CodeMichaelD@reddit

if it can fix crazy fast https://huggingface.co/0xSero/gemma-4-19b-a4b-it-REAP I am all for it..
the question is how to do it? do I need to setup localhost router and plug https://github.com/cactus-compute/needle#usage-python:\~:text=are%20auto%2Ddownloaded.-,Usage%20(Python),-from%20needle%20import
TLDR: How to use it for agentic coding to improve occasional broken toolcalls between llama.cpp and say codex? (thank you)

[-]

Henrie_the_dreamer@reddit (OP)

For coding, Needle will need finetuning.

[-]

anotherthrowaway469@reddit

This is super interesting, I had been looking for something like this. Do you have any plans for somewhat larger models (e.g. ~10B) that would still fit on most consumer GPUs but give you a lot more parameters?

[-]

Henrie_the_dreamer@reddit (OP)

We’re still perfecting the architecture and running experiments for now, so much to do before all that :(

[-]

themixtergames@reddit

Gemini is an interesting choice, it has issues with tool calling that Google had to patch in the system prompt. Every Gemini query I do it first thinks:

I'm focusing intently on tool specificity. I've been refining my approach to file manipulation, particularly avoiding cat in favor of dedicated tools, like grep_search, for creating and appending to files. The goal is a more efficient, less error-prone workflow that also respects existing best practices.

I'm sure your training data is clean, just thought it was funny. Nice work.

[-]

Henrie_the_dreamer@reddit (OP)

Yeah, cause our experiments used FunctionGemma as baseline, which was distilled from Gemini too.

[-]

Shot-Ad8790@reddit

The focus on tool-calling fits scenarios where external APIs or databases provide the 'facts.' It's a different problem than models designed for open-ended reasoning or conversation without external knowledge.

[-]

themixtergames@reddit

Your system prompt is wrong, you need to specify to write all sentences using lowercase.

[-]

Orolol@reddit

We found that the "no FFN" finding generalizes beyond function calling to any task where the model has access to external structured knowledge (RAG, tool use, retrieval-augmented generation). The model doesn't need to memorize facts in FFN weights if the facts are provided in the input. Experimental results to be published.

So you could have this model to route request toward a RAG, and then a small model like this (without FFN but post trained to this specific task) using the knowledge extracted via the RAG to articulate a response in natural language ?

[-]

Henrie_the_dreamer@reddit (OP)

Yes, MLP and scale seems to matter in tasks like coding, maths, situations that need internal model knowledge.

[-]

denoflore_ai_guy@reddit

*slow clap that turns into an auditorium standing applause*

[-]

Henrie_the_dreamer@reddit (OP)

haha, thanks, lets us know you thoughts!

[-]

gcavalcante8808@reddit

Nice! I was about to test Gemma 3 270 on tool calling. I'll take a look at the blog, thanks

[-]

Henrie_the_dreamer@reddit (OP)

Ping us if you need anything!

[-]

rentprompts@reddit

The part I would pull out of Needle: We Distilled Gemini Tool Calling Into a 26M Model is the concrete workflow. What input did they start with, what changed in the middle, and what final output was good enough to reuse?

That is the missing layer in a lot of AI posts right now. A trend becomes useful only when someone can copy the process, measure the result, and adapt it to their own offer.

[-]

Henrie_the_dreamer@reddit (OP)

We shared the full dataset generation pipeline in the repository, you should check it out.

[-]

Firstbober@reddit

So essentially such model could be used to filter where any query should go by just one-shot calling a proper "big LLM" with proper params. Also, the same architecture could be used for perfect summarization AI?

[-]

Henrie_the_dreamer@reddit (OP)

Yes, any task that does not require the model to have robust world knowledge without external data.