IBM releases Granite-4.0 Nano (300M & 1B), along with a local browser demo showing how the models can programmatically interact with websites and call tools/browser APIs on your behalf.

Posted by xenovatech@reddit | LocalLLaMA | View on Reddit | 34 comments

IBM just released Granite-4.0 Nano, their smallest LLMs to date (300M & 1B). The models demonstrate remarkable instruction following and tool calling capabilities, making them perfect for on-device applications.

Links:
- Blog post: https://huggingface.co/blog/ibm-granite/granite-4-nano
- Demo (+ source code): https://huggingface.co/spaces/ibm-granite/Granite-4.0-Nano-WebGPU

+ for those wondering, the demo uses Transformers.js to run the models 100% locally in your browser with WebGPU acceleration.

[-]

ElSrJuez@reddit

I must be dumber than a 300M model, couldnt run the demo, just gives me a page “this demo”

[-]

badgerbadgerbadgerWI@reddit

300M parameters running client-side is wild. The privacy implications alone make this worth exploring. No more sending PII to OpenAI for basic tasks.

[-]

WriedGuy@reddit

IBM is slowly dominating in SLM

[-]

Substantial_Step_351@reddit

This is a pretty solid move by IBM. Running 300M-1B parameter modles locally with browser API access is huge for privacy-focused or offline-first devs. It bridges that middle ground between toy demo and cloud dependency.

What will be interesting is how they handle permissioning, if the model can open URLS or trigger browser calls, sandboxing becomes key. Still, a nice reminder that edge inference isn't just for mobile anymore, WebGPU and lightweight LLMs are making local AI actually practical.

[-]

wolttam@reddit

Luckily, sandboxing within browsers is pretty easy (in fact, it’s the default: browsers work very hard to make sure code running on a websites can’t break out and harm your system).

[-]

ramendik@reddit

Any idea what the "official" context window size is?

[-]

ZakoZakoZakoZakoZako@reddit

Holy shit mamba+attn might legit be viable and the way forward

[-]

Fuckinglivemealone@reddit

Why exactly?

[-]

PeruvianNet@reddit

Speculation. It's not. If it beat transformers it would be the default.

[-]

tiffanytrashcan@reddit

I mean in plenty of use cases it does beat "simple" transformers.

Sure, it's a little slower than a similarly sized model on my hardware, but the context window is literally ten times bigger, and it still fits in VRAM. It's physically impossible for me to run that context size on models even half the parameters. Ram offload or not.

This is my experience with the older Llama.cpp implementations /koboldcpp - before the latest fixes that should make it extremely competitive and equally as fast.

I'm super excited for these new models. I'm imagining stupid token windows on a phone.

[-]

PeruvianNet@reddit

It gives longer context. The performance in other ways degrades. It'll be forgotten once they do the image compression to text transformers.

On the phone I can see them making a few long mamba routines.

[-]

Straight_Abrocoma321@reddit

Maybe it's not the default because nobody has tried it on a large scale.

[-]

PeruvianNet@reddit

Nope. They tried and it didn't work well for anything besides context

[-]

EntireBobcat1474@reddit

The architecture powering Granite 4.0-H-Micro, Granite 4.0-H-Tiny and Granite 4.0-H-Small combines Mamba-2 layers and conventional transformer blocks sequentially in a 9:1 ratio. Essentially, the Mamba-2 blocks efficiently process global context and periodically pass that contextual information through a transformer block that delivers a more nuanced parsing of local context through self-attention before passing it along to the next grouping of Mamba-2 layers.

Huh, this is kind of similar to Gemma and Gemini 1.5 in using a N:1 interleaving layers of dense attention along with something else, of course for Gemma, it was a local windowed attention transformer layer instead of an RNN layer, and at a more conservative 4-6:1 ratio. It's imo a great idea, the main performance bottleneck in Mamba is a breakdown of inductive reasoning without the dense attention, but it is only needed relatively sparsely to be able to develop the proper inductive biases to create those circuits. The quadratic bottleneck remains, so you'll still need a way to solve the quadratic communication overhead during training for long sequences, but it should be much cheaper to train now

[-]

ZakoZakoZakoZakoZako@reddit

Oh wow this is even only using mamba 2, i wonder how it would be improved using mamba3...

[-]

zhambe@reddit

This is impressive. I don't understand how it's built, but I think I get the implications -- this is not limited to browsers, one can use this model for tool calling in other contexts, right?

These are small enough you can run a "swarm" of them on a pair of 3090s

[-]

InterestRelative@reddit

Why would you need a swarm of same models?

[-]

Devourer_of_HP@reddit

One of the things you can do is have an agent choose what tasks need to be done based on the prompt sent to it then delegate each task to a specialized agent.

So for example it receives a prompt to do preliminary data analysis on whatever you want, the orchestration agent receives the request, create multiple subtasks and delegates each one to an agent made for it, like having one made for querying the internet to find sources, and one to make python code on the received data and show graphs.

[-]

InterestRelative@reddit

And this specialized agent - what's that? Is it same LLM with different system prompt and different set of tools? Is it same LLM with LoRA adapter and different set of tools?
Or it's a separate LLM?

In first case you still have 1 model to serve even if prompts, tools and adapters are different. Changing adapters on the fly should be fast since it's already in GPU memory and tiny, few milliseconds maybe.
In second you have a swarm of LLMs, but how useful it is to have 10x 2B models rather than single 20B MoE for everything?

[-]