IBM releases Granite-4.0 Nano (300M & 1B), along with a local browser demo showing how the models can programmatically interact with websites and call tools/browser APIs on your behalf.
Posted by xenovatech@reddit | LocalLLaMA | View on Reddit | 34 comments
IBM just released Granite-4.0 Nano, their smallest LLMs to date (300M & 1B). The models demonstrate remarkable instruction following and tool calling capabilities, making them perfect for on-device applications.
Links:
- Blog post: https://huggingface.co/blog/ibm-granite/granite-4-nano
- Demo (+ source code): https://huggingface.co/spaces/ibm-granite/Granite-4.0-Nano-WebGPU
+ for those wondering, the demo uses Transformers.js to run the models 100% locally in your browser with WebGPU acceleration.
ElSrJuez@reddit
I must be dumber than a 300M model, couldnt run the demo, just gives me a page “this demo”
badgerbadgerbadgerWI@reddit
300M parameters running client-side is wild. The privacy implications alone make this worth exploring. No more sending PII to OpenAI for basic tasks.
WriedGuy@reddit
IBM is slowly dominating in SLM
Substantial_Step_351@reddit
This is a pretty solid move by IBM. Running 300M-1B parameter modles locally with browser API access is huge for privacy-focused or offline-first devs. It bridges that middle ground between toy demo and cloud dependency.
What will be interesting is how they handle permissioning, if the model can open URLS or trigger browser calls, sandboxing becomes key. Still, a nice reminder that edge inference isn't just for mobile anymore, WebGPU and lightweight LLMs are making local AI actually practical.
wolttam@reddit
Luckily, sandboxing within browsers is pretty easy (in fact, it’s the default: browsers work very hard to make sure code running on a websites can’t break out and harm your system).
ramendik@reddit
Any idea what the "official" context window size is?
ZakoZakoZakoZakoZako@reddit
Holy shit mamba+attn might legit be viable and the way forward
Fuckinglivemealone@reddit
Why exactly?
PeruvianNet@reddit
Speculation. It's not. If it beat transformers it would be the default.
tiffanytrashcan@reddit
I mean in plenty of use cases it does beat "simple" transformers.
Sure, it's a little slower than a similarly sized model on my hardware, but the context window is literally ten times bigger, and it still fits in VRAM. It's physically impossible for me to run that context size on models even half the parameters. Ram offload or not.
This is my experience with the older Llama.cpp implementations /koboldcpp - before the latest fixes that should make it extremely competitive and equally as fast.
I'm super excited for these new models. I'm imagining stupid token windows on a phone.
PeruvianNet@reddit
It gives longer context. The performance in other ways degrades. It'll be forgotten once they do the image compression to text transformers.
On the phone I can see them making a few long mamba routines.
Straight_Abrocoma321@reddit
Maybe it's not the default because nobody has tried it on a large scale.
PeruvianNet@reddit
Nope. They tried and it didn't work well for anything besides context
EntireBobcat1474@reddit
Huh, this is kind of similar to Gemma and Gemini 1.5 in using a N:1 interleaving layers of dense attention along with something else, of course for Gemma, it was a local windowed attention transformer layer instead of an RNN layer, and at a more conservative 4-6:1 ratio. It's imo a great idea, the main performance bottleneck in Mamba is a breakdown of inductive reasoning without the dense attention, but it is only needed relatively sparsely to be able to develop the proper inductive biases to create those circuits. The quadratic bottleneck remains, so you'll still need a way to solve the quadratic communication overhead during training for long sequences, but it should be much cheaper to train now
ZakoZakoZakoZakoZako@reddit
Oh wow this is even only using mamba 2, i wonder how it would be improved using mamba3...
zhambe@reddit
This is impressive. I don't understand how it's built, but I think I get the implications -- this is not limited to browsers, one can use this model for tool calling in other contexts, right?
These are small enough you can run a "swarm" of them on a pair of 3090s
InterestRelative@reddit
Why would you need a swarm of same models?
Devourer_of_HP@reddit
One of the things you can do is have an agent choose what tasks need to be done based on the prompt sent to it then delegate each task to a specialized agent.
So for example it receives a prompt to do preliminary data analysis on whatever you want, the orchestration agent receives the request, create multiple subtasks and delegates each one to an agent made for it, like having one made for querying the internet to find sources, and one to make python code on the received data and show graphs.
InterestRelative@reddit
And this specialized agent - what's that? Is it same LLM with different system prompt and different set of tools? Is it same LLM with LoRA adapter and different set of tools?
Or it's a separate LLM?
In first case you still have 1 model to serve even if prompts, tools and adapters are different. Changing adapters on the fly should be fast since it's already in GPU memory and tiny, few milliseconds maybe.
In second you have a swarm of LLMs, but how useful it is to have 10x 2B models rather than single 20B MoE for everything?
_lavoisier_@reddit
llama.cpp has webasm support so they probably compiled it to webasm binary and run it via javascript.
These-Dog6141@reddit
Can someone test and report back of use case and how well it works
Juan_Valadez@reddit
Do you want a coffee?
These-Dog6141@reddit
yes im getting a coffee now good morning
PeruvianNet@reddit
I'm depressed and nothing is ever good and it only gets worse until we die.
twavisdegwet@reddit
You should probably pay for a frontier/closed model
PeruvianNet@reddit
Don't worry I'm being ironic. I love my q6 qwen 4b 0725
Silent_Employment966@reddit
This is Cool, mind Sharing it in r/AIAgentsInAction
Famous-Appointment-8@reddit
So you are the Westbury?
Barry_Jumps@reddit
Actually pretty impressed by the nano model on WebGPU.
TechSwag@reddit
Offtopic, but how do people make these videos where the screen zooms in and out with the cursor?
Crafty-Celery-2466@reddit
Lot of apps like cap, screenstudio and more
Barry_Jumps@reddit
https://screen.studio/
padpump@reddit
You can do something like this with the built-in Zoom function of macOS
SnooMarzipans2470@reddit
unsloth for fine tuning when?