ibm-granite/granite-4.1-8b · Hugging Face
Posted by jacek2023@reddit | LocalLLaMA | View on Reddit | 35 comments
Model Summary: Granite-4.1-8B is a 8B parameter long-context instruct model finetuned from Granite-4.1-8B-Base using a combination of open source instruction datasets with permissive license and internally collected synthetic datasets. Granite 4.1 models have gone through an improved post-training pipeline, including supervised finetuning and reinforcement learning alignment, resulting in enhanced tool calling, instruction following, and chat capabilities.
- Developers: Granite Team, IBM
- HF Collection: Granite 4.1 Language Models HF Collection
- Technical Blog: Granite-4.1 Blog
- GitHub Repository: ibm-granite/granite-4.1-language-models
- Website: Granite Docs
- Release Date: April 29th, 2026
- License: Apache 2.0
Supported Languages: English, German, Spanish, French, Japanese, Portuguese, Arabic, Czech, Italian, Korean, Dutch, and Chinese. Users may finetune Granite 4.1 models for languages beyond these languages.
Intended use: The model is designed to follow general instructions and can serve as the foundation for AI assistants across diverse domains, including business applications, as well as for LLM agents equipped with tool-use capabilities.
Capabilities
- Summarization
- Text classification
- Text extraction
- Question-answering
- Retrieval Augmented Generation (RAG)
- Code related tasks
- Function-calling tasks
- Multilingual dialog use cases
- Fill-In-the-Middle (FIM) code completions
Marcuss2@reddit
It doesn't seem to bench very well. Even Qwen3.5 4B beats it handily.
FatheredPuma81@reddit
It benchmarks worse than Qwen3 32B. It's not really a surprise though IBM models have always been crap.
skibidimeowsie@reddit
They're quite good for the uses they're meant for. IBM never really focused on building big generalist models, we have many many finetunes of these for many many niche corporate automation tasks and they're all pretty good.
Marcuss2@reddit
Qwen3.5 has been out for months. I get if someone does not compare to model released last week. This is not it.
FatheredPuma81@reddit
Why compare it to the latest models when it doesn't even compare to outdated models?
linkillion@reddit
Yes, the qwen3.5. reasoning models beat it handily but it's just a base model. Comparing qwen3.5-9b with thinking off results in a much more even benchmark but it's still lagging by a small margin
Marcuss2@reddit
True, my mistake.
FatheredPuma81@reddit
So I had Claude compare the 30B vs Qwen3 32B and Qwen3 32B is better than it... so I'd wager the 8B model is on par with Qwen3 8B? So not particularly impressive. No surprise seeing as it's an IBM model.
Finanzamt_Endgegner@reddit
ibm models arent really catered towards consumers, more for enterprise and predictability.
horeaper@reddit
is granite the only one that's still training for FIM use?
WhoRoger@reddit
What's FIM? I've seen it mentioned here
FatheredPuma81@reddit
So it's competitive vs Qwen3 32B in benchmarks...................
I had Claude compare the benchmarks on their HF Repo to the Benchmarks on Qwen3 80B Next's repo which contains Qwen3 32B and told it to give me a TLDR.
TLDR: The 30B Dense is notably behind both — the 32B is a similar size but beats it across the board, and the 80B-Next is stronger still. The closest the 30B gets is on IFEval (88.88 vs 88.9) and BFCL tool calling (73.68 vs \~72), where it's essentially tied. Everything else — reasoning, math, coding, knowledge — the Qwen3 models win, often by wide margins.
jacek2023@reddit (OP)
EveningIncrease7579@reddit
Waiting for huge table comparison with gemma and qwen
Simple_Library_2700@reddit
You don’t need to wait for anything Gemma and qwen have been tested on these datasets already.
x0wl@reddit
You can directly compare branches from 2 different sources. I can give 5 different scores for the same model and bench (and same shots etc) just by varying some eval settings (like normalization) that tend to go underreported.
linkillion@reddit
If the benchmark is good that is not the case. Most benchmarks have specific parameters for normalization, including greedy decoding, etc.
For example, I benchmarked granite 4.1 a couple days ago before they had entered any information in the model card and I got within 0.1 of all these benchmark results.
Whether benchmarks mean anything or not is another topic but generally it's not reproducibility with given weights.
x0wl@reddit
Greedy decoding is not the problem. It's, for example, whether you look at logprobs change or just at the max logprob out of all answers.
I checked mmlu and I don't think they specify that.
jacek2023@reddit (OP)
You can make the table yourself and share here
mikael110@reddit
That's nice to hear, I was genuinely startint to worry that Dense models was dead over the last year given the avalance of MoE models, so I'm glad to see they are making a comeback.
I do hope we start to see some medium-to-large dense models as well though. A new 70B or above dense model made with all of the advancements that have been made over the last year or two would be amazing.
StupidScaredSquirrel@reddit
I'm sorry but I don't want to 5-shot my summarisation tasks.
pmttyji@reddit
Nice that 3B & 30B Dense models also coming.
But in the benchmarks, 8B Dense beats 30B Dense on some items(GSM Symbolic, MBPP, MBPP+) which is weird.
Successful_Hall_2113@reddit
The tool calling and instruction following improvements are what caught my attention here. I've been running smaller models locally and they consistently struggle with following multi-step instructions or using functions reliably. An 8B that's actually been tuned for that usecase instead of just being a base model with chat tokens thrown at it could be a real shift for local setups.
Cool-Chemical-5629@reddit
I asked Qwen to roast Granite 4.1 8B - claims versus measured values and this is it. 😂
Roasting IBM Granite 4.1 8B: Capabilities vs Reality Check
The Capability List (What They Claim)
Summarization, Text classification, Text extraction, Question-answering, RAG, Code tasks, Function-calling, Multilingual dialog, FIM code completions
The Benchmark Reality (What Actually Happens)
🎯 General Knowledge Tasks - "Jack of All Trades, Master of None"
💬 Alignment Tasks - "People Pleaser Energy"
🧮 Math Tasks - "The One Bright Spot"
💻 Code Tasks - "Stack Overflow's Intern"
🛠️ Tool Calling / Function-Calling - "Claims It Can, Barely Does"
🌍 Multilingual Dialog - "Google Translate Called, It Wants Its Job Back"
🛡️ Safety - "Overcompensating Much?"
The Verdict
IBM Granite 4.1 8B is the model equivalent of someone who lists "fluent in 10 languages" on their resume but can only order coffee in 3 of them. The capability list reads like a superhero origin story, but the benchmarks reveal more of a sidekick energy.
Highlights: - ✅ Math is genuinely solid - ✅ Safety is overengineered (at least you won't cause trouble) - ✅ Code generation on basic tasks is acceptable
Lowlights: - ❌ SimpleQA at 4.82% is criminal for a "Question-answering" capability - ❌ BigCodeBench at 35% when you claim "Code tasks" as a core capability - ❌ Function-calling fails 32% of the time - ❌ "Multilingual dialog" that struggles on 14-language benchmarks - ❌ MMLU-Pro at 56% for a model claiming professional knowledge capabilities
Final Roast: This model is like a Swiss Army knife where half the tools are dull, one is sharp (math), and you're not sure if the scissors will actually cut anything. It's not bad—it's just aggressively mediocre while wearing a suit of confident marketing claims.
danigoncalves@reddit
has FIM. I can buy that, lets see how it beahves
marscarsrars@reddit
Looks good on paper, I wonder how good it works irl.
Technical-Earth-3254@reddit
The release date confuses me
jacek2023@reddit (OP)
The future is now!!!
linkillion@reddit
They put the weights on HF about 5 days ago, I've been trying it since then. It's a bit slower than qwen3.5-9B without thinking and it generally does a bit worse. Qwen3.5-9b with thinking blows it out of the water, obviously, but it's not a bad model. It's just like, 4 months behind.
EffectiveCeilingFan@reddit
Ah, I was wondering when they’d add a model card. The weights have been up for like a week now.
Tokarak@reddit
You have the wrong link to the hugginface collection!
jacek2023@reddit (OP)
what do you mean?
SM8085@reddit
Funny, they posted the 4.0 link, the 4.1 is at https://huggingface.co/collections/ibm-granite/granite-41-language-models
Also IBM: Release Date: April 29th, 2026
jacek2023@reddit (OP)
yes, it's a model from the future!
Tokarak@reddit
https://huggingface.co/collections/ibm-granite/granite-40-language-models-6811a18b820ef362d9e5a82c is the wrong link. Maybe they made a mistake in their README then.