ibm-granite/granite-4.1-8b · Hugging Face

Posted by jacek2023@reddit | LocalLLaMA | View on Reddit | 35 comments

Model Summary: Granite-4.1-8B is a 8B parameter long-context instruct model finetuned from Granite-4.1-8B-Base using a combination of open source instruction datasets with permissive license and internally collected synthetic datasets. Granite 4.1 models have gone through an improved post-training pipeline, including supervised finetuning and reinforcement learning alignment, resulting in enhanced tool calling, instruction following, and chat capabilities.

Developers: Granite Team, IBM
HF Collection: Granite 4.1 Language Models HF Collection
Technical Blog: Granite-4.1 Blog
GitHub Repository: ibm-granite/granite-4.1-language-models
Website: Granite Docs
Release Date: April 29th, 2026
License: Apache 2.0

Supported Languages: English, German, Spanish, French, Japanese, Portuguese, Arabic, Czech, Italian, Korean, Dutch, and Chinese. Users may finetune Granite 4.1 models for languages beyond these languages.

Intended use: The model is designed to follow general instructions and can serve as the foundation for AI assistants across diverse domains, including business applications, as well as for LLM agents equipped with tool-use capabilities.

Capabilities

Summarization
Text classification
Text extraction
Question-answering
Retrieval Augmented Generation (RAG)
Code related tasks
Function-calling tasks
Multilingual dialog use cases
Fill-In-the-Middle (FIM) code completions

[-]

Marcuss2@reddit

It doesn't seem to bench very well. Even Qwen3.5 4B beats it handily.

[-]

FatheredPuma81@reddit

It benchmarks worse than Qwen3 32B. It's not really a surprise though IBM models have always been crap.

[-]

skibidimeowsie@reddit

They're quite good for the uses they're meant for. IBM never really focused on building big generalist models, we have many many finetunes of these for many many niche corporate automation tasks and they're all pretty good.

[-]

Marcuss2@reddit

Qwen3.5 has been out for months. I get if someone does not compare to model released last week. This is not it.

[-]

FatheredPuma81@reddit

Why compare it to the latest models when it doesn't even compare to outdated models?

[-]

linkillion@reddit

Yes, the qwen3.5. reasoning models beat it handily but it's just a base model. Comparing qwen3.5-9b with thinking off results in a much more even benchmark but it's still lagging by a small margin

[-]

Marcuss2@reddit

True, my mistake.

[-]

FatheredPuma81@reddit

So I had Claude compare the 30B vs Qwen3 32B and Qwen3 32B is better than it... so I'd wager the 8B model is on par with Qwen3 8B? So not particularly impressive. No surprise seeing as it's an IBM model.

[-]

Finanzamt_Endgegner@reddit

ibm models arent really catered towards consumers, more for enterprise and predictability.

[-]

horeaper@reddit

is granite the only one that's still training for FIM use?

[-]

WhoRoger@reddit

What's FIM? I've seen it mentioned here

[-]

FatheredPuma81@reddit

So it's competitive vs Qwen3 32B in benchmarks...................

I had Claude compare the benchmarks on their HF Repo to the Benchmarks on Qwen3 80B Next's repo which contains Qwen3 32B and told it to give me a TLDR.

TLDR: The 30B Dense is notably behind both — the 32B is a similar size but beats it across the board, and the 80B-Next is stronger still. The closest the 30B gets is on IFEval (88.88 vs 88.9) and BFCL tool calling (73.68 vs \~72), where it's essentially tied. Everything else — reasoning, math, coding, knowledge — the Qwen3 models win, often by wide margins.

[-]

jacek2023@reddit (OP)

[-]

EveningIncrease7579@reddit

Waiting for huge table comparison with gemma and qwen

[-]

Simple_Library_2700@reddit

You don’t need to wait for anything Gemma and qwen have been tested on these datasets already.

[-]

x0wl@reddit

You can directly compare branches from 2 different sources. I can give 5 different scores for the same model and bench (and same shots etc) just by varying some eval settings (like normalization) that tend to go underreported.

[-]

linkillion@reddit

If the benchmark is good that is not the case. Most benchmarks have specific parameters for normalization, including greedy decoding, etc.

For example, I benchmarked granite 4.1 a couple days ago before they had entered any information in the model card and I got within 0.1 of all these benchmark results.

Whether benchmarks mean anything or not is another topic but generally it's not reproducibility with given weights.

[-]

x0wl@reddit

Greedy decoding is not the problem. It's, for example, whether you look at logprobs change or just at the max logprob out of all answers.

I checked mmlu and I don't think they specify that.

[-]

jacek2023@reddit (OP)

You can make the table yourself and share here

[-]

mikael110@reddit

That's nice to hear, I was genuinely startint to worry that Dense models was dead over the last year given the avalance of MoE models, so I'm glad to see they are making a comeback.

I do hope we start to see some medium-to-large dense models as well though. A new 70B or above dense model made with all of the advancements that have been made over the last year or two would be amazing.

[-]

StupidScaredSquirrel@reddit

I'm sorry but I don't want to 5-shot my summarisation tasks.

[-]

pmttyji@reddit

Nice that 3B & 30B Dense models also coming.

But in the benchmarks, 8B Dense beats 30B Dense on some items(GSM Symbolic, MBPP, MBPP+) which is weird.

[-]

Successful_Hall_2113@reddit

The tool calling and instruction following improvements are what caught my attention here. I've been running smaller models locally and they consistently struggle with following multi-step instructions or using functions reliably. An 8B that's actually been tuned for that usecase instead of just being a base model with chat tokens thrown at it could be a real shift for local setups.

[-]

Cool-Chemical-5629@reddit

I asked Qwen to roast Granite 4.1 8B - claims versus measured values and this is it. 😂

Roasting IBM Granite 4.1 8B: Capabilities vs Reality Check

The Capability List (What They Claim)

Summarization, Text classification, Text extraction, Question-answering, RAG, Code tasks, Function-calling, Multilingual dialog, FIM code completions

The Benchmark Reality (What Actually Happens)

🎯 General Knowledge Tasks - "Jack of All Trades, Master of None"

MMLU-Pro: 55.99% - Congratulations, you're barely better than random guessing on professional-level knowledge. A monkey with a dart board could compete.
SimpleQA: 4.82% - FOUR POINT EIGHT TWO PERCENT. You're called "Granite" but you crumble like sandstone on simple questions. This is embarrassing.
GPQA: 41.96% - Graduate-level physics? More like graduate-level disappointment.

💬 Alignment Tasks - "People Pleaser Energy"

AlpacaEval 2.0: 50.08% - You're literally average. Not even confidently average - you're hovering just above the "I don't know" threshold.
MTBench: 8.61/10 - Okay, this one's actually decent. Did someone bribe the judges?
IFEval: 85.87% - You can follow instructions well. Too bad the instructions are often "be smart" and you fail at that part.

🧮 Math Tasks - "The One Bright Spot"

GSM8K: 92.49% - WOW. Look at you, Mr. Calculator. But wait...
GSM Symbolic: 83.70% - Oh, it drops when the numbers change slightly. So you memorized the training set? Classic.

💻 Code Tasks - "Stack Overflow's Intern"

HumanEval: 85.37% - Respectable. You can write hello world and maybe a for loop.
BigCodeBench: 35.00% - THIRTY-FIVE PERCENT. What happened here? Did you forget everything when the code got complex?
CRUXEval-O: 47.63% - Under 50% on code understanding. You're literally worse than a coin flip at understanding code you didn't write.
MBPP+: 73.81% - The "+" version humbles you. Basic problems? Sure. Slightly harder? Panic mode.

🛠️ Tool Calling / Function-Calling - "Claims It Can, Barely Does"

BFCL v3: 68.27% - You claim "Function-calling" as a capability but you fail 1 out of 3 times. Would you trust this in production? Didn't think so.

🌍 Multilingual Dialog - "Google Translate Called, It Wants Its Job Back"

MMMLU: 64.84% - Across 11 languages, you're... fine. Mediocre at best in most languages.
INCLUDE: 58.89% - When tested on 14 languages including Indian languages, you struggle. That "Multilingual dialog" capability is generous marketing speak.
MGSM: 82.32% - Oh look, math in different languages! At least you're consistently okay at math.

🛡️ Safety - "Overcompensating Much?"

SALAD-Bench: 95.80% - Wow, excellent safety scores. Too bad you need to be safe AND useful.
Tulu3 Safety Eval Avg: 75.57% - The real safety evaluation shows you're not as perfect. Surprise!

The Verdict

IBM Granite 4.1 8B is the model equivalent of someone who lists "fluent in 10 languages" on their resume but can only order coffee in 3 of them. The capability list reads like a superhero origin story, but the benchmarks reveal more of a sidekick energy.

Highlights: - ✅ Math is genuinely solid - ✅ Safety is overengineered (at least you won't cause trouble) - ✅ Code generation on basic tasks is acceptable

Lowlights: - ❌ SimpleQA at 4.82% is criminal for a "Question-answering" capability - ❌ BigCodeBench at 35% when you claim "Code tasks" as a core capability - ❌ Function-calling fails 32% of the time - ❌ "Multilingual dialog" that struggles on 14-language benchmarks - ❌ MMLU-Pro at 56% for a model claiming professional knowledge capabilities

Final Roast: This model is like a Swiss Army knife where half the tools are dull, one is sharp (math), and you're not sure if the scissors will actually cut anything. It's not bad—it's just aggressively mediocre while wearing a suit of confident marketing claims.

[-]

danigoncalves@reddit

has FIM. I can buy that, lets see how it beahves

[-]

marscarsrars@reddit

Looks good on paper, I wonder how good it works irl.

[-]

Technical-Earth-3254@reddit

The release date confuses me

[-]

jacek2023@reddit (OP)

The future is now!!!

[-]

linkillion@reddit

They put the weights on HF about 5 days ago, I've been trying it since then. It's a bit slower than qwen3.5-9B without thinking and it generally does a bit worse. Qwen3.5-9b with thinking blows it out of the water, obviously, but it's not a bad model. It's just like, 4 months behind.

[-]