TheaterFire

Granite 4.0 Language Models - a ibm-granite Collection

Posted by rerri@reddit | LocalLLaMA | View on Reddit | 264 comments

Some Granite 4 models are now out. GGUF's are in the same repo.

Reply to Post

264 Comments

ibm@reddit

Let us know if you have any questions about Granite 4.0!
View on Reddit #67933897

jacek2023@reddit

so we have small, tiny and micro, can we also expect something bigger in the future as open weights too?
View on Reddit #67939007

ibm@reddit

Yes, we’re working on larger (and even smaller!) Granite 4.0 model sizes that we plan to release later this year. And we have every intention of continuing to release Granite under an Apache 2.0 license! [\- Emma, Product Marketing, Granite](http://ibm.biz/BdbxAZ)
View on Reddit #67942770

coder543@reddit

Any update on the larger Granite 4.0 models? It has been 6 months, and I’m afraid it isn’t happening anymore.
View on Reddit #81871884

ReallyFineJelly@reddit

Both larger and smaller models to come sound awesome. Thank you very much. Looking forward to see what's to come.
View on Reddit #67949285

JLeonsarmiento@reddit

🙈🖤👁️🐝Ⓜ️ thanks folks.
View on Reddit #67947822

jacek2023@reddit

thanks Emma, waiting for larger models then :)
View on Reddit #67946385

danielhanchen@reddit

Fantastic work as usual and excited for more Granite models! We made some dynamic Unsloth GGUFs and FP8 quants for those interested! https://huggingface.co/collections/unsloth/granite-40-68ddf64b4a8717dc22a9322d Also a free Colab fine-tuning notebook showing how to make a support agent https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/Granite4.0.ipynb
View on Reddit #67938694

cyaxios@reddit

So far, I love the micro model and want to move some fine-tuning over to it for webgpu deployment...but I'm hitting a blocker with ONNX export for WebGPU deployment. I assume eventually it will be supported, but until then is there a workaround for exporting fine-tuned Granite 4.0 (unsloth) models to ONNX in the meantime
View on Reddit #68259708

beneath_steel_sky@reddit

Thanks so much! BTW I noticed your GGUFs have been updated, I was wondering what changed?
View on Reddit #68077315

crantob@reddit

And thank you, once again.
View on Reddit #67951784

danielhanchen@reddit

Thanks!
View on Reddit #67971111

alitanveer@reddit

What would you recommend for a receipt analysis and classification workload? I have a few million receipt image files in about 12 languages and need some way to extract structured data from them, or recreate them in HTML. Is the 3.2 vision model the best tool for that?
View on Reddit #67943238

ibm@reddit

We’d definitely recommend Granite-Docling (which was just released last week) for this. It handles OCR + layout + structure in one pipeline and converts images/documents into structured formats like HTML or Markdown, which sounds like what you’re going for. Only thing is that it’s optimized for English, though we do provide experimental support for Japanese, Arabic, and Chinese. [https://huggingface.co/ibm-granite/granite-docling-258M](https://huggingface.co/ibm-granite/granite-docling-258M)
View on Reddit #67950620

alitanveer@reddit

That is incredibly helpful and thank you so much for responding. We'll start with English only. I got a 5090 last week. Let's see if that thing can churn.
View on Reddit #67953035

up_the_irons@reddit

How has it been working so far on the 5090? :)
View on Reddit #68128538

alitanveer@reddit

Haven't started yet. [Decided to take apart the computer closet in honor of the new card. ](https://www.imgur.com/a/Va2eaW8)
View on Reddit #68138138

Mkengine@reddit

Does "optimized for english" mean "don't even try other European languages" or "other European languages may work as well"?
View on Reddit #67972600

Jastibute@reddit

I'm new to AI and interested in self hosting. What are the hardware requirements?
View on Reddit #68096585

false79@reddit

Unsloth references best practice settings for inference from Qwen (https://docs.unsloth.ai/models/qwen3-how-to-run-and-fine-tune#official-recommended-settings) Is there something similar for Granite 4.0 as well?
View on Reddit #68058616

Double_Cause4609@reddit

Is there any hope of getting training scripts for personalization and customization of the models? Bonus points if we can get access to official training pipelines so we can sidestep the Huggingface ecosystem's sequential expert dispatch issue that limits MoE training speed.
View on Reddit #67981937

shawntan@reddit

Granite team member here. Open LM Engine [https://github.com/open-lm-engine/lm-engine](https://github.com/open-lm-engine/lm-engine), the stack we use internally, has functionality to import Granite models. Another lightweight option if the concern is JUST the MoE implementation, is to do \`replace\_moe\` as described here in the README. That injects the forward pass in the HF implementation with scattermoe.
View on Reddit #67983746

Double_Cause4609@reddit

Oh that's an absolutely lovely note. Thanks so much for the \* Uh...Pointer. Thanks for the pointer.
View on Reddit #67988658

shawntan@reddit

# 🤣
View on Reddit #68017003

PigOfFire@reddit

I still love and use your 3.1 3B moe model <3 I guess I will give 7B-A1B a try :) Thank you!
View on Reddit #67936895

ibm@reddit

Awesome, thanks for the feedback! Really glad it’s working well for you 🔥
View on Reddit #68007812

Hertigan@reddit

Fantastic that you guys made it open weight!! Haven’t tried it out yet, but it looks amazing!
View on Reddit #67985402

lemon07r@reddit

What are the recommendations sampler and temperature settings for these models?
View on Reddit #67984454

teddybear082@reddit

Any vision models in the roadmap for this family?
View on Reddit #67973238

AMOVCS@reddit

**Thank you! We appreciate you making the weights available to everyone. It’s a wonderful contribution to the community!**
View on Reddit #67935762

ML-Future@reddit

Is there a Granite 4 Vision model, or will there be one?
View on Reddit #67968242

ibm@reddit

Appreciate the feedback! We’ll make sure this gets passed along to our research team. In 2024 we did release code-specific models, but at this point our newest models will be better-suited for most coding tasks. [https://huggingface.co/collections/ibm-granite/granite-code-models-6624c5cec322e4c148c8b330](https://huggingface.co/collections/ibm-granite/granite-code-models-6624c5cec322e4c148c8b330) [\- Emma, Product Marketing, Granite](http://ibm.biz/BdbxAZ)
View on Reddit #67936738

Dazz9@reddit

Any idea if it works somewhat with Serbian language, especially for RAG?
View on Reddit #67949706

ibm@reddit

Unfortunately not currently! Current languages supported are: English, German, Spanish, French, Japanese, Portuguese, Arabic, Czech, Italian, Korean, Dutch, and Chinese. We’re always looking to expand these though!
View on Reddit #67950705

Dazz9@reddit

Thanks for the answer! Guess it could be easy to fine tune, any example on how large the dataset should be?
View on Reddit #67951177

markole@reddit

Folks from Unsloth released a fine tuning guide: https://docs.unsloth.ai/new/ibm-granite-4.0 Share your results, I'm also interested in OCR and analysis of text in Serbian.
View on Reddit #67954205

Dazz9@reddit

Thanks for the link! I think I just need to get some appropriate dataset from HF.
View on Reddit #67956129

AMOVCS@reddit

Last year I recall using Granite Coder, it was really solid and underrated! It seems like a great time to make another one, especially given the popularity here of 30B to 100B\~ MoE models such as GLM Air and GPT-OSS 120B. People appreciate how quickly they run via APIs, or even locally at decent speeds, particularly on systems with DDR5 memory.
View on Reddit #67937552

JLeonsarmiento@reddit

Yes. An agentic coding focused model. Perhaps with vision capabilities. 🤞🤞
View on Reddit #67947449

danigoncalves@reddit

There is no way I could reinforce this more. Those sizes are the perfect ones for us GPU poor to have local coding models.
View on Reddit #67939117

aaronsb@reddit

Thank you for publishing usable edge compute models!
View on Reddit #67958048

jesus359_@reddit

Yeeeeeesss!! Ive always loved Granite models! You guys are awesome!
View on Reddit #67946224

ibm@reddit

https://i.redd.it/sj9g5hpzsqsf1.gif
View on Reddit #67957468

and_human@reddit

Hey IBM, I tried your granite playground, but it looks (the UI) pretty bad. I think it might be an issue with dark mode. 
View on Reddit #67955670

Few_Painter_5588@reddit

Any plans on keeping the reasoning and non-reasoning models seperate or will future models be hybrids?
View on Reddit #67935003

ibm@reddit

Near term: separate. Later this year we’ll release variants with explicit reasoning support. Worth noting that previous Granite models with reasoning include a “toggle” so you can turn on/off as needed. [\- Emma, Product Marketing, Granite](http://ibm.biz/BdbxAZ)
View on Reddit #67936695

x0wl@reddit

The reasoning version of this would be killer because it does not lose generation speed (as much as other models) as the context fills up. Do you plan to add reasoning effort control to the reasoning versions?
View on Reddit #67953057

daank@reddit

The apache 2 licensing is really appreciated!
View on Reddit #67950950

AlanzhuLy@reddit

Great work and amazing models! We've made Granite 4 running on Qualcomm NPU, so that it can be used across billions of laptops, mobiles, cars, and IoT devices, with both low-latency and energy efficiency! For those interested, Run Granite 4 today on NPU, GPU, and CPU with NexaSDK GitHub: [https://github.com/NexaAI/nexa-sdk](https://github.com/NexaAI/nexa-sdk) Step by step instruction: [https://sdk.nexa.ai/model/Granite-4-Micro](https://sdk.nexa.ai/model/Granite-4-Micro)
View on Reddit #67949377

MythOfDarkness@reddit

When Diorite?
View on Reddit #67948456

glassorangebird@reddit

What motivates you to release these great products for free?
View on Reddit #67938224

AlphaEdge77@reddit

from here: [https://huggingface.co/ibm-granite](https://huggingface.co/ibm-granite) IBM is building enterprise-focused foundation models to drive the future of business. The Granite family of foundation models span a variety of modalities, including language, code, and other modalities, such as time series. We strongly believe in the power of collaboration and community-driven development to propel AI forward. As such, we will be hosting our latest open innovations on this IBM-Granite HuggingFace organization page. We hope that the AI community will find our efforts useful and that our models help fuel their research.
View on Reddit #67946934

SkyLunat1c@reddit

Thanks for giving these out to the community! Are any of these new models currently used in Docling and are there plans to upgrade it with them?
View on Reddit #67936845

ibm@reddit

The Granite-Docling model is based on Granite 3 architecture. We wanted to get the Granite 4.0 text models to the community ASAP. Multimodal will build from there and we're hard at work keeping the GPUs hot as we speak! [\- Gabe, Chief Architect, AI Open Innovation](https://ibm.biz/Bdbxus)
View on Reddit #67942544

Elbobinas@reddit

Siuuuuuuuu
View on Reddit #67939330

stoppableDissolution@reddit

Are there by the chance any plans on making even smaller model? The big-attention architecture was godsent for me with granite3 2b, but its still a bit too big (and 3b is, well, even bigger). Maybe something <=1b?
View on Reddit #67936394

ibm@reddit

Yes, we’re working on smaller (and larger) Granite 4.0 models. Based on what you describe, I think you’ll be happy with what’s coming ☺️ \- [Emma, Product Marketing, Granite](http://ibm.biz/BdbxAZ)
View on Reddit #67938120

ironwroth@reddit

Congrats on the release! Day 1 llama.cpp / MLX support is awesome. Really wish more labs did this. Thanks for the hard work!
View on Reddit #67937556

intellidumb@reddit

Just want to say thank you!
View on Reddit #67936618

Admirable-Star7088@reddit

Thanks for the models, I will check them out! I have a question. I see that your largest version, 32B-A9B, is called "small". Does this mean that you plan to release more versions that are even bigger, such as "medium" and "large"? Models such as gpt-oss-120b and GLM 4.5 has proven that large models can run fast on consumer hardware, and even faster by offloading just the active parameters to the GPU. If you plan to release something larger and similar, such as Granite \~100b-200b with just a few active parameters, it could be extremely interesting.
View on Reddit #67936530

ApprehensiveAd3629@reddit

amazing work!
View on Reddit #67934471

ibm@reddit

Thank you!! 💙
View on Reddit #67934652

Fast-Act86@reddit

Hola, yo utilizó el modelo granite4:tiny-h para un agente de monitoreo de base de datos con python, funciona bien aquí les dejo un ejemplo de su respuesta: Ejemplo de salida: **Evento**: HighResourceQuery **Análisis**: La consulta 'UPDATE STATISTICS SCHEMA..TABLA WITH FULLSCAN' ha sido detectada como consumidora recursos excesivos. Esto puede deberse a una serie de factores, como la falta de índice adecuado en las columnas utilizadas, un escaneo completo de tablas (full scan) debido a la ausencia de índices relevantes o posibles bloqueos implícitos que afectan el rendimiento. **Solución**: Recomiendo realizar una revisión inmediata del plan de ejecución de esta consulta para identificar si realmente es necesario un escaneo completo de tabla. En caso de ser así, se deberían considerar la creación de índices apropiados en las columnas utilizadas en la consulta o en los filtros relacionados con 'TABLA '. También se puede analizar el uso de estadísticas y considerar actualizarlas para mejorar el rendimiento. Si la consulta es crítica, podría ser necesario suspenderla temporalmente hasta que se puedan realizar las optimizaciones necesarias.
View on Reddit #81386738

Kahvana@reddit

Great models so far! I would love to see the following configurations return that were previously offered: \- 1B A400M \- 3B A800M These really help in deployment on edge devices.
View on Reddit #69766008

Available_Load_5334@reddit

https://preview.redd.it/t1yflu5srqsf1.png?width=1117&format=png&auto=webp&s=af1035c6e21cf836c0d257931e1f99b9000d951c German "Who wants to be a Millionaire" benchmark. [https://github.com/ikiruneo/millionaire-bench](https://github.com/ikiruneo/millionaire-bench)
View on Reddit #67957022

EmployeeLogical5051@reddit

Human seems like a half decent model. 
View on Reddit #67983677

The_Best_Man_Alive@reddit

Total params: 100T Active params: 0
View on Reddit #68160269

MerePotato@reddit

Magistral getting less than Mistral Small makes me suspicious of the effectiveness of this bench
View on Reddit #67961924

Available_Load_5334@reddit

magistral is a reasoning model but chose not to think - probably because of the system prompt. maybe thats why.
View on Reddit #67962470

DukeMo@reddit

On the magistral card it has recommendations on how to get it to think using system prompt.
View on Reddit #67972794

Available_Load_5334@reddit

the choose for non thinking was deliberate. it would take my laptop hours to generate 2500+ answers with thinking enabled. more info on the repo
View on Reddit #67975180

MerePotato@reddit

Not a very fair test in that case, you'd be better off limiting it to instruct tunes
View on Reddit #67998635

Available_Load_5334@reddit

i agree. i'm just curious — this isn’t authoritative benchmark. the test is harsh and not well optimized for every model. i used a fixed prompt and recommended settings — whatever happens, happens.
View on Reddit #68008707

MerePotato@reddit

Make sure to use the Unsloth GGUF since that has the reasoning stuff baked in, use their recommend sampling params from the params file and llama.cpp launch command on the model page and use --special if using cpp. That ought to change your results and I'd be curious to see how different they are.
View on Reddit #67962985

AppearanceHeavy6724@reddit

Nemo has very good world knowledge for its size. I've asked several specific question about Central Asian region and the only model <= 32b that could answer it correctly was Nemo; neither Small 2506, nor Gemma 3 nor Qwen Models could do that.
View on Reddit #67995419

kevin_1994@reddit

makes sense, nemo is an older model trained for more world knowledge than these current generation of models that are more highly optimized for coding and stem
View on Reddit #67973277

MerePotato@reddit

Nemo is ancient, non reasoning and has half the parameters
View on Reddit #67973710

Federal-Effective879@reddit

These benchmark results really doesn't align with my personal experience using Granite 4 Small and various others listed here, though I was using the model mostly in English and some French, not German. For my usage, it's roughly on par with Gemma 3 27B in knowledge and intelligence. For me, it was slightly better than Mistral Small 3.2 in world knowledge but slightly worse in STEM intelligence. Granite 4 Small was substantially better than Qwen 3 30B-A3B 2507 in world knowledge, but substantially worse in STEM intelligence.
View on Reddit #67962805

Zc5Gwu@reddit

I think they said something about thinking models coming in the future.
View on Reddit #67981740

Zc5Gwu@reddit

Considering that it is an instruct model and not a thinking model it doesn't look bad at all.
View on Reddit #67981701

crantob@reddit

Thought this might be educational: Tiny models have difficulty thinking about the 'missing information' such as in the following problem: --------------------------------------- [ME] There are three sisters sitting in a bare room, Alice, Becky and Cindy. Alice is reading a book. Becky is playing chess against another player. What is Cindy most likely doing? [G4] Given the information provided, it's most likely that Cindy is also in the bare room, either sitting or standing, paying attention to what's happening in the room, possibly supporting either Alice or Becky during their activities (like helping Alice with a challenging passage in the book or assisting Becky during the chess game), or quietly engaged in an activity herself. However, since only three names are mentioned (Alice, Becky, and Cindy) and no activity is assigned to Cindy in the initial description, we acknowledge that there's some anticipation for a reasonable assumption based on the given context. Cindy could be doing any number of things, from assisting her sisters in their activities to pursuing her own hobby or interest in solitude. [ME] If Becky is playing chess, she is likely playing against another player, yes? Can we infer the name of that other player based on the givern information? [G4] Based on the information provided, we cannot infer the name of the other player Becky is playing chess against. The scenario describes the activities of three sisters: Alice, Becky, and Cindy, where Alice is reading a book and Becky is playing chess against another player, but no specific name or identity is given to Becky's opponent. Chess is often played against another player, but without additional context or information, it's not possible to determine who Becky's opponent is from the details given. The identity of Becky's chess opponent remains unknown based on the provided context. [ME] Given that there are only three people in the room, Alice, Becky and Cindy, we know that Becky is not playing chess versus alice, since Alice is reading a book. Who is the only other person in the room who could be playing chess with Becky? [G4] Given that Becky is playing chess against another player in the room, and considering the three individuals present—Alice, Becky, and Cindy—along with the information that Alice is occupied with reading a book and therefore cannot be playing chess with Becky— the only other person left to play chess with Becky, from the given scenario, would be Cindy. Since no other names or descriptions have been provided to clarify the connection between Cindy and Becky's interactive activity directly, and given that Alice's involvement is noted but not intersecting with the chess activity, it's reasonable to infer that Cindy is most likely the one playing chess with Becky. This inference is based purely on the exclusion of Alice from the chess-playing scenario due to her engagement with reading. ------------------------- So G4 managed to get the right answer 'Cindy' at the end without me mentioning the name, but I had to expllicitly explain the logic to be used. Still very logical patterns, just not enough of that 'ephemeral mind' that the big models seem to build-up.
View on Reddit #68098576

locpilot@reddit

\> IBM Granite 4.0: models for enterprise We are planning to create a demo to showcase using Granite 4.0 models in Microsoft Word locally. Could you suggest which model would resonate most with your enterprise audience? Below is one of our demonstrations for your reference: [https://youtu.be/9CjPaQ5Iqr0](https://youtu.be/9CjPaQ5Iqr0) The functionality in this demo is based on a local Word Add-in, ensuring that all data remains local and private.
View on Reddit #67947312

ibm@reddit

Granite 4.0 Small is our “enterprise workhorse”, but Granite 4.0 Tiny and Micro are specifically intended for local deployments so may be best to showcase one of those. Between those two it really just comes down to user preference between architectures (transformers-only versus hybrid SSM/transformers, MoE vs dense) \-[ Emma, Product Marketing, Granite](https://ibm.biz/Bdbxus)
View on Reddit #67952789

gptlocalhost@reddit

Thank you for your advice. The demo is online now: [https://youtu.be/63s8dMwfu1s](https://youtu.be/63s8dMwfu1s)
View on Reddit #68069472

Thrumpwart@reddit

Initial testing looks good. Incredible how quickly hybrid attention models are being released, and how big a difference they make for speed and memory usage.
View on Reddit #68062620

dheetoo@reddit

finally, I can start a retired plan for granite 3.3 8B, It been a loyal workhorse for my project for almost a year
View on Reddit #67935523

silenceimpaired@reddit

What type of stuff are you doing that Granite shines at?
View on Reddit #67942569

socialjusticeinme@reddit

I wanted to use a local LLM for a thing which required it to parse text that had very bad language in it, identify it, and then write a summary of what it found. Granite was the only one which wouldn’t immediately shut down if it found the N word and performed exactly as expected.  That was a year or so ago, but anytime I want to do business type stuff, I always use granite. 
View on Reddit #67976161

toothpastespiders@reddit

I do a fair amount with historical documents. For what it's worth I feel your pain. Even aside from that a lot of things that were just normal words back then have become offensive slang in the modern era. I've had stuff as mundane as shoe care shot down. I might have to give granite a shot.
View on Reddit #68055575

florinandrei@reddit

curling stones
View on Reddit #67968307

IrisColt@reddit

heh
View on Reddit #67987651

silenceimpaired@reddit

Weird, never would have saw them as the athletic type. :)
View on Reddit #67978361

ibm@reddit

Thank you for your service, Granite 3.3 8B https://i.redd.it/0mbehxm6dpsf1.gif
View on Reddit #67936826

chillahc@reddit

I still have troubles getting the granite-h models (tiny & small - lmstudio-community version) to generate any code via LM Studio & RooCode. Either the models are stuck in a weird answer loop (basically talking to itself, while answering auto-generated empty user messages) or it justs generates packets without ever handing off back to VS Code.
View on Reddit #68053936

Maykey@reddit

Sweet goodness, 7B-A1B sounds insanely fast. And the weekend is near. Perfect timing to play around with them Also on hf eg https://huggingface.co/ibm-granite/granite-4.0-micro displays arxiv as 0000.0000 (so are other [granites](https://huggingface.co/models?other=arxiv:0000.00000))
View on Reddit #67955652

ibm@reddit

Let us know what you think! What are you hoping to use it for?
View on Reddit #68007933

Maykey@reddit

Gonna try for story cowriting, small scripts coding. I actually hope someone will finetune them for coding.
View on Reddit #68019884

SpicyWangz@reddit

So far been very impressed with it for its size. It might be a new favorite for that range.
View on Reddit #67982914

Federal-Effective879@reddit

Nice models, thank you IBM. I've been trying out the "Small" (32B-A9B) model and comparing it to Qwen 3 30B-A3B 2507, Mistral Small 3.2, and Google Gemma 3 27B. I've been impressed by its world knowledge for its size class - it's noticeably better than the Qwen MoE, similar to or maybe slightly better than Mistral Small 3.2 as well, and close to Gemma 3 27B, which is my gold standard for world knowledge in this size class. I also like how prompt processing and generation performance stays pretty consistent as the context gets large; the hybrid architecture has lots of potential, and is definitely the future. Having llama.cpp support and official ggufs available from day zero is also excellent, well done. With the right system prompt, these models are willing to answer NSFW requests without restrictions, though by default they try to stay SFW, which makes sense for a business model. I'm glad it's still willing to talk about such things when authorized by the system prompt, rather than being always censored (like Chinese models), or completely lobotimized for any vaguely sensitive topic (like Gemma or GPT-OSS). For creative writing, the model seemed fairly good, not too sloppy and decent prompt adherence. By default, its creating writing can feel a bit too short, abrupt, and stacatto, but when prompted to write the way I want it does much better. Plots it produces could be more interesting, but maybe that could also be improved with appropriate prompts. For code analysis and summarization tasks, the consistent long context performance was great, though its intelligence and understanding was not at the level of Qwen 3 30B-A3B 2507 or Mistral Small 3.2, but not too bad either. I'd say its overall intelligence for various STEM tasks I gave it was comparable to Gemma 3 27B, or maybe slightly worse but close. It was substantially better than Granite 3.2 or 3.3 8B, but that was to be expected given its larger size. Overall, I'd say that Granite 4.0 Small is similar to Gemma 3 27B in knowledge, intelligence, and general capabilities, but with much faster long context performance, much lower long context memory usage, and it's mostly uncensored (with the right system prompt) like Mistral models. Granite should be a good tool for summarizing long documents efficiently, and is also good for conversation and general assistant duties, and creative writing. For STEM problem solving and coding, you're better off with Qwen 3 or Qwen 3 Coder.
View on Reddit #67960886

AppearanceHeavy6724@reddit

What is your take on GLM-4-32B, I am curious? In my tests world knowledge was above Qwen3-32b but less than Gemma 3 or even Small.
View on Reddit #67995661

Federal-Effective879@reddit

Sorry about the deleted comment, there was a Reddit bug where it made the comment appear duplicated for me. As I said earlier, my experience with GLM-4 32B's world knowledge was exactly in line with what you said. Slightly better than Qwen 3 32B, slightly worse than Mistral Small 3.2. What really impressed me about Granite 4.0 Small is that despite it being a MoE, its world knowledge was better than several modern dense models of the same size (GLM-4 32B and Qwen 3 32B). In terms of overall intelligence and capabilities, I found Qwen 3 32B and GLM-4 32B to be pretty similar. I haven't tried GLM 4.5 Air.
View on Reddit #68012953

AppearanceHeavy6724@reddit

No problems. GLM 4 is better at creative writing than Qwen 3 32b but worse at long context. Granites always had good world knowledge, 8b 3.1-3.3 granites are great at trivia. Nemo BTW also has good world knowledge too.
View on Reddit #68014117

Federal-Effective879@reddit

That’s exactly my experience with GLM-4 32B’s world knowledge.
View on Reddit #68002697

AppearanceHeavy6724@reddit

I guess I need to check the Granite Small today.
View on Reddit #68002936

ibm@reddit

Thank you so much for taking the time to thoroughly evaluate Granite 4.0 Small AND the time to share what you found. Feedback like this goes directly to our Research team so they can make future versions even stronger. Thanks again 🎉
View on Reddit #68008879

jarec707@reddit

I appreciate your thoughtful and helpful post. Good job mate
View on Reddit #67975129

AloneSYD@reddit

Can you we get a tutorial on how to finetune the MoE e.g. the tiny version?
View on Reddit #67958191

ibm@reddit

Yes, our friends at Unsloth posted a tutorial on this! [https://docs.unsloth.ai/new/ibm-granite-4.0#fine-tuning-granite-4.0-in-unsloth](https://docs.unsloth.ai/new/ibm-granite-4.0#fine-tuning-granite-4.0-in-unsloth)\`
View on Reddit #68009040

AloneSYD@reddit

Thanks!
View on Reddit #68012266

SeverusBlackoric@reddit

i tried to run it with llamacpp, but still don't figure it out yet why the speed really slow. My GPU is Rx 7900xt with 20GB ram. ❯ ./build/bin/llama-bench -m ~/.lmstudio/models/unsloth/granite-4.0-h-small-GGUF/granite-4.0-h-small-IQ4_XS.gguf -nkvo 1 -ngl 99 ggml_vulkan: Found 1 Vulkan devices: ggml_vulkan: 0 = Radeon RX 7900 XT (RADV NAVI31) (radv) | uma: 0 | fp16: 1 | bf16: 0 | warp size: 64 | shared memory: 65536 | int dot: 0 | matrix cores: KHR_coopmat | model | size | params | backend | ngl | nkvo | test | t/s | | ------------------------------ | ---------: | ---------: | ---------- | --: | ------------: | --------------: | -------------------: | | granitehybrid ?B IQ4_XS - 4.25 bpw | 16.23 GiB | 32.21 B | Vulkan | 99 | 1 | pp512 | 297.39 ± 1.47 | | granitehybrid ?B IQ4_XS - 4.25 bpw | 16.23 GiB | 32.21 B | Vulkan | 99 | 1 | tg128 | 19.44 ± 0.02 |
View on Reddit #67948789

ibm@reddit

The hybrid architecture relies on Mamba-2 which in turn relies on the SSM\_SCAN operation. In llama.cpp, SSM\_SCAN was implemented for Mamba (not 2) and extended to support Mamba-2. One of the main innovations in Mamba-2 was the SSD (State Space Duality) algorithm which refactors the algorithm using semi-separable matrices leading to significant performance improvements. The SSD algorithm hasn't been implemented in llama.cpp yet since it relies on a number of other sub-operations that aren't there yet. We're actively working on this and will be contributing these optimizations to get the speed up to par with other engines based on torch and MLX. \- Gabe, Chief Architect, AI Open Innovation
View on Reddit #68008824

SeverusBlackoric@reddit

Can't not thank you enough Gabe !!! I have tried again with llama cpp ROCm build, and the generated token speed is 3 times faster, with Granite 4 h-tiny, i got about 60 t/s ❯ ./build_rocm/bin/llama-bench -m ~/.lmstudio/models/unsloth/granite-4.0-h-small-GGUF/granite-4.0-h-small-IQ4_XS.gguf -fa 0 ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no ggml_cuda_init: found 1 ROCm devices: Device 0: Radeon RX 7900 XT, gfx1100 (0x1100), VMM: no, Wave Size: 32 | model | size | params | backend | ngl | test | t/s | | ------------------------------ | ---------: | ---------: | ---------- | --: | --------------: | -------------------: | | granitehybrid ?B IQ4_XS - 4.25 bpw | 16.23 GiB | 32.21 B | ROCm | 99 | pp512 | 841.97 ± 5.21 | | granitehybrid ?B IQ4_XS - 4.25 bpw | 16.23 GiB | 32.21 B | ROCm | 99 | tg128 | 59.62 ± 0.03 | build: e308efda8 (6676)
View on Reddit #68009404

amamiyaharuka@reddit

Thank you so much for your clear answer!!!
View on Reddit #68009147

kevin_1994@reddit

- `-nkvo` puts gpu cache in ram right? probably slowing you down - `--flash-attn on` always a good move
View on Reddit #67973668

SeverusBlackoric@reddit

Thank you ! I tried again with flash attention on, but still the speed is still very slow, only 16 generated token per second. May be because of the Mamba Hybrid Architecture ? I'm not sure it is well supported by Llamacpp or not \`\`\` ❯ ./build/bin/llama-bench -m \~/.lmstudio/models/unsloth/granite-4.0-h-small-GGUF/granite-4.0-h-small-IQ4\_XS.gguf -fa 1 -ngl 99 ggml\_vulkan: Found 1 Vulkan devices: ggml\_vulkan: 0 = Radeon RX 7900 XT (RADV NAVI31) (radv) | uma: 0 | fp16: 1 | bf16: 0 | warp size: 64 | shared memory: 65536 | int dot: 0 | matrix cores: KHR\_coopmat | model | size | params | backend | ngl | fa | test | t/s | | ------------------------------ | ---------: | ---------: | ---------- | --: | -: | --------------: | -------------------: | | granitehybrid ?B IQ4\_XS - 4.25 bpw | 16.23 GiB | 32.21 B | Vulkan | 99 | 1 | pp512 | 303.54 ± 1.68 | | granitehybrid ?B IQ4\_XS - 4.25 bpw | 16.23 GiB | 32.21 B | Vulkan | 99 | 1 | tg128 | 16.40 ± 0.01 | build: 91a2a5655 (6670) \`\`\`
View on Reddit #67982564

SpicyWangz@reddit

That 7b model is pretty impressive. It's the only model of that size I've seen successfully name all gen 1 Pokemon. It definitely doesn't have the world knowledge of a larger dense model like Gemma 12b, but pretty impressive for its size. Interested to play around with it more and see what it's capable of
View on Reddit #67953665

ibm@reddit

I think you just introduced a new benchmark to make standard 🤔
View on Reddit #68009131

Original_Finding2212@reddit

Any of them perform well on small compute like Raspberry pi? And how well does the vision model is good with actions? (Tool use/mcps) A vision model with actions that performs well on raspberry pi would be very interesting to small scale roboticists
View on Reddit #67967872

ibm@reddit

Yes, our Granite 4.0 Micro model (hybrid) was built to run efficiently on hardware exactly like a Raspberry Pi. Currently Granite 4.0 models are text only.
View on Reddit #68009084

SpicyWangz@reddit

I'd be curious to see how the 7b-a1b performs on a Pi. I bet it could run decently well at Q4.
View on Reddit #67983067

Northern_candles@reddit

I couldn't find recommended inference settings anywhere. Can you share recommended temp etc settings please? Using small in LM studio
View on Reddit #67960438

ibm@reddit

These models are designed to be robust to all your favorite inference settings depending on the task. For tasks that need repeatability, greedy decoding should work well. For creative tasks, a higher temperature and corresponding sampling parameters can be tuned to get the performance you need. \- Gabe, Chief Architect, AI Open Innovation
View on Reddit #68008950

Practical-Hand203@reddit

Do you consider adding a model that would fit in (slightly under) 16GB RAM, given that's a very common configuration on many devices?
View on Reddit #67965670

ibm@reddit

Check out the Granite 4.0 Tiny and Micro models. For a context length of 128k and batch size of 1, we’re estimating Tiny to require \~8GB of memory and Micro (hybrid) to require \~4GB. The non-hybrid Micro model will require more memory at \~9GB.
View on Reddit #68008922

doomed151@reddit

I really appreciate open models. Thank you. Sometimes we tend to take it for Granite.
View on Reddit #67955198

ibm@reddit

https://i.redd.it/2ty11h3jawsf1.gif
View on Reddit #68008761

bennmann@reddit

Looking forward to a granite model that can perform well on new Gaia2 leaderboard eventually. Please keep making good models.
View on Reddit #67957367

ibm@reddit

No intention of stopping! 🫡
View on Reddit #68007876

Ok-Kangaroo6055@reddit

Nice, the old Granite 3 7B was one of my favourites under 30b. Hope these are better.
View on Reddit #67998990

JLeonsarmiento@reddit

Small is killing it in QwenCode CLI.
View on Reddit #67959212

NoFudge4700@reddit

Could you write down steps you did to use it with qwen cli?
View on Reddit #67988902

JLeonsarmiento@reddit

Create or add to the repository “.env” File in the project root folder or the folder from which you launch the terminal: # For QwenCode robot: OPENAI_API_KEY=your_api_key_here OPENAI_BASE_URL=your_api_endpoint OPENAI_MODEL=your_model_choice For example: # For QwenCode robot via lm studio OPENAI_API_BASE="http://localhost:1234/v1" OPENAI_API_KEY="whatever-NoAPIkeysOnLocal” OPENAI_MODEL=“defaultModelIdentifier-e.g.Granite4-small”
View on Reddit #67996921

danielhanchen@reddit

Made some dynamic Unsloth quants for Granite 4! https://huggingface.co/unsloth/granite-4.0-h-small-GGUF https://huggingface.co/unsloth/granite-4.0-h-tiny-GGUF https://huggingface.co/unsloth/granite-4.0-h-micro-GGUF Guide for fine-tuning and running at https://docs.unsloth.ai/new/ibm-granite-4.0
View on Reddit #67938076

dark-light92@reddit

Correct me if I'm doing something wrong but the vulkan build of llama.cpp is significantly slower than ROCm build. **Like 3x slower.** It's almost as if vulkan build is running at CPU speed...
View on Reddit #67953714

Mekfal@reddit

Rollback to v1.50.2 the ones after that seem to have a bug wherein they do not use GPU for processing.
View on Reddit #67993376

danielhanchen@reddit

Oh interesting unsure on Vulkan - it's best to open a Github issue!
View on Reddit #67971146

Hopeful_Eye2946@reddit

si, parece que no se puede usar bien con vulkan, da unos 4 a 10 tokens en graficas AMD, pero solo en CPU va de 20 a 40 tokens, aun esta verde ahi
View on Reddit #67969291

PaceZealousideal6091@reddit

Hi Daniel! Can you please confirm if this gguf supports mamba on lcpp?
View on Reddit #67952432

danielhanchen@reddit

Yes they work!
View on Reddit #67971122

Glum_Treacle4183@reddit

Thank you so much for your work!
View on Reddit #67939214

danielhanchen@reddit

:)
View on Reddit #67939446

lightstockchart@reddit

congrats. the tiny and micro models work fast and reliable, even on an AMD iGPU 780M
View on Reddit #67992319

cibernox@reddit

I tested the speed (not the quality) of both tiny models and I'm impressed.i reached 100tk/s on small prompts on the 3B one, which the fastest I've seen a 3B model be. Usually they hover around 80-82tk/s on my RTX3060. I did try some tool calling and they almost nailed the it. The 7BA1 was around the same speed, I was expecting it to be faster than the 3B. I tip off my hat, IBM.
View on Reddit #67991119

redditrasberry@reddit

will 4.x vision models be coming as well? these are great.
View on Reddit #67991018

planetearth80@reddit

Is this Ollama ready?
View on Reddit #67969841

KMaheshBhat@reddit

Yes. See: https://ollama.com/library/granite4
View on Reddit #67989163

ThunderBeanage@reddit

https://preview.redd.it/lkrp4a9f6psf1.png?width=1432&format=png&auto=webp&s=b685c9b83ef8ce705dd4b4ecf8dcf6e59905099b
View on Reddit #67934483

a_slay_nub@reddit

Any benchmark that puts llama 4 above....anything is not a benchmark I trust
View on Reddit #67935322

ForsookComparison@reddit

This is IFEVAL. Llama has always punched above its weight at following instructions. I think it's a super random choice to show off in a single benchmark jpeg.. but having used all of these for very wacky custom instruction sets, Maverick beating Kimi is believable.
View on Reddit #67936597

a_slay_nub@reddit

Interesting. I haven't really played with Maverick since we don't have the hardware for it, but Scout is impressively bad. It's practically a meme on our team how much I hate Scout.
View on Reddit #67939483

atineiatte@reddit

\>It's practically a meme on our team how much I hate Scout. That is the wildest and wackiest AI workplace anecdote I have ever heard
View on Reddit #67941347

a_slay_nub@reddit

Defense contractor so we're extremely limited on which models we can use(ironically we can't really use Llama either but our legal team is weird). This leaves us with an extremely limited subset of models. Basically, llama3.3, llama 4, gemma, mistral small, granite and a few others. I'm typically the one that sources the models, downloads them and am general tech support for how they're run. I was also one of the first to really play with llama 4 because of this. It broke my code so many times in ways that was just infuriating that llama 3.3 wouldn't do. Ironically, it's also slower than llama 3.3 despite having fewer active parameters, so there's really no benefit for us. Management wants to "push forward and use the latest and greatest," which leads to us pushing this subpar model that's worse and slower than what we already had. Slowly, as more of the team tries switching their endpoints to llama 4, they're realizing that I may actually be right and am not just a hater for haters sake.
View on Reddit #67944243

Educated_Bro@reddit

It seems the subtext of what you said is that “we can’t use any model coming out of China because it is a security risk” is there in fact a problem security wise with the Chinese models?
View on Reddit #67987062

kevin_1994@reddit

sounds like china=bad could you use gpt oss? it's much better than llama and also "american" (from openai)
View on Reddit #67973166

shockwaverc13@reddit

https://preview.redd.it/4fam699tppsf1.jpeg?width=1242&format=pjpg&auto=webp&s=a56b02784e1a9a14776b19c9d8f378f0ce0d9e10
View on Reddit #67941864

ForsookComparison@reddit

The problem is that at the 400B size most reasoning models can deal with *most* instruction sets just fine. So the only thing Maverick really stood out at was already "solved" for most use cases. Agreed with Scout though. I cannot find a single reason to use it.
View on Reddit #67939854

DinoAmino@reddit

I wish more models published benchmarks for IFEval. They seem to be conspicuously absent these days.
View on Reddit #67946945

noiserr@reddit

Seriously. Following instructions well is a make or break feature for people who do any kind of agentic or automation type tasks.
View on Reddit #67959007

steezy13312@reddit

Running this on llama.cpp with unsloth's Q4_K_XL, it's definitely slower than Qwen's 30B or gpt-oss-20b, both for prompt processing and token generation. (Roughly, where the earlier two are between 380-420tk/s pp, this is around 130 tk/s pp)
View on Reddit #67946222

random-tomato@reddit

If you're talking about the 32B, that's because Granite has 9B active whereas GPT OSS 20B has 3.6B active and Qwen3 30B has 3B active :)
View on Reddit #67985581

LinkSea8324@reddit

As of (at least yesterday), there was pretty much two family models working at very long context (+80k) : Qwen2.5 (1 M variant only) and Qwen3. What test exactly did you run to ensure long context capacities ? RULER ? Internal non published ones ?
View on Reddit #67984509

Pro-editor-1105@reddit

This seems pretty much perfect for a 48gb macbook tbh
View on Reddit #67972493

SpicyWangz@reddit

And the 7b-a1b is perfect for a 16gb MB, since it leaves a decent amount of RAM available for the OS still
View on Reddit #67983093

Stepfunction@reddit

I think the real star of the show here is the 3B models, which benchmark very well for their size, according to the self-reported results. I'll be curious to see how they pan out in practice. The 32B is a little underwhelming, especially when compared against Qwen3 30B-A3B.
View on Reddit #67934447

SpicyWangz@reddit

I'm really liking the 7B so far. It's outperforming most other 7b or 8b models I've tried before, and the a1b makes it very snappy
View on Reddit #67982565

ibm@reddit

We are also VERY excited for Granite 4.0 Micro. For Granite 4.0 Small, the price:performance ratio is worth checking out and also performance on tasks like instruction following and tool calling. [\- Emma, Product Marketing, Granite](http://ibm.biz/BdbxAZ)
View on Reddit #67936640

c--b@reddit

I'm super impressed with how well it handled attention with longer context conversations, most models of equivalent size (To granite 4 micro) lose the plot after like 2-4 paragraphs, for granite 4 it was able to recall the conversation after a fairly long conversation while remaining coherent and making good points, and recalling information. Definitely has my attention.
View on Reddit #67982212

walrusrage1@reddit

What languages have these been trained and tested on? Are they multilingual?
View on Reddit #67947702

ibm@reddit

Yes, supported languages are: English, German, Spanish, French, Japanese, Portuguese, Arabic, Czech, Italian, Korean, Dutch, and Chinese. We always welcome feedback for what languages are needed by our users, so let us know if there’s any other languages that you particularly need support for!
View on Reddit #67950593

OUT_OF_HOST_MEMORY@reddit

Russian, Ukrainian, and Polish would be very nice
View on Reddit #67981976

walrusrage1@reddit

Malay and Bahasa would also be very useful, but this is a great start! 
View on Reddit #67967088

x0wl@reddit

East Slavic stuff: Russian and Ukrainian in particular will be very useful for me
View on Reddit #67951272

Mysterious_Act_6040@reddit

sczz
View on Reddit #67973530

kevin_1994@reddit

No context limit is crazy. Im so excited for advancements in hybrid mamba architecture I wish there were a few more benchmarks but ill download it tonight and give it the vibe test
View on Reddit #67935292

SlaveZelda@reddit

hmmm i tried out the micro one 90k of context and its pretty bad
View on Reddit #67971973

ibm@reddit

We’re big fans of Mamba in case you couldn’t tell! We’ve validated performance up to 128k but with hardware that can handle it, you should be able to go much further. If you test with long context lengths, let us know how it goes! [\- Emma, Product Marketing, Granite](http://ibm.biz/BdbxAZ)
View on Reddit #67936781

silenceimpaired@reddit

Oh, I will. :) I use LLMs for brainstorming and holding my entire novel within view. Instead of having to reread the entire novel or take copious notes I update I have been chunking chapters through LLMs to answer questions about the novel. It will be interesting to see how you perform with the full text. Wish you guys implemented datasets focused on creative writing like LongPage… but I also get it probably isn’t your main focus… never the less I do think creative writing can help LLMs understand the world from a more human perspective and it pushes it to think in larger contexts.
View on Reddit #67942883

ibm@reddit

One of our release partners, Unsloth, published a fine-tuning notebook where they adapt Granite 4.0 into a support agent using data from a Google Sheet. Same process would work if you wanted to feed in creative writing samples instead. [https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/Granite4.0.ipynb](https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/Granite4.0.ipynb)
View on Reddit #67950553

silenceimpaired@reddit

Awesome to see you partnering with them and others. I’ll have to try it
View on Reddit #67951779

ismail_the_whale@reddit

i missed this...where is this written down?
View on Reddit #67944157

kevin_1994@reddit

from the blog > > Unconstrained context length > > One of the more tantalizing aspects of state space model (SSM)-based language models like Mamba is their potential to handle infinitely long sequences. All Granite 4.0 models have been trained on data samples up to 512K tokens in context length. Performance has been validated on tasks involving context length of up to 128K tokens, but theoretically, the context length can extend further. > > In standard transformer models, the maximum context window is fundamentally constrained by the limitations of positional encoding. Because a transformer’s attention mechanism processes every token at once, it doesn’t preserve any information about the order of tokens. Positional encoding (PE) adds that information back in. Some research suggests that models using common PE techniques such as rotary positional encoding (RoPE) struggle on sequences longer than what they’ve seen in training.2 > > The Granite 4.0-H architecture uses no positional encoding (NoPE). We found that, simply put, they don’t need it: Mamba inherently does preserve information about the order of tokens, because it “reads” them sequentially.
View on Reddit #67947395

Hopeful_Eye2946@reddit

Probandolo en LMStudio con una grafica de AMD me da entres 4 a 6 tokens por segundo usando Vulkan en windows 11, pero con CPU en LMStudio Windows 11, son unos 18 a 25 tokens por segundo
View on Reddit #67969471

Finanzamt_Endgegner@reddit

yeah lmstudio has some issues it seems, im using vulcan too, since i can use flash attn with it even on my old rtx 2070 so i can run dual gpu with flash attn and 1m context, and i get only like 16t/s with 0 context, which is very slow for a 7ba1b. For comparison with the same setup with iq4xs qwen30b i get around 50-60t/s. So ig its not working as intended rn.
View on Reddit #67970491

Odd_Material_2467@reddit

Please for all that is holy, include the param number in the model name
View on Reddit #67934421

ibm@reddit

Thanks for the feedback! This has been a thorny issue as the mapping from total param count to both speed and VRAM requirements has changed with the introduction of MoE and hybrid model architecture components. We opted for the simple T-shirt size naming to avoid trying to pack too much information into the name with qualifiers. As pointed out above, you can still see the parameter counts on HF. You can also retrieve the model size for any model with this handy script: \`#!/usr/bin/env bash curl -s $1 | grep -A 3 "Model size" | grep params | cut -d'>' -f2 | cut -d' ' -f 1’ [\- Gabe, Chief Architect, AI Open Innovation](https://ibm.biz/Bdbxus)
View on Reddit #67938018

redblobgames@reddit

What is `$1` here? the hugging face url?
View on Reddit #67965522

ZookeepergameOver476@reddit

Yep, this was a case of formatting not translating well to Reddit. $1 is the URL to the model in HF. [\- Gabe, Chief Architect, AI Open Innovation](https://www.linkedin.com/in/gabe-goodhart-6676ba30/)
View on Reddit #67966495

SkyFeistyLlama8@reddit

Thank you IBM for the release! I think you should put the dense and MOE active params so we know which models might work better on CPU or GPU, just in case. For example, Granite 4.0 H Small should be Granite 4.0 Small 32B-A3B.
View on Reddit #67943985

robberviet@reddit

Same. Huggingface having params number helps, but in name would be better.
View on Reddit #67936018

Amazing_Athlete_2265@reddit

It's my bedtime so I am unable to test. I've been looking forward to Granite 4 so excited to put it through it's paces tomorrow! Thanks for the open source things IBM!
View on Reddit #67937499

ibm@reddit

https://i.redd.it/33j51pvwrpsf1.gif
View on Reddit #67942608

Amazing_Athlete_2265@reddit

Putting the micro and tiny models through my evals now. Responses seem pretty good so far. Interestingly, the micro model runs my 3080 at full power (340W) whereas the tiny only draws about 220W. Still waiting on token rate data. Thanks again for the small models!!
View on Reddit #67965507

Practical-Hand203@reddit

When will you make them available on the ollama library?
View on Reddit #67965447

hehsteve@reddit

Any advice on structured output?
View on Reddit #67962831

NoFudge4700@reddit

So, it’s not a coding model, right? Might write code but not intended for coding. Correct me someone if I’m wrong.
View on Reddit #67940812

ibm@reddit

It wasn’t built solely as a code model family like we’ve done with our previous Granite Code family. But the combination of FIM, tool-calling, long context, and training on more than 100 programming languages make it a solid option if you want a small model for coding tasks.
View on Reddit #67948639

MerePotato@reddit

Happy to see another general purpose open weight release personally, we have no shortage of solid coding models of late anyway
View on Reddit #67962050

NoFudge4700@reddit

I’d love to see a benchmark against other open weight models in the same category.
View on Reddit #67948735

OcelotMadness@reddit

Oh wow. 7B 1A is a new size for me. I hope it ends up being good. That could go hard for Text Adventure finetuning.
View on Reddit #67959817

WithoutReason1729@reddit

Your post is getting popular and we just featured it on our Discord! [Come check it out!](https://discord.gg/PgFhZ8cnWW) You've also been given a special flair for your contribution. We appreciate your post! *I am a bot and this action was performed automatically.*
View on Reddit #67958549

SlaveZelda@reddit

I think these would be great for very long context tasks
View on Reddit #67957935

Dyapemdion@reddit

Nice Thank You
View on Reddit #67952526

ibm@reddit

https://i.redd.it/o4s4lkfutqsf1.gif
View on Reddit #67957808

Porespellar@reddit

Any vision + reasoning + tool calling combo models in the pipeline coming anytime soon?
View on Reddit #67956275

silenceimpaired@reddit

Llama support is already merged?
View on Reddit #67937516

rerri@reddit (OP)

Llama.cpp already supports this, yes. Running the 32B currently.
View on Reddit #67937693

silenceimpaired@reddit

Working well? I’m sad it isn’t 32b dense
View on Reddit #67941565

ttkciar@reddit

> I’m sad it isn’t 32b dense That was my first reaction too, but it uses 9B active parameters, and the Granite3 8B-dense was almost useful. Looking forward to putting the 32B-A9B through my testsuite.
View on Reddit #67953839

danielhanchen@reddit

Yes it works! Made some dynamic Unsloth quants at https://huggingface.co/unsloth/granite-4.0-h-small-GGUF
View on Reddit #67938492

ClearApartment2627@reddit

The largest model is the "small" variant. Do I infer correctly that larger ones are in the works?
View on Reddit #67934136

ibm@reddit

Yes, we’re working on larger (and even smaller!) Granite 4.0 model sizes that we plan to release later this year.
View on Reddit #67934675

Finanzamt_Endgegner@reddit

Smaller? you are insane 😅(in the good way)
View on Reddit #67951726

cms2307@reddit

3b a0.2b🤔
View on Reddit #67937018

x0wl@reddit

Would be so cool for my Chromebook with 8GB ram and no GPU lol
View on Reddit #67951060

ab2377@reddit

😄
View on Reddit #67937661

ClearApartment2627@reddit

Thanks! I will try out the small variant.
View on Reddit #67938857

crapaud_dindon@reddit

How good is the multi-language support of those models? I am asking mostly for french/english comprehension.
View on Reddit #67946429

ibm@reddit

On the model cards there is a section that lists performance on a few benchmarks for multilingual tasks and the languages they were tested on (French was included for all of them). [https://huggingface.co/ibm-granite/granite-4.0-h-small#:\~:text=64.69-,Multilingual%20Tasks,-MULTIPLE](https://huggingface.co/ibm-granite/granite-4.0-h-small#:~:text=64.69-,Multilingual%20Tasks,-MULTIPLE)
View on Reddit #67950653

chillahc@reddit

What's the difference between these 2 model variants? What does the "h" stand for? * [https://huggingface.co/ibm-granite/granite-4.0-micro](https://huggingface.co/ibm-granite/granite-4.0-micro) * [https://huggingface.co/ibm-granite/granite-4.0-h-micro](https://huggingface.co/ibm-granite/granite-4.0-h-micro) The Intended use-description is almost identical, just a small difference at the end: **"granite-4.0-micro" –** The model is designed to follow general instructions and can serve as the foundation for AI assistants across diverse domains, including business applications, `as well as for LLM agents equipped with tool-use capabilities.` **"granite-4.0-h-micro" –** The model is designed to respond to general instructions and can be used to build AI assistants for multiple domains, including business applications. Can somebody explain? Just wanted to understand, since the unsloth variants are all based on the "h"-variants. Thanks! 😎👋
View on Reddit #67945861

ibm@reddit

The “H” stands for hybrid! Most of the Granite 4.0 models use a hybrid Mamba-2/transformers architecture. For Micro in particular, we released two models: one with the new hybrid architecture, and another with the traditional transformers architecture used in previous Granite models. They’re both intended for the same use cases, but the the non-hybrid variant is an alternative for use where Mamba-2 support is not yet optimized. Our blog goes into more details: [https://ibm.biz/BdbxVG](https://ibm.biz/BdbxVG)
View on Reddit #67947388

chillahc@reddit

Thank you for explaining, will have a look 👀👍
View on Reddit #67948880

exaknight21@reddit

/u/ibm do you guys plan on providing support for awq-marlin? It’s higher accuracy and less resources deployment via vLLM is extremely efficient. I’d love your thoughts on this subject. Religiously watch your youtube series and find it extremely helpful.
View on Reddit #67940448

ibm@reddit

Thanks for the suggestion! No plans for awq\_marlin right now, but we're always exploring ways to run models more efficiently, so we'll definitely look into it. \- Gabe, Chief Architect, AI Open Innovation
View on Reddit #67948609

this-just_in@reddit

Watch user cpatonn on HF.
View on Reddit #67940944

exaknight21@reddit

I looked at his qwen3:4b-instruct-2507-awq. I was not able to run it with vLLM. But to be honest, I tried it once only.
View on Reddit #67941066

this-just_in@reddit

I don’t know about that one specifically but I use his Qwen3 30B and 80B quants just fine!
View on Reddit #67941143

greenreddits@reddit

what's the difference between the 'base' version and the default one in GGUF ? For summarizing long academic texts, which version Q2-Q8 would be best ? What's the difference between them ?
View on Reddit #67942221

ibm@reddit

The base GGUFs are converted from the base (not instruct tuned) models, so they're great as a starting point for fine tuning or other non-chat uses. The instruct tuned models are best for instruction following, tool calling, and other chat-based interactions. In terms of which quantization to use, we typically see the best performance/size ratio around Q4. Depending on the sensitivity of your task to slight noise, you may need to try larger quantizations or may be able to get away with very small sizes for simpler tasks. \- Gabe, Chief Architect, AI Open Innovation
View on Reddit #67948578

ontorealist@reddit

The default is an instruction model ideal as an assistant, while the base model is for text completion given a set of text. Q4 is generally ideal for most tasks and machine such as summarization, RAG, etc. Higher Q5-Q6 models are typically close enough to Q8 or full precision but higher will be generally better for accuracy / STEM-loaded tasks. Links to Unsloth’s GGUFs can be found in this thread, where you’ll find UD-Q4_K_XL which is likely solid baseline to try for longer 12K+ context windows before trying higher quants. Unsloth’s documentation is a good primer if you want to learn more about quantization methods, what works for your machine / use case.
View on Reddit #67945904

PigOfFire@reddit

Would you please include Polish lang too in future?
View on Reddit #67943646

ibm@reddit

Noted and will pass along to our research team! They always want to hear what languages a lot of people are asking for, and I think we’ve had Polish requested a few times before. Thanks!
View on Reddit #67948522

Marcuss2@reddit

I would like to see benchmark comparisons to similar models. Can anyone compile that easily?
View on Reddit #67944390

ibm@reddit

We have a variety of comparisons in our release blog (benchmark performance, speed, memory requirements, etc.) [https://www.ibm.com/new/announcements/ibm-granite-4-0-hyper-efficient-high-performance-hybrid-models](https://www.ibm.com/new/announcements/ibm-granite-4-0-hyper-efficient-high-performance-hybrid-models)
View on Reddit #67948494

mgr2019x@reddit

Interesting sizes! Thx 🙏
View on Reddit #67944828

Admirable-Star7088@reddit

Maybe I'm blind, but where do I find the recommended interference settings? I was going to test the 32B-A9B version, but I have no idea what settings I should use for best performance?
View on Reddit #67937789

ibm@reddit

These models are designed to be robust to all your favorite inference settings depending on the task. For tasks that need repeatability, greedy decoding should work well. For creative tasks, a higher temperature and corresponding sampling parameters can be tuned to get the performance you need. [\- Gabe, Chief Architect, AI Open Innovation](https://ibm.biz/Bdbxus)
View on Reddit #67942653

Admirable-Star7088@reddit

I see, thanks for the reply!
View on Reddit #67943320

TechSwag@reddit

I may be blind, but I don't see the recommended parameters for running the model.
View on Reddit #67938414

ibm@reddit

These models are designed to be robust to all your favorite inference settings depending on the task. For tasks that need repeatability, greedy decoding should work well. For creative tasks, a higher temperature and corresponding sampling parameters can be tuned to get the performance you need. [\- Gabe, Chief Architect, AI Open Innovation](https://ibm.biz/Bdbxus)
View on Reddit #67942718

pmttyji@reddit

Yeah, finally! MOEs!
View on Reddit #67935047

ibm@reddit

https://i.redd.it/pwk9torlrpsf1.gif
View on Reddit #67942476

silenceimpaired@reddit

Personally, I’m excited to run the small large language model… sigh. Small large.
View on Reddit #67942397

No_Conversation9561@reddit

https://preview.redd.it/f77pzcwoopsf1.jpeg?width=1284&format=pjpg&auto=webp&s=e1def45e0b7595755e1bda0422bb6e717d4ae553 coincidence? that I got ad from ibm Official while ibm Official is in the comments?
View on Reddit #67941369

dinerburgeryum@reddit

Congrats on the release! I’ve been eagerly awaiting this one; arguably the most space efficient Attention implementation out there right now. 
View on Reddit #67941039

ArcherAdditional2478@reddit

It's a bit embarrassing that this model still performs poorly in non-English languages. Most models have come a long way in this regard, but Granite remains poor. Why?
View on Reddit #67939772

jacek2023@reddit

....finally!!! Thanks IBM for the GGUFs
View on Reddit #67938859

Beneficial-Good660@reddit

A bad model, something like a falcon32b. I asked him to create an HTML landing page based on the specifications, but he didn't even understand what was needed and simply copied the specifications. Then, when I asked him to do it again, he started writing nonsense about it being technically difficult. Then he somehow managed to get it done (I asked for it in one file, but he did it in chunks of code and in different files). After he finally did create the website, it's really bad. All the models I tested, even the older ones, were better.
View on Reddit #67937651

dheetoo@reddit

the task that shining for me is I use a very small model (like 3B in this release) as a bridge model between the workflow like an aggregator model instead of a user facing or coding model
View on Reddit #67938271

Beneficial-Good660@reddit

Why are you writing this to me? If you want advice, take qwen4b. I tested a couple more simple queries with easy logic, but he doesn't even understand what's being asked, so I deleted it. My blacklist is granit, exaone, and falcon. I'm downloading April now, we'll see what it's like. And to the developers who dislike things, my advice: do it properly, and you'll be treated well.
View on Reddit #67938605

NoFudge4700@reddit

I’m on mobile and can’t find GGUFs, any king person to please link them or OP?
View on Reddit #67936072

danielhanchen@reddit

Made some dynamic Unsloth quants as well :) https://huggingface.co/unsloth/granite-4.0-h-small-GGUF
View on Reddit #67938528

rerri@reddit (OP)

Added GGUF collection to OP.
View on Reddit #67936241

PermanentLiminality@reddit

I see some unsloth quants for the 32B model. Does llama.cpp support this?
View on Reddit #67938085

danielhanchen@reddit

Yes it should work!
View on Reddit #67938386

MDT-49@reddit

My Raspberry Pi is so ready for this!
View on Reddit #67935313

CatDaddy1776@reddit

nice. what are ya thinkin about building with the pi?
View on Reddit #67938114

ForsookComparison@reddit

I really really want Granite to succeed. We need another Western mega-corp to start competing in this space.
View on Reddit #67936436

mumblerit@reddit

from what ive seen IBM/Redhat are doing a lot, just maybe not as flashy
View on Reddit #67937676

igorwarzocha@reddit

Did anyone say FIM and not explicitly mention code? I'm all ears.
View on Reddit #67934664

Zc5Gwu@reddit

Yes, glad I’m not the only one excited about a new FIM model.
View on Reddit #67937612

ForsookComparison@reddit

> 32B A9B I am very excited to try this
View on Reddit #67936476

beneath_steel_sky@reddit

So... - 2 dense (Micro 3.4B, H Micro 3.19B) - 2 MoE (H Tiny 6.94B, H Small 32.2B)
View on Reddit #67934564

ironwroth@reddit

holy shit finally
View on Reddit #67934400

Stepfunction@reddit

These are hybrid MoE models: https://preview.redd.it/nu84v9km5psf1.png?width=651&format=png&auto=webp&s=ecd88f1be10032a2791304dee18a0ab1d4b049d7
View on Reddit #67934209