Granite 4 release today? Collection updated with 8 private repos.

[-]

Cool-Chemical-5629@reddit

I hope you all are enjoying the new Granite 4 models released yesterday... 🤣

[-]

-dysangel-@reddit

I wonder if the Qwen 3 Next released forced their hand. Looking forward to ever more efficient attention, especially on larger models :)

[-]

I dont think they are anywhere near competitors. Granite 3 was intentionally a great "utility" model to be used in agents rather than a jack of all trades as qwen, and I hope they will keep it that way.

[-]

-dysangel-@reddit

I see Qwen 3 Next as basically a turbocharged utility model too, it's only using 40GB of RAM

[-]

ironwroth@reddit (OP)

I don’t think so. They said end of summer when they posted the tiny preview a few months ago.

[-]

-dysangel-@reddit

Yeah I'm aware it's been on the cards for a while, but it's very interesting timing. I've just been testing Qwen 3 Next out locally on Cline - it's a beast. If Granite has some larger, smarter models with linear prompt processing then I really don't need cloud agents any more

[-]

SkyFeistyLlama8@reddit

Since I'm waiting for Qwen Next support to drop for llama.cpp, how does it compare to GPT OSS 20B for agent work?

[-]

-dysangel-@reddit

I haven't really tried that one properly. Harmony support was awful when it came out and I've been using GLM 4.5 Air for everything since then..

[-]

DistanceAlert5706@reddit

Don't expect much, benchmarks for agentic tasks for Qwen Next are terrible.

[-]

DealingWithIt202s@reddit

Wait Qwen3 Next has llama.cpp support already? I thought it was months away.

[-]

-dysangel-@reddit

nope, I'm using it on mlx

[-]

Cool-Chemical-5629@reddit

Granite 4 release today? Collection updated with 8 private repos.

Remember when they created this collection for the first time and everyone started hyping that Granite 4 is coming soon only for them to hide the collection and then keep us waiting some more until they released the tiny preview model?

Well this time the models seem to be added as the collection already contains 10 items, but is it an actual guarantee that they will be releasing it today? I don't think so.

I'm glad it's on the way though. Finally, better later than never, but I guess it's not time to start the hype train engine just yet.

Besides, I don't think there is support in llamacpp for this yet and unlike Qwen team, IBM does not have their own chat website in which we could play with the model while we wait for support in what I believe is among the most popular inference engines in the local community.

[-]

ironwroth@reddit (OP)

There is support in llama.cpp already, and one of the IBM guys just did the same for mlx-lm a few days ago.

[-]

Cool-Chemical-5629@reddit

Wasn't the support just for the tiny model though?

[-]

ironwroth@reddit (OP)

The support is for the Granite 4 model architecture itself. It's not specific to just the tiny version.

[-]

Ok-Possibility-5586@reddit

Looks like only tiny is available on huggingface.

I haven't spent the time to look on IBMs own site, but it would be good if they had a midrange model - somewhere from 20-30B.

[-]

stoppableDissolution@reddit

I really wish there would be a <1b dense, too. Other vendor's tiny models are very meh for attention-based tasks, and granite 3 2.4b, while good, is an overkill :c

[-]

ttkciar@reddit

I'm looking forward to it. Granite-3 was underwhelming overall, but hit above its weight for a few task types (like RAG).

I'm mindful of my Phi experiences. Phi, Phi-2, and Phi-3 were "meh", but then Phi-4 came out and became my main go-to model for non-coding STEM tasks.

The take-away there is that sometimes it just takes an LLM R&D team time to find their stride. Maybe Granite-4 is where IBM's team finds theirs? We will see.

[-]

dazl1212@reddit

Which one was good for rag?

[-]

ttkciar@reddit

Granite3-8B dense.

[-]

dazl1212@reddit

So you use it as your embedding model or the LLM?

[-]

ttkciar@reddit

I might use it as the LLM, but my RAG implementation doesn't use an embedding model, and my usual LLM for final inference is Gemma3-12B or Gemma3-27B.

My RAG implementation uses a HyDE step before traditional LTS via Lucy Search, which indexes documents as text, not embeddings.

The HyDE step helps close the gap between traditional LTS and vector search by introducing search terms which are semantically related to the user's prompt.

Lucy Search then retrieves entire documents, rather than vectorized chunks, the top N scored documents' sentences are weighted according to prompt word occurance, and an nltk/punkt summarizer prunes the retrieved content until the N documents' summaries fit within the specified context budget. This gives me a context much more densely packed with relevant information, and less relevant information lost across chunk boundaries.

That summarization step with that technology precludes the pre-vectorization of the documents, but with a lot of work it should be possible to make a summarizer for vectorized content. So far I haven't found it worthwhile to prioritize that work.

The summarized retrieved content is then vectorized at inference time, and final inference begins.

I'm pretty happy with the quality of final inference, and Lucy Search scales a lot better than any vector databases I've tried, but it's not without disadvantages:

The HyDE step introduces latency, though I'm hopeful Gemma3-270M will reduce that a lot (been meaning to try it),
My sentence-weighting algorithm lacks stemming logic, so sometimes it misses the mark; I've been meaning to remedy that,
nltk/punkt is pretty fast, but also introduces latency in the summarization step,
Vectorizing the content at inference time adds yet more latency.

So overall it's pretty slow, even though Lucy Search itself is quite fast. Everything else gets in the way.

My usual go-to for the HyDE step is one of Tulu3-8B, Phi-4, or Gemma3-12B, depending on the data domain, but I'm looking forward to trying Gemma3-270M for much faster HyDE.

My usual go-to for the final inference step is either Gemma3-12B (for "fast RAG") or Gemma3-27B (for "quality RAG"). Its RAG skills are quite good, and its 128K context accommodates large summarized retrievals, though I find its competence drops off after about 90K. My default configuration only fills it to 82K with retrieved content and the user's prompt, leaving 8K for the inferred reply.

I will be publishing my implementation as open source eventually, but I have a fairly long to-do list to work through before then.

[-]

AdDizzy8160@reddit

Interesting setup/knowledge, thanx for sharing.

[-]

dazl1212@reddit

That sounds amazing but I don't really understand much of it. I'll have to go away and do some study on it. I've mainly been using MSTY with Qwen 8b embeddings and Deepseek over open router as the LLM. I'm using it to read visual novel scripts to get similar gameplay elements in my visual novel. I've not had great results

[-]

SkyFeistyLlama8@reddit

The Granite embedding models are pretty good.

[-]

johnkapolos@reddit

Great! 3 was punching higher than its class so looking forward to see 4.

[-]

ZestyCheeses@reddit

Expecting dead on arrival. I doubt it will be able to compete with the best open source models, although I'd be happy to be surprised. Really we're seeing continued commodification of models where people will just use the best, fastest and cheapest model available. If your model isn't that at release (or at least competitive on those fronts) then it really is DOA unfortunately.

[-]

ResidentPositive4122@reddit

where people will just use the best

There is no such thing in the open models. Some models are better then others at some things, while not at others. They all have their uses, it's not black and white.

[-]

ZestyCheeses@reddit

This just isn't true below the SOTA. Sure some SOTA models might have differing capabilities in the way they were trained or fine tuned, but below the SOTA the models are almost useless beyond maybe some obscure niche. The larger cases follow the SOTA and that's why we're seeing a convergence of these models into commodities. People are just going to use the best that they can run, I doubt Granite 4 will beat out other models in the space.

[-]

ttkciar@reddit

What you call an "obscure niche" is what thousands of people call their "primary use case".

[-]

ZestyCheeses@reddit

What is your point? That still makes it sn obscure niche. These models simply aren't viable long term to train for such niches.

[-]

ttkciar@reddit

Well, how would you like it if the industry decided that your primary use-case was an obscure niche, and stopped training models for it?

That would suck, wouldn't it?

So don't advocate doing that to other people.

[-]

ZestyCheeses@reddit

I'm not advocating for anything. I'm just stating that models are becoming commodities. The vast majority of people just hop to the best, fastest, and cheapest models. Which means we will eventually see models like Granite drop off because if they don't compete to those standards then they aren't competitive as a commodity and therfore not viable to invest in. This is just reality.

[-]

ttkciar@reddit

we will eventually see models like Granite drop off because if they don't compete to those standards then they aren't competitive as a commodity and therfore not viable to invest in

Granite isn't targeting that market. Rather, it is the default model for Red Hat's RHEAI solution, upon which Enterprise customers would base their own products and services. (Red Hat is now a subsidiary of IBM, so they share an LLM tech strategy.)

Granite's skill-set and resource requirements will chase whatever Red Hat's Enterprise customers demand, but for now it's reflecting IBM's expectations of that market.

[-]

aseichter2007@reddit

Things like programming languages have overlapping syntaxes and plug-ins and structures and nomenclature paradigms.

A model trained specifically for C# will confabulate less and produce better C# code than models also trained on Javascript, assuming that the training data was equal quality.

[-]

jacek2023@reddit

took them two days to read my comment ;)

https://www.reddit.com/r/LocalLLaMA/comments/1nh1wqy/comment/ne8jd7t/

[-]

ilintar@reddit

About time :) the Llama.cpp support was added some time ago (and required a considerable amount of work, too).