ibm-granite/granite-4.0-1b-speech · Hugging Face

Posted by jacek2023@reddit | LocalLLaMA | View on Reddit | 17 comments

Model Summary: Granite-4.0-1b-speech is a compact and efficient speech-language model, specifically designed for multilingual automatic speech recognition (ASR) and bidirectional automatic speech translation (AST).

The model was trained on a collection of public corpora comprising of diverse datasets for ASR and AST as well as synthetic datasets tailored to support Japanese ASR, keyword-biased ASR and speech translation. Granite-4.0-1b-speech was trained by modality aligning granite-4.0-1b-base to speech on publicly available open source corpora containing audio inputs and text targets. Compared to granite-speech-3.3-2b and granite-speech-3.3-8b, this model has the following additional capabilities and improvements:

Supports multilingual speech inputs in English, French, German, Spanish, Portuguese and Japanese,
Provides higher transcription accuracy for English ASR and faster inference through better encoder training and speculative decoding,
Has half the number of parameters of granite-speech-3.3-2b for running on resource-constrained devices,
Adds keyword list biasing capability for enhanced name and acronym recognition

[-]

tradmalcong@reddit

Granite‑4.0‑1b‑speech looking really tight for compact ASR/AST on edge gear. Love that it keeps accuracy up while halving the param count. In practice, I’ve found that pairing something like this underneath stacks like Palabra or Talo helps keep the real‑time speech‑to‑speech plumbing feeling lightweight, while still letting you push the “how much can we run locally” envelope on Mac or small servers

Wonderful_Guess9305@reddit

does it support language identification ? like a language id tag ?

CtrlAltDelve@reddit

These always seemed to be really promising, but they never seemed to have any comparisons to Parakeet. I've only ever used Whisper and Parakeet, but Parakeet has been so ludicrously fast and accurate for me that I've never wanted to use anything else.

Anyone has any experience trying these?

Temporary-Size7310@reddit

The main issue with Parakeet: It hallucinate on language, you can't define an input / output language like Canary so for other supported language you can't use it in production

It translate sometime 20% of random tokens so you cannot translate back to french ie without an additive LLM step for constrained hardware like mobile phone

nuclearbananana@reddit

Probably comparable to parakeet, but slower. I'd have to test. The word biasing could be useful

The qwen asr model was disappointing in speed, so hopefully this is better.

NobodySpecific@reddit

Why the name change?

granite-speech-3.3-2b granite-4.0-1b-speech

Why move 'speech' to the end? Does nobody care about consistency?

Traditional_Tap1708@reddit

I tried it with vllm. For english, it outputs plane text without any punctuation and looks less accurate than qwen-asr

ttkciar@reddit

Was reading through the bulletpoints, thinking "nice. nice. nice." and then hit the last one and thought "oooooh!"

Using a user-provided list to help recognize names and idiomatic constructs seems like a huge win.

My wife and I use private idioms all the time, and her phone's voice-to-text feature gets these wrong constantly! Like, this morning in a text she mentioned "cat window" (which refers to the corner of the kitchen where we feed the cats, in our private jargon) which her phone interpreted as "Kathmandu" (the capital of Nepal). Hillarious, but also illustrates a flaw in the technology.

If we can avoid errors like that by simply keeping/updating a glossary of our commonly used idioms, that would be fantastic!

"Cat window" is.. two very common words? Odd it got that wrong. I'd use it more for technical terms, mixed language words, names etc.

Hefty_Wolverine_553@reddit

Seems like a really great model, but might be a pain to get running on actual mobile devices.

Corporate_Drone31@reddit

I don't see it, size-wise. 1B LLMs run on semi-modern phones, ASR is presumably a question of the AI stack supporting the model.

Prince-of-Privacy@reddit

Why do none of these new ASR-models support Diarization by default? :(

That's what I love about Gemini for instance. That it can transcribe and diarize.

1-800-methdyke@reddit

What’s your workflow for doing this with Gemini? I just dumped a voice note into it and it did a great job of summarizing the conversation and picking up names from conversation context. But the transcript is only diarized to the extent that it’s broken up the conversation into chunks.

Typically I run it through Parakeet/ Pyannote locally which allows me to assign names to speakers (and it can save the embeddings to identify them next time)

You said it yourself in the edit. I just tell it to transcribe and separate speakers :)