llamacpp with Gemma4 31B dense and Gemma e4b as draft, plus audio input?
Posted by caetydid@reddit | LocalLLaMA | View on Reddit | 15 comments
Hi,
has anybody succeeded in running llama.cpp with Gemma 31b dense and Gemma e4b as draft model, and simultaneously inhibit the voice recognition feature?
Is it even (theoretically) possible?
thanks
MaruluVR@reddit
If what you are asking is if combining them at all is possible or not it would be possible, but would need retraining.
First you would have to extract the the 300mb audio encoder from E4B (These parameters will have to be frozen during training)
Then create a new linear projection layer in Pytorch that translates the E4B encoder output to fit the hidden dimension size of 31B or what ever you are using.
Now you need to retrain 31B on audio and text pairs, maybe even full on audio to thinking to response pairs.
This has been done before with vision in the old llama days see: https://github.com/haotian-liu/LLaVA
Since a lot of weights are frozen it shouldnt be too expensive to train, but the biggest problem will be the dataset.
caetydid@reddit (OP)
wow amazing. but doing this exceeds my capabilities!
I wonder if that would lead to a more efficient pipeline than running whisper and feeding the transcript into gemma4.
so for me whisper is the way to go.
MaruluVR@reddit
I just made reddit post about this, lets see what other people think about this.
https://www.reddit.com/r/LocalLLaMA/comments/1te1yxy/adding_e4b_audio_encoder_to_larger_models/
caetydid@reddit (OP)
thank you
xeeff@reddit
why not just use the official assistant model for draft
caetydid@reddit (OP)
because that would not support audio.
vasileer@reddit
gemma4 31b doesn't support audio, only text and image
xeeff@reddit
... how would that even work?
MaruluVR@reddit
As far as I know the audio gets turned into latent vector embeddings by E4B without ever turning into normal text, since the 31B version isnt trained on this type of data it wont be able to do anything with it.
caetydid@reddit (OP)
aah i thought so. so i will need to use whisper then.
MaruluVR@reddit
Dont, whisper is slow and bad, use Nvidia Parakeet.
caetydid@reddit (OP)
youre right, it is fast, but WER is still lower with whisper
PositiveBit01@reddit
I may be misunderstanding but what do you mean by draft model? Shouldn't the gemma4 31b assistant model be the draft model?
caetydid@reddit (OP)
i cant use the builtin draft because it does not accept audio embeddings!
nickm_27@reddit
there are multiple types of drafting. You are referring to MTP (multi token prediction) but there is also older traditional draft model which just uses the full LLM