llamacpp with Gemma4 31B dense and Gemma e4b as draft, plus audio input?

Posted by caetydid@reddit | LocalLLaMA | View on Reddit | 15 comments

Hi,

has anybody succeeded in running llama.cpp with Gemma 31b dense and Gemma e4b as draft model, and simultaneously inhibit the voice recognition feature?

Is it even (theoretically) possible?

thanks

[-]

MaruluVR@reddit

If what you are asking is if combining them at all is possible or not it would be possible, but would need retraining.

First you would have to extract the the 300mb audio encoder from E4B (These parameters will have to be frozen during training)

Then create a new linear projection layer in Pytorch that translates the E4B encoder output to fit the hidden dimension size of 31B or what ever you are using.

Now you need to retrain 31B on audio and text pairs, maybe even full on audio to thinking to response pairs.

This has been done before with vision in the old llama days see: https://github.com/haotian-liu/LLaVA

Since a lot of weights are frozen it shouldnt be too expensive to train, but the biggest problem will be the dataset.

[-]

caetydid@reddit (OP)

wow amazing. but doing this exceeds my capabilities!

I wonder if that would lead to a more efficient pipeline than running whisper and feeding the transcript into gemma4.

so for me whisper is the way to go.

[-]

MaruluVR@reddit

I just made reddit post about this, lets see what other people think about this.

https://www.reddit.com/r/LocalLLaMA/comments/1te1yxy/adding_e4b_audio_encoder_to_larger_models/

[-]

xeeff@reddit

why not just use the official assistant model for draft

[-]

caetydid@reddit (OP)

because that would not support audio.

[-]

vasileer@reddit

gemma4 31b doesn't support audio, only text and image

[-]

As far as I know the audio gets turned into latent vector embeddings by E4B without ever turning into normal text, since the 31B version isnt trained on this type of data it wont be able to do anything with it.

[-]

caetydid@reddit (OP)

aah i thought so. so i will need to use whisper then.

[-]

MaruluVR@reddit

Dont, whisper is slow and bad, use Nvidia Parakeet.

[-]

caetydid@reddit (OP)

youre right, it is fast, but WER is still lower with whisper

[-]

PositiveBit01@reddit

I may be misunderstanding but what do you mean by draft model? Shouldn't the gemma4 31b assistant model be the draft model?

[-]

caetydid@reddit (OP)

i cant use the builtin draft because it does not accept audio embeddings!

[-]

nickm_27@reddit

there are multiple types of drafting. You are referring to MTP (multi token prediction) but there is also older traditional draft model which just uses the full LLM