Adding E4B audio encoder to larger models

Posted by MaruluVR@reddit | LocalLLaMA | View on Reddit | 5 comments

I am curious if anyone here has tried doing this, I did a bit of digging and it seems like it would be easier to do then I first thought and would like to ask ask for correction if my assumptions are wrong. Here is how I would go about it:

Extract the 300mb audio encoder from E4B or E2B
Create a new linear projection layer in Pytorch that translates the E4B encoder output to fit the hidden dimension size of the larger target model
Get a dataset of text and audio pairs
Freeze both the large model and audio encoder and only train the new linear projection layer

Since only the new layers have to be trained it should be relatively quick to train and wouldnt negatively affect the larger models output negatively. Basically the same as this paper but instead of using the whisper encoder using the Gemma one which has been built for low latency LLMs.

[-]

caetydid@reddit

When doing so would actually turn Gemma 31b directly into an audio-visual model, why didn't Google design it like that in the first place?

I've asked that question already and someone answered me about several drawbacks, but I cannot remember details.

MaruluVR@reddit (OP)

In theory this wouldn't just work with Gemma but all local AI models, (if trained especially for said model) the same thing has previously been done using the Whisper encoder by bytedance when they made Salmonn and when Qwen created Qwen2-Audio.

Those were made using the generic whisper which meant they had to optimize the output (token efficiency sampling etc) for llms, while the Gemma one is already optimized out of the box.

Top-Rub-4670@reddit

Your theory isn't what's in question.

If it's so simple and the audio part is so small, why didn't Google add audio to 26B and 31B?

There has to be a massive trade-off that your theory isn't aware of.

Diecron@reddit

Could just be separation of concerns too, e.g. do you really want a "big" dense model to be handling Whisper and TTS flows when the E2B or E2B can do it well? Have that model run on device for real-time interactions and then pass off the actual response synthesis to a hosted/more intelligent model.

Silver-Champion-4846@reddit

I wonder how good this is?