What happens when you load two models and let each model take a turn generating a token?
Posted by silenceimpaired@reddit | LocalLLaMA | View on Reddit | 16 comments
To really make sure there is no misunderstanding here it is played out:
I like eating hotdogs.
Model 1: I, eat, hot
Model2: like,ing, dogs.
This is a simulation to demonstrate the idea.
So why? And is it worth it?
The first thought that came my mind was clearly it will be slower… but I wondered if a few adjustments to the software could ensure the context isn’t fully reprocessed for each model each time.
My next thought was how would two different model families handle this? For example GPT-OSS 120b and GLM-4.6V? What happens when the east meets west?
What happens if you always did inference on a smaller model, but only used it when it predicted the next word with high confidence and/or it was a common word (the, a, an, has, etc.) from the top 200 English words? Would this be faster than a draft model with a larger model and how much less accurate would it be?
One idea that came to mind is the fingerprint of the models would get muddied. How muddied? Only one way to find out.
And here you might get a little grumpy. I’m still at work and my knowledge to accomplish this is pretty narrow so I can’t give you this answer… yet. But a helpful upvote and a comment from you should get this some visibility so that those that have done this or have the knowledge to do so can beat me to providing you and I with an answer.
Have you done something wacky like this? Love to hear your experiences along my these lines.
16 Comments
Red_Redditor_Reddit@reddit
Corporate_Drone31@reddit
Red_Redditor_Reddit@reddit
droptableadventures@reddit
ortegaalfredo@reddit
Firepal64@reddit
dash_bro@reddit
silenceimpaired@reddit (OP)
AutomataManifold@reddit
FencingNerd@reddit
AutomataManifold@reddit
droptableadventures@reddit
AutomataManifold@reddit
rvistro@reddit
diaperrunner@reddit
FullOf_Bad_Ideas@reddit