What happens when you load two models and let each model take a turn generating a token?

Posted by silenceimpaired@reddit | LocalLLaMA | View on Reddit | 16 comments

To really make sure there is no misunderstanding here it is played out: I like eating hotdogs. Model 1: I, eat, hot Model2: like,ing, dogs. This is a simulation to demonstrate the idea. So why? And is it worth it? The first thought that came my mind was clearly it will be slower… but I wondered if a few adjustments to the software could ensure the context isn’t fully reprocessed for each model each time. My next thought was how would two different model families handle this? For example GPT-OSS 120b and GLM-4.6V? What happens when the east meets west? What happens if you always did inference on a smaller model, but only used it when it predicted the next word with high confidence and/or it was a common word (the, a, an, has, etc.) from the top 200 English words? Would this be faster than a draft model with a larger model and how much less accurate would it be? One idea that came to mind is the fingerprint of the models would get muddied. How muddied? Only one way to find out. And here you might get a little grumpy. I’m still at work and my knowledge to accomplish this is pretty narrow so I can’t give you this answer… yet. But a helpful upvote and a comment from you should get this some visibility so that those that have done this or have the knowledge to do so can beat me to providing you and I with an answer. Have you done something wacky like this? Love to hear your experiences along my these lines.

16 Comments

[-]

Red_Redditor_Reddit@reddit

I don't think two models would have the same tokenizer, so that. But otherwise it would probably behave like any other model. I suppose you can input the existing text as a prompt and have it generate one token.

Corporate_Drone31@reddit

Or have it generate the next word instead of the next token. Set the space character as a stop sequence.

That's a better idea.

droptableadventures@reddit

> I suppose you can input the existing text as a prompt and have it generate one token. This would be the way to do it.

ortegaalfredo@reddit

Sometimes it works better, sometimes worse. What it easier and usually works fine is doing the thinking with a cheaper model and the final answer with a bigger, more expensive model.

Firepal64@reddit

Reasoning with Nanbeige... Answer with Deepseek Speciale :P

dash_bro@reddit

The format you're describing is similar to speculative decoding, which is a well known speed up technique that has virtually zero drop in quality. You're doing it token by token so there likely won't be any speedup (offset with model loading or API latency), and the quality will be closer to the worst model in the lineup (if a low tier model generates a wrong token, you can't edit the sequence and it affects every token after it). SD: two models of the **same** model family but widely differing in size. The small model generates n tokens super fast, the larger model just checks if it would've generated the same n tokens. The "same tokens" are kept, and the system moves. You get the large model intelligence at 2-3x the speedup.

silenceimpaired@reddit (OP)

Yeah, I’m familiar with speculative decoding. This would be different. In one instance it is two models of the same size… but maybe different architectures. The second would take advantage of training a small model on common English and/or only relying on it for common English words… so that it just quickly filled in common English words that aren’t as meaningful as semantically necessary for good grammar.

AutomataManifold@reddit

The slow but practical way is to just request one token at a time from each model. Not too hard to script in Python with LiteLLM and Openrouter.

FencingNerd@reddit

I suspect that would give you gibberish, as each model would be generating its response independent of the other. It's the same as generating two responses and taking every other word. The only way to actually do it would be to feed each token back in after.

That's why I meant that it would be slow, you'd have to keep updating both contexts. You could probably script a shared context via PyTorch, or maybe Transformers, but that's probably getting in way too deep...

> you'd have to keep updating both contexts. You'd only have to add at most a few more tokens to the context cache, so it wouldn't be that slow. What would slow you down is having to create a whole new API request for each token. It'd work as a prototype, but to do it a bit faster, you'd want it to be more tightly integrated. You couldn't share them as the models would likely have different "shapes" and different tokenizers.

Yeah, caching is probably what makes or breaks you.

rvistro@reddit

I think an alternative is have the two models process the prompt and share that info with each other to get to a consensus.

diaperrunner@reddit

I could see trade offs with generation. Text generation models create a tree of possible tokens so you can't do that it creates issues of common text generation algorithms. Ps. I am basing most of my knowledge of huggingface transformers generation and algorithms like top p.

FullOf_Bad_Ideas@reddit

cool idea. If you set up 2 OpenAI compatible completion APIs with vLLM, let's say one per GPU if you have two GPUs, it should work easily, at least for base and instruct models - not sure about reasoning ones. It should also reuse cache so it should be relatively quick. You can vibe code a good demo in a few hours or less. I haven't tried it but if you do end up doing this, share your findings.

Reply to Post

16 Comments