Thinking with a smaller model to speed things up?
Posted by q-admin007@reddit | LocalLLaMA | View on Reddit | 10 comments
Question: can i do the thinking with a smaller model, like Gemma 4 4B, then use that as the prompt for Gemma 4 31B, to speed things up?
Has anyone done this and measure if it's worth it?
thread-e-printing@reddit
Wouldn't you now have to run prompt processing twice, including reprocessing the first model's generated thinking into the second model's latent space? And wouldn't it be worse thinking in the first place? TANSTAAFL.
q-admin007@reddit (OP)
I had something like this in mind:
System: You are a prompt writer AI. Take the user input, structure it and enrich it with some ideas. Your output becomes the input to a large, but slow non-reasoning model.
User: twitter clone, php
ShengrenR@reddit
Lots of good comments, but one extra note: keep in mind the "thinking" stage is not actual thinking that can simply be done by any model, it's learned context building that is model specific - the traces made by the little model are to help guide the little model.
You could do the experiment, though.. have each model run the response completely, then rerun with the thinking part of the context swapped between the models and see how they do. My bet is they do worse with the thinking swapped, but I'd be happy to be surprised.
jax_cooper@reddit
In my personal experience, for quality results from smaller models, they need waaay more thinking and sometimes that's even slower.
For example my tinkerings in February:
qwen3-4b-2507 was slower than qwen3-14b and gave similar results as qwen3-30b-Q1 in non-thinking mode. But for a 4b model it was exceptional.
Another example is: Nanbeige. It's a model in the 2-4b range and it was soooo slow, I even smelled something burning while it was ruminating and had to turn it off :D
34574rd@reddit
i mean that's what's called speculative decoding and yes it does "speed things up". id suggest you wait for the dflash gemma variant as that would take up lesser resources
my_name_isnt_clever@reddit
That's not what speculative decoding is.
UnusualAverage8687@reddit
It's not exactly what it is, but it's close, no?
What OP is describing here can't speed anything up, simply because, even if the small model "does the thinking", the large model still has to calculate the entire prefill. It will actually take longer.
The only way to speed up the larger model using a smaller model (AFAIK) is to use speculative decoding.
Former-Ad-5757@reddit
You can, but a 4b model will also think worse than a 31b model,
EffectiveCeilingFan@reddit
I mean at that point you’re just using Gemma 4 31B as a summarizer, in which case you’d be better off just using the smaller Gemma for everything.
Miriel_z@reddit
Might help to summarize and format the input. A structured prompt is a good practice generally. Have not benchmarked it though.