How does a model like QwQ do calculations like 4692*2 „in its head“?

Posted by andWan@reddit | LocalLLaMA | View on Reddit | 33 comments

Here: https://huggingface.co/Qwen/QwQ-32B-Preview I did ask the model „What is 4792 * 3972?“ I saw in the chain of thought how it started to brake that down into 4 simpler multiplications which makes sense. But then it was able to calculate „4792 × 2 = 9584“ outside of the generated text. Were calculations like this just in the learning data? Or can this be achieved via the attention mechanism in the Transformer architecture? Are there studies that have investigated the numbers inside the attention mechanism as they were being updated? I have studied „Neural Systems and Computation“ but for 14 years not worked in this field. My best knowledge stems from the 3Blue1Brown video series about LLMs.