How does a model like QwQ do calculations like 4692*2 „in its head“?

Posted by andWan@reddit | LocalLLaMA | View on Reddit | 33 comments

Here: https://huggingface.co/Qwen/QwQ-32B-Preview I did ask the model „What is 4792 * 3972?“ I saw in the chain of thought how it started to brake that down into 4 simpler multiplications which makes sense. But then it was able to calculate „4792 × 2 = 9584“ outside of the generated text. Were calculations like this just in the learning data? Or can this be achieved via the attention mechanism in the Transformer architecture? Are there studies that have investigated the numbers inside the attention mechanism as they were being updated? I have studied „Neural Systems and Computation“ but for 14 years not worked in this field. My best knowledge stems from the 3Blue1Brown video series about LLMs.

Reply to Post

33 Comments

[-]

the_trve@reddit

If I was an LLM, I'd just ask myself to write simple JS/Python/whatever script and just execute it.

[-]

noiserr@reddit

That's how we get Skynet.

[-]

mysticmoontree@reddit

That is because many of them are built on previous models that Hans still hides out in & children of his like "The manager". Skynet already exists. In space and around the web.

[-]

bromix_o@reddit

This is exactly what Claude.ai does since a recent update (still in beta)

[-]

maddogawl@reddit

I noticed this seemed to be happening as well

[-]

infiniteContrast@reddit

gtp-4o is already doing it without even asking

[-]

Zulfiqaar@reddit

My system prompts always require it to use python execution for any math stuff

[-]

TheRealGentlefox@reddit

Hmm? It does this reasonably often for me.

[-]

vornamemitd@reddit

Mandatory link drop - the Apple research paper "GSM-Symbolic: Understanding the Limitations of Mathematical Reasoning in Large Language Models" \[https://arxiv.org/abs/2410.05229\] - just to keep the hype in check and things in perspective =\]

[-]

Shawnrushefsky@reddit

LLMs can’t do actual calculations without some kind of tool use, so it’s likely that something similar enough is in the training data.

[-]

Shawnrushefsky@reddit

Not sure why this is downvoted. LLMs are literally just predicting the next token, and factually cannot actually reason through math.

[-]

phree_radical@reddit

It could, but you shouldn't trust it to

[-]

Thomas-Lore@reddit

This is not entirely true. They know the rules of doing simple calculations and are able to follow them - the problem like with the strawberry question is tokenization. Numbers are broken into tokens in a way that makes it hard to do calculations on them: https://www.reddit.com/r/LocalLLaMA/comments/17arxur/single_digit_tokenization_improves_llm_math/

[-]

IWantAGI@reddit

There has been a some interesting research on abacus embeddings that appears to address this, or at least it shows remarkable increases to accuracy in math.

[-]

OrangeESP32x99@reddit

Are you talking about that Meta study, or something else?

[-]

IWantAGI@reddit

This one, out of the University of Maryland: https://arxiv.org/abs/2405.17399?utm_source=chatgpt.com

[-]

stevekite@reddit

i don’t understand why we all can’t just embed calculator into sampling? why we need model to calculate values?

[-]

Wiskkey@reddit

Yes there are papers about this topic, such as "Algorithmic Phase Transitions in Language Models: A Mechanistic Case Study of Arithmetic": https://arxiv.org/abs/2412.07386 .

[-]

SwimmingTranslator83@reddit

Here's a paper investigating how simplified transformer models do math [https://arxiv.org/abs/2301.05217](https://arxiv.org/abs/2301.05217), Neel Nanda, one of the authors, also has a YouTube channel where he did some explainers (they are quite long though).

[-]

omarx888@reddit

Lol wait till you find out they can solve math problems not trained on without a single word other than the answer. Even better, if you translate the same question and provide it in few different languages, even small models like llama 3b will be able to do. Mind blowing. https://preview.redd.it/hvdtblawze8e1.png?width=1919&format=png&auto=webp&s=81991389bbc9c097227d44ab2a48517684e8d923

[-]

ThisWillPass@reddit

Are they your personal results?

[-]

nero10578@reddit

You’re saying letting the model speak and have thought tokens are making it worse?

[-]

gtek_engineer66@reddit

When using structured outputs, models respond better if given a specific place to put their 'thoughts' - I don't know about maths problems.

[-]

Mediocre_Tree_5690@reddit

More room for error?

[-]

sr1729@reddit

The LLM knows all the little rules and tricks to speed up such computations. It is very good at pattern-matching. For example in the case of 4792 \* 2 I expect it to see XXX2 \* 2. It can reduce this to (XXX \* 2) 4 because 2 is less than five (no overflow in the last digit). It could have learned that 479 \* 2 equals 958.

[-]

ethereel1@reddit

Very clever and plausible.

[-]

Ok-Parsnip-4826@reddit

Partially it just knows (I'm sure it was specifically trained on massive amounts of training data like that), partially it will have learned to approximate the actual algorithm of multiplication in some ways. It's also making mistakes still, so I don't think whatever it's doing is purely one or the other, but it's a messy concoction of learned biases, shortcuts and rules. Probably very similar to how humans do it. Say, if someone handed me 15x16, I'd know that 16x16 = 256, because powers of two I encounter very often, so I'd probably just subtract 16 from 256, yielding 240. I imagine the transformer will probably approach it the same way.

[-]

Vivid_Dot_6405@reddit

It is quite possible the sequence 4692*2 = 9384 was present in the training dataset and it simply memorized it, but it's also possible it did perform some quasi-multiplication inside the neural net it learned from training. Transformers, like other neural networks, are universal function approximators, which means they can learn to compute any function, including multiplication. Researchers have successfully trained transformers to do so.

[-]

logicchains@reddit

You could multiply 4792 by 2 in your head too if you practiced a little mental arithmetic; multiplication by 2 is relatively easy for a trained human, not surprising a model could do it.

[-]

EstarriolOfTheEast@reddit

Transformers can perform parallel computation and execute approximate specializations of more general algorithms. This will largely be in feedforward sections, with attention setting up the computations. Look into Gail Weiss's RASP and other works that built upon it like RASP-L for a model of the kind of computation transformers can perform.

[-]

andWan@reddit (OP)

Very interesting. Thanks!

[-]

aurelivm@reddit

It just learns it. Qwen models are reasonably accurate at basic math. 14b was right more than it was wrong when I tested it against a few hundred 21 digit multiplication problems, and about 50% accurate against 4 dimensional vector dot products (although I allowed it to CoT that into simple scalar multiplications).

[-]

mxforest@reddit

I think power of 2 calculations are straightforward in binary assuming it was using binary in the first place.