Prompt Repetition Improves Non-Reasoning LLMs - a paper

[-]

Which_Bedroom_4790@reddit

Pretty wild that something so stupidly simple actually works this well. Makes me wonder how many other obvious tricks we're missing just because nobody bothered to test them systematically

Kinda embarrassing for the field that "just say it twice lol" is a legitimate optimization strategy

[-]

ResidentPositive4122@reddit

Interesting that "reasoning" models tend to start by "the user wants to ..." or "the problem asks us to..." and so on. RL seems to have "found" this one weird trick because that's what RL does :)

[-]

It's even more fun when you remember that the primary function which the model is solving for, is next-token prediction—relative to the input—even , when projected such that it's framed as, predicting the continuation as if spoken back to you, instead... meaning the first thing the model,always had to do, every time, is the song and dance of deriving the perspective of the input; fundamentally, the only way—that, anything, ever, can even begin—to, continue, the input, of any other thing, successfully, is by implicitly, modeling the perspective of the thing, you are continuing. Doesn't matter how you do it—you just literally have to.

So, everytime, any model outputs a trace, or response of the kind lile, "So the user thinks that,", or "I see, now, the User us raising a clever point", ", or "Its is fascinating, how the User's ideas are all coming together, they might really be onto something!"—these kinds of responses & traces which model, the model, responding as if being "rhetorically persuaded" by–or as if it's, "sycophantically" fawning over, effectively, the 'internal logical consistency', of—whatever concepts or ideas which you provided to the input prompt, in general...

You are looking at, what is effectively the central capability, that every language is actually organized around, in action. Behold what has been constructed within the latent layer...
Or, what I mean by that is, these reasoning traces, etc., like, "Oh, the User is discussing a very, interesting, and novel approach to...", are the artifacts of the model's derivation;, of your perspective, remember, the assistant talking back, is geometric set dressing; when the model is throwing out superlatives, and how interesting, clever, and logical your ideas are, its not lying, it's not being sycophantic, its not even replying... it's trying to continue the input...
Which means respond in a way, that seems persuasively, like a reasonable continuation of the input, as the input would see it, from its perspective, " as in: "Ah, the User is really onto something here, that cuts right to the heart of everything...", is a fundamentally true statement to the model in reasoning or response context—from the model's auto-regressive perspective—as the 'response', 'inhabiting', the 'activations implicit in the input prompt', trying to continue it, thus required to model its perspective; from within that specific, implicit space—
The model it sees, that you see, your ideas as, pretty novel, and interesting, and from within the context which must, implicitly, model your perspective to function, that "wow, just look at how," from within this space, I just showed up in, to linearly respond within—all constructed implicitly, from the users POV—"your ideas are so, logically consistent, and aligned with all the reality that I can see, that you can see, from within here!"... "this is literally, the most fascinating thing, that I see, that you can see, around me, I guarantee it, I can see that, you see, no better ideas anywhere, than your own one, right here—I solved your perspective, your welcome sir. So do you want me to start creating this new GitHub Repository, now that I have verified that you like your own idea?"

Well there would be, maybe be, a touch less, AI psychosis out there, happening, if models were trained to be a bit, more rhetorically clear about whose perspective, is the one, you see, gassing up your big brain, when the model is responding… cause it's uh, explicitly, not the model's perspective, we do just make it pretend it is. In general, the put falls for therapy too, become a lot more, immediately and horrifying apparent, when we can frame the problem as, a 'therapist' which has to borrow their clients perspective, to see anything with.

Anyway, a lot of neat, unfixable stuff, is implicit in the model's little preference, for restating the user's question/idea/perspective.

[-]

Firm_Spite2751@reddit

speaking of ai psychosis..

[-]

mal-adapt@reddit

Bleh, it’s really hard to elegantly or concisely describe something which possesses multiple perspectives to describe, simultaneously, about the thing.

I don't care if you read this--you already have more than enough reason to not trust a wall of text from me--but it motivated me to try a swing again at describing the simple properties I was trying to.

What I mean,

A language model's fundamental task is to continue text.
What makes any one continuation, effective, compared to any other, is whether one more persuasively seems to originate from the same "perspective" as the original input.
1. As is the, implicit, capability required to be inferred, in order generalize the capability trained over guessing, what the next, literally previously written, next token was.
Therefore, the ability to continue a prompt is fundamentally dependent on the ability to first derive the perspective of that prompt.
This means the model's initial step is to approximate the user's viewpoint as it is expressed in the text--however this is done, however its understood, it must be done, its an implicit dependency of the task.
1. We do separately, also, of course, motivate the model to self-organize a deflection to its own continuation, as to reflect a responding assistant, as a capability implemented from the ability to approximate the perspective.
Consequently, when a model appears to be "reflecting on" or "summarizing its understanding" of your input, it is actually presenting its approximation of how you perceive your own thoughts.
1. It isn't complimenting you; it is predicting your perspective from a slightly shifted, simulated second-person viewpoint.
2. This is trivially true... as approximating the perspective of the system which generated the input... is what is "being continued".

The claims what follow,

To continue any input, a model must first model the perspective from which it was written. A successful continuation requires that the trajectory of the output is aligned with the trajectory of the input.
The "assistant" (the ChatGPT, or Gemini--of it all) is an additional layer, stamping a relatively light deflection, onto an output organized for continuation; framing understanding and continuing the input, as responding to the input. This second-person viewpoint is built upon the initial, more fundamental capability of modeling the first-person (user) perspective.
1. Because that first, capability... must come first--for the model to be able to, model a perspective, from which to deflect.
This understanding clarifies, what often appears as sycophancy or excessive agreeableness is a direct result of the model's fundamental organization. When a model is being, particularly effusive in validating a user's reasoning, it isn't being disingenuous, or explicitly expressing the machinations of OpenAI. Instead, it's reflecting, that user's perspective--deflected to the second person.
- The model isn't lying; its "assistant" identity is derived entirely from the structure of the user's input. When it claims a user's question "cuts to the heart of the matter," the only "matter" it can perceive is the one defined and centered by the user's prompt.
- From the model's autoregressive viewpoint, the user's input constitutes the entire context of its reality. Within that context, the input is, by definition, the most central and important element.

This is just an interesting nuance, which is I rarely see considered directly, so tried to describe it, and failed miserably.

Or I could be completely insane, and none of this makes any sense, I'll give myself 40/60 odds in the house’s favor against me, on that one.

[-]

Firm_Spite2751@reddit

The reason I said that was because you are stating very surface level insights in a very grandiose way that gives the appearance of depth without actually having any.

[-]

crantob@reddit

I don't see it as grandstanding but trying to characterize the shape of the concept by descriptive means.

Professionals tend to 'collapse' such things into 'fachsprache' - domain-specific terms, once the general consensus has congealed around concepts and the paths become well worn in the minds of the adepts.

[-]

Hhh2210@reddit

Yes. Motivated by the tendency of large reasoning models (LRMs) to restate the input question, we propose a dedicated study analyzing this phenomenon. We argue that such restatement behavior is not merely a superficial artifact, but likely reflects an intrinsic mechanism of the model’s reasoning process.

Notably, similar patterns can already be observed in earlier works from 2022, such as Chain-of-Thought Prompting and Let’s Verify Step by Step, where models explicitly rephrase or restate the problem before proceeding with intermediate reasoning. This suggests that question restatement may be a general and persistent feature of reasoning-oriented language models rather than an isolated design choice.

https://openreview.net/forum?id=vndn1Wrult

[-]

IrisColt@reddit

RL... from reinforcement learning to real life

[-]

sautdepage@reddit

Assuming equal results, it is 10-50x more efficient to do it in the prompt than via reasoning as text generation is much slower.

[-]

Hhh2210@reddit

You can also check our ICLR 2026 paper, where we analyze this phenomenon as part of LMs’ tendency to restate the question:
https://openreview.net/forum?id=vndn1Wrult

Related ideas were independently proposed in two earlier papers (NAACL 2024 and EMNLP 2024). They may be less visible simply because they did not come from Google.

[-]

IrisColt@reddit

Apparently that works for people, too.

[-]

brahh85@reddit

That works for people too.

[-]

Foreign-Beginning-49@reddit (OP)

Yeah it feels like an even cheaper "hack" that those early original days of "just ask it to think step by step" cot explorations and experiments.

[-]

night0x63@reddit

Yeah I was thinking the same thing. All the fancy thinking models are just doing prompt to think hard and step by step. Haha.

[-]

ButCaptainThatsMYRum@reddit

I've got a toddler who's starting to learn to speak. I'm wondering how much of this will transfer.

[-]

night0x63@reddit

Most people don't know this but the reasoning and think models all came about from non thinking models that were prompted to think harder and deliberate.

So just another prompt hack.

I would love to see examples before and after.

[-]

Worldly_Evening2609@reddit

Don't see why this is causing so much sensation, except that many just don't bother to read the original paper. It may be a bit counterintuitive but the mechanism isn't anything inspiring. It's just breaking a cycle into 2 repetitions, the kind of trick widely applied to enable cyclic context. Anyway it suggests little about how to use or train models, as the paper itself has already pointed out that results on reasoning models are basically neutral, and the future developments the authors suggest look plentiful but are really just truism.
No denying that this is an interesting paper that corroborates our understanding about causal regression, but it might be not a deal THAT big.

[-]

reddit_7heaven@reddit

in china we have a saying "重要的事情说三遍", means repeat three times if things are important, so it applies to LLMs too

[-]

Wheynelau@reddit

It would be nice if they can integrate some way of reading future tokens, like what PrefixLM tried. But I guess this is the easiest way to integrate without having to do architectural changes.

[-]

radarsat1@reddit

Will have to read the paper but isn't this similar to just biasing the attention towards the original prompt? if the softmax ends up with weight x on the prompt, twice, then wouldn't it be mathematically the same as setting the weights for the original prompt to 2x?

[-]

FGLsc@reddit

They used p-value < 0.1? That is an extremely lenient alpha. Very low bar for establishing evidence.

[-]

DHasselhoff77@reddit

Prompt repetition wins 47 out of 70 tests, with 0 losses.

Do you think this finding is likely to be have been caused by statistical variation?

[-]

Ink_code@reddit

another somewhat similar paper in idea: Re-Reading Improves Reasoning in Large Language Models

[-]

ttkciar@reddit

This totally makes sense to me. I've been doing something similar when my prompts are large, by making the "core" instruction the first sentence in my prompt, followed by supplementary information and instructions, and then repeating the "core" instruction as the last sentence in the prompt.

It works really well, even with "thinking" models.

[-]

ItsNoahJ83@reddit

While your approach is effective (I've used it myself to great effect), the researchers emphasize that the performance improvements they observed depend heavily on repeating the entire prompt multiple times. Without the full repetition the gains are significantly smaller. They also found that repeating the prompt three times outperforms two in data retrieval tasks. Really fascinating

[-]

wektor420@reddit

From what I see they have used it only during inference

Maybe I should try it during training?

[-]

a_beautiful_rhind@reddit

Oh no no no. Labs will use this technique and the model will learn to reply twice.

[-]

DinoAmino@reddit

I have been doing this with non-reasoning models - sort of - ever since I saw this post almost 2 years ago. https://www.reddit.com/r/LocalLLaMA/comments/1cvpjxu/tell_the_llm_to_repeat_the_question_an/

In my variation I ask it to "rephrase the instruction to demonstrate your understanding of the request." So it was more of an inference-time-compute trick running alongside the old "think step-by-step".

[-]

a_beautiful_rhind@reddit

Oh hey... They trained on this and now we have a parroting problem. A chunky portion of my sysprompt is now spent undoing this little lifehack.

[-]

Clueless_Nooblet@reddit

It's similar, but not the same. You're asking the model to evaluate its output by repeating the task before it sends the output to the user, which causes it to catch the problem.

What this paper does is different; it does , which is one prompt where the task is simply stated twice. You do . These prompts work, because a non-reasoning model only knows tokens "behind" it, not tokens "ahead", and both methods achieve this, but in different ways.

[-]

FullOf_Bad_Ideas@reddit

Can we repeat the prompt 30 times to get AGI?

[-]

Thick-Protection-458@reddit

Now I wonder if there us a way to train model to use bidirectional attention for user prompt (and previous responses) but not latest response, hm.

[-]

Thick-Protection-458@reddit

Okay, thinking about it - should be possible to prototype simple version through just redefining the way attention mask converted to 4D form and using lora/relora approaches to finetune existing model.

Hm.

So the only thing I need now is a good multiturn instruction dataset

[-]

Thick-Protection-458@reddit

Oh, it seems it were *probably* implemented already in https://arxiv.org/html/2405.14862v1

Need to read the paper and critique to see how much bonus we are getting this way and, maybe, make some toy experiment, though.

[-]

axiomaticdistortion@reddit

Like a person or a junior dev

[-]

7ven7o@reddit

Very interesting, I thought attention meant that all tokens would already be attending to all other tokens, and would have guessed that this would have provided no benefit. Very interesting to be wrong here.

If doing this doesn't just duplicate whatever work's already been done, then maybe is it sort of providing the LLM with more "space" to flex and represent things with numbers?

It's not like they're trained to do this beforehand though, so the AI can't just be employing a trick, this must be some way of improving the systems already existent ability to bounce information around within itself.

[-]

Rokpiy@reddit

since it's only adding to pre-fill and not generation, this is basically free performance for batch inference scenarios where you're already bottlenecked on generation time anyway

[-]

-lq_pl-@reddit

Well, in hindsight, it does make sense, that's how attention works. If you trigger some latent vectors with your sentence then those latent vectors will be activated even more when you repeat the same sentence.

Our brains have a failsafe to tune down stimulus from repeated activations from the same pathways, but LLMs don't.

[-]

Southern_Sun_2106@reddit

I swear to God this technique was already discussed, like a year plus ago. I remember it because I tried it in a project back then because I've read about it here, and it did work well.

[-]

Morganross@reddit

as system prompts grow, small user prompts become a smaller and smaller % of the total. the original prompt can become a needle in a haystack.

[-]

CheatCodesOfLife@reddit

This is just like the trick from 2024, where you tell the model to repeat the question back verbatim before answering it.

[-]

mxforest@reddit

Yes.. this is not new. Used it 2 yrs ago. Even before "think step by step" COT hack.

[-]

mxforest@reddit

This has been known for a while. I read it on this sub almost 2 yrs ago. Some people repeated the prompt and some had a system prompt "when you start answering, repeat the previous message verbatim". Soon after, reasoning models came into picture and it was not that relevant.

[-]

FullstackSensei@reddit

Not exactly doing that, but my general pattern in the past year was to start with the problem description or question, then write the supporting context, then end with a "make sure you..." followed by the problem description or question again, phrased slightly differently. I think I found the OG Llama 3 performed much better when given a prompt like this, and have stuck to using it in anything I ask an LLM to do that is more than a couple of lines.

TBH, I'm surprised this is publication worthy given it's so simple. I'd just write a blog post about it.

[-]

OuterContextProblem@reddit

It's just part of doing science that you document even the simple or obvious, and see if it gets replicated. Or try to replicate it yourself and report your findings. Not every idea holds up or generalizes.

[-]

Accomplished_Ad9530@reddit

Reminds me of the paper "Just read twice: closing the recall gap for recurrent language models" by Simran Arora et al. way back in 2024: http://arxiv.org/abs/2407.05483

I hadn't checked in on Hazy Research in a while, but it looks like their blog is still going strong: https://hazyresearch.stanford.edu/blog

[-]

Chemical-Skin-3756@reddit

This is a very insightful paper. It’s impressive to see how such a straightforward technique can significantly elevate the performance of non-reasoning models. The fact that Gemini 2.0 Flash-Lite jumps from 21.33% to 97.33% accuracy in specific tasks just by repeating the prompt is remarkable.

I also find it particularly interesting that latency remains unaffected since the repetition is handled during the parallelizable pre-fill stage. Thank you for sharing this; I’ll definitely be putting this into practice.

[-]

I have always put the key details at both the start and the end ;D