If you limit context to 4k tokens, which models today beat Llama2-70B from 2 years ago?

Posted by EmPips@reddit | LocalLLaMA | View on Reddit | 20 comments

Obviously this is a silly question. 4k context is limiting to the point where even dumber models are "better" for almost any pipeline and use case.

But for those who have been running local LLMs since then, what are you observations (your experience outside of benchmark JPEG's)? What model sizes now beat Llama2-70B in:

instruction following
depth of knowledge
writing skill
coding
logic

[-]

Flaky_Comedian2012@reddit

I have yet to find a single modern model that beats old llama2 based finetunes on just being able to have a human sounding conversation.

I give an old model just some example transcript and it will copy the mannerism and writing style perfectly.

I can even ask it questions and often characters will just refuse to answer because they do not care or know anything about this topic. With new models even if they are able to handle a few sentences of actually staying in character, it all just goes out of the window when you ask it a question. Then the AI assistant part takes over immediatly. With old models will often act in denial if I even tell them that they are a AI.

I really wish someone would make an old school model just with more context.

[-]

No_Efficiency_1144@reddit

On huggingface there are models that never had the alignment stage

These would be a good starting point

[-]

No_Efficiency_1144@reddit

Almost all of my usage of LLMs is below 4k token context

[-]

Thomas-Lore@reddit

You are missing out on in context learning.

[-]

No_Efficiency_1144@reddit

Yeah I agree, too much focus on fine tuning and I fell behind in ICL

[-]

DrAlexander@reddit

What's context learning?

[-]

Roubbes@reddit

Mistral Small 3.2

[-]

SillyLilBear@reddit

Almost all of them. Llama models suck.

[-]

BigRepresentative731@reddit

Qwen 2.5 14b

[-]

EmPips@reddit (OP)

Significantly smarter.

I don't know if it's on par with knowledge depth though.

[-]

BigRepresentative731@reddit

Trust, it is. I have many many many many hours of experience with this model

[-]

Red_Redditor_Reddit@reddit

My experience is that the newer models are generally better at everything except writing with human like verbiage from being over trained. Llama 2 also seems to do a better job following some system prompts. Newer models seem to be hard trained to be an AI assistant. If I give a prompt to llama 2 that says "this is a conversation between an ant and a user", it will hallucinate being an ant. Newer models will insist that it's an AI assistant or at least an ant AI assistant.

[-]

Thomas-Lore@reddit

Llama 2 was absolutely horrible at writing. All current models are better than it at this.

[-]

mikael110@reddit

If we are talking about vanilla Llama-2 and not a finetune, then honestly pretty much any modern model that is 12B or above will likely beat it on anything other that creative writing.

Llama-2 always felt like it was undertrained. It was not very good at instruction following, and it certainly wasn't a fountain of knowledge either. It was also one of the first official instruction models that had been red-teamed to such an extend that it was basically unusable for most tasks. It was the origin of the whole "Refusing to kill a Linux process" which was a meme for a bit in this community.

Coding was also terrible, it came out before coding was a big focus among LLM, and it shows. I remember there was a big push to create coding finetunes from it back then specifically because the base model was so bad at it.

[-]

a_beautiful_rhind@reddit

Plain llama-2 was pretty meh.. what models would beat miqu?

[-]

AppearanceHeavy6724@reddit

Never tried llama2-70B, but if you offer sample prompt and the produced result, that would make it easier to answer. I'd think all 24b and bigger models of 2025 will beat llama2 at everything but writing skill, which is hit and miss with modern models.

[-]

DinoAmino@reddit

I'd hazard a subjective guess that anything "recent" 32B and above are better in all regards due to more training tokens, improved training methods, and higher quality datasets. Codellama 70B was a fine-tune of Llama 2. Since then the only coder fine-tune above 70B was the DS 236B. So I'm assuming later models 70B and above have also been trained on coding datasets. Like Qwen 2.5 32B coder probably was fine-tuned on the same coding datasets used in their 72B. And that coder certainly beats codellama - codestral 22B did.

[-]

Double_Cause4609@reddit

Actually, 4k context is *a lot* in the context of a broader system; you'd be surprised at what can be done with 4k context and a carefully engineered setup.

Regardless, in my humble opinion:

Olmo 2 32B.

It's a pretty remarkable model and really does feel like the mini Claude at home, its only limitation being its context window of 4k (which probably could be alleviated with things like Rope or Alibi if the model were well supported in inference backends).

[-]

No_Efficiency_1144@reddit

Yeah it doesn’t have to be 4k tokens of English

4k tokens of a domain specific language or encoded data can be loads

[-]

Ravenpest@reddit

Deepseek (R1 \ V3) at Q1 at 4k context shits on Llama2 70b any day of the week for whatever task your heart desires.