Llama models: still valuable for finetuning or surpassed by everything new?
Posted by Silver-Champion-4846@reddit | LocalLLaMA | View on Reddit | 83 comments
Hello there people. So I have noticed that people are pretty much ignoring Llama 3 plus 3.1, 3.2, and 3.3 these days. They never mention how their experience goes with fine-tuning those models. But we haven't been getting many entries into the 70 billion space. So is, for example, Llama 3.3 70B the best thing available right now to be experimented with and fine-tuned? Or is it Qwen3 all the way?
Healthy-Nebula-3603@reddit
You sound like someone from 2024 :)
Badger-Purple@reddit
This!!
randominsamity@reddit
This=I have nothing to say but I want to feel like I've contributed to the conversation.
Badger-Purple@reddit
There is no conversation to be had
silenceimpaired@reddit
This=I haven’t discovered the upvote button.
Virtual_Monitor3600@reddit
This!
Silver-Champion-4846@reddit (OP)
I have no idea what you're talking about. I know that it's old, but shouldn't finetuning bring relevant data?
Healthy-Nebula-3603@reddit
What for ?
Connect your database to agent ..that's it
Silver-Champion-4846@reddit (OP)
Interesting... That simple? WHat about context overload or context rot?
Healthy-Nebula-3603@reddit
Nowadays..yes
Everything is handling agent like: Memory handling, memory compaction, Internet access, tasks planning, etc.
Only what you need from the model is to be good with functions calling, following instructions well and keeping information well in the context.
The rest is handling agent. So just connect a database and agent will handle it itself. When you ask something then agent will check if that is present in the database by calling search function.
I'm using personally at home opencode with llama-server as API provider for Gemma 4 27b and qwen 3.6 26b.
Gemma 4 has context 85k and qwen 3.6 100k with cache Q8 on my rtx 3090 via Vulkan backend.
Silver-Champion-4846@reddit (OP)
Too bad I don't have a gpu. I tried qwen3 0.8b and it did a terrible websearch call, it called but didn't even start looking at the search results. Jan's openrouter implementation is weird, Gemma 4 31b worked for the first prompt and failed afterwards.
handsoapdispenser@reddit
It's a valid question. This sub and half the tools are named after llama which kicked off the local llm movement. I'm sure it's confusing to a newcomer that llama is now such a laggard.
Silver-Champion-4846@reddit (OP)
Indeed.
hidden2u@reddit
Training data cutoff date
Silver-Champion-4846@reddit (OP)
Why do you keep assuming I'm a bot? I don't even talk like one.
jacek2023@reddit
Qwen 3 is old, Llama 3 is ancient
Silver-Champion-4846@reddit (OP)
But isn't finetuning supposed to bring the data you actually need to be relevant? I need to understand.
Classroom-Impressive@reddit
Each new model line tends to be wayy better, you'd need so much finetuning to make up for that ur better off just finetuning the newer model line (e.g. qwen3.6 or 3.5) and completely ignoring qwen3.0, let alone llama
Silver-Champion-4846@reddit (OP)
Wow. Thanks for the info
silenceimpaired@reddit
Perhaps a bot?
Silver-Champion-4846@reddit (OP)
Who's a bot? I'm not a bot! Check my submitions.
oldschooldaw@reddit
Unsloth are doing so much work to make fine tuning of qwens need so so so much less compute. It’s impressive as hell
u3435@reddit
APEX quants are substantially superior in every way to Unsloth, except for marketing. I get 256k context on 24GB VRAM with 8GB spillover to system RAM (I-Compact and Opus-Reasoning-Distilled versions) . The Opus-Reasoning-Distilled score even higher than the I-Compact version, but the I-Compact version has the benefit of a working vision module with 204k context running under llama.cpp.
By comparison, Unsloth only allows for 52k tokens on the same hardware, same settings.
Silver-Champion-4846@reddit (OP)
Amazing if true.
u3435@reddit
See https://pastebin.com/ZQvjwCJi for the exact recipe I used.
Silver-Champion-4846@reddit (OP)
Which qwens?
DinoAmino@reddit
Depends on the goal of your fine-tune and how you go about it. Usually the point of fine-tuning is to perform a specific task or to respond in a specific way. Most fine-tuning damages a model's original performances. Especially instruction following. To this end old models still work great! They are sooo much easier to fine-tune. MoEs are not easy to fine-tune at all. Qwen 2.5 models and Llama 3 models are still very popular for this - checkout HF and see for yourself. Old models are downloaded far more than new models.
And 70B is too expensive to train for most - like, there are a ton of tool calling tunes from 8B to 32B but rarely larger. You can develop and test your fine-tune on a smaller 3B model first to iron out the kinks before going big.
Silver-Champion-4846@reddit (OP)
Are dense Qwens also harder to tune than llama?
DinoAmino@reddit
I'd say no, generally.
Silver-Champion-4846@reddit (OP)
Oh interesting
tecplush@reddit
Finetunibg wahr? What‘s the question here?
Do you have any AI basic knowledge?
Silver-Champion-4846@reddit (OP)
I know the basics. LLMs predict the next token, Finetuning them involves giving them a custom dataset and letting their weights be changed on some serious compute, except Lora which is more poor-friendly. I probably didn't get your exact intent.
tecplush@reddit
Rethinking this after reading the basic books.
Just kidding. 😉
Good luck, dude.
Silver-Champion-4846@reddit (OP)
Hah thanks.
ttkciar@reddit
If you want to fine-tune a 70B dense, I would recommend K2-V2-Instruct rather than llama.
Newer models have surpassed llama-3.3 entirely. GLM-4.5-Air is a better physics assistant now than Tulu3-405B (a deep STEM retrain of llama3-405B) for example.
Silver-Champion-4846@reddit (OP)
4.5Air hmmmmm wonder whether GLM5.x will also have an Air version or not? I actually can't even finetune a 4b with my capabilities, I was just surprised that a 70b was being pretty much ignored.
ttkciar@reddit
I've been wondering the same thing. GLM-4.5-Air has consistently punched above its weight, especially at instruction-following. I keep trying these new 120B models as they come out, thinking "surely this one will knock Air off its perch", but so far they've all fallen short in some way or another. Air is still my go-to.
Ah, okay. Yeah, dense models in general have mostly fallen out of favor (though Qwen3.5-27B helped revive interest) and aside from MistralAI and LLM360 they've all been small (32B or smaller).
It's a pity, because dense models offer the highest competence for a given inference memory budget, but I think most people prefer the higher inference speed of MoE models, and most trainers prefer the lower training costs of MoE models as well, so almost all larger-end LLMs have been MoE.
Only MistralAI (with their 128B dense Mistral Medium 3.5) and LLM360 (K2-V2 lineage) seem to be bucking that trend.
Silver-Champion-4846@reddit (OP)
What about the Marco models for highly-sparse moe?
ttkciar@reddit
I haven't tried any of those, yet. My main focus is mid-sized models in the 24B to 32B range. My previous experiences with highly-sparse MoE have not been great, but of course the technology is always improving.
Silver-Champion-4846@reddit (OP)
the thing is I'm always hunting for small models since I have no gpu, 8gb of ram and an intel gen8 U type cpu. Pretty much in "what the hell are you doing in LocalLlama" territory.
silenceimpaired@reddit
The problem with K2-V2 is it was trained for thinking
Silver-Champion-4846@reddit (OP)
No instruct version?
Confusion_Senior@reddit
Time to take your meds grandpa
Silver-Champion-4846@reddit (OP)
Bad joke, kid.
Confusion_Senior@reddit
I am middle aged btw
Silver-Champion-4846@reddit (OP)
And I'm a young adult, sir.
a_beautiful_rhind@reddit
Those models are better at talking. If you want assistant stuff, use something trained on tools. OTOH, taking stemmaxxed qwen and trying to make it into a conversationalist has similar results.
silenceimpaired@reddit
Yeah, the new stuff is too focused on the agentic mindset. I almost want to see if I can get a modern GGUF for llama 1 so I can use it to do style transfer experiments. It’s probably the purist with minimal AI’isms.
Silver-Champion-4846@reddit (OP)
That'd be interesting. Style-transferring that wild way of writing.
Silver-Champion-4846@reddit (OP)
Do you have a <1b good assistant model?
a_beautiful_rhind@reddit
Not really, thats tiny. Something like that would be useful for a text encoder but I can't imagine it being useful beyond a single task.
Silver-Champion-4846@reddit (OP)
Yeaaah, unfortunate
Enough_Big4191@reddit
Llama 3.3 70B is still great for fine-tuning, especially for specific tasks. Newer models like Qwen3 are strong, but Llama remains solid for practical experimentation. Test both to see what works best for your use case.
Silver-Champion-4846@reddit (OP)
Thanks for the answer
Sufficient_Prune3897@reddit
Everybody wants agents which llama wasn't trained for. So it's pretty much a dead end for that. Finetunes are also dead, since base models are actually good nowadays. But if you actually have a niche, then especially the leaked 3.3 8B performs great. My own finetune on the llama 8B performs much better than the test run I did on Qwen 3.5 9b
Silver-Champion-4846@reddit (OP)
Thanks for the answer.
XMasterDE@reddit
I would say yes
Silver-Champion-4846@reddit (OP)
Thank you
Sash17@reddit
Llama is still solid for finetuning, mostly because the ecosystem around it is huge. But yeah, Qwen has been stealing the spotlight lately since the base models are crazy good for the size.
Silver-Champion-4846@reddit (OP)
I heard that they are no longer base models in the "just raw autocomplete ready to be finetuned on anything" sense, and that they are actually closer to instruct models, is that correct?
Kahvana@reddit
With the creation of good RAG solutions, toolcalling for external services, models in general getting so much more capable, etc, there just is much less of a need to finetune a model.
In most instances it’s going to be a choice between Gemma4-31B or Qwen-27B for dense, or Gemma4-26B-A4B or Qwen3.6-35B-A3B for MoE.
If you do need to finetune: Ministral 3 and mistral small 3 models are decent and got a good license (apache 2.0).
Silver-Champion-4846@reddit (OP)
It seems to me that the local ai world has pretty much taken Qwen3.x and Gemma4 as standard; seems they are really miraculously good
durden111111@reddit
I still find llama 3.3 70B an interesting model. I feel like it was among the last generation solely made for chat insteading of agentic or coding purposes. It still is really good at following instructions. Dense 70B still has something that small dense models or larger MoEs dont.
Silver-Champion-4846@reddit (OP)
How do you see its finetunability potential?
ParaboloidalCrest@reddit
I want to be believe that but, putting agentic/coding abilities aside, models from llama3 era just weren't good at long context comprehension. Llama 3.3 70b in particular fails to put entire context into account once it goes beyond 30k. And I'm talking about the official finetune, as well as the more recent ones eg pepe-70b. Add slowness to the equation and honestly it makes them pretty much useless nowadays.
silenceimpaired@reddit
That assumes long context is always required. I think even llama1 might work well for rewrite purposes. You only need enough context for two paragraphs. And sampling has matured a lot since that came out.
Fedor_Doc@reddit
LLama models were not trained for agentic and code-generation behaviors, and have no reasoning.
They had spawn A LOT of finetunes, and I think they are still a nice starting point if you are into creative text generation, RP, general chat.
Qwen3 is very different in its base capabilities – STEM and coding are its forte.
There is also Gemma series, and its latest base models should be better than Llama + have reasoning, and modern architecture. Gemma4 31b can be more capable, and it is good for humanities (knows languages, can write pretty well) and can code reasonably well too.
Silver-Champion-4846@reddit (OP)
When will we start seeing creative gemmas like wesaw with llama and Nemo?
stoppableDissolution@reddit
Gemma 31 is probably better than llama 70 for everything. If only there was a 7-9b version :c
a_beautiful_rhind@reddit
Not sure about that. Also I checked the base vs the IT. All the magic is in the IT and you're not training jack over google. Base makes small model mistakes in logic.
stoppableDissolution@reddit
I'm not training better than giigle for general capabilities or coding or tool usage, of course, but I'm pretty sure it is possible to elicit mire diverse writing out if it. It is very good for a general-purpose model and is very smart for the size, but quite rigid in the creative writing structure.
a_beautiful_rhind@reddit
It's not a bad model by any means, but tuning it is going to have consequences because of how tightly it was trained.
Not talking about coding/tools, understanding things in conversations was where the base fell short too. You already get more variety out of it but that has a cost. Check perplexity and token distribution of both of them. Dense model shouldn't have 2k PPL.
stoppableDissolution@reddit
2k ppl on what?
a_beautiful_rhind@reddit
Even on wikitext. The IT models all have strange PPL and the base has normal PPL. They heavily bias the most likely token and probably took some pages out of the GPT-OSS playbook to make it modification hostile.
stoppableDissolution@reddit
It is likely just pushed to the overtraining limit, just as gpt-oss and qwen3? Idk, I am going to play with it extensively and see if its tunable once I got the dataset into the shape I want
a_beautiful_rhind@reddit
Hence I'm waiting to see if anyone makes a good tune but my hopes are low. Seems like a you get what you get, take it or leave it model.
silenceimpaired@reddit
What would you pick to finetune these days?
a_beautiful_rhind@reddit
Probably a dense 24~30b to start. Maybe go even smaller just to see the impact immediately.
Then you pull a character AI and train whatever model you like the most with the process you develop.
The reason I didn't bother to finetune is because I'd have to curate a massive dataset to make a dent on modern models. It's easier just to sample/prompt my way out of it and let someone else do the work, namely big labs.
A broad thing like writing and conversation is a big undertaking unless you're just hammering the style at other things' expense. Narrow focus like removing refusals or parroting seem like a more achievable goal.
silenceimpaired@reddit
Fair point. A smaller model may be sufficient for style transfer as well, and probably other things I want to do to adjust the writing.
a_beautiful_rhind@reddit
I think it also tells you what your tune will look like before you waste time and effort on something larger.
ComplexType568@reddit
7-9B is E4B i guess? it has a total of like 8.4B params. it's like a 2-expert moe model just not really.
stoppableDissolution@reddit
Not really, half of it is embeddings and it is quite weird arch overall
NNN_Throwaway2@reddit
Fine-tuning has been kind of obsoleted by the combination of agents and long context.
That said, if you have a fine-tuning use-case, Llama remains easier to fine-tune than models like Qwen.