Why isn't the whole industry focusing on online-learning?
Posted by unraveleverything@reddit | LocalLLaMA | View on Reddit | 18 comments
LLMs (currently) have no memory. You will always be able to tell LLMs from humans because LLMs are stateless. Right now you basically have a bunch of hacks like system prompts and RAG that tries to make it resemble something its not.
So what about concurrent multi-(Q)LoRA serving? Tell me why there's seemingly no research in this direction? "AGI" to me seems as simple as freezing the base weights, then training 1-pass over a LoRA for memory. Like say your goal is to understand a codebase. Just train a LoRA on 1 pass through that codebase? First you give it the folder/file structure then the codebase. Tell me why this woudn't work. Then 1 node can handle multiple concurrent users and by storing 1 small LoRA for each user.
Directory structure:
└── microsoft-lora/
├── README.md
├── LICENSE.md
├── SECURITY.md
├── setup.py
├── examples/
│ ├── NLG/
│ │ ├── README.md
...
================================================
File: README.md
================================================
# LoRA: Low-Rank Adaptation of Large Language Models
This repo contains the source code of the Python package `loralib` and several examples of how to integrate it with PyTorch models, such as those in Hugging Face.
We only support PyTorch for now.
See our paper for a detailed description of LoRA.
...
================================================
File: LICENSE.md
================================================
MIT License
Copyright (c) Microsoft Corporation.
Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal
in the Software without restriction, including without limitation the rights
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
copies of the Software, and to permit persons to whom the Software is
furnished to do so, subject to the following conditions:
...
GraceToSentience@reddit
They do have memories, it's their context windows.
And for longer memory, you can just post train them on additional data.
So they do have both short term and long term memory as long as you make them compute that extra.
remyxai@reddit
Online learning is a key skill humans have but LLMs lack.
This came up while discussing 'what's missing in AI?' during a recent podcast I participated in: https://riverside.fm/dashboard/studios/jose-cervinos-studio/recordings/69ebed58-44ef-4dce-9ff6-cc1dac070777?share-token=3cc2204a12bd893439a0&content-shared=project&hls=true
But I'm also thinking about fundamental limits to the rate of updates when you need time to make observations, like running a controlled experiment.
__SlimeQ__@reddit
why don't you consider RAG memory? this seems like a very strange premise
fine tuning on something does not absorb the information, it only shapes the output. also it's problematically expensive (even just time wise)
nothing stopping you from trying this though, grab oobabooga and go for it
aeroumbria@reddit
Most RAG is hard-coded and detached from the learning process, so unless we have done ways to propagate learning signal to the retrieval mechanism, it is just a frozen model interacting with a static environment, rather than actual learning. Active learning would be something like hooking an associative memory module into the gradients of the main network.
silenceimpaired@reddit
RAG also fails when it comes to large scale context interaction … for example you can’t summarize everything in rag in the same way as it being all in the LLM context.
Inner-End7733@reddit
Look up Google titans paper
Equivalent-Bet-8771@reddit
Because your model forgets other stuff, that you may need, to do actual work with.
MINIMAN10001@reddit
Also known as catastrophic forgetting
sruly_@reddit
Loras don't do catastrophic forgetting they don't override the base knowledge
RiseStock@reddit
I personally don't want my neural networks to remember anything. RAG is the way forward. I want lossless memory.
1998marcom@reddit
You might want to have a look into "titans": https://arxiv.org/abs/2501.00663
Gleethos@reddit
They are batch learners, so if you update their weights on individual samples, you get catastrophic inference. Which means they forget previously learned stuff. It's a major design flaw that most NNs have. My intuition tells me that we need better weight specialization and sharding to solve this. Something like Mixture of Expers but with much higher "expert granularity". But that is just wild speculation from a random redditor.
Pretend_Guava7322@reddit
So why can’t we train a router model that can, for each generated token, decide to use a fine tuned llm that may or may not catastrophically forget data, or the base?
Lossu@reddit
That's pretty much what the paper Mixture of A Million Experts proposed. So you maybe onto something.
ttkciar@reddit
Yep this, OP should google "catastrophic forgetting".
BumbleSlob@reddit
You are describing performing a fine tune after every completion. You will end up with your data, because it is definitely not large number, overfitting the model to a single prompt. There is a reason fine tuning takes hundreds to thousands of examples — to avoid overfitting.
This is all ignoring that it would be a massive, massive computational drain on the client machine. And performance would degrade worse and worse every time you use the model.
RedditDiedLongAgo@reddit
LLMs are stateful though.
Mindless_Pain1860@reddit
It simply doesn't work, at least not with GPT. You can try fine-tuning the data you posted here, for example using OpenAI's fine-tuning service, whether with SFT or SFT+DPO, and regardless of whether you're fine-tuning GPT-4o or GPT-4o-mini. It still fails to answer the question well unless you use the exact same prompt from the fine-tuning process, and even then, it often produces responses with severe hallucinations. My assumption is that, because GPT is a probability-based model, it cannot truly "understand" a concept or generalize well without proper exposure to a very large dataset.