Since this is such a fast moving field, where do you think LLM will be in two years?
Posted by tim_Andromeda@reddit | LocalLLaMA | View on Reddit | 269 comments
I’m amazed at the progress in this field, with LLMs quickly becoming smaller for the same capabilities and a ton of research in the field as well. Where do you see LLMs in two years? For example, how many parameters might a GPT-4 capable LLM have in two years? What kind of LLM’s might we be capable of running on our phones in those days?
Capaj@reddit
We will have chat-gpt 5 which will be a tad better than gpt4. It will not beat the arc challenge, my guess it will score around 45 percent.
Transformer architecture has plataued. We cannot keep the pace of innovation without a new ground breaking discovery.
Substantial_Luck_273@reddit
What are your thoughts on the current state of AI given your previous predictions?
Capaj@reddit
IMHO I was too early to say "Transformer architecture has plataued".
That has clearly not happened yet. It may get lot better still. There was a time when it looked like it might plateau, but IMHO starting with GPT-5 it was obvious for me that the labs are able to push it much much further
Klutzy-Smile-9839@reddit
Considering generative AI, I am not sure the transformer has plateaued. A lot has to be explored : Multimodality, action/agent data, 4D synthetic data (xyzt data associated to text and video and pictures) for an internal comprehension of the world/words.
Let us see what can transformers do with these before declaring it has plateaued as an architecture.
GregsWorld@reddit
These will only fractionally improve performance though? They don't address the larger issues with LLMs (hallucinations, reasoning, understanding etc..)
Klutzy-Smile-9839@reddit
Hallucinations: It may be reduced, for example, by using a post-search algorithm within the original foundation data or in internet public data. That is what you would do essentially, and it is programmatically implementable.
Understanding: Large 4D data should provide a layer to understand the physical world (space time) and the relation between object (impact, contact, accelation, distances, etc). Eg., an apple could be associated to its 3d shape, it's size, etc. *May be supplented by pure simulation/optimization
Reasoning: More synthetic reasoning data, constructed bottom-up by creating/solving fictional problems. *May be supplented by pure simulation/optimization tools (see Deepmind achievements)
LLM is to AGI what the reflex layer is to our human nervous system, i.e., a one shot forward signal that provide a first guess in response to the input. That layer is so impressive right now that other algorithmic layers (writing true content, doing perfect pure reasoning, performing pure anticipation) are not yet competitives to brute force LLM (reflex layer). Nonetheless, these layers are exemplified by the last Deepmind achievements.
GregsWorld@reddit
Yeah a sensible take, LLMs alone are not enough, they're the eyes the ears, the senses. The logic, reasoning etc.. needs to be supplemented by other systems.
Umbristopheles@reddit
If you listen to Sama, gpt-5 should be better that 4 than 4 was better than 3. We shall see.
Winter_Tension5432@reddit
Plateau, I don't think so is growing and doing that a exponential level, multimodality and ASICS will bring will times, I don't care about AGI if I can get a model 3x stronger than gpt4o with voice and vision and running locally on a phone that is what I am looking for.
Electrical_Crow_2773@reddit
You may not care about AGI right now, but when it comes out... The progress will accelerate exponentially
Winter_Tension5432@reddit
A big " when" I was saying in the context of transformers not being enough for AGI even if they are not the world as we know it will be wild.
Charuru@reddit
AGI
Brilla-Bose@reddit
lol never with llms
Specialist_Cap_2404@reddit
Organic data will run out, so artificially generated data will play a bigger and bigger role.
Otherwise, of course, more can be done with less compute. Accelerator hardware for LLMs will be on the rise.
tim_Andromeda@reddit (OP)
Can we get more mileage out of organic data than we are currently? Perhaps some additional annotations or labeling to the data so that the LLM can make more sense out of it or better be able to differentiate between good quality data and poor quality data?
Specialist_Cap_2404@reddit
Labeling is data. Labeling by Human is organic data, labeling by machines is artificial data.
The point about data running out is because the performance of LLMs will roughly scale by the amount of data that is used to train it. And that scaling outpaces our generation of such data, so LLMs can't improve using such data at that point, then it must all be new algorithmic insight or more computing power which is slower.
v1nchent@reddit
Why would we ever run out of data to collect? That seems like a very wild claim to me which has no foundation.
Could you provide some more insights here?
As long as humans are interacting with technology, new data will continue to arise, even if it is data you previously assumed unimportant.
a_beautiful_rhind@reddit
A lot of human data is bad. People trained on proxy logs and it ended up making the models worse. So maybe it's more accurate to say we'll run out of usable human data.
v1nchent@reddit
At worst the rate would slow down. But what do you think human data is? We will never run out of new, and even new usable, data. At least not anywhere near our lifetime.
a_beautiful_rhind@reddit
You can only have so many examples of "sus amogus".
v1nchent@reddit
Sure, but the way you talk about it implies the way we as humans communicate is somehow not going to change anymore...
You are clearly a capable person, saying humans are done evolving is an L take.. and you should know this..
a_beautiful_rhind@reddit
Of course, over time. Fast enough to keep improving models over the course of a year?
v1nchent@reddit
What's with the arbitrary timeline suddenly thrown into the argument? And improving over the span of a year, if there is a single year where we have not made an improvement to a model because there was no new data I will transfer you all of my assets. We are either running out or we are not.
Volume is not what is my argument. There either is at least 1 new piece of data, in which case we have not ran out, as you suggested or there is absolutely 0 new data available.
Words have meanings.
a_beautiful_rhind@reddit
You are right that there won't be 0 new data available as long as humans exist.
The amount of quality data isn't going to keep up with the amount needed in the time required. Plus the signal to noise ratio in it is going to get worse.
Say you're training on books, you have to go to increasingly lower quality ones as time goes on. At first you got all the classics and the best sellers, your llm sounds great, but soon you are scraping the bottom of the barrel and it isn't really helping. Maybe it's even hurting it. So your option is to use your previous LLM to make more data, train on something else or only run the same dataset. Is that not running out?
v1nchent@reddit
You're using the term 'quality' but that's a very subjective way of looking at the world and incredibly self-aggrandizing...
You have got to realize that you're talking about art and human expression as if there is only one correct and viable way to express one self.
You're twisting arguments in a way just to not 'admit defeat' whilst there is nothing wrong with adjusting what you said...
We are not running out. Period.
What do you want exactly?
What exactly is more data going to give you here? Do you know what quality data actually is?
If you can't properly defend your position without talking in circles using buzzwords, ignoring what is presented to you, you don't know what you're talking about...
Even if it's the general consensus, you're nothing but a parrot then.
Just training stuff to train stuff makes no sense for anything. Stuff needs to be trained with purpose in mind.
We are never, ever, going to run out of data. Quality or otherwise. Daily, new interactions take place in massive quantities.
Can you provide me with verifiable sources which prove your claim? Not blog posts, actual sources.
Don't make doom claims without actual evidence...
tim_Andromeda@reddit (OP)
It’s common sense. If you want to scale your LLM one thing you need to do is scale the data massively. When you’ve hoovered most of the data on the internet into your LLM, what is generated anew day to day is but a trickle. So yes, in that sense SOTA LLMs are running out of data.
v1nchent@reddit
I was just getting into the topic and would like to point out that I was beyond ignorant in my responses.
For which I apologize, thanks for trying to get me on track even though I kept steering myself off of it.
a_beautiful_rhind@reddit
I think the evidence is major companies incorporating synthetic data more and more.
That hits the nail on the head. There isn't enough good human data for particular purposes and it's easier to create your own.
And you want sources? I thought this wasn't about parroting people. You know what the writers in the sources do, right? They look at evidence and draw conclusions. We don't have to agree on them.
ohcrap___fk@reddit
Hey I just want to say you employed a Godzilla amount of patience here by calmly explaining the nuance to him while he was being aggressive. Cheers to you for making the world better today
v1nchent@reddit
The comment made says there will be no data to train on. I wholeheartedly disagree with this.
The future of it will be training it on organic data in precise fields.
To think what is online currently is all that humanity is to offer is an incredibly strange and fatalistic position to hold.
How could I not disagree with that to my core?
Was I too agressive in my approach, that depends on who reads this, but I can see it too.
Am I sorry for taking the stand I took? Not in the slightest.
And whilst you may disagree with my take on it, you may not fault me for believing in humanity.
incongruity@reddit
Except as humans use LLMs for more and more, often without clear declaration of such use, much of what might be presumed to be “human data” will actually be mixed and for many purposes not used because of the presumption that it is in fact mixed.
ColorlessCrowfeet@reddit
Yes, but we've also been mining a backlog of old data, which is a large but finite resource.
Bernafterpostinggg@reddit
There are some good papers about this. The current estimate is that we'll run out of net new human data by between 2026-2032. And it's important to remember that a significant portion of the data on the Internet is now AI generated so purely organic data is already gone. Model collapse is a big issue but augmenting organic data with accumulated synthetic data looks like a solid short term solution.
With multimodal being the main focus now, using synthetic data is a very real strategy. Sim2Real and training in virtual environments is a key component to progress.
v1nchent@reddit
Well, I still believe there is a difference between:
And
The former is likely true and I believe that myself. The latter is people using words they can't possibly believe because they are simply not true.
That's like me saying there is no light because my solar panels no longer generate electricity.
Maybe it's just cloudy...
Bernafterpostinggg@reddit
Yeah, it's not really that there won't be any more human data per se, as much as there will be nothing close to a new mega-corpus to train on. There will never be a full new internet scale text cache again like there has been for the current SOTA LLMs...
Specialist_Cap_2404@reddit
Here's a summary: https://epochai.org/blog/will-we-run-out-of-data-limits-of-llm-scaling-based-on-human-generated-data
ffiw@reddit
LLM's performance is average of the data it's trained on. Half us have less than IQ 85. So most content produced is terrible and inaccurate and useless. Which brings down the performance of LLM. You need to curate training data which is limited in quantity, that takes money.
git_oiwn@reddit
Because LLM already use whole internet data for training. Growth rates for valuable data are tiny in comparison what already accumulated.
v1nchent@reddit
I will 100% give you that the growth rate will likely slow down.
There is however a BIG difference between slowing down, even to a crawl, and running out.
We're literally talking about large LANGUAGE models. Don't assign meanings to words that are not there. Or be deemed invaluable data.
Running out of something literally means none is left. How in the fuck are you going to run out of data... Even new data.
And with more people than ever with the ability to read, write and contribute, saying we're going to run out of quality data just means you have absolutely no clue what you're actually saying.
Sure, maybe we will not get a new spike in performance for a few years or decades.
But that's like saying the world ran out of cookies because the jar at your grandmother's house is empty.
Your data is not all the valuable data in the world, other people have value too..
Dayder111@reddit
They are doing this with what they call "synthetic data", if I understand it correctly. Enriching the data that the model gets, with its own, or other, more capable model's thoughts, connections and conclusions.
BlueboyZX@reddit
TesseractOCR has been using synthetic data for fine tuning training for over a decade.
Eisenstein@reddit
Imagine you have a data with pictures in it. Have different people describe those pictures. Will the descriptions be the same? There is much more data in datasets than we have used so far.
Mescallan@reddit
LLM pre-training is not as sample efficient as humans, and we are not likely even close to peak sample efficiency.
In theory we can be getting more out of LLM weights than we are now. They are abstract representation of language and we are using multi headed attention to parse the weights, but there could be an even more efficient way to use the weights.
jasminUwU6@reddit
We also have to make the models more sparse, we can already maintain most of the performance while only using less than 1% of the weights
kuchenrolle@reddit
This is a false dichotomy. Synthetic data is based on "organic" data, it's just fleshed out in a different way. There is no creation of information and the limiting factor is still the same.
ResidentPositive4122@reddit
There can be. One simple example is the "Tony's mother is Fay. Who is Fay's son?" problem. You can augment lots of data with "relationships" and train / fine-tune on that. L3 has shown that carefully creating synthetic data works, and provides good results.
kuchenrolle@reddit
And those relationships are (learned from the) organic data. You're just giving an example of generalization or collecting structured data. Same for the math dataset. There's no arguing about the impossibility of creating information. Use google or just copy this into your favourite LLM Chat and have it explained to you, I won't engage in a pointless conversation.
Eisenstein@reddit
Why don't you define what you mean by 'creating information' and we can have a discussion about that.
kuchenrolle@reddit
I don't think it matters (though please do provide a definition where the non-synthetic data isn't the limiting factor). But let's just say in the Shannon sense, s the number of possible (equiprobable) distinctions. One bit can then distinguish two things. There is no way to take something with one bit of information - let's say that's the information in the organic data - and distinguish more things based on it, like by deriving synthetic data. That new information has to come from somewhere, like relationship annotations (a different, new source of additional information). If it comes from the original data, that data wasn't the limiting factor to begin with, and the synthetic data is just part of the process of using that information, so no new information has been created.
It's sad to see my comment get downvoted and an (understandable but still simply wrong) misconception get upvoted. This is not complicated or advanced knowledge. I would have expected this in the ChatGPT sub, but here, I am a bit surprised.
Eisenstein@reddit
How can you not establish more things based on existing data? Language means different things depending on a lot of factors. You take something like "I helped my uncle jack off a horse" and you as 3 different people what it means and get 2 different answers (helped uncle demount from saddle; helped uncle pleasure animal).
Data is interpreted. If there is a universal 'put a sentence through this algorithm and get back all meaning from it' that will work on all language, then please let me know what it is.
kuchenrolle@reddit
I don't quite know what you're saying. Where did I talk about language and meaning? You can't "get all meaning out from a sentence". That's not how language works. What something means depends on the context. I can say "okay." in a hundred different context and mean a hundred different things. That's like trying to extract the meaning of 1 vs 0 (a bit). The context sets up what the possible meanings are and language is just used to choose the right one(s).
What you can define, however, is a limit on how many different meanings you could possibly express with a set of symbols (words or sentences). The actual meanings are irrelevant for this question, it is just an upper bound on expressiveness. That's what the information is. Information is very different from the colloquial understanding of meaning.
Eisenstein@reddit
Many people are trying to get to your point by directly confronting statements you make with easily understandable counter statements looking for something concrete we can point at and go 'ok now here is a reference that makes sense'. But every time you weave into some kind of word salad that mixes parts of different academic disciplines but effectively says nothing comprehensible.
I am sorry to make this personal, but there are really two options here: either you are a genius way beyond every else's capabilities, or you are talking nonsense that only makes sense to you because it is inside your head.
kuchenrolle@reddit
I'm sorry you feel that way. I don't know how regular sentences appear to you as "word salad". I also don't know where you see "many people" in this thread or something happening "every time". If you don't understand something, after I have tried to explain it and it still doesn't make sense to you, why don't you ask ChatGPT if you're just not getting it or if I'm talking nonsense? This is not a difficult question, I'm not "a genius way beyond every else's capabilities", but it seems you're not actually trying to understand.
You ask about a definition of "creating information", I give you the definition of information that everyone in the field uses (look up information theory) and explain what creating information would mean then. Then you somehow start talking about meaning and "a universal 'put a sentence through this algorithm and get back all meaning from it'", even though I never said any of that. I try to explain to you that meaning and information are very different things and how they are different and then that's word salad and you feel like you should make this personal? I'm sorry, but what the actual fuck? Where do you give "easily understandable counter statements" that I'm not addressing directly?
I'll give this one more shot and then I'm done. Here's what I think you think: LLMs can create new knowledge. An LLM can say something like "An eagle is a bird of prey", even though that statement is not in its training data. So it's new. New knowledge is new information. So it created information.
What's wrong with that is that the model didn't create new information. Let's ignore the information part and just focus on the knowledge, because it didn't create new knowledge either. That is for two reasons. The first is that the model can generate this statement by learning from the data. Maybe the data contained the statements "Hawks are birds of prey" and "Eagles are essentially large hawks". Taking these together, one can conclude that eagles are also (larger) birds of prey. And that's what the model does, except with tons of examples and tons of sentences that are similar in one way or another, so you wouldn't be able to point to two sentences in the data that lead to this newly generated sentence. The model would be able to generate that statement. Do you see how the knowledge (eagle is bird of prey) is already in the data? The data contains patterns. And the model is applying the pattern to new examples. But the patterns and the examples together already contain everything necessary to create these new examples. The model merely extracts this, it doesn't create anything that wasn't already present. If I tell you "Cars are faster than bikes" and you then go on to say "Bikes are slower than cars" you maybe created a new sentence, but you didn't create new knowledge and you certainly didn't create new information.
Eisenstein@reddit
I the 'information' that is 'new' is a result of us having more senses and being able to extrapolate more about the world around us because we experience it different. Pointing out a 'new' theory of physics doesn't create anything new, it just explains something that already exists. Writing a poem doesn't create anything new, it just moves words around that already exist.
This is what I meant by 'define what you mean by new information'. You are holding a text model up to standards that you can't define. I asked you what you meant and said a bunch of nonsense that sounds deep but is basically fancy sounding words masking nothing more than 'words only mean something eventful when humans combine them together'. If an LLM were to make up a new language with grammar rules would that be 'new information' or just a combination of already existing training data.
The fact is that a lot of these questions are open to interpretation and you are positing yours as if it were some kind of truth when all you are doing is combining a few ideas together write about in a way that sounds logical but when unpacked is meaningless because you are basing assumptions on assumptions.
kuchenrolle@reddit
It would be just a combination of already existing training data.
Eisenstein@reddit
All information then is 'a combination of already existing data'. The LLMs just have access to less of it to combine because they are stuck inside a computer with no senses.
Unless you can tell me 'what would an LLM have to generate to make you agree that it created new information', then you are obviously starting from the conclusion that LLMs 'can't create new information' and working backwards from there.
kuchenrolle@reddit
I don't need to give you conditions for something that is not possible to somehow prove that it is indeed not possible. That doesn't mean I'm starting from the conclusion, I'm just not deriving the conclusion from the conditions you've specified. It is not true that all information is a combination of already existing data - again, that's not information. It is true, however, that all model generations are a combination of already existing data.
If you want to get real deep, then classical physics would even have it that there is no creation of information ever, not for the model and not for anyone. Information can never be destroyed nor created. In that sense, if you know the complete state of the universe, you can predict the the next state without uncertainty. Whether that is actually true for the universe is not quite clear, but for the type of information we are talking about, it is.
I'm honestly getting a bit annoyed. Have you actually tried chatting with ChatGPT or Claude about this? This is all very basic, very common knowledge and these models are perfectly capable of answering your questions and helping you understand.
I can only assume that you don't know how language models work, but in essence, they estimate probabilities from the data they are trained on. A very simple, early way of doing this was by predicting every word from the previous one. You just count count all pairs of words in a large collection of text and then get the probabilities that the first word is followed by each word that you observe it with. At any point in a sentence, you can then generate the next word, by looking at the probabilities given the current word and choosing the next one according to those probabilities. If you generate enough random stuff, you could get a text that "invents" a new language and lays it out. But you clearly haven't actually created anything, because the probabilities are fully determined by the data. You can also create a bunch of random text (the synthetic data) and then add that to the original texts you had (the "organic" data) and re-calculate the probabilities. Now have you created something new? You can think about that.
(Hint: No. In the limit (so with a lot of generated text), the probabilities would actually remain the exact same. And while transformers are a lot more complicated than that, at the very heart, they still do the exact same thing - they just don't calculate the probabilities for the next word (token) by simple counting.)
Eisenstein@reddit
That's exactly what that means, though. If there is no situation in which you could be proven wrong, then you aren't even considering that you could be wrong. Therefore you came up with the conclusion and are working backwards with it. You are essentially holding a belief. The first rule of science is falsifiability.
kuchenrolle@reddit
You knowing more than me about any of this is about as probable as an LLM creating new information. Take care.
Eisenstein@reddit
I'm sure having to constantly deflect attempts by others to engage in actual productive discourse while generating paragraphs of quasi-coherent text is exhausting, so you should get some rest.
Eisenstein@reddit
That is an example of a word salad. It means nothing. Let's dissect it.
Information conveyed by language changes depending on the circumstances surrounding the use of it it. Sure, that's fine.
Okay can mean 'I agree' or 'fuck off' or any number of things. Got it.
This is a non-sequitur. How does okay meaning different things in different situations relate to a 'extracting meaning of a bit'? This is nonsense.
The situation in which the language is used creates the possibility for information conveyed by language and the language is used to choose which possibility this is?
These words all combine to sound like they mean something but when analyzed are nonsense.
Eisenstein@reddit
You need to define exactly what you are arguing because when I ask you for a definition you weave around going 'that isn't helpful' and then when say things like 'you can't get more than 2 bits from a piece of data' and I respond that you can when it is language, you pivot to something about 'information contains a finite amount of information'.
You are the type of person to end up debating the meaning of certain words in order to make a point no one understands but you. Either you are a genius no one understands, or maybe you are kind of talking nonsesense and have no actual idea you can pin down and express in a coherent way.
ResidentPositive4122@reddit
I just gave you an example. You start with pairs of "math question" and "known answer" and you generate the "reasoning" steps. There were no steps before. There are now. Hence you've "created" information.
kuchenrolle@reddit
You haven't created information. The information was always in the data. You have just made use of it. By your logic, everything generated with a model is information.
I don't know why you don't just try to use one of the available resources, but here is ChatGPT's answer when I gave it this comment thread:
Thank you for the examples and the discussion. I appreciate the points you’re making about the use of synthetic data. Let's break this down further:
In summary, while synthetic data can indeed add new layers of information and enhance model performance, it does so based on the original organic data’s inherent information. The creation of new data points or reasoning steps exemplifies advanced generalization and structuring rather than the creation of fundamentally new information from nothing. The limitations of the initial data still play a crucial role in defining the extent of these improvements.
MoffKalast@reddit
Only half a year more until AMD Strix Halo brings M-series Mac LLM performance to everyone else, meanwhile Intel's busy pushing Microsoft spyware accelerators like the grifters they've become. Hopefully Hailo delivers something useful too eventually.
RedditDiedLongAgo@reddit
Good luck pretending AMD delivers on their promises.
MoffKalast@reddit
Let me dream for a bit alright? I can pretend it'll be great and cheap and have drivers that work until it releases. And then we'll see it have like 50 GB/s of practical bandwidth and no ROCm support lmao.
shroddy@reddit
It will have either around 250 GB/s or 500 GB/s bandwidth (probably 250). For interference, the 16 CPU zen 5 cores will probably be fast enough to saturate the memory bandwidth. Prompt eval is another story, here working ROCm would be important.
fallingdowndizzyvr@reddit
TBC. Low end Apple Silicon performance. It's not Max or Ultra level performance that people rave about. It's the low end of the line at prices that are pretty much the same.
Possible-Moment-6313@reddit
Researchers have already demonstrated that if you train LLMs on their own output, LLMs degenerate really quickly. Which is why OpenAI and others are so desperate to secure exclusive access to sources like Reddit or Stackoverflow.
jakderrida@reddit
True.... However, I would argue that using LLMs to restructure the information in some ideal sort of way that has examples.
I sort of do this using explanations from gmatclub for GMAT questions. There will be a top-voted answer with all sorts of shorthand explaining why each answer choice is right or wrong. I'll have it make classifications for why each are wrong and have it use top explanations to apply the new structure with classifications by providing a template, examples, the new questions with choices, and a few of the top explanations.
While I completely agree that just asking ChatGPT to generate answers will lead to lower and lower quality, I really think there's a mixed use of provided unstructured data and restructured generation that I do believe would help.
-main@reddit
So you don't train the model directly on model output.
The Llama 3 paper included details about doing synthetic data generation and the key is to run it through some ground source-of-truth -- for example, to train a codegen model not on generated code but on the output of running the generated code & the output of running the generated tests for the code. Instead of model collapse, you strengthen the ability to predict code outputs and test outputs.
bsenftner@reddit
I worked at a facial recognition company where we used synthetic data to expand the training data from under a 100K subjects to 300M. Because it's human faces, and 3D graphics can produce photo realistic human faces, we generated under a million faces and then via rendering views at different angles, facial expressions and different atmosphere/illumination/focus/weather/seasons and then took all of those and stepped on them with varying levels of face obscuring blocking objects and over aggressive image compression. When all done, the training data was equal to 300M faces and the resulting trained algorithm ranks in the top 5 annually at the NIST facial recognition vendor test. There's a huge percentage that our algorithm scanned you when you traveled through any western nation's airport.
False_Grit@reddit
Wow that was extraordinarily interesting, thank you for sharing!
EmbarrassedHelp@reddit
The researchers explicitly state that they are testing raw unfiltered data with no quality control. That is a clear limitation of study and was included in the conclusion of the latest research paper on the subject.
ResidentPositive4122@reddit
And L3 paper just debunked that.
a_beautiful_rhind@reddit
The L3 synthetic data does wonders for benchmarks but it doesn't do much for the tone of the models.
Eisenstein@reddit
Is the tone determined by the pre-training data?
a_beautiful_rhind@reddit
Probably both. The instruction tuning and preference tuning is supposed to cement it in. Finetuning wasn't really able to fix llama 3.0 tho.
Eisenstein@reddit
The face that no one can say for sure why it is so hard to train Lama 3 just means that people make up their own explanations based on whatever sounds reasonable to them. In such cases you often see reasons that reflect the mindset of the person themselves or those they listen to.
Regarding pre-training, there is so much going in to that it seems unlikely that it significantly determines personality (tone). Has there been any experimentation with pretrained models given different completion tasks to see if they are consistent in tone and style? Would be interesting to see.
a_beautiful_rhind@reddit
The main explanation I've seen is that they really curated the pre-training to be "safe". The model doesn't have enough conversation material that isn't super dry. Then they finished it off with instruction tuning by shifting that stuff towards the forefront.
Eisenstein@reddit
That is a guess, though.
kontoeinesperson@reddit
Sorry for being oblivious but what does tone mean in this context
a_beautiful_rhind@reddit
How it writes to you. Does it sound like a academic, a warm friend, a purple prose AO3 writer etc. And yes you can instruct it to say "talk like a pirate" but the default personality often has a specific one and it bleeds into everything else.
Short-Sandwich-905@reddit
Yeah, I asked him for a source but nothing.
Short-Sandwich-905@reddit
Source?
Possible-Moment-6313@reddit
https://www.nature.com/articles/s41586-024-07566-y?ref=thestack.technology
Healthy-Nebula-3603@reddit
That paper is from 10.2023 not mention was written few months earlier probably .. is so outdated nowadays
Eisenstein@reddit
They demonstrated that under certain conditions. It doesn't mean it is an iron clad rule that it is always the case.
Wooden-Potential2226@reddit
Multi-modal input will ensure plenty of data. Think about how much implicit information eg. video data contains. Combine that with eg. Lidar sensors and your model has the foundation for spatial world knowledge. But not easy - just the self-driving-car-ML-guys…
Specialist_Cap_2404@reddit
The data requirements for improving LLM performance is increasing exponentially. Any clever tricks are not even a speed bump.
deadweightboss@reddit
we’re already here
91o291o@reddit
OpenAI has been paying people to write content, and to correct answers, for years.
This will become even more important.
AwakenedRobot@reddit
Isnt there organic data in every day life? Conversations? Not saying it is not private but that would be infinite source
cyan2k@reddit
We are in the business of translating bleeding edge computer science research into actual IT ready applications and use cases, and the one thing I learnt in the past 20 years is that you don't even try to predict the future, you will always be wrong. enjoy the ride, and take it as it come.
So, who the fuck knows? For all we know Yann LeCun has the insight of the millenia tomorrow and we will have AGI next week, and fight for spice on Arakis in three weeks. Probably not, but it's just foolish to predict the unpredictable. It's like trying to predict a LLM's output at max temp, and it hast only some billion parameters. Real life has a little bit more parameters.
dalhaze@reddit
I tend to agree with you on this. But trying to anticipate and experiment with what applied genAI in the real world will look like is the business of translating the bleeding edge.
iKy1e@reddit
It has been stated Llama 4 will be trained with 10x the compute Llama 3.1 was. And presumably at this rate 2 years time will see Llama 5 so 100x?
I think at this point the limitation (which I feel we are already seeing) isn’t going to be the model’s capability as much as its storage size for knowledge. At least for smaller 10b> models.
Llama 3.1 and mistral models are already great at following tasks.
Give them a snippet of info and a task (extract, summaries, proof read) and they do great. But ask questions that require knowledge to answer, and that’s where they start to fall down.
Wikipedia is 50+GB to download and yet an LLM, even Phi3 or Llama 3 can talk about any of the subjects it contains. But Llama 3 quantised is 4GB download. There is no way it contains all that info + more.
So for small models I hope we get more focus on logic, reasoning ability, mathematical skills, chain of thought, etc…
Also experiments with other architectures, like the 1bit LLM models I’ve seen floating around.
If we can keep the abilities of Llama 3.1 and co, but shrink them further, and speed them up, that’d enable whole new uses cases!
At the top end I’m hoping for at home (at least as ‘at home as you can count modern 70b models, barely) GPT 4 multimodal, audio, image, text, documents, (video?) running locally in ‘real-ish’ time.
Flabout@reddit
It's a very lossy compression, I would much rather have a tiny language model with good reasoning abilities, and give it the possibility to browse offline Wikipedia to give me more detailed information than one that contains the whole knowledge but needs a humongous compute power to run. That's how I do it as a human, the amount of time I go to look up things on Wikipedia on a daily basis is crazy, I cannot store all the knowledge in my head. Knowing where to get the information if you can get it fast is often sufficient.
dalhaze@reddit
What about when you don’t know where to find the information?
iKy1e@reddit
Exactly, small models will need to be made with a general understanding of things (up and down are opposite) but it’s mostly their reasoning and task following that is going to be most useful.
Another area I’d like to see improvement on is knowing when they lack sufficient information.
I feel this will mostly come down to training data and technique to achieve, but I’d like to see more “I don’t have sufficient information to answer this fully, but what I can infer/surmise is this….” give as full an answer as possible with the available info, while acknowledging the limitations of the available information.
Rather than currently, where it just makes stuff up to fill any lack.
I feel like that’s the sort of thing that can be trained. Provide all the info minus 1 critical bit & an explanation that complains about its lack & another than states it can answer the question because that info is there?
But it’d have to be balanced carefully to avoid false denials and training it to just “give up”
Flabout@reddit
I hope we get to this point eventually, it would be a nice trade off between speed and knowledge quality.
As for the ability to tell when they're right or wrong, I'm pretty sure it's being worked on, I don't know when, but it's definitely gonna be a thing soon enough.
RedditLovingSun@reddit
I wonder if people have tried post-training for instruction following like dpo or rlhf but with an offline query-able vector database in the loop on small models.
Like instead of just prompt -> llm output -> reward model score, we added a new second step when another learnable small model queries a giant vector DB for relevant info and enhances the input with it.
Perplexity basically does this during inference but not with a trainable model that's trained for it.
jasminUwU6@reddit
I don't think trying to fit all the knowledge inside the weights is a good idea long term. LLMs are good at intuition, but there are probably better ways to do knowledge retrieval.
megadonkeyx@reddit
context size is still a massive problem, even claude 3.5 sonnet starts to go bananas with a long conversation.
there needs to be an LLM that can write directly back into its weights as it learns. no idea how or if that could be possible but everything else is just hacking around this one issue.
DonnotDefine@reddit
auto continous learning
tim_Andromeda@reddit (OP)
I agree. That’s how the human brain works, adjusting its own weights on the fly. I think that would be a huge development. Models that can learn.
Healthy-Nebula-3603@reddit
On the fly? No. Research proved that long time ago. During consensus state we only operate on context like LLM. During the sleep new informations are integrated into our brain weights.
tim_Andromeda@reddit (OP)
What research are you referring to? My understanding is that Memories are formed by changing the strength of synapses. Otherwise how would we remember anything throughout the day?
fallingdowndizzyvr@reddit
That's short term memory. Sleep is when it gets integrated into long term memory. That's why there are some people who forget everything they learned during the day because they have a condition that prevents them from integrating short term memory into long term memory. For them, every day is the same day over again since they can't remember yesterday. The yesterday they can remember could have been years ago. Yet during the day, they remember what's happening since that's in their short term memory.
https://news.harvard.edu/gazette/story/2020/12/how-neurons-form-long-term-memories/
martinerous@reddit
How short is the short-term memory in humans? For example, when a person who has not slept for a few days would start forgetting the information they gained at the beginning of the experiment?
fallingdowndizzyvr@reddit
People underestimate how much they sleep. Since no one doesn't sleep for a few days even though they think they have. Have you ever woken up too early and thought you couldn't get back to sleep? Yet the next time you look at you clock it's 1-2 hours later.
People don't sleep for a few days. They take micro naps even if they don't realize it. Which has been found to happen with those people who tried to set records for not sleeping. The thing is, they did. Just for really short periods of time. Which clearly is not enough since by the second day cognition generally degrades including memory.
https://www.msn.com/en-us/health/wellness/what-researchers-learned-from-a-guy-who-stayed-awake-for-11-days/ar-BB1hrCqf
RiotNrrd2001@reddit
Right now LLMs really only have what we might call long term memory, the built-in pre-trained stuff that never gets updated. They don't have a real short term memory the way that we do, so we use their prompts for that, continually feeding in the previous material in the chat on each pass. And that works, more or less, except that there's never an update cycle. Long term memory in LLMs (the weighting) just never gets updated from short term memory the way it does in us. So they can remember facts from 1982, but can't remember the conversation you just had two minutes ago and deleted.
I think that people are probably trying to think of ways to do this updating, but right now short term memory in LLMs is really clunky.
False_Grit@reddit
That....actually sounds like exactly what's happening. Dang.
Why can't this be accomplished, though?
I haven't yet tried to fine-tune an LLM, but I've finetuned and created Loras for image generators, and the process did not seem too difficult or time intensive.(Admittedly, image generators tend to have quite a few less parameters than some of the more robust LLMs...I suppose an image is NOT actually worth 1000 words).
Couldn't we just converse with the LLM normally throughout the day, then automate a fine-tuning or Lora process at night to integrate the things it learned that day into the LLM itself.
RiotNrrd2001@reddit
Yes, I think that's exactly what needs to happen. Create a Lora-like layer that accumulates weightings during the day that it can use like short-term memory, and then basically have a scheduled "dream period" where the daily accumulation of weightings gets pushed into the main system and then cleared for the next day (or whatever time period is optimal).
I myself do not have the technical expertise or equipment to do that, or even to work on it, but I think that is probably the way things will be developing. It seems to be the way that we ourselves operate, and nature does provide some good examples.
False_Grit@reddit
Maybe Mark Zuckerberg will read this and devote a spare h100 cluster to our crazy idea???
Mark - If you're out there, I love you and the work you're doing!
Healthy-Nebula-3603@reddit
That new data are integrated during the sleep into brain weights. I was reading people who didn't sleep few days ( prevented by drubgs for research ) didn't remember nothing from days when did not sleep because they overlaped their context ,( short memory ) Also people with dementia also do not remember anything from the day before ..because new data (line LLM context ) are not integrated during the night into brain weights. Later I find those research papers... are very interesting.
tim_Andromeda@reddit (OP)
I think sleep just solidifies certain memories and drops others. Some synapses strengthen even more. But the same process is definitely happening throughout the day.
Healthy-Nebula-3603@reddit
Researchers say no.
Good you day "I think so probably it is true because I think so "
How you can know better? Do you conducted research in this field 😅? Have you tried live without sleep few days to test it for overloaded context?
For instance here
https://www.sciencedirect.com/science/article/abs/pii/S0168010222003194
Eisenstein@reddit
I have gone without sleep for days studying. I got good grades on those exams. That at least means it is a bit more complicated than whatever it is you propose is happening.
Healthy-Nebula-3603@reddit
Sure ...for days 1-2? .. can you prove it or I have to believe you? We are talking without sleep 5+ days from research.
Apart from that you not overloaded your short memory and sleep eventually to integrate new data you gained.
Covid-Plannedemic_@reddit
Healthy-Nebula-3603@reddit
short memory is cumulative memory between sleep cycles when is assimilating to the brain weights.
That short memory has limits and can be overflown so you just forgetting facts and other stuff. You just remember more important events without details. It is funny how we lying ourself that we are remember "everything".
fallingdowndizzyvr@reddit
No. It isn't. Check my last response to one of your posts I just posted. With an article to read. There are people that have perfectly fine short term memory. During the day they are fine. Yet when they wake up the next morning they don't remember anything that happened yesterday. Since sleep is when that short term memory is turned into long term memory. Which doesn't happen for some people.
AwakenedRobot@reddit
I get that but models dont need sleep, maybe it could be on the fly for models eventually
False_Grit@reddit
Maybe they do though? Fine-tuning takes a lot more time and compute power generally; maybe if we turned them off and finetuned them every night their long term memory would improve?
Healthy-Nebula-3603@reddit
I think that will be something like switch off for few minutes ( depends of compute power ) . That new data must be structured , filtered and integrated. Nowadays unlocking LLM weights and adding new data to it takes a lots of computing power. Maybe that's why we have to sleep so long because it is very computing demand....
MaryIsMyMother@reddit
That's a large over simplification. The human brain has multiple tools at its disposal, from forming new synapses, changing the "weights" of those synapses, neurogenesis itself, as well as causing "global" effects through feedback loop involving neurotransmitters and hormones. Some of these things can be done on the fly, others can't.
fallingdowndizzyvr@reddit
No. That's pretty much certainly not how it works. It happens during sleep. There are classic cases of people with brain injuries that prevented the from integrating short term memory into long term memory. They are fine during the day. They seem absolutely normal. At dinner time they can tell you what they had for breakfast and lunch. But after waking up the next morning, they can't remember anything about the previous day. The yesterday the can remember, could have been years ago. But for all they know, that was yesterday. Since everything they experience from the time they woke up yesterday was in short term memory. That short term memory never was incorporated into long term memory and is now gone.
MaryIsMyMother@reddit
If you're talking just about memory formation then yeah of course. I thought you were saying the brain itself is a similar environment to an LLMs in a more literal sense
fallingdowndizzyvr@reddit
I think you are confusing me with that other poster you were conversing with.
As for the similarities between the human brain and LLMs. That remains to be seen. Since we don't know exactly how the human brain works. As we don't know exactly how LLMs work.
Healthy-Nebula-3603@reddit
Update on the fly? I do not think so.
How you do you explain people with dementia and other similar injuries of the brain?
They remember everting from the whole day (context memor like in llmy) until they go to sleep but after the sleep they are loosing informations completely from the previous day before the sleep.
That showing something like (context memory) is flushing during a sleep and assimilated to the long memory. But that mechanism is broken for some people.
We are not assimilating any information for good during a day that is happening during a sleep.
Accurate-Snow9951@reddit
I remember hearing about liquid neural nets but haven't seen them mentioned that much.
bblankuser@reddit
this and people (llm makers) are getting too comfortable with not using new technologies like mamba or multi-token prediction
Small-Fall-6500@reddit
This is not an unsolved technical problem but an unresolved problem with humans/society.
Google Deepmind, for all intents and purposes, solved long context in their Gemini 1.5 models. Whether or not Anthropic or OpenAI has also solved this is not relevant to whether or not it is technically possible or present in a leading model.
Almost certainly, scaling the Gemini 1.5 model would result in better long-context capabilities, but what matters more are aspects like the costs of using such long contexts. So I agree that "context size is still a massive problem" but not at all because there are some models that become coherent at longer contexts.
Now, maybe Google has upped their security and no one will leak the special recipe for long context. In that case, worrying about most future models lacking this capability is more reasonable, but this problem is at least known to be possible to solve. It's reasonable to assume that, with all the funding and top researchers working on these kinds of problems, eventually, other labs/people will figure something out too, and probably fairly soon. (Or, for all we know, it could just be an already published technique like ring-attention but scaled up, and other labs just haven't really tried it yet.)
butthole_nipple@reddit
Google has more compute power than any other player afaik. I'm not sure they're doing anything more special than scaling
KillerX629@reddit
State space models suddenly ring a bell
azriel777@reddit
This is what I am hoping to see.
martinerous@reddit
Being more a philosopher than an LLM expert (but just average in both areas), I see the following directions of development:
faster, more resource-efficient inference (maybe based on what we see with BitNet or something else)
continuous learning (however, that would need to invent trust scores to prevent it from learning from garbage)
world model, based on different sources of information and interactions (or simulated interactions) with the physical world. Just a naive example: instead of feeding it millions of cat photos, we could mount cameras in animal shelters and let it learn how different cats look from different angles and poses, also feeding in the sound from microphones, and then somehow distill this model and safely merge it with other object recognition data.
algorithms for learning the basic concepts well and using external tools for more complex calculations or information. Essentially, instead of training on a huge amount of (somewhat chaotic) text, the model might get trained on some kind of ground truth rules. Imagine AlphaProof and AlphaGeometry but applied to different kinds of information. These core logic models then could use the LLM models as just text-generating engines. Similar to how the human brain works. We think in concepts and only after that, we express the concepts in words in any language.
However, some of these ideas might take more than two years to become mainstream. So, realistically, I wouldn't expect ground-breaking changes in two years, but more like steady progress - faster, smaller, smarter AIs where the LLM is just a part of it.
SeriousBuiznuss@reddit
Expected outcomes:
Fantasy Wish list:
Expected Regulatory Response:
Omnic19@reddit
LLMs in their current form are really expensive to run with digital hardware. If digital hardware like GPUs are going to be used we will have to stop somewhere. maybe at 10trillion for data centers and 500 billion max for consumers.
But if innovations in Analog hardware come up we would be able to run trillion parameter models on the consumer end.
tim_Andromeda@reddit (OP)
Analog hardware, are we talking like running LLMs on neural tissue?
Omnic19@reddit
LOL no😂. There are companies like Mythic,LightMatter
etc are working on Analog devices.
There's also tech called memresistors you can look that up. The whole idea with memresistors is that computation is carried out in the memory itself. The memory chip is also the processor so there's no need for memory bandwidth to take the data from memory to the processing units.
BlueboyZX@reddit
TL;DR: We have the raw neural network science to solve the issue but implementing it for such a huge dataset is likely to be too computationally taxing for a commodity PC.... today. :P
In TesseractOCR, there is a main body of trained data for its neural network that is pretty solid but it can be advantageous to fine-tune for certain circumstances. Generally this is to improve accuracy for a given specific use case at the expense of a variable degree of loss of accuracy in other contexts. This can be done by adding another layer on top of the pre-existing neural network or by removing 1+ layers and training over that with customized new layers. I have done his on my PC with both manually curated as well as synthetic data. The issue is that LLM's, well, are large.
Some of you mention the possibility of changing weights on a rolling basis to maintain the conversation. One aspect of LLM's that may be taken advantage of is that even a long conversation is likely to only touch on a small portion of the model. If you want to adjust weights on the network without producing some massive thing, conceptually you could store it as a diff file that contains your 'temporary' adjustments to the network. These would only store the difference between the original network and whatever weights change during the conversation. This new data would be a tiny fraction of the size of the full LLM, be compressed and stored when you want to stop and resume the conversation later.
In theory, we have the data science to do all of this, but actual practical implementation is another issue. Unlike, say, TesseractOCR, this difference would need to be done on a rolling basis. New weights would be calculated in much smaller increments, which would help the calculations not take so long as Tesseract's fine-tuning. OTOH, we are not exactly working with training a 20 meg network, so this will take a lot more computational power even with very focused weight changes.
Minimally informed spitballing: Adjusting the network topology and data structures used in the initial LLM training may make optimizations for this kind of diff calculation possible. I have a vague notion of a graph database being generated during the initial LLM training to help with fuzzy isolation of changes to help optimize the new weight calculations needed for this diff scheme. I have no actual experience using graph databases, so this may be totally off the mark. (Graph databases focus on borders between different pieces of data, totally different use cases from the old-school relational databases).
petrichorax@reddit
Stop trying to predict, go make
blancorey@reddit
dont confuse frenzied updates with forward progress
lxe@reddit
Much longer context windows
Vast hardware and software optimizations for both training and inference
Alternative model architectures
Race to the bottom hardware competition
Dethronement: NVDA crash and OpenAI running out of funding
First casualties of state-level regulation
First casualties of workplace transformation: new expectation for white collars and creatives to use AI systems
aggracc@reddit
We're going to have grounding for the models from organic text to prevent hallucinations.
We're going to see agents that talk with each other to get over the fact you have only so many layers for each token.
GregsWorld@reddit
How would that prevent hallucinations?
aggracc@reddit
The short version: these are called transformers for a reason.
The long version: it's waiting peer review for 6 months so far.
GregsWorld@reddit
Sorry that didn't really clarify anything. Aren't hallucinations a byproduct of the fuzzy nature of neural networks?
Removing hallucinations would be removing the core of what makes neural networks good at pattern matching.
aggracc@reddit
Hallucinations happen orders of magnitude less frequently when there is a lot of text that the llm can operate on while trying to produce it's response. The simplest way to do is to provide as much context as text you expect out.
GregsWorld@reddit
Oh okay that's not addressing the root problem to prevent hallucinations then, just reducing how often they might occur
aggracc@reddit
Yes, the same way that error correction in CPUs doesn't eliminate hardware faults.
AfterAte@reddit
1) 70B will be the new 7B of 2023. All models will be be using Quantization Aware Training, which will have a 70B at smaller than IQ1_S sizes (about 15GB) be as useful as ones at 5_K_M sizes (About 50GB). And mainstream cards will be 16GB, as used 4080s will be $300 and rx6800 will be $150.
2) CUDA is going to be supplanted by a hardware agnostic ML framework that most other companies get behind. AMD should contribute to UXL.
3) The software around Coding LLMs will get so easy to use that programmers will stop having to use Stack Overflow, and it'll go out of business. Continue.dev in VSCode + Ollama is halfway there.
tim_Andromeda@reddit (OP)
Ooo. I’ll have to check that out. Already running ollama and vscode.
_daddylonglegz@reddit
LLMs on every type of Edge device and it’ll be weird like a microwave or washer and dryer with a built in LLM.
Hopefully we’ll be moving on to something better than transformers. Maybe with more hardware support of smaller floating point data types we’ll be able to use architectures that weren’t realistic before.
tim_Andromeda@reddit (OP)
I had this funny vision the other day of a kitchen full of talking appliances that started arguing with each other and I had to break it up! 🤪
lakeland_nz@reddit
Two years!?
My overriding reaction is to ask: "do I care?".
I remember years ago I worked for a very conservative, cautious company. We had systems and processes for everything, all designed to ensure nothing would go wrong. Any time something went wrong anyway, we'd improve the process with an extra check to prevent that.
We were losing in the marketplace and I participated in a workshop looking at why. Our conclusion was our competitors were much less cautious, things went wrong, and they adapted so the consequence was relatively minor.
It changed my whole outlook on life. Rather than try to forecast the future, I try to understand and react to the present.
LLMs can't reason, but chain of thought, and outputting a program to solve the question both help.
LLMs hallucinate, but RAG can ensure accurate data is in the prompt. LLMs have limited prompt length, but are remarkably good at rewriting if for you.
And so on. I don't really care what the next advance is, so long as I can pick it up and adapt quickly.
Some guesses just for fun:
Reasoning will shift from being a bolt-on to being integrated into the core architecture.
MoE will change to true specialists. It'll be more like a team if LLMs working on the problem, and sharing resources.
The strengths and weaknesses will be much better understood, with fewer paranoid bans or stupid projects doomed to failure.
AwakenedRobot@reddit
I think when robots walk among ya is when data is gonna sky rocket with all the dynamic interactions that come from doing day to day activities and actions
sensei_von_bonzai@reddit
This but with metaverse
Leo2000Immortal@reddit
Language models are close to peaking. We'll see growth in multi modal domain. Also, we should be having much more capable smaller models which will run on most smartphones
Umbristopheles@reddit
They actually might have room to grow. Check out "grokking."
https://arxiv.org/abs/2405.15071
KillerX629@reddit
The thing about grokking is it's cost. It was done with million parameter models running a lot more training than the chinchilla optimum. Now with that new barrier set, the costs will rise to infinity until someone actually gets it and shows off the results. I beg to god it's not openai
Umbristopheles@reddit
Have I got the paper for you, then!
https://arxiv.org/abs/2405.20233
KillerX629@reddit
Oh lord... This is what having hope feels like
MoffKalast@reddit
There will certainly be more understanding and less memorizing in future models, but here's only so much you can really do with just text. Image/audio encoders need to become a thing at least in the pretraining phase so models know what a chicken sounds and looks like, not just how thousands have described it. Check out some medieval paintings of lions made by people who've only heard about them in stories and you'll see what I mean.
paraffin@reddit
I agree that image and video data is useful, but not that it’s necessary. A deaf and blind person can still reason usefully about the world, abstract topics, and art.
kirakun@reddit
You should read this: https://en.m.wikipedia.org/wiki/Knowledge_argument
Camel_Sensitive@reddit
This thought experiment is poor, but worse than that, it isn't useful. Several reasons why:
1) It's ultimately an argument from ignorance. Mary experiences qualia when she leaves the room, but only because our understanding of how qualia arises from physical processes is at best, incomplete. It's possible for Mary to learn everything there is to know before experiencing it, but not with our current tooling.
2) Complex systems often exhibit emergent properties that are not predictable. Consciousness and qualia are probably emergent properties of the brain's complex quantum interactions.
3) Quantum non-locality and superposition most likely provide mechanisms for the seemingly non-local nature of conscious experience.
Actual good science has already been done on this subject by non-philosophers, I'd look into Orch OR, originally the brain child of Roger Penrose (yeah, the guy that basically invented blackhole mathematics).
Orch OR combines the Penrose–Lucas argument with Hameroff's hypothesis on quantum processing in microtubules. It proposes that when condensates in the brain undergo an objective wave function reduction, their collapse connects noncomputational decision-making to experiences embedded in spacetime's fundamental geometry.
If you're interested, he wrote a fantastic book on it called Shadows of the Mind.
paraffin@reddit
I fully believe that Mary learns something.
I just don’t believe she absolutely must go in the room in order to reason usefully about the world.
MoffKalast@reddit
Indeed they can, but to a much less accurate degree. If that weren't the case the current gen of models wouldn't be able to reason at all.
Plus the deafblind can still touch things and do start out with a fair bit of genetic knowledge about the world so they still have a leg up compared to something starting out random any only ever being fed text data, even if it is a shitload of text data.
paraffin@reddit
Wrote this in reply to another but he deleted it. I feel is vaguely relevant to your comment.
I don’t say everything can be turned into text. I am saying that useful, generally intelligent systems can be trained on purely text for certain tasks.
From a quick look at some of Dr Marcus’ writing, I agree with him as far as characterizing current foundation LLMs. But it’s a limited perspective.
Transformers are universal function approximators. Therefore the limits they have are given by their parameter counts the functions they are asked to approximate, and how well they are trained to approximate it.
The function of predicting text from the Internet is an interesting one. However, the vast majority of texts are poorly contextualized. The data is missing key predictors, so the LLM learns to make stuff up. While they do gain some generalizable reasoning skills, they are limited. While they do memorize certain facts, they don’t always store them in a manner that makes them applicable for all reasoning tasks.
Parameter count wise, transformers are probably not the final solution. We’ll likely need more recurrent solutions. Not because a transformer can’t learn to do anything, but because certain tasks would require far more parameters and training than is feasible.
MoffKalast@reddit
I mean in a sense you're right. Text is just UTF-8 encoded bytes, and you would need zero architecture changes to parse or generate any kind of data that can be encoded in bytes, which is... well anything at all. The tokenizer and context would just need to be larger and the training longer and much more varied. With more compute and cleaner data there would be no need to haphazardly glue on already working encoders either.
I doubt there's any practical difference in what RNNs and Transformers can be trained to do (i.e. both can learn anything), it's more of a question of which has better efficiency on current hardware. TF have the benefit of being stateless so they're obviously favored by anyone intending to run lots of requests in parallel.
paraffin@reddit
But even with language text alone, you can teach an LLM to generalize for useful tasks.
It’s just that you need better data than random internet stuff.
MoffKalast@reddit
Nah, random internet stuff is already enough to generalize for useful tasks, just look at llama-3.0, that's just 15T tokens of pure internet stuff almost topping the charts.
But there will still be lots of tasks that end up heavily approximated or won't be very reliable or consistent by only training on second hand sources instead of first hand data. Layouts in 3D space, intuition on how physics affects objects, reading facial expressions, etc.
paraffin@reddit
You’re right, it is. I was trying to say that the full spectrum of abilities that make up “general reasoning” in humans can’t be learned from learning to reproduce internet text data.
It’s not even (at least not primarily) the lossiness of the translation of ideas into text, IMO. It’s the loss of the context which produced the data.
An LLM can never predict what I’m about to write right now because it doesn’t have the context that I have as I write it. There is a seltzer can in front of me. So it learns to randomly say there is a can of seltzer sometimes, or play pretend, or make good-enough guesses.
And yeah, sometimes that context is perceptual and it’s lost and the LLM can only loosely approximate it. But GPT-4 can write code that draws a unicorn. Badly, but recognizable. It can tell you how to stack objects. It has a world model.
It’s not about the modality. It’s about the data-generating function and the function-learning function. Cameras are better data-generating functions for some tasks than typists.
MoffKalast@reddit
Now I'm absolutely laughing about an LLM learning that "there is a can of seltzer.... sometimes". You're probably right about the loss of context being a huge problem.
paraffin@reddit
The parameter count has to do with how many logical operations can be performed in sequence going down the layers. Without a recurrent solution one is limited. Autoregression is a decent kind of recursion but probably not enough.
mindful_subconscious@reddit
That reminds me of the McGurk Effect in psychology. It shows how your vision can affect hearing and vice versa. I may be wrong but it seems that reasoning could improve with multimodal options.
Draggador@reddit
humans have five sensory organs; even without eyes & ears functioning, there are three remaining; machines too need multiple sensory systems to achieve anything that even remotely resembles reasoning
paraffin@reddit
Does a multimodal model only have one sense; tokens? Obviously that’s not a useful characterization, but at a certain level, all intelligent systems just have one or more streams of data. The interesting part is how the data is organized.
Models can then have “senses” far more rich, varied, and complex than humans. A model trained on time series data could have a “stock market” sense. It could have proprioception of its portfolio’s positions. An inner ear for balancing risk.
Digital embodiment is far more flexible than biological. The world seen through our senses is limited, abstract, and misleading.
What we find important about sense data is that it comes from physical systems - it gives our minds access to the world. Text data is the world filtered through other minds, lacking the context of the being who produced it. But text data can be much more. Text data can be produced from physical or simulated environments deterministically. It can capture mathematical and purely abstract ideas for which human senses are limiting.
Yes, visual data is better than text alone for understanding the human experience of the world. But it is not necessary for useful, sophisticated reasoning machines. It just depends what you’re trying to get the system to do.
Draggador@reddit
You seem quite confident in the claim that everything can be turned into text without anything getting lost in translation. What you do think about the criticism provided by folks like doctor gary marcus about the limitations of LLMs in general?
olivierp9@reddit
https://www.lesswrong.com/posts/GpSzShaaf8po4rcmA/qapr-5-grokking-is-maybe-not-that-big-a-deal#:~:text=Grokking%20is%20an%20example%20of,deeper%20understanding%20with%20longer%20thought.
prototypist@reddit
I thought you were linking to the original grokking paper and was confused, but this is an interesting use of transformers to test that it's learned relationships, thanks
ecchirhino99@reddit
I belive they are very far from peaking. If the goal is to completely replace humans, you want llm to write you whole programs with multiple files and debug it. You will want it to design a full product in one go include circuit boards and everything.
Also improve creativity is an endless pursuit.
HeinrichTheWolf_17@reddit
Multi-Modality, Autonomous Agentic, Seamless Voice Chat Interface.
Fox151@reddit
Maybe the embodiment of ai recording its own new data from real world to train next gen models and improve itself (and the same on a digital simulated world)
Fox151@reddit
I also think medium sized world models (around 20 - 70B) with high density and multimodal capability (made from bigger models) could be the popular choice
olmoscd@reddit
someone (or nvda/amd/intc/qcom) will release a 32/64/128/512 GB card(s) purely for LLM acceleration and it will be cheaper than a GPU and use less power.
it just makes sense. i think this with open models will basically end the game and LLM's will get as good as they'll ever get but just run way more efficiently. they're basically out of data to train on at this point so GPT-4o/Claude/LLaMa are the beginning of the long plateau with this technology.
fedorum-com@reddit
About a year ago, I discovered GPT4All and on my laptop that has an NVIDIA 3080, the words came out slowly.
Fast forward one year, I use LM Studio and it spits sentences so fast that I have to scroll back to read.
Fast forward one year into the future, I would be happy if I no longer have to type and instead, could speak and hear and see the response. I'd be happy to invest in a 3090, 4090 or 50xx to have that.
In two years? I hate to think about it as they will have to implement ads to pay for the development as well as the developers and investors. My hunch is that it will be more powerful but also more annoying. If I was ever wrong, I hope I am about this!
trill5556@reddit
LLM is the new UX. Once you get used to talking with your data, you will not go back to running long queries and waiting for dashboards to update.
chrchr@reddit
Smaller, but more targeted and much more powerful. It will be much less visible to consumers, but will power things behind the curtains to a much greater degree.
Necessary-Donkey5574@reddit
Massively more efficient architecture and compute. Those 405b home servers are about to be way overkill.
Significant_Focus134@reddit
In 2 years, instead of LLM, we will have LMM (large multimodal models). I also suspect that these models will be embedded in the physical world (trained on gravity) and used in robotics. At least that's what I would do.
sdlab@reddit
may be it will become intelligent. But that's not for sure. I highly doubt.
Kep0a@reddit
I think the field is going to stall for awhile. Models are going to get better, but not that much better. We're still not leapfrogged GPT-4 despite many arguing we would. LLMs are not even approaching AGI. We have a lot of incredible papers, but nothing that's a guaranteed 10-20x performance.
I like the adage, technology in the short term under delivers, but over delivers in the long term.
Klutzy-Smile-9839@reddit
Gpt4o voice is out in alpha version. I use it since yesterday.
Kep0a@reddit
I didn't realize. That's cool. Does it work well?
Klutzy-Smile-9839@reddit
Incredibly well, as if I was talking to a well-intentioned private assistant, and I use it in French ! In English it is probably more than perfect.
Maykey@reddit
Subquadratic models. Being affected by comma from 10000 sentences ago is kinda wasteful.
Zeikos@reddit
I think that mostly everybody realized that scale is powerful but too limiting.
I'd expect a lot more research and compute put towards smaller models, and/or combination of smaller models.
Not necessarily the multimodal approach, but something reminiscent of it.
That a maybe, because small models are something that's unpleasant for those companies, since size is a natural barrier against competition.
If GPT-6 level quality will be achiveable with a 10B model then it'll be a massive problem for who wants to profit from it AI, because any company could run one locally.
Klutzy-Smile-9839@reddit
Any company can run a local version of Windows, but they pay licences fee to avoid lawsuits.
carl2187@reddit
There's a documentary on it called "Terminator".
graphicaldot@reddit
Smaller models will be very task-specific, for example, booking tickets, coding, etc. There will likely be a manager model that will decide which specific model to use for each task. People will use these models on their own machines and will seldom use cloud AI. They will probably use encrypted data storage like Urbit to store their data. Websites and apps will just be in XML or some other format that will be easy for LLMs to understand and take action.
All these models combined will give a notion of Intelligence.
benkei_sudo@reddit
I don't think we'll see true "AGI" (Artificial General Intelligence) in 2 years. We'll still be far from human-level intelligence, but LLMs will become increasingly useful for specific tasks and industries.
smilingcarbon@reddit
My predictions:
Popular-Direction984@reddit
Not sure about (1), but (2) and (3) - 100%.
MoffKalast@reddit
Absolutely not for (1) lmao. The latency to the moon is just 1 second and you can have a team of human pilots working around the clock for a miniscule fraction of the cost it took to launch the thing. Space hardware is still extremely slow since you need radiation hardening and requires a level of control reliability and verifiability that no ML model can offer.
What we should start seeing instead are social robots here on earth. Like Star Wars droid tier ones, smart enough to understand speech and do very basic stuff in highly fault tolerant environments.
MaryIsMyMother@reddit
There's an LLM currently operating on the ISS.
MoffKalast@reddit
Ah yeah I suppose it would be useful for the astronauts to have a local knowledge base they can query if the laser internet link ever goes down. Kinda wondering what they ran it on that didn't immediately overheat though lol.
JustOneAvailableName@reddit
NVIDIA has had a more and more deep learning focus at least since AlexNet in 2012. They delivered their first DGX (dedicated Deep learning machine) to OpenAI in the summer of 2016 and their GPUs have had tensor cores since the V100, released in 2017.
-main@reddit
NVIDIA only kinda has a hardware lead. The real lead is the lockin with their software. Maybe if AMD or someone write some LLM-powered CUDA translator thing that actually works and then it actually works, someone might challenge NVIDIA. Otherwise, I don't see them falling in the next year. Five years, who knows, anything could happen.
MoffKalast@reddit
Well there is that ZLUDA thing...
DeProgrammer99@reddit
And SCALE, posted here a few weeks ago: https://www.reddit.com/r/LocalLLaMA/comments/1e6jwf5/nvidia_cuda_can_now_directly_run_on_amd_gpus/
aggracc@reddit
Cuda is a big moat but not insurmountable.
AMD needs to write good drivers and their own version of cuda which is well supported. I double dare you to get ROCm hello world working on flagship cards in under a week.
-main@reddit
AMD maybe is too dysfunctional at the management level to achieve this. We'll see, I guess. It is one of the medium-sized questions about the near-term future of AI deployment.
aggracc@reddit
Yeah, what I find odd about the current CPU landscape is that AMD hasn't managed to fuck up ryzen. No idea how. Being born to fail seems to be in their DNA for every other product.
Eisenstein@reddit
AMD is Becoming a Software Company. Here's the Plan.
-main@reddit
Well, good luck to them. Some real competition in this market would benefit us all.
tim_Andromeda@reddit (OP)
I also think LLM will massively improve the capabilities of robots. I think LLM toys might be around the corner, like the “Perpetual Pet” of M3GAN fame.
Thellton@reddit
that's asking for talkie toaster... lol
tim_Andromeda@reddit (OP)
😆
Winter_Tension5432@reddit
I would say 300m will outperform today's 2b model like gemma 2b, and a 2b would be equivalent to a 27b gemma that is a little more realistic.
Zediatech@reddit
LLMs will be common place and integrated into almost everything you use to communicate. What still has room to grow is data processing and prediction models, amongst many other things!
FutureIsMine@reddit
We will have smaller LLMs get even more capable and that’s where we’ll be
runvnc@reddit
I think we are going to see live video with audio generation of customized personas that can be instructed the same way as LLMs can. And maybe in two years, probably within 3 years, they will be able to do a lot of stuff in the video, not just explain things.. something like video generation could be expanded to include audio, text, images, all in the same model.
So you could ask it to demonstrate how to cook something, and it creates a custom cooking video showing it's virtual persona moving around a kitchen etc. Things like that.
I think a lot of people will prefer the truly multimodal models over language ones.
Old-Ambassador3066@reddit
Either dead or alive. Not educated enough to make guesses on it. Are there any scientific models that could help determine that though?
DooDiDaDiDoo@reddit
In terms of modeling, it's both multimodal - including the very exciting motor coordiation models that could completely revolutionize manufacturing and even healthcare in a longer time frame - and smaller, more efficient models. In a skeptical note, new architectures have been at best incremental improvements, and the holy grail of an infinite context length is more and more feeling like a fools errand, in the sense that even if it's achievable, in practical terms all the other LLM issues will stand (needle in haystack, hallucinations etc). Maybe things like attending to the relevant information with reasoning steps will get simpler with some advancements, but no big breakthroughs there.
In terms of learning algorithms, there's a non-zero, but dim chance of breakthroughs in either RL or the fancy stuff from the big academic dogs bringing yet another qualitative leap - think of things like the forward-forward algorithm or neurocognition-inspired update rules, but that would be actually applicable and allow for effective real time learning.
In terms of more direct applications, multi-agent systems coupled with guided generation and specialized models are already becoming the norm even for basic chatbots that were already pretty good with RAG. DSPy is showing promise at designing optimization in a quasi-unit-testing fashion. LLMs can be pretty ubiquitous if you treat them as semantically rich data mapping, or multiplexer, or whatever other abstraction you already use in your system design.
GregsWorld@reddit
Do you have any links about these motor models? Google turned up nothing.
DooDiDaDiDoo@reddit
The press name is "large behavior model", but the actual paper that serves as a good entrypoint is this one: https://arxiv.org/abs/2303.04137
It's only tangential to the LLM craze (it has a transformers component in its diffusion model), but a bunch of derivative work is already trying to leverage LLM semantics to guide robots in precision tasks.
montdawgg@reddit
We will be excited that local models can match GPT4 performance on a laptop.
We will also be anticipating GPT6 because of how dumb and limited we all think GPT-5 is. lol
LukeDaTastyBoi@reddit
RemindMe! 2 years "Ai predictions"
RemindMeBot@reddit
I will be messaging you in 2 years on 2026-08-04 15:31:19 UTC to remind you of this link
CLICK THIS LINK to send a PM to also be reminded and to reduce spam.
^(Parent commenter can ) ^(delete this message to hide from others.)
segmond@reddit
LLM = Large Langauge Models. That's it, we have seen it all. What can be optimized? They could be made smaller, faster, cheaper, smarter, longer context. That's all I can imagine that can be done to them. I'm not talking about multimodal. Just the languages. That in itself would be amazing. Imagine a 20B model that's faster to infer, smarter than 400B model, 5 million context. That would definitely put us in a different realm in terms of possibility unlocked. I don't know that we would be there in 2 years, if I was to guess. I would guess that < 100B model would be as smart as the llama3.1 405B model. I would hope open source would have a working and useful 1million context model by then too.
desexmachina@reddit
Supposedly, nothing has changed with actual ML strategies and algorithms in the last 20 years. What has changed is the capability of compute. I think compute will continue to improve and we'll see capabilities we didn't see before if the premise is valid. Looking at the jobs now, all the reqs are asking for PHD level CS people, good luck with that. I think, as usual, the corps will be caught with their pants down and some small player will come in and dominate the landscape.
asokraju@reddit
Tokenizer
Lazylion2@reddit
hopefully something that learns in real time like brains do.
When the AI is wrong and I find the solution myself I want to be able to teach it
petrus4@reddit
I don't think things will be terribly different, honestly. The technology was revolutionary 18 months ago, but then it scared the crap out of a lot of people, so they dumbed it down. What we have now is actually nowhere near as intelligent as early Character.AI and GPT4 was.
So we might get bigger models, in terms of parameter count, but I don't think there's going to be another major public breakthrough in real intelligence, (at least not for several years) because they won't allow it.
_ii_@reddit
The main thing that is holding current LLM applications back is compute cost. I expect LLM applications improvements to be inversely proportional to compute cost. So I think those impressive AI demos will be a reality in a couple of years. Anything beyond that is anyone’s guess. Maybe some researchers are working on a new architecture that will blow everything away, or if we 1000x the model size new capabilities emerges. So far, at least for my employer, all our recent model architecture improvements have been minor and most come with their own drawbacks. The most fruitful investment has been cluster utilization improvements. Distilling large models to small models is more trial and error than exact science. So the priority is to make the training-testing cycle faster and cheaper.
malinefficient@reddit
By utilizing comprehensive insights from within current LLMs, we will delve into showcasing the potential applications and notably those crucial findings will underscore their future developments.
squareOfTwo@reddit
fast? Well it's very slow like a turtle at best when the goal is to get to GI. Also the LM architectures don't change fast. Sure we have RWKV and Mamba but these are exceptions and to little work is done in searching for better architectures.
smahs9@reddit
Perhaps an unpopular opinion around here, but LLMs have already achieved quite a lot. To keep the engine running and set a foundation for new frontiers, the return on the massive investment that is already riding on the current crop of models research must be realized. What is needed is a "wow" application based on LLMs with potential to start a new wave of unicorns. Ergo, two areas that should remain the focus: more training (likely on ever growing corpus of generated data) for getting structured data from unstructured data (create API calls or database queries reliably and directly or with minimal tooling from user input in natural language); and multimodal capabilities (IMO text + audio will have a bigger impact here, but architecture challenges remain).
emprahsFury@reddit
I agree that the status quo has created something special already, but I will generalize that to disagree that we need another wow moment. LLMs exploded, it's been argued, because hardware caught up to the research. Even if there are no more unicorns the hyperscalers will still put funding in to research (even if their investors halt the current outpouring), and the university system will continue to put money into research. And now the that DIB is interested there are guaranteed moonshots.
HatLover91@reddit
pregnant
Sabin_Stargem@reddit
IIRC, it takes about 3 years for physical chip manufacturing to be readied, especially for new or novel designs. That means in about two years, dedicated AI hardware should become more common. My gut says that AI won't be practical for everyday consumers even then.
There is a big issue around 2027, in that China may invade Taiwan. Considering that the best and majority of chips are made there, that could disrupt the march of (civilian) AI. The folks who fund fabrication plants probably want to decide whether they should be in Taiwan - that is where the talent, and danger, is. Until the situation is clear, dedicated manufacturing of AI chips can't be on a huge scale.
As such, I am guessing it is about 2030 AI goes from a metaphorical 1990's Information Superhighway, into a popularized Internet. We need infrastructure and things to do with AI to make it widespread.
Themash360@reddit
I hope action models will play a bigger role, being able to access AI specific APIs in operating systems. Allowing AI to not just know what to do, but also do it.
nikitastaf1996@reddit
Native multimodality will be big. Powerful agency abilities arise from it so naturally.
favorable_odds@reddit
I think they will slowly become more adopted in the mainstream software developer world, maybe more prompt engineers will arise.
Specialized hardware or chips to consumers. Actually, i'm kind of surprised it isn't out there already. Keeping the possibility in the back of my mind the model methods themselves could be improved. Maybe long term memory, maybe less GPU, maybe even CPU level models.
There's the possibility LLMS could lead to other breakthroughs which can't be predicted easily right now - maybe they can be trained to to auto-make math proofs, auto scan and break systems at a mass scale
tim_Andromeda@reddit (OP)
I have some strictly anecdotal evidence that LLMs are really accelerating the learning of coding for many programmers such that perhaps we’ll see another golden age of apps—not only infused with LLM tech— but brought about by people who ordinarily wouldn’t have had enough time to learn all the necessary knowledge to make an app. In other words, we will see more ideas come to fruition because the pool of people who can code a useful app will increase quite a bit, thanks to LLMs.
I think that is one of the greatest strengths of LLMs is that you can learn piecemeal. You can tell the LLM exactly what you want to know rather than reading book after book and attaining knowledge you don’t necessarily need or will never use. It’s kind of like having a book written from scratch just for you and tailored to your specific individual needs.
favorable_odds@reddit
I've tried to communicate this to some of my coding friends. It's kind of funny how the conversation goes, they just compare it to ChatGPT and GPT4 (where they tried it a year ago) how it hallucinated some problems on something they were working on and that's it, like it won't get better from there.
GregsWorld@reddit
It's good for learning in the same way Google is, it's great for getting the place to start and explaining things.
But when it comes to being productive you should listen to your coding friends. It's counter productive if you want to build anything serious.
tay_the_creator@reddit
how is this field advancing fast? you mean academically? it's not like you have an AI gf now physically.
fasti-au@reddit
Hopefully where it Belongs as an orchestrator or request breakdowns for sending to other specific tools. Too many people trying to use llm as a knowledge base when the reality is vectors are flash backs not references. It’s fed mistakes more that fact so it’s confused about life in general
absens_aqua_1066@reddit
In two years, LLMs might be smaller, faster, and more accessible, possibly running on phones.
TheToi@reddit
i have 8.9tok/s gemma 2 2B IQ4 on phone(asus rogue phone 6), which is pretty fast, 5 tok/s is more or less my read speed.
Uncle___Marty@reddit
Both Apple and Android have front end that run multiple different small LLMs already. And surprisingly they're incredibly fast. LM Playground is one I use on Android which uses qwen2 .5B all the way up to Llama 3.
Your future is here already :)
swagonflyyyy@reddit
In your watch, with vision capabilities.
qudat@reddit
I would bet on us continuing to figure out how we can be less involved in the training stack. The goal would be to look at all the things we have to do in order to make an LLM and figure out how to let computers do it instead. This will lead to cheaper models and potentially more powerful.
Healthy-Nebula-3603@reddit
AGI if we not hit the wall ...
a_beautiful_rhind@reddit
I think new architectures that aren't transformers will show up.
If we're not in some world war, we might end up with better models. It still won't be AGI though. The AI bubble is going to break so more practical applications will rule.
natso26@reddit
Let me take a wild guess and say that LLM/SLM progress will probably tend to “stabilize” then - in the same way that classical ML models have stabilized in capability, use cases, etc. in a predictable way. This means we will probably have “stronger” approaches by then, but LLM/SLM are still used in some cases where they are still useful, but maybe not as “general intelligence” 🤔.
-main@reddit
The big labs are talking about scaling inference compute. Doing tree-search over layers, doing much more interesting things than merely sampling output logits.
Everlier@reddit
A shift to the actual intelligence modelling, rather than language. LLMs will help us generating datasets that outline the thought process in a more formalised set of invisible "tokens", new gen LLMs will output those to help guiding reasoning and improve the traceability
Bakedsoda@reddit
Gpt 5 running on 2 yr old smartphones
xandie985@reddit
I guess, MABMA based models will explode in the markets, (if the hype about mamba is true)
Feztopia@reddit
The theory is moving fast but the models aren't adapting all of that this quickly. Grokking might become the standard. I would wish things like bitnet to be used. And more efficience with the mamba versions (zamba, expressive hidden states,...)
ResidentPositive4122@reddit
In 2 years I think we'll have some of these features either from the big 3 or from the opensource community, and I think no one can accurately predict which one gets there first. But anyway:
better explainability / more accurate "knobs" to play with - along the lines of the orthogonality method to remove refusals, but applied to "arbitrary" concepts. Spend a lot on inference, notice the common activation paths, extract them as knobs to be tweaked at inference time, or cook them into the model itself.
better merging between models. Work to re-use pre-trained frontier models to quickly boot-strap other models w/ different architectures / tokenizers / shapes.
true multimodal w/ synthetic data augmentation. Also multimodal w/ tool augmented modes (i.e. text is one modality and graph representation - as in graph theory - is another modality. Trained together)
advances in the distill / teacher student / merging at the "capabilities" level instead of weights level. Something along the lines of that paper where you train a very small model on a narrow task and then incorporate it via merging into the large model, where the large model gets some of the capabilities of the small model.
RL would play a much bigger role.
better fine-tuning support, with "pre-finetuning" scores given by the model itself, using ICL to re-write the instruct pairs with the base model and fine-tune on that, better data visualisation, better data coverage.
lfrtsa@reddit
SOTA will probably be essentially AGI in regards to text capabilities, but still far from humans in image understanding, playing video games, 3d modelling, drawing with SVG, image based IQ tests etc.
appakaradi@reddit
Massive context. Smart / Dynamic context length.
Ability to run normal inference and do deep thinking when needed.
More abstractions into the LLM. Fully multi modal. Audio video image etc.
frictionless interface with other expert LLMs.
Our phones will have more memory and neural processing. Smarter SLMs, may be tiny.
Build sophisticated systems automatically that auto evolve and manage. Not just throwing out code or basic agents.