I think we should train LLMs in increasing complexity while avoiding material on the internet.

Posted by My_Unbiased_Opinion@reddit | LocalLLaMA | View on Reddit | 17 comments

I think the current idea of training LLMs on internet information is the wrong way. Instead, I feel we should train an LLM how a child learns. Start with books you should show an infant, then toddler, then child, etc. Eventually, you train it on graduate level material, Always using textbook quality material. The issue I have with internet material is that the information might not actually be correct, but most people think it so since it gets repeated so often. Also I feel that information should be taught in levels or layers, with easiest concepts being taught first, and increasing in complexity and depth. It shouldn't only be taught STEM. Consider psychology, sociology, criminal justice, nursing. I'm a nurse by trade, and I feel that nursing specifically is really good material to train on. On a lot of ways, the material covers a ton of disciplines from medicine, psychology, sociology and math and more importantly, integrates it together. Finally, for fine tuning, written works of all types should be the focus. Teach the LLM how to write and be personable. Also, most of the content on the internet is generated by AI now. You don't want hallucinated material in your training data. I'm thinking out loud. I don't work in tech, but I find LLMs fascinating.

Reply to Post

17 Comments

[-]

teleprint-me@reddit

I agree with you. No one knows how to do this yet. This stuff isn't easy.

[-]

Monkey_1505@reddit

The trouble with teach an LLM like a child is that humans come packed with hundreds of thousands of years of heuristic learning in the form of genetics, and LLM's come packed with zero. So a human can learn from substantially less information, and generalize from it vastly better. You'd still need to feed an LLM a library of information to teach a large SOTA model, and we haven't curated high quality data sets of that magnitude. Maybe we don't even have that much.

[-]

ttkciar@reddit

> The trouble with teach an LLM like a child is that humans come packed with hundreds of thousands of years of heuristic learning in the form of genetics, and LLM's come packed with zero. So a human can learn from substantially less information, and generalize from it vastly better. This is the crux of the disagreement between the Connectionists and the traditional Cognitive Scientists. The Connectionists believe imposing humanlike implementation details on intelligence is a mistake, and that a relatively simple algorithm with sufficiently generalized application will figure everything out, given only information and enough compute power. The traditional Cognitive Scientists delve deeply into how humans and animals think, and develop theory about exactly what you describe -- the use of instinctual heuristics and other evolved algorithms to cogitate well on less information (for example, given an image with two circles and lateral symmetry, try to make it into a face, since face recognition has been an important part of animal evolution). The current LLM boom is the Connectionists' time in the sun, and traditional CogSci has taken a back seat, for now. It's unfortunate that the relationship between the two camps tends to be antagonistic; IMO a hybrid approach might get us closer to practical general intelligence than either alone.

[-]

Monkey_1505@reddit

Brute force certainly took things further than one might have expected, even if the total dataset available to man is size constrained, the return on investment for scaling is a exponentially diminishing curve, and genuine understanding of anything clearly isn't there - the simulacrum of real dialogue just based on sheer compute and data volume is quite impressive. But the connectionists, at least the ones involved in actually producing AI, are tending towards data quality, distillation, teacher models and the like. Which suggests even they don't believe data volume and compute alone is the key. A casual experience with training your own AI (of any kind) quickly disproves that quality and relevance is not important. To the contrary a small amount of bad data can spoil your entire training. Things like memory and attention are also considered important, and also the manner of learning - people are trying to improve these all the time, and this does 100% border on the field of cognitive science, even if AI computer scientists aren't really consulting cognitive scientists, which I agree is a disappointment. I think the crux of this really is sort of hope that things will be simple. Engineers often oversimplify their view of systems, with a kind of optimism that they can be easily replicated. So here, they are aiming for simple, hacky, solutions for things like attention, or learning, because of their own bias towards a worldview that what they aim for can be accomplished easily by them.

[-]

BigChungus-42069@reddit

Great input

[-]

beetlejorst@reddit

Yeah, that's not how this works. It needs hundreds of millions of data points to aggregate to get to the 'cognitive' level at which we currently find it interesting and useful. You could then finetune it on child-oriented stuff to begin with, working your way up, but it's not a thinking, self-reflecting individual. It wouldn't 'grow out of' the child stuff, so you'd likely just end up with an LLM that talked like a child occasionally.

[-]

aaronr_90@reddit

Don’t we all talk like a child occasionally?

[-]

Feztopia@reddit

" Also, most of the content on the internet is generated by AI now" Most?

[-]

ttkciar@reddit

Pretty sure they meant most of the *new* content, but either way I'm not sure how one could prove or disprove the assertion without a multi-petabyte web crawl and entropy analysis.

[-]

My_Unbiased_Opinion@reddit (OP)

Just want to add, by "text book quality material", I mean actual textbooks. This would be a copyright nightmare. But I honestly feel like this is the reason why Qwen is so efficient; I feel there is a TON of copyrighted material in its training data.

[-]

Healthy-Nebula-3603@reddit

Learning from someone is copyright data? Wtf So how has that person learned before the first person... Who is inventing so fucked up ideas ...

[-]

Fun_Librarian_7699@reddit

I think you can't compare machine learning with human learning.

[-]

Healthy-Nebula-3603@reddit

That is exactly the same. Humans most time spending fucking around not learning facts from books.

[-]

Interesting-Fish-542@reddit

What if, instead of having a large vocab, I start with small vocab that trains itself with model and also increases the size. This way, I will start with alphabets, then go on to smaller words and so on. I can arrange my training datasets in incremental order of complexity.

[-]

HephaestoSun@reddit

Didn't people said some time ago that groked models are very good? the starting model could be grokked on a smaller but more robust dataset, once this is done then we could just add new stuff over. that would be interesting

[-]

My_Unbiased_Opinion@reddit (OP)

Yeah. The idea at its root is really training efficiency. Textbooks and educational material is very high quality. I feel that is what we should focus on. I know Phi was trained on textbook quality data, but I feel like the rp/storytelling data was skipped. I feel we can train LLMs in much less data if we can supply high quality data from the start. Start with simple letters and move up in complexity with more advanced topics. language models use language. Textbooks are language books basically, so starting with less words (less complex and varied connections) is the best way since those weights would be foundational and solid. (Using less words to convey and idea is ideal) I don't know a thing about how LLMs learn. I'm sure my idea has been tried, and there must be a reason why it's not more popular.

[-]

M34L@reddit

Most of the random bulk of data is needed to build the natural language processing and to build whatever bit of "reasoning" they possess. Also you are completely missing that a 30 years old PHD level human spent maybe single year or a couple learning from "textbook quality material" and like 29 years fucking about and finding out, including trying to eat dirt for quite a while. In the sense of what kind of quality most of the data an average human learns from through their lifespan, "shit on the internet" is extremely well curated, highly factual, highly efficient condensation.