Question in regards to AI efficiency

Posted by Druber13@reddit | Python | View on Reddit | 10 comments

I have been trying to look up more details on this but don’t want to build my own AI and can’t seem to find full information on it. Seems my questions are more core to the foundations and not so readily available.

My main question is are we seeing such a compute and ram issue because the AI models are built super inefficient? That and/or are they all too generic that they have to do too much to provide the basics for about everything?

For example: if we ask it some generic information about a state. Is it running a model to break the question down then doing something like a select * from a state table and then going back through a model to get information and giving it to the user?

Happy to look it up myself as well if anyone has some proper terms to search. Not sure what I should be looking up. Just trying to wrap my head around what happened on the other side.

[-]

wandering_melissa@reddit

asking google or AI first would be better if you feel you know nothing about a topic, search for these on wikipedia or google: transformers (AI), LLM, GPT, RAG

as for memory constraints there are a lot of methods to reduce memory usage but it results in less performant (dumber) models, one of the methods is quantization, I can run 2gb models on my pc fine but they are only useful for simple tasks.

[-]

Druber13@reddit (OP)

I have tried googling it a bit. It’s more of a design backend structure question. I get the gist of how things are working. Not a ton by any means.

I’m just trying to understand how the big companies are doing it a bit more. I guess it’s hard to formulate my question is why I’m having a hard time getting to my answer.

I’m just wondering how something like ChatGPT runs when you ask it a question. More over how is it optimized. The hard part I guess is I’m not building an AI or don’t plan on it just wanting to know a specific part lol.

[-]

wandering_melissa@reddit

chatgpt claude or any other "AI" (they are more specifically LLM models which is why I said you should google LLM) model works same. takes string, turns it into numbers, shuffles the numbers using the LLM model and gets new numbers, turns those numbers back to string, voila. If you want more detailed explanation you should go check wikipedia or youtube. There is no extra steps or "backend" code, rest of the explanation is mathematical equations.

[-]

Druber13@reddit (OP)

Someone I think nailed it below. I didn’t quite realize it worked like that. I work more in the data space so I’m thinking stored data and queries of some sort when it’s all broke down. Which obviously it isn’t. It’s very interesting that this all works the more I’m learning about it.

[-]

latkde@reddit

LLMs are inefficient. They're essentially a brute-force solution.

On a more mechanical level, the task that an LLM has been trained on is to learn statistical correlations between word fragments (tokens). This works so well that LLMs can generate coherent text and have learned the relationships between underlying concepts of words.

These learned relationships are stored as "weights", and we refer to the size of an LLM by how many weights it has. Useful models start at around 4 billion parameters, but state of the art models have allegedly reached the trillion-parameter range. Assuming that each parameter has been compressed into one or two bytes, this means models are between 4GB and 2TB large.

For each "forward pass" (each output token), the current input tokens must be matrix-multiplied with the weights. That requires all weights to be loaded in memory. And 2TB of GPU RAM would be quite a lot. That's more than any GPU on the market has. The GPU also needs space for the input data, not just the weights. So, LLMs require a ton of memory.

There are strategies to reduce this. Compressing weights to use fewer bytes. Distributing different model layers across multiple GPUs. Splitting the model into multiple sub-LLMs so that not all weights have to be activated for each forward pass (mixture-of-experts). Investing more into the training phase so that a smaller LLMs produce better-quality results. Using so-called reasoning, so that smaller models can handle complex tasks better. Hiding the exact model from users, and routing easy inputs to smaller models. Dynamically adding potentially-relevant information to the input, so that the LLM has access to knowledge without additional training (RAG, tool calls). But the result is still a huge memory+compute requirement. And many of these mitigation strategies involve more tokens, which also increases the computational cost and needs more RAM.

At no point will an LLM destructure and look up information in a database. That's not how they work. They can be trained to call tools, and we can tell the LLM that if it produces output in a tool-call format, then we'll look up information in a databas for it. Such systems where an LLM is combined with tools is sometimes called an "agent". But a plain LLM, consisting just of the weights, has no structured knowledge, only statistical token relationships that were learned during training.

It is also for this reason that LLMs inherently hallucinate, and cannot generally be trusted for fact-based tasks. They are approximately correct about many things, but they correlate, and do not know.

[-]

Druber13@reddit (OP)

That’s explains it

[-]

Python-ModTeam@reddit

Your post was removed for violating Rule #2. All posts must be directly related to the Python programming language. Posts pertaining to programming in general are not permitted. You may want to try posting in /r/programming instead.

[-]

Kevdog824_@reddit

It’s not entirely clear what you are trying to do with the model, so it’s hard for any of us to even begin reasoning what the issue is

[-]

Druber13@reddit (OP)

I’m not wanting to do anything with one. Just wondering more so if things like ChatGPT are coded poorly or how it loosely works a bit more.

Im imagining they are just banging out updated and not real worried about optimization and performance really since it’s a race to something lol.

[-]

FrickinLazerBeams@reddit

You should maybe learn at k least a little about how a neural net works before asking this kind of question. Like at least read Wikipedia a little and establish some basic conceptual understanding.