LLM architecture | TheaterFire

[-]

ExperiencedDevs-ModTeam@reddit

Rule 7: No Google-able questions

I.e. no "what are the best language(s), framework(s), tool(s), book(s), resource(s)". Most of these are trivially searchable.

If you must post something like this, please frame it in a larger discussion - what are you trying to accomplish, what have you already considered - don't just crowd-source out something you want to know.

[-]

Realistic_Tomato1816@reddit

What are you trying to build? I have deployed a few LLM solutions to prod. Many are large data-lakes for RAG. Terabytes of videos, PDFs,etc.

I've also built small one-off automation type things like detect changes in a Sharepoint volume, as people make updates to files invoke an action.

But the larger RAG projects have a lot of tooling. And most of that is just regular software engineering. If I have 200 videos coming in everyday, I have to build a queue to extract images, audio and process them. That is not unique to LLM but data-engineering.

The key concern and my focus now is building safe-guards and stuff like preventing employees from entering and sending off specific data. So that involves building a guard which has nothing to do with a LLM but running a custom ML/AI in-house model to detect those type of content so it never leaves the datacenter.

Then volume. Unlike a regular API or web service, you don't get a quick response back you can measure in milliseconds. But, how do you handle 400 concurrent users who have open sessions and responses that can take up to 2-3 minutes to get a reply. So you have to load-balance 400 open streaming streams where one user gets a reply in 15 seconds and another in 3 minutes. I won't get into that. Now multiply that possibly 50,000 concurrent users while trying to filter/guard rail they don't enter in sensitive info.

And then the testing regime. How do you analyze number of hallucinations and ad-hoc prevent those similar prompts in the future to get there right answer the next 4 people ask those questions. You have to build for that.

A lot of these problems are just SWE engineering issues. Not related to LLMs but greatly are edge cases to consider.

For the fun stuff is extracting a frame a video that has a table/chart and RAG-ING that into a vector. And when someone asks about it, deliver that point in the video. And stuff like "Hey, you can't ask that question because it is a violation of our policies and your upload has been flagged."

[-]

Odd_Departure_9511@reddit

Are the LLM solutions you’ve deployed using pre trained LLMs with personalization (RAG vectors), potentially fine tuning, and orchestration targeted towards your company’s business needs? Or were they bespoke LLMs?

I mostly ask because, either way, it would be fun to pick your brain about compute and storage. Sounds fun. Wish I had opportunities like that.

[-]

Realistic_Tomato1816@reddit

I work on both. Recent project was RAG. Prior ones trained, in-house models.

[-]

t0rt0ff@reddit

I wouldn't start with Langchain. I made that mistake, LC makes LLMs look much more complex than they really are. Just use plain OpeanAI&Co APIs to learn how to work with LLMs. Once you understand what they are (unless you already do), then you can try Langgraph or something else for more complex agentic flows.

As for architecture - heavily depends on what you want to do: do you want to have chats with agents? Are they global or per entity? Are they isolated between users? How complex of the flows do you want to automate? Do you need access to some large extra context (e.g. RAG)? etc.

E.g. if you simply need a one shot LLM call to summarize something, you don't even need to think about it as agents or LLM, it is really just an API call with a relatively high latency.

[-]

metaphorm@reddit

this was exactly what we experienced at my company too. we got about 3 weeks in to trying to do anything with lang-chain before abandoning it as not ready for our use case.

[-]

Ultima-Fan@reddit (OP)

Let’s say I want to better understand how someone can use LLM AIs to solve a specific problem e.g: look at a document and output structured data to support a human being make decisions. Reason I’m mentioning LC is because I had a systems design interview and I failed miserably lol, so I’m trying to understand how this works… Things like: How they are rate limited, What are the inputs and how are the responses structured, Latency, Cost, Common techniques used. And 10000 miles view of the infrastructure where this would be running. Like as SaaS would make sense to use a message queue? I think yes, for databases I had in mind MongoDB but I just learned about the existence of vector dbs…

[-]

verzac05@reddit

Might I suggest n8n? You can self-host it using Docker and the great thing is that you can view the "outputs" between steps (e.g. you can preview the content of your email on the "get email" step, and you can preview the output of your LLM's "summarize email" step before sending it further down the line).

Caveat: I wouldn't use n8n to build an app (since its workflows can get annoying to chain, reuse and such compared to just writing doFoo(); doOtherFoo()), but I found it to be extremely helpful for writing one-off scripts / automations for my personal use.

[-]

t0rt0ff@reddit

All of that is essentially in its infancy. I am not even sure there are real solid time-tested approaches out there. E.g. up until reasoning models were released people had to implement some sort of reasoning manually via more complicated flows. Now they can rely on reasoning models. Or context windows are now approaching some crazy sizes, so in many cases you don't need RAGs (vector DBs), you can just shove everything into the request, go figure.

Anyway, unfortunately, I do not have a good answer for you except to just go and try to build something simple, e.g. a tool to summarize your emails, or some simple chatbot. I would recommend to use plain APIs for that though, not Langchain/Langgraph - they are way overcomplicated in many cases.

[-]

Ultima-Fan@reddit (OP)

Gotcha, I appreciate your feedback :)

[-]

Ultima-Fan@reddit (OP)

Let’s say I want to better understand how someone can use LLM APIs to solve a specific problem e.g: look at a document and output structured data to support a human being make decisions. Reason I’m mentioning LC is because I had a systems design interview and I failed miserably lol, so I’m trying to understand how this works… Things like: How they are rate limited, What are the inputs and how are the responses structured, Latency, Cost, Common techniques used. And 10000 miles view of the infrastructure where this would be running. Like as SaaS would make sense to use a message queue? I think yes, for databases I had in mind MongoDB but I just learned about the existence of vector dbs…

[-]

SucculentSuspition@reddit

Observability and validations are the name of the game for production SAS AI eng. We use Langfuse and Instructor and they are inevitably immature but best in breed atm imo.

[-]

Odd-Investigator-870@reddit

Non obvious architecture detail: - the LLM is an infrastructure detail, it belongs as far from your architecture as possible. - requests and IO to an external infrastructure should be protected by a Clean Architecture, so that they are arbitrarily swappable like plugins.

[-]

originalchronoguy@reddit

There is a cost to an LLM. Either internal GPU compute cost via self hosted or token based. Cost should drive the architecture design. For example, if your user 80% of the time ask the same questions or variations of that question, an architecture design can be caching or using a pre-model filter to prevent incurring that cost.

Some models/edge use case may require a bifurcation in routing. Again design. If a prompt can be handled by CPU relatively quickly, it can go to that infra cluster through routing logic. Which can be based on load. Someone asking at 3AM can wait 7 milliseconds with a CPU model. While during 9AM rush hours, you have 30 concurrent users, the warm up time for a node to handle 30 concurrent users may bring the response down to 3 milleseconds. While at 3AM, a single user cold-starting a node may take an additional 200 ms just to start up.

That bifurcation of traffic based on load, warm up, cost are architectural decisions I have made.

[-]

Odd-Investigator-870@reddit

You misunderstood the term Architecture. The LLM is just an infrastructure detail, and should be isolated from changes affecting your system architecture. If you want cache behavior, then use a Proxy pattern in your translation or application layer. But keep the LLM out of your domain layer.

https://refactoring.guru/design-patterns/proxy

[-]

originalchronoguy@reddit

Sure, on that premise. Swapping out mistral vs llama3 vs openai are just variable changes in deployment yaml that points to different URI and endpoints. Most of them just OpenAI that the swap out is relatively easy as you say.

I was referring to architecture design of an app on how, when to use a specific LLM vs DB vs inhouse model,etc.. The presence of the LLM has a cost and you design and architect your application based on the cost constraints. So how and when it is used should part of system design. That was the kind of architecture I am referring to. Architecting an application and the moving parts.