Has anyone seen AI agents working in production at scale?
Posted by madredditscientist@reddit | LocalLLaMA | View on Reddit | 73 comments
Has anyone seen AI agents working in production at scale?
It doesn't matter if you're using the Swarm, langchain, or any other AI agent orchestration framework if the underlying issue is that AI agents too slow, too expensive, and too unreliable. I wrote about AI agent hype vs. reality a while ago, and I don't think it has changed yet.
By combining tightly constrained LLMs, good evaluation data, human-in-the-loop oversight, and traditional engineering methods, we can achieve reliably good results for automating medium-complex tasks.
Will AI agents automate tedious repetitive work, such as web scraping, form filling, and data entry? Yes, absolutely.
Will AI agents autonomously book your vacation without your intervention? Unlikely, at least in the near future.
What are your real-world use cases and experiences?
nayeo@reddit
if the usa gov is not scamming, we can see big improvements when DOGE uses the data of american citizan to sell "new customized products" off of those data...
nyc217@reddit
I use an agent called Skej which is an AI personal assistant. It handles all my meeting scheduling, responding to emails for me, handling the back and forth. It works.
Swimming-Fondant-180@reddit
100%, we use agents for lead generation using langchain.
But not in isolation, we had to add security layer on top (ZenGuard AI), use caching (Redis) for cost optimization and source good data for quality outreach.
SmythOSInfo@reddit
I'd argue that scaling is actually less of a challenge for small AI agent developers than it might seem, and large corporations may not need to scale as much as we think.
For small developers:
For big corporations:
The real challenge isn't necessarily scale, but reliability, cost-effectiveness, and finding the right use cases. As you mentioned, tightly constrained LLMs with human oversight can achieve good results for medium-complex tasks. This approach may be more practical and immediately valuable than trying to create fully autonomous, general-purpose agents at massive scale.
rackmountme@reddit
Palantir does.
bgighjigftuik@reddit
No. People are just experimenting. The unreliability is still a major issue: any derailing in the auto-regressive generation process can be fatal for an agent
a_slay_nub@reddit
Not just unreliability but unreliability with autonomy. I have been really impressed with language models. But the moment you need them to put something in a JSON or respond in an exact format so you can do something else with it, their reliability(or perhaps perceived reliability) drops off a cliff.
I've been trying to get LLMs to convert sentences into a constrained natural language and they just suck at it. They do what they want to do and nothing will convince them otherwise.
Philix@reddit
Huh, someone else trying to do something similar to me. I've been on-and-off trying to get a reliable automation working for a similar task for several months. What methods have you experimented with?
Token bans(and logit bias) are cumbersome, unreliable since many common words share tokens, and a boatload of manual work even with regex. Control vectors are far too imprecise. LoRAs have a chicken-egg problem where I need the dataset to generate the dataset. I'm experimenting with gnbf grammar at the moment, since one of the examples for its use case was emoji-only output, but it seems to share the same problems I had with token bans.
If that doesn't work out, I'm pretty close to abandoning my project until I see news about a decent tool to do what I'm looking for, or I can find some lunatic willing to give me enough money to pay a team of linguists for a few years.
a_slay_nub@reddit
I worked with gbnf for a bit. I had to abandon it for a few reasons.
The company server uses vLLM for the Llama 3.1 405B model and outlines guided grammars does not currently work on vLLM. Because of this, I was limited to whatever model fits on my 3090 workstation.
LLMs like to say what they want to say, especially models designed to game LMSYS leaderboards. They favor normal conversational English. If you tell them to convert the sentence "The dog went to the market" into "subject verb object period" format. They will abuse the hell out of the grammar. Lets say you have a grammar which allows for 3 strings and a period. You will end up with "The dog went_to_the_market." It starts with 2 strings then realizes the only thing it can output after the last string is a period so it concatenates the rest of it's "thoughts" on the end. The worst part is that when you parse this through lark, it will parse perfectly fine.
Lately, I've been working with our company API with L3 405B without grammars but verifying the output of the LLM with a Lark grammar. I then use an autonomous back-and-forth process to verify the parse tree or tell the LLM it was wrong. Seems to work much better but I'm only getting ~85% accuracy.
Philix@reddit
Thanks for the reply. Point 2 is largely the problem I ran into with token bans, and am now grappling with using the gbnf grammar. The model's output mangles like you describe. Nice to know I'm not just making a dumb mistake somewhere and someone else is having the same difficulty.
a_slay_nub@reddit
Yeah, let me know if you have any luck because I'm banging my head against the wall. At least I have funding with customers that don't ask too many questions. God I love/hate working for the government.
amko-ipho@reddit
What exactly do you mean by outlines guided grammar? Cause we are using outlines json schema with vLLM and it works fine (on version 0.4.1 I think)
a_slay_nub@reddit
https://github.com/w013nad/vllm/blob/patch-2/examples/guided_decoding.md
amko-ipho@reddit
Cool, didn’t know about it this! Thx
Asatru55@reddit
JSON mode in the openAI API works, it'll force the output to valid JSON. It'll mostly keep to a schema if you provide it in a prompt, even GPT-4o mini.
I'm running a simple data labelling agent operating on a given schema in production.
Every now and then categories will be hallucinated that aren't actually in the schema, but that can be actually useful to determine outliers.
I just don't buy this whole 'AGI' schtick. If you think of the utility, LLMs / agents add a way to enhance deterministic programs with probabilistic code and reliable NLP.
Odd-Environment-7193@reddit
Yeah I'm using the JSON format with the vercel ai SDK. It's just a zod Schema. Pretty awesome stuff. It has some unwanted behavior like shorter responses I'm busy fighting with but overall very useful.
Nyghtbynger@reddit
Maybe we'll have to wait for the next breakthrough in attention models or whatever (transformer level breakthrough)
a_slay_nub@reddit
Honestly, I hope not. As far as architecture breakthroughs, we haven't been doing that well. Most breakthroughs lately have been in data and compute. I hope that we'll have luck sooner rather than later or my project is in trouble.
Granted, our customer is the government, so heavens knows they'll continue throwing money at us as long as we can give them a shred of hope.
Noxusequal@reddit
The diff transformer architecture looks promising i think :)
Nyghtbynger@reddit
Can you apply in parallel for another project or use your product for another use ?
Explain it like you're waiting for a breakthrough in technology to be the first on the market, and have free resources in the meantime. Best strategist are patient sometimes
macronancer@reddit
This has NOT been my experience. I have used JSON generation for creation of programming agents and role playing game masters. Works just fine.
Ex: https://escaperoom.creocortex.com/
ElectricalHost5996@reddit
Maybe the Duolingo company thing will work . It used to convert newyork time old archives digitised through captcha ,it shows a text to captcha users and they will answer if in 10 of them 7 answered one answer the rest 2 one answer and 1 different answer it knows the right answer . So asking it one prompt and next checking if its answer is right might reduce the error or is it question based like it hallucinates on one particular type of inputs constantly?
ybotics@reddit
Sounds like some fine-tuning / specialisation might help with these problems.
petrichorax@reddit
Are you giving example responses in your prompt/system?
Open-Designer-5383@reddit
But I also think people are over-interpreting the word "at scale" - enterprise services where AI agents are supposed to help work at a different scale (much smaller) than internet companies or even cloud data scale companies.
Right now, AI agents are tried and deployed company by company. Use cases are different, adoption and development are different. I do not think there will be one deploy button for AI agents that works for all use cases. There might be one but enterprise data coupling is important and so a lot of them have to be tuned to enterprise data and workflows.
fairydreaming@reddit
I agree that reliability of LLMs is a major problem. I think that for every problem with variable complexity (e.g. number of variables to take into account) there is a complexity limit beyond which LLM become unreliable. I mean even the best models like o1 can't reliably solve a simple problem if it has too many variables. What's worse, this reliability limit depends on the model, on the problem, on the system prompt, on the problem location within the context window and likely on multiple other factors. The only solution is to break down larger problems into smaller ones that can be solved reliably, solve them separately and then combine the results - but it's not always possible.
0xd00d@reddit
o1 is interesting, i dunno if this has changed but they dont give system prompt level control. So regardless of their extra prowess in "reasoning ability", when combined with this reduction in prompt based control over it, we can't leverage those added smarts to make it more reliable than existing less smart models on following instructions for function calling and such. It's always a gamble to see if the LLM decides to place enough credence to certain parts of your prompt that it needs to be able to follow to the T in order to produce a usable result. Needing to sprinkle this prompting in with increasing frequency is not my idea of a proper way to address the issue but it seems at least mildly effective.
Poildek@reddit
In 2 months all our l1 tech support will be entirely processed by generative ai agents (ticketing and monitoring). We have developped our own framework, that we also released in open source.
Illustrious-Tank1838@reddit
Share maybe?
porcupinepoxpie@reddit
iirc IronClad(https://ironcladapp.com/) was using LLM agents for legal documents (e.g. convert a contract meant for Delaware to Virginia).
Treat each one like an intern, log their tool use as a way to check their work.
Original_Finding2212@reddit
What’s agents? Do we have to use a framework?
If no, then yes, I have.
Works perfectly if designed right.
asankhs@reddit
In my experience in the past couple of years, what I have seen working at scale in production can only be called "automation", true agent-like workloads are not reliable enough for enterprise use cases yet.
nomaddave@reddit
I’ve seen one failing constantly in production at a large company some people will have heard of. It’s crashy, expensive and customers don’t like it. The CSRs were already let go. It’s always going to be that way for this company, but I could see others finding success with it with the right customer base/industry.
GoogleOpenLetter@reddit
I'd argue that https://websim.ai/ is an agent that scales. It's an agent in the sense that it's building websites with an LLM and able to actually deploy them. We often have shifting goalposts for what's considered an agent, if we went back 10 years ago and said you could make a fully deployed website in 30 seconds by just telling a chatbot what you wanted, it would be considered magic.
Perfect-Campaign9551@reddit
That's not an autonomous agent use case example though
balcell@reddit
Meh, it keeps requesting login details for data harvesting or a poorly designed sales funnel. No-go. I bet the tech is fine though.
Perfect-Campaign9551@reddit
I agree with you, people are way overestimating the ability of llms, they are too unpredictable.
Asking_Help141414@reddit
Not sure what you'd call this but it's live in 4 countries/all industries. Pretty good for 'at scale' sales/job/prospecting agent.
Status-Shock-880@reddit
Afai can see, if all the agents are LLMs, very limited. But add agents/tools that aren’t LLMs, you get somewhere interesting.
Rizzon1724@reddit
Clay.com has highly effective use cases that are being used by enterprise customers. They developed their own agentic research agent feature and model, called Claygent (Ai agent web research feature), with Claygent Neon (model).
When combined with their workbooks, tables, automated workflows and triggers, along with APIs for tools already integrate or custom API Requests.
With great prompting, regular AI prompting in Clay can be down pretty powerfully, combined with Claygent can be a beast for agentic research, browsing, scraping, etc. and combining it with all the other features and functionalities, you can scale up things pretty damn well IMO.
—— Realize this isn’t like a full own developer example, but you can essentially just create any agentic framework or model you want, and if limited by something, you can use custom API calls to send and bring in anything you need wherever and whenever you want.
Usecases: Check Clay.com
For me: - digital pr link building campaigns with original data studies (everything A to Z) - Thiught leadership PR opportunities for clients - Building Topical Authority with data-driven content that earns natural backlinks to also drive domain authority for SEO and organic traffic - building media and contact databases, enriched with AI - create, manage, track, report on PR campaigns / sales campaigns / link building campaigns - prospecting, contact finding, segmenting, personalizing segmented pitches, etc. for PR / sales / link building - media monitoring, trends analysis, brand analysis, etc etc
After setting up and testing prompts, I can automate my highly detailed approach to prospecting for PR campaigns, segmenting, personalizing, contact finding, verification, and assignment to a campaign at scale, in a day, what dedicated VAs would take a week to do.
GortKlaatu_@reddit
I think you're asking the wrong questions and comparing it to computer programs instead of people. Let's compare it to an off-shore call center employee. Is it slower? More expensive? Is it actually less reliable?
I'm also assuming you've implemented agents with plenty of guardrails to take a best effort at checking for hallucination, validating results, and you're using it with tool calling, etc.
An AI agent will never autonomously book my vacation without my intervention, just as a travel agency won't either. You forget how many human processes still need the end user human in the loop.
A_for_Anonymous@reddit
Dave from technical support did want to get my credit card details so he could book my vacation for me.
DifficultNerve6992@reddit
There are some solutions there are production ready and used in production, you can explore specialized directory for AI agents and search different categories.
https://aiagentsdirectory.com/
IrisColt@reddit
Believing AI can handle complex tasks alone ignores the need for human judgment. Just my opinion, but automation thrives on collaboration, with AI supporting human decision-making.
strongoffense@reddit
Yes. Working really well at large scale in production. Almost all the examples I can think of are vertical specific though; haven't seen any work well across industries and functions.
Consistent_Walrus_23@reddit
Why are (multi-)agent setups needed at all? I get the humanistic motivation - specialize people to certain tasks - but is this really needed for LLMs?
Probably a stupid question but something I have always wondered. Ever since LLMs came up, multi-agent systems seem to be all the rage.
0xd00d@reddit
yeah we specialize humans to tasks and domains because it is practical to have specialists and experts curate knowledge on their areas and then manage them together to achieve great things.
It's a little less clear how this would even work with LLMs since all base knowledge is static for them, baked in during training, and any information in prompts is fully fluid (in contrast with how expertise is expressed in humans) since you have to have them in a suitable format for prompting anyway.
Combining different LLM models to extract value out of the set union of their capabilities is plausible. But even that seems really inefficient to have them talk to each other rather than to just have an MoE architecture... On the other hand getting the "thoughts" "written down" is also a bit essential so we may take a peek and see just how far things have gone off the rails.
many issues probably boil down to the problem where the more LLM shenanigans you do, the more work you've created for yourself (LLM output that has to be evaluated by some preferably automated system, for guidance purposes). One big problem is that if you want an LLM to be able to look at a bunch of LLMs talking to each other trying to work toward some goal, there's a bunch of catch-22 business going on with how if those agents that are actively participating are losing the plot, what says that the supervisor LLM (even if it is more expensive and smarter, which it can't be because that exponentiates your running cost) won't also lose the plot just trying to keep up with ingesting the transcript.
robogame_dev@reddit
No you're correct, the issue is people are thinking of "Agents" as a human-like model.
Of course you need many customized LLM prompts with different temperature etc settings, with different tools etc. But they've been bastardized into the "agent" metaphor when really they're just separate functions or processes. There's no reason to use "agents" in the sense of multiple LLMs each with their own chat history of talking to other agents, people building with LLMs at scale treat each functional unit / LLM call as a separate component.
Darkstar197@reddit
Because building something enterprise scale requires a ton of different connections to APIs and data structures during content retrieval. Having multiple agents is an essential part of a successful RAG flow unless your use case Is basic enough to just be a vector database query with some similarly search and adding all the results to the model context window. (Not ideal for multiple reasons)
NighthawkT42@reddit
What sort of scale? We currently have basically one agent with a set of tools, but several of the tools will soon be agents.
MoffKalast@reddit
And then the tools of those tools will also be agents one day? It's agents all the way down?
GortKlaatu_@reddit
Sometimes we need a layer of abstraction when too many tools become available and you need to call the butcher, the baker, or the candlestick maker depending on the task and set of tools.
rnosov@reddit
I'm not sure if you're considering coding assistants to be true AI agents but tools like Aider, Cline ( Claude.dev ), cursor.ai are all spreading like wildfire. You just tell them "make me a game" and it will write all code, install dependencies, and even launch it for you. Modifications are not a problem either.
Look at openrouter stats and it looks like they start to rival (E)RP. I've been binge watching youtube videos of people using them and I think this is where things are heading with AI agents.
brokester@reddit
The code they generate is complete dog shit and not maintainable. The problem with generative ai is that testing it is nearly impossible because they are nondeterministic. Yes you can validate outputs but there are more factors at play, like domain knowledge, deployment, scalability and communication between these areas.
Also in enterprise application you want stability. Because if your service goes down, so will your company.
0xd00d@reddit
It's an overgeneralization... in many domains this can be the case.
I've had many situations in which I hack something fairly large and complex and get it working and send it in and ask around, poking from different angles, and every single time as long as you're not shooting in the dark and have something working and/or have a test suite to validate correctness with, you can and will be able to make big gains refactoring and improving code. these things can absolutely definitely write good code including designing entire systems at a high level. But, if you do not give it enough of the right kinds of context (no prescriptive method exists nor should one ever exist to produce that), it may very well be as hopeless as asking the same of your friendly janitor down the hall.
ithanlara1@reddit
I agree, partially.
A good prompt, with instructions and guidelines, where you start with a clear idea of what you want will usually end up returning good and usable code.
If you expect it to output something good out of the box, then no, it's not a good solution sadly.
But that's why we are still needed as developers right? We need to have the knowledge of what to use and how to make it scale, and we can get the AI to do the coding and documentation part!
Then you can think of the test and have the AI write them as well!
rnosov@reddit
I wouldn't use it for anything critical yet but for prototyping it works remarkably well. Also, the quality of code in the latest crop of of coding models is getting much better. It used to be dog shit but I wouldn't call it that now. Obviously things like domain knowledge, scalability, deployment etc you have to do manually but don't underestimate number of times you have to just add another field to the form. Basically, it's freeing up your time to give attention to more important areas. I could be wrong though.
dodiyeztr@reddit
cursor.ai does nothing. it's all hype
it just copies your code to the context that is it
any vscode/idea extension can do that
in fact the github copilot chat extension does exactly that
rnosov@reddit
I'm mostly using aider which I understand is similar to cursor. Aider does a lot more that just copying to context. If you have several files, juggling them manually becomes non trivial. Aider also does automatic diffing that which if done manually will be a major time sink. If you can recommend any extension that has all that functionality I'm all ears as I've spent a lot of time looking for something like this and couldn't find anything similar ( I have continue.dev extension but in my experience it's not as advanced as aider ).
dodiyeztr@reddit
still not an agent
0xd00d@reddit
agreed. i use aider and i wouldnt want it to be more autonomous than it currrently is, and it already tries to do a whole bunch of shit (running commands the LLM suggests) that I reject 98% of the time
automated commits and optimized token consumption via the use of diff blocks are indispensable however.
Nyghtbynger@reddit
The closest one for vim is avante.nvim
Vivid_Dot_6405@reddit
Obviously. There are dozens of other solutions. It's just fancy prompt engineering.
Various-Operation550@reddit
Narrot.org kinda does that, we have proprietary model and our own agent framework
kohlerm@reddit
Totally autonomous no. Keeping a human in the loop yes
Maleficent_Pair4920@reddit
What type of agent and what do you mean with tools?
PizzaCatAm@reddit
Yeah, can’t disclose the details but it can scale.
TheDreamWoken@reddit
https://arxiv.org/abs/2305.16291
naughtybear23274@reddit
Not sure I'd consider a research project "at scale". And rather than linking the paper, can look through it here: https://github.com/MineDojo/Voyager
mycolo_gist@reddit
Duolingo uses AI chat based characters at scale.
https://www.emergingtechbrew.com/stories/2024/09/25/duolingo-debuts-ai-video-chats
irvollo@reddit
that's not agents. its just a language themed c.ai
blackkettle@reddit
We’re doing nearly the same thing for B2B contact center Agent onboarding. Simulate customer profiles and situations from existing calls and let agents train on these. The biggest challenge is managing the VAD and LLM response times to hit a natural interaction including things like interruptions.
Enough-Meringue4745@reddit
Not sure I’d consider that agents