Building a local RAG server

Posted by autonom1a@reddit | LocalLLaMA | View on Reddit | 15 comments

Hi. Corporate wants me to build a local RAG server. 50-100 concurrent interactions with the model few times a day at the first stage and 100-1000 when deployed to production.

I want to understand the hardware stack and its price. Maybe options.

Halp.

[-]

huzbum@reddit

What does your current stack look like? Might be able to integrate existing technologies like elastic search or Postgres.

Are you using any cloud services like AWS, or do you plan to put physical hardware on site? Do you already have hardware on site? What are the uptime requirements? There is a big difference between “it would be nice if this thing was always working” and guaranteed 5 9’s.

[-]

autonom1a@reddit (OP)

no stack, gotta build it ground up. No hardware that is specific for this task ATM, has to be placed in the office, no outside placement. Uptime 24/7.

[-]

huzbum@reddit

Ok, but is this going to be a service for people in the office or something with subscriptions and a service level agreement?

There is a big difference between it’s always on and probably never goes down, vs 5 9’s. Like the difference between a desktop in the corner that probably stays up for a year straight, and server racks with redundant power supplies, UPS, backup generators, redundant network connections, etc.

[-]

autonom1a@reddit (OP)

yes it gotta be always on, ready to interact 24/7 under strict SLA - immediate response and ready to handle surge uses.

[-]

huzbum@reddit

Also, does “100 concurrent users” mean 100 people hit enter at the same moment, or there are 100 people with access that will spread out use over the day?

And how smart does it need to be? Any idea what model you might want to run?

[-]

autonom1a@reddit (OP)

Mostly spread throughout the day, but sometimes they can do that almost simultaneously.

[-]

kantydir@reddit

What model(s) do you have in mind? That many concurrent requests will probably require running the model in data parallel mode or balancing between several servers if you want a decent interactive user experience.

[-]