Reliable Open Source LLM as a Service

Posted by pravictor@reddit | LocalLLaMA | View on Reddit | 9 comments

Has anyone figured out a provider whose open source models (Kimi, Qwen, GLM e.t.c) can be used reliably in production.

I have tested some well known providers and they all suffer from high latency and poor uptime rendering them mostly useless for production implementation.

I am using them for an agentic workflow in production so reliability and low latency are very important for me.

Is there no provider that compares to Gemini / Claude in reliability but with open source models?

So far tested Teogether.ai and Fireworks and Groq looks like it is dying

[-]

tecneeq@reddit

You stepped into the wrong neighborhood, cloud-kid. Around here it's all about local llamas.

[-]

One useful evaluation is to separate model quality from workflow quality. The model may be strong, but the surrounding context, review loop, and failure recovery usually decide whether it is usable day to day.

[-]

Formal-Exam-8767@reddit

On-prem depending on your budget:

NVIDIA GB200 NVL72
NVIDIA HGX B300
NVIDIA HGX B300
NVIDIA DGX H100

[-]

RandumbRedditor1000@reddit

Your pc

[-]

jikilan_@reddit

Can be also my pc

[-]

FriskyFennecFox@reddit

Novita is often praised by their zero data retention policy and is practically a veteran that appeared roughly at the same time Together did. If you're looking something supported by the big guys, then there's Cloudflare.

I recommend just opening OpenRouter, finding a popular model, and researching every available provider individually. Each has their pros and cons.

[-]

ttkciar@reddit

If you want reliability, you really should host on-premises.

All commercial inference providers change models (or their quantization), token caps, and price tiers without forewarning, which makes them intrinsically unreliable.

Hosting on-premises is more expensive, but provides advantages in addition to reliability like privacy, transparency, control, and future-proofing, so it's a trade-off.

[-]

1beb@reddit

Opencode (Go, Zen) seems pretty stable so far and they serve most of the frontier open source models.

[-]

FoxiPanda@reddit

/r/LocalLLaMA recommends consulting your local GPU server for reliable hosting of LLMs.