If you were to build a new LLM API gateway today, which interface would you standardize on?

Posted by dmpiergiacomo@reddit | LocalLLaMA | View on Reddit | 21 comments

Same as the tile: if you were to build a new LLM API gateway today, which interface would you standardize on among these ones?

OpenAI Chat Completions (old standard)
OpenAI Responses (the new one)
Anthropic Messages
Gemini generateContent (current)
Gemini Interactions (beta)

I'm less familiar with OSS models and the API interface typically used (although I expect it to be the legacy Chat Completion), so open to new interfaces too.

And no, I'm not building a new gateway (there are enough companies already doing this), I'm just unhappy with the existing solutions.

[-]

fasti-au@reddit

Just do all it’s just a translation

Ollama has OpenAI and ollama and you can see anthropic. Just offer all and fastmcp proxy them all to one of whatever you want.

[-]

JockY@reddit

What’s wonderful about the current state of play is that for the most part I don’t need to care any more. Everything just works. Curious what you’re hitting that’s causing frustration.

[-]

amberdrake@reddit

Agreed. Those are details I don’t have to care about.

[-]

Oh? Everything? Or just the one inference engine you use works with the couple things you use - and they all use openai compatible? By any chance have you used vLLM with Codex CLI and responses endpoint?

[-]

dmpiergiacomo@reddit (OP)

I'm building infra and need granular control over all the parameters passed to each model. Ideally I can leverage the union of the functionalities (rather than the common denominator among them). I feel like gateways get limiting, but maybe I just don't know the right one.

[-]

sahanpk@reddit

I’d use chat completions as the boring base, then add a real /capabilities endpoint. pretending every provider has the same knobs is where gateways get painful.

[-]

my_name_isnt_clever@reddit

I have had no desire to use anything but classic openai chat completions, it's simple and it does the job, and most tools support it natively.

[-]

MoneyPowerNexis@reddit

This but make it so you can always take the response object (or merged stream object) and put the response message part onto the message history array without the need for sanitizing the data. Either dont put extra keys in the response message part that would break the API when fed back in (squints angrily at vllm) or be forgiving enough to allow your own output as input. /rant

[-]

CautiousStudent6919@reddit

Yeah theres only one real thing missing.... GET /models needs more metadata as part of the official spec.. context window, output max, cost etc etc.

[-]

Wix86@reddit

100% for "cost", it's really pain to monitor and update.

[-]

lupodevelop@reddit

Most enterprise gateways (like LiteLLM) focus heavily on routing, fallback, and basic key management. But if you are building one today, the real diff is local semantic caching and orchestration closer to the data.

[-]

dmpiergiacomo@reddit (OP)

I don't need any of that, as I handle it myself already. I guess I'm just trying to pick the best interface. Which would you pick?

[-]

epicfilemcnulty@reddit

I'm working on my own harness, and initially I was trying to actually implement LLM gateway functionality right in the harness, but fortunately quickly realized that it does not belong in the coding agent. Was using LiteLLM but I'm not really a fan of python packages with lots of deps, then I've stumbled upon GoModel, been using it as LiteLLM replacement for a while now, I really like it.

[-]

Wix86@reddit

Thanks for the link, I didn't know about this repo.

[-]

lupodevelop@reddit

Stick to the legacy OpenAI format as the core interface, and maybe just support JSON mode and/or Structured Outputs as a first-class citizen..

[-]

DeProgrammer99@reddit

Like I've said before, what we really need is a standard /capabilities API that indicates what features are available, because there's a lot of variables... like llama-server's /slots can tell you how many parallel requests are allowed, some APIs don't support streaming responses, some setups support speculative decoding but it may not be toggleable without restarting the inference software, some providers support GBNF grammars while others only support JSON mode or no constrained decoding, and so on.

[-]

dmpiergiacomo@reddit (OP)

Yeah I've spotted the same issues, and more. It's honestly a mess at the moment. I wish there was a good standard. I've seen plenty of initiatives, but nothing is really picking up.

[-]

MaxKruse96@reddit

chat completion for sure. asking servers to handle my context can turn out terribly, at least i have the fantasy that if i manage it myself, they wont mess with it before it hits the LLM

[-]

dmpiergiacomo@reddit (OP)

I guess you're referring to the new stateful APIs. Yeah I agree.

Would you still pick chat completion over the anthropic messages?

[-]

MaxKruse96@reddit

Yes. Anthropic is unable to produce good standards.