What's the best LLM Router right now, and why?

Posted by desexmachina@reddit | LocalLLaMA | View on Reddit | 42 comments

What's the best LLM router you've used at this point. I'll put some minor requirements down, but feel free to go outside these bounds.

Routes to more than 2 models
Routes to local LLM and API
Maybe has a pre or post token ingestor that can summarize
Not just a simple vector DB

[-]

achompas@reddit

u/desexmachina We've built [this list of routing resources](https://github.com/Not-Diamond/awesome-ai-model-routing) at Not Diamond. We've also built our own router - try it out [within our chatbot](https://chat.notdiamond.ai/), or learn more [from our docs](https://docs.notdiamond.ai/docs/what-is-not-diamond).

Happy to answer any other questions you might have about routing!

[-]

CalangoVelho@reddit

Ever tried LiteLLM proxy?

[-]

emprahsFury@reddit

Litellm is pretty good. They do ship breaking bugs every now and again, so I would just say pin a version, but otherwise works as intended.

Now if they would just ship a way to link comfyui to the /image/ endpoints

[-]

Comfortable_Dirt5590@reddit

Hi I'm the maintainer of LiteLLM - what breaking bugs did you face ? We're working on improving reliability

[-]

shamsway@reddit

+1 for litellm. I use if frequently.

[-]

Hotel_Nice@reddit

Have you tried Portkey?

250+ models supported
Supports custom LLMs
Support plugins to check & transform content through the gateway
Not a vector DB, but extensive set of routing rules (load balanced, fallbacks, canary testing, cached, conditional)

https://github.com/Portkey-AI/gateway

[-]

Status-Shock-880@reddit

This one is the best for me

[-]

Scary-Knowledgable@reddit

This one is good for people with Parkinson's as it autocorrects - https://www.amazon.com/Shaper-Origin-Handheld-CNC-Router/dp/B0BVY6S4LK

[-]

Status-Shock-880@reddit

That is good, chatgpt doesn’t currently have parkinsons support, what are they thinking

[-]

No_Afternoon_4260@reddit

I prefer the wurth one

[-]

nas2k21@reddit

This guy routes

[-]

1ncehost@reddit

Can you explain what you mean by router? There is another meaning than I think you're referring to that I believe is more commonly understood

[-]

desexmachina@reddit (OP)

You put in a prompt and it decides which LLM it gets fed into

[-]

nas2k21@reddit

Like an moe model?

[-]

desexmachina@reddit (OP)

What’s MOE? There’s at least 5 routers out there now that are open source

[-]

nas2k21@reddit

Mother of experts as basically a model that contains a bunch of models, for simplicity it may have gpt2, and llama2 and it uses sentiment analysis/ect to decide which model to give the prompt

[-]

No_Afternoon_4260@reddit

Mixture of experts

[-]

nas2k21@reddit

Huh, not sure how I mixed the 2, doesn't really change my point tho that moe does exactly what op asked

[-]

desexmachina@reddit (OP)

Well I actually don’t think it would be practical as a single model. Better to route between specialized LLMs that will be good for what they’re trained on. Maybe even an aggregator LLM that can ingest the simultaneous output of several LLMs and summarize.

[-]

Imaginary_Bench_7294@reddit

M.O.E., aka moe, stands for mixture of experts. One of the more infamous models out right now is Mixtral.

The way these models work isn't to dissimilar to what you've described. They consist primarily of 2 portions, a gating mechanism, and a model cluster.

What happens is they essentially clone a small model until they have the desired number of experts. They introduce the gating mechanism, which can be an ultra light classification LLM, and then train it all as one model.

The gating mechanism determines which models receive what training material during the training process, which ends up making certain ones specialize in that type of data. Hence the nomenclature "experts".

This means that any given time, only a few of the models are actually in use. Each of the activated models contribute to the overall output. The gating mechanism also usually has an exposed variable that let's you determine how many of the experts are allowed to be active.

The only significant difference between how these models work, and what you've described, is that MOE models have to fully load all of their experts, as it's been trained as one cohesive unit, whereas what you've described would allow for selective model loading.

[-]

No_Afternoon_4260@reddit

Mistral had a good blogppost about their smoe (sparse mixture of expert) if you want to go deeper they also released a really good paper while releasing weights for mixtral 8x7b

[-]

No_Afternoon_4260@reddit

You have kraken if you want to play with loras Is that what you want? https://huggingface.co/posts/DavidGF/885841437422630

[-]

gedw99@reddit

https://github.com/danielmiessler/fabric

Works with ollama and provide a cli and router.

It’s basically a giant pipeline processor to allow using many LLM in a chain . So essentially a router .

Work great with nats Jetstream too

[-]

DeltaSqueezer@reddit

what does this mean: Maybe has a pre or post token ingestor that can summarize?

[-]

ActualDW@reddit

So…you want a small LLM to feed bigger LLMs, basically…?

[-]

InterstellarReddit@reddit

I want LLMCeption. I want my smaller LLMS to plant a seed in a bigger LLM.

[-]

Zulfiqaar@reddit

This is kind of what happens in speculative decoding to accelerate inference

[-]

InterstellarReddit@reddit

And off I go into spending my night reading into something that I never knew existed thank you.

[-]

Zulfiqaar@reddit

You're welcome! It's beyond my hardware to test, but I just read in another comment that if you have a decently sized GPU setup you can even use it to accelerate some of the larger open weights models at home and get upto triple the tokens/sec

[-]

InterstellarReddit@reddit

We get free AWS credit at work for learning And nobody really uses them, so I practically have thousands of dollars every month to pay around with my stupidity.

[-]