What's the best LLM Router right now, and why?
Posted by desexmachina@reddit | LocalLLaMA | View on Reddit | 42 comments
What's the best LLM router you've used at this point. I'll put some minor requirements down, but feel free to go outside these bounds.
- Routes to more than 2 models
- Routes to local LLM and API
- Maybe has a pre or post token ingestor that can summarize
- Not just a simple vector DB
achompas@reddit
u/desexmachina We've built [this list of routing resources](https://github.com/Not-Diamond/awesome-ai-model-routing) at Not Diamond. We've also built our own router - try it out [within our chatbot](https://chat.notdiamond.ai/), or learn more [from our docs](https://docs.notdiamond.ai/docs/what-is-not-diamond).
Happy to answer any other questions you might have about routing!
CalangoVelho@reddit
Ever tried LiteLLM proxy?
emprahsFury@reddit
Litellm is pretty good. They do ship breaking bugs every now and again, so I would just say pin a version, but otherwise works as intended.
Now if they would just ship a way to link comfyui to the /image/ endpoints
Comfortable_Dirt5590@reddit
Hi I'm the maintainer of LiteLLM - what breaking bugs did you face ? We're working on improving reliability
shamsway@reddit
+1 for litellm. I use if frequently.
Hotel_Nice@reddit
Have you tried Portkey?
https://github.com/Portkey-AI/gateway
Status-Shock-880@reddit
This one is the best for me
Scary-Knowledgable@reddit
This one is good for people with Parkinson's as it autocorrects - https://www.amazon.com/Shaper-Origin-Handheld-CNC-Router/dp/B0BVY6S4LK
Status-Shock-880@reddit
That is good, chatgpt doesn’t currently have parkinsons support, what are they thinking
No_Afternoon_4260@reddit
I prefer the wurth one
nas2k21@reddit
This guy routes
1ncehost@reddit
Can you explain what you mean by router? There is another meaning than I think you're referring to that I believe is more commonly understood
desexmachina@reddit (OP)
You put in a prompt and it decides which LLM it gets fed into
nas2k21@reddit
Like an moe model?
desexmachina@reddit (OP)
What’s MOE? There’s at least 5 routers out there now that are open source
nas2k21@reddit
Mother of experts as basically a model that contains a bunch of models, for simplicity it may have gpt2, and llama2 and it uses sentiment analysis/ect to decide which model to give the prompt
No_Afternoon_4260@reddit
Mixture of experts
nas2k21@reddit
Huh, not sure how I mixed the 2, doesn't really change my point tho that moe does exactly what op asked
desexmachina@reddit (OP)
Well I actually don’t think it would be practical as a single model. Better to route between specialized LLMs that will be good for what they’re trained on. Maybe even an aggregator LLM that can ingest the simultaneous output of several LLMs and summarize.
Imaginary_Bench_7294@reddit
M.O.E., aka moe, stands for mixture of experts. One of the more infamous models out right now is Mixtral.
The way these models work isn't to dissimilar to what you've described. They consist primarily of 2 portions, a gating mechanism, and a model cluster.
What happens is they essentially clone a small model until they have the desired number of experts. They introduce the gating mechanism, which can be an ultra light classification LLM, and then train it all as one model.
The gating mechanism determines which models receive what training material during the training process, which ends up making certain ones specialize in that type of data. Hence the nomenclature "experts".
This means that any given time, only a few of the models are actually in use. Each of the activated models contribute to the overall output. The gating mechanism also usually has an exposed variable that let's you determine how many of the experts are allowed to be active.
The only significant difference between how these models work, and what you've described, is that MOE models have to fully load all of their experts, as it's been trained as one cohesive unit, whereas what you've described would allow for selective model loading.
No_Afternoon_4260@reddit
Mistral had a good blogppost about their smoe (sparse mixture of expert) if you want to go deeper they also released a really good paper while releasing weights for mixtral 8x7b
No_Afternoon_4260@reddit
You have kraken if you want to play with loras Is that what you want? https://huggingface.co/posts/DavidGF/885841437422630
gedw99@reddit
https://github.com/danielmiessler/fabric
Works with ollama and provide a cli and router.
It’s basically a giant pipeline processor to allow using many LLM in a chain . So essentially a router .
Work great with nats Jetstream too
DeltaSqueezer@reddit
what does this mean: Maybe has a pre or post token ingestor that can summarize?
ActualDW@reddit
So…you want a small LLM to feed bigger LLMs, basically…?
InterstellarReddit@reddit
I want LLMCeption. I want my smaller LLMS to plant a seed in a bigger LLM.
Zulfiqaar@reddit
This is kind of what happens in speculative decoding to accelerate inference
InterstellarReddit@reddit
And off I go into spending my night reading into something that I never knew existed thank you.
Zulfiqaar@reddit
You're welcome! It's beyond my hardware to test, but I just read in another comment that if you have a decently sized GPU setup you can even use it to accelerate some of the larger open weights models at home and get upto triple the tokens/sec
InterstellarReddit@reddit
We get free AWS credit at work for learning And nobody really uses them, so I practically have thousands of dollars every month to pay around with my stupidity.
Zulfiqaar@reddit
Time to train some nice fine-tunes..machine learning is still learning!
nas2k21@reddit
Careful, next thing you know you got a bunch of little llms running around
_RouteThe_Switch@reddit
I'm guessing this is what op means.
aseichter2007@reddit
https://github.com/SomeOddCodeGuy/WilmerAI
Maybe you mean like this?
desexmachina@reddit (OP)
Yes, something like this
iwanttoseek@reddit
RouteLLM or you can create your own custom Agent that routes to the specific LLM based on the metadata.
desexmachina@reddit (OP)
That’s basic vector DB isn’t it?
fkrhvfpdbn4f0x@reddit
RASA Calm
https://github.com/aurelio-labs/semantic-router
Aurelio_Aguirre@reddit
Could someone explain to me how number 2 works exactly? What's the relationship between the "utterances" and what the user prompts?
These_Lavishness_903@reddit
Most
Strong-Strike2001@reddit
OpenRouter? Be clearer in your question
Unhappy-Day5677@reddit
The only one I'm aware of is big-AGI. It's worked well thus far.