Semantic routing and caching doesn't work - task specific LLMs (TLMs) ftw!

Posted by AdditionalWeb107@reddit | LocalLLaMA | View on Reddit | 9 comments

If you are building caching techniques for LLMs or developing a router to handle certain queries by select LLMs/agents - know that semantic caching and routing is a broken approach. Here is why.

What can you do instead? You are far better off in using a LLM and instruct it to predict the scenario for you (like here is a user query, does it overlap with recent list of queries here) or build a very small and highly capable TLM (Task-specific LLM).

For agent routing and hand off i've built a guide on how to use it via my open source project i have on GH. If you want to learn about my approach drop me a comment.