Semantic routing and caching doesn't work - task specific LLMs (TLMs) ftw!
Posted by AdditionalWeb107@reddit | LocalLLaMA | View on Reddit | 9 comments
If you are building caching techniques for LLMs or developing a router to handle certain queries by select LLMs/agents - know that semantic caching and routing is a broken approach. Here is why.
- Follow-ups or Elliptical Queries: Same issue as embeddings — "And Boston?" doesn't carry meaning on its own. Clustering will likely put it in a generic or wrong cluster unless context is encoded.
- Semantic Drift and Negation: Clustering can’t capture logical distinctions like negation, sarcasm, or intent reversal. “I don’t want a refund” may fall in the same cluster as “I want a refund.”
- Unseen or Low-Frequency Queries: Sparse or emerging intents won’t form tight clusters. Outliers may get dropped or grouped incorrectly, leading to intent “blind spots.”
- Over-clustering / Under-clustering: Setting the right number of clusters is non-trivial. Fine-grained intents often end up merged unless you do manual tuning or post-labeling.
- Short Utterances: Queries like “cancel,” “report,” “yes” often land in huge ambiguous clusters. Clustering lacks precision for atomic expressions.
What can you do instead? You are far better off in using a LLM and instruct it to predict the scenario for you (like here is a user query, does it overlap with recent list of queries here) or build a very small and highly capable TLM (Task-specific LLM).
For agent routing and hand off i've built a guide on how to use it via my open source project i have on GH. If you want to learn about my approach drop me a comment.
UnreasonableEconomy@reddit
but anon, many embedding models ARE llms.
You can even prompt some embedding models (post query contextualization).
One issue is that the embedding world has kinda fallen asleep about a year ago and there doesn't seem to be all that much interest (or understanding) for what they can or cannot do.
VLM embedding models, 30b, 70b dense embedding models, or LLMs that can be embedding sampled (like davinci embeddings) is absolutely what we need.
It's also a non-issue. (or rather, is solvable with occlusion).
Imagine you're making a computer game. Like in Unreal or something.
You can look at your embedding vector as a camera coordinate. You can sort of think of your embedding results as a sort of an n+1 dimensional fisheye perspective of your hypersphere geometry. The problem with clustering is that you're thinking like an n+1 entity.
You need to shave off that extra dimension and "render" your "view" in something that is logical in n dimensions. One way to do that is by creating the equivalent of a z-buffer/occlusion map. Your LLM can then look around in that semantic space and decide if there's anything worthwhile, but there's no sense in giving it a tomographic view of its world (signal to noise ratio).
I know this is kinda abstract, hope this makes some sort of sense. There was a post about this on the OpenAI forums a while ago.
Not_your_guy_buddy42@reddit
OPs post and your comment are making me want to play with a tree of classifiers and coreference resolution
Accomplished_Mode170@reddit
Same but can y’all clarify the objection for myself and posterity?
I.e. if we have heuristics validated stepwise across an event-chain we’re testing for values in dimensions absent the models actual latent space.
Not_your_guy_buddy42@reddit
Yes, semantic-only misses context-dependent meaning. A tree can make routing decisions hierarchically - boring old tree, just semantic. E.g. 'I want a refund' and 'I don't want a refund'. 1. identify the "refund" part 2. check negation (dot product similarity might be enough) 3. ??? 4. profit . Stepwise classification builds up understanding incrementally. Just clustering makes no sense, the post is a socials strawman lol but I still dig OP‘s project
ShengrenR@reddit
For the 'follow-ups' - this is just a case for prompt pre-processing before you send it along for classification/routing - same as why you wouldn't run RAG on that statement alone directly; you pass the conversation context, ask for rephrase/HyDE/whatever and route/embed on that.
AdditionalWeb107@reddit (OP)
And that’s the point - you have to build and maintain all that plumbing code or you could use a task-specific LLM that does it in one shot and trained for routing scenarios. Faster, cheaper and effective
Accomplished_Mode170@reddit
This is the part folks are missing; a distilled task-specific LM doesn’t require UMAPing your JSON
Like, everything with shared context is going to inherently embed the same way.
If we’re going for representative dimensionality in a shared latent space you can also use BERT-style classifiers.
Accomplished_Mode170@reddit
Ha! Had this opened as a tab all day; came back and was like, ‘Wait that sounds like ArchGW!’
Glad to find y’all here making it accessible!
Folks like yourself, ngrok, Unsloth, et al building AI-native microservices makes me optimistic about the future we’re building.
Chromix_@reddit
For those who didn't immediately catch what the bullet points in the post are aiming at: OP is making a case for routing (classifying) user requests using a fast, specialized LLM, over traditional, non-LLM-based methods. You can find a diagram of the flow at the beginning of the project page.
By the way: Entertaining example code / prompt: