Would you use a very fast context layer on top of your existing OpenCode/Claude Code instance?
Posted by Winter_Educator_2496@reddit | LocalLLaMA | View on Reddit | 30 comments
My goal is simple - a single AI agent everywhere.
For example, if you want to use AI in Google Docs, Google Sheets or Gmail, you need to have a Gemini subscription.
If you want to ask questions about some video on YouTube, you need to have YouTube premium (and even that barely works).
If you want to ask to clarify a post on X, you need Twitter Blue and use a separate Grok instance.
In each example you need to trust the shadow agent infrastructure. Meaning you better hope YouTube gives you a good answer rooted in video transcript + web search, instead of producing some low quality hallucination.
This does not need to be this way. I have been working on a previous project of mine for a year that does just this and have seen people already use it. But it was yet another AI thing that you had to use separately, which defeated the whole idea really.
What I mean is that a prompt like "Can you explain what he means?" should give you the correct answer, without you ever having to explain who "he" is, or what the overall context is. How I achieved it is quite fascinating and I will make a video explaining the architecture next week. In 2 clicks it works with every browser, every website and every other desktop app.
My question is, would you personally use it? Do you often find yourself constantly having to explain exactly what you're doing? As long as you can set it up instantly on Mac, Linux, Windows and connect your phone to it?
DeltaSqueezer@reddit
No. I provide accurate context for the LLM so that it gives the right answer.
Winter_Educator_2496@reddit (OP)
Right, but how would it integrate into Notion? Or Google Docs? Or any other app?
Having to pay to use Grok on x.com for example makes no sense if you're already paying for one AI agent, or have a local one running. Neither does on any of these other websites that require you to use their own separate interfaces.
How do you solve that problem?
Automatic-Arm8153@reddit
By realising it’s unpractical to do that.
Direct focused context is better. Telling it exactly the situation is better than a generalist summary from whatever your idea is.
It might be good for the standard AI user. But for power users of AI it’s a hindrance. Same thing with memories. Ai is a tool. Would you give your shovel memories?
It’s not a bad idea. Just not practical, you will end up correcting/perfecting the idea more than you actually use it.
It’s like the mistake most businesses are trying to make right now. You don’t need ai in everything. We don’t want copilot everywhere..
Thanks for coming to my TED talk.
It’s definitely not a bad idea though, if you believe in it go for it. You just might’ve hit the wrong target audience right now.
Winter_Educator_2496@reddit (OP)
Pasting from my previous comment:
What if instead, you could do this:
With essentially different websites having different adapters. That way you can provide the context while still having the same agent be integrated into everything.
DeltaSqueezer@reddit
and how does it get the URL of the video?
Winter_Educator_2496@reddit (OP)
The crux of what I created is this local context tracking system that saves the last focused 5-10 apps along with some metadata about them. For browsers I have a separate connection via Native Messaging and a browser extension, to essentially rig every single website that you focus as a separate app. And it works on every browser and every website, and every desktop app on every platform. This is also the thing I want to make the architecture video about because it's quite cool the way it works. It's also how it is able to retrieve structured data from a YouTube video, as YouTube blocks all of the data center IP addresses, meaning you can't use third party cloud apps.
The flow I gave in the previously already works but automatically, without you having to use @. However I can easily add that part.
DeltaSqueezer@reddit
I guess tracking usage is one way. I'm not sure how you disambiguate. I currently have over 100 tabs open and over 20 YT tabs open. Which one does it pick? It what's the probability of getting the right one?
I prefer to just paste in the url so you can ask: " on this video: URL". This way url is specified, no ambiguity and no tracking required.
I also work regularly across 3 different machines, so you'd also need to sync across them or have gaps failures due to not having the context across machines.
It's an interesting idea and might be right for some people.
Winter_Educator_2496@reddit (OP)
Would you be interested in trying it out? I will open source the whole thing in a few days and give you acces to the repo. It should take less than 30 seconds to setup. I’m curious if you will find the thing personally useful.
DeltaSqueezer@reddit
Thanks but it is not for me.
Winter_Educator_2496@reddit (OP)
Currently it works with the last focused tab. You can also with @ choose yourself which tab it opens. I could rig it with something like "@ tari youtube AI Course", with the latter 2 words being the actual video. So there is 0 ambiguity. The url is attached to the video title.
I am currently thinking about either making it sqlite single instance, or dedicated postgres server that syncs across various machines and you phone (this is what I have now).
Winter_Educator_2496@reddit (OP)
It also completely ignores the private windows in browsers. When the window is private, it does not record or convey any information about the current focus or that in fact the focus has even switched from the regular browser instance.
Automatic-Arm8153@reddit
Again for a normie AI user it’s revolutionary. You are basically describing Jarvis and that’s cool.
Until you learn how to use AI effectively then you realise that’s terrible, and Jarvis is not practical.
Eg in your example of a YouTube video. Your @Tari for it to know what I’m referring to, does it get fed the last 10 seconds of the video? The whole video transcript and a prompt stating that this is the part the user is asking about?
If it’s the whole transcript then that’s unnecessary noise. Personally I would rather open a new Claude chat and say “I was watching a video about X and YouTuber said Y. What do they mean?”
This is nice and specific, straight to the point no unneeded context making my advanced word predictor have to consider unneeded words.
So again for beginner/most AI users think ollama loving ChatGPT using type of people, this idea is amazing.
For power users, having utmost control of the context window is paramount and I’d personally rather steer things myself.
But that’s because I understand the workings of AI. Most people don’t so it’s still worth doing, it might even be the next biggest thing. I can see a market for it.
Just like how there was a market for openclaw but I would never touch it with a ten foot pole.
Winter_Educator_2496@reddit (OP)
Also YouTube blocks data center IP's. So Claude can't retrieve the video transcript.
You can for example work with Google docs via Claude web chat. But that requires vendor lock in and connection to their thing. Which goes against the local agent idea in my opinion.
Automatic-Arm8153@reddit
That’s standard stuff man. Been doing/knowing such stuff since before AI.
If your have ever done any automation eg social media this is obvious. What you’re doing is using a residential ip, aka your home internet. Along with your cookies making you look like a legit user and bypassing anti bot measures.
Not as revolutionary as you think. Good discovery but it’s how most of us use AI. Especially when it comes to automating anything.
Winter_Educator_2496@reddit (OP)
Right. Yes, you can do this.
But the Claude Code specifically will open it via url, and tell you that it can't.
That is my whole point - rigging interaction with the currently opened sites via your own pc with custom tools.
Automatic-Arm8153@reddit
How many situations are you in that you need to do that. Have you even used your own program?
And if so how often do you actually use it?
Not just that does it actively benefit anything or is it just a nice feeling because you have connected many things?
This is what AI makes you feel like when you start developing, eventually you get over it
Winter_Educator_2496@reddit (OP)
It's not a single youtube video. That is an example.
I've been using it for a year now in various forms. It is super useful because it can easily connect to the things that I already have open. So I don't need to have to copy and paste the url, and have to explain what my question is.
I can just highlight a piece of information and ask questions about it and get my answer.
What you're saying makes no sense. You're weirdly darting around. I have been a developer for like 7 years now almost. I specifically created it for myself. Was just asking if other people would be interested in it.
Winter_Educator_2496@reddit (OP)
I have a two way custom network layer between the agent and the every browser.
So the agent gets fed essentially "The user is currently watching a YouTube video titled {title} and is on the timestamp {time}, you can look at the OpenTari mcp server to see what kind of tools you have access to." That MCP server is fully dynamic. Like fully fully dynamic. It exposes a set of tools to get the metadata like the name of the channel, the transcript as text. The transcript with every line having a timestamp and various video frames.
On Google Docs it has exposed various tools to render parts of the page and extract text from it, as well as insert text with various styles and adjust fonts and all the other things.
Winter_Educator_2496@reddit (OP)
I am thinking about what you said. The project is called OpenTari, this will be relevant below.
What if instead, you could do this:
With essentially different websites having different adapters. That way you can provide the context while still having the same agent be integrated into everything.
Conscious_Chapter_93@reddit
Depends on what 'very fast' means. The version we've found useful: a layer that (a) decides which tools the agent can call without asking, (b) records what was approved and what ran, and (c) lets the next agent or operator pick up where the last one stopped. (a) is what makes it fast — the agent stops asking permission for obviously-allowed calls. (b) and (c) are what make it trustworthy enough to be on by default. Yes with those three. Less useful as a 'better prompt context' tool alone.
Queasy-Contract9753@reddit
Sounds useful but how would we get there? Constant screenshots from all my devices?
There's technically Gemini app on Android that sort of gets close. I can open any screen and send that to Gemini. Suppose a local system could then perform more actions from it.
Winter_Educator_2496@reddit (OP)
God I wish I could give a simple explanation in text. But I don't think I can without diagrams and video examples. The repo I am gonna release this week will have multiple markdown files on just this (I still need to write them). And a YouTube video link where I go through the architecture.
In short: screenshots can be done, but they're the last resort. What I do instead is prepare a set of tools per app/website that you have open. For example, if you're on a youtube video, the agent is told it can call youtube_get_transcript(start, end). If it's on Google Docs it can call gdocs_get_current_page() or gdocs_get_full_document_text. As well as calls to add text and add comments. For example I can say "Give feedback" to a GDocs document I have open and it will embed them as comments/highlights into the page, no Gemini subscription needed. There are also tools that run on every website. Some have custom integrations.
I more or less rig a custom MCP server and provide the agent with context of what you're doing. So asking "Explain this" provides a clear and accurate answer.
vastaaja@reddit
I've found this to usually be a sign of not yet understanding the problem or solution.
Winter_Educator_2496@reddit (OP)
Not really. It is just a complex system.
In short, it is a server to which native messengers owned by the browser extension connect to. The server then queries the browser by the last-focused PID to provide a dynamic selection of tools assigned to that browser tab. And works similarly for Microsoft Office and has strategies for best effort integration into various desktop apps.
But that tells you nothing. Not unless you happen to have a very specific set of skills and knowledge.
vastaaja@reddit
I don't think the idea sounds complex at all, just the implementation.
You can safely assume a pretty skilled audience here.
Winter_Educator_2496@reddit (OP)
Yes the idea is not complex. But the person specifically asked how do we get there if not just screenshots.
Also the skills are not very relevant. Very few people know how a custom network layer works, let alone native messaging in browsers and stuff like that. So I think it's just a lot more valuable if there are proper explanations with graphs, instead of a short reddit comment.
Winter_Educator_2496@reddit (OP)
Just to be clear - I am already "there". Without having to use only screenshots. You can disable screenshots even and only use custom integrations, or only the integrations per websites/apps you allow. For example, you can say that the agent can call various tools to read the pages you're on in various ways, but it can't call any tools to change the pages. And only allow it to insert text/click buttons on select few websites/desktop apps. Security is another big concern of my but I haven't really talked about it because that's a whole other thing.
Borkato@reddit
👀 this sounds promising if it exists as advertised
Winter_Educator_2496@reddit (OP)
It does! I honestly can't wait to create a video explaining the architecture. I rigged so many of the usual AI workflows to make this happen. And it took a while to stability and handle cross platform and cross browser communication. Each one was just a bit different. So it should be mostly bug free from the jump!
Winter_Educator_2496@reddit (OP)
The context recognition btw has a delay of less than 10ms. So it means the app updates instantly to what you're doing. Everything happens 100% locally and every line of code is open source and no telemetry.