Built a local-first AI memory system that indexes screen activity, meetings, and voice notes ( MCP + automations)

Posted by Top_Speaker_7785@reddit | LocalLLaMA | View on Reddit | 11 comments

Been experimenting with an idea — what if your AI assistant actually remembered everything you did on your computer? Not stateless chats, but real persistent context. So I built ScreenMind. It continuously captures your screen (using perceptual hashing so it only triggers when content actually changes), runs each frame through Gemma 4 E2B via llama.cpp, and builds a searchable timeline of your day. You can:

search things you've previously seen ("that error message from earlier")
chat with your history ("what was I working on at 3pm?")
transcribe meetings (auto-detects Zoom/Teams/Meet)
voice memos through Gemma 4's audio encoder
write automations in plain English markdown
connect to Claude/Cursor via MCP Runs on 4GB+ VRAM with Q4 quantization. Python + FastAPI + SQLite. Everything local.

Honestly still figuring out the agent/automation side — right now it's more workflow-driven than truly autonomous, trying not to oversell it. The retrieval quality and onboarding friction also need work. But the core idea I keep coming back to is that local AI gets way more useful once it has real context about what you're actually doing — your screen, your conversations, your patterns — instead of starting from zero every time.

Would love feedback, especially on inference optimization ideas. The E2B model handles everything right now — vision analysis, chat, audio — so GPU scheduling between those tasks has been the main challenge.

GitHub: https://github.com/ayushh0110/ScreenMind
Demo: https://youtu.be/CxkkBT_EvPw

[-]

amberdrake@reddit

How do you handle keeping people’s data private?

[-]

Top_Speaker_7785@reddit (OP)

everything runs on your machine — no cloud, no network calls after you download the model. screenshots are encrypted at rest, sensitive stuff like credit cards and API keys gets auto-redacted before storage.

[-]

Parzival_3110@reddit

Cool direction. The useful line for me is when memory stops being just search history and starts becoming action context: what tab was open, what the agent saw, what it clicked, and what state it should avoid touching again.

If you add Claude or Cursor MCP actions, I would keep browser work separate from the memory index. Owned tabs, action receipts, and hard stops for login or captcha states make the assistant a lot easier to trust.

I am building FSB around that real Chrome control layer for Claude and Codex: https://github.com/LakshmanTurlapati/FSB

[-]

ai-christianson@reddit

what does a concrete eval look like for action-context memory? what test cases prove the system is doing more than search? e.g. does the agent avoid repeating a failed action it remembers, or does it just retrieve the fact that it happened?

[-]

Top_Speaker_7785@reddit (OP)

right now it's just retrieval tbh. it'll surface what happened but doesn't avoid repeating a failed action on its own. only put this out 4 days ago so the agent loop isn't closed yet — its on the list tho

[-]

Top_Speaker_7785@reddit (OP)

yea keeping MCP actions separate from the memory index — that's basically how i have it right now, the MCP server is read-only. no write actions yet.

[-]

A-n-o-v-a@reddit

Had a recall subscription but I think this is a good conteder

[-]

Maleficent-Ad5999@reddit

Isn’t this what Microsoft wanted to build as “Recall”?

[-]

Top_Speaker_7785@reddit (OP)

yeah that's where the idea started ,a local privacy-first alternative to Recall. but it's grown past that.
Recall (and screenpipe) basically do raw OCR text dumps of your screen., this understands context..uses ocr only for contexts...
so it went from "private recall clone" to more of a local AI memory system with actual understanding.

[-]

pquattro@reddit

Interesting project! The perceptual hashing + Gemma 4 E2B pipeline for screen capture is clever — have you benchmarked the overhead of frame-by-frame analysis vs. selective region capture (e.g., active window only)? For GPU scheduling, consider batching vision/audio tasks during idle periods or using a lightweight scheduler like vLLM’s PagedAttention to prioritize interactive queries. Also, SQLite might bottleneck at scale; switching to DuckDB or LMDB for the timeline could help with concurrent writes during heavy capture sessions.

[-]

Top_Speaker_7785@reddit (OP)

right now it captures full screen but the pHash cache means it skips frames that haven't meaningfully changed, which cuts inference calls in practice.
for GPU scheduling — there's a deferred analysis toggle that queues up screenshots and only runs inference when the GPU is idle. chat always gets priority and cancels in-flight analysis instantly. also auto-pauses capture when it detects heavy GPU apps.

SQLite hasn't been a bottleneck honestly — heavy capture sessions are throttled by a 3-minute staleness check that skips frames if they're still queued too long, so writes stay sequential and manageable. but il keep DuckDB in mind if that changes.