shbong

I burned a weekend making the models "remember" me. The fix had nothing to do with trying to run bigger models locally

Posted by shbong@reddit | LocalLLaMA | View on Reddit | 19 comments

[-]

shbong@reddit (OP)

Yes, I agree graphs are complex af BUT for certain use cases in my opinion there is not trade-off, we must find a way to make graphs work, they are such a powerful structure. For the entity deduplication and graph creation I approached differently, I harnessed the "pre-update" phase, by running a swarm of agents that read and discuss together the new data that needs to be appended considering and reading the current graph state while discussing, I found this way being better for the end graph structure

How much VRAM needed for Qwen 3.6 27B Q8 with 262K context?

Posted by My_Unbiased_Opinion@reddit | LocalLLaMA | View on Reddit | 79 comments

[-]

shbong@reddit

Interesting so mostly is harnessing, do dynamic skills fit in your stack?

I burned a weekend making the models "remember" me. The fix had nothing to do with trying to run bigger models locally

Posted by shbong@reddit | LocalLLaMA | View on Reddit | 19 comments

[-]

shbong@reddit (OP)

What do you mean?

I burned a weekend making the models "remember" me. The fix had nothing to do with trying to run bigger models locally

Posted by shbong@reddit | LocalLLaMA | View on Reddit | 19 comments

[-]

shbong@reddit (OP)

https://preview.redd.it/je7akklq925h1.jpeg?width=736&format=pjpg&auto=webp&s=1b0a780973e82e19b6ec2e4a255b5528192e228e

How much VRAM needed for Qwen 3.6 27B Q8 with 262K context?

Posted by My_Unbiased_Opinion@reddit | LocalLLaMA | View on Reddit | 79 comments

[-]

shbong@reddit

Like in what circumstances? Also if someone else has tried did you noticed a degradation in performances (the model being accurate and smart) about the things you are asking with such big context windows?

I burned a weekend making the models "remember" me. The fix had nothing to do with trying to run bigger models locally

Posted by shbong@reddit | LocalLLaMA | View on Reddit | 19 comments

[-]

shbong@reddit (OP)

LoL "AI Slop" you don't even know what I am talking about, turn on your brain before vomiting at first glance

What memory system are you using for your agents?

Posted by Mr_Moonsilver@reddit | LocalLLaMA | View on Reddit | 55 comments

[-]

shbong@reddit

I built my own memory system not because others weren't great, but because I've been developing agents that required memory for a couple of years. Initially, I had to integrate memory directly into my projects. Later, I extracted this work into a separate "memory-like" project that eventually became BrainAPI. The advantage of having my own system is that I can continuously develop projects on top of it, allowing me to edit and extend its capabilities as needed, without the limitations of other memory systems. For instance, to enhance adaptability, I developed a plugin ecosystem that can modify the core engine's functionalities

I burned a weekend making the models "remember" me. The fix had nothing to do with trying to run bigger models locally

Posted by shbong@reddit | LocalLLaMA | View on Reddit | 19 comments

[-]

shbong@reddit (OP)

What's wrong with you?

How much VRAM needed for Qwen 3.6 27B Q8 with 262K context?

Posted by My_Unbiased_Opinion@reddit | LocalLLaMA | View on Reddit | 79 comments

[-]

shbong@reddit

Is it really worth having a context of 262K tokens? Considering the attention span of LLMs having such big context window can be counter productive as said in the "Lost in the Middle" paper

I burned a weekend making the models "remember" me. The fix had nothing to do with trying to run bigger models locally

Posted by shbong@reddit | LocalLLaMA | View on Reddit | 19 comments

[-]

shbong@reddit (OP)

Well thanks to the context around and the previous messages

Stop asking what model to run. There are literally only two.

Posted by Wrong_Mushroom_7350@reddit | LocalLLaMA | View on Reddit | 549 comments

[-]

shbong@reddit

LoL now also DeepSeek V4 is goodgood choice But.. no RTX will handle the job, you should think about getting a mac

Anyone else experimenting with memory for LLMs?

Posted by shbong@reddit | LocalLLaMA | View on Reddit | 41 comments

[-]

shbong@reddit (OP)

I've been developing a project that has evolved into an ecosystem now that it supports plugins. Recently, I created two plugins: a chatbot inference plugin and a chatbot memory plugin. The inference plugin exposes two API endpoints to serve a Large Language Model (LLM) with either streaming or simple API responses. The memory plugin enhances the messages sent to the LLM by adding extra context. These two plugins can function together when installed in the same BrainAPI instance. The chatbot plugin allows the LLM to directly use MCP tools to navigate a knowledge graph, effectively creating a fully knowledge-aware agent straight out of the box using just the command-line interface. The project can be found at https://github.com/Lumen-Labs/brainapi2. The setup I used, consisting of PostgreSQL, pgvector, and NetworkX, works exceptionally well

Calling it now Microsoft is buying Unsloth.

Posted by Wrong_Mushroom_7350@reddit | LocalLLaMA | View on Reddit | 289 comments

[-]

shbong@reddit

At worst we will have the forks that will become something else

Gemma4 27b vs GPT-OSS 20b -- Has anyone compared them ?

Posted by shbong@reddit | LocalLLaMA | View on Reddit | 19 comments

[-]

shbong@reddit (OP)

Nice resource! thanks

Gemma4 27b vs GPT-OSS 20b -- Has anyone compared them ?

Posted by shbong@reddit | LocalLLaMA | View on Reddit | 19 comments

[-]

shbong@reddit (OP)

Yeah, 397B is quite huge

Gemma4 27b vs GPT-OSS 20b -- Has anyone compared them ?

Posted by shbong@reddit | LocalLLaMA | View on Reddit | 19 comments

[-]

shbong@reddit (OP)

Just in general.. if someone tried both what noticed...

A local agent (that works with local models) that is easy to set up.

Posted by Valuable-Run2129@reddit | LocalLLaMA | View on Reddit | 3 comments

[-]

shbong@reddit

I still have to try Gemma4 but from what I've seen so far local small models tend to struggle with tool usage and structured output (for eg. often if requested to output json misses one parenthesis or quotes, breaking the whole output)

4B models on smartphone

Posted by Sudden_Vegetable6844@reddit | LocalLLaMA | View on Reddit | 9 comments

[-]

shbong@reddit

BUT just the fact that we now are in the position of discussing the quality and being able to run LLMs locally in our smartphone is a something big

If you haven't yet given Gemma 4 a go...do it today

Posted by No-Anchovies@reddit | LocalLLaMA | View on Reddit | 206 comments

[-]

shbong@reddit

Yeah I saw the benchmarks and also that on some benchmarks it's better than sonnet 4.. and it runs on your local machine..

Why retrieval breaks once documents stop being static

Posted by EnoughNinja@reddit | LocalLLaMA | View on Reddit | 1 comments

[-]

shbong@reddit

Semantic retrieval serves as foundation, it cannot be the right end solution for most of the use cases

I'm shocked (Gemma 4 results)

Posted by Potential-Gold5298@reddit | LocalLLaMA | View on Reddit | 78 comments

[-]

shbong@reddit

This is going to change the "local ai" space, such powerful model that can run easily locally with quantization, beating bigger models like sonnet 4..

M1 Max vs M4 Max vs M5 Max

Posted by br_web@reddit | LocalLLaMA | View on Reddit | 4 comments

[-]

shbong@reddit

You can aim to max 50/80 token/sec not more so 3x or 4x are not on the table

which macbook configuration to buy

Posted by Ayuzh@reddit | LocalLLaMA | View on Reddit | 11 comments

[-]

shbong@reddit

Macbooks are great because they share the ram with their gpu so tecnically you'll get the same amount of VRAM of your RAM

Thoughts on the almost near release Avocado?

Posted by shbong@reddit | LocalLLaMA | View on Reddit | 6 comments

[-]

shbong@reddit (OP)

yes they had planned to release it on march but since it didn't met their expectations they moved the launch to may so they can improve it meanwhile

Built a graph-based "memory layer" for agents - Qwen > LLaMA for us, GPT-OSS 20B fast but tooling issues

Posted by shbong@reddit | LocalLLaMA | View on Reddit | 4 comments

[-]

shbong@reddit (OP)

I would love to, I have a discord channel dedicated to memory, rag and this kind of stuff, maybe you can jump in there?

Built a graph-based "memory layer" for agents - Qwen > LLaMA for us, GPT-OSS 20B fast but tooling issues

Posted by shbong@reddit | LocalLLaMA | View on Reddit | 4 comments

[-]

shbong@reddit (OP)

27B? Wow, your GPU must be on fire! I've thought about buying some GPUs many times but I'm still relying on my trusty MacBook

OpenAI pivot investors love

Posted by PaceImaginary8610@reddit | LocalLLaMA | View on Reddit | 124 comments

[-]

shbong@reddit

it's funny.. .because it's true

I feel personally attacked

Posted by HeadAcanthisitta7390@reddit | LocalLLaMA | View on Reddit | 219 comments

[-]

shbong@reddit

lol at least ppl can vent their creativity

LLMs finally remembering: I’ve built the memory layer, now it’s time to explore

Posted by shbong@reddit | LocalLLaMA | View on Reddit | 4 comments

[-]

shbong@reddit (OP)

You don't see probably the "memory layer" in the article because the article talks about a tutorial and the part about integrating memory layer is just few lines and does not require to lo learn more flaky abstraction

LLMs finally remembering: I’ve built the memory layer, now it’s time to explore

Posted by shbong@reddit | LocalLLaMA | View on Reddit | 4 comments

[-]

shbong@reddit (OP)

I mean if we should follow this philosophy we will not have most of the technology that we have, we will not have LLMs too

Anyone else experimenting with memory for LLMs?

Posted by shbong@reddit | LocalLLaMA | View on Reddit | 41 comments

[-]

shbong@reddit (OP)

yes, I was taking a look at it and sounds really cool, how long have you been working on it guys?

Anyone else experimenting with memory for LLMs?

Posted by shbong@reddit | LocalLLaMA | View on Reddit | 41 comments

[-]

shbong@reddit (OP)

Letta is also cool (MemGPT) have you experimented with them?

Anyone else experimenting with memory for LLMs?

Posted by shbong@reddit | LocalLLaMA | View on Reddit | 41 comments

[-]

shbong@reddit (OP)

I didn't meant to say that it's a trivial task on the contrary I want to underline that it's a really complex task, a real challenge nowdays that can bring so much more quality to AI agents or chatbots in general

Anyone else experimenting with memory for LLMs?

Posted by shbong@reddit | LocalLLaMA | View on Reddit | 41 comments

[-]

shbong@reddit (OP)

Just took a look at memsync, it looks really cool

Anyone else experimenting with memory for LLMs?

Posted by shbong@reddit | LocalLLaMA | View on Reddit | 41 comments

[-]

shbong@reddit (OP)

So you think that the future is going to have LLMs with full memory-in-context?

Anyone else experimenting with memory for LLMs?

Posted by shbong@reddit | LocalLLaMA | View on Reddit | 41 comments

[-]

shbong@reddit (OP)

I had a theory but still not had been able to apply it due to speed constraints, in short: \- resolve coreferences in the texts you are about to process/save \- extract triplets embedding the phrase, subject and object \- create the nodes on the graph \- while searching do a vector search on both nodes and entire phrases and retrieve 1 level depth \- let the llm explore paths on the graph with a tool

Anyone else experimenting with memory for LLMs?

Posted by shbong@reddit | LocalLLaMA | View on Reddit | 41 comments

[-]

shbong@reddit (OP)

You definitely sound like a veteran lol, how's the journey going?

8x RTX 3090 open rig

Posted by Armym@reddit | LocalLLaMA | View on Reddit | 391 comments

[-]

shbong@reddit

“If I will win the lottery I will not tell anybody but there will be signs”