One thing we found while building long-horizon agents: context density mattered more than context length

Posted by Ok_Celery_4154@reddit | LocalLLaMA | View on Reddit | 1 comments

We’ve been experimenting with a long-horizon agent setup, and one thing that became increasingly obvious was this:

A lot of failures weren’t coming from insufficient context window size, but from low information density inside the active context.

In other words, even when the model had “enough room,” the decision quality still degraded once too much low-value state, tool history, and irrelevant memory accumulated.

So we started testing a different design approach:

keep the tool interface minimal
retrieve memory on demand instead of loading everything
explicitly convert successful task experience into reusable SOPs/scripts
compress or trim context aggressively when it stops being decision-relevant

A few things we observed:

repeated runs of similar tasks became much cheaper over time
token usage dropped by as much as 89.6% on repeated tasks in our setup
the system showed a pretty visible cold-start → convergence pattern
on some harder web tasks, reducing context noise mattered more than adding more structure

My current takeaway is that for agent systems, context management may be a more fundamental bottleneck than raw context length.

Curious whether others here have seen similar behavior:

with memory-heavy agents
with tool-using workflows
or with long web / desktop task chains

If useful, we wrote up the implementation and evaluation details here:

Report: https://arxiv.org/abs/2604.17091
Repo: https://github.com/lsdefine/GenericAgent

Would be especially interested in pushback on:

whether “context information density” is actually a useful framing
how others handle reusable skill formation
whether repeated-task convergence holds outside narrow task families