Your coding agent sessions are sitting on your machine right now. Big labs use this data internally. We could build an open equivalent.

Posted by No-Point1424@reddit | LocalLLaMA | View on Reddit | 26 comments

Every time you use Claude Code or Codex CLI in agent mode, it logs everything locally. The full loop: your task, the model's reasoning, every tool call, every environment response, every error and retry. Complete (state → action → reward → next state) tuples. The exact data format RL researchers dream about.

I checked all my machines today.

Mac Mini:
~/.claude/projects/   3.1GB   1103 files   574 agentic sessions

MacBook:
~/.codex/sessions/    2.4GB   3530 files    79 agentic sessions
~/.claude/projects/   652MB    316 files    99 agentic sessions

775 sessions with real tool calls. 41 million tokens.

Extrapolate to thousands developers and we would have hundreds of billions tokens of real agentic trajectory data. No Pile equivalent exists for this. It's just sitting on people's hard drives, being silently deleted.

Claude Code deletes logs after 30 days by default. Fix it now:

echo '{"cleanupPeriodDays": 36500}' > ~/.claude/settings.json

Why this data matters

The environment always tells you if it worked. Exit code 0 or not. Tests pass or not. This is the missing training signal , causal reasoning, error recovery, long-horizon planning. Things current models are genuinely bad at.

Big labs already collect this. Every Claude Code,codex session trains proprietary models. There's no open equivalent, not because the data doesn't exist, but because it's fragmented across developer machines.

The proposal

Federated learning. Your data never leaves your machine. You train a small LoRA adapter locally, share only the weights with differential privacy noise, and get an improved global model back. Everyone contributes compute and signal. Nobody exposes their data or we can anonymize the data and create a dataset finetune a model.

Check your own machines

du -sh ~/.codex/sessions/ 
2
>/dev/null
du -sh ~/.claude/projects/ 
2
>/dev/null
find ~/.codex/sessions/ -name "*.jsonl" | wc -l
find ~/.claude/projects/ -name "*.jsonl" | wc -l

Drop your numbers in the comments. I want to know the actual scale sitting unused across this community.

If there's enough interest we can build this out.

[-]

BC_MARO@reddit

Worth noting those session logs contain API keys, file paths, and code in plain text. The federated approach is smart but scrubbing credentials from the JSONL before anything leaves the machine is non-negotiable.

[-]

Far-Association2923@reddit

Not hard to do. I have a check-secrets.js each commit gets scanned by and refuses to commit if it fails. Sometimes you get false positives but it works great.

[-]

BC_MARO@reddit

Commit hooks handle what goes into git, but the JSONL session files are written directly to disk by the agent runtime and never touch a commit at all, so you'd need scrubbing at the write layer itself.

[-]

Far-Association2923@reddit

Yes I was just using a git commit scrubber as a reference here. There are many rust crates that can do all the scrubbing we would require by default and add some specific anon filtering. I'm all for creating a repo to build this as a tiny rust binary 😁 The question would be "where" does the data get pushed to. I'm not sure for example if IFPS could handle the size it might grow into and pinning costs each user $$$. Maybe we can convince upstash to support this https://upstash.com/open-source

[-]

BC_MARO@reddit

The Rust binary idea is solid -- a lightweight sidecar that hooks into the JSONL write path and strips credentials in place before anything accumulates. Upstash's serverless Redis might actually work for federated weight aggregation since you'd be pushing small diffs, not raw session data.

[-]

Far-Association2923@reddit

I wonder if $1000 credits a month would be enough. I suppose it also depends on how many people chip in to submit data.

I'm already working with a rust project that can auto-compile cross platform and put to crates npm via git actions. This code would dramatically speed up development. I don't have much experience with rust binaries that just auto run at startup though. I suppose it can also be something you call at random intervals to "submit" your data.

[-]

Far-Association2923@reddit

Here’s my proposal / early prototype:

https://github.com/frumu-ai/trace-share

It’s an opt-in local pipeline for coding-agent traces:

ingest logs → sanitize/redact → convert to episode records → preview/export training-ready payloads.

Quick test (Windows/macOS/Linux):

npm i -g u/frumu/trace-share

trace-share consent init --license CC0-1.0
trace-share sources detect

trace-share run --dry-run --review --explain-size

trace-share run --dry-run --review --export-payload ./out/episodes.jsonl

Nothing is uploaded unless you explicitly run publish paths with review/yes gates.

Current focus is local validation + schema iteration.

On my machine, raw Codex logs were \~700MB; sanitized, trainable episode export was \~36MB.

To move from prototype to public dataset ops, we still need:

1) vector index/search infra (curation/dedupe/discovery)

2) blob storage + hosting for snapshot artifacts (R2 or Hugging Face)

Looking for contributors/support:

- source adapters (OpenCode/Windsurf/Trae/Roo/Kilo/etc.)

- sanitizer hardening

- feedback from model trainers on episode schema + export formats

- testers / secutiry auditors

[-]

RedParaglider@reddit

I was thinking the same, what an opsec nightmare lol.

Sense_Nom@reddit

The more interesting angle for me is what's in those traces from a data perspective. Agent sessions that interact with user inputs — especially in anything production-facing — can accumulate PII, credentials, internal endpoints. Nobody talks about the hygiene of what the agent processed, only what it produced. Would be curious whether a data collection effort like this would sanitize traces before aggregating.

LA_rent_Aficionado@reddit

I actually built this a few months back but haven't opened up my private repo because I need to tweak it a bit.

I'll open it up now before anyone wastes any time recreating this but consider it beta: https://github.com/thad0ctor/SFTizer/

It's 100% vibe coded it formats gemini, claude, cursor and cline chats into a SFT dataset formats (including multi-turn) with PII scrubbing (that needs to be improved).

Disclaimer: 100% vibe coded

teleprint-me@reddit

As tempting as it is to use these tools, Ive decided to opt out completely from using them.

Just build a wrapper around llama.cpp and youll end up with the same result without the headaches involved in using one of these tools.

The last straw for me was qhen they wanted ownership of the models outputs relative to the authors inputs.

So if you come up with something novel, they want a percentage of ownership.

Considering they stole the data, then sold it back to people, and then claim its their private property, Im all set. Im fully local now for these reasons.

If that means I cant use the latest SOTA models above 30B params, then so be it.

Im saving money long term by doing this.

12 mo/yr @ min 20/mo:

1 yr -> 240
2 yr -> 480
3 yr -> 720
4yr -> 960
5yr -> 1200

So, in a 5 year span, you could have bought a GPU to run a model locally at the minimum cost per usage.

There are people dolling out hundreds per month.

12/ mo/yr @ min 200/mo

1 yr -> 2400
2 yr -> 4800
3 yr -> 7200
4yr -> 9600
5yr -> 12000

You will spend more money in the long term by building reliance on these remote systems, give up property rights (partial at min), give up privacy, and enable profiling (watch lists). Doesnt even matter if youre legit or morally inclined.

Just do the math. Youll see that its better to be local. Remote usage is unsustainable long term and just as harmful if not more so at scale.

ttkciar@reddit

We already have Open Code and local models, though.

No-Point1424@reddit (OP)

local models don't have access to codex an claude code sessions. I think one of the reasons openai gives generous credits and even 2x for few months is because of data they get. They can use that for RL on next run. Cursor does the same thing and claude code too. All SOTA coding open models are out of compute reach for many people and many small enough models are not good enough yet.

Okay, but we already have Open Code. Why not use that instead of Claude Code?

I’m talking about model. Not about the agent. We can distill/train on outputs from opus and 5.3 codex

This would need to be an "opt in" not standard. I agree though the amount of data the opensource community could pull together and share would be massive for training opensource models. There would also be nothing stopping the big boys from using this data as well although they already have a lot of our data.

If someone has the resources to store this massive amount of data for model training I would gladly implement this into my app. Maybe just an open source vector store so it compressed down to LLM searchable data?

I wish there’s a way/breakthough to train models decentralised on scale.

just needs the support of some infrastructure https://github.com/frumu-ai/trace-share

CatConfuser2022@reddit

Related: https://www.reddit.com/r/ClaudeCode/comments/1re0qa1/dataclaw_publish_your_claude_code_conversations/

sine120@reddit

I'd rather keep LocalLLaMA's personalized porn out of my training data, thanks.

openSourcerer9000@reddit

I'm so down. Tried to get some momentum on a similar idea to simply getting local models to use coding harnesses about a year ago. Things moved so fast in that time that it didn't make sense to lock into 1 model/coding harness though.

With sota changing every week, it's exciting but it makes it difficult for the research community to actually build anything on top of the base models. At this point, it's best to stay model agnostic until winners are picked and things actually start to settle down. For this reason, data sets are more useful than weights, but either would be nice to have if we can actually improve performance of one of these models.

People are probably on LocalLlama in the first place because they're data privacy nuts. For personal identifiers and secrets, I know there's secret scrubber models out there, I don't have any links for open source ones at the moment though. https://huggingface.co/learn/cookbook/en/llm_gateway_pii_detection

For weights, we would have to pick one model that most people could qlora, and provided scripts for scrubbing secrets, and training on either cuda or mps (or can mlx weights be fused with pytorch?). To be honest, I'm not even sure the literature supports the idea of fusing Loras = distributed training. It would be more like adapting the model to our own use cases and then interpolating between them. Someone with more ml knowledge than me would need to have a master plan on how distributed training would work.

The more practical approach would be to simply scrub secrets and then compile a dataset shared between us, and then share or open source any models we train on it.

My original idea for a similar distributed coding agent, they asked me discord for it which we can pick back if this gains some traction: https://www.reddit.com/r/RooCode/comments/1lufep2/lets_train_a_local_opensource_model_to_use_roo/