Your coding agent sessions are sitting on your machine right now. Big labs use this data internally. We could build an open equivalent.
Posted by No-Point1424@reddit | LocalLLaMA | View on Reddit | 26 comments
Every time you use Claude Code or Codex CLI in agent mode, it logs everything locally. The full loop: your task, the model's reasoning, every tool call, every environment response, every error and retry. Complete (state → action → reward → next state) tuples. The exact data format RL researchers dream about.
I checked all my machines today.
Mac Mini:
~/.claude/projects/ 3.1GB 1103 files 574 agentic sessions
MacBook:
~/.codex/sessions/ 2.4GB 3530 files 79 agentic sessions
~/.claude/projects/ 652MB 316 files 99 agentic sessions
775 sessions with real tool calls. 41 million tokens.
Extrapolate to thousands developers and we would have hundreds of billions tokens of real agentic trajectory data. No Pile equivalent exists for this. It's just sitting on people's hard drives, being silently deleted.
Claude Code deletes logs after 30 days by default. Fix it now:
echo '{"cleanupPeriodDays": 36500}' > ~/.claude/settings.json
Why this data matters
The environment always tells you if it worked. Exit code 0 or not. Tests pass or not. This is the missing training signal , causal reasoning, error recovery, long-horizon planning. Things current models are genuinely bad at.
Big labs already collect this. Every Claude Code,codex session trains proprietary models. There's no open equivalent, not because the data doesn't exist, but because it's fragmented across developer machines.
The proposal
Federated learning. Your data never leaves your machine. You train a small LoRA adapter locally, share only the weights with differential privacy noise, and get an improved global model back. Everyone contributes compute and signal. Nobody exposes their data or we can anonymize the data and create a dataset finetune a model.
Check your own machines
du -sh ~/.codex/sessions/
2
>/dev/null
du -sh ~/.claude/projects/
2
>/dev/null
find ~/.codex/sessions/ -name "*.jsonl" | wc -l
find ~/.claude/projects/ -name "*.jsonl" | wc -l
Drop your numbers in the comments. I want to know the actual scale sitting unused across this community.
If there's enough interest we can build this out.
Sense_Nom@reddit
The more interesting angle for me is what's in those traces from a data perspective. Agent sessions that interact with user inputs — especially in anything production-facing — can accumulate PII, credentials, internal endpoints. Nobody talks about the hygiene of what the agent processed, only what it produced. Would be curious whether a data collection effort like this would sanitize traces before aggregating.
LA_rent_Aficionado@reddit
I actually built this a few months back but haven't opened up my private repo because I need to tweak it a bit.
I'll open it up now before anyone wastes any time recreating this but consider it beta: https://github.com/thad0ctor/SFTizer/
It's 100% vibe coded it formats gemini, claude, cursor and cline chats into a SFT dataset formats (including multi-turn) with PII scrubbing (that needs to be improved).
Disclaimer: 100% vibe coded
teleprint-me@reddit
As tempting as it is to use these tools, Ive decided to opt out completely from using them.
Just build a wrapper around llama.cpp and youll end up with the same result without the headaches involved in using one of these tools.
The last straw for me was qhen they wanted ownership of the models outputs relative to the authors inputs.
So if you come up with something novel, they want a percentage of ownership.
Considering they stole the data, then sold it back to people, and then claim its their private property, Im all set. Im fully local now for these reasons.
If that means I cant use the latest SOTA models above 30B params, then so be it.
Im saving money long term by doing this.
12 mo/yr @ min 20/mo:
So, in a 5 year span, you could have bought a GPU to run a model locally at the minimum cost per usage.
There are people dolling out hundreds per month.
12/ mo/yr @ min 200/mo
You will spend more money in the long term by building reliance on these remote systems, give up property rights (partial at min), give up privacy, and enable profiling (watch lists). Doesnt even matter if youre legit or morally inclined.
Just do the math. Youll see that its better to be local. Remote usage is unsustainable long term and just as harmful if not more so at scale.
ttkciar@reddit
We already have Open Code and local models, though.
No-Point1424@reddit (OP)
local models don't have access to codex an claude code sessions. I think one of the reasons openai gives generous credits and even 2x for few months is because of data they get. They can use that for RL on next run. Cursor does the same thing and claude code too. All SOTA coding open models are out of compute reach for many people and many small enough models are not good enough yet.
ttkciar@reddit
Okay, but we already have Open Code. Why not use that instead of Claude Code?
No-Point1424@reddit (OP)
I’m talking about model. Not about the agent. We can distill/train on outputs from opus and 5.3 codex
Far-Association2923@reddit
This would need to be an "opt in" not standard. I agree though the amount of data the opensource community could pull together and share would be massive for training opensource models. There would also be nothing stopping the big boys from using this data as well although they already have a lot of our data.
If someone has the resources to store this massive amount of data for model training I would gladly implement this into my app. Maybe just an open source vector store so it compressed down to LLM searchable data?
No-Point1424@reddit (OP)
I wish there’s a way/breakthough to train models decentralised on scale.
Far-Association2923@reddit
just needs the support of some infrastructure https://github.com/frumu-ai/trace-share
CatConfuser2022@reddit
Related: https://www.reddit.com/r/ClaudeCode/comments/1re0qa1/dataclaw_publish_your_claude_code_conversations/
BC_MARO@reddit
Worth noting those session logs contain API keys, file paths, and code in plain text. The federated approach is smart but scrubbing credentials from the JSONL before anything leaves the machine is non-negotiable.
Far-Association2923@reddit
Not hard to do. I have a check-secrets.js each commit gets scanned by and refuses to commit if it fails. Sometimes you get false positives but it works great.
BC_MARO@reddit
Commit hooks handle what goes into git, but the JSONL session files are written directly to disk by the agent runtime and never touch a commit at all, so you'd need scrubbing at the write layer itself.
Far-Association2923@reddit
Yes I was just using a git commit scrubber as a reference here. There are many rust crates that can do all the scrubbing we would require by default and add some specific anon filtering. I'm all for creating a repo to build this as a tiny rust binary 😁 The question would be "where" does the data get pushed to. I'm not sure for example if IFPS could handle the size it might grow into and pinning costs each user $$$. Maybe we can convince upstash to support this https://upstash.com/open-source
BC_MARO@reddit
The Rust binary idea is solid -- a lightweight sidecar that hooks into the JSONL write path and strips credentials in place before anything accumulates. Upstash's serverless Redis might actually work for federated weight aggregation since you'd be pushing small diffs, not raw session data.
Far-Association2923@reddit
I wonder if $1000 credits a month would be enough. I suppose it also depends on how many people chip in to submit data.
I'm already working with a rust project that can auto-compile cross platform and put to crates npm via git actions. This code would dramatically speed up development. I don't have much experience with rust binaries that just auto run at startup though. I suppose it can also be something you call at random intervals to "submit" your data.
Far-Association2923@reddit
Here’s my proposal / early prototype:
https://github.com/frumu-ai/trace-share
It’s an opt-in local pipeline for coding-agent traces:
ingest logs → sanitize/redact → convert to episode records → preview/export training-ready payloads.
Quick test (Windows/macOS/Linux):
Nothing is uploaded unless you explicitly run publish paths with review/yes gates.
Current focus is local validation + schema iteration.
On my machine, raw Codex logs were \~700MB; sanitized, trainable episode export was \~36MB.
To move from prototype to public dataset ops, we still need:
1) vector index/search infra (curation/dedupe/discovery)
2) blob storage + hosting for snapshot artifacts (R2 or Hugging Face)
Looking for contributors/support:
- source adapters (OpenCode/Windsurf/Trae/Roo/Kilo/etc.)
- sanitizer hardening
- feedback from model trainers on episode schema + export formats
- testers / secutiry auditors
RedParaglider@reddit
I was thinking the same, what an opsec nightmare lol.
sine120@reddit
I'd rather keep LocalLLaMA's personalized porn out of my training data, thanks.
openSourcerer9000@reddit
I'm so down. Tried to get some momentum on a similar idea to simply getting local models to use coding harnesses about a year ago. Things moved so fast in that time that it didn't make sense to lock into 1 model/coding harness though.
With sota changing every week, it's exciting but it makes it difficult for the research community to actually build anything on top of the base models. At this point, it's best to stay model agnostic until winners are picked and things actually start to settle down. For this reason, data sets are more useful than weights, but either would be nice to have if we can actually improve performance of one of these models.
People are probably on LocalLlama in the first place because they're data privacy nuts. For personal identifiers and secrets, I know there's secret scrubber models out there, I don't have any links for open source ones at the moment though. https://huggingface.co/learn/cookbook/en/llm_gateway_pii_detection
For weights, we would have to pick one model that most people could qlora, and provided scripts for scrubbing secrets, and training on either cuda or mps (or can mlx weights be fused with pytorch?). To be honest, I'm not even sure the literature supports the idea of fusing Loras = distributed training. It would be more like adapting the model to our own use cases and then interpolating between them. Someone with more ml knowledge than me would need to have a master plan on how distributed training would work.
The more practical approach would be to simply scrub secrets and then compile a dataset shared between us, and then share or open source any models we train on it.
My original idea for a similar distributed coding agent, they asked me discord for it which we can pick back if this gains some traction: https://www.reddit.com/r/RooCode/comments/1lufep2/lets_train_a_local_opensource_model_to_use_roo/
Imakerocketengine@reddit
Love the idea, but privacy wise, it seems like a nightmare. we would need to scrub locally all personally identifiable information... and it get worse how can we do this on unstructured data ?
Inevitable_Raccoon_9@reddit
Or you look up www.sidjua.com
Not_your_guy_buddy42@reddit
LOL I unironically love seeing when an abandoned idea that was too large for me solo gets built for real
Inevitable_Raccoon_9@reddit
Me plus Opus, Sonnet and Haiku - 2 weeks
ProfessionalSpend589@reddit
I’m not interested in the moment, but please do build it!
Currently for work I chat with my private local model via my laptop via my personal internet connection. I type the responses on my work computer manually. It’s a bit limiting.