Offline Epstein File Ranker Using GPT-OSS-120B (Built on tensonaut’s dataset)

Posted by onil_gova@reddit | LocalLLaMA | View on Reddit | 13 comments

Offline Epstein File Ranker Using GPT-OSS-120B (Built on tensonaut’s dataset)

I’ve been playing with the new 25k-page Epstein Files drop that tensonaut posted. Instead of reading 100MB of chaotic OCR myself like a medieval scribe, I threw an open-source model at it and built a local tool that ranks every document by “investigative usefulness.”

Everything runs on a single M3 Max MacBook Pro with open-source models only. No cloud, no API calls, no data leaving the machine.

What it does
• Streams the entire House Oversight release through openai/gpt-oss-120b running locally via LM Studio.
• Scores each passage based on actionable leads, controversy, novelty, and power-linkage.
• Outputs a fully structured JSONL dataset with headline, score, key insights, implicated actors, financial-flow notes, etc.
• Ships with an interactive local viewer so you can filter by score, read full source text, explore lead types, and inspect charts.
• Designed for investigative triage, RAG, IR experiments, or academic analysis.

Why it matters
This corpus is massive, messy, and full of OCR noise. Doing a systematic pass manually is impossible. Doing it with cloud models would be expensive and slow. Doing it locally means it’s cheap, private, and reproducible.

A full run costs about $1.50 in electricity.

Tech details
• Model: openai/gpt-oss-120b served at localhost:5002/v1
• Hardware: M3 Max, 128 GB RAM
• Viewer: simple JS dashboard with AG Grid, charts, and chunked JSONL loading
• Input dataset: tensonaut’s EPSTEIN_FILES_20K on Hugging Face
• Output: ranked chunks in contrib/, auto-indexed by the viewer
• Prompt: optimized for investigative lead scoring, with a consistent numerical scale (0–100)

Repo:
https://github.com/latent-variable/epstein-ranker

So far I’ve processed the first 5,000 rows myself and published the scored chunks in the repo. If anyone wants to help triage more of the dataset, the GitHub includes simple instructions for claiming a slice and submitting it as a contrib chunk. The workflow supports clean collaboration with automatic deduping.

If you’d rather build your own tools on top of the scored output or adapt the ranking method for other document dumps, go for it. Everything is MIT-licensed, fully local, and easy to extend.

Contributions, forks, or experiments are all welcome.