Offline Epstein File Ranker Using GPT-OSS-120B (Built on tensonaut’s dataset)

Posted by onil_gova@reddit | LocalLLaMA | View on Reddit | 13 comments

I’ve been playing with the new 25k-page Epstein Files drop that tensonaut posted. Instead of reading 100MB of chaotic OCR myself like a medieval scribe, I threw an open-source model at it and built a local tool that ranks every document by “investigative usefulness.”

Everything runs on a single M3 Max MacBook Pro with open-source models only. No cloud, no API calls, no data leaving the machine.

What it does
• Streams the entire House Oversight release through openai/gpt-oss-120b running locally via LM Studio.
• Scores each passage based on actionable leads, controversy, novelty, and power-linkage.
• Outputs a fully structured JSONL dataset with headline, score, key insights, implicated actors, financial-flow notes, etc.
• Ships with an interactive local viewer so you can filter by score, read full source text, explore lead types, and inspect charts.
• Designed for investigative triage, RAG, IR experiments, or academic analysis.

Why it matters
This corpus is massive, messy, and full of OCR noise. Doing a systematic pass manually is impossible. Doing it with cloud models would be expensive and slow. Doing it locally means it’s cheap, private, and reproducible.

A full run costs about $1.50 in electricity.

Tech details
• Model: openai/gpt-oss-120b served at localhost:5002/v1
• Hardware: M3 Max, 128 GB RAM
• Viewer: simple JS dashboard with AG Grid, charts, and chunked JSONL loading
• Input dataset: tensonaut’s EPSTEIN_FILES_20K on Hugging Face
• Output: ranked chunks in contrib/, auto-indexed by the viewer
• Prompt: optimized for investigative lead scoring, with a consistent numerical scale (0–100)

Repo:
https://github.com/latent-variable/epstein-ranker

So far I’ve processed the first 5,000 rows myself and published the scored chunks in the repo. If anyone wants to help triage more of the dataset, the GitHub includes simple instructions for claiming a slice and submitting it as a contrib chunk. The workflow supports clean collaboration with automatic deduping.

If you’d rather build your own tools on top of the scored output or adapt the ranking method for other document dumps, go for it. Everything is MIT-licensed, fully local, and easy to extend.

Contributions, forks, or experiments are all welcome.

[-]

warycat@reddit

I want to see smaller models, not everyone has a 6000

alwaysSunny17@reddit

Have you seen any refusals from gpt-oss-120b?

I’m curious to see if Qwen3 would give better results

sypzowki@reddit

Qwen3 on Epstein files: no refusals
On Tiananmen June 89: no memories

oopsiiiie

onil_gova@reddit (OP)

No refusal so far

InterstellarReddit@reddit

Barack Obama isn’t even in the files? When was he in the files?

He seems to be related to old files during his presidency and Israeli attacks on Iran.

So the information is correct just the context is wrong for that person

cafedude@reddit

Wait, why is Snowden in there?

Here is an example of a document containing Snowden

atape_1@reddit

Is it reproducible though, have you tested the reproducibility?

You will get run-by-run variance unless you set your temperature to 0 topK to 1 topP to 0.1 seed set to a fixed seed However, even with the run-by-run variance, the results were on the same ballpark. The goal is to surface the most valuable files, which this can realistically do based on the instructions.

cleverusernametry@reddit

You need to verify reproducibility. Its fundamental otherwise the output is not trustworthy.

You should make 3 pass the default

random-tomato@reddit

It's certainly much more easy to reproduce since it uses GPT-OSS-120B versus if you queried an API model :)