I don't have the technical know-how to answer questions about it or to elaborate on what you did, so I might just copy paste this with an introduction. Let me know if you want me to dm you the link once it's done.

[-]

tensonaut@reddit (OP)

Thanks for circling back on this. Feel free to share anywhere else you think its relevant.

[-]

TheMightyMisanthrope@reddit

Former Prince Albert may be on his way to text you, beware

[-]

9011442@reddit

You should fit right in then.

[-]

madmax_br5@reddit

I have a whole graph visualizer for it here: https://github.com/maxandrews/Epstein-doc-explorer

There is a hosted link in the repo; can't post it here because reddit banned it sitewide (not a joke, check my post history for details)

There is also preexistng OCR's versions of the docs here: https://drive.google.com/drive/folders/1ldncvdqIf6miiskDp_EDuGSDAaI_fJx8

[-]

Commercial-Camel-870@reddit

Hi I came to this thread by way of seeking a downloadable file of the document release that I could give to ChatGPT to help search for key words and information. I see now that my plan seems elementary compared to the type of processing you all are doing. I’m also old and fairly ignorant to technology compared to you all.

My goal was to search for connections/ relationships as they pertain to the Interlochen School for Arts. And or Traverse City Michigan / surrounding area.

My caution, prior to clicking the link was “do I know what I’m actually downloading” as we are talking about what seems to be the largest, most substantial coverup in recent history. And the worry is that somehow the information could be sabotaged to where all of a sudden I’m complicit in downloading/ sharing things that are illegal.

I guess I’m just asking you because your work seems extensive. Am I being too paranoid? Are the downloads on this thread safe. And is the thought of using Chat GPT to search these files ignorant ?

[-]

tensonaut@reddit (OP)

Interesting work - I hope someone uses the full dataset to do something similar.

The demo and docs seems to contain only around. \~2,800 documents. It seems they didn't include the emails/court proceedings/files embedded in the jpg images that account for over 20,000+ files.

[-]

madmax_br5@reddit

oh really? I'll definitely add your extracted docs then!

[-]

madmax_br5@reddit

Running in batches now...

[-]

madmax_br5@reddit

Dang approaching my weekly limit on claude plan. Resets thursday AM at midnight. I've got about 7800 done so far, will push what I have and do the rest Thursday when my budget resets. In the meantime I'll try qwen or GLM on openrouter and see if they're capable of being a cheaper drop-in replacement, and if so I'll proceed out of pocket with those.

[-]

Right_Fondant_2827@reddit

hello, the original download link is down and the ops acc is dead, may i know id you have the documents with you, and how i can go about downloading the recently released files? thank you

[-]

madmax_br5@reddit

[ Removed by Reddit ]

[-]

thebrokestbroker2021@reddit

Qwen should be good, the VL model compared to Google vision is pretty good. I should have already done the rest, only have about 4000 done but trying to do it locally lol.

[-]

madmax_br5@reddit

So far I'm having the best price/perf ratio with GPT-OSS-120B (working from the pre-OCRd text files). GPT-OSS is actually outperforming claude haiku on this particular task, though not quite as reliable (more json parsing issues).

[-]

thebrokestbroker2021@reddit

I ALMOST recommended that as well, at least the 20B for summarizing. I need to try 120B on a rented server lol

[-]

madmax_br5@reddit

just use openrouter or fireworks. it's really cheap.

[-]

thebrokestbroker2021@reddit

Ok when you put it like that lol

[-]

PentagonUnpadded@reddit

Is it completely idiotic to try and process the data on a local LLM? I want to be doing what you are doing in a year, and this Epstien data release is energizing.

I'm trying to follow the style of work you are doing for my own education, using qwen3-14b running on a local 5090. After around a half hour, I'm at 54/24556 chunks. That is in pace to finish in 9 days.

This is my first project with LightRAG immediately after running the christmas carol example. I understand this is not going to be practically useful like yours, and I'm hoping to get to 'basic portfolio project' levels of completion. Do you have pointers on how I can make this finish-able? Ideally something that can run in under 24hrs and have result I can put on a portfolio.

I'm thinking I could using a faster model (3b?), more parallelization (I'm at 550w/600 already, using MAX_ASYNC=6 and MAX_PARALLEL_INSERT=3). And probably the easiest - know how I coud cut down on the input space? Some way of filtering down 90% of the documents?

Appreciate any insights, and I'll be watching your Gh for updates. Cheers Madmax.

[-]

madmax_br5@reddit

OK so the question here is whether or not the local models are relevant to your portfolio. Are you trying to show off that you can run models locally, or that you can produce something cool with models, generally? Local models have a huge handicap for bulk data analysis like this because you can't scale them. You won't get the throughput on one request and you won't be able to batch multiple requests in one.

My advice to you would be don't tie yourself to one way of getting inference if it's not important to your end result -- use the best tool for the job. If you want to build a UI demo, just use an existing dataset! If you want to build an extraction or data analysis demo, use serverless models you can batch! I would only use local models if that's part of what you're trying to demonstrate.

[-]

PentagonUnpadded@reddit

These are valid critiques of my extremely naive approach. I feel the Graph Rag technologies cool and have a personal affinity for local models, and those aren't a great fit for a dataset like these.

I pivoted to a smaller, personal dataset from some friends' creative writing group. The LightRAG server + included ui is producing interesting results, and it was real simple to setup. Built something cool in a day, mission accomplished. Highly encourage LightRAG to other devs reading this who want something quick and easy to use.

Cheers madmax, thanks for the detailed reply. Hope to keep seeing you around the sub.

[-]

horsethebandthemovie@reddit

opencode has free glm branded as big pickle + a couple others

[-]

starlocke@reddit

!remindme 3 days

[-]

RemindMeBot@reddit

I will be messaging you in 3 days on 2025-11-21 09:24:38 UTC to remind you of this link

CLICK THIS LINK to send a PM to also be reminded and to reduce spam.

^(Parent commenter can ) ^(delete this message to hide from others.)

^(Info)	^(Custom)	^(Your Reminders)	^(Feedback)

[-]

madmax_br5@reddit

OK I updated the database with most of the new docs. Ended up using GPT-OSS-120B on vertex. Good price/performance ratio and it handled the task well. I did not have very good luck with models smaller than 70B parameters; the prompt is quite complex and I think would need to be broken apart to work with smaller models. Had a few processing errors so there are still a few hundred missing docs, will backfill those this evening. Also added some density-based filtering to better cope with the larger corpus.

[-]

tensonaut@reddit (OP)

Amazing work, I would suggest that you make a short video on how to do some basic search as an example, post it here, give a brief description of your pipeline - provide link to the render demo and github. Id also suggest you to post it on r/Epstein, r/RAG, r/DataHoarder along with the video. Seriously, good work.

[-]

Jackloco@reddit

Pretty circles

[-]

gootecks@reddit

incredible work, wow!

[-]

TheCactusPlant@reddit

Why is it gone

[-]

Asleep-Statement-300@reddit

Not available

[-]

olearyboy@reddit

You know those apps that let you ‘speak with the dead’…..

[-]

Living_Cook1123@reddit

What??

[-]

Chemical-Cheek-7224@reddit

It’s gone

[-]

Lunyan4@reddit

Link not working anymore :(

[-]

MammaBear1438@reddit

They dont work anymore

[-]

No-Professional2047@reddit

Epstein pictures

[-]

YouPuzzleheaded2706@reddit

i found A interesting epstein file: https://www.justice.gov/epstein/files/DataSet%2010/EFTA01648734.mp4 , https://www.justice.gov/epstein/files/DataSet%2010/EFTA01648692.mp4 , this is the same girl i think from the first file i sent https://www.justice.gov/epstein/files/DataSet%2010/EFTA01648730.mp4 ,https://www.justice.gov/epstein/files/DataSet%2010/EFTA01648728.mp4

[-]

whodatindaback@reddit

this got to be a virus

[-]

CucumberJust3417@reddit

Y

[-]

DependentBass1390@reddit

Of course it’s gone

[-]

FelipeMantri@reddit

https://www.armstrongpowerhouse.com/
https://www.rbf.org/
https://birn.eu.com/
Those 3 sites had massive data transfers after the files started going public. Same server?

[-]

FelipeMantri@reddit

https://www.ekt.gr/en/index
https://www.indexoncensorship.org/

Those 2 follower a few seconds later. I wonder why.

[-]

Even-Surround625@reddit

Not found

[-]

saphirwMRK@reddit

Looks like it’s been deleted. I can’t find anything when I click on the link.

[-]

OptimisticEmo-Doll99@reddit

Anyone still have the pdf ?

[-]

onelabz@reddit

anybody still got access?

[-]

someone383726@reddit

A new RAG benchmark will drop soon. The EpsteinBench

[-]

Daniel_H212@reddit

Please someone do this it would be so funny

[-]

Basel_Ashraf_Fekry@reddit

https://pro-pug-powerful.ngrok-free.app/

[-]

RaiseRuntimeError@reddit

The people want The EpsteinBench released!

[-]

CoruNethronX@reddit

We had an EpsteinBench ready for launch yesterday, only domain name had to be propagated but files disappeared along with storage and servers. We can't even contact a hoster, seems like it's vanished as well.

[-]

petrx@reddit

And the webdeveloper commited a suicide while on a suicide watch

[-]

LaughterOnWater@reddit

Release the EpsteinBench!

[-]

booi@reddit

There was no EpsteinBench. it was a hoax

[-]

mrfouz@reddit

The EpsteinBench didn’t delete himself!!!

[-]

Infinite-Ad-8456@reddit

EpsteinBenchGate

[-]

Firepal64@reddit

Why is everyone still talking about EpsteinBench? Old news.

[-]

AI-On-A-Dime@reddit

Are people still talking about the EpsteinBench?? We have AIME, we have Livecodebench. You want to waste your time with this creepy bench? I can’t believe you are asking about EpsteinBench at a time like this when GPT 5.1 just released and Kimi K2 thinking just crushed

[-]

tensonaut@reddit (OP)

RAG is an IR problem which needs its on benchmark. RAG Benchmark ≠ LLM Benchmark

[-]

mcilrain@reddit

All the Epstein-related benchmarks that have been released are all we have.

[-]

PANIC_EXCEPTION@reddit

[-]

Trachu90@reddit

Everybody has to see this please let people find out, seems everyone forgot these babies testimony please make people see that: https://www.youtube.com/watch?v=VDqOTJarTdM

[-]

bussolon@reddit

Benchstein

[-]

OkDesk4532@reddit

MMD! :)

[-]

Agent_Pancake@reddit

Thats one way to force the government to regulate AI

[-]

theMonkeyTrap@reddit

they will all be benchmarking on how many 'trump' references we can locate in these files.

[-]

re_e1@reddit

💀

[-]

PentagonUnpadded@reddit

Hijacking this top comment. Can someone suggest local RAG tooling? Microsoft's GraphRAG has given me nothing but headaches and silent errors.

[-]

PeachScary413@reddit

[-]

Iory1998@reddit

The best idea I've heard in months! I am all in :D

[-]

Basel_Ashraf_Fekry@reddit

Guys I made it!!

It's up for 3 hours as it's running on colab's free tier.

https://pro-pug-powerful.ngrok-free.app/

[-]

SecurityHamster@reddit

This seems fascinating. As a fan of self hosted LLMs but also someone who can only run the models I get from hugging face, would you be able provide instructions/guidance on adding more source documents to this?

[-]

Frequent_Use440@reddit

Esto de descargar los PDF de esa página no es ilegal tengo duda

[-]

CommunicationDry2964@reddit

guys, i haven't seen anything please send me

[-]

Annual-Smile-4874@reddit

DOJ Document EFTA00331112 show creation of a false travel itinerary on 14 Jan 2016 for a girl or woman. The request was by Epstein's associate Lesley Groff to Amex Centurian Travel. Epstein booked through a Russian-speaking manager of the Centurion company, Natalia Molotkova. Both ChatGPT and Gemini analyzed the document and concluded it was evidence of intentional trafficking. It reads, in part -

"Customer [REDACTION] 01/14/2016 10:45 AM We need to find a flight that departs Rome and goes to London on the 29th for [REDACTION]

...this flight should depart around the same time the flight we are holding for Rome to Miami (10:35am) No return for this flight... This is a decoy flight...she will no really take it...but she needs to show an itinerary for this flight...can you put something together for me?"

https://www.justice.gov/epstein/files/DataSet%209/EFTA00331112.pdf

[-]

ThinNeighborhood2421@reddit

https://youtu.be/8XLL16NDmo0?si=4W1y9XcT_n16-B3I

[-]

Someone3436@reddit

both of the links don’t work :(

[-]

swiiftea@reddit

https://huggingface.co/spaces/theelderemo/epstein-files

https://github.com/theelderemo/Epstein-files

[-]

Roxy_Haven@reddit

How and where can I see the photos and videos without downloading to and killing my phone

[-]

Fun_Bullfrog7001@reddit

Modi

[-]

SiThreePO@reddit

And....it's gone

[-]

Bartsworld211@reddit

EFTA00067066.pdf https://share.google/fTeWe2ifILLC9bJyz

[-]

Free_Handle4853@reddit

It no longer works? Can someone sharethe files with me

[-]

Trachu90@reddit

Everybody has to see this please let people find out, seems everyone forgot these babies testimony please make people see that: https://www.youtube.com/watch?v=VDqOTJarTdM

[-]

JustDifferent1111@reddit

Link is not working or it's just me?

[-]

Jatsu21@reddit

File link gone?

[-]

FatMax25@reddit

Removed?

[-]

BryerM@reddit

Your links been scrubbed

[-]

Nicky150@reddit

Can you share the new ones posted before they redacted some images?

[-]

zezenia_art@reddit

I can't download it now

[-]

MidnytLavenderHaze10@reddit

Where to see Epstein files?

[-]

RedRumRoxy@reddit

Where can I get the Epstein files?

[-]

Neither-Rest8913@reddit

Neither link works 404 error msgs

[-]

Individual_Season803@reddit

Links are down

[-]

Amber-Lynn3@reddit

Anyone still have this file I can download since they removed it - I know I’m late but I want to read them since today’s crap is literally crap. Per usual

[-]

Main_Leek_4453@reddit

Links no longer work

[-]

AttemptMoss@reddit

Here's a link to the 20k emails guys

https://journaliststudio.google.com/pinpoint/search?collection=092314e384a58618&utm_source=collection_share_link

[-]

cruncherv@reddit

He should have removed those line breaks. It makes two-word searches difficult since the next word might be in the next line.

[-]

cruncherv@reddit

404

Sorry, we can't find the page you are looking for.

[-]

NomadsOfAmerica@reddit

The link no longer works

[-]

Economy-Department47@reddit

The hugging face things is not there anymore it says 404

[-]

IvanHappy@reddit

Hi, friend. The link doesn't work. Error 404. Could you update it? Thanks

[-]

Fun_Cucumber6013@reddit

Oh my god

[-]

Amazing_Trace@reddit

now if we could uncensor all the FBI redactions

[-]

AllanSundry2020@reddit

you actually can see them often if there is a photo image of the email (yes they did that!) accompanying it. The image is un redacted while the email is redacted

[-]

MyBrainsShit@reddit

i m just going to leave this (great vision model with which i've had great experience on vairous topics) here on an unrelated note: qwen3-vl-4b + good prompt along the lines of "Convert the content of this image as .md"

[-]

Ansible32@reddit

Have to wonder if this was malicious compliance on the part of the FBI. It's actually pretty hard to imagine anyone doing this work who would feel motivated to protect Trump, no matter how much they worship him, they wouldn't really believe he had anything to hide.

[-]

AllanSundry2020@reddit

this redditor seems to have combined the folders of images into PDF https://www.reddit.com/r/PritzkerPosting/s/CVmPL7v9ay might make it easy to use with LLM

[-]

yldave@reddit

Maybe u/tensonaut can use the image v email diff filtered to public figures/politicians to give us a way to query the redacted.

[-]

LaughterOnWater@reddit

Create an LLM LoRA that proposes the likely redacted content with confidence measured in font color (green = confident, brown = sketchy, red = conspiracy theory zone)

[-]

Amazing_Trace@reddit

I'm not sure theres a dataset to finetune on for any sort of reliability in those confidence classifications lol

[-]

LaughterOnWater@reddit

Try pornhub?

[-]

PentagonUnpadded@reddit

This is a tremendous idea!

[-]

do-un-to@reddit

Hey- What if we did some kind of probabilistic guessing of redactions based off analyzed patterns of related training data?

[-]

Individual_Holiday_9@reddit

You’d have people gaming data to replace all instances of GOP donors with ‘George Soros’

[-]

do-un-to@reddit

Be careful of the corpus you use for training.

[-]

FaceDeer@reddit

We've got LLMs, they're specifically designed to fill in incomplete text with the most likely missing bits. What could go wrong?

[-]

StartledWatermelon@reddit

LLMs are actually designed to provide the probability distribution over the possible fill-ins. If this fits your goal, nothing would go wrong. But probabilities are just probabilities.

[-]

shockwaverc13@reddit

finally an actual use for BERT

[-]

tertain@reddit

Seems within the realm of possibility that the guy that normally does the redactions and understands the methodology was fired and replaced with a Pizza Hut delivery driver that beat up a black guy once. So, we’ll have to see what happens.

[-]

Robonglious@reddit

Wait, what happened? Did they actually release the files?

[-]

ThePixelHunter@reddit

Nothing ever happens

[-]

Allitig8r@reddit

All this for a bunch of chicks who let an old man feel them up/fuck them for money and/or drugs?

We had a word for these girls back in the day, actually two words, sluts and whores. All of them fit this category and no crying about being a whore when you were younger makes you not a whore sorry.

[-]

Background-Owl-9183@reddit

Glad to see no reference to DJT, whew, that was a close one!

[-]

Pleasant-Double-8944@reddit

Anyone download this that can send it to me?

[-]

Whole-Assignment6240@reddit

Impressive OCR work at this scale. Did you experiment with structured extraction for entity relationships, or is this purely raw converted text?

[-]

Whatdoyoufightfor98@reddit

This account AND the links are DELETED WTF

[-]

meccaleccahimeccahi@reddit

Thanks for putting this dataset together. I actually used your release for a weekend side experiment.

I work a lot with log analytics tooling, and I wanted to see what would happen if I treated the whole corpus like logs instead of documents. I converted everything to plain text, tagged it with metadata (doc year, people, orgs, locations, themes, etc.), and ingested it into a log engine in my lab to see how the AI layer would handle it.

It ended up working surprisingly well. It found patterns across years, co-occurrence clusters, and relationships between entities in a way that looked a lot like real incident-correlation workflows.

If you want to see what it did, I posted the results here (and you can log in to the tool and chat with the AI about the data)

https://www.reddit.com/r/homelab/comments/1p5xken/comment/nqxe3lt/

Your dataset made the experiment a lot more interesting, so thanks again for making it available!

[-]

Fast_Description_337@reddit

This is fucking genious!

[-]

tensonaut@reddit (OP)

Thanks! We also have this sub come together to create tools of this dataset, we curate them here: https://github.com/EF20K/Projects

I love this sub :)

[-]

ninteendayswithLLMs@reddit

thats crazy, instant epstein RAG

[-]

Top_Independence4067@reddit

How to download tho?

[-]

tensonaut@reddit (OP)

You can go to this link and click on the down arrow icon next to the file to download it: https://huggingface.co/datasets/tensonaut/EPSTEIN_FILES_20K/tree/main

[-]

Top_Independence4067@reddit

Thanks! :)

[-]

Taikari@reddit

go here select use this data set

[-]

Top_Independence4067@reddit

Oh ok thanks!

[-]

Taikari@reddit

then choose one of the methods

[-]

qwer1627@reddit

I am throwing this into Milvus now, what do you wanna know or try to ask?

[-]

ghostknyght@reddit

what are the ten most commonly mentioned names

what are the ten most commonly mentioned businesses

of the most commonly named individuals and businesses what are the subjects the both have most in common

[-]

qwer1627@reddit

https://svetimfm.github.io/epstein-files-visualizations goto the first vizualization

[-]

ghostknyght@reddit

haha my man. very nice sir.

[-]

qwer1627@reddit

wait a minute, this is a header file for the Files repo itself innit?

Converting all these docs into embeddings is an AWS bill I just dont wanna eat whole...

[-]

InnerSun@reddit

I've checked and it isn't that expensive all things considered:

There are 26k rows (documents) in the dataset.
Each document is around 70000 tokens if we go for the upper bound.

26000 * 70000 = 1 820 000 000 tokens

Assuming you use their batch API and lower pricing:
Gemini Embedding = $0.075 per million of tokens processed
-> 1820 * 0.075          = $136
Amazon Embedding = $0.0000675 per thousands of tokens processed
-> 1 820 000 * 0.0000675 = $122

So I'd say it stays reasonable.

[-]

qwer1627@reddit

https://svetimfm.github.io/epstein-files-visualizations/epstein_network_graph.html

Used Nomic overnight

[-]

fets-12345c@reddit

You can embed locally using Ollama with Nomic Embed Text: https://ollama.com/library/nomic-embed-text

[-]

qwer1627@reddit

on a 3070Ti

- 0.049s to 2.352s per document (average \~0.7s)

- Very fast for short texts: 90 chars = 0.049s

- 6197 chars = 2.000s

This is the way - these 768 dims are fairly decent compared to v2 Titan 1024 dims, fully locally at that. TY again.

[-]

qwer1627@reddit

Woah, thank you!

[-]

HauntingSpirit471@reddit

Any references to pizza

[-]

qwer1627@reddit

use fuzzy search for that! :) https://ep-nov-12.greg.technology/?q=pizza

[-]

Ok_Alfalfa3361@reddit

The download is being buggy it either doesn’t work or it does but the entire text of each document is compressed into a single lines ———————————————————————— ———————————————————————— Each document is all there but put in a space that large so i have to manually drag the screen over and over again just to complete part of a sentence. Can someone help me so that it’s blocks of text instead rather than these compressed lines?

[-]

Funny_Winner2960@reddit

Guys why is the mossad knocking on my door?

[-]

Fantastic_Green9633@reddit

False alarm – the Mossad never knocks on doors.

[-]

presidentbidden@reddit

Why is your pager ringing ?

[-]

Lucky-Necessary-8382@reddit

Lmao

[-]

TechByTom@reddit

Direct Link: https://huggingface.co/datasets/tensonaut/EPSTEIN_FILES_20K/resolve/main/EPS_FILES_20K_NOV2026.csv?download=true

[-]

palohagara@reddit

link does not work anymore 2025-11-19 16:00 GMT

[-]

TechByTom@reddit

https://huggingface.co/datasets/tensonaut/EPSTEIN_FILES_20K/resolve/main/EPS_FILES_20K_NOV2025.csv?download=true Strange, I just grabbed the link again and it looks like it's the same?

[-]

gordonv@reddit

Wow, they didn't make this clear and easy at all.

[-]

meganoob1337@reddit

Can you also show your graph rag ingestion pipeline? I'm currently playing around with it and have not yet found a nice workflow for it

[-]

tensonaut@reddit (OP)

Please ensure you link the text with the original documents shared by the house commitee below, in case anyone is planning to build a queryable RAG on the dataset. The filename column can be expanded to link to the full path to official google drive documents.

https://oversight.house.gov/release/oversight-committee-releases-additional-epstein-estate-documents/

Might be a good demo project to post on Hugging face spaces. (https://huggingface.co/docs/hub/en/spaces-overview)

[-]

miafayee@reddit

Nice, that's a great way to connect the dots! It'll definitely help people verify the info. Thanks for sharing the link!

[-]

inevitable-publicn@reddit

We shouldn't use Huggingface or perhaps even this sub for this. These are very valuable resources for Open LLMs.

[-]

tensonaut@reddit (OP)

This is public data similar to Enron dataset

[-]

No_Lynx5887@reddit

So is Trump in them or not?

[-]

arousedsquirel@reddit

This is nice work! Considering the hot subject it will get some more involved in creating a decent kb graph and test which entities and edges can be created. Good job!

[-]

tensonaut@reddit (OP)

Yes, that's what I was hoping for. I'm more interested in people building knowledge graphs, then given two entities."Epstein" and someone else, you can find how they are associated using a graph library like networkx

It will be as just one line of code nx.all_simple_paths(G, source=source_node, target=target_node)

Ensuring quality of entity and node extraction is the key

[-]

qwer1627@reddit

I’m working on this right now, can you help me understand if this is just an index or a full conversion of the files to text? And then just has metadata pointing to the source files?

[-]

tensonaut@reddit (OP)

Its a full conversion of files to text in one column. The coulmn is just the filename. Also for embedding, you can just use Nomic or BGE embedding models, they both can be locally downloaded and are close to SOTA performance for their size and should be more than good enough

[-]

qwer1627@reddit

https://huggingface.co/datasets/svetfm/epstein-files-nov11-25-house-post-ocr-embeddings

Embedded, 768 Dim. Ty for your work!

[-]

qwer1627@reddit

I’m using a recommended by another redditor 768dim text2embedding model offline to not blow up my AWS bill (just a few hundred bucks but still)

[-]

phora577@reddit

ed. Good job! Edit: for

enjoy your next level TDS no life spamming! try to cover up those sniffy joe refs too while you're at it, lefty

[-]

7657786425658907653@reddit

can i run the epstein files on a 4080?

[-]

takuarc@reddit

Oh lord, OpenAI is gonna train on this data isn't it?...

[-]

thatguyinline@reddit

Interesting to see that DeepSeek (the model I'm using) refuses to answer questions about Trump as it relates to the emails. It will answer questions from it's general corpus of knowledge, but actively refuses "Per CCP Rules" to talk about Trump as it relates to Epstein.

[-]

thatguyinline@reddit

In this one, I asked it to focus on Snowden as the primary node. This graph shows you all the connections referenced in Jeffrey Epstein's emails and how it connects to Snowden.

I'm not very passionate about the topic, so I honestly don't have any good ideas of what to look at next but it is pretty cool to chat with a specific bot that is answering questions solely based on the emails.

I wonder if there is appetite by the world for an "AskJeffrey" chatbot tied to this graph data. Effectively you'd be able to just ask questions about the emails and the relationships of people and places and dates and get answers only from the emails.

[-]

thatguyinline@reddit

In this one, I asked it to focus on Donald Trump as the primary node. This graph shows you all the connections referenced in Jeffrey Epstein's emails and how it connects to Trump.

[-]

thatguyinline@reddit

I loaded up the emails into a GraphRAG database, where it uses an LLM to create clusters/communities/nodes in a graph database. This was all run on a home machine using deepseek1.5 heavily quantized and the qwen3 embedder without any reranking, so the quality of the results is not on par with what we'd get if this was on production infrastructure with production models. A few more photos of the graph coming.

[-]

MrPecunius@reddit

Large Lolita Model

[-]

Zweckbestimmung@reddit

This is a good idea of a project to get into LLaMA I will try to replicate it

[-]

tensonaut@reddit (OP)

Good luck!

[-]

Reader3123@reddit

The finetunes are gonna be crazy lol

[-]

harmlessharold@reddit

ELI5?

[-]

Reader3123@reddit

People use datasets to change the behavior of a model to be more like that dataset. and that process is called finetuning.
I was suggesting finetunes using this dataset would be funny

[-]

cyberdork@reddit

Should be benchmarked with all those underaged character cards for SillyTavernAI.

[-]

_supert_@reddit

That and the wiki leaks insurance files.

[-]

a_beautiful_rhind@reddit

Not sure I want to RP with epstein and a bunch of crooked politicians.

[-]

getting_serious@reddit

I have a list of people that wouldn't notice if I suddenly formatted my e-mails like he did. I don't want the content, just the formatting and spelling.

[-]

EXPATasap@reddit

lololololol

[-]

stylist-trend@reddit

But think about it; you could ERP Bubba!

[-]

a_beautiful_rhind@reddit

Bill or the horse?

[-]

Chilidawg@reddit

He has the attributes of one.

[-]

Responsible-Bread996@reddit

I thought a dog was in the mix now too?

[-]

stylist-trend@reddit

Yes.

[-]

dashingsauce@reddit

lmfao

[-]

tensonaut@reddit (OP)

someone made this exact same comment when I posted yesterday.

[-]

randomrealname@reddit

Ocr libraries are shite. How much of the image data have you checked? Nit much I imagine. Waste if time.

[-]

fallen0523@reddit

You clearly didn’t read OP’s post…

[-]

randomrealname@reddit

Yeah, I stopped when I seen they used OCR Library. Lol

[-]

fallen0523@reddit

So, you’re going to “call it quits” and then “throw shit at the wall” about it before even verifying that it worked? Why?

[-]

randomrealname@reddit

Because I have extensive experience using these libraries. They are VERY inconsistent. Unless you actually check every single translation, then you are reading data that is very likely not what was on the original document.

Why are you chasing my comments, do you have a stake in the OCR library game? lol saddo.

[-]

fallen0523@reddit

How am I “chasing your comments”? I responded to two of them in the same post that were right next to each other at the time I made the first two comments. I’m genuinely curious as to why you’re saying what you’re saying regarding the thing being a “waste of time”.

[-]

randomrealname@reddit

3*
This is your fourth attempt.

And because OCR libraries are shit. Like how many ways do you need to hear it. Download Python, get a pdf and enjoy yourself.

They can barely make out equations, always, always misread stuff, add words that change meaning. (like adding `not` in a sentence because there are unwanted marks on the original document)

It is so bad that you literally need to go line by line, word by word to make sure the document hasn't lost its semantic meaning.

Now, you go and tell me about all your experience using them? I don't think you have, or you wouldn't be saying this dumb shit.

[-]

fallen0523@reddit

Ah, so responding to a comment thread is an “attempt at chasing your comments”. Got it 👍

Look, I’m not arguing that OCR is perfect. I literally said it makes mistakes. That’s why you verify it. But acting like OCR is useless while pretending LLMs don’t also screw things up makes no sense. I use both every day for work. Neither one is flawless. 🤷‍♂️

If a PDF doesn’t have a text layer, an LLM is blind until OCR does its job. That’s just how the format works. There’s no “skip OCR and let the model read it” option unless it’s a vision model, which is still doing OCR behind the scenes… So yeah, you check the output. You check every tool. That’s the whole point. I’m not sure why that idea suddenly becomes controversial. 🤔

[-]

randomrealname@reddit

Hahahahaha you use them for work and then make a comment like that. LLm's need a text layer do they? lol you are funny.

[-]

fallen0523@reddit

If you honestly think an LLM can read an image-only PDF without OCR, that’s not me on “copeium,” that’s you not understanding how PDFs work. 🙄

A model can’t read text that isn’t there. Vision models still run OCR under the hood. That’s just reality. Both OCR and LLMs screw up, which is why you verify everything. Acting like one is flawless while the other is trash is just you talking out of ignorant confidence. Pop that ego of yours. It’s not doing you any favors.

[-]

randomrealname@reddit

Yes, multimodal LLM's read the document. They don't use OCR. And you have missed my point.

This execution is worthless, I am not saying they are useless. If you go line by line, word for word and correct all the mistakes, then they re a valuable tool (if you are doing a few pages) that is monotonous to do on say 250 textbook full of equations (not even diagrams)

The point is, there is literally no way OP has checked the 20,000 documents match the output the OCR library spat out., That makes this worthless. You won't convince me or anyone else who has a modicum of sense that it is any other way.

[-]

fallen0523@reddit

lol, your comment got flagged. See, there’s that ego that you won’t let go of. Instead of just going “damn, you know what? I didn’t think of it that way” and continuing to ignore how these systems work, you’re too caught up on just trying to be “right”.

I’m not saying “deflate your ego” as an attack. I’m saying it because you’re allowing your ego to misguide you into being confidently wrong about things, and when I’m giving you factual information that contradicts your statements, you still allow your ego to override your ability to reassess what’s being said. Good luck with that in the future. It’s a hard habit to break, but it’s necessary.

[-]

randomrealname@reddit

I just flagged your comment.*

Speak the truth moron.

[-]

fallen0523@reddit

And… once again your comment was flagged. I didn’t flag anything, Reddit did.

And again with the ego. You’ve been throwing a tantrum and calling me names in every one of your last few replies and they keep getting automatically flagged. I haven’t insulted you once. That should tell you everything you need to know about why I’ve been saying you need to seriously just deflate the ego, reevaluate the conversation, and reassess where you’re wrong.

Yeah, it’s clear you don’t “care about what some loser on Reddit” has to say because you’re too fixated on being “right” even though you were clearly wrong. Again, I implore you to do some serious soul searching and really work on reflection.

Also take the time to understand where you were incorrect factually regarding multimodal LLMs. Multimodal LLMs still use OCR for text extraction when documents (like PDFs) don’t contain a text layer. If there isn’t a text layer, they utilize an OCR tool for character recognition by scanning the pixels of the document and then try to determine what the character is (which is literally OCR). You should also take some time to learn about different PDF formatting options and how they work. I can guarantee, a lot of the issues you have had in the past regarding OCRs were user error and not the actual shortcomings of the OCR itself (again, me agreeing with you that there are shortcomings of OCRs and that they require manual verification to ensure accuracy). LLMs also fall short on things and require manual verification to ensure accuracy (hence why there are disclaimers on most commercial AI models that specifically say “X Can make mistakes at times. Check important info.” or any variation of that disclaimer.

[-]

randomrealname@reddit

Flagging me as suicidal! Hahahahha loser.

[-]

fallen0523@reddit

lol, you blocked me and then unblocked me. Seriously, get some help.

[-]

randomrealname@reddit

Hahajahha suicidal! Oh yeah, you matter that much stranger on the internet.

[-]

fallen0523@reddit

You’re clearly not well. You’re spiraling. And you spiral more often than not when someone says something to you as a rebuttal. You resort to name calling, then blocked me, then unblock me, then claimed I reported you for “being suicidal”, then report me as being “suicidal” to Reddit. Get help.

[-]

randomrealname@reddit

You star5ed the suicidal shit pal. Look at insights, your the single person who read any of these comments. Loser

[-]

fallen0523@reddit

Multiple people have read a good portion of this comment thread. 4 people have read up to this point. I didn’t report you. You blocked me, then unblocked me and claimed I reported you, then continue with your spiral. Again, get help.

[-]

randomrealname@reddit

Ok bye kiddo.

[-]

fallen0523@reddit

There’s nothing “flagable” about my comments. They’re directly specific to the topic, there aren’t any forms of “bullying” or “harassment”. Sorry I’m not an LLM that you can just gaslight into agreeing with you when you’re clearly wrong. Good luck in the future bro.

[-]

fallen0523@reddit

Again… Multimodal models still rely on OCR techniques. Calling it “not OCR” because it’s wrapped in an LLM doesn’t magically change what it’s doing. 😅 It’s still extracting text from pixels. That’s literally the definition.

And nobody said OP manually checked 20k files. The claim was that OCR is a “waste of time” which is just wrong. You verify samples, you spot check, you clean up obvious errors. That’s how this stuff is done in the real world when you apply proper “best practices”.

If you think any method gives you perfect output without verification, that says more about your expectations than the tech. Again, popping that ego of yours would really help you realize what it is you’re actually saying. 👍

[-]

randomrealname@reddit

OK, moron, I have had enough of this completely pointless chat.

[-]

Specialist-Season-88@reddit

I'm sure they have already ",fixed the books" so to speak and removed any prominent players. Like TRUMP

[-]

14dM24d@reddit

**Ask him if Putin has the photos of Trump blowing Bubba? **

[-]

14dM24d@reddit

From: Mark L. Epstein 
Sent: 3/21/2018 1:54:31 PM 
To: jeffrey E. [jeeyacation@gmail.com] 
Subject: Re: hey 
Importance: High 
You and your boy Donnie can make a remake of the movie Get Hard. 
Sent via tin can and string.

On Mar 21, 2018, at 09:37, jeffrey E. <jeevacation@gmail.com> wrote: 
and i thought- I had tsuris

On Wed, Mar 21, 2018 at 4:32 AM, Mark L. Epstein wrote: 
Ask him if Putin has the photos of Trump blowing Bubba?

From: jeffrey E. [mailto:jeevacation@gmail.com] 
Sent: Monday, March 19, 2018 2:15 PM 
To: 
Subject: Re: hey 
All good. Bannon with me

On Mon, Mar 19, 2018 at 1:49 PM Mark L. Epstein_____________________________wrote: 
How are you doing? 
A while back you mentioned that you were prediabetic. Has anything changed with that? 
What is your boy Donald up to now?

[-]

gooeydumpling@reddit

Does the dataset have details in the big beautiful bill with bill in every sense if the word?

[-]

14dM24d@reddit

no, but there's BUBBA

[-]

chucrutcito@reddit

I am particularly interested in the OCR process. Could you please provide detailed information regarding this process?

[-]

randomrealname@reddit

Python. The libraries are shite though.

[-]

fallen0523@reddit

The library’s are shit, or do you just not know how to use them properly?

[-]

randomrealname@reddit

Lol, what kind of copium comment is this?

Yes, clearly I know how to use them. They are crap at what they do. LLM's actually do a better job these days.

[-]

areyouokmyfriend@reddit

what do i do if i found a phone number they forgot to redact

[-]

Bruceleroy90@reddit

The house just voted to release the Epstein files!

[-]

tensonaut@reddit (OP)

Will post another update if its released today after work!

[-]

Vast-Imagination-596@reddit

Wouldn't it be easier to interview the victims than to pore over redacted files? Ask the victims who they were trafficked to. Ask them who helped Epstein and Maxwell.

[-]

No-Complaint-9779@reddit

Thank you! Free Qdrant vector database on the way for anyone to use 😁 (embeddinggemma:300m)

[-]

WestCloud8216@reddit

Americans wasting their time with the Epstein files.

[-]

Glathull@reddit

Epstein is the best thing to happen to politicians since Roe got overturned. They’ve all been out there looking for a wedge issue to grandstand and fundraise on, and they’ve found it!

[-]

OcelotMadness@reddit

Your sick if you don't give a shit that we're being ruled by a regime headed by a dude who abused kids verifiably.
Or more likely your a bot, but it had to be said either way.

[-]

Scew@reddit

But no suggestion of what they should waste their time on? Bruh you needa up your marketing game.

[-]

InternalEngineering@reddit

File name is incorrect: EPS_FILES_20K_NOV2026.csv on hugging face (It's currently 2025)

[-]

_parfait@reddit

Time travel leaksss

[-]

tensonaut@reddit (OP)

Thanks for letting me know, I've updated it.

[-]

Unhappy_Donut_8551@reddit

Check out https://OpenEpstein.com

Uses Grok for the summary.

[-]

Comfortable-Tap-9991@reddit

Most of you are probably just interested in this so here’s the answer that the AI orovides when asked if Trump ever visited Epstein’s island:

None of the excerpts contain logs, witness statements, emails, or affidavits explicitly stating that Trump traveled to or visited Little St. James. Mentions of Trump's interactions with Epstein are tied to Florida-based properties, social events, or business dealings, with no reference to island travel, helicopter transfers from St. Thomas (a common access point to the island), or island-specific activities involving Trump.

[-]

LouB0O@reddit

Id be concerned about code names or such. They cant be THAT stupid to be like "Trump, cya at diddle Island next week. I got 5 kids, 4 women and some livestock for you to enjoy"

[-]

FastDecode1@reddit

That's very optimistic of you.

The reality is that the rich and powerful are just as retarded and clueless as the rest of us, if not more.

I just had a good laugh reading an email chain of the then-president of the Maledives asking Epstein if this ~~Nigerian prince~~ anonymous funds manager offering to send his finance minster 4 billion is legit.

[-]

Unhappy_Donut_8551@reddit

Yup what I see too, no mentions at all of him being on the island.

[-]

NobleKale@reddit

Uses Grok for the summary.

... why would you use Musk's bot for THIS task?

Seems like a bad selection.

[-]

Unhappy_Donut_8551@reddit

Really the price and context size. Used “gpt-5-chat-latest” first and it was great, but was as much as 10-15c each request. Using top-k 100 to call to pull as many relevant docs at once then allowing LLM to summarize.

It’s not straying from explaining and summarizing what it sees in the docs since I’m giving it the text. In reading top-k to 200 is like 2-3c per request now.

They are both built in to work, but this was providing good results. I understand where you are coming from though!

[-]

NobleKale@reddit

I think you're missing my 'Grok is not going to give you a straight answer, it's a fucking propaganda machine, what the fuck are you doing using it for something that involves anything with Epstein, or Trump, holy fucking shit' angle.

[-]

paul_tu@reddit

Any URLs of the files themselves?

[-]

tensonaut@reddit (OP)

https://oversight.house.gov/release/oversight-committee-releases-additional-epstein-estate-documents/

[-]

paul_tu@reddit

Thanks

Looks like it's not full

But anyway thanks

[-]

tensonaut@reddit (OP)

These are the complete files released by the house oversight comittee last friday

[-]

ortegaalfredo@reddit

We can revive him. We have the technology.

MechaEpstein.

[-]

LouB0O@reddit

Lmao. Shit breaks out and runs loose. Taking revenge on those who killed him.

[-]

Any-Blacksmith-2054@reddit

Frankepstein

[-]

Astroturf_Agent@reddit

The Epsteinilisk will make us regret AI.

[-]

pstuart@reddit

Being that the data was likely scrubbed of Trump references, it would be interesting if it was possible to detect that from metadata or across sources.

[-]

Simon-Says69@reddit

That's not likely at all. What would they scrub, that Trump was a key witness for the prosecution? Your theory makes no logical sense.

If there was any info against Trump, Epstein would have used it to stay out of jail, and later the Biden admin would have used it to manipulate the 2024 election.

[-]

AppearanceHeavy6724@reddit

You are so, so naive.

[-]

davidy22@reddit

The data isn't behind a gate or anything, it's fully available and multiple people have made it very searchable, including the person who made this post. My patience hasn't gotten me through the entire set, but Trump absolutely hasn't been removed from this dump. Both a look through any amount of documents or just the bare minimum effort of typing Trump into the search bar would have told you that he's very present in the dump.

[-]

AppearanceHeavy6724@reddit

but Trump absolutely hasn't been removed from this dump. Either a look through any amount of documents or even just the bare minimum effort of typing Trump into the search bar would have told you that he's very present in these docs, you don't have to make vague low effort conspiracy comments

American government has a rich history or being utterly untrustworthy, mucking with evidence (the latest example would be covering for Fauci in GOF research which very possible caused the pandemic), poisoning the well wrt UFO evidence (the latest tict-tac stuff very possibly be an erlaborate psyop hoax), so only extremely naive tooth fairy believer would think that both Republicans and Democrats would ever allow the true data, implicating actual acting US president will ever see the light; amount of market disturbances, political instability all that crap that will follow is not acceptable. It is not a partisan issue anymore, it is a matter national security, for the truth to not see the light.

[-]

davidy22@reddit

It does kinda track that the same kind of person who can't be bothered to open and look at the info in the link they're commenting under would be the same kind of person peddling conspiracies that Fauci created COVID.

[-]

AppearanceHeavy6724@reddit

If you looked at FOIA respect regarding relevant research by Fauci and NIH it was 200 pages of entirely blank or blacked out pages. If there is nothing to hide there would be no need in this disrespectful fuckery.

I am not American or in any way partisan person; I have zero trust to any word that comes from your government, any of your two parties. If you think those in federal government have any desire to tell American people truth, you probably have either cognitive deficiency (you do not seem), a personality disorder (naivete) or some psychiatric issue (I hope yo do not).

[-]

WrinkledWinkle@reddit

Care to explain why he would be naive? I mean you lefties had the docs in your hands for nearly 2 decades, did nothing with them. Instead you invented the steel dossier accusing Trump of russian collusion, highly illegal, and then you got caught and busted. You left wingers are retarded dorks

[-]

Qs9bxNKZ@reddit

Naive or not, it's logical and makes sense. Hoping that it is something else, especially in light of the close association of Epstein to the Democrats and trying to hurt Trump betrays your lack of understanding (or tells us how much you really do understand)

[-]

AppearanceHeavy6724@reddit

Naive or not, it's logical and makes sense.

Much like bedtime stories for children.

[-]

davidy22@reddit

All you needed to do to check this was use the search bar and you didn't do that.

[-]

omernesh@reddit

A new "minor in a haystack" test?

[-]

Ok_Warning2146@reddit

Are these the Epstein Emails already released? Or are these the Epstein Files that are to be released after Epstein Act is passed by the Congress?

[-]

tensonaut@reddit (OP)

These are the ones released last Friday by the house oversight committee

[-]

Ok_Warning2146@reddit

I see. These are the Epstein Emails then.

[-]

tensonaut@reddit (OP)

They are mix of emails, court proceedings, police filings, magazine pages, news articles. The 20k documents released is a mix of docs from the Epstein Estate

[-]

Sea_Mouse655@reddit

We need a NotebookLM style podcast stat

[-]

tensonaut@reddit (OP)

I've shared it on NotebooKLM sub, seems like couple of folks are working on it. It should be a trending post on that sub, you can go check it out there

[-]

AppearanceHeavy6724@reddit

Darn it why everyone still use Mistral 7b,? If you want small capable LLM just use Llama 3.1

[-]

Wrong-booby7584@reddit

There's a database from another redditor here: https://epstein-docs.github.io/

[-]

tensonaut@reddit (OP)

Seems like they haven't updated their db with the latest 20k docs release.

Ah, it was released in the last month - https://www.reddit.com/r/DataHoarder/comments/1nzcq31/epstein_files_for_real/

[-]

temurbv@reddit

can this model perform better than Sonnet 4.5?

[-]

ksk99@reddit

"episten bench"- this is the way to embedded it in the history, just like that image processing girl... Fellas let's do it ...

[-]

drillbit6509@reddit

build a basic RAG

where's the raw data? Since you mentioned you did not spend too much time on figuring out the entities.

[-]

Every_Bathroom_119@reddit

Go through the data file, the OCR result has much issues, need to do some cleaning work

[-]

Lucky-Necessary-8382@reddit

For OCR use a chinese local model like qwen3-vl-8B

[-]

RickyRickC137@reddit

This post is gonna delete itself!

[-]

claygraffix@reddit

HTTPS://openepstein.com

[-]

14dM24d@reddit

EPS_FILES_20K_NOV2026.csv

time travel??

[-]

SysPsych@reddit

Fine tune your model on this and Hunter Biden's laptop contents if you want local LLMs to be heavily regulated tomorrow.

[-]

WithoutReason1729@reddit

Your post is getting popular and we just featured it on our Discord! Come check it out!

You've also been given a special flair for your contribution. We appreciate your post!

I am a bot and this action was performed automatically.

[-]

Space__Whiskey@reddit

I clicked and read some of the entries. There is some weird stuff in there. Like, a "Russian Doll" poem out of nowhere. Trippy.

[-]

davidy22@reddit

I've dug through the files myself, there's some baffling inclusions that bury the actual good stuff. With the patience I was able to muster, I was able to find two letters from lawyers that were actual novel information buried among a photocopy of an entire book, a report on the effect Trump's presidency will have on the mexican peso, a summary of the publicly available depositions from a lawsuit from when epstein was still alive and a 50 page report on Trump's real estate assets. I suspect the number of actual documents we care about in the dump comes closer to about 500 because most of this is stuff is just stuff that's already publicly available, but someone with more time and patience than me is going to have to do that filtering for the entire 20,000 page set.

[-]

mrpkeya@reddit

System prompt:

You are president or a famous scientist. Answer accordingly

[-]

layer4down@reddit

Including Donnica Lewinsky?

[-]

CapoDoFrango@reddit

Sent from my iPhone

[-]

Interigo@reddit

Nice! I was doing the exact same thing as you last week. You would’ve saved me time lol

[-]

Chuyito@reddit

Can this help provide tax structure advice without asking for something in return

[-]

Zulfiqaar@reddit

Guess its time for the sherlock alpha models to show us what they can do. 1.84M context, and pretty much zero refusals on any subject.

[-]

ValuableOven734@reddit

Wild

[-]

thatguyinline@reddit

Have been looking for an excuse to test LightRag :)

[-]

zhambe@reddit

What did you use for the graph rag?

[-]

tensonaut@reddit (OP)

I build a naive one from scratch, I didn't implement the graph community summary which is a big drawback. Im pretty sure if you implement a full Graph RAG system on the dataset, you can find more insights.

[-]

igorwarzocha@reddit

Nanochat anyone?

[-]

tsilvs0@reddit

I NEEDED an interesting knowledge base graph task for learning purposes)