Thoughts on how to structure and implement RAG for genealogical datasets?

Posted by Own_Attention_3392@reddit | LocalLLaMA | View on Reddit | 3 comments

My wife is a hardcore amateur genealogist and occasionally spends hours poring over old records trying to find links between people or locate new information about existing people through various means (matching birth/death dates, misspellings, etc). You know how it is -- the older the data, the less reliable it is.

Anyway, I've been mulling over how to approach throwing some AI at the problem. I'm thinking if I can get CSV or JSON datasets, I can probably do some RAG magic with them, but I may also possibly need some agentic workflows in the mix. I understand the fundamentals of RAG, but haven't implemented it at any usable, realistic scale for anything to date. I also haven't had the need or opportunity to play with MCP, etc. So I'm really just fishing for ideas here and hoping to fill in some knowledge gaps.

Getting appropriate datasets is a bit of a problem, but that's something I can solve -- I'm a software developer so I can handle any sort of data transformation if I'm getting heterogenous datasets and need to make them consistent. I can also bang out code in Python if I need to glue anything together with some custom nonsense.

I'm starting with her existing family tree in GEDCOM format, which I've successfully converted to JSON. I don't think the current structure is very RAG friendly though, so I want to revisit that. My first attempts were just loading LM Studio, attaching the JSON file, and seeing what happened. That did not go well. Simple questions like "What's the relationship between Person X and Person Y?" were wildly unsuccessful. No, she is not married to her grandfather, but thanks for trying, Mister LLM. So now I'm officially out of my depth.

I have an RTX 5090 and 64 GB of system memory (which I'm hoping to turn into 128 GB if RAM prices drop a bit for black Friday), and I have no problem with getting a runpod instance or similar going if I need more computing horsepower to mess around with this stuff. I'm not super concerned about which model I'm using, I think GLM Air or GPT OSS (both of which I already run for other stuff) would probably be good candidates.

I'm hoping to open up a conversation on the topic because this could be an interesting subject for other people looking to accomplish the same thing if I can find a reliable workflow and toolchain. Thanks, smart LLM people!