I built a CLI that turns documents into knowledge graphs — no code, no database
Posted by garagebandj@reddit | Python | View on Reddit | 22 comments
I built sift-kg, a Python CLI that converts document collection into browsable knowledge graphs.
pip install sift-kg
sift extract ./docs/
sift build
sift view
That's the whole workflow. No database, no Docker, no code to write.
I built this while working on a forensic document analysis platform for Cuban property restitution cases. Needed a way to extract entities and relations from document dumps and get a browsable knowledge graphs without standing up infrastructure.
Built in Python with Typer (CLI), NetworkX (graph), Pydantic (models), LiteLLM (multi-provider LLM support — OpenAI, Anthropic, Ollama), and pyvis (interactive visualization). Async throughout with rate limiting and concurrency controls.
Human-in-the-loop entity resolution — the LLM proposes merges, you approve or reject via YAML or interactive terminal review.
The repo includes a complete FTX case study (9 articles → 373 entities, 1,184 relations). Explore the graph live: https://juanceresa.github.io/sift-kg/graph.html
**What My Project Does** sift-kg is a Python CLI that extracts entities and relations from document collections using LLMs, builds a knowledge graph, and lets you explore it in an interactive browser-based viewer. The full pipeline runs from the command line — no code to write, no database to set up.
**Target Audience**
Researchers, journalists, lawyers, OSINT analysts, and anyone who needs to understand what's in a pile of documents without building custom tooling. Production-ready and published on PyPI.
**Comparison**
Most alternatives are either Python libraries that require writing code (KGGen, LlamaIndex) or need infrastructure like Docker and Neo4j (Neo4j LLM Graph Builder). GraphRAG is CLI-based but focused on RAG retrieval, not knowledge graph construction. sift-kg is the only pip-installable CLI that goes from documents to interactive knowledge graph with no code and no database.
Source: https://github.com/juanceresa/sift-kg PyPI: https://pypi.org/project/sift-kg/
brianckeegan@reddit
I'm excited to try this out!
garagebandj@reddit (OP)
Let me know how it goes!
EmbarrassedCar347@reddit
Why are people down voting this?
timtom85@reddit
"haters gonna hate" that's why
if you look at some of the comments above, some people are like "this can't work because" smh
Cute-Net5957@reddit
extract → build → view is a realy clean pipeline.. how are you persisting state between commands? im building a typer cli that needs state between invocations and went with a json file but already regreting it as the data grows.. wondering if sqlite wouldve been the smarter call from the start. also the FTX case study in the readme is a nice touch.. way more compeling than toy data
gardenia856@reddit
The core win here is you treat KG building like a dead-simple ETL: extract → build → view, instead of yet another “stand up Neo4j and learn Cypher” weekend project. Two things I’d love to see: 1) a lightweight schema/ontology layer (even just YAML templates per use case: fraud, M&A, OSINT) so entities/edges don’t drift across runs, and 2) export paths that play nice with other tools: GraphML / Parquet edges, plus maybe a small API so stuff like Neo4j or Memgraph can ingest when people outgrow the local viewer. For entity resolution, a cheap win is active learning: surface the “highest-impact” merge suggestions first (degree, betweenness, page rank), not just whatever the LLM spits out. On the “who actually uses this” side: this fits nicely next to things like Obsidian and Logseq for personal research flows; I’ve seen folks pair that kind of KG output with monitoring tools like Mention and Pulse for tracking how entities/relationships evolve over time across the web. Bottom line: you nailed the no-infra KG niche; now it’s all about schema discipline and smarter review UX.
garagebandj@reddit (OP)
Really appreciate this comment - you basically described what already exists and what's next on the roadmap.
Schema/ontology layer: This is already in. Each project can set a domain via sift.yaml or pass --domain domain.yaml, where you define entity types, relation types, extraction hints, and which relations require human review. There are bundled domains for general use and OSINT, but the idea is exactly what you described — YAML templates per use case so extractions stay consistent across runs.
Exports: sift export already supports GraphML, GEXF, CSV, SQLite, and JSON. So Neo4j/Memgraph/Gephi ingestion is a sift export graphml away.
Active learning for merge review: This is a great idea. Right now proposals come out in whatever order the LLM produces them. Ranking by graph centrality so you review the highest-impact merges first is a cheap win — adding it to the roadmap.
Obsidian/Zotero: Both recently added to the roadmap as integration targets. The personal research flow is exactly the right mental model.
Thanks for engaging so thoughtfully with this.
Actual__Wizard@reddit
You can't use LLMs for that purpose as whether a word is an entity or not changes contextually in the sentence. It's going to have a ton of failure points, like names of businesses as an example.
jnwatson@reddit
LLMs are precisely built for extracting the specific nuance of a word or phrase based on its position in a sentence. The point of the transformer's attention mechanism is to extract the relationships among all tokens in a sequence, and the way that the tokens are positionally encoded maintains the word order relationships.
Actual__Wizard@reddit
Sure, but that scheme isn't consistent with English, so it has a limitation. There's a word linkage system (remember from the education system) that is needed to be understood to understanding the meaning of the sentence.
jnwatson@reddit
You are basing your experience on outdated technology. The latest generation of LLMs have complete command of the English language. I picked one of your later sentences and had Claude diagram it:
I can't paste the diagram here; it is precise and correct.
Actual__Wizard@reddit
Homie, it doesn't apply that stuff to understand language, it's layering it on top of the prompt, and it's probably not very accurate.
jnwatson@reddit
Clearly the LLM understands grammar better than you. Adverbs can't modify nouns.
Actual__Wizard@reddit
They do indeed modify nouns indirectly.
jnwatson@reddit
I get it. All your expertise is going away. I remember seeing demonstrations of simple grammar deconstruction at the CS department at University of Texas in the early 90s. All that work is now moot.
Thousands of ML researchers over the last 70 years have wasted their time in designing bespoke rule-based systems. Raw compute at scale has won. This is The Bitter Lesson.
Yes, you can just shove it into a matrix and it fixes it. It turns out that meaning is derivable directly from the language itself.
Actual__Wizard@reddit
Yeah what's the former professor's name again? I thought he passed away, which is unfortunate all things considered.
I have a serious question for you: What is your intention with that information? So, you're saying to me, that something that kinder gardeners do, is too hard for AI developers to implement as code?
You're the first person that I've talked to in about a year that even appears to be aware of what I am talking about at all.
So, I would really appreciate if you engage in this conversation.
Okay, so, you know what I'm talking about. Holy cow. That's amazing. Did you know that there's been some mega big advancements in the area of construction grammar?
garagebandj@reddit (OP)
Good points on extraction quality- that's why sift-kg has a human-in-the-loop review step where you approve or reject merges before anything gets finalized.
Events are extracted as their own entity type by design- "Froot of the Loom Chapter 11 Bankruptcy" is an EVENT node, not a mislabeled organization. You can configure which entity types matter for your domain in the YAML config.
On the approach- rule-based NER handles standard entity types well but can't extract relations or domain-specific entities without training data. That's the tradeoff.
Actual__Wizard@reddit
Froot of the Loom is a Brand (a entity) and their Chapter 11 Bankruptcy is an event (an entity.)
Arty_Showdown@reddit
It's because people are effort adverse. They want a cure-all to any task and they're willing to sacrifice any semblance of accuracy to achieve it.
People, and I confess I did as well when I started out, go full steam ahead with ideas like this without the required comprehension of the fundamentals. It's folks like yourself who brought me into reality, hopefully OP experiences the same.
Typical-Muscle4397@reddit
This is crazy, everyone check out examples/ftx/output/graph.html
garagebandj@reddit (OP)
Appreciate it! Just pushed an updated FTX graph and added a new one for the Epstein/Giuffre v. Maxwell depositions. Both live here: https://juanceresa.github.io/sift-kg/
Unlikely_Elevator_42@reddit
I am going to try this out