I spent 2 years building privacy-first local AI. My conclusion: Ingestion is the bottleneck, not the Model. (Showcase: Ollama + Docling RAG Kit)

Posted by ChapterEquivalent188@reddit | LocalLLaMA | View on Reddit | 4 comments

Hi r/LocalLLaMA,

I’ve been working on strictly local, data-privacy-compliant AI solutions for about two years now. Dealing with sensitive data meant that cloud APIs were never an option—it had to be air-gapped or on-prem.

The biggest lesson I learned:

We spend 90% of our time debating model quantization, VRAM, and context windows. But in real-world implementations, the project usually fails long before the prompt hits the LLM. It fails at Ingestion.

Especially in environments like Germany, where "Digitalization" just meant "scanning paper into PDFs" for the last decade, we are sitting on mountains of "Digital Paper"—files that look digital but are structurally dead (visual layouts, no semantic meaning).

The Solution:

I built a self-hosting starter kit that focuses heavily on fixing the Input Layer before worrying about the model.

The Stack:

Engine: Ollama (because it’s the standard for local inference and handles GGUF on consumer hardware perfectly).
Ingestion: Docling (v2). I chose this over PyPDF/LangChain splitters because it actually performs layout analysis. It reconstructs tables and headers into Markdown, so the LLM isn't guessing when reading a row.
Database: ChromaDB (persistent, local).
Architecture: Separation of concerns. I created specific profiles for Code (analyzing repositories) vs. Documents (PDFs), because throwing them into the same chunking strategy creates noise.

What this Kit is:

It’s a docker-compose setup for anyone who needs a "Google Code Wiki" style system but cannot let their data leave the building. It’s opinionated (Ingestion-First), strips out complex async worker queues for simplicity, and runs on a standard 16GB machine.

Repo: https://github.com/2dogsandanerd/Knowledge-Base-Self-Hosting-Kit

I’ve decided to start open-sourcing my internal toolset because I genuinely fear we are heading towards a massive wave of failed AI integrations.

We are currently seeing companies and devs rushing into RAG, but hitting a wall because they overlook the strict quality requirements for retrieval. They don't realize that "electronic paper" (PDFs) is not Digitalization. It's just dead data on a screen.

Unless we fix the ingestion layer and stop treating "File Upload" as a solved problem, these integrations will fail to deliver value. This kit is my attempt to provide a baseline for doing it right—locally and privately.

I’d love to hear your thoughts on the "Ingestion First" approach. For me, switching from simple text-splitting to layout-aware parsing was the game changer for retrieval accuracy.

Thanks !

[-]

exaknight21@reddit

It took me a few months to understand quality of OCR matters the most, and anyone working on RAG that has used VLMs for OCR knows in their heart and soul that VLMs make the best OCR. Obviously this means, in addition to LLMs/Embedding Models there is a new expensive overhead of VLMs, unless you use higher up models, then you can do text gen + OCR from a single model.

I’m at an impasse where OCR is a non issue now, it’s running everything on a single device.

How can you make an effective RAG project without decent hardware?

You need 98% accurate OCR or your pipeline fails.
You need a large context window so that full context of the document being retrieved can be loaded, granted, we can use chunking like we do with embeddings, but to me, this is an untested theory.

8,000 tokens is a lot of tokens, requires just a little less than 24 GB VRAM, issue is, how may people doing this locally on their potato laptop actually have 24 GB VRAM.

Not to mention re-rankers.

Sure, you can use docling to text extract… but the rest of the overhead for true local experience requires at least 2 GPUs with 12 GB VRAM each.

fabkosta@reddit

Cool, but not exactly news, is it? High-quality OCRing has always been the sine-qua-non prerequisite for natural language processing. PDFs are among the worst possible file formats for that, though, and modern layout-aware OCRing tools like Docling are great, but still struggle with many PDFs. It's always been a pain, really.

ChapterEquivalent188@reddit (OP)

Im with you, it's not 'news' for veterans. But looking at the flood of 'Why is my RAG hallucinating?' posts, it seems to be news for the 90% of developers who just joined the party via LangChain

And yes, even Docling struggles. That's exactly why I argue for a 'Sanitized Pipeline' (repair -> parse -> post-process) rather than just looking for a magic OCR tool. The 'pain' never goes away completely, but you can engineer around it. There is more after docling and bevor db ;)

I am testing Docling right now. It's generally a good tool, but still fails on some pretty basic problems. Do you tweak it somehow, e.g. by adding a different OCRing tool, or setting certain specific parameteres?