Improving RAG Results with OpenWebUI - Looking for Advice on Custom Pipelines & Better Embeddings

Posted by b5761@reddit | LocalLLaMA | View on Reddit | 11 comments

I’m currently working on improving the RAG performance in OpenWebUI and would appreciate advice from others who have built custom pipelines or optimized embeddings. My current setup uses OpenWebUI as the frontend, with GPT-OSS-120b running on an external GPU server (connected via API token). The embedding model is bge-m3, and text extraction is handled by Apache Tika. All documents (mainly internal German-language PDFs) are uploaded directly into the OpenWebUI knowledge base.

Setup / Environment:

Frontend: OpenWebUI
LLM: GPT-OSS-120b (external GPU server, connected via API token)
Embedding Model: bge-m3
Extraction Engine: Apache Tika
Knowledge Base: PDFs uploaded directly into OpenWebUI
Data Type: Internal company documents (German language, about product informations)

Observed Issues:

The RAG pipeline sometimes pulls the wrong PDF context for a query – responses reference unrelated documents.
Repeating the same question multiple times yields different answers, some of which are incorrect.
The first few responses after starting a chat are often relevant, but context quality degrades over time.
I suspect the embedding model isn’t optimal for German, or preprocessing is inconsistent.

I’m looking for practical advice on how to build a custom embedding pipeline outside of OpenWebUI, with better control over chunking, text cleaning, and metadata handling. I’d also like to know which German-optimized embedding models from Hugging Face or the MTEB leaderboard outperform bge-m3 in semantic retrieval. In addition, I’m interested in frameworks or methods for pretraining on QA pairs or fine-tuning with document context, for example using SentenceTransformers or InstructorXL. How does this pre-training work? Another question is whether it’s more effective to switch to an external vector database such as Qdrant for embedding storage and retrieval, instead of relying on OpenWebUI’s built-in knowledge base. Does a finetuning or training / customized PDF-Pipeline work better? If so are there any tutorials out there and is this possible with Openwebui?

Thanks for your help!

[-]

No-Refrigerator-1672@reddit

RAG in OpenWebUI is very barebones and inflexible. I would recommend not to use it; instead you should deploy a fully fledged standalone RAG system. I would recommend RAG Flow cause I've had good experience using it. It has advanced embedding techniques, including RAPTOR and Knowledge Graph and fine-tunable control over document processing and chopping up. Systems like RAG flow have their own ai chatbot builders, where you could configure the retrieval process for your needs, and they cannot then expose the chatbot as a separate model over OpenAI API, allowing you to integrate it back into OpenWebUI or other software suites that you use.

[-]

Desijizz@reddit

Skip OpenWebUI’s built-in RAG and run a standalone retriever with proper preprocessing, hybrid search, and a German-friendly reranker.

Concrete wins I’ve seen with German PDFs:

- Clean Tika output first: de-hyphenate, merge wrapped lines into paragraphs, strip headers/footers via regex, keep H1–H3 and page numbers as metadata.

- Chunk by headings, \~500–700 tokens with 50–100 overlap; tag lang=de, productid, section; pre-filter by productid before vector search.

- Try intfloat/multilingual-e5-large (or keep bge-m3) but add a reranker; BAAI/bge-reranker-v2-m3 is multilingual and stabilizes relevance a lot.

- Go hybrid: Elastic (German analyzer) or Typesense for BM25 unioned with Qdrant vectors; retrieve k≈20, MMR on, rerank to top 3–5 before the LLM. Keep cosine for bge/e5. Temperature ≤0.2 and don’t feed long chat history back into retrieval.

- Fine-tune last: SentenceTransformers with MultipleNegativesRanking on mined QA/section pairs helps, but reranking + metadata usually moves the needle faster.

RAG Flow is solid; I’ve paired Haystack + Qdrant for retrieval, with DreamFactory to expose internal SQL/Mongo as quick REST tools for live lookups. Standalone RAG with tight preprocessing, hybrid search, and a German-tuned model is the fix.

[-]

b5761@reddit (OP)

This also sounds like a possible option. So, I would be able to connect this "final" chatbot as a separate or new model in OpenWebUI, while the entire knowledge base (including processing) runs in its own backend via RAG Flow.

The user would simply need to know: “If you want to ask something about Knowledge XY, use this specific chat model.”?

[-]

No-Refrigerator-1672@reddit

Yes, you understood it correctly. This way your users will continue to use OpenWebUI and only model name will change. Only uploading new documents would require interactions with another software system.

[-]

EssayNo3309@reddit

for embedd model, you can try with: Alibaba-NLP/gte-multilingual-base
for extraction engine, I use tika and a own made extraction engine that use pymupdf & tesseract (on GPU): https://github.com/open-webui/open-webui/discussions/17621

[-]

segfawlt@reddit

Are you using a reranking model in addition to the embedding model?

[-]

Fun_Smoke4792@reddit

Check your chunks. See if that's what you expected. PDF is not the best solution here. You can change it to markdown so chunking can be smarter. 2. Using German lemmatizer if you don't have any other for your preprocess.

[-]

b5761@reddit (OP)

Thanks for your response. So, the first step would be to convert all PDFs into Markdown format. After that, I can review the Markdown files to verify that everything has been extracted correctly. Then, as a next step, I could run an additional preprocessing stage using a German lemmatizer? Correct?

Would the resulting output be ready to upload directly into OpenWebUI as “documents,” or would I need to use another tool or component for that step?

[-]

Fun_Smoke4792@reddit

You don't have to do the conversation now. I mean you can check your chunks first. Maybe they're already wrong, then you can do nothing to make it better. Chunking makes the retrieval part better. But fixed size is fine. For markdown, you can do more, i.e. you can have your list as a chunk, it won't break in the middle. You can have headings as extra context or use it as separator, so your chunks can provide a better context. And this costs almost the same with fixed size chunking.

[-]

Disastrous_Look_1745@reddit

German language RAG is a whole different beast - i've been down this path with multilingual document processing. The embedding model choice is crucial here.. bge-m3 is decent for multilingual but for German specifically you might want to check out the German BERT variants or even the new multilingual e5 models. They tend to capture German semantics way better.

For your inconsistent results problem - this sounds like a chunking issue more than anything else. Apache Tika can be hit or miss with complex PDFs, especially if they have tables or weird formatting. We actually built our own PDF processing pipeline at Nanonets because of similar issues. If you're open to trying alternatives, Docstrange has some solid German language support for document extraction - might be worth checking out for the preprocessing part at least. The key is getting clean, consistent text chunks before they even hit your embedding model.

[-]

b5761@reddit (OP)

Yes, that’s exactly the point. I assume I can perform a kind of “lightweight fine-tuning,” for example by experimenting with different chunk sizes or overlap strategies. However, I’ve also heard that you can achieve a much larger optimization step by building a custom PDF processing pipeline, executed outside of OpenWebUI.

Since I’m still new to this topic, my main question is how exactly this setup could be implemented — and how the processed data or embeddings could then be accessed or integrated back into OpenWebUI.