Best platform-agnostic tools/frameworks to vectorize large wikis (not wikipedia) for RAG?
Posted by Mgeek35@reddit | LocalLLaMA | View on Reddit | 6 comments
Hi folks,
I'm working with an LLM company tailored to a special business use case. Since most LLMs were not trained on the business data, we are scraping wikis in our business and trying to build a vector database out of these wikis to use in our RAG. We want to have this database usable regardless of the RAG framework. One problem I found with things like LlamaIndex (please correct me if I'm wrong), they store the data in special objects, which are not really usable/transferable outside LlamaIndex.
andreasntr@reddit
Mongodb vector indexes are quite flexible, plus you can store and query metadata to filter out objects which do not match some given conditions at inference time. Also pgvector does the same but with a relational approach
Mgeek35@reddit (OP)
Thanks people. I really appreciate the ideas
rbgo404@reddit
You can use any vector Database and create the object as per your need.
For example: You can take Weaviate and then create the input field as per your need and along with the vector you can also have the metadata.
Accomplished_Map2130@reddit
Store it in a vectorDB like Qdrant.
vasileer@reddit
Postgres (e.g. Supabase) with pg_vector for semantic search, and bge-m3 model (has 8K tokens context) to create embeddings
complains_constantly@reddit
Yup. And do it in batches of 256 with the model since flagembedding supports it. I get about 750 embeddings per second on an RTX 6000 Ada