Semantic search over 100M rows of data?

Posted by cryptoguy23@reddit | LocalLLaMA | View on Reddit | 19 comments

Hi - I’m working with a large dataset and looking to build a quick search engine that can take unstructured queries as input to find matching products. Each product has a description, product name, color, weight, images, shipping time, sku. About 100M products. The file in csv is 164gb.

Would fine-tuning llama 3.1 405b work for this?

What’s the best stack for this? What approv cost am I looking at?

[-]

OrganicMesh@reddit

A package called usearch is as fast faiss for vectors - I would embed the product text of the 100M products & then search.

For a size of 100M in plain English check out taylorAI/gte-tiny or similar.

Depending on your hardware (gpu) you should be able to encode ~1000 texts / s.

[-]

OrganicMesh@reddit

https://github.com/michaelfeil/infinity
https://huggingface.co/TaylorAI/gte-tiny

docker run -it --gpus all michaelf34/infinity:latest-trt-onnx v2 --model-id TaylorAI/gte-tiny --engine optimum --device cuda

Combine with: https://github.com/unum-cloud/usearch

[-]

gthing@reddit

I really like faiss, but idk about a dataset that size.

https://github.com/facebookresearch/faiss

[-]

Aron-One@reddit

Faiss and binary quantization can do miracles. I have been dealing with Wikidata 40M records and Faiss takes like 5 seconds (on gpu) to retrieve top 25 closest records. HuggingFace has a nice tutorial about that: https://huggingface.co/blog/embedding-quantization

[-]

Barry_Jumps@reddit

Avoid Elastic at all costs and and go with Meilisearch, trust me. https://www.meilisearch.com/solutions/hybrid-search

[-]

bisontruffle@reddit

MS is fantastic for RAG and more, so fast. Podscan uses it for a huge amount of rows but follow the founder guy on Twitter he has some reservations it seems at times

[-]

Echo9Zulu-@reddit

Sounds like you already have a structured document database setup. I would use python and elastic to convert your csv into json objects, create your field mapping and build an ingestion pipeline.

LLMs can help you build Elasticsearch queries, but for this you will need a really strong handle on field content and deep knowledge of your data structure. Having some kind of hierarchical relationships would be super helpful, like categories. Many NLP techniques exist to help you study your collection as a corpus, which Elasticsearch can help you leverage without having to build the complex infrastructure your data requires to utilize effectively.

If you use the Explain API and use LLMs to help you interpret the report you could use a long context local model to help make sense of how to build your mapping

[-]

DeFiNator_2021@reddit

I see that others have commented already, but really no need for LLM here. Build it as vector database. For example look into Qdrant. It will be fast, and accuracy is great.

[-]

Medical_Chemistry_63@reddit

As others have said the non AI approach is probably the best use case for this job. Not everything needs to be AI, over engineering could cause headaches further down the line.

[-]

BoyKai@reddit

Check out Azure AI Search and vector indexing for RAG.

[-]

krtcl@reddit

How does Azure AI Search compare with something like Qdrant hybrid search?

[-]

NarrowTea3631@reddit

i love qdrant, but i would use elastic search for this. the sparse and colbert vectors are huuuuuge btw and for my use case barely improved accuracy compared to regular embedding vectors.

[-]

Some_Endian_FP17@reddit

Cosmos DB now has vector indexing and different search algos for NoSQL.

[-]

localllm2@reddit

I'd use an embedding model + vector search approach. In particular Postgres + pgvector! That comes with the extra value of having also access to SQL to query your structured fields (in addition to you unstructured queries).

[-]

Little_Dick_Energy1@reddit

Why can't you use a database?

[-]

dash_bro@reddit

Look at it as a search and retrieve problem. May not need an LLM at all!

Two steps:

fast document search
rerank to achieve a high hit rate

As a baseline:

BM25 all the product descriptions you've got. This is your "fast" document search. The quality would be pretty off, but remember -- the goal is to only maximise recall from your entire search space. Take topK results, I'd recommend between 0.1% - 1% of your document space.
rerank using multiple metadata filters and even query x documents rerankers. You can go with a cross-encoder for ranking [query x documents], or even semantic search over this set of [100k - 1M] documents using a really simple quantised model.

Note that this is only a baseline. There's a million different ways to do it using similar applications of the "speed" step -> "precision" step.

If you have search specific metrics like click rate, user preferences, metadata for search etc., might wanna look at LTR algorithms.

This is a (slightly outdated, but still really good) start about ranking systems for search algorithms: https://www.youtube.com/live/oXfFqAKf4Ac?si=SWrRSgVJ_B9ER5mC

[-]

Chaosdrifer@reddit

It is probably much easier to store the actual data in a DB, either sql based or like elastic search, and then just use a text2sql LLM model that translate from natural language to structured input like sql query or http calls to elastic search and either return the results directly, or it can answer your query based on the answer.

[-]

ParaboloidalCrest@reddit

I would use Elastic Search for that use case, but it's just me. Hopefully others would chime in about LLM-based solutions.

[-]

laurenblackfox@reddit

I would second the non-LLM approach. It'd be cheaper, much more flexible, and doesn't rely on a nebulous compute resource to function. ElasticSearch would be my go-to as well.