Tool for web scraping with LLMs?

Posted by arbayi@reddit | LocalLLaMA | View on Reddit | 13 comments

Hey all, I'm trying to put together a scraper that can actually understand the content it's grabbing. Basically want two parts:

Something that can search the web and grab relevant URLs
A tool that visits those URLs and pulls out specific info I need

Honestly not sure what's the best way to go about this. Anyone done something similar? Is there a tool that already does this kind of "smart" scraping?

Note: Goal is to make this reusable for different types of product research and specs.

[-]

Thunderbit_HQ@reddit

Yeah, I’ve worked on something similar, check out Thunderbit. It’s a Chrome extension designed for "smart" scraping: you describe what info you want in natural language (like "extract product names, prices, and ratings"), and it figures out the structure for you. It can also auto-navigate subpages, which is useful for product spec pages or listings.

[-]

Particular_Judge5029@reddit

Hi,

For the first part you can try out any 3rd party tool or API for web search.

With regards to the second question, I have a really cool platform to introduce to you: Minexa.

What you have to do it just sing up yourself and you will get free 1000 Credit Points for free without having to provide any credit card details upfront. Then just hover over to the web-app, where you will have to enter the URL in the URLs field and the data you want to scrape in the Sample Data field, respectively.

The tool then intelligently extracts the data for you and you don't really have to worry about any CSS selectors, JS rendering etc, the platform does that for you itself.

The most important part? Once you have the scraper ready for the base page in 2 minutes, you can now use that scraper to extract data of similaly structured pages to extract data in a matter of few seconds from those pages!

Sounds interesting right? Give it a try by joining using this link:

Minexa Sign Up

[-]

iamrafal@reddit

try supadata.ai, it lets you get websites in markdown and also scrape youtube transcripts

[-]

teroknor92@reddit

For the second part you can use https://github.com/m92vyas/llm-reader It will extract any information you need from the urls

For the first part you can use the googlesearch library or any API

Let me know if anyone needs any help with this.

[-]

promptcloud@reddit

If you're looking for a tool that combines web scraping with LLMs, it depends on your use case. At PromptCloud, we use advanced web scraping techniques to gather structured data, and LLMs can then be used for tasks like analyzing or summarizing that data. While LLMs aren’t built for scraping directly, they’re great for processing and deriving insights once the data is scraped. You can integrate tools like BeautifulSoup or Scrapy for scraping and then use LLMs via APIs (e.g., OpenAI) for analysis. It’s all about combining the right tools for the job!

[-]

promptcloud@reddit

[-]

0x5f3759df-i@reddit

Just some advice, the aspect that most scrapers / over engineered py libs get wrong (langchain etc), is that the most critical task to get right is removing all the extraneous markup from web pages in order to maximize the information density of the content (reducing length) as effective context window for LLM is so critical to performance.

[-]

SatoshiNotMe@reddit

You can have a look at clean implementation using a Langroid agent: https://github.com/langroid/langroid/blob/main/examples/docqa/chat-search.py

user query
agent decides whether to:
answer from prior ingested content via RAG
use web-search (metaphor/exa, DuckDuckGo or Google, others soon) to get URLs, which are scraped + ingested into a vec-db and answers using RAG

The script can be run against any LLM via the -m <model> cli arg, e.g. -m ollama/qwen2.5 or -m groq/llama-3.1-70b-versatile

[-]

brewhouse@reddit

There are a few projects like this already. Playwright/puppateer based with LLM flavour for structured extraction.

Crawl4AI

ScrapeGraph

There are a bunch which are paid, but I won't list them because why should you if there are open-sourced and functional ones like the above. They all will converge to the same features anyway.

[-]

GoogleOpenLetter@reddit

Matthew Berman the youtuber is building an AI web scraping project. It's slightly different to your goals - but he essentially is building agents that can do research and present the information. He brings in the guy that created one of the agentic programs, and that guy really knows his stuff. I think you might find it useful.

Part 1

Part 2

Part 3

[-]

SM8085@reddit

I made a simple version in BASH that uses python to send the LLM requests to the local model. I called mine llm-websearch.

I use SearXNG as a search backend. Users need to have that set up for themselves. I use the docker-compose version for ease of use.

Basically my script,

Prompts the user for their question or research topic.
Turns that query into a search phrase for SearX.
Grabs the first page of results.
Presents each result URL & description to the bot asking it the original question from 1.
If the bot thinks the URL could help, it uses wget to grab it and clean it up with html2text (I could probably test for pdfs here and run pdf2text on it)
Send that text to the bot with the original question asking it to summarize the portions of the page that answer the question from step 1.
Compile all summaries into a text file to be fed back to the bot in a final summary round. Reminding the bot of the user question from step 1.

I've been using Llama 3.2 3B with it. It picked out my parent's BluRay player.

The basic idea should be reusable. Get list of things. Loop over them applying a question.

[-]

Dalong_pub@reddit

I like this straightforward method. I found it a bit of a hassle to use searchxng with rocker since I didn’t need a full on alternative to google just something a bit better than DuckDuckGo to feed into my LLM

[-]

remyxai@reddit

How about a custom pipeline or spider using srapy?

Here's an example using a lightweight classifier to filter images.