Tool for web scraping with LLMs?
Posted by arbayi@reddit | LocalLLaMA | View on Reddit | 13 comments
Hey all, I'm trying to put together a scraper that can actually understand the content it's grabbing. Basically want two parts:
- Something that can search the web and grab relevant URLs
- A tool that visits those URLs and pulls out specific info I need
Honestly not sure what's the best way to go about this. Anyone done something similar? Is there a tool that already does this kind of "smart" scraping?
Note: Goal is to make this reusable for different types of product research and specs.
Thunderbit_HQ@reddit
Yeah, I’ve worked on something similar, check out Thunderbit. It’s a Chrome extension designed for "smart" scraping: you describe what info you want in natural language (like "extract product names, prices, and ratings"), and it figures out the structure for you. It can also auto-navigate subpages, which is useful for product spec pages or listings.
Particular_Judge5029@reddit
Hi,
For the first part you can try out any 3rd party tool or API for web search.
With regards to the second question, I have a really cool platform to introduce to you: Minexa.
What you have to do it just sing up yourself and you will get free 1000 Credit Points for free without having to provide any credit card details upfront. Then just hover over to the web-app, where you will have to enter the URL in the URLs field and the data you want to scrape in the Sample Data field, respectively.
The tool then intelligently extracts the data for you and you don't really have to worry about any CSS selectors, JS rendering etc, the platform does that for you itself.
The most important part? Once you have the scraper ready for the base page in 2 minutes, you can now use that scraper to extract data of similaly structured pages to extract data in a matter of few seconds from those pages!
Sounds interesting right? Give it a try by joining using this link:
Minexa Sign Up
iamrafal@reddit
try supadata.ai, it lets you get websites in markdown and also scrape youtube transcripts
teroknor92@reddit
For the second part you can use https://github.com/m92vyas/llm-reader It will extract any information you need from the urls
For the first part you can use the googlesearch library or any API
Let me know if anyone needs any help with this.
promptcloud@reddit
If you're looking for a tool that combines web scraping with LLMs, it depends on your use case. At PromptCloud, we use advanced web scraping techniques to gather structured data, and LLMs can then be used for tasks like analyzing or summarizing that data. While LLMs aren’t built for scraping directly, they’re great for processing and deriving insights once the data is scraped. You can integrate tools like BeautifulSoup or Scrapy for scraping and then use LLMs via APIs (e.g., OpenAI) for analysis. It’s all about combining the right tools for the job!
promptcloud@reddit
Hi
0x5f3759df-i@reddit
Just some advice, the aspect that most scrapers / over engineered py libs get wrong (langchain etc), is that the most critical task to get right is removing all the extraneous markup from web pages in order to maximize the information density of the content (reducing length) as effective context window for LLM is so critical to performance.
SatoshiNotMe@reddit
You can have a look at clean implementation using a Langroid agent: https://github.com/langroid/langroid/blob/main/examples/docqa/chat-search.py
The script can be run against any LLM via the
-m <model>
cli arg, e.g.-m ollama/qwen2.5
or-m groq/llama-3.1-70b-versatile
brewhouse@reddit
There are a few projects like this already. Playwright/puppateer based with LLM flavour for structured extraction.
Crawl4AI
ScrapeGraph
There are a bunch which are paid, but I won't list them because why should you if there are open-sourced and functional ones like the above. They all will converge to the same features anyway.
GoogleOpenLetter@reddit
Matthew Berman the youtuber is building an AI web scraping project. It's slightly different to your goals - but he essentially is building agents that can do research and present the information. He brings in the guy that created one of the agentic programs, and that guy really knows his stuff. I think you might find it useful.
Part 1
Part 2
Part 3
SM8085@reddit
I made a simple version in BASH that uses python to send the LLM requests to the local model. I called mine llm-websearch.
I use SearXNG as a search backend. Users need to have that set up for themselves. I use the docker-compose version for ease of use.
Basically my script,
I've been using Llama 3.2 3B with it. It picked out my parent's BluRay player.
The basic idea should be reusable. Get list of things. Loop over them applying a question.
Dalong_pub@reddit
I like this straightforward method. I found it a bit of a hassle to use searchxng with rocker since I didn’t need a full on alternative to google just something a bit better than DuckDuckGo to feed into my LLM
remyxai@reddit
How about a custom pipeline or spider using srapy?
Here's an example using a lightweight classifier to filter images.