Tool for web scraping with LLMs?
Posted by arbayi@reddit | LocalLLaMA | View on Reddit | 9 comments
Hey all, I'm trying to put together a scraper that can actually understand the content it's grabbing. Basically want two parts:
- Something that can search the web and grab relevant URLs
- A tool that visits those URLs and pulls out specific info I need
Honestly not sure what's the best way to go about this. Anyone done something similar? Is there a tool that already does this kind of "smart" scraping?
Note: Goal is to make this reusable for different types of product research and specs.
promptcloud@reddit
If you're looking for a tool that combines web scraping with LLMs, it depends on your use case. At PromptCloud, we use advanced web scraping techniques to gather structured data, and LLMs can then be used for tasks like analyzing or summarizing that data. While LLMs aren’t built for scraping directly, they’re great for processing and deriving insights once the data is scraped. You can integrate tools like BeautifulSoup or Scrapy for scraping and then use LLMs via APIs (e.g., OpenAI) for analysis. It’s all about combining the right tools for the job!
promptcloud@reddit
Hi
0x5f3759df-i@reddit
Just some advice, the aspect that most scrapers / over engineered py libs get wrong (langchain etc), is that the most critical task to get right is removing all the extraneous markup from web pages in order to maximize the information density of the content (reducing length) as effective context window for LLM is so critical to performance.
SatoshiNotMe@reddit
You can have a look at clean implementation using a Langroid agent: https://github.com/langroid/langroid/blob/main/examples/docqa/chat-search.py
The script can be run against any LLM via the
-m <model>
cli arg, e.g.-m ollama/qwen2.5
or-m groq/llama-3.1-70b-versatile
brewhouse@reddit
There are a few projects like this already. Playwright/puppateer based with LLM flavour for structured extraction.
Crawl4AI
ScrapeGraph
There are a bunch which are paid, but I won't list them because why should you if there are open-sourced and functional ones like the above. They all will converge to the same features anyway.
GoogleOpenLetter@reddit
Matthew Berman the youtuber is building an AI web scraping project. It's slightly different to your goals - but he essentially is building agents that can do research and present the information. He brings in the guy that created one of the agentic programs, and that guy really knows his stuff. I think you might find it useful.
Part 1
Part 2
Part 3
SM8085@reddit
I made a simple version in BASH that uses python to send the LLM requests to the local model. I called mine llm-websearch.
I use SearXNG as a search backend. Users need to have that set up for themselves. I use the docker-compose version for ease of use.
Basically my script,
I've been using Llama 3.2 3B with it. It picked out my parent's BluRay player.
The basic idea should be reusable. Get list of things. Loop over them applying a question.
Dalong_pub@reddit
I like this straightforward method. I found it a bit of a hassle to use searchxng with rocker since I didn’t need a full on alternative to google just something a bit better than DuckDuckGo to feed into my LLM
remyxai@reddit
How about a custom pipeline or spider using srapy?
Here's an example using a lightweight classifier to filter images.