Open-source web crawler with markdown output?
Posted by mxdamp@reddit | LocalLLaMA | View on Reddit | 8 comments
I’m looking for an open-source web crawler that can recursively crawl a website (e.g., API documentation) and convert the content into markdown files. I'd like to use the markdown files in RAG applications. I've found Crawl4AI and Firecrawl, but I'd prefer a free TUI or GUI application.
MemeLord-Jenkins@reddit
Oxylabs just added native Markdown output to their Web Scraper API, you just pass "markdown": true in the job request and it returns structured MD wrapped in JSON.
IcyBackground5204@reddit
https://www.crawl4.com
Dilpreet_13@reddit
Crawl4ai is excellent, allows asynchronous running for parallel operations and markdown functionality.
Im also in the process of making a rag model using it (scrapping the docs)
SHSharkar@reddit
You may consider trying this solution, which provides both
crawlfunctionality andMarkdownformatting.It supports crawling a single webpage or a
sitemap.xmlfile.https://crawl.devwz.com
mxdamp@reddit (OP)
A tool built around crawl4ai would be perfect. Thank you for sharing.
SHSharkar@reddit
You are most welcome. I also enjoy CLI. However, there are instances when a graphical user interface (GUI) is preferred for ease of use.
And, behind the scenes, Crawl4AI is being used, but it is the GUI version. That also supports sanitization, markdown formatting, an easy copy or download feature, log data display, and so on.
Overall, these are the features I require the most.
ABC4A_@reddit
I'm new to the, but does it being in markdown format help? Why not just strip the html using beautiful soup
mxdamp@reddit (OP)
While I could use Beautiful Soup or Scraperr to scrape and format the content myself, I’m looking for a tool that’s already configured for this purpose. Ideally, I’d just input a URL – like a docs.rs page, a GitHub wiki page, or a Python/C API Reference Manual page – and the tool would know what to extract and provide it as structured output, ready for LLMs.