Open-source web crawler with markdown output?

Posted by mxdamp@reddit | LocalLLaMA | View on Reddit | 8 comments

I’m looking for an open-source web crawler that can recursively crawl a website (e.g., API documentation) and convert the content into markdown files. I'd like to use the markdown files in RAG applications. I've found Crawl4AI and Firecrawl, but I'd prefer a free TUI or GUI application.

[-]

MemeLord-Jenkins@reddit

Oxylabs just added native Markdown output to their Web Scraper API, you just pass "markdown": true in the job request and it returns structured MD wrapped in JSON.

[-]

IcyBackground5204@reddit

https://www.crawl4.com

[-]

Dilpreet_13@reddit

Crawl4ai is excellent, allows asynchronous running for parallel operations and markdown functionality.

Im also in the process of making a rag model using it (scrapping the docs)

[-]

SHSharkar@reddit

You may consider trying this solution, which provides both crawl functionality and Markdown formatting.

It supports crawling a single webpage or a sitemap.xml file.

https://crawl.devwz.com

[-]

mxdamp@reddit (OP)

A tool built around crawl4ai would be perfect. Thank you for sharing.

[-]

SHSharkar@reddit

You are most welcome. I also enjoy CLI. However, there are instances when a graphical user interface (GUI) is preferred for ease of use.

And, behind the scenes, Crawl4AI is being used, but it is the GUI version. That also supports sanitization, markdown formatting, an easy copy or download feature, log data display, and so on.

Overall, these are the features I require the most.

[-]

ABC4A_@reddit

I'm new to the, but does it being in markdown format help? Why not just strip the html using beautiful soup

[-]

mxdamp@reddit (OP)

While I could use Beautiful Soup or Scraperr to scrape and format the content myself, I’m looking for a tool that’s already configured for this purpose. Ideally, I’d just input a URL – like a docs.rs page, a GitHub wiki page, or a Python/C API Reference Manual page – and the tool would know what to extract and provide it as structured output, ready for LLMs.