How are you handling web crawling? Firecrawl is great, but I'm hitting limits.
Posted by Robertshee@reddit | LocalLLaMA | View on Reddit | 24 comments
Been experimenting with web search and content extraction for a small AI assistant project, and I'm hitting a few bottlenecks. My current setup is basically 1) Search for a batch of URLs 2) Scrape and extract the text and 3) Feed it to an LLM for answers.
It works decently, but the main issue is managing multiple services - dealing with search APIs, scraping infrastructure, and LLM calls separately , and maintaining that pipeline feels heavier than it should.
Is there a better way to handle this? Ideally something that bundles search + content extraction + LLM generation together. All this without having to constantly manage multiple services manually.
Basically: I need a simpler dev stack for AI-powered web-aware assistants that handles both data retrieval and answer generation cleanly. I wanna know if anyone has built this kind of pipeline in production
Budget_Sprinkles_451@reddit
GET YOUR STRUGGLE
So: Shipped /openpull today.
The idea: scraping shouldn't require writing selectors. Struggled with this a lot the past year.
Just say what you want:
→ "find all team members with their roles and linkedin"
→ get clean JSON back
It auto-discovers relevant pages, handles JS-rendered sites, and outputs structured data.
Built on crawl4ai + Gemini. Open source.
Happy about feedback, challenges. Hope this helps.
github.com/federicodeponte/openpull
HarambeTenSei@reddit
I search with searxng and crawl with crawl4ai. Attached some vpn proxy to get around some of the rate limits
heyyyjoo@reddit
What rate limits were you facing? Is the vpn proxy for searxng or crawl4ai? I'm looking to start using crawl4ai too and wondering what i need to look out for
HarambeTenSei@reddit
It's for both. Search engines get edgy if you send too many requests at once and google has even removed the free quota from the search API so now I do google search with crawl4ai as a searxng engine. Google doesn't like that either so vibe coded my nordvpn subscription into a wireguard proxy that just switches vpn servers every 5 minutes or so.
Sometimes websites also don't like thousands of requests from the same ip so they might captcha you so I just rotate the VPN out whenever it happens
Charming_Support726@reddit
I dumped Firecrawl because it felt very unreliable and switched to https://github.com/unclecode/crawl4ai
I get very clean results even without LLM extraction
heyyyjoo@reddit
Do you just do something like the simple web crawl in the "Quick Start"? Or do you have to turn on some other stuff to make it more reliable than Firecrawl? (handle JS, avoid bot detection etc)
Charming_Support726@reddit
It took me a week to have everything running. Implemented with codex and tweaked the config by hand.
Mysterious-Rock7154@reddit
I used to use firecrawl but got tired of their API returning 500s all the time and the quality not improving. Now I use the new tool https://search.getlark.ai/ that has an API similar to firecrawl
heyyyjoo@reddit
Have you tried using crawl4ai? If so how does it compare?
Cursed_line@reddit
I ran into the same issue. Managing separate services for search, scraping, and LLM calls was a nightmare for me too. I ended up switching to an integrated API that handles all three layers. check out LLMLayer. It gives you APIs for both web search and content scraping (and even an answer API for complete AI-generated responses). Saved me a lot of glue code managing multiple services.
No-Function-7019@reddit
One thing I liked about LLMLayer is that it returns structured context that you can drop straight into a model without additional preprocessing. For one of my prototypes, it replaced a combination of Firecrawl + Playwright scraping + my own HTML cleaner. The speed wasn’t dramatically faster, but the mental overhead dropped a lot because everything was consolidated.
RoosterHuge1937@reddit
I’ve been debating whether to switch to something more unified myself.
KaleidoscopeFar6955@reddit
Stitching together separate tools for search, scraping, and LLM calls becomes unmanageable fast. An integrated API that handles the whole flow is a huge quality-of-life improvement. Cutting out all that glue code is honestly half the battle.
RoosterHuge1937@reddit
I’ve actually been using LLMLayer recently for a similar assistant workflow, and the biggest win has been not having to manage scraping + search + LLM formatting separately. It handles the retrieval + extraction + chunking step in one go, so the pipeline is way cleaner. If you’re aiming for something that feels more “native” to LLM pipelines, it might be worth trying.
KaleidoscopeFar6955@reddit
I ran into the same problem when I was juggling separate tools for search, scraping, and LLM calls. It worked, but the pipeline felt heavier than it should. LLMLayer simplified things quite a bit for me because it bundles search + extraction + LLM-ready output under one API. Instead of stitching services together or cleaning raw HTML, I just pass URLs and get structured snippets back.
teroknor92@reddit
I created this open source repo for similar use case as yours https://github.com/m92vyas/AI-web_scraper
It will search the web, scrape required data from each and output an array of output for each url. The readme is not in detail but if you view the code I have added simple functions for web search, getting llm ready text, scraping etc. using open source free tools. You can also add your own simple function for each task and pass the function name as parameter and it will handle it.
I have a hosted version of firecrawl like API which you can use to avoid getting blocked (the code is open source). Within a few days I will change be reducing the pricing which will provide developers with cheaper scrape API with pay-as-you-go pricing.
dash_bro@reddit
Get URLs, use a computer use agent to click, take ss of the page and save html of it
Use both (image, html) as context, dump into gemini or gpt, tune and get outputs.
Bonus points if you create a simple cache for the URLs and map them to the scraped pairs to avoid extra work.
Brave_Reaction_1224@reddit
Hey, Founder of Firecrawl here.
did you try our /search endpoint? It handles search and gives you the content back as markdown. Frankly, we leave the LLM generation part out on purpose because we've found it pretty easy to pass the markdown content to the LLM of your choice. Out of curiosity, why do you want that bundled in? Just one less tool in the stack or is there another reason?
ogandrea@reddit
Yeah the multi-service juggling act gets old fast, especially when you're trying to keep everything in sync. I've been down this exact rabbit hole and the coordination overhead between search APIs, scrapers, and LLM calls becomes a real pain point when you're iterating quickly on the AI logic.
What ended up working better for me was moving toward a more unified approach where the browser automation handles both the search and extraction phases before passing clean data to the LLM. Instead of stitching together separate services, having one reliable system that can navigate, extract, and preprocess content reduces a lot of the pipeline complexity. The key is making sure your extraction layer is robust enough to handle different site structures without constantly breaking, which honestly took way more engineering time than I initially expected but pays off in the long run.
ekaj@reddit
Yea, Project: https://github.com/rmusser01/tldw_server/tree/main
https://github.com/rmusser01/tldw_server/tree/main/tldw_Server_API/app/core/Web_Scraping - web scraping module
I don’t have any documentation for media ingestion API usage besides this: https://github.com/rmusser01/tldw_server/blob/main/Docs/MCP/Unified/Documentation_Ingestion_Playbook.md which doesn’t cover the web scraping options. Just now realizing that, I’ll plan on fixing that.
SlowFail2433@reddit
Literally never scrape again and instead use computer use agents that pretend to be a human lmao
swagonflyyyy@reddit
Lmao. I would really feel safe doing that with the qwen3vl-235b models tbh. 30b-a3b kept looping in circles.
SlowFail2433@reddit
Its current research frontier im doing daily RL runs but progress is chaotic lmao
swagonflyyyy@reddit
I bet lmao.