Scraping 500k pages: works locally, blocked on EC2 how do you scale?

Posted by ComprehensiveCat3034@reddit | learnprogramming | View on Reddit | 14 comments

Hey folks,

I’m working on a project where I need to collect reviews for around \~500k hotels. APIs (Google, Tripadvisor, etc.) are turning out to be quite expensive at this scale, so I’m exploring scraping as an alternative.

Here’s my situation:

What I’m trying to figure out:

I’m okay with slower scraping speeds since this is more of a periodic batch job, not real-time.

Would really appreciate insights from anyone who has tackled similar large-scale scraping problems 🙏