Scraping 500k pages: works locally, blocked on EC2 how do you scale?

Posted by ComprehensiveCat3034@reddit | learnprogramming | View on Reddit | 14 comments

Hey folks,

I’m working on a project where I need to collect reviews for around \~500k hotels. APIs (Google, Tripadvisor, etc.) are turning out to be quite expensive at this scale, so I’m exploring scraping as an alternative.

Here’s my situation:

I don’t need real-time data — even updating once every 1–2 months is fine
I clearly Know when I run scraping locally, things work reasonably okay
But when I move the same setup to an EC2 instance, I get blocked pretty quickly
I’m trying to avoid using residential proxies due to cost and complexity
Prefer open-source or low-cost approaches if possible

What I’m trying to figure out:

Is there any practical way to scrape at this scale without getting blocked (or at least minimizing it) using only open-source tools?
Are there strategies that work specifically on cloud environments like EC2?
Has anyone managed something similar without relying on expensive proxy networks?
Any architectural suggestions (batching, distributed scraping, etc.) that could help?

I’m okay with slower scraping speeds since this is more of a periodic batch job, not real-time.

Would really appreciate insights from anyone who has tackled similar large-scale scraping problems 🙏

[-]

Ordinary-Cycle7809@reddit

Quick answer from experience: the main reason it works locally but gets blocked on EC2 is IP reputation. Your home IP is clean and shared with normal users, while AWS EC2 IPs are heavily abused for scraping so many sites block them aggressively right away.Even without residential proxies, here are a few low-cost things that often help on EC2:

Rotate through multiple cheap EC2 instances (different regions) or use spot instances + simple IP rotation.
Add random delays (10–30 seconds between requests) + realistic browser headers and slow scrolling behavior.
Use Selenium + undetected-chromedriver or Playwright with stealth plugins.
Run in batches scrape 5k–10k hotels per day max instead of hammering everything at once.

For 500k pages every 1-2 months this slower approach is totally fine. Many people do large hotel/review scraping this way without paid proxies.Have you tried adding proper random user-agents + referers + delays yet?

Beautiful-Staff-3124@reddit

try rotating through a few different regions on spot instances first. the ip reputation issue is real but manageable.

adding realistic delays and headers is a solid baseline. i'd start there before anything more complex.

ohhhh

backfire10z@reddit

If 1-2 month old data is ok and OP is running an EC2 anyways, can’t they just slow down the scraping to match a 1 month cadence or so?

Agreeable_Math7501@reddit

scraping at that scale from a single ec2 ip is basically a guaranteed block. you need proper proxy rotation.

i ended up using Qoest for Developers for a similar project. their scraping api handle the proxy rotation and captcha solving automatically.

henkdegrasmaaier@reddit

there are some opensource tools that you can use for scraping, but be very aware that most websites do not allow scraping and are actively banning IPs because of this reason. If a website allows bots, you can usually do a regular http request.

for educational purposes, you can have some decent results with cloudscraper (best imo for this) in python and playwright. Do not expect high speeds and especially multiple playwright instances can take up your ram.

but, if you can, try to pay for a small api of the websites you are using or from a travel api like Duffel.

RegisterConscious993@reddit

Unfortunately, proxies would be the solution.

If the cost would be too high, you can setup a spare, cheap phone as your own private proxy. I used to use an Android phone with a webhook to trigger airplane more on/off to rotate IPs.

antiproton@reddit

You understand those companies don't want you to do this, right? There's no good way to accomplish what you want.

Jigglytep@reddit

My suggestions:

SSH into EC2 and confirm you can’t get the data using a curl command or using a tool like selenium to see why you are blocked. If you are blocked.

To scale up:

https://www.scrapy.org/

Look into scrappy a python scraping library I used it to scrape the Department of Transportation database by querying every possible DOT number. Sped things up exponentially. It is open source. I later found an easier way of getting that data: I called them and asked if I could have it and they shared a link to a CSV file.

deleted_by_reddit@reddit

[removed]

AutoModerator@reddit

Please, ask for programming partners/buddies in /r/programmingbuddies which is the appropriate subreddit

Your post has been removed

I am a bot, and this action was performed automatically. Please contact the moderators of this subreddit if you have any questions or concerns.