Scraping 500k pages: works locally, blocked on EC2 how do you scale?
Posted by ComprehensiveCat3034@reddit | learnprogramming | View on Reddit | 14 comments
Hey folks,
I’m working on a project where I need to collect reviews for around \~500k hotels. APIs (Google, Tripadvisor, etc.) are turning out to be quite expensive at this scale, so I’m exploring scraping as an alternative.
Here’s my situation:
- I don’t need real-time data — even updating once every 1–2 months is fine
- I clearly Know when I run scraping locally, things work reasonably okay
- But when I move the same setup to an EC2 instance, I get blocked pretty quickly
- I’m trying to avoid using residential proxies due to cost and complexity
- Prefer open-source or low-cost approaches if possible
What I’m trying to figure out:
- Is there any practical way to scrape at this scale without getting blocked (or at least minimizing it) using only open-source tools?
- Are there strategies that work specifically on cloud environments like EC2?
- Has anyone managed something similar without relying on expensive proxy networks?
- Any architectural suggestions (batching, distributed scraping, etc.) that could help?
I’m okay with slower scraping speeds since this is more of a periodic batch job, not real-time.
Would really appreciate insights from anyone who has tackled similar large-scale scraping problems 🙏
Ordinary-Cycle7809@reddit
Quick answer from experience: the main reason it works locally but gets blocked on EC2 is IP reputation. Your home IP is clean and shared with normal users, while AWS EC2 IPs are heavily abused for scraping so many sites block them aggressively right away.Even without residential proxies, here are a few low-cost things that often help on EC2:
For 500k pages every 1-2 months this slower approach is totally fine. Many people do large hotel/review scraping this way without paid proxies.Have you tried adding proper random user-agents + referers + delays yet?
Beautiful-Staff-3124@reddit
try rotating through a few different regions on spot instances first. the ip reputation issue is real but manageable.
adding realistic delays and headers is a solid baseline. i'd start there before anything more complex.
Ordinary-Cycle7809@reddit
ohhhh
backfire10z@reddit
If 1-2 month old data is ok and OP is running an EC2 anyways, can’t they just slow down the scraping to match a 1 month cadence or so?
Agreeable_Math7501@reddit
scraping at that scale from a single ec2 ip is basically a guaranteed block. you need proper proxy rotation.
i ended up using Qoest for Developers for a similar project. their scraping api handle the proxy rotation and captcha solving automatically.
henkdegrasmaaier@reddit
there are some opensource tools that you can use for scraping, but be very aware that most websites do not allow scraping and are actively banning IPs because of this reason. If a website allows bots, you can usually do a regular http request.
for educational purposes, you can have some decent results with cloudscraper (best imo for this) in python and playwright. Do not expect high speeds and especially multiple playwright instances can take up your ram.
but, if you can, try to pay for a small api of the websites you are using or from a travel api like Duffel.
RegisterConscious993@reddit
Unfortunately, proxies would be the solution.
If the cost would be too high, you can setup a spare, cheap phone as your own private proxy. I used to use an Android phone with a webhook to trigger airplane more on/off to rotate IPs.
antiproton@reddit
You understand those companies don't want you to do this, right? There's no good way to accomplish what you want.
Jigglytep@reddit
My suggestions:
SSH into EC2 and confirm you can’t get the data using a curl command or using a tool like selenium to see why you are blocked. If you are blocked.
To scale up:
https://www.scrapy.org/
Look into scrappy a python scraping library I used it to scrape the Department of Transportation database by querying every possible DOT number. Sped things up exponentially. It is open source. I later found an easier way of getting that data: I called them and asked if I could have it and they shared a link to a CSV file.
deleted_by_reddit@reddit
[removed]
AutoModerator@reddit
Please, ask for programming partners/buddies in /r/programmingbuddies which is the appropriate subreddit
Your post has been removed
I am a bot, and this action was performed automatically. Please contact the moderators of this subreddit if you have any questions or concerns.
retornam@reddit
Run Tailscale on both your home network and on your EC2 instance and then use your home network as an exit node, so it appears you are accessing the websites through your home network.
It’s cheaper, your home ip might be rate-limited but will never be blocked outright due to how home ISP IPs get re-assigned when you restart your modem.
alord@reddit
Use brightdata or some other proxy provider
Accomplished-Web6183@reddit
Normally data center IPes are blocked. You can use proxies or something like firecrawl.