HACKER Q&A
📣 aspyct

Best practices for ethical web scraping?


Hello HN!

As part of my learning in data science, I need/want to gather data. One relatively easy way to do that is web scraping.

However I'd like to do that in a respectful way. Here are three things I can think of:

1. Identify my bot with a user agent/info URL, and provide a way to contact me 2. Don't DoS websites with tons of request. 3. Respect the robots.txt

What else would be considered good practice when it comes to web scraping?


  👤 snidane Accepted Answer ✓
When scraping just behave as to not piss off the site owner - whatever that means. Eg. not cause excessive load or making sure you don't leak out sensitive data.

Next put yourself in their shoes and realize they don't usually monitor their traffic that much or simply don't care as long as you don't slow down their site. It's usually only certain big sites with heavy bot traffic such as linkedin or sneaker shoe sites which implement bot protections. Most others don't care.

Some websites are created almost as if they want to be scraped. The json api used by frontend is ridiculously clean and accessible. Perhaps they benefit when people see their results and invest in their stock. You never fully know if the site wants to be scraped or not.

The reality of scraping industry related to your question is this

1. scraping companies generally don't use real user agent such as 'my friendly data science bot' but they hide behind a set of fake ones and/or route the traffic through a proxy network. You don't want to get banned so stupidly easily by revealing user agent when you know your competitors don't reveal theirs.

2. This one is obvious. The general rule is to scrape over long time period continuously and add large delays between requests of at least 1 second. If you go below 1 second be careful.

3. robots.txt is controversial and doesn't serve its original purpose. It should be renamed to google_instructions.txt because site owners use it to guide googlebot to navigate their site. It is generally ignored by the industry again because you know your competitors ignore it.

Just remember the rule of 'not to piss off the site owner' and then just go ahead and scrape. Also keep in mind that you are in a free country and we don't discriminate here whether it is of racial or gender reasons or whether you are a biological or mechanical website visitor.

I simply described the reality of data science industry around scraping after several years of being in it. Note that this will probably not be liked by HN audience as they are mostly website devs and site owners.


👤 pfarrell
It won’t help you learn to write a scraper, but using the common crawl dataset will get you access to a crazy amount of data without paying to acquire it yourself.

https://commoncrawl.org/the-data/


👤 montroser
Nice you to ask this question and think about how to be as considerate as you can.

Some other thoughts:

- Find the most minimal, least expensive (for you and them both) way to get the data you're looking for. Sometimes you can iterate through search results pages and get all you need from there in bulk, rather than iterating through detail pages one at at a time.

- Even if they don't have an official/documented API, they may very likely have internal JSON routes, or RSS feeds that you can consume directly, which may be easier for them to accommodate.

- Pay attention to response times. If you get your results back in 50ms, it probably was trivially easy for them and you can request a bunch without troubling them too much. On the other hand, if responses are taking 5s to come back, then be gentle. If you are using internal undocumented APIs you may find that you get faster/cheaper cached results if you stick to the same sets of parameters as the site is using on its own (e.g., when the site's front end makes AJAX calls)


👤 mapgrep
I always add an “Accept-Encoding” header to my request to indicate I will accept a gzip response (or deflate if available). Your http library (in whatever language your bot is in) probably supports this with a near trivial amount of additional code, if any. Meanwhile you are saving the target site some bandwidth.

Look into If-Modified-Since and If-None-Match/Etag headers as well if you are querying resources that support those headers (RSS feeds, for example, commonly support these, and static resources). They prevent the target site from having to send anything other than a 304, saving bandwidth and possibly compute.


👤 rectang
In addition to the steps you're already taking, and the ethical suggestions from other commenters, I suggest that you aquaint yourself thoroughly with intellectual property (IP) law. If you eventually decide to publish anything based on what you learn, copyright and possibly trademark law will come into play.

Knowing what rights you have to use material you're scraping early on could guide you towards seeking out alternative sources in some cases, sparing you trouble down the line.


👤 sairamkunala
Simple,

respect robots.txt

find your data from sitemaps, ensure you query at a slow rate. robots.txt has a cool off period. See https://en.wikipedia.org/wiki/Robots_exclusion_standard#Craw...

example: https://www.google.com/robots.txt


👤 jakelazaroff
I think your main obligation is not to the entity from which you’re scraping the data, but the people whom the data is about.

For example, the recent case between LinkedIn and hiQ centered on the latter not respecting the former’s terms of service. But even if they had followed that to the T, what hiQ is doing — scraping people’s profiles and snitching to their employer when it looked like they were job hunting — is incredibly unethical.

Invert power structures. Think about how the information you scrape could be misused. Allow people to opt out.


👤 mettamage
Indirectly related, if you have some time to spare follow Harvard's course in ethics! [1]

Here is why: while it didn't teach me anything new (in a sense), it did give me a vocabulary to better articulate myself. Having new words to describe certain ideas means you have more analytical tools at your disposal. So you'll be able to examine your own ethical stance better.

It takes some time, but instead of watching Netflix (if that's a thing you do), watch this instead! Although, The Good Place is a pretty good Netflix show sprinkling some basic ethics in there.

[1] https://www.youtube.com/watch?v=kBdfcR-8hEY


👤 fiddlerwoaroof
My general attitude towards web scraping is that if I, as a user, have access to a piece of data through a web browser, the site owners have no grounds to object to me using a different program to access the data, as long as I’m not putting more load on their servers than a user clicking all the links would.

Obviously, there may be legal repercussions for scraping, and you should follow such laws, but those laws seem absurd to me.


👤 RuedigerVoigt
Common CMS are fairly good at caching and can handle a high load, but quite often someone deems a badly programmed extension "mission critical". In that case one of your requests might trigger dozens of database calls. If multiple sites share a database backend, an accidental DOS might bring down a whole organization.

If the bot has a distinct IP (or distinct user agent), then a good setup can handle this situation automatically. If the crawler switches IPs to circumvent a rate limit or for other reasons, then it often causes trouble in the form of tickets and phone calls to the webmasters. Few care about some gigabytes of traffic, but they do care about overtime.

Some react by blocking whole IP ranges. I have seen sites that blocked every request from the network of Deutsche Telekom (Tier 1 / former state monopoly in Germany) for weeks. So you might affect many on your network.

So:

* Most of the time it does not matter if you scrape all information you need in minutes or overnight. For crawl jobs I try to avoid the time of day I assume high traffic to the site. So I would not crawl restaurant sites at lunch time, but 2 a.m. local time should be fine. If the response time goes up suddenly at this time, this can be due to a backup job. Simply wait a bit.

* The software you choose has an impact: If you use Selenium or headless Chrome, you load images and scripts. If you do not need those, analyzing the source (with for example beautiful soup) draws less of the server's resources and might be much faster.

* Keep track of your requests. A specific file might be linked from a dozen pages of the site you crawl. Download it just once. This can be tricky if a site uses A/B testing for headlines and changes the URL.

* If you provide contact information read your emails. This sounds silly, but at my previous work we had problems with a friendly crawler with known owners. It tried to crawl our sites once a quarter and was blocked each time, because they did not react to our friendly requests to change their crawling rate.

Side note: I happen to work on a python library for a polite crawler. It is about a week away from stable (one important bug fix and a database schema change for a new feature). In case it is helpful: https://github.com/RuedigerVoigt/exoskeleton


👤 haddr
Some time ago I wrote an answer on stackoverflow: https://stackoverflow.com/questions/38947884/changing-proxy-...

Maybe that can help.


👤 tingletech
as sort of a poor man's rate limiting, I have written spiders that will sleep after every request, for the length of the previous request (sometimes length of the request times a sleep factor that defaults to 1). My thinking is that if the site is under load, it will respond slower, and my spider will slow down as well.

👤 coderholic
Another option is to not scrape at all, and use an existing data set. Common crawl is one good example, and http archive is another.

If you just want meta data from the homepage of all domains we scrape that every month at https://host.io and make the data available over our API: https://host.io/docs


👤 xzel
This might be overboard for most projects but here is what I recently did. There is a website I use heavily that provides sales data for a specific type of products. I actually e-mailed to make sure this was allowed because they took down their public API a few years ago. They said yes everything that is on the website is fair game and you can even do it on your main account. It was actually a surprisingly nice response.

👤 ok_coo
I work with a scientific institution and it's still amazing to me that people don't check or ask if there are downloadable full datasets that anyone can have for free. They just jump right in to scraping websites.

I don't know what kind of data you're looking for, but please verify that there isn't a quicker/easier way of getting the data than scraping first.


👤 tedivm
I've gone through this process twice- one about six months ago, and once just this week.

In the first event the content wasn't clearly licensed and the site as somewhat small, so I didn't want to break them. I emailed them and they gave us permission but only if we only crawled one page per ten seconds. Took us a weekend, but we got all the data and did so in a way that respected their site.

The second one was this last week and was part of a personal project. All of the content was over an open license (creative commons), and the site was hosted on a platform that can take a ton of traffic. For this one I made sure we weren't hitting it too hard (scrapy has some great autothrottle options), but otherwise didn't worry about it too much.

Since the second project is personal I open sourced the crawler if you're curious- https://github.com/tedivm/scp_crawler


👤 elorant
My policy on scraping is to never use asynchronous methods. I've seen a lot of small e-commerce sites that can't really handle the load, even if it's a few hundred requests per second, and the server crashes. So even if it takes me longer to scrape a site I prefer to not cause any real harm on them as long as I can avoid it.

👤 throwaway777555
The suggestions in the comments are excellent. One thing I would add is this: contact the site owner in advance and ask for their permission. If they are okay with it or if you don't hear back, credit the site in your work. Then send the owner a message with where they can see the information being used.

Some sites will have rules or guidelines for attribution already in place. For example, the DMOZ had a Required Attribution page to explain how to credit them: https://dmoz-odp.org/docs/en/license.html. Discogs mentions that use of their data also falls under CC0: https://data.discogs.com/. Other sites may have these details in their Terms of Service, About page, or similar.


👤 moooo99
The rules you named are some I personally followed. One other extremely important thing is privacy when you want to crawl personal data like social networks. I personally avoid crawling data that inexperienced users might accidentally expose, like email adresses, phone numbers or their friends list. A good rule of thumb for social networks for me always was, that I only scrape the data that is visible when my bot is not logged in (also helps to not break the providers ToS).

The most elegant way would be to ask the site provider if they allow scraping their website and which rules you should obey. I was surprised how open some providers were, but some don't even bother replying. If they don't reply, apply the rules you set and follow the obvious ones like not overloading their service etc.


👤 mfontani
If all scrapers did what you did, I'd curse a lot less at $work. Kudos for that.

Re 2 and 3: do you parse/respect the "Crawl-delay" robots.txt directive, and do you ensure that works properly across your fleet of crawlers?


👤 tyingq
Be careful about making the data you've scraped visible to Google's search engine scrapers.

That's often how site owners get riled up. They search for some unique phrase on Google, and your site shows up in the search results.


👤 narsil
It's helpful to filter out links to large content and downloadable assets from being traversed. For example, I assume you wouldn't care about downloading videos, images, and other assets that would otherwise use a large amount of data transfer and increase costs.

If the file type isn't clear, the response headers would still include the Content-Length for non-chunked downloads, and the Content-Disposition header may contain the file name with extension for assets meant to be downloaded rather than displayed on a page. Response headers can be parsed prior to downloading the entire body.


👤 JackC
In some cases, especially during development, local caching of responses can help reduce load. You can write a little wrapper that tries to return url contents from a local cache and then falls back to a live request.

👤 philippz
As many pages are at least half-way SPAs, make sure to really understand the website's communication with their backend. Identify API calls and try to make API calls directly instead of downloading the full pages and extracting the required information from HTML afterwards. If you have certain data sets from specific API calls that almost never change, try to crawl them less regularly and instead cache the results.

👤 DoofusOfDeath
You may need to get more specific about your definition of "ethical".

For example, do you just mean "legal"? Or perhaps, consistent with current industry norms (which probably includes things you'd consider sleazy)? Or not doing anything that would cause offense to site owners (regardless of how unreasonable they may seem)?

I do think it's laudible that you want to do good. Just pointing out that it's not a simple thing.


👤 danpalmer
Haven’t seen anyone mention this, but asking permission first is about the most ethical approach. If you think sites are unlikely to give you permission, that might be an indication that what you’re doing has limited value. Offering to share your results with them could be a good plan.

I work for a company that does a lot of web scraping, but we have a business contract with every company we scrape from.


👤 tdy721
Schema.org is a nice resource. If you can find that meta-data on a site, you can be just a little more sure they don’t mind getting that data scraped. It’s the instruction book for teaching google and other crawlers extra information and context. Your scraper would be wise to parse this extra meta information.

👤 jll29
The only sound advice one can give is: there are two elements to consider: 1) ethics is different from law 1.1) the ethical way: respect robots.txt protocol 2) consult a lawyer 2.1) prior written consent, they will say, prevents you from being sued, and not much else.

👤 sudoaza
Those 3 are the main, sharing the data in the end could be also a way to avoid future scrapings.

👤 imduffy15

👤 Someone
IMO, the best practice is “don’t”. If you think the data you’re trying to scrape is freely available, contact the site owner, and ask them whether dumps are available.

Certainly, if your goal is “learning in data science”, and thus not tied to a specific subject, there are enough open datasets to work with, for example from https://data.europa.eu/euodp/en/home or https://www.data.gov/


👤 adrianhel
I like this approach. Personally I wait an hour if I get an invalid response and use timeouts of a few seconds between other requests.

👤 abannin
Don't fake identity. If the site requires a login, don't fake that login. This has legal implications.

👤 avip
Contact site owner, tell them who you are and what you're doing, ask about data dump or api.

👤 brainzap
Ask for permissions and have nice timeout/retries.

👤 sys_64738
Ethical web scraping? Is that even a thing?