HACKER Q&A
📣 djaque

Legally Scraping Reddit Images


I'm considering working on a project related to GANs and started to research where I could get training data for it. I came across a subreddit full of community-filtered images of exactly what I need. There's also roughly 100K of them which is exactly how many I was hoping for.

The only problem is that I don't have any experience with collecting that magnitude of data. Does anyone have experience with scraping reddit and can offer pointers on how to approach it? I'm also having a difficult time figuring out if reddit even allows this.


  👤 photon_off Accepted Answer ✓
You can look at the "legal" aspect of it by Googling "is it legal to scrape". My understanding (IANAL) is that it is OK as long as the user agreement is "opt-out" as opposed to "opt-in" (eg: clicking a consent box before viewing the content). You'll have to read up on this yourself and weigh the risk/reward of doing your project. I (NAFL) would assume the risk is quite small. The reward -- that's your call.

As for the other part, getting the data: it's called scraping. Depending on your experience with scraping, you may need to pay for certain aspects of it (eg, getting a large list of proxies so Reddit does not block you, or using a scraping API). Or maybe your project is small enough (or time constraints large enough) that you can slowly siphon the data via your own means.

As per Reddit allowing it: Refer to the legality of scraping, and apply it to Reddit.