I want to crawl every plain HTML website. Where do I begin?

Question

Web crawling in this day and age is loaded with caveats, and must be done carefully as to not put an unnecessary load on other people's infrastructure. I'd really like to try to make my own web search tool, so I'm trying to scope out something simple enough to get me started. Here's what I have so far.- I don't want to parse anything in the SimilarWeb Top 50.- I don't want to render JS- I'd like to keep a web index that is still measured in TBsI've done search engines for research papers in the past. The key difference was that I could collect that data very easily through a documented API. Now I need to either build or use a crawler and I'm not sure where to begin. Here are some thoughts that I have so far.- I'm probably going to write the crawler in Go. It seems like a good fit for this sort of software.- How do I start collecting lists of domains? Do I just start hitting public IPv4 addresses on Ports 80 and 443?- If I run something like this with proper rate limiting on a server, would Cloudflare inevitably just start blocking me?- If I were to run this from a machine connected to the Internet via a residential ISP, then would I get a nasty letter from my ISP?Any advice or feedback is appreciated. The goal of this project is to learn more about web crawling moreso than to build a product that would be sold.

ratio11 · Accepted Answer

You might be interested in Common Crawl. They crawl the internet and make the full dataset downloadable.https://commoncrawl.org

heresjohnny · Answer

More and more sites are driven by lazily loaded content, though &ndash; for which javascript is a prerequisite. Do note that you&rsquo;re excluding a significant amount of sites this way.

deepsy · Answer

I'd probably look into AWS Lambda as you can query in parallel from different IPs at a scale cheaply.

I want to crawl every plain HTML website. Where do I begin?

You might be interested in Common Crawl. They crawl the internet and make the full dataset downloadable.
https://commoncrawl.org

More and more sites are driven by lazily loaded content, though – for which javascript is a prerequisite. Do note that you’re excluding a significant amount of sites this way.

I'd probably look into AWS Lambda as you can query in parallel from different IPs at a scale cheaply.