- I don't want to parse anything in the SimilarWeb Top 50.
- I don't want to render JS
- I'd like to keep a web index that is still measured in TBs
I've done search engines for research papers in the past. The key difference was that I could collect that data very easily through a documented API. Now I need to either build or use a crawler and I'm not sure where to begin. Here are some thoughts that I have so far.
- I'm probably going to write the crawler in Go. It seems like a good fit for this sort of software.
- How do I start collecting lists of domains? Do I just start hitting public IPv4 addresses on Ports 80 and 443?
- If I run something like this with proper rate limiting on a server, would Cloudflare inevitably just start blocking me?
- If I were to run this from a machine connected to the Internet via a residential ISP, then would I get a nasty letter from my ISP?
Any advice or feedback is appreciated. The goal of this project is to learn more about web crawling moreso than to build a product that would be sold.