HACKER Q&A
📣 idlewords

How to keep Chinese crawlers from taking down my site?


I run Pinboard, a bookmarking website with about 20K active users.

By design, public pages on Pinboard are supposed to be visible to anyone, whether or not they have an account on the site. However, since about August of 2024 I have found myself overwhelmed with bot traffic from crawlers, which has forced me to put all pages on the site behind a login.

In the past, it was relatively easy to stop aggressive crawling by blocking on IP range or on some feature of the user agent string. This crawling is different—it comes from thousands of distinct IP addresses that make one or two requests each, and the user agent strings spoof normal browsers. Sampling this traffic shows it comes almost entirely from Hong Kong and mainland Chinese IP addresses. It averages about 1 request/second, although there are times when it can hit 4 requests/second or more.

The way Pinboard is designed, certain public pages (especially views of user bookmarks filtered by multiple tags) are expensive to generate. In ordinary circumstances, this is not an issue, but bot traffic spread across dozens of user+tag pages can quickly overwhelm the site, especially when the bots start paginating.

My question is how to effectively block this kind of distributed crawling on a Ubuntu box without relying on a third party like Cloudflare[0]. I understand that iptables is not designed to block tens of thousands of IP addresses or ranges efficiently. What options am I left with? Hiding public pages behind a captcha? Filtering the entire China IP range using rules loaded into a frontend like nginx?

[0] This restriction is a product requirement ("no third party anything"). You may think it's silly but bear with me; my users like it.


  👤 epc Accepted Answer ✓
I block entire /8 and /16s from China on my personal sites. They were swamping my bandwidth requesting the same pages or images 1000s of times a day. Start by blocking the tencent and alibaba cloud networks and then work down what other networks are generating the most traffic.

👤 jsheard
Dumb bulk crawlers usually don't bother running Javascript, so you might be able to mitigate it by moving your expensive logic to an API endpoint that clientside JS calls in order to populate the page. Assuming you don't need search engines to be able to see those pages.

👤 tptacek
Just for people stumbling across this in the future: the general answer to blocking large numbers of discontiguous ranges of source IPs (or doing anything with those kinds of ranges) is `ipset`, which has iirc a hash and radix trie backend in the kernel.

If you wanted to be super awesome about it, you could write a 20 line eBPF XDP program to do the same thing, populating hash and LPM trie maps of addresses and dropping crawlers without so much as giving them an skbuff for their SYN packets. But `ipset` is fine.


👤 johng
This will help you sort by number of connections by IP address. Has been handy for me many times.

https://www.commandlinefu.com/commands/view/1767/number-of-o....


👤 johng
I used to null route large blocks as well. Let them sit and wait...

route add X.X.X.X gw 127.0.0.1 lo


👤 thiagowfx
Have you heard of https://github.com/crowdsecurity/crowdsec? It seems like a good fit.

👤 active_caramel
block access with Tiananmen Square date. google it and read up, might help

👤 JSTrading
Block them with iptables? Use something like this. Not foolproof. https://herrbischoff.com/2021/03/herr-bischoffs-ip-blocklist... Although I like to have fun with these things I wrote a Docker image and my own C++ App so it’s fast that randomly redirects to a set of random pages I switch up now and again. Sites like this. Or random news sites. https://www.planethollywoodlondon.com/