Real-time backend bot handling?

Question

Morning HN.I'm looking to programmatically identify bots to serve 403s rather than content. For this project, an 80% reduction would be considered successful. Search bots will still retain access to content.HTTP headers and real-time analytics will be combined to accomplish the task. I can measure the time between hits while considering the visitor's browser, operating system, and language specifications. I'm uncertain if there are legitimate use cases for atypical user agents?How would you approach the task with backend technology (without JS, cookies, or external tools)?

LinuxBender · Accepted Answer

Bots that use headless Chrome will not likely be detected without special tools.
The poorly written bots can be detected looking at TCP header information such as MSS, window size, sequence numbers, TTL, packet length. [1] These are some of the things that CDN's like Cloudflare look at. This probably won't get you to your desired 80% but might get you half the way there.
Another method I have used to squash a handful of bots is to require HTTP/2.0. Many of the older bots only speak HTTP/1.1 or 1.0. The modern/common browsers can speak HTTP/2.0. This method may break some API CLI tools if your site has an API gateway. This can be worked around by putting the API gateway on its own load balancer or web servers. I should add that Google's crawler bots can not yet speak HTTP/2.0 so exclude them from this restriction.
Another method is to implement strict-SNI on your load balancers. Some bots can't deal with this yet and will end up on your default dummy site or just get an error depending on how one implements this.
[1] - https://github.com/p0f/p0f/

ggeorgovassilis · Answer

> I'm uncertain if there are legitimate use cases for atypical user agents?
RSS readers and small search engines you haven't heard of but you're discriminating against :-)
> How would you approach the task with backend technology (without JS, cookies, or external tools)?
I solve this with client-side techniques (eg. recaptcha). Isolating and blocking IPs ranges from server logs (eg. drop anything from AWS & Azure subnets) might discourage bot operators.