You cannot have both public access content and yet limit it to limited parties at your discretion. The content is either public or it’s not and obfuscation, even via authentication, is not a security solution. The role of authentication is not to limit access but to segment access such that account specific content is not intermingled between accounts.
No because I can still copy and redistribute your auth-locked content. By writing something and distributing that writing you are making it possible to train on the contents. That's the way things are now.
It may discourage some companies from scrapping your content.
And if they ignore the license, you may have grounds to collect the fee if it becomes obvious that they are using your content in the AI models. (Unlikely to collect though)
The process I used was to gather the ASNs of every local ISP where my users are. Thanks to BGP looking glass services, you can convert this to a list of every advertised prefix. You can coalesce this into an optimised data structure using “ipset” to build a firewall rule that affects only addresses you whitelist.
I also found a data center offering colocation which is directly peered with every major ISP in my area and a few of the minor ones.
Any visitor that doesn’t match my whitelist sees a decoy page. For the protected content, I set IP TTL just high enough to deliver only within the service area.
One drawback of this approach is it’s difficult to include local mobile users (without including their entire parent network) because they will need higher TTLs and come from a network with many prefixes. I’d be interested in an iptables rule that can drop connections after a few packets based on an RTT heuristic.
This approach is the opposite of “zero-trust” and it fails as soon as your threats start moving into into your user’s network.
Overall - simply making life a bit harder for the crawler (AI or plain old hardcoded one).
I saw a saas recently that has something targetting this. They will automatically update your robot.txt to include different crawlers they find.
This will only stop crawlers that obey robots.txt though.
What would be the demand for something like this as a SaaS? Wondering if it would be worth developing it.