I don't want big companies to scrape my content and then sell it on their platform.
Novelty of LLM output may be an open question, but input is just someone else's stuff. I assumed that default copyright protects from this kind of bullshitttery. That it says that work can not be used, adapted, copied without creators permission. (I can only guess that it was allowed to happen, because that's the first time someone stole IP in this particular manner on this scale?) But now that we know that it's a thing, how can we maintain ownership of the inputs legally and engineering wise?
What is not then permitted is to give other people copies, or publish them on your website, or pretend it's your own work etc...
When it comes to LLMs or image generation models, they don't keep any copies and they don't generate any copies either, so they consider themselves to be well in the clear. [2]
If you want to stop people scraping your stuff anyway, you can always use robots.txt , or put things up behind a login-wall.
Do consider the morality of what you are doing though. Personally I feel that published data should be scrape-able where practical.
[1] https://en.wikipedia.org/wiki/Authors_Guild%2C_Inc._v._Googl.... (you're even allowed to do this with physical books)
[2] https://www.uspto.gov/sites/default/files/documents/OpenAI_R... (With apologies for my crude summary of their actual arguments)
I grew the apple, I should be able to decide what people do with it. I have a sign that says the apples are only for eating, but people are ignoring it.
You can go to extremes and put your content behind a login, as others have suggested. But that would also create friction for your intended audience.
[1]: https://www.naiyerasif.com/post/2023/09/30/blocking-ai-web-c...
Even a simple self made captcha (what is 2 + 7?) to reveal the content would probably stop LLMs.
But hurt seo so you have to not rely on that.
Do a medium and show a paragraph first then the login/captcha to continue.
(If you go similar route, don't forget rate limiting)
2. Put it behind a login wall