HACKER Q&A
📣 tikkun

How to publish online without it being used for ML model training?


Is there any way to publish somewhere in a way that is unlikely (a guarantee would be great, but that seems impossible) it'd be used for model training?


  👤 themodelplumber Accepted Answer ✓
If by online you mean web publishing, it seems there's some consensus around these tags:

    

    
https://twitter.com/globalcomix/status/1604279726985474048

Another thing to potentially look into is the headers being sent:

https://twitter.com/stealcase/status/1605736262949687296


👤 smoldesu
Watermarking it might be a good start. The people who go through training data are probably flagging pictures that have unsightly artifacts or unrealistic destructive changes in the image. If you add a Shutterstock-style watermark it would probably get removed from most sets, and a prominent signature in the bottom corner would probably also pretty well.

As for text, I guess your best bet is to either limit exposure of it or intentionally poison the data. It's a little bit harder to do this in writing I guess, but you could still try by creating unusable fictitious accounts of characters named "Biden" or "Boris" doing increasingly ridiculous things. Any politically-stark moderator would probably remove your data before it hits the model, and if it does there's a good chance it will be flagged as problematic.