Apple notes = private
Google docs = ?
Siri requests = used for training models
Emails you send in gmail = ?
I'm seeking to understand what things people might think are private, because they're not posted on the open web, but where they're used for training AI models.
> Google Docs, Sheets, & Slides uses data to improve your experience
> To provide services like spam filtering, virus detection, malware protection and the ability to search for files within your individual account, we process your content.
So yes, they do use your data for things like training AI. However it does not seem to be for general AI like Bard but for ML systems within the product in question.
"...from public dialog data and other public web documents..."
LaMDA 2 paper: https://arxiv.org/abs/2201.08239
My overview of Google Bard including dataset: https://lifearchitect.ai/bard/
My overview of Google PaLM and Pathways family including dataset: https://lifearchitect.ai/pathways/
Compare with other models including the use of DeepMind's MassiveWeb/MassiveText and EleutherAI's Pile dataset: https://lifearchitect.ai/whats-in-my-ai/
One exception to that is scanning for CSAM and Terrorism and DMCA. And with DMCA, it's automated based on file hash, and you still maintain access to your files, you are just limited from sharing them. Ads in gmail aren't based on content, but other online activity while logged in.
I think the other exception to that is smart compose. AI models do use email content for training data, but the output of those are strictly for use locally while writing emails. I imagine it's also siloed per user.
EDIT: Not a google employee, I apologize if my assertions seem too strong.
EDIT2: https://en.wikipedia.org/wiki/Federated_learning
"We have always maintained that you control your data and we process it according to the agreement(s) we have with you. Furthermore, we will not and cannot look at it without a legitimate need to support your use of the service -- and even then it is only with your permission. Here are some of the additional measures we take to ensure your privacy: (reference: GCP Terms).
In addition to these commitments, for AI/ML development, we don’t use data that you provide us to train our own models without your permission. And if you want to work together to develop a solution using any of our AI/ML products, by default our teams will work only with data that you have provided and that has identifying information removed. We work with your raw data only with your consent and where the model development process requires it. "
https://cloud.google.com/blog/products/ai-machine-learning/g...
https://support.google.com/mail/answer/6603?hl=en
https://arxiv.org/abs/1906.00080
https://ai.googleblog.com/2017/04/federated-learning-collabo...
OpenAI is using book1 (BookCorpus ?) and book2 sources. By the number of tokens, this seems less than a million book in total.
[1] https://www.blog.google/products/search/15-years-google-book...
No for ads.
Yes for training models for Smart Reply, from: https://arxiv.org/pdf/1606.04870.pdf
Privacy Note that all email data (raw data, preprocessed data and training data) was encrypted. Engineers could only inspect aggregated statistics on anonymized sentences that occurred across many users and did not identify any user. Also, only frequent words are retained. As a result, verifying model’s quality and debugging is more complex.
Do their Terms of Service, etc., allow them to use that text for training models ?