Does Google use the text inside gdocs and Gmail for training AI models?

Question

Google searches = used for training AI modelsApple notes = privateGoogle docs = ?Siri requests = used for training modelsEmails you send in gmail = ?I'm seeking to understand what things people might think are private, because they're not posted on the open web, but where they're used for training AI models.

Awelton · Accepted Answer

I don't know how everyone else approaches it, but I just assume that anything uploaded to googles servers will be snooped using some legal loophole or another. I don't have time to study the TOS for every user hostile application in the world, so guilty until proven innocent is the only sane position I can see to take.

advisedwang · Answer

https://support.google.com/docs/answer/10381817 States:
> Google Docs, Sheets, & Slides uses data to improve your experience
> To provide services like spam filtering, virus detection, malware protection and the ability to search for files within your individual account, we process your content.
So yes, they do use your data for things like training AI. However it does not seem to be for general AI like Bard but for ML systems within the product in question.

touringa · Answer

If you're talking about Google Bard, they were very clear in the LaMDA 2 paper that they only used public sources.
"...from public dialog data and other public web documents..."
LaMDA 2 paper: https://arxiv.org/abs/2201.08239
My overview of Google Bard including dataset: https://lifearchitect.ai/bard/
My overview of Google PaLM and Pathways family including dataset: https://lifearchitect.ai/pathways/
Compare with other models including the use of DeepMind's MassiveWeb/MassiveText and EleutherAI's Pile dataset: https://lifearchitect.ai/whats-in-my-ai/

knaik94 · Answer

No it does not, it would be irresponsible to do that on private data. There's a very clear line between data posted publicly and data held privately, especially in terms of copyright. I doubt it will ever be default opt-in for something as sensitive e-mail and docs.
One exception to that is scanning for CSAM and Terrorism and DMCA. And with DMCA, it's automated based on file hash, and you still maintain access to your files, you are just limited from sharing them. Ads in gmail aren't based on content, but other online activity while logged in.
I think the other exception to that is smart compose. AI models do use email content for training data, but the output of those are strictly for use locally while writing emails. I imagine it's also siloed per user.
EDIT: Not a google employee, I apologize if my assertions seem too strong.
EDIT2: https://en.wikipedia.org/wiki/Federated_learning
"We have always maintained that you control your data and we process it according to the agreement(s) we have with you. Furthermore, we will not and cannot look at it without a legitimate need to support your use of the service -- and even then it is only with your permission. Here are some of the additional measures we take to ensure your privacy: (reference: GCP Terms).
In addition to these commitments, for AI/ML development, we don’t use data that you provide us to train our own models without your permission. And if you want to work together to develop a solution using any of our AI/ML products, by default our teams will work only with data that you have provided and that has identifying information removed. We work with your raw data only with your consent and where the model development process requires it. "
https://cloud.google.com/blog/products/ai-machine-learning/g...
https://support.google.com/mail/answer/6603?hl=en
https://arxiv.org/abs/1906.00080
https://ai.googleblog.com/2017/04/federated-learning-collabo...

mikek · Answer

Email contents are used to generate a model for Smart Compose in Gmail [1]. I assume that Google docs works similarly.[1] https://arxiv.org/pdf/1906.00080.pdf

askiiart · Answer

Well, since Google offers an auto complete AI, I'd assume any text Google has (or at the very least, text that can be auto-completed) gets fed into their AIs. I have no evidence for this, and I should really read the ToS sometimes, but I digress.

CrypticShift · Answer

A related question: are the 40+ [1] millions Full Text Books used ?
OpenAI is using book1 (BookCorpus ?) and book2 sources. By the number of tokens, this seems less than a million book in total.
[1] https://www.blog.google/products/search/15-years-google-book...

kartayyar · Answer

About emails:No for ads.Yes for training models for Smart Reply, from: https://arxiv.org/pdf/1606.04870.pdfPrivacy Note that all email data (raw data, preprocessed data and training data) was encrypted. Engineers could only inspect aggregated statistics on anonymized sentences that occurred across many users and did not identify any user. Also, only frequent words are retained. As a result, verifying model&rsquo;s quality and debugging is more complex.

rsync · Answer

A different question should be asked:Do their Terms of Service, etc., allow them to use that text for training models ?

BenjaminDyer · Answer

I would assume so, all email content is surfaced for serving ads, so why not AI training.

ReflectedImage · Answer

If they used text inside Google Docs for training AI models, they would leak confidential customer information in an incredibility obvious way. So I'm guessing no.

Berniek · Answer

And a related question is do any/all of these programs use other copyright material as their training data and is it a breach of copyright?

whalesalad · Answer

I wouldn't be surprised at all - gotta read the ToS. The entire reason Google went heavy on GMail, despite it being a fun 10% project initially, was so that they could read your messages and use it to send more targeted ads.

guluarte · Answer

google uses all data it can for training their models, the only rule is the data should not track back to a specific user.

moremetadata · Answer

Can you have anti-spam AI models?

manv1 · Answer

Google says it doesn't use email text for targeting, but there are always "bugs."

lofaszvanitt · Answer

They are literally training their future enslaver.

behnamoh · Answer

Why would it matter? If you're concerned about your privacy, there are millions of other reasons to avoid Google.

Does Google use the text inside gdocs and Gmail for training AI models?

Email contents are used to generate a model for Smart Compose in Gmail [1]. I assume that Google docs works similarly.
[1] https://arxiv.org/pdf/1906.00080.pdf

Well, since Google offers an auto complete AI, I'd assume any text Google has (or at the very least, text that can be auto-completed) gets fed into their AIs. I have no evidence for this, and I should really read the ToS sometimes, but I digress.

A related question: are the 40+ [1] millions Full Text Books used ?
OpenAI is using book1 (BookCorpus ?) and book2 sources. By the number of tokens, this seems less than a million book in total.
[1] https://www.blog.google/products/search/15-years-google-book...

A different question should be asked:
Do their Terms of Service, etc., allow them to use that text for training models ?

I would assume so, all email content is surfaced for serving ads, so why not AI training.

If they used text inside Google Docs for training AI models, they would leak confidential customer information in an incredibility obvious way. So I'm guessing no.

And a related question is do any/all of these programs use other copyright material as their training data and is it a breach of copyright?

I wouldn't be surprised at all - gotta read the ToS. The entire reason Google went heavy on GMail, despite it being a fun 10% project initially, was so that they could read your messages and use it to send more targeted ads.

google uses all data it can for training their models, the only rule is the data should not track back to a specific user.

Can you have anti-spam AI models?

Google says it doesn't use email text for targeting, but there are always "bugs."

They are literally training their future enslaver.

Why would it matter? If you're concerned about your privacy, there are millions of other reasons to avoid Google.

Does Google use the text inside gdocs and Gmail for training AI models?

Email contents are used to generate a model for Smart Compose in Gmail [1]. I assume that Google docs works similarly.[1] https://arxiv.org/pdf/1906.00080.pdf

Well, since Google offers an auto complete AI, I'd assume any text Google has (or at the very least, text that can be auto-completed) gets fed into their AIs. I have no evidence for this, and I should really read the ToS sometimes, but I digress.

A related question: are the 40+ [1] millions Full Text Books used ?OpenAI is using book1 (BookCorpus ?) and book2 sources. By the number of tokens, this seems less than a million book in total.[1] https://www.blog.google/products/search/15-years-google-book...

A different question should be asked:Do their Terms of Service, etc., allow them to use that text for training models ?

I would assume so, all email content is surfaced for serving ads, so why not AI training.

If they used text inside Google Docs for training AI models, they would leak confidential customer information in an incredibility obvious way. So I'm guessing no.

And a related question is do any/all of these programs use other copyright material as their training data and is it a breach of copyright?

I wouldn't be surprised at all - gotta read the ToS. The entire reason Google went heavy on GMail, despite it being a fun 10% project initially, was so that they could read your messages and use it to send more targeted ads.

google uses all data it can for training their models, the only rule is the data should not track back to a specific user.

Can you have anti-spam AI models?

Google says it doesn't use email text for targeting, but there are always "bugs."

They are literally training their future enslaver.

Why would it matter? If you're concerned about your privacy, there are millions of other reasons to avoid Google.

Email contents are used to generate a model for Smart Compose in Gmail [1]. I assume that Google docs works similarly.
[1] https://arxiv.org/pdf/1906.00080.pdf

A related question: are the 40+ [1] millions Full Text Books used ?
OpenAI is using book1 (BookCorpus ?) and book2 sources. By the number of tokens, this seems less than a million book in total.
[1] https://www.blog.google/products/search/15-years-google-book...

A different question should be asked:
Do their Terms of Service, etc., allow them to use that text for training models ?