HACKER Q&A
📣 tikkun

Does Google use the text inside gdocs and Gmail for training AI models?


Google searches = used for training AI models

Apple notes = private

Google docs = ?

Siri requests = used for training models

Emails you send in gmail = ?

I'm seeking to understand what things people might think are private, because they're not posted on the open web, but where they're used for training AI models.


  👤 Awelton Accepted Answer ✓
I don't know how everyone else approaches it, but I just assume that anything uploaded to googles servers will be snooped using some legal loophole or another. I don't have time to study the TOS for every user hostile application in the world, so guilty until proven innocent is the only sane position I can see to take.

👤 advisedwang
https://support.google.com/docs/answer/10381817 States:

> Google Docs, Sheets, & Slides uses data to improve your experience

> To provide services like spam filtering, virus detection, malware protection and the ability to search for files within your individual account, we process your content.

So yes, they do use your data for things like training AI. However it does not seem to be for general AI like Bard but for ML systems within the product in question.


👤 touringa
If you're talking about Google Bard, they were very clear in the LaMDA 2 paper that they only used public sources.

"...from public dialog data and other public web documents..."

LaMDA 2 paper: https://arxiv.org/abs/2201.08239

My overview of Google Bard including dataset: https://lifearchitect.ai/bard/

My overview of Google PaLM and Pathways family including dataset: https://lifearchitect.ai/pathways/

Compare with other models including the use of DeepMind's MassiveWeb/MassiveText and EleutherAI's Pile dataset: https://lifearchitect.ai/whats-in-my-ai/


👤 knaik94
No it does not, it would be irresponsible to do that on private data. There's a very clear line between data posted publicly and data held privately, especially in terms of copyright. I doubt it will ever be default opt-in for something as sensitive e-mail and docs.

One exception to that is scanning for CSAM and Terrorism and DMCA. And with DMCA, it's automated based on file hash, and you still maintain access to your files, you are just limited from sharing them. Ads in gmail aren't based on content, but other online activity while logged in.

I think the other exception to that is smart compose. AI models do use email content for training data, but the output of those are strictly for use locally while writing emails. I imagine it's also siloed per user.

EDIT: Not a google employee, I apologize if my assertions seem too strong.

EDIT2: https://en.wikipedia.org/wiki/Federated_learning

"We have always maintained that you control your data and we process it according to the agreement(s) we have with you. Furthermore, we will not and cannot look at it without a legitimate need to support your use of the service -- and even then it is only with your permission. Here are some of the additional measures we take to ensure your privacy: (reference: GCP Terms).

In addition to these commitments, for AI/ML development, we don’t use data that you provide us to train our own models without your permission. And if you want to work together to develop a solution using any of our AI/ML products, by default our teams will work only with data that you have provided and that has identifying information removed. We work with your raw data only with your consent and where the model development process requires it. "

https://cloud.google.com/blog/products/ai-machine-learning/g...

https://support.google.com/mail/answer/6603?hl=en

https://arxiv.org/abs/1906.00080

https://ai.googleblog.com/2017/04/federated-learning-collabo...


👤 mikek
Email contents are used to generate a model for Smart Compose in Gmail [1]. I assume that Google docs works similarly.

[1] https://arxiv.org/pdf/1906.00080.pdf


👤 askiiart
Well, since Google offers an auto complete AI, I'd assume any text Google has (or at the very least, text that can be auto-completed) gets fed into their AIs. I have no evidence for this, and I should really read the ToS sometimes, but I digress.

👤 CrypticShift
A related question: are the 40+ [1] millions Full Text Books used ?

OpenAI is using book1 (BookCorpus ?) and book2 sources. By the number of tokens, this seems less than a million book in total.

[1] https://www.blog.google/products/search/15-years-google-book...


👤 kartayyar
About emails:

No for ads.

Yes for training models for Smart Reply, from: https://arxiv.org/pdf/1606.04870.pdf

Privacy Note that all email data (raw data, preprocessed data and training data) was encrypted. Engineers could only inspect aggregated statistics on anonymized sentences that occurred across many users and did not identify any user. Also, only frequent words are retained. As a result, verifying model’s quality and debugging is more complex.


👤 rsync
A different question should be asked:

Do their Terms of Service, etc., allow them to use that text for training models ?


👤 BenjaminDyer
I would assume so, all email content is surfaced for serving ads, so why not AI training.

👤 ReflectedImage
If they used text inside Google Docs for training AI models, they would leak confidential customer information in an incredibility obvious way. So I'm guessing no.

👤 Berniek
And a related question is do any/all of these programs use other copyright material as their training data and is it a breach of copyright?

👤 whalesalad
I wouldn't be surprised at all - gotta read the ToS. The entire reason Google went heavy on GMail, despite it being a fun 10% project initially, was so that they could read your messages and use it to send more targeted ads.

👤 guluarte
google uses all data it can for training their models, the only rule is the data should not track back to a specific user.

👤 moremetadata
Can you have anti-spam AI models?

👤 manv1
Google says it doesn't use email text for targeting, but there are always "bugs."

👤 lofaszvanitt
They are literally training their future enslaver.

👤 behnamoh
Why would it matter? If you're concerned about your privacy, there are millions of other reasons to avoid Google.