How to avoid sensitive data being part of LLM training data?

Question

How are you making sure your sensitive data & PII doesn't become part of LLM training data. When training data is small, manual verification is possible. But when data size is huge, is there a way to mask/filter out PII/sensitive data.

toomuchtodo · Accepted Answer

At the orgs I contribute at, we use data security posture management tools (DSPM) to filter sensitive data from model training ingest. Mostly regex string equivalents, but also heuristics and context aware tagging of data.

incogitor · Answer

There are a number of PII filters and libraries available on GitHub.