How to avoid sensitive data being part of LLM training data?
How are you making sure your sensitive data & PII doesn't become part of LLM training data.
When training data is small, manual verification is possible. But when data size is huge, is there a way to mask/filter out PII/sensitive data.
At the orgs I contribute at, we use data security posture management tools (DSPM) to filter sensitive data from model training ingest. Mostly regex string equivalents, but also heuristics and context aware tagging of data.
There are a number of PII filters and libraries available on GitHub.