HACKER Q&A
📣 deepakthakur

How to avoid sensitive data being part of LLM training data?


How are you making sure your sensitive data & PII doesn't become part of LLM training data. When training data is small, manual verification is possible. But when data size is huge, is there a way to mask/filter out PII/sensitive data.


  👤 toomuchtodo Accepted Answer ✓
At the orgs I contribute at, we use data security posture management tools (DSPM) to filter sensitive data from model training ingest. Mostly regex string equivalents, but also heuristics and context aware tagging of data.

👤 incogitor
There are a number of PII filters and libraries available on GitHub.