HACKER Q&A
📣 monkeydust

De-identifying data without destroying quality


We are exploring how to aggregate significant amount of data from different clients to use in a machine learning model.

The output of the model will be shared with all clients.

The clients require the model to see data that has been stripped of sensitive information. We could crudely just remove columns that we felt were sensitive but this would impact performance of the model.

Has anyone got experience or thoughts on how to approach this?

Any software / open-source on not that could help?

Txs MD


  👤 jklein11 Accepted Answer ✓
Do you need the data to be anonymized or just de-identified?

I know this won't sit well with privacy minded folks, but if you just need the data de-identified and not anonymized, you could pick the fields that might contain sensitive data and do a character for character swap. This way you retain the information without storing the personal information in a raw form.


👤 2rsf
What do you need to anonymize ? what type of system are you asking about ? do you need to comply with GDPR ?

names, addresses, id numbers or account numbers can be easily randomized. Dates and numbers (what kind of system is it?) are trickier since they are used in calculations. Finally the tricky part is making sure that the anonymized data still can't be tracked back to real entities.