I'm a trained software engineer, and spent most of my career programming applications, of various kinds: mobile apps, websites, backends... About a year ago, I got a new role that requires me to sift through quite a lot of data and extract numbers from it. Basically, we have a big SQL database full of events, and I'm responsible for building reports based on those events.
That's a lot of fun! The database has grown to a size where it actually becomes a challenge to run those big aggregations, and I must tweak my queries carefully. 100% recommend to anyone interested in... well going through a lot of stuff.
However there's one aspect of it that's a lot less fun: I find it hard to get a "yes, those numbers are correct.".
At first, I assumed I could simply write unit tests, reproduce the data that we should have in the production database, and verify my numbers on smaller datasets.
The reality, however, is far from that. While the events respect established schemas, some were lost. Others didn't arrive in order. Some are duplicated, and others seem related but actually are not. Also, due to the sheer volume of records, I can't just start counting them manually.
I realized, on multiple occasions, that some queries were incorrect. Of course the resulting numbers were incorrect too. But we didn't notice that immediately, and some reports had to be recalled/corrected. No harm was done, but I did lose some sleep over it.
So there's that. While I very much enjoy working with that data, I wish I could find a way to validate my work, and finally be able to click on a "resolve" button knowing that my results truly are correct.
Is there any magic trick I'm missing? Any tool that could help me? Any workflow that would be my data unicorn?
I was a DS for a couple years in NYC area. My advise is the same as in science, treat the act of verification as the end goal itself. If you understand your experimental apparatus, in this case, how the DB is populated then you understand all subsequent of the "physics". Anything you can't understand should be carefully quantified as random and systematic errors.
Sorry I don't have any magic bullet or standard recipe for how to go about it. Every case is different.