HACKER Q&A
📣 lost_sqler

Data Scientists / Statisticians of HN, how do you verify your numbers?


Hi everyone. I'm in need of guidance as to how to verify the correctness of my work.

I'm a trained software engineer, and spent most of my career programming applications, of various kinds: mobile apps, websites, backends... About a year ago, I got a new role that requires me to sift through quite a lot of data and extract numbers from it. Basically, we have a big SQL database full of events, and I'm responsible for building reports based on those events.

That's a lot of fun! The database has grown to a size where it actually becomes a challenge to run those big aggregations, and I must tweak my queries carefully. 100% recommend to anyone interested in... well going through a lot of stuff.

However there's one aspect of it that's a lot less fun: I find it hard to get a "yes, those numbers are correct.".

At first, I assumed I could simply write unit tests, reproduce the data that we should have in the production database, and verify my numbers on smaller datasets.

The reality, however, is far from that. While the events respect established schemas, some were lost. Others didn't arrive in order. Some are duplicated, and others seem related but actually are not. Also, due to the sheer volume of records, I can't just start counting them manually.

I realized, on multiple occasions, that some queries were incorrect. Of course the resulting numbers were incorrect too. But we didn't notice that immediately, and some reports had to be recalled/corrected. No harm was done, but I did lose some sleep over it.

So there's that. While I very much enjoy working with that data, I wish I could find a way to validate my work, and finally be able to click on a "resolve" button knowing that my results truly are correct.

Is there any magic trick I'm missing? Any tool that could help me? Any workflow that would be my data unicorn?


  👤 euix Accepted Answer ✓
The trick is to realize the act of verifying the accuracy of the data "is" the science. I come from High energy physics background. People think the point is to find sexy new particles or make ground breaking discoveries. No the real goal is to understand the inner workings of every part of your data acquisition pipeline (from the hardware detector to the software analysis) so well that you eliminate all possible other explanations when you spot a deviation. That then is your discovery.

I was a DS for a couple years in NYC area. My advise is the same as in science, treat the act of verification as the end goal itself. If you understand your experimental apparatus, in this case, how the DB is populated then you understand all subsequent of the "physics". Anything you can't understand should be carefully quantified as random and systematic errors.

Sorry I don't have any magic bullet or standard recipe for how to go about it. Every case is different.