HACKER Q&A
📣 aazo11

What datasets (public or proprietary) do you use on a regular basis?


What are the pain points (if any) and what tools to do use to pipe in, analyze + present the data?


  👤 data_dan_ Accepted Answer ✓
I use a lot of U.S. government data sources (EPA climate data; BLS employment statistics; etc.). I also use a fair amount of international greenhouse gas emissions data, such as from the UNFCCC greenhouse gas inventory datasets.

Pain points: data disappearing, moving, or being updated without notice and without indication of a change. Numbers from the same API endpoints or URL changing unexpectedly and without explanation can be an unwelcome surprise.

I use bit.io (https://bit.io -- I work there) to deal with these problems. It's an online PostgreSQL database; very easy to use with e.g. psycopg2/SQLalchemy in Python or DBI+dbplyr in R. Before any analysis, I copy the necessary data over to a repo/schema in bit.io, fill in the documentation with the dates on which I obtained the data, and use that as the source of "ground truth" for the analysis.