HACKER Q&A
📣 imperio59

Where to find open data sets to play around with to learn ML?


I'm learning ML and looking to find more open datasets that I can use, especially in the area of recommender/ranker systems.

I'm already familiar with Kaggle, but wondering what else there is out there?


  👤 JoeyBananas Accepted Answer ✓
In a few applications of ML that I've worked with, there is no need for an outside dataset because the program generates it's own data. For example, the data could come from a simulation of some process.

👤 mindcrime
https://datasets.reddit.com

https://opendata.reddit.com

https://archive.ics.uci.edu/ml/datasets.php

https://lod-cloud.net/

https://www.data.gov

https://data.un.org/

https://data.worldbank.org/

https://fred.stlouisfed.org/

https://data.oecd.org/

https://www.nber.org/research/data?page=1&perPage=50

https://github.com/awesomedata/awesome-public-datasets

https://github.com/datasets

https://opendata.cern.ch/

https://data.nasa.gov/

https://data.world/datasets/machine-learning

https://data.noaa.gov/datasetsearch/

https://www.usgs.gov/products/data

https://www.fema.gov/about/openfema/data-sets

etc...

And of course don't ignore the data you can collect yourself one way or another. A few cheap Arduino Nano or Rpi Pico boards, some sensors, and you can build quite a variety of distributed data collection systems. Use solar panels for power in remote areas, and 4G / cellular data networks and you can get data from all over the place. You can also use a cheap SDR "dongle" to pull down data from various weather satellites and other sources. And don't forget about the API's / data export mechanisms for apps you might use like Fitbit, Strava, MapMyRun, etc.