HACKER Q&A
📣 ThePhysicist

Database for storing machine learning data?


I'm looking for a database that can efficiently store and retrieve a very large number (billions) of structured datapoints for use in machine learning. Each datapoint can have an arbitrary number of categorical and numerical attributes and belong to one or more datasets.

I want to be able to quickly (ideally in several seconds at most for result sets with 1.000-1.000.000 datapoints) select datapoints of a given dataset and possibly filter them based on their attribute values, e.g. formulating queries like "give me all datapoints belonging to dataset A for which x < 4.5 AND category = 'test' AND event_date >= '2009-04-10'". Once written, datapoints will not change, though I would like to attach additional information to specific datapoints (e.g. test results or additional labels), which could be done in a separate data structure or table though.

Right now I'm solving this using a simple PostgreSQL database with auxiliary index tables, but I'm looking for more scalable alternatives.

I've considered software like Cassandra or Clickhouse but I'm not sure they will fit my use case well. Do you have any recommendations or did you realise such a system in your work and can provide some ideas or guidance? Thanks!


  👤 pachico Accepted Answer ✓
I use ClickHouse for analytical purposes and I managed to ingest with very modest hardware up to 5 million rows per second. I stopped there but with more multiple jobs I might achieve even more. Queries and export are very fast too. At the moment I cannot think of anything better for this. Let me know if you need extra info.

👤 tarun_anand
Define quickly. One minute, one hour?

What is the downstream use? To train, label?