- A platform that includes an execution engine like Spark or another option for creating datasets (akin to Databricks).
- A system similar to Dagster, Airflow, or Prefect for building DAGs of datasets.
- A setup supporting multiple projects or repositories where users can easily define their transformations and version their code (similar to Git).
- A feature for managing data lineage (similar to what Palantir offers) to control access to datasets, requiring access to all upstream datasets for a given dataset.
- A granular access management system for individual datasets.
While the individual components exist, Palantir's value comes from integrating these elements. An architecture combining these tools, along with a simple frontend for interaction, might suffice.
The ultimate aim is to empower business teams within an organization to independently develop "data products" and provide these products to other teams for further development into more complex data products.
Are you aware of such a project, did you face similar issues in your work or is anyone interested in discussing these topics further?