I'd like to make it easier on myself by using a system that can query these different sources, e.g., give me the data within a bounding box (or polygon) for these variables and in the year 2018.
Does such a system exist? Would dumping everything I can to a PostGIS database get me most of the way there? Hoping someone that works with this type of data at scale can provide some insight into best practices.
What's your re-projection strategy? Are you at liberty to apply the same projection to all of the data in your pipeline? If not (using UTM for rasters for example), what are the fewest number of CRSs you can get away with?
How are you going to efficiently retrieve data? For example, do you intent to CoG your rasters to enable range reading? Do you intend to pyramid your rasters on ingest so you can pull different zoom levels quickly? If you have a mix of resolutions, do you want to standardize your resolutions so that co-registration is easier on the read side?
Do you want to automate your ETL process and have it run continuously or are you ok with ad-hoc manual runs?
Is there any data filtering your want to apply in your ETL? Cloud removal, special NODATA cases, spatial-temporal filtering?
What are your cost, latency, throughput requirements? Does this project prioritize any of those more than the others?
Source: built a raster/vector ingestion pipeline which I now use for analysis. Contact info in bio if you want to chat more about this.
There’s other “big data” DB’s like Cassandra or Elastic that can handle GIS data but I’m skeptical they are even necessary until you reach petabytes of data.
Having a solid indexing strategy will basically fulfil the majority of your performance concerns. Things like simplifying geometry, reducing scans etc
I recommend the Boundless Geo tutorial on PostGIS. It does a great job of teaching you about bounding box indexes and all the GIS functions and types.