HACKER Q&A
📣 chasely

Best practices for organizing geospatial data from different sources?


I am working on a project for fun using GeoTIFF, NetCDF, geojson, and satellite imagery as all part of the analysis. My ETL process for using this is basically a bunch of scripts to get the data to a place where I can actually run the analyses I'm interested in.

I'd like to make it easier on myself by using a system that can query these different sources, e.g., give me the data within a bounding box (or polygon) for these variables and in the year 2018.

Does such a system exist? Would dumping everything I can to a PostGIS database get me most of the way there? Hoping someone that works with this type of data at scale can provide some insight into best practices.


  👤 aaron-santos Accepted Answer ✓
Loads of questions that might help to find your answers:

What's your re-projection strategy? Are you at liberty to apply the same projection to all of the data in your pipeline? If not (using UTM for rasters for example), what are the fewest number of CRSs you can get away with?

How are you going to efficiently retrieve data? For example, do you intent to CoG your rasters to enable range reading? Do you intend to pyramid your rasters on ingest so you can pull different zoom levels quickly? If you have a mix of resolutions, do you want to standardize your resolutions so that co-registration is easier on the read side?

Do you want to automate your ETL process and have it run continuously or are you ok with ad-hoc manual runs?

Is there any data filtering your want to apply in your ETL? Cloud removal, special NODATA cases, spatial-temporal filtering?

What are your cost, latency, throughput requirements? Does this project prioritize any of those more than the others?

Source: built a raster/vector ingestion pipeline which I now use for analysis. Contact info in bio if you want to chat more about this.



👤 tcbasche
Having worked on an ingestion pipeline for geospatial data I’d say postgis is more than ok (and possibly an industry leader) for those kinds of queries.

There’s other “big data” DB’s like Cassandra or Elastic that can handle GIS data but I’m skeptical they are even necessary until you reach petabytes of data.

Having a solid indexing strategy will basically fulfil the majority of your performance concerns. Things like simplifying geometry, reducing scans etc


👤 nknealk
As others have said, PostGIS is a great option if supports everything in your use case. The GIN indexes in particular give you very fast bounding box lookups. You can also do shape intersection and other sophisticated things.

I recommend the Boundless Geo tutorial on PostGIS. It does a great job of teaching you about bounding box indexes and all the GIS functions and types.


👤 schoenobates
might be worth giving qgis a try. It can work with loads of formats and can be used to run an analysis ( as well as make maps)

https://qgis.org