realtime planetary-scale datacube fusion › ... · realtime planetary-scale datacube fusion p....

2
Realtime Planetary-Scale Datacube Fusion P. Baumann 1 1 Jacobs University, Bremen, Germany [email protected] Commission VI, WG VI/4 KEY WORDS: datacube, fusion, distributed processing, query planning, rasdaman 1. INTRODUCTION The datacube model has rapidly gained acceptance as a cornerstone for analysis-ready data, and also the corresponding service model which is more powerful, but easier to use than existing API-based interfaces. Mature and widely adopted datacube standards show the way how datacube functionality can be presented to clients, be they human or m2m connections. Specifically, the OGC Coverage Implementation Schema (CIS) data model and the Web Coverage Service (WCS) service model suite define a framework adopted by the main open- source as well as proprietary tools, including MapServer, GeoServer, GDAL, QGIS, ArcGIS, and python/OWSlib. While the basic, most widely used functionality includes access, possibly extraction, and reformatting of a data sub(set) for download, processing and in particular data fusion represent complex, resource-intensive challenges. In the WCS suite this is addressed through a concept of modularity where WCS Core (which is mandatory for any WCS to implement) offers the GetCoverage request to accomplish “give me part of this spatio- temporal coverage, in my favourite format” and (optionally implementable) extensions add further functionality facets, up to the spatio-temporal datacube analytics language, Web Coverage Processing Service (WCPS) (Baumann, 2010). For example, computing the NVDI over Europe on the 1 st of July 2018 using Sentinel data stored in a datacube with the same name, and returning the result as a NetCDF file, can be achieved with the following WCPS query: for $c in (Sentinel) return encode( (((float) $c.nir – $c.red) / ((float) $c.nir + $c.red)) [ Lat(35:70), Long(10:40), time(“2018-07-01”) ], “image/tiff” ) As WCPS allows combination of datasets (“joins” in database terminology) the question arises: what if datacubes to be combined reside on different computers (cloud scenario), or even in different data centers (federation scenario)? A suboptimal implementation obviously could cause massively degraded performance. In this contribution we present federation methods implemented in the rasdaman datacube engine which is accepted technology leader, standards driver, and reference implementation for data- cubes. 2. RASDAMAN ARCHITECTURE The rasdaman engine resembles a fully-fledged Array Database System which utilizes a declarative query language enhancing standard SQL with high-level array operators; in fact, ISO in 2018 has adopted this language as an extension to the SQL language standard (ISO, 2019). This datacube language is domain agnostic and can serve as well medical imagery, cosmologic simulation results, and the like. A semantic layer on top of it offers geo semantics, i.e., it knows about spatial and temporal coordinates and, hence, also about regular and irregular grids. This semantics is offered via OGC WCS, WCPS, and WMS interfaces so that a plethora of clients can utilize massive datacubes managed by rasdaman. All such requests are translated to array SQL queries internally. In the server, such incoming queries are optimized, parallelized, and ultimately mapped to datacubes that are tiled on disk (or tape). As opposed to standard regular tiling (i.e., equi-sized tiles) rasdaman allows for arbitrary tiling guided by several strategies (Baumann et al., 2010). Tiling remains invisible to the applications (and, hence, in the queries), it is a tuning para- meter for the administrator. Tiles of the same datacube can even sit on different machines. The rasdaman server is multi-parallel per se, allowing any number of simultaneous parallel requests (inter-query parallelization) as well as parallel execution of incoming queries (intra-query parallelization), see (Dumitru, 2014). Datacube operations are usually triggered via standardized web services, such as WCS, WMS and WCPS. However, these services are only the interface to the client performing the operation, while the processing itself is handled at a lower level which enables optimized, efficient execution. In rasdaman, the geo layer intercepting the web requests representing datacube operations translates them into array database queries, which are then handled by rasdaman’s query processing engine. The engine analyses each incoming query and compiles it into machine code. During that process, a series of optimization steps is applied, one of which is distribution. Such federations are being done routinely meantime (Baumann, 2017). In the EarthServer project, Petabyte-scale datacubes have been federated across ECMWF/UK and NCI/Australia. Another real-life use-case of the optimization is given in the federation which was set up between the German Sentinel hub, CODE-DE, established in the BigDataCube project (BigData- Cube project team, 2019), and the Alfred Wegener Institute for Polar and Marine Research. Among others, temperature data- cubes are available at CODE-DE, and SeaIce datacubes are available at AWI. In one of the use-cases, scientists need to compute the sea ice distribution at different temperature inter- The International Archives of the Photogrammetry, Remote Sensing and Spatial Information Sciences, Volume XLII-3/W11, 2020 PECORA 21/ISRSE 38 Joint Meeting, 6–11 October 2019, Baltimore, Maryland, USA This contribution has been peer-reviewed. https://doi.org/10.5194/isprs-archives-XLII-3-W11-15-2020 | © Authors 2020. CC BY 4.0 License. 15

Upload: others

Post on 29-Jun-2020

12 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Realtime Planetary-Scale Datacube Fusion › ... · Realtime Planetary-Scale Datacube Fusion P. Baumann 1 1 Jacobs University, Bremen, Germany p.baumann@jacobs-university.de Commission

Realtime Planetary-Scale Datacube Fusion

P. Baumann 1

1 Jacobs University, Bremen, [email protected]

Commission VI, WG VI/4

KEY WORDS: datacube, fusion, distributed processing, query planning, rasdaman

1. INTRODUCTION

The datacube model has rapidly gained acceptance as acornerstone for analysis-ready data, and also the correspondingservice model which is more powerful, but easier to use thanexisting API-based interfaces. Mature and widely adopteddatacube standards show the way how datacube functionalitycan be presented to clients, be they human or m2m connections.Specifically, the OGC Coverage Implementation Schema (CIS)data model and the Web Coverage Service (WCS) servicemodel suite define a framework adopted by the main open-source as well as proprietary tools, including MapServer,GeoServer, GDAL, QGIS, ArcGIS, and python/OWSlib.

While the basic, most widely used functionality includes access,possibly extraction, and reformatting of a data sub(set) fordownload, processing and in particular data fusion representcomplex, resource-intensive challenges. In the WCS suite this isaddressed through a concept of modularity where WCS Core(which is mandatory for any WCS to implement) offers theGetCoverage request to accomplish “give me part of this spatio-temporal coverage, in my favourite format” and (optionallyimplementable) extensions add further functionality facets, upto the spatio-temporal datacube analytics language, WebCoverage Processing Service (WCPS) (Baumann, 2010).

For example, computing the NVDI over Europe on the 1st ofJuly 2018 using Sentinel data stored in a datacube with thesame name, and returning the result as a NetCDF file, can beachieved with the following WCPS query:

for $c in (Sentinel) return encode( (((float) $c.nir – $c.red) / ((float) $c.nir + $c.red)) [ Lat(35:70), Long(10:40), time(“2018-07-01”) ], “image/tiff” )

As WCPS allows combination of datasets (“joins” in databaseterminology) the question arises: what if datacubes to becombined reside on different computers (cloud scenario), oreven in different data centers (federation scenario)? Asuboptimal implementation obviously could cause massivelydegraded performance.

In this contribution we present federation methods implementedin the rasdaman datacube engine which is accepted technologyleader, standards driver, and reference implementation for data-cubes.

2. RASDAMAN ARCHITECTURE

The rasdaman engine resembles a fully-fledged Array DatabaseSystem which utilizes a declarative query language enhancingstandard SQL with high-level array operators; in fact, ISO in2018 has adopted this language as an extension to the SQLlanguage standard (ISO, 2019). This datacube language isdomain agnostic and can serve as well medical imagery,cosmologic simulation results, and the like. A semantic layer ontop of it offers geo semantics, i.e., it knows about spatial andtemporal coordinates and, hence, also about regular andirregular grids. This semantics is offered via OGC WCS,WCPS, and WMS interfaces so that a plethora of clients canutilize massive datacubes managed by rasdaman. All suchrequests are translated to array SQL queries internally.

In the server, such incoming queries are optimized, parallelized,and ultimately mapped to datacubes that are tiled on disk (ortape). As opposed to standard regular tiling (i.e., equi-sizedtiles) rasdaman allows for arbitrary tiling guided by severalstrategies (Baumann et al., 2010). Tiling remains invisible tothe applications (and, hence, in the queries), it is a tuning para-meter for the administrator. Tiles of the same datacube can evensit on different machines.

The rasdaman server is multi-parallel per se, allowing anynumber of simultaneous parallel requests (inter-queryparallelization) as well as parallel execution of incomingqueries (intra-query parallelization), see (Dumitru, 2014).

Datacube operations are usually triggered via standardized webservices, such as WCS, WMS and WCPS. However, theseservices are only the interface to the client performing theoperation, while the processing itself is handled at a lower levelwhich enables optimized, efficient execution. In rasdaman, thegeo layer intercepting the web requests representing datacubeoperations translates them into array database queries, whichare then handled by rasdaman’s query processing engine.

The engine analyses each incoming query and compiles it intomachine code. During that process, a series of optimizationsteps is applied, one of which is distribution.

Such federations are being done routinely meantime (Baumann,2017). In the EarthServer project, Petabyte-scale datacubeshave been federated across ECMWF/UK and NCI/Australia.Another real-life use-case of the optimization is given in thefederation which was set up between the German Sentinel hub,CODE-DE, established in the BigDataCube project (BigData-Cube project team, 2019), and the Alfred Wegener Institute forPolar and Marine Research. Among others, temperature data-cubes are available at CODE-DE, and SeaIce datacubes areavailable at AWI. In one of the use-cases, scientists need tocompute the sea ice distribution at different temperature inter-

The International Archives of the Photogrammetry, Remote Sensing and Spatial Information Sciences, Volume XLII-3/W11, 2020 PECORA 21/ISRSE 38 Joint Meeting, 6–11 October 2019, Baltimore, Maryland, USA

This contribution has been peer-reviewed. https://doi.org/10.5194/isprs-archives-XLII-3-W11-15-2020 | © Authors 2020. CC BY 4.0 License.

15

Page 2: Realtime Planetary-Scale Datacube Fusion › ... · Realtime Planetary-Scale Datacube Fusion P. Baumann 1 1 Jacobs University, Bremen, Germany p.baumann@jacobs-university.de Commission

vals. The WCPS query that achieves that is presented below.The datacubes involved contain data at different resolutions andin different coordinate systems, which gets adjusted in thecourse of query processing. Most recently, FCU/Taiwain hasbeen federated with CODE-DE.

Figure 1. Sample query splitting and distribution in rasdaman

Figure 2. FCU/CODE-DE federation query path: query sent toCODE-DE from Singapore, subquery generated for FCU

Figure 3. FCU/CODE-DE federation query path: query sent toFCU from Singapore, subquery generated for CODE-DE

3. CONCLUSION

In summary, rasdaman allows federated processing ofdeclarative queries whereby complete location transparency isgiven: any federation member can receive the query, and thefederation will dynamically orchestrate each query individuallyfor optimized processing.

Figure 4. FCU/CODE-DE federation query path: query sent toFCU from Singapore, subquery generated for CODE-DE

ACKNOWLEDGEMENTS

This work is being supported by H2020 LandSupport, H2020EOSC-hub, and German BMWi BigDataCube.

REFERENCES

Baumann, P., 2010: The OGC Web Coverage ProcessingService (WCPS) Standard. Geoinformatica, 14(4)2010, pp 447-479

Baumann, P., et al. 2010: Putting Pixels in Place: A StorageLayout Language for Scientific Data. Proc. IEEE ICDM Work-shop on Spatial and Spatiotemporal Data Mining, Sydney,Australia, pp. 194 - 201

Dumitru, A. et al., 2014: Exploring cloud opportunities from anarray database perspective. Proc. ACM SIGMOD DanaC,Snowbird, USA, pp. 1 - 4

ISO, 2019: Information technology — Database languages —SQL — Part 15: Multi-Dimensional Arrays. ISO IS 9075-15:2019

N.n., 2019: BigDataCube. http://www.bigdatacube.org/

P. Baumann, A.P. Rossi,et al., 2017: Fostering Cross-Disciplin-ary Earth Science Through Datacube Analytics. In. P.P.Mathieu, C. Aubrecht (eds.): Earth Observation Open Scienceand Innovation - Changing the World One Pixel at a Time,International Space Science Institute (ISSI), pp. 91 – 119

The International Archives of the Photogrammetry, Remote Sensing and Spatial Information Sciences, Volume XLII-3/W11, 2020 PECORA 21/ISRSE 38 Joint Meeting, 6–11 October 2019, Baltimore, Maryland, USA

This contribution has been peer-reviewed. https://doi.org/10.5194/isprs-archives-XLII-3-W11-15-2020 | © Authors 2020. CC BY 4.0 License.

16