isprs geobigdata poster

A. Dasgupta Big data: the future is in analytics, Geospatial World Magazine April 2013.D. Jewell et al., IBM RedBook Performance and Capacity Implications for Big Data IBM Corp. 2014 pp. 7-20D. Kristóf, R. Giachetta, A. Olasz, B. Nguyen Thai Big geospatial data processing and analysis developments in the IQmulusproject; Proceedings of the 2014 Conference on Big Data from Space (BiDS’14) ISBN: 978-92-79-43252-1, DOI: 10.2788/1823J.-G. Lee and M. Kang Geospatial Big Data, Challenges and Opportunities, Big Data Research Volume 2, Issue 2 Visions on Big Data, June 2015, pp. 74–81Kambatla et al., 2014. Trends in big data analytics. Journal of Parallel and Distributed Computing 74(7), pp 2561–2573Kambatla et al., 2014. Trends in big data analytics. Journal of Parallel and Distributed Computing 74(7), pp 2561–2573

What did we learn?

As we are facing the boom of remotely sensed data sources, distributed processing of large dataset are required in the field of GIS. Python programming language (together with GDAL API) and Dispy (Distributed Python library) have provided the flexibility and functionalities for us to create a new processing framework prototype.NDVI calculation for Landsat8 imagery was a good test method, in the future work we are going to widen the possible raster processing tools to implement a real-life application.

References:

CONCLUSIONSTo summarize, according to the technical survey of the existing computing frameworks, there is no existing complete solution for distributed raster data processing that would fulfil our requirements formulated as base questions at the beginning of our study. Based on the results we can admit that there is a need to a new distributed processing framework for raster datasets to be implemented, which allows the users to set different decomposition techniques related to the application area and applied GIS method. The new distributed processing framework should provide the ability to supervise the distribution method among processingprocessing nodes. The first results for the proof of concept distributed computing system look promising, because we have been able to reduce the processing time to more than half comparing to commercial ERDAS IMAGINE 2011 and 2014. In the future work we plan further system development and widening of processing test cases as well as enhanced coverage of user requirements.

RESULTS

(NIR - VIS)(NIR + VIS)

NDVI = Study area

We have realized that between grid size 2 and 3, there is no difference in decomposition time; however, after increasing the number of grids to 5x5, decomposition time have increased substantially. Our first experiment was to measure the average runtime by run-ning the same NDVI calculation on three processing units on 3x3 grids consecutively. Our second experiment consisted of determin-ing if there is any relation in decomposition grid size and runtime; therefore, we have decomposed the input data into 4x4 grids and ran the NDVI calculation again. Our third experiment was to measure the processing time on the same dataset with commercial software ERDAS IMAGINE 2011 and 2014 on an average work station having 16 GB of RAM and a 4-core CPU. The first run is based on ERDAS internal data type: IMG, the second run is based on GeoTIFF. The input data of the model were band 4 and band 5 of Landsat8 imagery without any pre-processing. The model applies virtual stacking of the imagery and after calculates the NDVI for each pixel, the output is GEOTIFF and IMG. The results after comparing the processing time on the commercial system vs. the devel-oped environment look promising.

As a proof of concept, we have developed a GeoTIFF decomposition demo application in Python programming language using GDAL API. The Dispy distributed computing library has been chosen to run existing and newly implemented services distributed. The archi-tecture has been implemented on prototypical level, which allowed us to process datasets and evaluate the performance of the proto-type against ERDAS Imagine. For evaluation purposes, we used a simple normalized difference vegetation index (NDVI) calculation, but over a large volume of data. The implemented decomposer takes a source file or directory, looking for GeoTIFF files and decomposes them into smaller regular grids. After decomposition has been successfully performed, decomposed data are being transferred to computation nodes. We have setup three client machines as processing units with 4GB of RAM and 2 CPUs, and a master unit acting as server. The server unit is responsible for decomposing original data into smaller grids and distribute them among processing units. To test our data decomposition application, we have selected 36 Landsat 8 images including NIR and VIS bands, covering the area of Hungary. Each of them is approximately 128 Mb, resulting in a total of 4.6 GB of data. We have measured the time of data de-composition for 36 Landsat 8 images into 2x2, 3x3, 4x4, 5x5 and 6x6 grids on a Dual-Core PC with 4GB of RAM.

FIGURE 2: Bounding boxes of Landsat 8 images (band 4) of Hungary; distribution of sub-raster blocks to region based processing aspect; resulting raster of NDVI calculation

1. What if we do not want to re-write our algorithms and services to Map-Reduce programming model, but would like to use the advantages of distributed computing environment?2. What if we would like to implement new algorithms and services, by using another language then Java?3. What if we would like to implement our own data partitioning methods, depending on processing algorithm’s nature? 4. What if we would like to control the sequence of data distribution among the nodes?5. What if we would like to send functions, modules or classes to process data, without writing a wrapper application for every function?6.6. What if we would like to run pre-installed executables on processing nodes without parameterizing mapper and reducer executables?

Our questions concerning technical limitations:

METHODS

OBJECTIVES

Our goal is to find a solution for processing of big geospatial data in a dis-tributed ecosystem providing an environment to run algorithms, services, processing modules without limitations on programming language, as well as data partitioning strategies and distribution among computational nodes. As a first step we would like to focus on: data decomposition and distributed processing.

FIGURE 1.: Data Life Cycle in the aspect of transform informationto knowledge (According to R. Piechowski, 2010)

The progress and innovation is no longer hindered by the ability to collect data. Nowadays we are in the third quadrant of the Data Life Cycle (Figure 1), the transition of information to knowledge, the collection and access of different kinds of data no longer being a problem. We are struggling with the variety of input data (different in format, scale, source, accuracy, resolution, acquisition tech-nique, date, purpose, processing environment, etc.) to be pre-processed and analysed in order to derive and provide different geospa-tial data and information as outputs. Another important aspect to consider is that data are commonly not lo-cated on the same platform, representation and data format, even in the same institution. Moreover, data are continuously and dynamically chang-ing over time. “These unique circumstances result in several performance and capacity challenges. Most big data projects face more challenges from variety and fewer from data volumes” (D. Jewell et al., 2014). Processing of geospatial big data can be time-consuming and difficult. In data processing, different aspects are to be considered such as speed, precision or timeliness, all depending on data types and processing methods.

A distributed system is a collection of computers within the same network, working together as one larger computer. Massive computational power and storage capacity have been gained thanks to this architecture. In GIS applications, we want to ensure: efficiency of data access and dissemination deriving and processing geographic information efficiently, interoperability impeded by heterogeneous hardware, software, and data environments, security (and veracity).

What is distributed processsing and why to use it in GIS?

This research has been partially funded by the 7th Framework Programme of the European Union, in the frame of the „IQmulus” project. IQmulus (full name: A High-volume Fusion and Analysis Platform for Geospatial Point Clouds, Coverages and Volumetric Data Sets) is a 4-year Integrating Project (IP) partially funded by the European Commission under the Grant Agreement FP7-ICT-2011-318787.

It is positioned in the area of Intelligent Information Management within the ICT 2011.4.4 Challenge 4: Technologies for Digital Content and Languages.

www.iqmulus.eu

RASTER DATA PARTITIONING FOR SUPPORTINGDISTRIBUTED GIS PROCESSING

GeoBigdata'2015 Submission 235

In the geospatial sector, “Big Data” concept already has an impact. Several techniques and methods originated from computer science are being increasingly applied for processing huge amounts of geospatial data. In some research studies geospatial data is considered as it would always have been “Big Data” (Lee and Kang, 2015). Lately, data acquisition methods have been improved substantially, yielding tre-mendous increases both in terms of amount and resolution of raw data in spectral, spatial and temporal dimensions. A significant portion of big data is geospatial data, and the size of such data is growing rapidly: at least byby 20% every year (Dasgupta, 2013). With this rapid increase of raw data volumes collected in different formats, representations and for multiple purposes, information derived from these data is considered as the most valuable outcome. However, computing capacity and processing speed have strong limitations, especially if semi-automatic or automatic procedures are aimed on complex geospatial data (Kristóf et al., 2014). In late times, distributed computing has reached many interdisciplinary areas of computer science including remote sensing and geographic information processing. Map-Reduce programming model and distributed file systems havehave proven their capabilities to process non-geospatial Big Data. But sometimes it is inconvenient or inefficient to rewrite existing algorithms to Map-Reduce programming model; also, GIS data cannot be partitioned like text-based data by line or by bytes. Hence, we would like to find an alternative solution for data partitioning, data distribution and execution of existing algorithms without rewriting existing code, or with only minor modifications. This paper focuses on the technical overview of currently available distributed computing environments, as well as geospatial (raster) data partitioning, distribution and distributed processing of GIS algo-rithms. A proof-of-concept implementation have been made to demonstrate the above-mentioned concept. The first results on performance have been compared against commercial software ERDAS IMAGINE 2011 and 2014. Partitioning methods heavily depend on application areas, therefore we may consider data partitioning as a preprocessing step before applying processing services on data. As a proof-of-concept we have implement-ed a simple tile-based partitioning method splitting an image into smaller grids (NxM tiles), and comparing the processing time to existing methods on the example of NDVI calculations. The concept is demonstrated using an open source processing framework developed by the authors.

INTRODUCTION

isprs geobigdata poster

Science