[ieee 2014 third international conference on agro-geoinformatics - beijing, china...

BigGIS: How Big Data Can Shape

Next-Generation GIS

Peng Yue, Liangcun Jiang

State Key Laboratory of Information Engineering in Surveying, Mapping and Remote Sensing (LIESMARS)

Wuhan University, Wuhan, China

[email protected]

Abstract—The emergence of the “Big Data” concept is

changing the way data are managed and analyzed. Traditional

GIS software is limited in dealing with big data challenges

including versatile data forms, steaming processing, large scale

parallel computing, and dynamic mapping and visualization.

Significant improvement is needed, which can result in next

generation GIS. This paper gives an overview of recent methods

in supporting big data management and analysis in geospatial

domain. First, it motivates the necessity to develop and use the

bigdata-aware GIS software. By reviewing advanced information

technologies and approaches, it can assess what operational

system framework and approaches are available and applicable

in developing bigdata-enabled next-generation GIS - BigGIS.

Key considerations for development of BigGIS are highlighted.

The results can help identify critical issues and direct future

research agenda for next-generation GIS.

Keywords—Big Data; GIS; High Performance Computing;

Steam Processing; NoSQL

I. INTRODUCTION

Geographical Information System (GIS) has been widely employed in spatial-related problem-solving and decision-making ever since its generation in 1960s [1, 2, 3]. Taking a technological perspective, the development of GIS as a software tool is highly related to the advancement of information technology. As a result, we have seen several representative architectures of GIS in the past half century (Fig. 1) [4-9].

1) Desktop GIS: In this traditional architecture, geospatial data and computation are managed in a standalone environment like a desktop computer. Data and programs are held by a centralized software repository.

Fig. 1. GIS evolution and future trends.

2) Client-Server GIS: This architecture is enabled by the Web technologies and adopted by many implementations of Web GIS or Internet GIS. According to assignment of various GIS functions to the client or server, “thin-client, fat-server” and “fat- client, thin-server” approaches can be categorized [4, 5].

3) Component-based GIS: The component-based programming allows GIS software or applications to be built up by integrating reusable software components from different developers. For example, when there are different coupling strategies (isolated, loose, close, and integrated) to link GIS and other spatial analysis software packages, component-based software development can help overcome the disadvantages of close coupling in terms of overhead in code creation [6].

4) Service-oriented GIS: As Service Computing technologies have been widely employed in the past decade, GIS takes advantages of Web Services to support the spatial data infrastructure or cyberinfrastructure.

As we are entering the era of “Big Data”, existing information systems including GIS face with significant challenges in dealing with big data. Big data is often characterized by its “4Vs” feature: volume, velocity, variety, and value. Conventional systems focused on structured data management and processing, which are often limited to static data and cannot meet the demand on big data analysis. As a result, we have seen an increasing development in information technologies targeted to big data management and analysis such as map/reduce computing paradigm, stream processing, and NoSQL technologies [10, 11]. The particular emphasis is to support knowledge discovery from big data analytics.

A new bigdata-aware GIS is necessary to meet the demand for massive geospatial data management, processing, analysis, and visualization. The exponentially growing spatio-temporal data (e.g. remote sensed data, GPS trajectories, and video streams), along with their 4Vs characteristics, bring new challenges to GIS from all aspects. First, GIS needs to be extended to accommodate the dynamic observations of sensors including Volunteered Geographic Information (VGI). Next, new data model and indexing algorithms need to be developed to store and access unstructured, multidimensional, and dynamic data. Third, computing paradigm calls for innovation

to meet the demands of stream processing, real-time analysis,

and information extraction from large-scale datasets. Fourth, novel methods in mapping and visualization should be studied

to dynamically display, analyze, and simulate geographical phenomena and progress. Finally, big data mining and analysis technologies deserve further researches to achieve data, information, and knowledge transformations.

The paper provides an overview of bigdata-enabled next-generation GIS - BigGIS, by addressing coordinated sensing, distributed storage, parallel processing, dynamic visualization, and spatial-temporal analysis of big data. Related concepts, technologies and research directions are investigated to ground this new concept. The remainder of this paper is structured as follows. Section II provides the definition of BigGIS. Related work on big data technologies is described in Section III. Section IV proposes key considerations of BigGIS and possible solutions. Section V concludes the paper.

II. WHAT IS BIGGIS

The bigdata-enabled next-generation GIS, the so-called BigGIS, is described from the perspective of data characteristics. It attempts to tackle big data challenges with advanced information technologies, such as internet of things, distributed data storage, high performance computing, steam processing. Thus it shares with some technological aspects of Web GIS, Cloud GIS, and CyberGIS. However, BigGIS is a comprehensive combination of existing work to support big data analytics and knowledge discovery. Therefore, we describe next-generation GIS as follows:

BigGIS is a new comprehensive product of GIS development in the “Big Data” era, which is used to sense, store, integrate, process, and visualize big geospatial data. BigGIS covers conceptual, methodological, technical, and managerial issues in solving big data challenges. It is primarily characterized by coordinated observation, distributed storage, parallel processing, high performance computation, dynamic visualization, and efficient geospatial analysis and knowledge discovery.

The development of BigGIS will meet the demands of big

data management, big data processing, big data visualization and big data analytics. It enables users to timely manipulate high volumes of spatiotemporal data and derive knowledge from heterogeneous data sources. The bigdata-aware GIS software will support more wide applications in multiple domains for problem-solving and decision-making.

III. BIG DATA TECHNOLOGIES

In order to address big data challenges, fundamental innovations have taken place in data model, computing paradigm, and even infrastructure. Meanwhile, a variety of technologies have been developed from the aspects of data management, processing, analysis, and visualization. Some prominent technological advances include NoSQL, cloud computing, stream processing, and big data analytics.

A. Internet of Things and Sensor Web

The advances of Internet of Things (IoT) and Sensor Web

technologies contribute to sources of big data. These two

sensing services promote high-velocity data capture capacity

and improve access to information, which are fueling the big

data trend [12]. They will continue to play indispensable role

in the big data era by providing massive on-demand

observations through ubiquitous sensor devices deployed

anywhere.

B. NoSQL Databases

In high-concurrent and large-scale data access environment, relational database management systems (RDMS) find it inadequate to meet continuously increasing demands on big data storage and query. RDMS are also facing challenges when dealing with unstructured data. NoSQL databases are increasingly used in big data applications for their simplicity, scalability and high performance. Mainstream NoSQL databases (TABLE I) include key-value databases (e.g. Redis [13]), column-oriented database (e.g. HBase, Google's Bigtable [14]), document databases (e.g. MongoDB [15], CouchDB [16]), and graph databases (e.g. Neo4j [17]) [18].

TABLE I. MAINSTREAM NOSQL DATABASES

Data Model Mainstream Databases

Key-value Dynamo, Redis, Riak

Column-oriented Bigtable, Cassandra, HBase, Hypertable

Document CouchDB, MongoDB, XML database

Graph AllegroGraph, Neo4J, InfiniteGraph

C. Cloud Computing

Cloud computing is a new computing paradigm on the basis of distributed computing, parallel computing, high performance computing, and grid computing. Cloud computing provides elastic and cost-effective computing, storage resources as a utility [19], and ubiquitous service including Infrastructure as a Service (IaaS), Platform as a Service (PaaS), Software as a Service (SaaS) [20], Database as a Service (DBaaS) [21], and Network as a Service (NaaS) [22]. Cloud computing technologies have been proven to be an effective and scalable manner for big data storage, big data processing, and big data analytics [23-25].

D. Parallel Processing

Parallel processing is an efficient solution to process huge volumes of data in a timely manner. Execution of parallel analysis algorithms over distributed computer processors enables efficient computation [26]. There are different paradigms available for parallel computing: Open Multi-Processing (OpenMP), Message Passing Interface (MPI), MapReduce and GPU-based approaches, which can take advantage of the multi-core hardware, graphics cards, and clusters. MapReduce is a popular parallel processing model used to handle data processing in distributed systems, and it has been implemented across a wide ranges of Google applications [27, 28]. Apache Hadoop is an open-source implementation of MapReduce in Java programming language [29]. Hadoop is also available in the cloud as Microsoft Azure, Amazon EC2 (Elastic Compute Cloud)/ S3 (Simple Storage Service) services.

E. Stream Processing

Stream processing is an effective programming paradigm to process big data, especially streaming data. Different from traditional computing approach, streaming processing engines read data directly from software or hardware sensors in a stream form, rather than from databases. Though input data are theoretically infinite ordered sequences, the stream processing algorithms merely operate on finite and latest data records to support real-time or near real-time analytics. Many mature implementations have been developed including Storm [30] and S4 (Simple Scalable Streaming System) [31].

F. Big Data Analytics

As data keep flooding in an unprecedented rate, traditional

data analytics cannot effectively operate on these large

volumes of raw data. Accordingly, powerful analytic

algorithms, methods, and software have been developed to

extract valuable information from big data for decision

making, such as product pricing, business promotion, and

market planning. Big data analytics covers a collection of

related methods and technologies, including predictive

analytics, statistical modeling, natural language processing,

social computing, data mining, machine learning, text

analytics, Web analytics, network analytics, in-memory

databases, and advance visualizations [32-33].

IV. KEY CONSIDERATIONS FOR BIGGIS AND POSSIBLE

SOLUTIONS

The primary goal of BigGIS is to support real-time access of streaming sensor data, rapid publish and query of spatiotemporal data, and knowledge discovery from noisy data. Some fundamental innovations shall originate from underlying storing and computing infrastructure to front-end service architecture. The following research directions need particular attention to achieve the eventual success of next-generation GIS.

A. Coordinated Observation

Coordinated observation is the primary key to enhance

problem-solving capability of bigdata-enabled GIS, because

information extraction and knowledge discovery usually

require data collected from multiple platforms such as air-,

land-, and water-based sensors. Synergistic use of Earth

Observation data provides possibility for answering many

previous unanswerable scientific issues, such as global

climate change, Earth processes and quick responds to natural

disasters. A series of scientific issues require further research,

including multi-sensors coordinating mechanism, optimzied

observing model, integrated sensor systems (space-, air-, land-,

and ocean-based systems), heterogeneous sensor data

assimilation, and synergistic processing methods.

B. Big Geospatial Data Management

To meet the demands of geospatial big data management,

key research issues include spatiotemporal data models,

computing-oriented distributed data management, and

spatiotemporal indexing (Fig. 2).

Spatiotemporal data model

A universal data model is needed to support not only static spatial data and attribute data, but also various unstructured spatiotemporal data (e.g. in-suit sensor data sequences, GPS trajectories, and video streams). The novel data model should take into account spatial, temporal, scale, and semantic characteristics of big data. The data model can be organized as theme (or layer), object, state, and data hierarchically.

Distributed data management

To achieve distributed data storage and hierarchical management, proper data division patterns need to be investigated according to diverse data types and their spatial and temporal adjacency. BigGIS shall support both internal and external storage to meet the demands of real-time analysis.

Spatiotemporal indexing

Batch indexing methods, dynamic updating algorithms, multi-dimensional joint query, and spatiotemporal query optimal algorithms are the key means for rapid retrieval and indexing.

C. Parallel geocomputation framework

A parallel processing framework provides possibility for

big data processing. Fig. 3 shows a vision of the parallel

geocomputation framework in BigGIS. Aiming to provide a

universal parallel framework for geoprocessing tasks, the

framework encompasses multi-granularity parallel model,

parallel geocomputation interface, parallel geocomputation

algorithm library, and geocomputation modelling tool.

Depending on complexity of geoprocessing tasks, parallel

algorithms can be executed on multicore CPUs / GPUs /

clusters. The parallelization of algorithms and computing

resources can be supported by hybrid parallel architecture

integrating diverse parallel programming models and cluster

systems. These distributed computational and storing

resources can be scheduled by using mainstream cluster

managers, such as YARN and Mesos.

Fig. 2. A roadmap for geospatial big data management.

Fig. 3. Parallel geocomputation framework.

D. Dynamic Visualization of Big Geospatial Data

Conventional visual approaches are not enough in dealing with high-concurrent charting and high-performance rendering of heterogeneous and real-time geospatial data. Dynamic visualization is an effective way for not only presenting spatiotemporal information in large amount of data but also supporting complex analysis, information extraction, and knowledge discovery. Several key issues need to be address to achieve dynamic visualization, including real-time modeling and rendering, self-adaptive visualization, and dynamic symbol-based map making. High-concurrent strategies, including data streamlining, multi-channel parallelism, and data parallelism, can support on-the-fly visualization modeling and rendering of 3D data and streaming data. Self-adaptive strategy should take into account not only servers, clients, and network environments, but also users’ behaviors and changing scenes to achieve interactive visualization in real time. The key of dynamic symbol systems is to design dynamic features to variables of legacy map symbols, such as color, illumination, and texture.

E. Knowledge Discovery and Intelligent Service

Knowledge discovery often requires access to and integration of heterogeneous data, domain semantics and knowledge, and parallel mining algorithms. To achieve efficient geospatial analysis and knowledge discovery, researches should cover data mining, process modeling and service technologies (Fig. 4). Typical data mining algorithms are designed for small scale data mining tasks, and usually deployed on a single desktop computer. Whereas, big data mining need to aggregate diverse data sources and process on parallel computing nodes. Static knowledge discovery methods are not applicable to dynamic streaming data. Therefore, effective technologies are needed to support data stream mining. Process modeling technologies are also required to provide services over big data, including spatiotemporal process metamodel, script mapping, process planning and control, and evaluation and provenance [34].

Fig. 4. A roadmap knowledge discovery and service.

All functions on big geospatial analytics will be provided through intelligent services [35]. Multi-layered service can provide new opportunity for Web analytics and knowledge service. At the bottom, APIs (application programming interface) and scripts allow developers to dynamically invoke parallel algorithms. In the middle layer, Web services provide standard-based multi-granularity data service and various types of geoprocessing functionalities. At the top, processing models, discovered knowledge, and ontologies can be accessed through knowledge services.

V. CONCLUSION

Big data era brings new challenges to various aspects of GIS including data collection, management, processing, and visualization. It pushes current GIS evolving into bigdata-aware next generation GIS - BigGIS. This paper proposes the concept of BigGIS. Existing prominent big data technologies are summarized. Key consideration for the development of BigGIS are discusses, which can help direct the future research agenda for BigGIS.

ACKNOWLEDGMENT

This work was supported partly by National Basic Research Program of China (2011CB707105), National Natural Science Foundation of China (41271397), Program for New Century Excellent Talents in University (NCET-13-0435), and the Fundamental Research Funds for the Central Universities. We would like to thank colleagues in GIS sector of LIESMARS for valuable discussions and comments.

REFERENCES

[1] J.T. Coppock and D.W. Rhind, “The history of GIS,” Geographical information systems: Principles and applications, vol. 1, no. 1, pp. 21-43, 1991.

[2] M.F. Goodchild, “Geographic information systems and geographic research, ” Ground truth: The social implications of geographic information systems, pp. 31-50, 1995.

[3] P.A. Burrough and R. McDonnell, “Principles of geographical information systems,” 2nd ed., New York: Oxford University Press, 1998.

[4] D.J. Abel, K. Taylor, R. Ackland, and S. Hungerford, “An exploration of GIS architectures for internet environments,” Computer, Environment and Urban Systems, vol. 22, no. 1, pp. 7–23. 1998.

[5] Y. Chang and H. Park, “XML Web service-based development model for internet GIS applications,” International Journal Of Geographical Information Science, vol. 20, no. 4, pp. 371–399, 2006.

[6] M.J. Ungerer and M.F. Goodchild, “Integrating spatial data analysis and GIS: a new implementation using the Component Object Model (COM),” International Journal of Geographical Information Science, vol. 16, no. 1, pp. 41-53, 2002.

[7] M.H. Tsou and B.P. Buttenfield, “A dynamic architecture for distributing geographic information services,” Transactions in GIS, vol. 6, no. 4, pp. 355–381, 2002.

[8] P.Yue, J. Gong, L. Di, J. Yuan, L. Sun, Z. Sun, and Q. Wang, “GeoPW: laying blocks for geospatial processing Web,” Transactions in GIS, vol. 14, no. 6, pp. 755–772, 2010.

[9] P. Zhao, T. Foerster, and P. Yue, “The geoprocessing web,” Computers & Geosciences, vol. 47, no. 10, pp. 3-12, 2012.

[10] J. Manyika, M. Chui, B. Brown, J. Bughin, R. Dobbs, C. Roxburgh, and A. Byers, “Big data: The next frontier for innovation,” Competition, and Productivity. McKinsey Global Institute, 2011.

[11] P. Zikopoulos and C. Eaton, “Understanding big data: Analytics for enterprise class hadoop and streaming data,” McGraw-Hill Osborne Media, 2011.

[12] S. Lohr, “The age of big data,” New York Times, vol. 11, 2012.

[13] Redis, http://redis.io/ (Accessed 20 Jun. 2014).

[14] F. Chang, J. Dean, S. Ghemawat, W.C. Hsieh, D.A. Wallach, M. Burrows, T. Chandra, A. Fikes, and R.E. Gruber, “Bigtable: A distributed storage system for structured data,” ACM Transactions on Computer Systems (TOCS), vol. 26, no. 2, pp. 4, 2008.

[15] Mongodb, http://www.mongodb.org/display/DOCSlHome (Accessed 20 Jun. 2014).

[16] Couchdb, http://couchdb.apache.org/ (Accessed 20 Jun. 2014).

[17] Neo4j, www.neo4j.com (Accessed 20 Jun. 2014).

[18] J. Han, H. E, G. Le, and J. Du, “Survey on NoSQL database,” 2011 6th international conference on, IEEE, pp. 363-366, 2011.

[19] R. Buyya, C.S. Yeo, S. Venugopal, J. Broberg, and I. Brandic, “Cloud computing and emerging IT platforms: Vision, hype, and reality for delivering computing as the 5th utility,” Future Generation computer systems, vol. 25, no. 6, pp. 599-616, 2009.

[20] M. Peter and T. Grance, “The NIST definition ofcloud computing (draft),” National Institute of Standards and Technology, vol. 53, pp. 50 2009.

[21] W. Lehner and K.U. Sattler, “Database as a service (DBaaS),” In Data Engineering (ICDE), 2010 IEEE 26th International Conference on, IEEE, pp. 1216-1217, 2010.

[22] P. Costa, M. Migliavacca, P. Pietzuch, and A.L. Wolf, “NaaS: Network-as-a-Service in the Cloud,” In Proceedings of the 2nd USENIX conference on Hot Topics in Management of Internet, Cloud, and Enterprise Networks and Services, Hot-ICE , vol. 12, pp. 1, 2012.

[23] D. Agrawal, S. Das, A. El Abbadi, “Big data and cloud computing: current state and future opportunities,” Proceedings of the 14th International Conference on Extending Database Technology, ACM, pp. 530-533, 2011.

[24] C. Ji, Y. Li, W. Qiu, U. Awada, and K. Li, “Big data processing in cloud computing environments,” In Proceedings of the 2012 12th International Symposium on Pervasive Systems, Algorithms and Networks, IEEE Computer Society, pp. 17-23, 2012.

[25] P. Yue, H. Zhou, J. Gong, and L. Hu, “Geoprocessing in Cloud Computing platforms–a comparative analysis,” International Journal of Digital Earth, 6(4), pp.404-425, 2013.

[26] E.E. Schadt, M.D. Linderman, J. Sorenson, L. Lee, and G.P. Nolan, “Computational solutions to large-scale data management and analysis,” Nature Reviews Genetics, vol. 11, no. 9, pp. 647-657, 2010.

[27] J. Dean and S. Ghemawat, “MapReduce: simplified data processing on large clusters,” Communications of the ACM, vol. 51, no. 1, pp. 107-113, 2008.

[28] J. Dean and S. Ghemawat, “MapReduce: a flexible data processing tool,” Communications of the ACM, vol. 53, no. 1, pp. 72-77, 2010.

[29] T. White, “Hadoop: the definitive guide: the definitive guide,” O’Reilly Media, Inc., 2009.

[30] Storm, https://storm.incubator.apache.org/ (Accessed 20 Jun. 2014).

[31] L. Neumeyer, B. Robbins, A. Nair, and A. Kesari, “S4: Distributed stream computing platform,” 2010 IEEE International Conference on, IEEE, pp. 170-177, 2010.

[32] P. Russom, “Big data analytics,” TDWI Best Practices Report, Fourth Quarter, 2011.

[33] P. Yue, C. Zhang, M. Zhang, and L. Jiang, “Sensor Web detection and geoprocessing over big data,” in Proceedings of the 2014 IEEE International Geoscience and Remote Sensing Symposium (IGARSS14), Quebec, Canada, pp. 1–4, 2014.

[34] L. Di, P. Yue, H. K. Ramapriyan, and R. King, “Geoscience data provenance: an overview,” IEEE Transactions on Geoscience and Remote Sensing, vol. 51, no. 11, pp. 5065-5072, 2013.

[35] P. Yue, L. Di, Y. Wei, and W. Han, “Intelligent services for discovery of complex geospatial features from remote sensing imagery,” ISPRS Journal of Photogrammetry and Remote Sensing, vol. 83, pp. 151-164, 2013.

[ieee 2014 third international conference on agro-geoinformatics - beijing, china...

Documents