developing big data analytics architecture for spatial data · stack: spark and cassandra. the...

30
Developing Big Data Analytics Architecture for Spatial Data Synopsis Submitted to Ahmedabad University For The Degree of Doctor of Philosophy in Information and Communication Technology By Purnima Rasiklal Shah (1451002) School of Engineering and Applied Science, Ahmedabad University, Ahmedabad – 380009, India. Under Supervision of Dr. Sanjay R. Chaudhary Professor and Associate Dean School of Engineering and Applied Science, Ahmedabad University. [September – 2018]

Upload: others

Post on 22-May-2020

19 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Developing Big Data Analytics Architecture for Spatial Data · stack: Spark and Cassandra. The framework provides SQL like query interface to perform spatial operations. It does not

Developing Big Data Analytics Architecture for Spatial Data

Synopsis

Submitted to

Ahmedabad University

For

The Degree of Doctor of Philosophy in

Information and Communication Technology

By

Purnima Rasiklal Shah (1451002)

School of Engineering and Applied Science,

Ahmedabad University, Ahmedabad – 380009, India.

Under Supervision of

Dr. Sanjay R. Chaudhary

Professor and Associate Dean

School of Engineering and Applied Science,

Ahmedabad University.

[September – 2018]

Page 2: Developing Big Data Analytics Architecture for Spatial Data · stack: Spark and Cassandra. The framework provides SQL like query interface to perform spatial operations. It does not

2

1 Introduction

In the mobile and Internet era, massive scale data is generated from disparate sources

with spatial components. Traditional methods often fail to handle high volume, high

speed, and variety of complex data in this fast and growing modern view of data

generation and consumption. The modern technologies like Big Data and Big Data

Analytics (BDA) have a huge potential to handle massive scale data with high

scalability and low latency. Though the modern big data management tools are highly

efficient, they offer limited functions and methods for spatial data management. In

addition, in modern application development, only one specific big data tool would

not be able to manage big data efficiently and effectively. Hence, it is highly enviable

to exploit the potential features of big data tools and technologies and propose

integrated frameworks and architectures built on top of more than one technology to

develop robust and powerful applications including geospatial data.

The research presents an open source and scalable big data analytics architecture for

spatial data management. The main goal of the research work is to solve a wide range

of data problems by offering batch, iterative, and interactive computations in a unified

architecture. The proposed architecture is built and implemented on top of open

source big data frameworks. In comparison with the existing platforms and

architectures, the proposed architecture is in-memory, cost-effective, and open

source. The architecture is realized for the agriculture domain. As a proof of concept,

the spatial analytics applications are developed using agricultural real-life datasets.

The main focus of the research is to develop a big data analytics framework to load,

store, process, and query spatial data at scale. The framework is developed to enable

interactive query processing on spatial and non-spatial data via a web based REST

interface and distributed APIs. In comparison with the existing frameworks, the

proposed framework is implemented on an integrated big data infrastructure with a

new input data source, i.e. NoSQL database. The framework provides efficient and

scalable solutions for spatial data by developing distributed APIs for spatial

operations such as location search, proximity search, and K Nearest Neighbor (KNN)

search. The application layer is implemented to accelerate ad-hoc query processing

via a common web based REST interface. The user interface diverts the user requests

to the suitable framework either Cassandra or Spark, i.e. low latency queries are

executed on Cassandra and aggregated and complex queries are performed on the

Spark cluster. The framework is evaluated by analyzing the performance of the

analytical operations in terms of latency against the variable size of data. The

Page 3: Developing Big Data Analytics Architecture for Spatial Data · stack: Spark and Cassandra. The framework provides SQL like query interface to perform spatial operations. It does not

3

performance of the framework is compared with the baseline technology, i.e.

Cassandra for low latency queries.

2 Novelty and Contributions

The research work has developed an open source and scalable architecture to manage

massively distributed data including spatial data. In comparison with the existing

platforms and architectures, the proposed architecture is in-memory, cost-effective

and open source. The architecture intends to build and implement distributed and

scalable APIs for spatial data management on top of analytics and integrated

infrastructure. The architecture is realized to develop analytical services for

agriculture and provide customized solutions to the end users in the form of

interactive maps and Restful ad-hoc services.

Recent findings from the literature suggest that there is no open source big data

analytics framework available which provides end-to-end solutions for spatial data,

i.e. from data loading to data retrieval on top of integrated big data infrastructure. The

previous research efforts yet to demonstrate the scalability on top of big data stack.

The innovative idea is that the framework is able to store and process spatial data

without changing the underlying architecture of either Spark or Cassandra. The

framework is not only able to perform the described analytics operations but also

offers a unified infrastructure to perform more sophisticated and complex analytics on

spatial data. It can be extended to enhance the analytical capability with real-time and

streaming data.

The main contributions of the thesis are:

1. Open source big data analytics architecture is developed to perform relative

analytics on massively distributed data at scale.

2. The architecture is realized on top of open source frameworks and performs

relevant analytics on real-life datasets in the agriculture domain including

geospatial data.

3. The data preparation framework is implemented to collect, pre-process, and

integrate data coming from disparate and heterogeneous data sources.

4. A novel big data analytics framework is built and implemented on top of open

source technologies Spark and Cassandra. The framework is implemented to load,

store, process, and perform ad-hoc query processing on spatial and non-spatial

data at scale.

a. A NoSQL based spatial data storage framework is built and implemented.

Page 4: Developing Big Data Analytics Architecture for Spatial Data · stack: Spark and Cassandra. The framework provides SQL like query interface to perform spatial operations. It does not

4

b. Built and implemented distributed and scalable APIs for spatial operations

such as location search, proximity search, and KNN search.

c. Application architecture is implemented to accelerate interactive analytics

on spatial and non-spatial data. A common web based REST interface is

developed to divert user queries to the suitable framework either

Cassandra or Spark.

d. The performance of the framework is evaluated in terms of latency against

the variable size of data.

e. The performance of the framework is compared with the baseline

technology, i.e. Cassandra for low latency queries.

5. The visualization layer is implemented to showcase the analytical results with

dynamic layouts. A dashboard application is designed and implemented to depict

the analytical results in the agriculture domain.

6. Big data applications in agriculture are developed as a proof of concept.

3 Literature Survey

The thesis has reviewed literature on, 1) Existing state-of-the-art systems for spatial

data management, and 2) Existing ICT based applications and systems in the

agriculture domain.

3.1 Existing Systems for Spatial Data Management

The thesis has reviewed existing database technologies including relational and

NoSQL databases for spatial data. It has also reviewed existing big spatial data

frameworks and architectures and compared with the proposed framework. The core

spatial functionalities provided by the state-of-the-art spatial databases including

NoSQL databases and computational frameworks are depicted in Table 1 and 2.

Table 1 State-of-the-art spatial databases

Database

Supported Geometry

Objects

Supported Geometry Functions Spatial

Index

Horizontal

Scalable

PostGIS Point, LineString,

Polygon, MultiPoint,

MultiPolygon,

MultiLineString

OGC standard methods on

geometry instances

B-Tree,

R-Tree,

GiST

No

MySQL Point, LineString,

Polygon, MultiPoint,

MultiPolygon,

OGC standard methods on

geometry instances

2d plane

index,

B-trees

No

Page 5: Developing Big Data Analytics Architecture for Spatial Data · stack: Spark and Cassandra. The framework provides SQL like query interface to perform spatial operations. It does not

5

MultiLineString

MongoDB

[1]

Point, LineString,

Polygon, MultiPoint,

MultiPolygon,

MultiLineString

Inclusion, intersection,

distance/proximity

2dsphere

index,

2d index

Yes

Cassandra

[2]

Point, Polygon,

LineString

Distance/proximity,

intersection, iswithin,

isdisjointto

Solr/

lucene

Yes

Table 2 State-of-the-art big spatial data processing frameworks

Framework Architecture Index

Input

Format

Spatial

Geometry

Operators

Spatial

Geometry

Objects

Language

Support/

Interface

Proposed

Framework

Key-Value

+ RDD

Geohash Cassandra Circle range

query,

KNN query,

point query,

Attribute query

Point R [3],

REST

Disk-based systems

SpatialHadoop

[4]

MapReduce Grid /

R-tree /

R+ tree

HDFS

compatible

input

formats

Spatial

analysis and

aggregation

functions,

joins, filter,

box range

query, KNN,

distance join

(via spatial

join)

Point,

LineString,

Polygon

Pigeon [5]

Hadoop-GIS

[6]

MapReduce Uniform

grid

index

HDFS

compatible

input

formats

Range query,

self-join, join,

containment,

aggregation

Point,

LineString,

Polygon

HIVEQL

Memory-based systems

Geospark [7] RDD,

SRDD

R-Tree,

Quadtree

CSV,

GeoJSON,

shapefiles,

and WKT

Box range

query, circle

range query,

kNN, distance

join

Point,

Polygon,

Rectangle

Scala,

Java

Magellan [8] RDD z-order

curve

(default

precision

30)

ESRI,

GeoJSON,

OSM-XML

and WKT

Intersects,

contains,

within

Point,

LineString,

Polygon,

MultiPoint,

MultiPolygon

Scala,

Extended

SparkSQL

Page 6: Developing Big Data Analytics Architecture for Spatial Data · stack: Spark and Cassandra. The framework provides SQL like query interface to perform spatial operations. It does not

6

Framework Architecture Index

Input

Format

Spatial

Geometry

Operators

Spatial

Geometry

Objects

Language

Support/

Interface

SparkSpatial

[9]

RDD

Grid/

Kd-tree

A form of

WKT in

Hadoop

File System

(HDFS)

Box range

query, circle

range query,

KNN, distance

join (Point-to-

polygon dist),

point-in-

polygon

Point,

Polygon

Impala

LocationSpark

[10]

RDD R-tree,

Quadtree,

IR tree

HDFS

compatible

input

formats

Range, kNN,

Spatial Join,

Distance Join,

kNN Join

Point,

Rectangle

Scala

SIMBA [11] RDD R-tree HDFS

compatible

input

formats

Range, kNN,

Distance Join,

kNN Join

Point Scala/

Extende

SparkSQL

3.1.1 Big Spatial Data Architectures

Generally, big data architectures are designed and developed to achieve a specific

goal. Many big data architectures such as Lambda [12], Kappa [13], Liquid [14],

BDAS [15], SMACK [16] and HPCC [17] have been developed on top of the

integrated infrastructure.

There are very limited platforms and architectures such as IBM PARIS [18], SMASH

[19], and ORANGE [20] have been developed for spatial data management. They

have provided very high-level designs and architectures, but no source code or design

implementation guidelines are provided for future designers. In addition, the existing

architectures have less considered the non-spatial attributes in spatial analytics. In

comparison with the existing big spatial data architectures, the proposed architecture

is in-memory, cost-effective, and open source.

A number of big spatial data frameworks have been developed on top of big data

stack composed of Spark and Cassandra. The Cassandra-solr-spark framework has

been developed by Datastax to enable spatial query processing on top of big data

stack: Spark and Cassandra. The framework provides SQL like query interface to

perform spatial operations. It does not support join operations. It has not been

evaluated based on the performance metric. P. Shah et al. [21, 22] have developed a

big data analytics framework including geospatial data on top of big data stack: Spark

Page 7: Developing Big Data Analytics Architecture for Spatial Data · stack: Spark and Cassandra. The framework provides SQL like query interface to perform spatial operations. It does not

7

and Cassandra. The spatial analytics applications in the agriculture domain have been

developed using third-party Geospark libraries. The major drawback of the

framework is data duplication.

3.2 Existing Agricultural Information Systems and Applications

The thesis has reviewed the existing ICT (Information and Communication

Technology) based agricultural systems and applications developed using the state-

of-the-art technologies such as web, mobile, GIS, Big Data, Big Spatial Data, etc. The

web and mobile based applications and systems1,2,3,4,5,6,7,8

,[23, 24, 25, 26] are

reviewed in the first phase of research.

The web-GIS based systems in agriculture are reviewed in the next phase. [27, 28, 29,

30, 31] have demonstrated and implemented web-GIS based information systems for

agriculture using traditional GIS tools and technologies. These technologies are often

insufficient to provide a complete picture of analytics in a geographic context. The

thesis has also, reviewed the big data applications and systems [32, 34, 35, 36, 37, 38]

emerged in the agriculture domain to manage complex data at scale.

3.3 Research Gap

The research has identified the existing research gap for big spatial data management.

The modern big data storage and processing frameworks are providing very limited

geo-functionality and OGC standard methods in comparison with the traditional

systems. The research has also found that the use of big data in the agriculture domain

is in the initial stage, and more research efforts are required to develop big data

applications in the agriculture domain.

3.3.1 Shortcomings of Existing Systems for Spatial Data Management

There are very less research efforts have been made to develop open source big data

architectures and platforms for spatial data management. The big spatial data

platforms and architectures like IBM PARIS are made accessible using a proprietary

1 http://www.esagu.in 2http://www.bhoomi.karnataka.gov.in/ 3http://www.chandigarh.gov.in/egov_esmpk.htm 4http://www.tcs.com/offerings/technologyproducts/ mKRISHI/Pages/default.aspx 5http://agropedia.iitk.ac.in/ 6http://www.icrisat.org/newsroom/news-releases/icrisat-pr-2014- media40.htm 7http://www.icrisat.org/what-wedo/satrends/SATrends2009/satrends_october09.htm 8http://mkisan.gov.in/downloadmobileapps.aspx

Page 8: Developing Big Data Analytics Architecture for Spatial Data · stack: Spark and Cassandra. The framework provides SQL like query interface to perform spatial operations. It does not

8

license. There are very limited open source and integrated architectures which support

interactive and complex (e.g. joins, aggregations) query processing on spatial data.

Traditional spatial database management systems are less efficient to store and

process massive scale geospatial data. The set of geospatial operations supported by

the state-of-the-art big data storage and processing frameworks

(NoSQL/Hadoop/Spark based) are very limited compared to the standard GIS

products, such as ArcGIS. The existing extensions to the NoSQL databases for spatial

data management lack the support for spatial aggregation and join operations. Spark

has no native support for spatial data. The existing Spark/Hadoop based frameworks

are only able to execute spatial operations on datasets that are available in text based

file formats (CSV/GeoJSON/shapefiles and WKT), and stored in HDFS or local disk.

There is no big data analytics framework available which reads data from the NoSQL

database and performs spatial analytics on those data.

3.3.2 Shortcomings of Existing Agricultural Information Systems

In spite of a huge revolution in ICT based technologies and their intervention in the

agriculture domain, there is a large technological gap between farmers and

information. Moreover, in the technical aspect, the major barriers for big data

application development in agriculture are lack of tools, infrastructures, data

standards, semantics, integrated data models, developers APIs, unified access points

for public and private data, technical expertise, and finally data.

The research papers and projects are mainly providing algorithmic solutions for

various applications like crop yield prediction and weather forecasting but very less

work has been done regarding novel infrastructure and architectures for agricultural

big data including spatial data. There is an urgent need to manage agricultural data at

scale with specialized systems, techniques, and algorithms. The challenge is how to

exploit the full potential of open source technologies and resources in order to create

a sophisticated and customizable working environment for end users to improve the

productivity in agricultural practices.

4 Big Data Analytics Architecture for Spatial Data

The thesis presents a big data analytics architecture for spatial data management. The

architecture is developed on top of an integrated infrastructure which solves a wide

range of data problems by offering batch, iterative, interactive, and streaming

computations within the same commodity cluster. The architecture is built and

implemented on big data open source technologies for the enrichment of large data

Page 9: Developing Big Data Analytics Architecture for Spatial Data · stack: Spark and Cassandra. The framework provides SQL like query interface to perform spatial operations. It does not

9

sets including geospatial data. The architecture is designed to provide scalable,

flexible, extendible, and cost-effective solutions with available infrastructures and

tools for agriculture.

4.1 Architecture Implementation

The big data analytics architecture shown in figure 1 is realized for the agriculture

domain. The implementation is mainly divided into three stages: 1) Data preparation,

2) Big data analytics framework, and 3) Data Visualization.

Figure 1 Big data analytics architecture for agriculture

There are four types of user interaction with the architecture: 1) system developer, 2)

Data scientist, 3) Domain expert, and 4) End users. The data workflow between

software components within the data pipeline is illustrated in figure 2.

Page 10: Developing Big Data Analytics Architecture for Spatial Data · stack: Spark and Cassandra. The framework provides SQL like query interface to perform spatial operations. It does not

Figure

4.1.1 Data Preparation Layer

Large scale and massive

disparate data sources. These data are

an urgent need to collect and clean such data and prepare them for analytics.

preparation layer is design

sources via web services

repository, and 3) Data pre

the consistent data is transferred

provides two layers of data abstraction.

the data repository. Second, it

using various complex tools and

mapping tools, and record

The data preparation layer

from heterogeneous and distributed data sources as inputs to corresponding services

to perform relevant analytics.

consistent and clean data from disparate

This layer is implemented in three steps

repository, and 3) Data pre

10

Figure 2 Workflow diagram of big data pipeline

Data Preparation Layer

assively distributed data including spatial data are available at

data sources. These data are noisy, inconsistent, and heterogeneous

ect and clean such data and prepare them for analytics.

signed to perform: 1) Fetch and collect data from different

via web services, 2) Store data which collected via web services into

pre-processing and integration to get consistent data. Finally,

transferred into the persistence. The data preparation

two layers of data abstraction. First, it hides all physical data sources

. Second, it further unifies the data available in a data

tools and techniques such as data fusion algorithms

and record linkage algorithms.

layer is designed to collect and process complex data coming

from heterogeneous and distributed data sources as inputs to corresponding services

to perform relevant analytics. The data preparation services are implemented

consistent and clean data from disparate sources and store into a persistent

is implemented in three steps: 1) Data extraction and collection, 2)

ata pre-processing and integration.

are available at

noisy, inconsistent, and heterogeneous. There is

ect and clean such data and prepare them for analytics. The data

data from different

data which collected via web services into a data

to get consistent data. Finally,

to the persistence. The data preparation layer

hides all physical data sources from

data repository

data fusion algorithms, schema

ss complex data coming

from heterogeneous and distributed data sources as inputs to corresponding services

The data preparation services are implemented to fetch

persistent database.

ata extraction and collection, 2) Data

Page 11: Developing Big Data Analytics Architecture for Spatial Data · stack: Spark and Cassandra. The framework provides SQL like query interface to perform spatial operations. It does not

11

4.1.2 Big Data Analytics Framework

The main purpose of big data analytics framework is to enable spatial data

management on large scale data. The consistent datasets generated by data

preparation services are processed and analyzed using data analytics framework. The

analytical results are explored to end users through visualization and REST interface.

The design and development of big data analytic framework are discussed in section

5.

4.1.3 Data Visualization

Data visualization makes complex data more accessible, understandable and usable.

For example, in the agriculture domain, data visualization is useful to identify crop

patterns, crop future trends, market trends, etc. The dashboard applications and user

interfaces are developed to depict analytical results using open source tools like R,

D3, Google API, etc.

The architecture provides a web based user interface by developing analytical and

visualization services through Restful ad-hoc APIs and interactive maps. Web based

interactive maps are implemented using R libraries (leaflet, ggplot, and shiny) in the

form of choropleth map, pulse markers, and plots. Restful web services are

implemented to explore data and analytical results using SparkSQL/CQL interface on

top of analytics framework.

5 Big Data Analytics Framework for Spatial Data

A big spatial data analytics framework is developed to load, store, process, and query

large scale spatial and non-spatial data. It is an integrated infrastructure designed to

manage spatial data efficiently and effectively by exploiting the potential features

provided by standard storage and processing big data frameworks. It is realized on top

of big data stack with Spark as a core processing engine and Cassandra as data

storage. The implementation architecture is shown in figure 3. The implementation is

divided into four layers: 1) Spatial data storage layer, 2) Spark core layer, 3) Spatial

data processing layer, and 4) Application layer.

Page 12: Developing Big Data Analytics Architecture for Spatial Data · stack: Spark and Cassandra. The framework provides SQL like query interface to perform spatial operations. It does not

Figure

5.1 Spatial Data Storage Layer

The framework ingests data

persistent database. The s

spatial data at scale. It is built on top of distributed database

implementation architecture shown in

2) Data storage, 3) Integrat

spatial index, and 5) Store spatial

Figure 4 Implementation architecture of spatial data

12

Figure 3 Big data analytics framework for spatial data

Data Storage Layer

data from data preparation services and loads them into

The spatial data storage framework is developed to load and store

spatial data at scale. It is built on top of distributed database, i.e. Cassandra. The

architecture shown in figure 4 includes five phases; 1) Data loading,

2) Data storage, 3) Integration of big data storage and processing engine, 4) Associate

spatial index, and 5) Store spatial data frame into persistence.

Implementation architecture of spatial data storage framework

and loads them into the

developed to load and store

i.e. Cassandra. The

; 1) Data loading,

big data storage and processing engine, 4) Associate

storage framework

Page 13: Developing Big Data Analytics Architecture for Spatial Data · stack: Spark and Cassandra. The framework provides SQL like query interface to perform spatial operations. It does not

13

5.1.1 Data Loading

Raw datasets including spatial attributes are pulled from disparate sources in CSV

format using data preparation services. These data are loaded into the Cassandra

database using bulk loading. The fastest way to load a large CSV file into the

Cassandra database is bulk loading. The pseudo-code given below presents how data

are loaded into Cassandra.

Pseudo-code 1 - Load data into Cassandra database

Input: datafile_in_csv, keyspace, schema

Output: Load data into Cassandra database

1: The raw data is written to both CommitLog and Memtable.

2: Memtable is flush out to SSTables sstable[0], sstable[1],

sstable[2], ...

3: Combine SSTables into Single SSTable[n] using compaction

(optional).

4: Load SSTables into Cassandra using sstableloader utility.

5.1.2 Data Storage

The raw dataset with spatial components is stored in the Cassandra database based on

a data model. The following pseudo-code describes how each partition is stored in the

Cassandra database.

Pseudo-code 2 - Store data into Cassandra database

Input: datafile_in_csv, keyspace, schema

Output: store data into Cassandra database

1: Read data from the input file.

2: Determine the partitions based on the partition key.

3: Map each partition to a token value using murmur3partitioner

(default partitioner in Cassandra).

4: Allocate each partition of Cassandra to a particular node of a

cluster based on token range owned by that particular node.

5.1.3 Integration of Big Data Storage and Processing Engine

Spark is used as a core processing engine which processes data stored in the

Cassandra database. The Spark-Cassandra connector [39] is the key component which

aligns the Spark-Cassandra data distribution. The Cassandra partitions are imported

into Spark memory in the form of Dataframe object using Spark-Cassandra

connector.

Page 14: Developing Big Data Analytics Architecture for Spatial Data · stack: Spark and Cassandra. The framework provides SQL like query interface to perform spatial operations. It does not

14

The following pseudo-code shows how Cassandra partitions are mapped into Spark

partitions based on data distribution performed by Spark-Cassandra connector.

Pseudo-code 3 - Align Cassandra partitions into Spark partitions

Input: Cassandra partitions, spark.cassandra.input.split.size_in_mb

Output: Spark partitions

1: Read Cassandra partitions from Cassandra cluster.

2: Divide token ranges owned by Cassandra nodes based on input

parameter spark.cassandra.input.split.size_in_mb which determines

the number of spark partitions.

3: Align Cassandra partitions to Spark partitions.

5.1.4 Associate Spatial Index

Geohash9 is a hierarchical spatial data structure used for indexing spatial data. Each

record presented in Spark partitions is associated with Geohash character string using

one-to-one transformation. It creates a new Spark Dataframe with a spatial index. The

resultant Spark Dataframe is called spatial data frame. The spatial data frame is stored

back into the Cassandra database. The pseudo-code given below describes how

spatial index is associated with each record of Spark Dataframe.

Pseudo-code 4 - Associate spatial index to Spark partitions

Input: Spark object

Output: Spark object

1: Read Spark partitions from Spark object.

2: Map each record (.., latitude, longitude,…) of Spark partitions

into (Geohash, wkt,……) by applying custom functions geohash

(latitude, longitude, precision) and WKT(latitude, longitude).

3: Return Spark object.

5.1.5 Store Spatial Data Frame into Persistence

Now, each record of a spatial data frame is associated with Geohash code. The spatial

data model is designed such that each rectangular area represented by the Geohash

code is mapped to a Cassandra row. The Geohash attribute is modeled as a partition

key in the spatial data model. The spatial data model is shown in figure 5 where each

row represents a rectangular area and columns store spatial objects fall within that

rectangular area. The spatial attributes are stored in WKT format.

9 Geohash WG. Geohash. https://www.en.wikipedia.org/wiki/Geohash.

Page 15: Developing Big Data Analytics Architecture for Spatial Data · stack: Spark and Cassandra. The framework provides SQL like query interface to perform spatial operations. It does not

15

Figure 5 Spatial data model

The pseudo-code 5 describes how spatial data frame is stored back into Cassandra

based on the spatial data model.

Pseudo-code 5 - Store spatial data into Cassandra database

Input: spark object, keyspace, schema,

cassandra.output.batch.size.bytes, cassandra.output.batch.size.rows

Output: store spatial data into Cassandra database

1: Identify Spark partitions in Spark object with spatial

components.

2: Design spatial data model.

3: Map each rectangular area represented by Geohash code to a

Cassandra row.

4: Create micro batches of Spark partitions based on input

parameters cassandra.output.batch.size.bytes,

cassandra.output.batch.size.rows.

5: Write micro batches into Cassandra database.

The write operation from Spark to Cassandra is very challenging. It needs to collect

all Spark partitions into memory and store back into Cassandra. The write

performance from Spark to Cassandra is optimized by creating micro batches of the

data available in Spark partitions. The number of batches and batch size is determined

from the metadata available on each node of a Cassandra cluster. The connector

converts the data into a number of batches and then writes each batch into Cassandra

based on partition key.

5.2 Spark Core Layer

Spark core layer provides user friendly APIs for machine learning, graph processing,

and structured data processing using Spark SQL. It performs fast data processing on

large datasets by exploiting data parallelism and partitioning techniques. However,

Spark doesn’t support native partitioning for spatial data. For example, proximity

based partitioning for spatial data is not supported. The developers would require

Page 16: Developing Big Data Analytics Architecture for Spatial Data · stack: Spark and Cassandra. The framework provides SQL like query interface to perform spatial operations. It does not

16

writing wrappers to perform spatial operations. The distributed APIs are developed to

perform spatial operations using standard spark core APIs. The spatial APIs extends

Spark core functionalities for spatial data.

5.3 Spatial Data Processing Layer

The spatial data processing layer creates an interface to query spatial and non-spatial

data available in the form of Spark Dataframe object. The distributed APIs are

implemented for spatial operations; 1) Proximity search, 2) KNN search, and 3)

Location search.

5.3.1 Proximity Search

Proximity search queries are widely used in many analytics applications. Circle range

queries are used to find the number of spatial objects within a specified circular range

from a given query point. A distributed API for proximity search is,

circle_range_query(spark_ object, query_point, range)

The circular range query is implemented based on Algorithm 1. It returns the

approximate number of spatial objects within the given circular range.

Algorithm 1 - An algorithm to find spatial objects within a circular

range

Input: spark_object, query_point, range

Output: spatial objects within circular range

1: Find Geohash covered by circular search space using ProximityHash

algorithm

2: Filter the Spark partitions based on bounded Geohash

3: Collect the spatial objects from the filtered Spark partitions

4: Return spatial objects

5.3.2 KNN Search

K nearest neighbor search queries are used to find K nearest spatial objects from a

given query point. K nearest objects are found using selection and merge method. A

distributed API for KNN search is,

KNN_query(spark_ object, query_point, K)

The KNN search is implemented based on Algorithm 2.

Page 17: Developing Big Data Analytics Architecture for Spatial Data · stack: Spark and Cassandra. The framework provides SQL like query interface to perform spatial operations. It does not

17

Algorithm 2 - An algorithm to find K nearest neighbors from a query

point

Input: spark_object, query_point, K

Output: K neighbors from query_point

1: Select large enough circular search space defined by r. 2: Find Geohash covered by circular search space. 3: Sort subspaces based on the distance between the query point and

Geohash.

4: for each subspace do Find k’ nearest neighbor points from the query point

if k’ >= K then

goto step 7

endif

end for

5: Merge points from all subspaces and find k’ nearest neighbor points.

6: if k’ < K then Expand search space based on Geohash precision

if search space out of bound

return error

else repeat step 1 – 5

endif

endif

7: Return K spatial objects

5.3.3 Location Search

The location search query is used to find a particular location within dataset along

with attribute data. The distributed APIs to search a given location with and without

using Geohash index is shown below,

point_query(spark_ object, query_point)

point_query_with_index(spark_ object, query_point)

Location search without an index is very slow as it requires scanning each partition of

a data frame. Whereas, location search with the spatial index is fast. It finds a location

within a single partition defined by Geohash index. The implementation of the

location search query with and without an index is described in pseudo-code 6 and 7

respectively.

Pseudo-code 6 - Find spatial attributes associated with a

particular location without a spatial index

Input: spark_object, query_point

Output: spatial attributes at a query_point

1: Read each Spark partition of a Spark object.

2: Search query_point within each partition i.e full table scan

3: Collect the spatial object.

Page 18: Developing Big Data Analytics Architecture for Spatial Data · stack: Spark and Cassandra. The framework provides SQL like query interface to perform spatial operations. It does not

18

4: Return spatial object, spatial attributes.

Pseudo-code 7 - Find spatial attributes associated with a

particular location with spatial index

Input: spark_object, query_point

Output: spatial attributes at a query_point

1: Find Geohash associated with query_point.

2: Filter Spark partition based on Geohash value.

3: Search query_point within filtered spark partition.

4: Collect the spatial object.

5: Return spatial object, spatial attributes.

5.4 Application Layer

The framework provides a convenient web based REST interface to the end user.

Cassandra performs fast data retrieval based on partition key and clustering key

compared to Spark. The application architecture facilitates end users to execute ad-

hoc queries on a suitable framework either Spark or Cassandra via a common user

interface. The low latency queries are executed on Cassandra, whereas complex

queries (e.g. aggregated and spatial queries) are executed on the Spark framework.

The implementation architecture of the application layer is shown in figure 6.

Figure 6 Implementation architecture of application layer

5.5 Experiments and Results

The big data analytics framework is evaluated using benchmark dataset. The

scalability test is performed in terms of latency against the variable size of data. The

performance of the framework is compared with the baseline technology, i.e.

Cassandra for low latency queries.

Page 19: Developing Big Data Analytics Architecture for Spatial Data · stack: Spark and Cassandra. The framework provides SQL like query interface to perform spatial operations. It does not

19

5.5.1 Experimental Setup

All experiments are conducted on a cluster consisting of four nodes. Each node runs

Ubuntu 14.0.4 with Spark 2.1.0 and Cassandra 3.10. Each node is equipped with an

Intel(R) Core(TM) i7-4770 CPU @ 3.40GHz, 1 CPU, 4 physical cores per CPU, a

total of 8 logical CPU units with 16 GB RAM. The Spark cluster is deployed in

standalone mode. The Spark-Cassandra connector is used to integrate the Spark and

Cassandra frameworks. The big spatial analytics framework is implemented and

deployed using Sparklyr framework. The web based Restful ad-hoc APIs are built

and implemented to explore analytical results on top of the shiny10

framework.

5.5.2 Description of Dataset

NYC11

taxi dataset is an open database. It contains 167 million records in 30 GB.

Each record describes a taxi trip made by a particular driver at the particular date and

time. Each record has 16 attributes: hack_license, two 2D coordinates which

represent pickup location and drop-off location, pickup date_time, drop-off

date_time, and other nine attributes represent other related information. All

experiments are done on a sample dataset containing about 30 million records in 5

GB storage.

5.5.3 Performance Evaluation

Sample datasets of variable sizes (5 million to 30 million records; 800 MB to 5 GB)

are generated to evaluate the performance of the system. All experiments are

performed on 2-dimensional vector dataset. Pickup location is considered as a 2D

spatial object. Different test cases are designed to evaluate the performance of the

queries such as attribute search query, location search query, proximity search query,

KNN search query, and low latency query. The performance is evaluated in terms of

average latency against the variable size of datasets.

5.5.3.1 Load data into Cassandra

The performance of loading phase is evaluated using load time vs. data size. The

performance of bulk loading is shown in figure 7.

10 “Shiny,” http://www.shiny.rstudio.com 11 “NYC Taxi Trips,” http://www.andresmh.com/nyctaxitrips/

Page 20: Developing Big Data Analytics Architecture for Spatial Data · stack: Spark and Cassandra. The framework provides SQL like query interface to perform spatial operations. It does not

20

5.5.3.2 Establish a big data pipeline

The big data analytics pipeline is established in three phases; 1) Read data from

Cassandra, 2) Associate Geohash index to each record of a dataset, and 3) Write

indexed data back into Cassandra. The total elapsed time is calculated by adding the

time required by each phase of an analytics pipeline. The performance in terms of

total elapse time against the variable size of datasets is shown in figure 8.

Figure 7 Bulk load performance

(effect of data size)

Figure 8 Analytics pipeline performance (effect

of data size)

5.5.3.3 Attribute search analysis

Non-spatial data associated with spatial data are equally important in spatial analytics.

The non-spatial data express the characteristics of spatial data. The attribute search

queries are used to perform various operations such as selection, filtration,

aggregation, and group_by on non-spatial data. Twenty-five test queries are executed

and then calculated the average latency. The attribute query performance in terms of

latency against the different size of datasets is shown in figure 9. Attribute search is

performed using the REST abstraction provided by the application layer.

5.5.3.4 Location search analysis

Location search is used to find the characteristics at a particular location. Twenty-five

test cases are performed against pickup locations with and without Geohash index

each. Spatial index plays an important role in fast data retrieval. It is observed that

location search queries are executing fast when they are referred with Geohash index.

Location search without index requires a full table scan; hence it degrades the query

performance. The performance of a location search query is shown in figure 10.

Location search is performed via distributed API as well as a REST interface.

Page 21: Developing Big Data Analytics Architecture for Spatial Data · stack: Spark and Cassandra. The framework provides SQL like query interface to perform spatial operations. It does not

21

Figure 9 Attribute search performance (effect

of data size)

Figure 10 Location search performance

(effect of data size)

5.5.3.5 Proximity search analysis

Ten test cases are performed with a varying range from 500 meters to 3000 meters

from a given query point. The query point is selected using the sampling method. The

results show that proximity search using Geohash index offers outstanding

performance. The performance of the circular range query is shown in figure 11.

Proximity search is performed via distributed API as well as a REST interface.

5.5.3.6 KNN search analysis

Ten test queries are performed to find K (K=5 to K=25) nearest pickup locations from

a given query point. The query points are chosen from high density areas using the

sampling method. The performance of KNN search query is shown in figure 12. KNN

search is performed via distributed API as well as a REST interface.

Figure 11 Proximity search performance

(effect of data size)

Figure 12 KNN search performance

(effect of data size)

5.5.3.7 Low latency query analysis

Low latency queries are designed based on partition key or combination of partition

key and clustering key of a Cassandra table. Twenty-five test queries are executed.

The result shows that Cassandra outperforms Spark for low latency queries. Figure 13

Page 22: Developing Big Data Analytics Architecture for Spatial Data · stack: Spark and Cassandra. The framework provides SQL like query interface to perform spatial operations. It does not

22

depicts the query performance. Low latency query is performed by using REST

abstraction provided by the application layer.

Figure 13 Low latency query performance (effect of data size)

6 Realization of Architecture in Application Domain

The prototype applications in the agriculture domain are developed on top of the big

data analytics architecture. As a part of the implementation, the agricultural

information management system is developed to reduce the technological gap

between agro-users and information. The information system is implemented to

collect, query, analyze, and visualize heterogeneous and distributed data including

geospatial data at scale using open source.

6.1 Data Collection, Pre-processing, and Integration

There are very few open datasets or digital data available in the agriculture domain,

especially in India. Hence, the data collection is the prime issue to develop big data

application in the agriculture domain. Spatial and non-spatial data on weather, crop,

and market are collected from different sources like meteorological departments,

agriculture universities, and web portals. The summary of data collection is given in

Table 3.

Table 3 Data collection

Datasets Data source Description Data format

Weather data

(1992 – 2007)

Archived data,

www.Indiastat.com

Data collected for

seven districts of

Gujarat

Spreadsheet/

document

Crop data for

cotton crop

(1960 – 2007)

www.Indiastat.com,

http://apy.dacnet.nic.in/crop_fryr_toyr.aspx,

Archived data

Data collected for

eighteen districts of

Gujarat

Spreadsheet/

document

Market data http://agmarknet.gov.in/ Data collected for

429 agro-markets in

PDF

document

Page 23: Developing Big Data Analytics Architecture for Spatial Data · stack: Spark and Cassandra. The framework provides SQL like query interface to perform spatial operations. It does not

23

Gujarat

Current

weather data

Openweathermap API Global API for

weather data

JSON format

Spatial data for

Gujarat

www.diva-gis.org Spatial data Shapefile

Data pre-processing and integration techniques like handling missing values, data

duplication removal, data validation, similarity search, and joins are applied to get

consistent data sets on weather, crop, and market. The big data analytics applications

including geospatial data are implemented on top of big data analytics framework.

6.2 Big Data Applications in Agriculture

The big data applications in agriculture are developed on top of the proposed

architecture. The web based analytics and visualization services are developed for

cotton crop in Gujarat, India. Cotton is an important non-food crop which provides

lint to the textile industry, high protein feed to livestock, oil for human consumption,

byproducts used as fertilizer, the raw material for paper and cardboard, etc. [40].

India is the second largest cotton crop producer and consumer [41], and Gujarat

stands second in cotton production after Maharashtra. Cotton is a major cash crop in

Gujarat. Effective advisory for cotton crop may lead productivity growth in cotton, so

as the economic growth in Gujarat.

6.2.1 Big Data Application for Crop Yield Prediction

A big data application is developed to predict cotton crop yield based on weather

parameters: average temperature and rainfall. Multiple Linear Regression (MLR)

algorithm [42], [43] is implemented using MLib library of Spark. The yield prediction

results are shown in Table 4.

Table 3 Experimental results

District R-squared Actual yield in 2007

Bales/’00 Hectare

Predicted yield in 2007

Bales/’00 Hectare

Difference

Bales/’00 Hectare

Vadodara 0.44 447 359 88

Bharuch 0.53 285 176 109

Jamnagar 0.83 695 682 13

Amreli 0.68 643 476 167

Surat 0.78 545 489 56

The resultant crop yield prediction map for the cotton crop is shown in figure 14. The

crop production, land usage, and yield trend from 1960-2007 as well as actual vs.

Page 24: Developing Big Data Analytics Architecture for Spatial Data · stack: Spark and Cassandra. The framework provides SQL like query interface to perform spatial operations. It does not

predicted crop yield trend

event.

Various recommendations

agro-users. The recommendations

based on ongoing weather parameters

whether to grow that particular crop or

are unfavorable.

Figure 14 Snapsh

6.2.2 Big Data Application for

A big data application is developed

Gujarat. The Current weather

Openweathermap API. The historical weather data are stored

aggregation of data is performed on top of Spark

core APIs for aggregation is used to find

location in Gujarat is determined

integrates the current and historical weather data

figure 15 depicts the results via dynamic layouts

data for average temperature from 1992 to 2007

click event. It also depicts the hottest location with pulse marker.

24

predicted crop yield trend is displayed on the user interface by the polygon

arious recommendations based on the crop yield prediction model are

users. The recommendations will help farmers to have an idea of yield estimates

based on ongoing weather parameters. It will help farmers to make decisions on

whether to grow that particular crop or find an alternate crop incase yield predictions

Snapshot of crop yield prediction for cotton crop in Gujarat

Big Data Application for Weather Analytics

A big data application is developed to analyze the weather of various districts of

Current weather data is collected using global weather

The historical weather data are stored in Cassandra

aggregation of data is performed on top of Spark-Cassandra big data stack.

core APIs for aggregation is used to find the monthly average. The current hotte

determined using SparkSQL window function. The application

integrates the current and historical weather data. The choropleth map shown in

depicts the results via dynamic layouts. The map depicts daily and monthly

for average temperature from 1992 to 2007 on the user interface by

click event. It also depicts the hottest location with pulse marker.

polygon click

are provided to

will help farmers to have an idea of yield estimates

to make decisions on

alternate crop incase yield predictions

Gujarat

to analyze the weather of various districts of

data is collected using global weather service –

Cassandra DB. The

Cassandra big data stack. The Spark

The current hottest

using SparkSQL window function. The application

The choropleth map shown in

aily and monthly

interface by the polygon

Page 25: Developing Big Data Analytics Architecture for Spatial Data · stack: Spark and Cassandra. The framework provides SQL like query interface to perform spatial operations. It does not

25

Figure 15 Snapshot of weather analytics application

6.2.3 Big Data Application for Agro-Market Data Analytics

A big data application is developed to search and analyze agro-market data for the

cotton crop. The market locations are geo-referenced using Google map service.

Some of the market locations are geo-referenced manually due to the lack of data.

The spatial operations are performed on market data. Figure 16 depicts agro-markets

and shows information at a particular location at marker click event.

6.2.4 Big Data Application for Crop Data Aggregation

A big data application is developed to perform aggregation on cotton crop production,

land usage, and yield data. The crop data is averaged over the years from 1960 to

2007. The result on crop production data aggregation is shown in figure 17. The maps

depict the trend in a popup window by the polygon click event.

Page 26: Developing Big Data Analytics Architecture for Spatial Data · stack: Spark and Cassandra. The framework provides SQL like query interface to perform spatial operations. It does not

Figure

Figure 17

7 Conclusion

The overall research focus is to

analytics architecture for massive scale data management including spatial data. The

architecture is developed to address two big data challenges:

The architecture has three major components: data preparation, data analytics, and

26

Figure 16 Snapshot of agro-market search analysis

17 Snapshot of crop production data aggregation

The overall research focus is to develop an open source and scalable

for massive scale data management including spatial data. The

developed to address two big data challenges: Variety and

The architecture has three major components: data preparation, data analytics, and

and scalable big data

for massive scale data management including spatial data. The

ariety and Volume.

The architecture has three major components: data preparation, data analytics, and

Page 27: Developing Big Data Analytics Architecture for Spatial Data · stack: Spark and Cassandra. The framework provides SQL like query interface to perform spatial operations. It does not

27

data visualization. The data preparation layer is designed and developed to fetch and

collect massive scale data from disparate sources. The data preparation services are

implemented with two levels of data abstraction. A big data analytics framework is

implemented to load, store, process, and query spatial and non-spatial data at scale.

The framework is implemented by performing analytics operations like selection,

filtration, aggregation, location search, proximity search, and KNN search. The

results are explored through distributed APIs and Restful ad-hoc APIs. The

application layer is designed to accelerate ad-hoc query processing by diverting the

user queries to the suitable framework. The experimental results show that the

framework has achieved efficient storage management and high computational

processing through Spark and Cassandra integration. Data visualization layer is

developed to showcase the analytical results.

The architecture is realized for the agriculture domain. An agricultural information

system is developed as a proof of concept. The information system is implemented

by developing big data applications such as crop yield prediction, crop data

aggregation, and weather and agro-market analytics.

7.1 Implementation Status - Present and Future

The big data analytics architecture is designed and developed for massive scale data

management including spatial data. The data preparation layer is designed with two

levels of data abstraction. As a part of the implementation, REST interface is

designed and implemented to fetch and collect data from different data sources with

formats such as PDF, spreadsheets, documents, web pages, and online services. The

integration of data collected through the REST interface is the most critical module in

data preparation layer. An algorithmic solution is to be devised to link a variety of

data from diverse sources in aid of the unified search, query, and analysis.

The core component of big data analytics architecture, i.e. big data analytics

framework is implemented for spatial data management. The framework is to be

extended by developing complex spatial operations like spatial join and kNN join.

The spatial applications like spatial aggregation and spatial auto-correlation are to be

developed on top of the framework.

The complex applications in the agriculture domain are to be developed by

identifying new data sources, formats, and data types. The real-life datasets including

real-time and streaming data are to be collected and stored in a data repository to

perform further analytics. The near real-time data analytics and visualization

algorithms are to be devised to process real-time data like weather, disaster, etc. The

Page 28: Developing Big Data Analytics Architecture for Spatial Data · stack: Spark and Cassandra. The framework provides SQL like query interface to perform spatial operations. It does not

28

analytical services like rainfall prediction, crop recommendation, crop price

prediction, agro-inputs procurement, supply chain management, crop disease alert,

fertilizer recommendations, etc are to be implemented. These services are used to

generate customized and multilingual solutions in the form of weather based crop

calendar and alerts based on adverse events. The big spatial data applications are to

be developed. For example, find aggregated weather per agro-climatic zone, find the

number of regions having rainfall below the threshold value, and find the nearby

warehouse.

Publications by Author

1. Shah, Purnima, and Sanjay Chaudhary. "Big Data Analytics Framework for

Spatial Data." In Sixth International Conference on Big Data Analytics, pp. 250-

265. Springer, Cham, 2018.

2. Shah, Purnima and Sanjay Chaudhary. "Big Data Analytics and Integration

Platform for Agriculture”, In the Proceedings of Research Frontiers in Precession

Agriculture (Extended Abstract), AFITA/WCCA 2018 Conference, IIT Bombay,

India, October 24-26, 2018.

3. Shah, Purnima, Deepak Hiremath, and Sanjay Chaudhary. "Towards development

of spark based agricultural information system including geo-spatial data." In Big

Data (Big Data), 2017 IEEE International Conference on, pp. 3476-3481. IEEE,

2017. (h-index – 33, citation_count - 1)

4. Shah, Purnima, Deepak Hiremath, and Sanjay Chaudhary. "Big data analytics

architecture for agro advisory system." In High Performance Computing

Workshops (HiPCW), 2016 IEEE 23rd International Conference on, pp. 43-49.

IEEE, 2016. (h-index – 14, citation_count - 7)

5. Shah, Purnima, Deepak Hiremath, and Sanjay Chaudhary. “Big Data Analytics

for Crop Recommendation System”, 7th Intl. Workshop on Big Data

Benchmarking (WBDB 2015), New Delhi, 14-15 Dec. 2015, organized by ISI

Delhi center, and IIPH Hyderabad.

Bibliography

1. Website of MongoDB, http://www.mongodb.org

2. Lakshman, Avinash, and Prashant Malik. "Cassandra: a decentralized structured storage system."

ACM SIGOPS Operating Systems Review 44, no. 2 (2010): 35-40.

3. Team, R. Core. "R: A language and environment for statistical computing. R Foundation for

Statistical Computing, Vienna, Austria. 2013." (2014).

Page 29: Developing Big Data Analytics Architecture for Spatial Data · stack: Spark and Cassandra. The framework provides SQL like query interface to perform spatial operations. It does not

29

4. Eldawy, Ahmed, and Mohamed F. Mokbel. "Spatialhadoop: A mapreduce framework for spatial

data." In Data Engineering (ICDE), 2015 IEEE 31st International Conference on, pp. 1352-1363.

IEEE, 2015.

5. Eldawy, Ahmed, and Mohamed F. Mokbel. "Pigeon: A spatial mapreduce language." In 2014

IEEE 30th International Conference on Data Engineering (ICDE), pp. 1242-1245. IEEE, 2014.

6. Aji, Ablimit, Fusheng Wang, Hoang Vo, Rubao Lee, Qiaoling Liu, Xiaodong Zhang, and Joel

Saltz. "Hadoopgis: a high performance spatial data warehousing system over mapreduce."

Proceedings of the VLDB Endowment 6, no. 11 (2013): 1009-1020.

7. Yu, Jia, Jinxuan Wu, and Mohamed Sarwat. "Geospark: A cluster computing framework for

processing large-scale spatial data." In Proceedings of the 23rd SIGSPATIAL International

Conference on Advances in Geographic Information Systems, p. 70. ACM, 2015.

8. Web site of Magellan. https://github.com/harsha2010/magellan. Magellan -

https://hortonworks.com/blog/magellan-geospatial-analytics-in-spark/;

https://github.com/harsha2010/magellan

9. Web site of Spatialspark. http://simin.me/projects/spatialspark/

10. Tang, Mingjie, Yongyang Yu, Qutaibah M. Malluhi, Mourad Ouzzani, and Walid G. Aref.

"Locationspark: a distributed in-memory data management system for big spatial data."

Proceedings of the VLDB Endowment 9, no. 13 (2016): 1565-1568.

11. Xie, Dong, Feifei Li, Bin Yao, Gefei Li, Liang Zhou, and Minyi Guo. "Simba: Efficient in-

memory spatial analytics." In Proceedings of the 2016 International Conference on Management

of Data, pp. 1071-1085. ACM, 2016.

12. “Lambda Architecture,” http://lambda-architecture.net/, 2014.

13. “Kappa Architecture,” http://radar.oreilly.com/2014/07/ questioning-the-lambda-architecture.html,

2014.

14. Fernandez, Raul Castro, Peter R. Pietzuch, Jay Kreps, Neha Narkhede, Jun Rao, Joel Koshy, Dong

Lin, Chris Riccomini, and Guozhang Wang. "Liquid: Unifying Nearline and Offline Big Data

Integration." In CIDR. 2015.

15. Website of Berkeley Data Analysis Stack. , [https://amplab.cs.berkeley.edu/software/]

16. Website of HPCC, https://hpccsystems.com/.

17. Estrada, Raul, and Isaac Ruiz. "Big Data SMACK." Apress, Berkeley, CA (2016).

18. Klein, Levente J., Fernando J. Marianno, Conrad M. Albrecht, Marcus Freitag, Siyuan Lu, Nigel

Hinds, Xiaoyan Shao, Sergio Bermudez Rodriguez, and Hendrik F. Hamann. "PAIRS: A scalable

geo-spatial data analytics platform." In Big Data (Big Data), 2015 IEEE International Conference

on, pp. 1290-1298. IEEE, 2015.

19. Sinnott, Richard O., Luca Morandini, and Siqi Wu. "SMASH: A cloud-based architecture for big

data processing and visualization of traffic data." In Data Science and Data Intensive Systems

(DSDIS), 2015 IEEE International Conference on, pp. 53-60. IEEE, 2015.

20. S. Cho, S. Hong and C. Lee, "ORANGE: Spatial big data analysis platform," In Big Data (Big

Data), 2016 IEEE International Conference on, pp. 3963-3965. IEEE, 2016.

21. Shah, Purnima, Deepak Hiremath, and Sanjay Chaudhary. "Big data analytics architecture for agro

advisory system." In High Performance Computing Workshops (HiPCW), 2016 IEEE 23rd

International Conference on, pp. 43-49. IEEE, 2016.

22. Shah, Purnima, Deepak Hiremath, and Sanjay Chaudhary. "Towards development of spark based

agricultural information system including geo-spatial data." In Big Data (Big Data), 2017 IEEE

International Conference on, pp. 3476-3481. IEEE, 2017.

23. Sahni, Sonali. "Ontology Based Agro Advisory System." Department of Computer Science and

Engineering, lIT Mumbai, M. Tech. Thesis (2012).

24. Chaudhary, Sanjay, MinalBhise, Asim Banerjee, AakashGoyal, and ChetanMoradiya. "Agro

advisory system for cotton crop." In Communication Systems and Networks (COMSNETS), 2015

7th International Conference on, pp. 1-6. IEEE, 2015.

25. Pappu, Nagaraju, RunaSarkar, and T. V. Prabhakar. "Agropedia: Humanization of agricultural

knowledge." IEEE Internet Computing 14, no. 5 (2010): 57-59.

Page 30: Developing Big Data Analytics Architecture for Spatial Data · stack: Spark and Cassandra. The framework provides SQL like query interface to perform spatial operations. It does not

30

26. Sini, Margherita, VimleshYadav, Jeetendra Singh, Vikas Awasthi, and Prabhakar TV.

"Knowledge models in agropediaindica." (2009).

27. de Oliveira, Tiago H. Moreira, Marco Painho, Vitor Santos, Otávio Sian, and André Barriguinha.

"Development of an agricultural management information system based on Open-Source

solutions." Procedia Technology 16 (2014): 342-354.

28. Kumar SK, Babu SDB (2016) A Web GIS Based Decision Support System for Agriculture Crop

Monitoring System-A Case Study from Part of Medak District. J Remote Sensing & GIS 5:177.

doi: 10.4172/2469-4134.1000177.

29. Han, Weiguo, Zhengwei Yang, Liping Di, and Richard Mueller. "CropScape: A Web service

based application for exploring and disseminating US conterminous geospatial cropland data

products for decision support." Computers and Electronics in Agriculture 84 (2012): 111-123.

30. Zhu, Zhiqing, Rongmei Zhang, and Jieli Sun. "Research on GIS-based agriculture expert system."

In Software Engineering, 2009. WCSE'09. WRI World Congress on, vol. 3, pp. 252-255. IEEE,

2009.

31. Zhang, Hao, Li Zhang, YannaRen, Juan Zhang, XinXu, Xinming Ma, and Zhongmin Lu. "Design

and implementation of crop recommendation fertilization decision system based on WEBGIS at

village scale." In International Conference on Computer and Computing Technologies in

Agriculture, pp. 357-364. Springer, Berlin, Heidelberg, 2010.

32. Garg, Raghu, and Himanshu Aggarwal. "Big data analytics recommendation solutions for crop

disease using Hive and Hadoop Platform." Indian Journal of Science and Technology9, no. 32

(2016).

33. Lamrhari, Soumaya, Hamid Elghazi, Tayeb Sadiki, and Abdellatif El Faker. "A profile-based Big

data architecture for agricultural context." In Electrical and Information Technologies (ICEIT),

2016 International Conference on, pp. 22-27. IEEE, 2016.

34. Vitolo, Claudia, Yehia Elkhatib, Dominik Reusser, Christopher JA Macleod, and Wouter

Buytaert. "Web technologies for environmental Big Data." Environmental Modelling& Software

63 (2015): 185-198.

35. Peisker, Anu, and Soumya Dalai. "Data analytics for rural development." Indian Journal of

Science and Technology 8, no. S4 (2015): 50-60.

36. Xie, N. F., X. F. Zhang, W. Sun, and X. N. Hao. "Research on Big Data Technology-Based

Agricultural Information System." In International Conference on Computer Information Systems

and Industrial Applications. Atlantis Press. 2015.

37. Chalh, Ridouane, Zohra Bakkoury, Driss Ouazar, and Moulay Driss Hasnaoui. "Big data open

platform for water resources management." In Cloud Technologies and Applications (CloudTech),

2015 International Conference on, pp. 1-8. IEEE, 2015.

38. Sayad, Younes Oulad, Hajar Mousannif, and Michel Le Page. "Crop management using Big

Data." In Cloud Technologies and Applications (CloudTech), 2015 International Conference on,

pp. 1-6. IEEE, 2015.

39. Website of spark-cassandra-connector: https:// github.com/ datastax/spark-cassandra-connector.

40. Freeland Jr., Thomas B. , Pettigrew, Bill, Thaxton, Peggy, Andrew, Gordon L., “Agrometeorology

and Cotton Production” Guide of Agricultural Meteorological Practices (GAMP), 2010, edition

(WMO-No.134), Chapter 13.

41. Osakwe, Emeka, “Cotton Fact Sheet India”, International Cotton Advisory Committee, May 19,

2009.

42. RA Fisher, M. A. "III. The influence of rainfall on the yield of wheat at Rothamsted." Phil. Trans.

R. Soc. Lond. B 213, no. 402-410 (1925): 89-142.

43. Agrawal, Ranjana, and S. C. Mehta. "Weather based forecasting of crop yields, pests and diseases-

IASRI models." J. Ind. Soc. Agril. Statist 61, no. 2 (2007): 255-263.