developing big data analytics architecture for spatial data · stack: spark and cassandra. the...
TRANSCRIPT
Developing Big Data Analytics Architecture for Spatial Data
Synopsis
Submitted to
Ahmedabad University
For
The Degree of Doctor of Philosophy in
Information and Communication Technology
By
Purnima Rasiklal Shah (1451002)
School of Engineering and Applied Science,
Ahmedabad University, Ahmedabad – 380009, India.
Under Supervision of
Dr. Sanjay R. Chaudhary
Professor and Associate Dean
School of Engineering and Applied Science,
Ahmedabad University.
[September – 2018]
2
1 Introduction
In the mobile and Internet era, massive scale data is generated from disparate sources
with spatial components. Traditional methods often fail to handle high volume, high
speed, and variety of complex data in this fast and growing modern view of data
generation and consumption. The modern technologies like Big Data and Big Data
Analytics (BDA) have a huge potential to handle massive scale data with high
scalability and low latency. Though the modern big data management tools are highly
efficient, they offer limited functions and methods for spatial data management. In
addition, in modern application development, only one specific big data tool would
not be able to manage big data efficiently and effectively. Hence, it is highly enviable
to exploit the potential features of big data tools and technologies and propose
integrated frameworks and architectures built on top of more than one technology to
develop robust and powerful applications including geospatial data.
The research presents an open source and scalable big data analytics architecture for
spatial data management. The main goal of the research work is to solve a wide range
of data problems by offering batch, iterative, and interactive computations in a unified
architecture. The proposed architecture is built and implemented on top of open
source big data frameworks. In comparison with the existing platforms and
architectures, the proposed architecture is in-memory, cost-effective, and open
source. The architecture is realized for the agriculture domain. As a proof of concept,
the spatial analytics applications are developed using agricultural real-life datasets.
The main focus of the research is to develop a big data analytics framework to load,
store, process, and query spatial data at scale. The framework is developed to enable
interactive query processing on spatial and non-spatial data via a web based REST
interface and distributed APIs. In comparison with the existing frameworks, the
proposed framework is implemented on an integrated big data infrastructure with a
new input data source, i.e. NoSQL database. The framework provides efficient and
scalable solutions for spatial data by developing distributed APIs for spatial
operations such as location search, proximity search, and K Nearest Neighbor (KNN)
search. The application layer is implemented to accelerate ad-hoc query processing
via a common web based REST interface. The user interface diverts the user requests
to the suitable framework either Cassandra or Spark, i.e. low latency queries are
executed on Cassandra and aggregated and complex queries are performed on the
Spark cluster. The framework is evaluated by analyzing the performance of the
analytical operations in terms of latency against the variable size of data. The
3
performance of the framework is compared with the baseline technology, i.e.
Cassandra for low latency queries.
2 Novelty and Contributions
The research work has developed an open source and scalable architecture to manage
massively distributed data including spatial data. In comparison with the existing
platforms and architectures, the proposed architecture is in-memory, cost-effective
and open source. The architecture intends to build and implement distributed and
scalable APIs for spatial data management on top of analytics and integrated
infrastructure. The architecture is realized to develop analytical services for
agriculture and provide customized solutions to the end users in the form of
interactive maps and Restful ad-hoc services.
Recent findings from the literature suggest that there is no open source big data
analytics framework available which provides end-to-end solutions for spatial data,
i.e. from data loading to data retrieval on top of integrated big data infrastructure. The
previous research efforts yet to demonstrate the scalability on top of big data stack.
The innovative idea is that the framework is able to store and process spatial data
without changing the underlying architecture of either Spark or Cassandra. The
framework is not only able to perform the described analytics operations but also
offers a unified infrastructure to perform more sophisticated and complex analytics on
spatial data. It can be extended to enhance the analytical capability with real-time and
streaming data.
The main contributions of the thesis are:
1. Open source big data analytics architecture is developed to perform relative
analytics on massively distributed data at scale.
2. The architecture is realized on top of open source frameworks and performs
relevant analytics on real-life datasets in the agriculture domain including
geospatial data.
3. The data preparation framework is implemented to collect, pre-process, and
integrate data coming from disparate and heterogeneous data sources.
4. A novel big data analytics framework is built and implemented on top of open
source technologies Spark and Cassandra. The framework is implemented to load,
store, process, and perform ad-hoc query processing on spatial and non-spatial
data at scale.
a. A NoSQL based spatial data storage framework is built and implemented.
4
b. Built and implemented distributed and scalable APIs for spatial operations
such as location search, proximity search, and KNN search.
c. Application architecture is implemented to accelerate interactive analytics
on spatial and non-spatial data. A common web based REST interface is
developed to divert user queries to the suitable framework either
Cassandra or Spark.
d. The performance of the framework is evaluated in terms of latency against
the variable size of data.
e. The performance of the framework is compared with the baseline
technology, i.e. Cassandra for low latency queries.
5. The visualization layer is implemented to showcase the analytical results with
dynamic layouts. A dashboard application is designed and implemented to depict
the analytical results in the agriculture domain.
6. Big data applications in agriculture are developed as a proof of concept.
3 Literature Survey
The thesis has reviewed literature on, 1) Existing state-of-the-art systems for spatial
data management, and 2) Existing ICT based applications and systems in the
agriculture domain.
3.1 Existing Systems for Spatial Data Management
The thesis has reviewed existing database technologies including relational and
NoSQL databases for spatial data. It has also reviewed existing big spatial data
frameworks and architectures and compared with the proposed framework. The core
spatial functionalities provided by the state-of-the-art spatial databases including
NoSQL databases and computational frameworks are depicted in Table 1 and 2.
Table 1 State-of-the-art spatial databases
Database
Supported Geometry
Objects
Supported Geometry Functions Spatial
Index
Horizontal
Scalable
PostGIS Point, LineString,
Polygon, MultiPoint,
MultiPolygon,
MultiLineString
OGC standard methods on
geometry instances
B-Tree,
R-Tree,
GiST
No
MySQL Point, LineString,
Polygon, MultiPoint,
MultiPolygon,
OGC standard methods on
geometry instances
2d plane
index,
B-trees
No
5
MultiLineString
MongoDB
[1]
Point, LineString,
Polygon, MultiPoint,
MultiPolygon,
MultiLineString
Inclusion, intersection,
distance/proximity
2dsphere
index,
2d index
Yes
Cassandra
[2]
Point, Polygon,
LineString
Distance/proximity,
intersection, iswithin,
isdisjointto
Solr/
lucene
Yes
Table 2 State-of-the-art big spatial data processing frameworks
Framework Architecture Index
Input
Format
Spatial
Geometry
Operators
Spatial
Geometry
Objects
Language
Support/
Interface
Proposed
Framework
Key-Value
+ RDD
Geohash Cassandra Circle range
query,
KNN query,
point query,
Attribute query
Point R [3],
REST
Disk-based systems
SpatialHadoop
[4]
MapReduce Grid /
R-tree /
R+ tree
HDFS
compatible
input
formats
Spatial
analysis and
aggregation
functions,
joins, filter,
box range
query, KNN,
distance join
(via spatial
join)
Point,
LineString,
Polygon
Pigeon [5]
Hadoop-GIS
[6]
MapReduce Uniform
grid
index
HDFS
compatible
input
formats
Range query,
self-join, join,
containment,
aggregation
Point,
LineString,
Polygon
HIVEQL
Memory-based systems
Geospark [7] RDD,
SRDD
R-Tree,
Quadtree
CSV,
GeoJSON,
shapefiles,
and WKT
Box range
query, circle
range query,
kNN, distance
join
Point,
Polygon,
Rectangle
Scala,
Java
Magellan [8] RDD z-order
curve
(default
precision
30)
ESRI,
GeoJSON,
OSM-XML
and WKT
Intersects,
contains,
within
Point,
LineString,
Polygon,
MultiPoint,
MultiPolygon
Scala,
Extended
SparkSQL
6
Framework Architecture Index
Input
Format
Spatial
Geometry
Operators
Spatial
Geometry
Objects
Language
Support/
Interface
SparkSpatial
[9]
RDD
Grid/
Kd-tree
A form of
WKT in
Hadoop
File System
(HDFS)
Box range
query, circle
range query,
KNN, distance
join (Point-to-
polygon dist),
point-in-
polygon
Point,
Polygon
Impala
LocationSpark
[10]
RDD R-tree,
Quadtree,
IR tree
HDFS
compatible
input
formats
Range, kNN,
Spatial Join,
Distance Join,
kNN Join
Point,
Rectangle
Scala
SIMBA [11] RDD R-tree HDFS
compatible
input
formats
Range, kNN,
Distance Join,
kNN Join
Point Scala/
Extende
SparkSQL
3.1.1 Big Spatial Data Architectures
Generally, big data architectures are designed and developed to achieve a specific
goal. Many big data architectures such as Lambda [12], Kappa [13], Liquid [14],
BDAS [15], SMACK [16] and HPCC [17] have been developed on top of the
integrated infrastructure.
There are very limited platforms and architectures such as IBM PARIS [18], SMASH
[19], and ORANGE [20] have been developed for spatial data management. They
have provided very high-level designs and architectures, but no source code or design
implementation guidelines are provided for future designers. In addition, the existing
architectures have less considered the non-spatial attributes in spatial analytics. In
comparison with the existing big spatial data architectures, the proposed architecture
is in-memory, cost-effective, and open source.
A number of big spatial data frameworks have been developed on top of big data
stack composed of Spark and Cassandra. The Cassandra-solr-spark framework has
been developed by Datastax to enable spatial query processing on top of big data
stack: Spark and Cassandra. The framework provides SQL like query interface to
perform spatial operations. It does not support join operations. It has not been
evaluated based on the performance metric. P. Shah et al. [21, 22] have developed a
big data analytics framework including geospatial data on top of big data stack: Spark
7
and Cassandra. The spatial analytics applications in the agriculture domain have been
developed using third-party Geospark libraries. The major drawback of the
framework is data duplication.
3.2 Existing Agricultural Information Systems and Applications
The thesis has reviewed the existing ICT (Information and Communication
Technology) based agricultural systems and applications developed using the state-
of-the-art technologies such as web, mobile, GIS, Big Data, Big Spatial Data, etc. The
web and mobile based applications and systems1,2,3,4,5,6,7,8
,[23, 24, 25, 26] are
reviewed in the first phase of research.
The web-GIS based systems in agriculture are reviewed in the next phase. [27, 28, 29,
30, 31] have demonstrated and implemented web-GIS based information systems for
agriculture using traditional GIS tools and technologies. These technologies are often
insufficient to provide a complete picture of analytics in a geographic context. The
thesis has also, reviewed the big data applications and systems [32, 34, 35, 36, 37, 38]
emerged in the agriculture domain to manage complex data at scale.
3.3 Research Gap
The research has identified the existing research gap for big spatial data management.
The modern big data storage and processing frameworks are providing very limited
geo-functionality and OGC standard methods in comparison with the traditional
systems. The research has also found that the use of big data in the agriculture domain
is in the initial stage, and more research efforts are required to develop big data
applications in the agriculture domain.
3.3.1 Shortcomings of Existing Systems for Spatial Data Management
There are very less research efforts have been made to develop open source big data
architectures and platforms for spatial data management. The big spatial data
platforms and architectures like IBM PARIS are made accessible using a proprietary
1 http://www.esagu.in 2http://www.bhoomi.karnataka.gov.in/ 3http://www.chandigarh.gov.in/egov_esmpk.htm 4http://www.tcs.com/offerings/technologyproducts/ mKRISHI/Pages/default.aspx 5http://agropedia.iitk.ac.in/ 6http://www.icrisat.org/newsroom/news-releases/icrisat-pr-2014- media40.htm 7http://www.icrisat.org/what-wedo/satrends/SATrends2009/satrends_october09.htm 8http://mkisan.gov.in/downloadmobileapps.aspx
8
license. There are very limited open source and integrated architectures which support
interactive and complex (e.g. joins, aggregations) query processing on spatial data.
Traditional spatial database management systems are less efficient to store and
process massive scale geospatial data. The set of geospatial operations supported by
the state-of-the-art big data storage and processing frameworks
(NoSQL/Hadoop/Spark based) are very limited compared to the standard GIS
products, such as ArcGIS. The existing extensions to the NoSQL databases for spatial
data management lack the support for spatial aggregation and join operations. Spark
has no native support for spatial data. The existing Spark/Hadoop based frameworks
are only able to execute spatial operations on datasets that are available in text based
file formats (CSV/GeoJSON/shapefiles and WKT), and stored in HDFS or local disk.
There is no big data analytics framework available which reads data from the NoSQL
database and performs spatial analytics on those data.
3.3.2 Shortcomings of Existing Agricultural Information Systems
In spite of a huge revolution in ICT based technologies and their intervention in the
agriculture domain, there is a large technological gap between farmers and
information. Moreover, in the technical aspect, the major barriers for big data
application development in agriculture are lack of tools, infrastructures, data
standards, semantics, integrated data models, developers APIs, unified access points
for public and private data, technical expertise, and finally data.
The research papers and projects are mainly providing algorithmic solutions for
various applications like crop yield prediction and weather forecasting but very less
work has been done regarding novel infrastructure and architectures for agricultural
big data including spatial data. There is an urgent need to manage agricultural data at
scale with specialized systems, techniques, and algorithms. The challenge is how to
exploit the full potential of open source technologies and resources in order to create
a sophisticated and customizable working environment for end users to improve the
productivity in agricultural practices.
4 Big Data Analytics Architecture for Spatial Data
The thesis presents a big data analytics architecture for spatial data management. The
architecture is developed on top of an integrated infrastructure which solves a wide
range of data problems by offering batch, iterative, interactive, and streaming
computations within the same commodity cluster. The architecture is built and
implemented on big data open source technologies for the enrichment of large data
9
sets including geospatial data. The architecture is designed to provide scalable,
flexible, extendible, and cost-effective solutions with available infrastructures and
tools for agriculture.
4.1 Architecture Implementation
The big data analytics architecture shown in figure 1 is realized for the agriculture
domain. The implementation is mainly divided into three stages: 1) Data preparation,
2) Big data analytics framework, and 3) Data Visualization.
Figure 1 Big data analytics architecture for agriculture
There are four types of user interaction with the architecture: 1) system developer, 2)
Data scientist, 3) Domain expert, and 4) End users. The data workflow between
software components within the data pipeline is illustrated in figure 2.
Figure
4.1.1 Data Preparation Layer
Large scale and massive
disparate data sources. These data are
an urgent need to collect and clean such data and prepare them for analytics.
preparation layer is design
sources via web services
repository, and 3) Data pre
the consistent data is transferred
provides two layers of data abstraction.
the data repository. Second, it
using various complex tools and
mapping tools, and record
The data preparation layer
from heterogeneous and distributed data sources as inputs to corresponding services
to perform relevant analytics.
consistent and clean data from disparate
This layer is implemented in three steps
repository, and 3) Data pre
10
Figure 2 Workflow diagram of big data pipeline
Data Preparation Layer
assively distributed data including spatial data are available at
data sources. These data are noisy, inconsistent, and heterogeneous
ect and clean such data and prepare them for analytics.
signed to perform: 1) Fetch and collect data from different
via web services, 2) Store data which collected via web services into
pre-processing and integration to get consistent data. Finally,
transferred into the persistence. The data preparation
two layers of data abstraction. First, it hides all physical data sources
. Second, it further unifies the data available in a data
tools and techniques such as data fusion algorithms
and record linkage algorithms.
layer is designed to collect and process complex data coming
from heterogeneous and distributed data sources as inputs to corresponding services
to perform relevant analytics. The data preparation services are implemented
consistent and clean data from disparate sources and store into a persistent
is implemented in three steps: 1) Data extraction and collection, 2)
ata pre-processing and integration.
are available at
noisy, inconsistent, and heterogeneous. There is
ect and clean such data and prepare them for analytics. The data
data from different
data which collected via web services into a data
to get consistent data. Finally,
to the persistence. The data preparation layer
hides all physical data sources from
data repository
data fusion algorithms, schema
ss complex data coming
from heterogeneous and distributed data sources as inputs to corresponding services
The data preparation services are implemented to fetch
persistent database.
ata extraction and collection, 2) Data
11
4.1.2 Big Data Analytics Framework
The main purpose of big data analytics framework is to enable spatial data
management on large scale data. The consistent datasets generated by data
preparation services are processed and analyzed using data analytics framework. The
analytical results are explored to end users through visualization and REST interface.
The design and development of big data analytic framework are discussed in section
5.
4.1.3 Data Visualization
Data visualization makes complex data more accessible, understandable and usable.
For example, in the agriculture domain, data visualization is useful to identify crop
patterns, crop future trends, market trends, etc. The dashboard applications and user
interfaces are developed to depict analytical results using open source tools like R,
D3, Google API, etc.
The architecture provides a web based user interface by developing analytical and
visualization services through Restful ad-hoc APIs and interactive maps. Web based
interactive maps are implemented using R libraries (leaflet, ggplot, and shiny) in the
form of choropleth map, pulse markers, and plots. Restful web services are
implemented to explore data and analytical results using SparkSQL/CQL interface on
top of analytics framework.
5 Big Data Analytics Framework for Spatial Data
A big spatial data analytics framework is developed to load, store, process, and query
large scale spatial and non-spatial data. It is an integrated infrastructure designed to
manage spatial data efficiently and effectively by exploiting the potential features
provided by standard storage and processing big data frameworks. It is realized on top
of big data stack with Spark as a core processing engine and Cassandra as data
storage. The implementation architecture is shown in figure 3. The implementation is
divided into four layers: 1) Spatial data storage layer, 2) Spark core layer, 3) Spatial
data processing layer, and 4) Application layer.
Figure
5.1 Spatial Data Storage Layer
The framework ingests data
persistent database. The s
spatial data at scale. It is built on top of distributed database
implementation architecture shown in
2) Data storage, 3) Integrat
spatial index, and 5) Store spatial
Figure 4 Implementation architecture of spatial data
12
Figure 3 Big data analytics framework for spatial data
Data Storage Layer
data from data preparation services and loads them into
The spatial data storage framework is developed to load and store
spatial data at scale. It is built on top of distributed database, i.e. Cassandra. The
architecture shown in figure 4 includes five phases; 1) Data loading,
2) Data storage, 3) Integration of big data storage and processing engine, 4) Associate
spatial index, and 5) Store spatial data frame into persistence.
Implementation architecture of spatial data storage framework
and loads them into the
developed to load and store
i.e. Cassandra. The
; 1) Data loading,
big data storage and processing engine, 4) Associate
storage framework
13
5.1.1 Data Loading
Raw datasets including spatial attributes are pulled from disparate sources in CSV
format using data preparation services. These data are loaded into the Cassandra
database using bulk loading. The fastest way to load a large CSV file into the
Cassandra database is bulk loading. The pseudo-code given below presents how data
are loaded into Cassandra.
Pseudo-code 1 - Load data into Cassandra database
Input: datafile_in_csv, keyspace, schema
Output: Load data into Cassandra database
1: The raw data is written to both CommitLog and Memtable.
2: Memtable is flush out to SSTables sstable[0], sstable[1],
sstable[2], ...
3: Combine SSTables into Single SSTable[n] using compaction
(optional).
4: Load SSTables into Cassandra using sstableloader utility.
5.1.2 Data Storage
The raw dataset with spatial components is stored in the Cassandra database based on
a data model. The following pseudo-code describes how each partition is stored in the
Cassandra database.
Pseudo-code 2 - Store data into Cassandra database
Input: datafile_in_csv, keyspace, schema
Output: store data into Cassandra database
1: Read data from the input file.
2: Determine the partitions based on the partition key.
3: Map each partition to a token value using murmur3partitioner
(default partitioner in Cassandra).
4: Allocate each partition of Cassandra to a particular node of a
cluster based on token range owned by that particular node.
5.1.3 Integration of Big Data Storage and Processing Engine
Spark is used as a core processing engine which processes data stored in the
Cassandra database. The Spark-Cassandra connector [39] is the key component which
aligns the Spark-Cassandra data distribution. The Cassandra partitions are imported
into Spark memory in the form of Dataframe object using Spark-Cassandra
connector.
14
The following pseudo-code shows how Cassandra partitions are mapped into Spark
partitions based on data distribution performed by Spark-Cassandra connector.
Pseudo-code 3 - Align Cassandra partitions into Spark partitions
Input: Cassandra partitions, spark.cassandra.input.split.size_in_mb
Output: Spark partitions
1: Read Cassandra partitions from Cassandra cluster.
2: Divide token ranges owned by Cassandra nodes based on input
parameter spark.cassandra.input.split.size_in_mb which determines
the number of spark partitions.
3: Align Cassandra partitions to Spark partitions.
5.1.4 Associate Spatial Index
Geohash9 is a hierarchical spatial data structure used for indexing spatial data. Each
record presented in Spark partitions is associated with Geohash character string using
one-to-one transformation. It creates a new Spark Dataframe with a spatial index. The
resultant Spark Dataframe is called spatial data frame. The spatial data frame is stored
back into the Cassandra database. The pseudo-code given below describes how
spatial index is associated with each record of Spark Dataframe.
Pseudo-code 4 - Associate spatial index to Spark partitions
Input: Spark object
Output: Spark object
1: Read Spark partitions from Spark object.
2: Map each record (.., latitude, longitude,…) of Spark partitions
into (Geohash, wkt,……) by applying custom functions geohash
(latitude, longitude, precision) and WKT(latitude, longitude).
3: Return Spark object.
5.1.5 Store Spatial Data Frame into Persistence
Now, each record of a spatial data frame is associated with Geohash code. The spatial
data model is designed such that each rectangular area represented by the Geohash
code is mapped to a Cassandra row. The Geohash attribute is modeled as a partition
key in the spatial data model. The spatial data model is shown in figure 5 where each
row represents a rectangular area and columns store spatial objects fall within that
rectangular area. The spatial attributes are stored in WKT format.
9 Geohash WG. Geohash. https://www.en.wikipedia.org/wiki/Geohash.
15
Figure 5 Spatial data model
The pseudo-code 5 describes how spatial data frame is stored back into Cassandra
based on the spatial data model.
Pseudo-code 5 - Store spatial data into Cassandra database
Input: spark object, keyspace, schema,
cassandra.output.batch.size.bytes, cassandra.output.batch.size.rows
Output: store spatial data into Cassandra database
1: Identify Spark partitions in Spark object with spatial
components.
2: Design spatial data model.
3: Map each rectangular area represented by Geohash code to a
Cassandra row.
4: Create micro batches of Spark partitions based on input
parameters cassandra.output.batch.size.bytes,
cassandra.output.batch.size.rows.
5: Write micro batches into Cassandra database.
The write operation from Spark to Cassandra is very challenging. It needs to collect
all Spark partitions into memory and store back into Cassandra. The write
performance from Spark to Cassandra is optimized by creating micro batches of the
data available in Spark partitions. The number of batches and batch size is determined
from the metadata available on each node of a Cassandra cluster. The connector
converts the data into a number of batches and then writes each batch into Cassandra
based on partition key.
5.2 Spark Core Layer
Spark core layer provides user friendly APIs for machine learning, graph processing,
and structured data processing using Spark SQL. It performs fast data processing on
large datasets by exploiting data parallelism and partitioning techniques. However,
Spark doesn’t support native partitioning for spatial data. For example, proximity
based partitioning for spatial data is not supported. The developers would require
16
writing wrappers to perform spatial operations. The distributed APIs are developed to
perform spatial operations using standard spark core APIs. The spatial APIs extends
Spark core functionalities for spatial data.
5.3 Spatial Data Processing Layer
The spatial data processing layer creates an interface to query spatial and non-spatial
data available in the form of Spark Dataframe object. The distributed APIs are
implemented for spatial operations; 1) Proximity search, 2) KNN search, and 3)
Location search.
5.3.1 Proximity Search
Proximity search queries are widely used in many analytics applications. Circle range
queries are used to find the number of spatial objects within a specified circular range
from a given query point. A distributed API for proximity search is,
circle_range_query(spark_ object, query_point, range)
The circular range query is implemented based on Algorithm 1. It returns the
approximate number of spatial objects within the given circular range.
Algorithm 1 - An algorithm to find spatial objects within a circular
range
Input: spark_object, query_point, range
Output: spatial objects within circular range
1: Find Geohash covered by circular search space using ProximityHash
algorithm
2: Filter the Spark partitions based on bounded Geohash
3: Collect the spatial objects from the filtered Spark partitions
4: Return spatial objects
5.3.2 KNN Search
K nearest neighbor search queries are used to find K nearest spatial objects from a
given query point. K nearest objects are found using selection and merge method. A
distributed API for KNN search is,
KNN_query(spark_ object, query_point, K)
The KNN search is implemented based on Algorithm 2.
17
Algorithm 2 - An algorithm to find K nearest neighbors from a query
point
Input: spark_object, query_point, K
Output: K neighbors from query_point
1: Select large enough circular search space defined by r. 2: Find Geohash covered by circular search space. 3: Sort subspaces based on the distance between the query point and
Geohash.
4: for each subspace do Find k’ nearest neighbor points from the query point
if k’ >= K then
goto step 7
endif
end for
5: Merge points from all subspaces and find k’ nearest neighbor points.
6: if k’ < K then Expand search space based on Geohash precision
if search space out of bound
return error
else repeat step 1 – 5
endif
endif
7: Return K spatial objects
5.3.3 Location Search
The location search query is used to find a particular location within dataset along
with attribute data. The distributed APIs to search a given location with and without
using Geohash index is shown below,
point_query(spark_ object, query_point)
point_query_with_index(spark_ object, query_point)
Location search without an index is very slow as it requires scanning each partition of
a data frame. Whereas, location search with the spatial index is fast. It finds a location
within a single partition defined by Geohash index. The implementation of the
location search query with and without an index is described in pseudo-code 6 and 7
respectively.
Pseudo-code 6 - Find spatial attributes associated with a
particular location without a spatial index
Input: spark_object, query_point
Output: spatial attributes at a query_point
1: Read each Spark partition of a Spark object.
2: Search query_point within each partition i.e full table scan
3: Collect the spatial object.
18
4: Return spatial object, spatial attributes.
Pseudo-code 7 - Find spatial attributes associated with a
particular location with spatial index
Input: spark_object, query_point
Output: spatial attributes at a query_point
1: Find Geohash associated with query_point.
2: Filter Spark partition based on Geohash value.
3: Search query_point within filtered spark partition.
4: Collect the spatial object.
5: Return spatial object, spatial attributes.
5.4 Application Layer
The framework provides a convenient web based REST interface to the end user.
Cassandra performs fast data retrieval based on partition key and clustering key
compared to Spark. The application architecture facilitates end users to execute ad-
hoc queries on a suitable framework either Spark or Cassandra via a common user
interface. The low latency queries are executed on Cassandra, whereas complex
queries (e.g. aggregated and spatial queries) are executed on the Spark framework.
The implementation architecture of the application layer is shown in figure 6.
Figure 6 Implementation architecture of application layer
5.5 Experiments and Results
The big data analytics framework is evaluated using benchmark dataset. The
scalability test is performed in terms of latency against the variable size of data. The
performance of the framework is compared with the baseline technology, i.e.
Cassandra for low latency queries.
19
5.5.1 Experimental Setup
All experiments are conducted on a cluster consisting of four nodes. Each node runs
Ubuntu 14.0.4 with Spark 2.1.0 and Cassandra 3.10. Each node is equipped with an
Intel(R) Core(TM) i7-4770 CPU @ 3.40GHz, 1 CPU, 4 physical cores per CPU, a
total of 8 logical CPU units with 16 GB RAM. The Spark cluster is deployed in
standalone mode. The Spark-Cassandra connector is used to integrate the Spark and
Cassandra frameworks. The big spatial analytics framework is implemented and
deployed using Sparklyr framework. The web based Restful ad-hoc APIs are built
and implemented to explore analytical results on top of the shiny10
framework.
5.5.2 Description of Dataset
NYC11
taxi dataset is an open database. It contains 167 million records in 30 GB.
Each record describes a taxi trip made by a particular driver at the particular date and
time. Each record has 16 attributes: hack_license, two 2D coordinates which
represent pickup location and drop-off location, pickup date_time, drop-off
date_time, and other nine attributes represent other related information. All
experiments are done on a sample dataset containing about 30 million records in 5
GB storage.
5.5.3 Performance Evaluation
Sample datasets of variable sizes (5 million to 30 million records; 800 MB to 5 GB)
are generated to evaluate the performance of the system. All experiments are
performed on 2-dimensional vector dataset. Pickup location is considered as a 2D
spatial object. Different test cases are designed to evaluate the performance of the
queries such as attribute search query, location search query, proximity search query,
KNN search query, and low latency query. The performance is evaluated in terms of
average latency against the variable size of datasets.
5.5.3.1 Load data into Cassandra
The performance of loading phase is evaluated using load time vs. data size. The
performance of bulk loading is shown in figure 7.
10 “Shiny,” http://www.shiny.rstudio.com 11 “NYC Taxi Trips,” http://www.andresmh.com/nyctaxitrips/
20
5.5.3.2 Establish a big data pipeline
The big data analytics pipeline is established in three phases; 1) Read data from
Cassandra, 2) Associate Geohash index to each record of a dataset, and 3) Write
indexed data back into Cassandra. The total elapsed time is calculated by adding the
time required by each phase of an analytics pipeline. The performance in terms of
total elapse time against the variable size of datasets is shown in figure 8.
Figure 7 Bulk load performance
(effect of data size)
Figure 8 Analytics pipeline performance (effect
of data size)
5.5.3.3 Attribute search analysis
Non-spatial data associated with spatial data are equally important in spatial analytics.
The non-spatial data express the characteristics of spatial data. The attribute search
queries are used to perform various operations such as selection, filtration,
aggregation, and group_by on non-spatial data. Twenty-five test queries are executed
and then calculated the average latency. The attribute query performance in terms of
latency against the different size of datasets is shown in figure 9. Attribute search is
performed using the REST abstraction provided by the application layer.
5.5.3.4 Location search analysis
Location search is used to find the characteristics at a particular location. Twenty-five
test cases are performed against pickup locations with and without Geohash index
each. Spatial index plays an important role in fast data retrieval. It is observed that
location search queries are executing fast when they are referred with Geohash index.
Location search without index requires a full table scan; hence it degrades the query
performance. The performance of a location search query is shown in figure 10.
Location search is performed via distributed API as well as a REST interface.
21
Figure 9 Attribute search performance (effect
of data size)
Figure 10 Location search performance
(effect of data size)
5.5.3.5 Proximity search analysis
Ten test cases are performed with a varying range from 500 meters to 3000 meters
from a given query point. The query point is selected using the sampling method. The
results show that proximity search using Geohash index offers outstanding
performance. The performance of the circular range query is shown in figure 11.
Proximity search is performed via distributed API as well as a REST interface.
5.5.3.6 KNN search analysis
Ten test queries are performed to find K (K=5 to K=25) nearest pickup locations from
a given query point. The query points are chosen from high density areas using the
sampling method. The performance of KNN search query is shown in figure 12. KNN
search is performed via distributed API as well as a REST interface.
Figure 11 Proximity search performance
(effect of data size)
Figure 12 KNN search performance
(effect of data size)
5.5.3.7 Low latency query analysis
Low latency queries are designed based on partition key or combination of partition
key and clustering key of a Cassandra table. Twenty-five test queries are executed.
The result shows that Cassandra outperforms Spark for low latency queries. Figure 13
22
depicts the query performance. Low latency query is performed by using REST
abstraction provided by the application layer.
Figure 13 Low latency query performance (effect of data size)
6 Realization of Architecture in Application Domain
The prototype applications in the agriculture domain are developed on top of the big
data analytics architecture. As a part of the implementation, the agricultural
information management system is developed to reduce the technological gap
between agro-users and information. The information system is implemented to
collect, query, analyze, and visualize heterogeneous and distributed data including
geospatial data at scale using open source.
6.1 Data Collection, Pre-processing, and Integration
There are very few open datasets or digital data available in the agriculture domain,
especially in India. Hence, the data collection is the prime issue to develop big data
application in the agriculture domain. Spatial and non-spatial data on weather, crop,
and market are collected from different sources like meteorological departments,
agriculture universities, and web portals. The summary of data collection is given in
Table 3.
Table 3 Data collection
Datasets Data source Description Data format
Weather data
(1992 – 2007)
Archived data,
www.Indiastat.com
Data collected for
seven districts of
Gujarat
Spreadsheet/
document
Crop data for
cotton crop
(1960 – 2007)
www.Indiastat.com,
http://apy.dacnet.nic.in/crop_fryr_toyr.aspx,
Archived data
Data collected for
eighteen districts of
Gujarat
Spreadsheet/
document
Market data http://agmarknet.gov.in/ Data collected for
429 agro-markets in
document
23
Gujarat
Current
weather data
Openweathermap API Global API for
weather data
JSON format
Spatial data for
Gujarat
www.diva-gis.org Spatial data Shapefile
Data pre-processing and integration techniques like handling missing values, data
duplication removal, data validation, similarity search, and joins are applied to get
consistent data sets on weather, crop, and market. The big data analytics applications
including geospatial data are implemented on top of big data analytics framework.
6.2 Big Data Applications in Agriculture
The big data applications in agriculture are developed on top of the proposed
architecture. The web based analytics and visualization services are developed for
cotton crop in Gujarat, India. Cotton is an important non-food crop which provides
lint to the textile industry, high protein feed to livestock, oil for human consumption,
byproducts used as fertilizer, the raw material for paper and cardboard, etc. [40].
India is the second largest cotton crop producer and consumer [41], and Gujarat
stands second in cotton production after Maharashtra. Cotton is a major cash crop in
Gujarat. Effective advisory for cotton crop may lead productivity growth in cotton, so
as the economic growth in Gujarat.
6.2.1 Big Data Application for Crop Yield Prediction
A big data application is developed to predict cotton crop yield based on weather
parameters: average temperature and rainfall. Multiple Linear Regression (MLR)
algorithm [42], [43] is implemented using MLib library of Spark. The yield prediction
results are shown in Table 4.
Table 3 Experimental results
District R-squared Actual yield in 2007
Bales/’00 Hectare
Predicted yield in 2007
Bales/’00 Hectare
Difference
Bales/’00 Hectare
Vadodara 0.44 447 359 88
Bharuch 0.53 285 176 109
Jamnagar 0.83 695 682 13
Amreli 0.68 643 476 167
Surat 0.78 545 489 56
The resultant crop yield prediction map for the cotton crop is shown in figure 14. The
crop production, land usage, and yield trend from 1960-2007 as well as actual vs.
predicted crop yield trend
event.
Various recommendations
agro-users. The recommendations
based on ongoing weather parameters
whether to grow that particular crop or
are unfavorable.
Figure 14 Snapsh
6.2.2 Big Data Application for
A big data application is developed
Gujarat. The Current weather
Openweathermap API. The historical weather data are stored
aggregation of data is performed on top of Spark
core APIs for aggregation is used to find
location in Gujarat is determined
integrates the current and historical weather data
figure 15 depicts the results via dynamic layouts
data for average temperature from 1992 to 2007
click event. It also depicts the hottest location with pulse marker.
24
predicted crop yield trend is displayed on the user interface by the polygon
arious recommendations based on the crop yield prediction model are
users. The recommendations will help farmers to have an idea of yield estimates
based on ongoing weather parameters. It will help farmers to make decisions on
whether to grow that particular crop or find an alternate crop incase yield predictions
Snapshot of crop yield prediction for cotton crop in Gujarat
Big Data Application for Weather Analytics
A big data application is developed to analyze the weather of various districts of
Current weather data is collected using global weather
The historical weather data are stored in Cassandra
aggregation of data is performed on top of Spark-Cassandra big data stack.
core APIs for aggregation is used to find the monthly average. The current hotte
determined using SparkSQL window function. The application
integrates the current and historical weather data. The choropleth map shown in
depicts the results via dynamic layouts. The map depicts daily and monthly
for average temperature from 1992 to 2007 on the user interface by
click event. It also depicts the hottest location with pulse marker.
polygon click
are provided to
will help farmers to have an idea of yield estimates
to make decisions on
alternate crop incase yield predictions
Gujarat
to analyze the weather of various districts of
data is collected using global weather service –
Cassandra DB. The
Cassandra big data stack. The Spark
The current hottest
using SparkSQL window function. The application
The choropleth map shown in
aily and monthly
interface by the polygon
25
Figure 15 Snapshot of weather analytics application
6.2.3 Big Data Application for Agro-Market Data Analytics
A big data application is developed to search and analyze agro-market data for the
cotton crop. The market locations are geo-referenced using Google map service.
Some of the market locations are geo-referenced manually due to the lack of data.
The spatial operations are performed on market data. Figure 16 depicts agro-markets
and shows information at a particular location at marker click event.
6.2.4 Big Data Application for Crop Data Aggregation
A big data application is developed to perform aggregation on cotton crop production,
land usage, and yield data. The crop data is averaged over the years from 1960 to
2007. The result on crop production data aggregation is shown in figure 17. The maps
depict the trend in a popup window by the polygon click event.
Figure
Figure 17
7 Conclusion
The overall research focus is to
analytics architecture for massive scale data management including spatial data. The
architecture is developed to address two big data challenges:
The architecture has three major components: data preparation, data analytics, and
26
Figure 16 Snapshot of agro-market search analysis
17 Snapshot of crop production data aggregation
The overall research focus is to develop an open source and scalable
for massive scale data management including spatial data. The
developed to address two big data challenges: Variety and
The architecture has three major components: data preparation, data analytics, and
and scalable big data
for massive scale data management including spatial data. The
ariety and Volume.
The architecture has three major components: data preparation, data analytics, and
27
data visualization. The data preparation layer is designed and developed to fetch and
collect massive scale data from disparate sources. The data preparation services are
implemented with two levels of data abstraction. A big data analytics framework is
implemented to load, store, process, and query spatial and non-spatial data at scale.
The framework is implemented by performing analytics operations like selection,
filtration, aggregation, location search, proximity search, and KNN search. The
results are explored through distributed APIs and Restful ad-hoc APIs. The
application layer is designed to accelerate ad-hoc query processing by diverting the
user queries to the suitable framework. The experimental results show that the
framework has achieved efficient storage management and high computational
processing through Spark and Cassandra integration. Data visualization layer is
developed to showcase the analytical results.
The architecture is realized for the agriculture domain. An agricultural information
system is developed as a proof of concept. The information system is implemented
by developing big data applications such as crop yield prediction, crop data
aggregation, and weather and agro-market analytics.
7.1 Implementation Status - Present and Future
The big data analytics architecture is designed and developed for massive scale data
management including spatial data. The data preparation layer is designed with two
levels of data abstraction. As a part of the implementation, REST interface is
designed and implemented to fetch and collect data from different data sources with
formats such as PDF, spreadsheets, documents, web pages, and online services. The
integration of data collected through the REST interface is the most critical module in
data preparation layer. An algorithmic solution is to be devised to link a variety of
data from diverse sources in aid of the unified search, query, and analysis.
The core component of big data analytics architecture, i.e. big data analytics
framework is implemented for spatial data management. The framework is to be
extended by developing complex spatial operations like spatial join and kNN join.
The spatial applications like spatial aggregation and spatial auto-correlation are to be
developed on top of the framework.
The complex applications in the agriculture domain are to be developed by
identifying new data sources, formats, and data types. The real-life datasets including
real-time and streaming data are to be collected and stored in a data repository to
perform further analytics. The near real-time data analytics and visualization
algorithms are to be devised to process real-time data like weather, disaster, etc. The
28
analytical services like rainfall prediction, crop recommendation, crop price
prediction, agro-inputs procurement, supply chain management, crop disease alert,
fertilizer recommendations, etc are to be implemented. These services are used to
generate customized and multilingual solutions in the form of weather based crop
calendar and alerts based on adverse events. The big spatial data applications are to
be developed. For example, find aggregated weather per agro-climatic zone, find the
number of regions having rainfall below the threshold value, and find the nearby
warehouse.
Publications by Author
1. Shah, Purnima, and Sanjay Chaudhary. "Big Data Analytics Framework for
Spatial Data." In Sixth International Conference on Big Data Analytics, pp. 250-
265. Springer, Cham, 2018.
2. Shah, Purnima and Sanjay Chaudhary. "Big Data Analytics and Integration
Platform for Agriculture”, In the Proceedings of Research Frontiers in Precession
Agriculture (Extended Abstract), AFITA/WCCA 2018 Conference, IIT Bombay,
India, October 24-26, 2018.
3. Shah, Purnima, Deepak Hiremath, and Sanjay Chaudhary. "Towards development
of spark based agricultural information system including geo-spatial data." In Big
Data (Big Data), 2017 IEEE International Conference on, pp. 3476-3481. IEEE,
2017. (h-index – 33, citation_count - 1)
4. Shah, Purnima, Deepak Hiremath, and Sanjay Chaudhary. "Big data analytics
architecture for agro advisory system." In High Performance Computing
Workshops (HiPCW), 2016 IEEE 23rd International Conference on, pp. 43-49.
IEEE, 2016. (h-index – 14, citation_count - 7)
5. Shah, Purnima, Deepak Hiremath, and Sanjay Chaudhary. “Big Data Analytics
for Crop Recommendation System”, 7th Intl. Workshop on Big Data
Benchmarking (WBDB 2015), New Delhi, 14-15 Dec. 2015, organized by ISI
Delhi center, and IIPH Hyderabad.
Bibliography
1. Website of MongoDB, http://www.mongodb.org
2. Lakshman, Avinash, and Prashant Malik. "Cassandra: a decentralized structured storage system."
ACM SIGOPS Operating Systems Review 44, no. 2 (2010): 35-40.
3. Team, R. Core. "R: A language and environment for statistical computing. R Foundation for
Statistical Computing, Vienna, Austria. 2013." (2014).
29
4. Eldawy, Ahmed, and Mohamed F. Mokbel. "Spatialhadoop: A mapreduce framework for spatial
data." In Data Engineering (ICDE), 2015 IEEE 31st International Conference on, pp. 1352-1363.
IEEE, 2015.
5. Eldawy, Ahmed, and Mohamed F. Mokbel. "Pigeon: A spatial mapreduce language." In 2014
IEEE 30th International Conference on Data Engineering (ICDE), pp. 1242-1245. IEEE, 2014.
6. Aji, Ablimit, Fusheng Wang, Hoang Vo, Rubao Lee, Qiaoling Liu, Xiaodong Zhang, and Joel
Saltz. "Hadoopgis: a high performance spatial data warehousing system over mapreduce."
Proceedings of the VLDB Endowment 6, no. 11 (2013): 1009-1020.
7. Yu, Jia, Jinxuan Wu, and Mohamed Sarwat. "Geospark: A cluster computing framework for
processing large-scale spatial data." In Proceedings of the 23rd SIGSPATIAL International
Conference on Advances in Geographic Information Systems, p. 70. ACM, 2015.
8. Web site of Magellan. https://github.com/harsha2010/magellan. Magellan -
https://hortonworks.com/blog/magellan-geospatial-analytics-in-spark/;
https://github.com/harsha2010/magellan
9. Web site of Spatialspark. http://simin.me/projects/spatialspark/
10. Tang, Mingjie, Yongyang Yu, Qutaibah M. Malluhi, Mourad Ouzzani, and Walid G. Aref.
"Locationspark: a distributed in-memory data management system for big spatial data."
Proceedings of the VLDB Endowment 9, no. 13 (2016): 1565-1568.
11. Xie, Dong, Feifei Li, Bin Yao, Gefei Li, Liang Zhou, and Minyi Guo. "Simba: Efficient in-
memory spatial analytics." In Proceedings of the 2016 International Conference on Management
of Data, pp. 1071-1085. ACM, 2016.
12. “Lambda Architecture,” http://lambda-architecture.net/, 2014.
13. “Kappa Architecture,” http://radar.oreilly.com/2014/07/ questioning-the-lambda-architecture.html,
2014.
14. Fernandez, Raul Castro, Peter R. Pietzuch, Jay Kreps, Neha Narkhede, Jun Rao, Joel Koshy, Dong
Lin, Chris Riccomini, and Guozhang Wang. "Liquid: Unifying Nearline and Offline Big Data
Integration." In CIDR. 2015.
15. Website of Berkeley Data Analysis Stack. , [https://amplab.cs.berkeley.edu/software/]
16. Website of HPCC, https://hpccsystems.com/.
17. Estrada, Raul, and Isaac Ruiz. "Big Data SMACK." Apress, Berkeley, CA (2016).
18. Klein, Levente J., Fernando J. Marianno, Conrad M. Albrecht, Marcus Freitag, Siyuan Lu, Nigel
Hinds, Xiaoyan Shao, Sergio Bermudez Rodriguez, and Hendrik F. Hamann. "PAIRS: A scalable
geo-spatial data analytics platform." In Big Data (Big Data), 2015 IEEE International Conference
on, pp. 1290-1298. IEEE, 2015.
19. Sinnott, Richard O., Luca Morandini, and Siqi Wu. "SMASH: A cloud-based architecture for big
data processing and visualization of traffic data." In Data Science and Data Intensive Systems
(DSDIS), 2015 IEEE International Conference on, pp. 53-60. IEEE, 2015.
20. S. Cho, S. Hong and C. Lee, "ORANGE: Spatial big data analysis platform," In Big Data (Big
Data), 2016 IEEE International Conference on, pp. 3963-3965. IEEE, 2016.
21. Shah, Purnima, Deepak Hiremath, and Sanjay Chaudhary. "Big data analytics architecture for agro
advisory system." In High Performance Computing Workshops (HiPCW), 2016 IEEE 23rd
International Conference on, pp. 43-49. IEEE, 2016.
22. Shah, Purnima, Deepak Hiremath, and Sanjay Chaudhary. "Towards development of spark based
agricultural information system including geo-spatial data." In Big Data (Big Data), 2017 IEEE
International Conference on, pp. 3476-3481. IEEE, 2017.
23. Sahni, Sonali. "Ontology Based Agro Advisory System." Department of Computer Science and
Engineering, lIT Mumbai, M. Tech. Thesis (2012).
24. Chaudhary, Sanjay, MinalBhise, Asim Banerjee, AakashGoyal, and ChetanMoradiya. "Agro
advisory system for cotton crop." In Communication Systems and Networks (COMSNETS), 2015
7th International Conference on, pp. 1-6. IEEE, 2015.
25. Pappu, Nagaraju, RunaSarkar, and T. V. Prabhakar. "Agropedia: Humanization of agricultural
knowledge." IEEE Internet Computing 14, no. 5 (2010): 57-59.
30
26. Sini, Margherita, VimleshYadav, Jeetendra Singh, Vikas Awasthi, and Prabhakar TV.
"Knowledge models in agropediaindica." (2009).
27. de Oliveira, Tiago H. Moreira, Marco Painho, Vitor Santos, Otávio Sian, and André Barriguinha.
"Development of an agricultural management information system based on Open-Source
solutions." Procedia Technology 16 (2014): 342-354.
28. Kumar SK, Babu SDB (2016) A Web GIS Based Decision Support System for Agriculture Crop
Monitoring System-A Case Study from Part of Medak District. J Remote Sensing & GIS 5:177.
doi: 10.4172/2469-4134.1000177.
29. Han, Weiguo, Zhengwei Yang, Liping Di, and Richard Mueller. "CropScape: A Web service
based application for exploring and disseminating US conterminous geospatial cropland data
products for decision support." Computers and Electronics in Agriculture 84 (2012): 111-123.
30. Zhu, Zhiqing, Rongmei Zhang, and Jieli Sun. "Research on GIS-based agriculture expert system."
In Software Engineering, 2009. WCSE'09. WRI World Congress on, vol. 3, pp. 252-255. IEEE,
2009.
31. Zhang, Hao, Li Zhang, YannaRen, Juan Zhang, XinXu, Xinming Ma, and Zhongmin Lu. "Design
and implementation of crop recommendation fertilization decision system based on WEBGIS at
village scale." In International Conference on Computer and Computing Technologies in
Agriculture, pp. 357-364. Springer, Berlin, Heidelberg, 2010.
32. Garg, Raghu, and Himanshu Aggarwal. "Big data analytics recommendation solutions for crop
disease using Hive and Hadoop Platform." Indian Journal of Science and Technology9, no. 32
(2016).
33. Lamrhari, Soumaya, Hamid Elghazi, Tayeb Sadiki, and Abdellatif El Faker. "A profile-based Big
data architecture for agricultural context." In Electrical and Information Technologies (ICEIT),
2016 International Conference on, pp. 22-27. IEEE, 2016.
34. Vitolo, Claudia, Yehia Elkhatib, Dominik Reusser, Christopher JA Macleod, and Wouter
Buytaert. "Web technologies for environmental Big Data." Environmental Modelling& Software
63 (2015): 185-198.
35. Peisker, Anu, and Soumya Dalai. "Data analytics for rural development." Indian Journal of
Science and Technology 8, no. S4 (2015): 50-60.
36. Xie, N. F., X. F. Zhang, W. Sun, and X. N. Hao. "Research on Big Data Technology-Based
Agricultural Information System." In International Conference on Computer Information Systems
and Industrial Applications. Atlantis Press. 2015.
37. Chalh, Ridouane, Zohra Bakkoury, Driss Ouazar, and Moulay Driss Hasnaoui. "Big data open
platform for water resources management." In Cloud Technologies and Applications (CloudTech),
2015 International Conference on, pp. 1-8. IEEE, 2015.
38. Sayad, Younes Oulad, Hajar Mousannif, and Michel Le Page. "Crop management using Big
Data." In Cloud Technologies and Applications (CloudTech), 2015 International Conference on,
pp. 1-6. IEEE, 2015.
39. Website of spark-cassandra-connector: https:// github.com/ datastax/spark-cassandra-connector.
40. Freeland Jr., Thomas B. , Pettigrew, Bill, Thaxton, Peggy, Andrew, Gordon L., “Agrometeorology
and Cotton Production” Guide of Agricultural Meteorological Practices (GAMP), 2010, edition
(WMO-No.134), Chapter 13.
41. Osakwe, Emeka, “Cotton Fact Sheet India”, International Cotton Advisory Committee, May 19,
2009.
42. RA Fisher, M. A. "III. The influence of rainfall on the yield of wheat at Rothamsted." Phil. Trans.
R. Soc. Lond. B 213, no. 402-410 (1925): 89-142.
43. Agrawal, Ranjana, and S. C. Mehta. "Weather based forecasting of crop yields, pests and diseases-
IASRI models." J. Ind. Soc. Agril. Statist 61, no. 2 (2007): 255-263.