electronic transactions etransactions & data science ii

80
Decision Support Systems Laboratory, NTUA Electronic Transactions 2020 1 eTransactions & Data Science II: The Power of Big Data & Data Analytics Electronic Transactions

Upload: others

Post on 09-May-2022

2 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Electronic Transactions eTransactions & Data Science II

Decision Support Systems Laboratory, NTUA Electronic Transactions 2020 1

eTransactions & Data Science II: The Power of Big Data & Data Analytics

Electronic Transactions

Page 2: Electronic Transactions eTransactions & Data Science II

Decision Support Systems Laboratory, NTUA Electronic Transactions 2020 2

Open Data

Page 3: Electronic Transactions eTransactions & Data Science II

Decision Support Systems Laboratory, NTUA Electronic Transactions 2020 3

In a Nutshell…Open Data is a philosophy and practice requiring that certain data are freely available to everyone, without restrictions from copyright, patents or other

mechanisms of control in a timely and accessible way.

Desicion Support Systems Laboratory, NTUA Electronic Transactions 2018 - Data Science I

Page 4: Electronic Transactions eTransactions & Data Science II

Decision Support Systems Laboratory, NTUA Electronic Transactions 2020 4

Page 5: Electronic Transactions eTransactions & Data Science II

Decision Support Systems Laboratory, NTUA Electronic Transactions 2020 5

8 Principles of Open Data

1. Data Must Be Complete

2. Data Must Be Primary

3. Data Must Be Timely

4. Data Must Be Accessible

5. Data Must Be Machine Processable

6. Access Must Be Non-Discriminatory

7. Data Formats Must Be Non-Proprietary

8. Data Must Be License-free

Page 6: Electronic Transactions eTransactions & Data Science II

Decision Support Systems Laboratory, NTUA Electronic Transactions 2020 6

Open Data Publication

• Top-down approach:• A national plan for coordinating the data publication is created by

committees involving all stakeholders before public organizations actually release any data

• Defining and reaching consensus on a consistent set of terms and their relations (ontology)

• Bottom-up approach:• Data should be published by all public organizations

• Any interested party can use the available data in raw formats

• Coordination efforts to join them together should follow at a laterstage

Page 7: Electronic Transactions eTransactions & Data Science II

Decision Support Systems Laboratory, NTUA Electronic Transactions 2020 7

MetadataImagine a supermarket...

• With goods without labels?

• Without signs and directions?

Imagine a library...

• Without titles in the books?

• Without shelfs ranked per subject?

Now you may imagine information in the World Wide Web.

time periodtitle

supplemental information

abstract

Source: CSC Brands

author

sources

(file) size

Page 8: Electronic Transactions eTransactions & Data Science II

Decision Support Systems Laboratory, NTUA Electronic Transactions 2020 8

A government org publishes

data

Citizens & developers

engage, providing feedback

That govt. org incorporates

feedback, improving

data

Demonstrable use inspires that

govt. org to publish more

More data attracts more

data consumers

Positive interaction inspires more

governments to follow suit

Who Benefits from Open Data?

1

2

34

5

6

Page 9: Electronic Transactions eTransactions & Data Science II

Decision Support Systems Laboratory, NTUA Electronic Transactions 2020 9

The Open Data Publisher/Subscriber Equation

Transforming governments from data collectors → data producers → data publishers

Page 10: Electronic Transactions eTransactions & Data Science II

Decision Support Systems Laboratory, NTUA Electronic Transactions 2020 10

Open Data Initiatives

Page 11: Electronic Transactions eTransactions & Data Science II

Decision Support Systems Laboratory, NTUA Electronic Transactions 2020 11

Why Open Data?

• More information might lead to more informed and better decisions

• Higher degree of effectiveness and efficiency

• Strengthen trust

• Leverage benefits of peer production

• New business models

• “Peoples right to know”

Page 12: Electronic Transactions eTransactions & Data Science II

Decision Support Systems Laboratory, NTUA Electronic Transactions 2020 12

Linked Data

Page 13: Electronic Transactions eTransactions & Data Science II

Decision Support Systems Laboratory, NTUA Electronic Transactions 2020 13

Open Data.. Is this enough ?

★make your stuff available on the Web (whatever format) under an open license★★make it available as structured data (e.g., Excel instead of image scan of a table)★★★make it available in a non-proprietary open format (e.g., CSV instead of Excel)★★★★ use URIs to denote things, so that people can point at your stuff★★★★★ link your data to other data to provide context5

[Source: http://5stardata.info/en/]

Page 14: Electronic Transactions eTransactions & Data Science II

Decision Support Systems Laboratory, NTUA Electronic Transactions 2020 14

Linked Data – the idea

• The main strength of the world wide web lies in the ability to link between different web pages.

• This way, a webpage may provide its customer a link to another website in order to retrieve additional information about a topic.

• Could we apply the same principle on data?

• Linked data, much like websites, can live on different places, be maintained by different organizations, and still be used as a single system from the user’s perspective.

Page 15: Electronic Transactions eTransactions & Data Science II

Decision Support Systems Laboratory, NTUA Electronic Transactions 2020 15

Linked Data

Linked Data:

Structured data which is interlinked with other data so it becomes more useful through semantic queries

Page 16: Electronic Transactions eTransactions & Data Science II

Decision Support Systems Laboratory, NTUA Electronic Transactions 2020 16

Linked Data – Examples

Page 17: Electronic Transactions eTransactions & Data Science II

Decision Support Systems Laboratory, NTUA Electronic Transactions 2020 17

Linked Data Principles1. Use URIs as names for

things 2. Use HTTP URIs so that

people can look up those names.

3. When someone looks up a URI, provide useful information, using the standards (using RDF andSPARQL queries)

4. Include links to other URIs, so that they can discover more things.

• Uniform Resource Identifier (URI) is a string of characters used to identify a resource.• The Resource Description Framework RDF is a standard model for data interchange on the Web.

Page 18: Electronic Transactions eTransactions & Data Science II

Decision Support Systems Laboratory, NTUA Electronic Transactions 2020 18

Towards a Linked Open Data Web

Source: Sören Auer. Linked Data Tutorial

Page 19: Electronic Transactions eTransactions & Data Science II

Decision Support Systems Laboratory, NTUA Electronic Transactions 2020 19

From Web of Documents to Web of Data

• Designed for human consumption

• Data Silos in the Web

• Designed for machine consumption

• Interconnecting available data

Page 20: Electronic Transactions eTransactions & Data Science II

Decision Support Systems Laboratory, NTUA Electronic Transactions 2020 20

Page 21: Electronic Transactions eTransactions & Data Science II

Decision Support Systems Laboratory, NTUA Electronic Transactions 2020 21

Source: Sören Auer. Linked Data Tutorial

Page 22: Electronic Transactions eTransactions & Data Science II

Decision Support Systems Laboratory, NTUA Electronic Transactions 2020 22

RDFa• RDFa → RDF in attributes

• A way to mark up data in a web page

• RDFa encodes triples in HTML

• Useful for agents and (relatively)easy for humans

https://www.slideshare.net/tuttogaz/a-semantic-data-model-for-web-applications

Page 23: Electronic Transactions eTransactions & Data Science II

Decision Support Systems Laboratory, NTUA Electronic Transactions 2020 23

Structure - Ontologies

Page 24: Electronic Transactions eTransactions & Data Science II

Decision Support Systems Laboratory, NTUA Electronic Transactions 2020 24

Linked Open Data Cloud (a while ago…)

Desicion Support Systems Laboratory, NTUA Electronic Transactions 2018 - Data Science I

Page 25: Electronic Transactions eTransactions & Data Science II

Decision Support Systems Laboratory, NTUA Electronic Transactions 2020 25

Linked Data – Characteristics

• Linked Data allow us to easily reference the same entity in different datasets.

• Using linked data we can refer to and extend data external to our organization.

• Linked data usage is ideal for data exchange between different systems, especially when each one of the system only maintains part of the overall information regarding each entity.

• Linked data usage reduces the cost of data exchange and maintenance, while increasing the cost of data generation and usage.

• Benefits of linked data greatly depend on correct usage of the paradigm and well-designed datasets.

Page 26: Electronic Transactions eTransactions & Data Science II

Decision Support Systems Laboratory, NTUA Electronic Transactions 2020 26

Big Data Analytics Technologies

Page 27: Electronic Transactions eTransactions & Data Science II

Decision Support Systems Laboratory, NTUA Electronic Transactions 2020 27

Big Data Landscape (2012)

Page 28: Electronic Transactions eTransactions & Data Science II

Decision Support Systems Laboratory, NTUA Electronic Transactions 2020 28

Big

Dat

a La

nd

scap

e (2

01

6)

Page 29: Electronic Transactions eTransactions & Data Science II

Decision Support Systems Laboratory, NTUA Electronic Transactions 2020 29

Big

Dat

a La

nd

scap

e (2

01

8)

Page 30: Electronic Transactions eTransactions & Data Science II

Decision Support Systems Laboratory, NTUA Electronic Transactions 2020 30

Big Data Technologies

There are six primary needs that Big Data technologies address:

• 1. Distributed Storage and Processing

• 2. Non-Relational database with Low latency

• 3. Streams and Complex Event Processing

• 4. Data Processing of Special big data data-types

• 5. In-Memory Processing

• 6. Reporting

Page 31: Electronic Transactions eTransactions & Data Science II

Decision Support Systems Laboratory, NTUA Electronic Transactions 2020 31

Hadoop and MapReduce

Page 32: Electronic Transactions eTransactions & Data Science II

Decision Support Systems Laboratory, NTUA Electronic Transactions 2020 32

MapReduce & HadoopIn 2004, Google published a paper on a process called MapReduce

The MapReduce concept provides a parallel processing model, and an associated implementation wasreleased to process huge amounts of data. With MapReduce, queries are split and distributed acrossparallel nodes and processed in parallel (the Map step). The results are then gathered and delivered (theReduce step). The framework was very successful, so others wanted to replicate the algorithm.

Therefore, an implementation of the MapReduce framework was adopted by an Apache open-sourceproject named Hadoop.

2012 studies showed that a multiple-layer architecture is one option to address the issues that big datapresents. A distributed parallel architecture distributes data across multiple servers; these parallelexecution environments can dramatically improve data processing speeds. This type of architecture insertsdata into a parallel DBMS, which implements the use of MapReduce and Hadoop frameworks. This type offramework looks to make the processing power transparent to the end user by using a front-endapplication server.

Page 33: Electronic Transactions eTransactions & Data Science II

Decision Support Systems Laboratory, NTUA Electronic Transactions 2020 33

MapReduce phasesMapReduce is a distributed, fault-tolerant, system used for parallel programming and processing of huge data. Hadoop is capable of running MapReduce programs written in various languages: Java, Ruby, Python, and C++.

MapReduce programs are parallel in nature, thus are very useful for performing large-scale data analysis using multiple machines in the cluster.

The whole process goes through four phases of execution namely, splitting, mapping, shuffling, and reducing

Input Splits

An input to a MapReduce job is

divided into fixed-size pieces called input

splits Input split is a chunk of the input that is consumed by a single

map

Mapping

This is the very first phase in the execution

of map-reduce program. In this phase data in

each split is passed to a mapping function to

produce output values. In our example, a job of

mapping phase is to count a number of

occurrences of each word from input splits

Shuffling

This phase consumes the output of Mapping

phase. Its task is to consolidate the relevant records from Mapping phase output. In our example, the same words are clubed

together along with their respective

frequency.

Reducing

In this phase, output values from the

Shuffling phase are aggregated. This phase combines values from

Shuffling phase and returns a single output

value. In short, this phase summarizes the

complete dataset.

Page 34: Electronic Transactions eTransactions & Data Science II

Decision Support Systems Laboratory, NTUA Electronic Transactions 2020 34

MapReduce (word count example)

Page 35: Electronic Transactions eTransactions & Data Science II

Decision Support Systems Laboratory, NTUA Electronic Transactions 2020 35

Executing jobs with MapReduce

Page 36: Electronic Transactions eTransactions & Data Science II

Decision Support Systems Laboratory, NTUA Electronic Transactions 2020 36

Hadoop modulesThe base Apache Hadoop framework is composed of the following modules:

• Hadoop Common – contains libraries and utilities needed by other Hadoop modules;

• Hadoop Distributed File System (HDFS) – a distributed file-system that stores data on commoditymachines, providing very high aggregate bandwidth across the cluster;

• Hadoop YARN – introduced in 2012 is a platform responsible for managing computing resources inclusters and using them for scheduling users' applications;

• Hadoop MapReduce – an implementation of the MapReduce programming model for large-scale dataprocessing.

The term Hadoop is often used for both base modules and sub-modules and also the ecosystem, orcollection of additional software packages that can be installed on top of or alongside Hadoop, such asApache Pig, Apache Hive, Apache HBase, Apache Phoenix, Apache Spark, Apache ZooKeeper, ClouderaImpala, Apache Flume, Apache Sqoop, Apache Oozie, and Apache Storm.

Page 37: Electronic Transactions eTransactions & Data Science II

Decision Support Systems Laboratory, NTUA Electronic Transactions 2020 37

HDFS architectureHDFS is a distributed, fault-tolerant file system used to back the computation of Big Data.

Page 38: Electronic Transactions eTransactions & Data Science II

Decision Support Systems Laboratory, NTUA Electronic Transactions 2020 38

HDFS architecture

Page 39: Electronic Transactions eTransactions & Data Science II

Decision Support Systems Laboratory, NTUA Electronic Transactions 2020 39

File types for Hadoop• Text documents

• CSV/TSV files

• JSON records

• Avro files• Avro is quickly becoming the top choice for the developers due to its multiple benefits. Avro

stores metadata with the data itself and allows specification of an independent schema forreading the file.

• Parquet files• Parquet file is a columnar file format. Parquet also enjoys the features like compression and

query performance benefits but is generally slower to write than non-columnar file formats.

• ORC files• ORC are compressed columnar files that enable faster queries.• But it doesn’t support schema evolution.

Page 40: Electronic Transactions eTransactions & Data Science II

Decision Support Systems Laboratory, NTUA Electronic Transactions 2020 40

Hadoop

Page 41: Electronic Transactions eTransactions & Data Science II

Decision Support Systems Laboratory, NTUA Electronic Transactions 2020 41

Challenges of using HadoopWhat are the challenges of using Hadoop?

• MapReduce programming is not a good match for all problems. It’s good for simple informationrequests and problems that can be divided into independent units, but it's not efficient for iterative andinteractive analytic tasks. MapReduce is file-intensive. Because the nodes don’t intercommunicate exceptthrough sorts and shuffles, iterative algorithms require multiple map-shuffle/sort-reduce phases tocomplete. This creates multiple files between MapReduce phases and is inefficient for advanced analyticcomputing.

• There’s a widely acknowledged talent gap. It can be difficult to find entry-level programmers who havesufficient Java skills to be productive with MapReduce. That's one reason distribution providers are racingto put relational (SQL) technology on top of Hadoop. It is much easier to find programmers with SQL skillsthan MapReduce skills. And, Hadoop administration seems part art and part science, requiring low-levelknowledge of operating systems, hardware and Hadoop kernel settings.

• Data security. Another challenge centers around the fragmented data security issues, though new toolsand technologies are surfacing. The Kerberos authentication protocol is a great step toward makingHadoop environments secure.

• Full-fledged data management and governance. Hadoop does not have easy-to-use, full-feature tools fordata management, data cleansing, governance and metadata. Especially lacking are tools for data qualityand standardization.

Page 42: Electronic Transactions eTransactions & Data Science II

Decision Support Systems Laboratory, NTUA Electronic Transactions 2020 42

Hadoop Main Distributions

Cloud Providers

Page 43: Electronic Transactions eTransactions & Data Science II

Decision Support Systems Laboratory, NTUA Electronic Transactions 2020 43

NoSQL Databases

Page 44: Electronic Transactions eTransactions & Data Science II

Decision Support Systems Laboratory, NTUA Electronic Transactions 2020 44

NoSQL (Non Relational Databases)

Google BigTable

Couch DB

Page 45: Electronic Transactions eTransactions & Data Science II

Decision Support Systems Laboratory, NTUA Electronic Transactions 2020 45

Relational databasesRDBMS: Relational database management systems

ACID:• Atomicity• Consistency• Isolation• Durability

Issues with RDBMS - Scalability• Issues with scaling up when the dataset is too big.• Not designed to be distributed (usually), because:

• Joins are expensive• Hard to scale horizontally

• Looking at multi-node DB solutions (horizontal scaling)

RDBMS strengths:• Defined Schema• Transactions• Limitless indexing• A very strong language for dynamic, cross-table

queries (SQL)

Page 46: Electronic Transactions eTransactions & Data Science II

Decision Support Systems Laboratory, NTUA Electronic Transactions 2020 46

Relational databases

Page 47: Electronic Transactions eTransactions & Data Science II

Decision Support Systems Laboratory, NTUA Electronic Transactions 2020 47

Need for NoSQL

• Explosion of social media sites (Facebook, Twitter, Google, YouTube etc.) with large data needs.

• Rise of cloud-based solutions (i.e. Amazon S3).

• Need to handle large amounts of data quickly (horizontal scaling).

• Diversity of the available information.

• High connectivity between web elements.

Page 48: Electronic Transactions eTransactions & Data Science II

Decision Support Systems Laboratory, NTUA Electronic Transactions 2020 48

What is NoSQL

• Stands for “Not Only SQL”

• Do not require a fixed table schema

• Relaxation for one or more of the ACID properties

4 types of NoSQL Databases

Key-Value pair based

Column based

Document based

Graph based

Page 49: Electronic Transactions eTransactions & Data Science II

Decision Support Systems Laboratory, NTUA Electronic Transactions 2020 49

Key-value stores

• “One key, one value, no duplicates, and crazy fast”.

• Simplest NoSQL databases, use of hash table.

• Data has no required format, data may have any format.

• The value is a binary object aka. “blob” that the DB does not understand it and does not want to understand it.

• Basic operations: Insert(key, value), Fetch(key), Update(key), Delete(key)

Page 50: Electronic Transactions eTransactions & Data Science II

Decision Support Systems Laboratory, NTUA Electronic Transactions 2020 50

Column based

• Often referred as “BigTable clones”

• The column is the lowest/smallest instance of data.

• They store data as Column families containing rows that have many columns associated with a row key. Column families are groups of related data that is accessed together.

Google BigTable

Statistics about Facebook Search (Cassandra)

Page 51: Electronic Transactions eTransactions & Data Science II

Decision Support Systems Laboratory, NTUA Electronic Transactions 2020 51

Document based• Stores and retrieves documents.

• Key-value stores, it stores documents in the value part with a complex data structure.

• Self describing, hierarchical tree data structures. Documents can contain many different key-value pairs, or key-array pairs, or even nested documents.

Couch DB

Page 52: Electronic Transactions eTransactions & Data Science II

Decision Support Systems Laboratory, NTUA Electronic Transactions 2020 52

Graph based• “Relational database is a collection loosely connected tables, while Graph

database is a multi-relational graph”.

• Graph databases store entities and relationships between them as node and edges of a graph. Entities have properties.

• Traversing the relationships is very fast as relationships between nodes are not calculated at query time but are persisted.

Page 53: Electronic Transactions eTransactions & Data Science II

Decision Support Systems Laboratory, NTUA Electronic Transactions 2020 53

What we need?

• We need a distributed database system having such features:• Fault tolerance

• High availability

• Consistency

• Scalability

Which is impossible!!

According to the CAP theorem....

Page 54: Electronic Transactions eTransactions & Data Science II

Decision Support Systems Laboratory, NTUA Electronic Transactions 2020 54

CAP Theorem

Pick 2 from:• Consistency• Availability• Partitions

To scale out we have to partition. It leaves a chive between consistency and availability.

Everyone who builds big applications builds them on CAP: Yahoo, Google, Facebook, Amazon, eBay, etc.

Page 55: Electronic Transactions eTransactions & Data Science II

Decision Support Systems Laboratory, NTUA Electronic Transactions 2020 55

Advantages of NoSQL

• Massive data stores.

• Data are replicated in multiple nodes.• When data are written, the latest version is on at least one node and then replicated

to others.

• No single point of failure

• Easy to distribute.

• Scalability.

• Do not require a schema.

• Better when it is more important to have fast data than correct data.

Page 56: Electronic Transactions eTransactions & Data Science II

Decision Support Systems Laboratory, NTUA Electronic Transactions 2020 56

What is NOT provided by NoSQL (..or is difficult)

• Joins

• Group by

• ACID Transactions

• SQL

Page 57: Electronic Transactions eTransactions & Data Science II

Decision Support Systems Laboratory, NTUA Electronic Transactions 2020 57

Querying Big Data

Page 58: Electronic Transactions eTransactions & Data Science II

Decision Support Systems Laboratory, NTUA Electronic Transactions 2020 58

Querying Big Data• Hadoop, Spark, NoSQL are great tools for its purpose, but they don’t fit 100% of the

audience.

• There is a basic skill that every programmer is familiar with, SQL.

• Solution: SQL on top of Hadoop.

• Very useful for interactive and exploratory analytics.

Page 59: Electronic Transactions eTransactions & Data Science II

Decision Support Systems Laboratory, NTUA Electronic Transactions 2020 59

Apache Hive Hive is an Apache licensed, open-source query engine written in Java programming language used forsummarizing, analyzing and querying data stored on Hadoop. Though it was initially introduced byFacebook, it was later open-sourced.

Pros

• It is stable as it has been around for many years.

• Hive is also open-source with a great community should you need help using it.

• It uses HiveQL, a SQL-like querying language which can be easily understood by RDBMS experts.

• Supports Text File, RCFile, SequenceFile, ORC, Parquet, and Avro file formats.

Cons

• Hive relies on MapReduce to execute queries which makes it relatively slow compared to querying engines like Cloudera Impala, Spark or Presto.

• Hive only supports structured data. So if your data is largely unstructured, Hive isn’t an option.

Page 60: Electronic Transactions eTransactions & Data Science II

Decision Support Systems Laboratory, NTUA Electronic Transactions 2020 60

Cloudera Impala

A real time, Apache licensed, open source, massively parallel processing (MPP) SQL on Hadoop queryingengine written in C++ programming language and currently shipped by Cloudera, MapR, Amazon and Oracle.With Impala, you can query data, whether stored in HDFS or Apache HBase – including SELECT, JOIN, andaggregate functions – in real time.

Pros

• Impala provides real time querying on data stored on Hadoop clusters.

• It’s fast. The fact that it doesn’t use MapReduce to execute its queries makes it faster than Hive.

• It uses HiveQL, making it easy for data analysts coming from a RDBMS background to understand and use.

• Enterprise installation is supported because it is backed by Cloudera.

Cons

• Impala only has support for Parquet, RCFile, SequenceFIle, and Avro file formats. So if your data is in ORCformat, you will be faced with a tough job transitioning your data.

• Supports only Cloudera’s CDH, MapR, and AWS platforms.

Page 61: Electronic Transactions eTransactions & Data Science II

Decision Support Systems Laboratory, NTUA Electronic Transactions 2020 61

PrestoPresto is another massively parallel processing (MPP), open source, SQL on Hadoop querying enginedeveloped by Facebook to query databases on different sources with high speed irrespective of the volume,velocity, and variety of data they contain. It is currently being backed by Teradata and has been employedfor use by AirBnB, Dropbox, Netflix, and Uber.

Pros• Presto supports Text, ORC, Parquet and RCFile file formats. This makes it a great query engine of choice

without worrying about transforming your existing data into a new format.• It works well with Amazon S3 storage and queries data from any source at the scale of petabytes

simultaneously and in seconds.• Great support from the open-source community will ensure Presto is around for much longer.• Enterprise support is provided by Teradata — a big data analytics and marketing applications company.

Cons• Being largely open source, it is not advisable to deploy Presto if you think you aren’t capable of supporting

and debugging issues with Presto yourself except you decide to work with a vendor like Teradata.• It doesn’t have its own storage layer, so queries involving inserts or writing to the HDFS are not supported.

Page 62: Electronic Transactions eTransactions & Data Science II

Decision Support Systems Laboratory, NTUA Electronic Transactions 2020 62

Apache SparkSQLApache Spark is a cluster computing framework that runs on Hadoop. It comes with a built-in module calledSparkSQL which enables the execution of pure SQL queries to multiple data storages.

Pros

• It is very fast. Spark SQL executes batch queries in the Spark framework 10–100 times faster than Hive withMapReduce.

• Spark provides full compatibility with Hive data, queries, and user defined functions (UDF).

• Spark provides APIs (Application Programming Interfaces) in various languages (Java, Scala, Python) whichmakes it possible for developers to write applications in those languages.

• Apache Spark and Spark SQL boasts a larger open-source community support than Presto.

Cons

• Apache Spark consumes lots of RAM which makes it expensive in terms of cost.

• It is still maturing, and as such, it is not considered to be stable yet.

Page 63: Electronic Transactions eTransactions & Data Science II

Decision Support Systems Laboratory, NTUA Electronic Transactions 2020 63

Google BigQuery BigQuery is a cloud database solution provided by Google which executes queries on large amounts of data in seconds.Being a full database solution and not just another query engine means that it provides its own storage, a query engine,and also uses SQL-like commands to run queries against data stored in it.

Pros• I would refer to Google BigQuery as a plug and play solution for big data in that you don’t worry about server

management here. You only import your data in its own storage and begin querying your data while it handles performance, memory allocation, and CPU optimization implicitly.

• It has a strong backing from Google making it a very stable product.• BigQuery supports standard SQL syntax.• Moving data from other cloud storage solutions like Amazon S3 into GCS (Google Cloud Storage) is easy and hassle-

free using the transfer manager.• Great support for enterprise users.

Cons• It could become very expensive if you query your data a lot — because Google also charges per data processed on a

query.• Queries with lots of joins are not that fast.• You have to move your data into BigQuery’s storage system before you can query your data with it.

Page 64: Electronic Transactions eTransactions & Data Science II

Decision Support Systems Laboratory, NTUA Electronic Transactions 2020 64

Benchmarking

Source: https://cdn2.hubspot.net/hubfs/488249/Asset%20PDFs/Benchmark_BI-on-Hadoop_Performance_Q4_2016.pdf

Page 65: Electronic Transactions eTransactions & Data Science II

Decision Support Systems Laboratory, NTUA Electronic Transactions 2020 65

Big Data Analytics with Apache Spark

Page 66: Electronic Transactions eTransactions & Data Science II

Decision Support Systems Laboratory, NTUA Electronic Transactions 2020 66

Apache Spark

• Apache Spark is a powerful unified analytics engine for large-scale distributed data processing and machine learning.

• It provides in-memory computations for increased speed and data processing over MapReduce.

• It runs on top of an existing Hadoop cluster, can access the Hadoop data store (HDFS), process data in databases and streaming data.

• Spark facilitates the implementation of iterative algorithms, which visit their data set multiple times in a loop, interactive/exploratory data analysis and graph processing.

Page 67: Electronic Transactions eTransactions & Data Science II

Decision Support Systems Laboratory, NTUA Electronic Transactions 2020 67

Spark Vs. Hadoop MapReduceThe key difference between them lies in the approach to processing:

Spark can do it in-memory, while Hadoop MapReduce has to read from and write to a disk.

As a result, the speed of processing differs significantly – Spark may be up to 100 times faster.

However, the volume of data processed also differs: Hadoop MapReduce is able to work with far larger data sets than Spark.

MapReduce is good for:• Linear processing of huge data sets. In case the resulting dataset is larger than available RAM, Hadoop

MapReduce may outperform Spark.• Economical solution, if no immediate results are expected.

Spark is good for:• Fast data processing. In-memory processing makes Spark faster than Hadoop MapReduce – up to 100

times for data in RAM and up to 10 times for data in storage.• Iterative processing (Machine Learning Algorithms).• Near real-time & graph processing.• Joining datasets

Page 68: Electronic Transactions eTransactions & Data Science II

Decision Support Systems Laboratory, NTUA Electronic Transactions 2020 68

The Spark framework stack

Scala Python Java R SQL

Spark SQL MLib GraphX SparkSQL

Spark Core

Yarn Mesos Spark Scheduler

Local HDFS S3 RDBMS NoSQL Streams

Programming

Library

Engine

Management

Storage

Page 69: Electronic Transactions eTransactions & Data Science II

Decision Support Systems Laboratory, NTUA Electronic Transactions 2020 69

Spark Core

Spark Core is the foundation of the overall project. It provides distributed task dispatching, scheduling, and basic I/O functionalities, exposed through an API centered on the RDD abstraction.

Parallel operations such as map, filter, reduce joins and other can be scheduled for execution in parallel on the cluster, taking RDDs as input and producing new RDDs.

RDDs are immutable and their operations are lazy;

fault-tolerance is achieved by keeping track of the "lineage" of each RDD (the sequence of operations that produced it) so that it can be reconstructed in the case of data loss.

Page 70: Electronic Transactions eTransactions & Data Science II

Decision Support Systems Laboratory, NTUA Electronic Transactions 2020 70

Spark librariesSpark SQL: is a Spark module for structured and semi-structured data processing. Itprovides a programming abstraction called DataFrames, a domain specific languageto manipulate them and can also act as distributed SQL query engine.

Spark MLib is a distributed machine-learning framework on top of Spark Core.Many common machine learning and statistical algorithms have been implementedand are shipped with MLlib which simplifies large scale machine learning pipelines,including statistics, classification, transformations, regression, clustering and more.

GraphX is a graph computation engine built on top of Spark that enables users tointeractively build, transform and reason about graph structured data at scale.

Spark Streaming enables powerful interactive and analytical applications acrossboth streaming and historical data. It ingests data in mini-batches and performsRDD transformations on those mini-batches of data. This design enables the sameset of application code written for batch analytics to be used in streaming analytics.

Page 71: Electronic Transactions eTransactions & Data Science II

Decision Support Systems Laboratory, NTUA Electronic Transactions 2020 71

Spark program execution

Page 72: Electronic Transactions eTransactions & Data Science II

Decision Support Systems Laboratory, NTUA Electronic Transactions 2020 72

DAG

Page 73: Electronic Transactions eTransactions & Data Science II

Decision Support Systems Laboratory, NTUA Electronic Transactions 2020 73

Analytics with Spark

Page 74: Electronic Transactions eTransactions & Data Science II

Decision Support Systems Laboratory, NTUA Electronic Transactions 2020 74

Streaming Data Analytics

Page 75: Electronic Transactions eTransactions & Data Science II

Decision Support Systems Laboratory, NTUA Electronic Transactions 2020 75

What is Streaming Data?Streaming Data is data that is generated continuously by thousands of data sources, which typically send in the data records simultaneously, and in small sizes (order of Kilobytes). Streaming data includes a wide variety of data such as log files generated by customers using your mobile or web applications, ecommerce purchases, in-game player activity, information from social networks, financial trading floors, or geospatial services, and telemetry from connected devices or instrumentation in data centers.

This data needs to be processed sequentially and incrementally on a record-by-record basis or over sliding time windows, and used for a wide variety of analytics including correlations, aggregations, filtering, and sampling. Information derived from such analysis gives companies visibility into many aspects of their business and customer activity such as –service usage (for metering/billing), server activity, website clicks, and geo-location of devices, people, and physical goods –and enables them to respond promptly to emerging situations. For example, businesses can track changes in public sentiment on their brands and products by continuously analyzing social media streams, and respond in a timely fashion as the necessity arises.

Page 76: Electronic Transactions eTransactions & Data Science II

Decision Support Systems Laboratory, NTUA Electronic Transactions 2020 76

Streaming Data ExamplesSensors in transportation vehicles, industrial equipment, and farm machinery send data to a streaming application. The application monitors performance, detects any potential defects in advance, and places a spare part order automatically preventing equipment down time.

A financial institution tracks changes in the stock market in real time, computes value-at-risk, and automatically rebalances portfolios based on stock price movements.

A real-estate website tracks a subset of data from consumers’ mobile devices and makes real-time property recommendations of properties to visit based on their geo-location.

A solar power company has to maintain power throughput for its customers, or pay penalties. It implemented a streaming data application that monitors of all of panels in the field, and schedules service in real time, thereby minimizing the periods of low throughput from each panel and the associated penalty payouts.

A media publisher streams billions of clickstream records from its online properties, aggregates and enriches the data with demographic information about users, and optimizes content placement on its site, delivering relevancy and better experience to its audience.

An online gaming company collects streaming data about player-game interactions, and feeds the data into its gaming platform. It then analyzes the data in real-time, offers incentives and dynamic experiences to engage its players.

Page 77: Electronic Transactions eTransactions & Data Science II

Decision Support Systems Laboratory, NTUA Electronic Transactions 2020 77

Batch Processing VS Stream ProcessingBatch processing can be used to compute arbitrary queries over different sets of data. It usually computes results that are derived from all the data it encompasses, and enables deep analysis of big data sets. MapReduce-based systems are examples of platforms that support batch jobs.

Stream processing requires ingesting a sequence of data, and incrementally updating metrics, reports, and summary statistics in response to each arriving data record. It is better suited for real-time monitoring and response functions.

Many organizations are building a hybrid model by combining the two approaches, and maintain a real-time layer and a batch layer. Data is first processed by a streaming data platform to extract real-time insights, and then persisted into a store, where it can be transformed and loaded for a variety of batch processing use cases.

Page 78: Electronic Transactions eTransactions & Data Science II

Decision Support Systems Laboratory, NTUA Electronic Transactions 2020 78

Challenges in Working with Streaming Data

Streaming data processing requires two layers:

o A Storage Layer: needs to support record ordering and strong consistency to enable fast, inexpensive, and replayable reads and writes of large streams of data.

o A Processing Layer: responsible for consuming data from the storage layer, running computations on that data, and then notifying the storage layer to delete data that is no longer needed.

You also have to plan for scalability, data durability, and fault tolerance in both the storage and processing layers.

Page 79: Electronic Transactions eTransactions & Data Science II

Decision Support Systems Laboratory, NTUA Electronic Transactions 2020 79

Popular Stream Processing Tools and Platforms

Amazon Kinesis (https://aws.amazon.com/kinesis/)

Apache Kafka (https://kafka.apache.org/)

Apache Flume (https://flume.apache.org/)

Apache Storm (https://storm.apache.org/)

Apache Spark Streaming (https://spark.apache.org/streaming/)

Apache Flink

(https://flink.apache.org/)

Page 80: Electronic Transactions eTransactions & Data Science II

Decision Support Systems Laboratory, NTUA Electronic Transactions 2020 80

QUESTIONS

[email protected]

Tsapelas I. - [email protected]