electronic transactions etransactions & data science ii

Decision Support Systems Laboratory, NTUA Electronic Transactions 2020 1

eTransactions & Data Science II: The Power of Big Data & Data Analytics

Electronic Transactions


Open Data


In a Nutshell…Open Data is a philosophy and practice requiring that certain data are freely available to everyone, without restrictions from copyright, patents or other

mechanisms of control in a timely and accessible way.

Desicion Support Systems Laboratory, NTUA Electronic Transactions 2018 - Data Science I


8 Principles of Open Data

1. Data Must Be Complete

2. Data Must Be Primary

3. Data Must Be Timely

4. Data Must Be Accessible

5. Data Must Be Machine Processable

6. Access Must Be Non-Discriminatory

7. Data Formats Must Be Non-Proprietary

8. Data Must Be License-free


Open Data Publication

• Top-down approach:• A national plan for coordinating the data publication is created by

committees involving all stakeholders before public organizations actually release any data

• Defining and reaching consensus on a consistent set of terms and their relations (ontology)

• Bottom-up approach:• Data should be published by all public organizations

• Any interested party can use the available data in raw formats

• Coordination efforts to join them together should follow at a laterstage


MetadataImagine a supermarket...

• With goods without labels?

• Without signs and directions?

Imagine a library...

• Without titles in the books?

• Without shelfs ranked per subject?

Now you may imagine information in the World Wide Web.

time periodtitle

supplemental information

abstract

Source: CSC Brands

author

sources

(file) size


A government org publishes

data

Citizens & developers

engage, providing feedback

That govt. org incorporates

feedback, improving

data

Demonstrable use inspires that

govt. org to publish more

More data attracts more

data consumers

Positive interaction inspires more

governments to follow suit

Who Benefits from Open Data?

1

2

34

5

6


The Open Data Publisher/Subscriber Equation

Transforming governments from data collectors → data producers → data publishers


Open Data Initiatives


Why Open Data?

• More information might lead to more informed and better decisions

• Higher degree of effectiveness and efficiency

• Strengthen trust

• Leverage benefits of peer production

• New business models

• “Peoples right to know”


Linked Data


Open Data.. Is this enough ?

★make your stuff available on the Web (whatever format) under an open license★★make it available as structured data (e.g., Excel instead of image scan of a table)★★★make it available in a non-proprietary open format (e.g., CSV instead of Excel)★★★★ use URIs to denote things, so that people can point at your stuff★★★★★ link your data to other data to provide context5

[Source: http://5stardata.info/en/]


Linked Data – the idea

• The main strength of the world wide web lies in the ability to link between different web pages.

• This way, a webpage may provide its customer a link to another website in order to retrieve additional information about a topic.

• Could we apply the same principle on data?

• Linked data, much like websites, can live on different places, be maintained by different organizations, and still be used as a single system from the user’s perspective.


Linked Data

Linked Data:

Structured data which is interlinked with other data so it becomes more useful through semantic queries


Linked Data – Examples


Linked Data Principles1. Use URIs as names for

things 2. Use HTTP URIs so that

people can look up those names.

3. When someone looks up a URI, provide useful information, using the standards (using RDF andSPARQL queries)

4. Include links to other URIs, so that they can discover more things.

• Uniform Resource Identifier (URI) is a string of characters used to identify a resource.• The Resource Description Framework RDF is a standard model for data interchange on the Web.


Towards a Linked Open Data Web

Source: Sören Auer. Linked Data Tutorial


From Web of Documents to Web of Data

• Designed for human consumption

• Data Silos in the Web

• Designed for machine consumption

• Interconnecting available data


Source: Sören Auer. Linked Data Tutorial


RDFa• RDFa → RDF in attributes

• A way to mark up data in a web page

• RDFa encodes triples in HTML

• Useful for agents and (relatively)easy for humans

https://www.slideshare.net/tuttogaz/a-semantic-data-model-for-web-applications


Structure - Ontologies


Linked Open Data Cloud (a while ago…)

Desicion Support Systems Laboratory, NTUA Electronic Transactions 2018 - Data Science I


Linked Data – Characteristics

• Linked Data allow us to easily reference the same entity in different datasets.

• Using linked data we can refer to and extend data external to our organization.

• Linked data usage is ideal for data exchange between different systems, especially when each one of the system only maintains part of the overall information regarding each entity.

• Linked data usage reduces the cost of data exchange and maintenance, while increasing the cost of data generation and usage.

• Benefits of linked data greatly depend on correct usage of the paradigm and well-designed datasets.


Big Data Analytics Technologies


Big Data Landscape (2012)


Big

Dat

a La

nd

scap

e (2

01

6)


Big

Dat

a La

nd

scap

e (2

01

8)


Big Data Technologies

There are six primary needs that Big Data technologies address:

• 1. Distributed Storage and Processing

• 2. Non-Relational database with Low latency

• 3. Streams and Complex Event Processing

• 4. Data Processing of Special big data data-types

• 5. In-Memory Processing

• 6. Reporting


Hadoop and MapReduce


MapReduce & HadoopIn 2004, Google published a paper on a process called MapReduce

The MapReduce concept provides a parallel processing model, and an associated implementation wasreleased to process huge amounts of data. With MapReduce, queries are split and distributed acrossparallel nodes and processed in parallel (the Map step). The results are then gathered and delivered (theReduce step). The framework was very successful, so others wanted to replicate the algorithm.

Therefore, an implementation of the MapReduce framework was adopted by an Apache open-sourceproject named Hadoop.

2012 studies showed that a multiple-layer architecture is one option to address the issues that big datapresents. A distributed parallel architecture distributes data across multiple servers; these parallelexecution environments can dramatically improve data processing speeds. This type of architecture insertsdata into a parallel DBMS, which implements the use of MapReduce and Hadoop frameworks. This type offramework looks to make the processing power transparent to the end user by using a front-endapplication server.


MapReduce phasesMapReduce is a distributed, fault-tolerant, system used for parallel programming and processing of huge data. Hadoop is capable of running MapReduce programs written in various languages: Java, Ruby, Python, and C++.

MapReduce programs are parallel in nature, thus are very useful for performing large-scale data analysis using multiple machines in the cluster.

The whole process goes through four phases of execution namely, splitting, mapping, shuffling, and reducing

Input Splits

An input to a MapReduce job is

divided into fixed-size pieces called input

splits Input split is a chunk of the input that is consumed by a single

map

Mapping

This is the very first phase in the execution

of map-reduce program. In this phase data in

each split is passed to a mapping function to

produce output values. In our example, a job of

mapping phase is to count a number of

occurrences of each word from input splits

Shuffling

This phase consumes the output of Mapping

phase. Its task is to consolidate the relevant records from Mapping phase output. In our example, the same words are clubed

together along with their respective

frequency.

Reducing

In this phase, output values from the

Shuffling phase are aggregated. This phase combines values from

Shuffling phase and returns a single output

value. In short, this phase summarizes the

complete dataset.


MapReduce (word count example)


Executing jobs with MapReduce


Hadoop modulesThe base Apache Hadoop framework is composed of the following modules:

• Hadoop Common – contains libraries and utilities needed by other Hadoop modules;

• Hadoop Distributed File System (HDFS) – a distributed file-system that stores data on commoditymachines, providing very high aggregate bandwidth across the cluster;

• Hadoop YARN – introduced in 2012 is a platform responsible for managing computing resources inclusters and using them for scheduling users' applications;

• Hadoop MapReduce – an implementation of the MapReduce programming model for large-scale dataprocessing.

The term Hadoop is often used for both base modules and sub-modules and also the ecosystem, orcollection of additional software packages that can be installed on top of or alongside Hadoop, such asApache Pig, Apache Hive, Apache HBase, Apache Phoenix, Apache Spark, Apache ZooKeeper, ClouderaImpala, Apache Flume, Apache Sqoop, Apache Oozie, and Apache Storm.


HDFS architectureHDFS is a distributed, fault-tolerant file system used to back the computation of Big Data.


HDFS architecture


File types for Hadoop• Text documents

• CSV/TSV files

• JSON records

• Avro files• Avro is quickly becoming the top choice for the developers due to its multiple benefits. Avro

stores metadata with the data itself and allows specification of an independent schema forreading the file.

• Parquet files• Parquet file is a columnar file format. Parquet also enjoys the features like compression and

query performance benefits but is generally slower to write than non-columnar file formats.

• ORC files• ORC are compressed columnar files that enable faster queries.• But it doesn’t support schema evolution.


Hadoop


Challenges of using HadoopWhat are the challenges of using Hadoop?

• MapReduce programming is not a good match for all problems. It’s good for simple informationrequests and problems that can be divided into independent units, but it's not efficient for iterative andinteractive analytic tasks. MapReduce is file-intensive. Because the nodes don’t intercommunicate exceptthrough sorts and shuffles, iterative algorithms require multiple map-shuffle/sort-reduce phases tocomplete. This creates multiple files between MapReduce phases and is inefficient for advanced analyticcomputing.

• There’s a widely acknowledged talent gap. It can be difficult to find entry-level programmers who havesufficient Java skills to be productive with MapReduce. That's one reason distribution providers are racingto put relational (SQL) technology on top of Hadoop. It is much easier to find programmers with SQL skillsthan MapReduce skills. And, Hadoop administration seems part art and part science, requiring low-levelknowledge of operating systems, hardware and Hadoop kernel settings.

• Data security. Another challenge centers around the fragmented data security issues, though new toolsand technologies are surfacing. The Kerberos authentication protocol is a great step toward makingHadoop environments secure.

• Full-fledged data management and governance. Hadoop does not have easy-to-use, full-feature tools fordata management, data cleansing, governance and metadata. Especially lacking are tools for data qualityand standardization.


Hadoop Main Distributions

Cloud Providers


NoSQL Databases


NoSQL (Non Relational Databases)

Google BigTable

Couch DB


Relational databasesRDBMS: Relational database management systems

ACID:• Atomicity• Consistency• Isolation• Durability

Issues with RDBMS - Scalability• Issues with scaling up when the dataset is too big.• Not designed to be distributed (usually), because:

• Joins are expensive• Hard to scale horizontally

• Looking at multi-node DB solutions (horizontal scaling)

RDBMS strengths:• Defined Schema• Transactions• Limitless indexing• A very strong language for dynamic, cross-table

queries (SQL)


Relational databases


Need for NoSQL

• Explosion of social media sites (Facebook, Twitter, Google, YouTube etc.) with large data needs.

• Rise of cloud-based solutions (i.e. Amazon S3).

• Need to handle large amounts of data quickly (horizontal scaling).

• Diversity of the available information.

• High connectivity between web elements.


What is NoSQL

• Stands for “Not Only SQL”

• Do not require a fixed table schema

• Relaxation for one or more of the ACID properties

4 types of NoSQL Databases

Key-Value pair based

Column based

Document based

Graph based


Key-value stores

• “One key, one value, no duplicates, and crazy fast”.

• Simplest NoSQL databases, use of hash table.

• Data has no required format, data may have any format.

• The value is a binary object aka. “blob” that the DB does not understand it and does not want to understand it.

• Basic operations: Insert(key, value), Fetch(key), Update(key), Delete(key)


Column based

• Often referred as “BigTable clones”

• The column is the lowest/smallest instance of data.

• They store data as Column families containing rows that have many columns associated with a row key. Column families are groups of related data that is accessed together.

Google BigTable

Statistics about Facebook Search (Cassandra)


Document based• Stores and retrieves documents.

• Key-value stores, it stores documents in the value part with a complex data structure.

• Self describing, hierarchical tree data structures. Documents can contain many different key-value pairs, or key-array pairs, or even nested documents.

Couch DB


Graph based• “Relational database is a collection loosely connected tables, while Graph

database is a multi-relational graph”.

• Graph databases store entities and relationships between them as node and edges of a graph. Entities have properties.

• Traversing the relationships is very fast as relationships between nodes are not calculated at query time but are persisted.


What we need?

• We need a distributed database system having such features:• Fault tolerance

• High availability

• Consistency

• Scalability

Which is impossible!!

According to the CAP theorem....


CAP Theorem

Pick 2 from:• Consistency• Availability• Partitions

To scale out we have to partition. It leaves a chive between consistency and availability.

Everyone who builds big applications builds them on CAP: Yahoo, Google, Facebook, Amazon, eBay, etc.


Advantages of NoSQL

• Massive data stores.

• Data are replicated in multiple nodes.• When data are written, the latest version is on at least one node and then replicated

to others.

• No single point of failure

• Easy to distribute.

• Scalability.

• Do not require a schema.

• Better when it is more important to have fast data than correct data.


What is NOT provided by NoSQL (..or is difficult)

• Joins

• Group by

• ACID Transactions

• SQL


Querying Big Data


Querying Big Data• Hadoop, Spark, NoSQL are great tools for its purpose, but they don’t fit 100% of the

audience.

• There is a basic skill that every programmer is familiar with, SQL.

• Solution: SQL on top of Hadoop.

• Very useful for interactive and exploratory analytics.


Apache Hive Hive is an Apache licensed, open-source query engine written in Java programming language used forsummarizing, analyzing and querying data stored on Hadoop. Though it was initially introduced byFacebook, it was later open-sourced.

Pros

• It is stable as it has been around for many years.

• Hive is also open-source with a great community should you need help using it.

• It uses HiveQL, a SQL-like querying language which can be easily understood by RDBMS experts.

• Supports Text File, RCFile, SequenceFile, ORC, Parquet, and Avro file formats.

Cons

• Hive relies on MapReduce to execute queries which makes it relatively slow compared to querying engines like Cloudera Impala, Spark or Presto.

• Hive only supports structured data. So if your data is largely unstructured, Hive isn’t an option.


Cloudera Impala

A real time, Apache licensed, open source, massively parallel processing (MPP) SQL on Hadoop queryingengine written in C++ programming language and currently shipped by Cloudera, MapR, Amazon and Oracle.With Impala, you can query data, whether stored in HDFS or Apache HBase – including SELECT, JOIN, andaggregate functions – in real time.

Pros

• Impala provides real time querying on data stored on Hadoop clusters.

• It’s fast. The fact that it doesn’t use MapReduce to execute its queries makes it faster than Hive.

• It uses HiveQL, making it easy for data analysts coming from a RDBMS background to understand and use.

• Enterprise installation is supported because it is backed by Cloudera.

Cons

• Impala only has support for Parquet, RCFile, SequenceFIle, and Avro file formats. So if your data is in ORCformat, you will be faced with a tough job transitioning your data.

• Supports only Cloudera’s CDH, MapR, and AWS platforms.


PrestoPresto is another massively parallel processing (MPP), open source, SQL on Hadoop querying enginedeveloped by Facebook to query databases on different sources with high speed irrespective of the volume,velocity, and variety of data they contain. It is currently being backed by Teradata and has been employedfor use by AirBnB, Dropbox, Netflix, and Uber.

Pros• Presto supports Text, ORC, Parquet and RCFile file formats. This makes it a great query engine of choice

without worrying about transforming your existing data into a new format.• It works well with Amazon S3 storage and queries data from any source at the scale of petabytes

simultaneously and in seconds.• Great support from the open-source community will ensure Presto is around for much longer.• Enterprise support is provided by Teradata — a big data analytics and marketing applications company.

Cons• Being largely open source, it is not advisable to deploy Presto if you think you aren’t capable of supporting

and debugging issues with Presto yourself except you decide to work with a vendor like Teradata.• It doesn’t have its own storage layer, so queries involving inserts or writing to the HDFS are not supported.


Apache SparkSQLApache Spark is a cluster computing framework that runs on Hadoop. It comes with a built-in module calledSparkSQL which enables the execution of pure SQL queries to multiple data storages.

Pros

• It is very fast. Spark SQL executes batch queries in the Spark framework 10–100 times faster than Hive withMapReduce.

• Spark provides full compatibility with Hive data, queries, and user defined functions (UDF).

• Spark provides APIs (Application Programming Interfaces) in various languages (Java, Scala, Python) whichmakes it possible for developers to write applications in those languages.

• Apache Spark and Spark SQL boasts a larger open-source community support than Presto.

Cons

• Apache Spark consumes lots of RAM which makes it expensive in terms of cost.

• It is still maturing, and as such, it is not considered to be stable yet.


Google BigQuery BigQuery is a cloud database solution provided by Google which executes queries on large amounts of data in seconds.Being a full database solution and not just another query engine means that it provides its own storage, a query engine,and also uses SQL-like commands to run queries against data stored in it.

Pros• I would refer to Google BigQuery as a plug and play solution for big data in that you don’t worry about server

management here. You only import your data in its own storage and begin querying your data while it handles performance, memory allocation, and CPU optimization implicitly.

• It has a strong backing from Google making it a very stable product.• BigQuery supports standard SQL syntax.• Moving data from other cloud storage solutions like Amazon S3 into GCS (Google Cloud Storage) is easy and hassle-

free using the transfer manager.• Great support for enterprise users.

Cons• It could become very expensive if you query your data a lot — because Google also charges per data processed on a

query.• Queries with lots of joins are not that fast.• You have to move your data into BigQuery’s storage system before you can query your data with it.


Benchmarking

Source: https://cdn2.hubspot.net/hubfs/488249/Asset%20PDFs/Benchmark_BI-on-Hadoop_Performance_Q4_2016.pdf


Big Data Analytics with Apache Spark


Apache Spark

• Apache Spark is a powerful unified analytics engine for large-scale distributed data processing and machine learning.

• It provides in-memory computations for increased speed and data processing over MapReduce.

• It runs on top of an existing Hadoop cluster, can access the Hadoop data store (HDFS), process data in databases and streaming data.

• Spark facilitates the implementation of iterative algorithms, which visit their data set multiple times in a loop, interactive/exploratory data analysis and graph processing.


Spark Vs. Hadoop MapReduceThe key difference between them lies in the approach to processing:

Spark can do it in-memory, while Hadoop MapReduce has to read from and write to a disk.

As a result, the speed of processing differs significantly – Spark may be up to 100 times faster.

However, the volume of data processed also differs: Hadoop MapReduce is able to work with far larger data sets than Spark.

MapReduce is good for:• Linear processing of huge data sets. In case the resulting dataset is larger than available RAM, Hadoop

MapReduce may outperform Spark.• Economical solution, if no immediate results are expected.

Spark is good for:• Fast data processing. In-memory processing makes Spark faster than Hadoop MapReduce – up to 100

times for data in RAM and up to 10 times for data in storage.• Iterative processing (Machine Learning Algorithms).• Near real-time & graph processing.• Joining datasets


The Spark framework stack

Scala Python Java R SQL

Spark SQL MLib GraphX SparkSQL

Spark Core

Yarn Mesos Spark Scheduler

Local HDFS S3 RDBMS NoSQL Streams

Programming

Library

Engine

Management

Storage


Spark Core

Spark Core is the foundation of the overall project. It provides distributed task dispatching, scheduling, and basic I/O functionalities, exposed through an API centered on the RDD abstraction.

Parallel operations such as map, filter, reduce joins and other can be scheduled for execution in parallel on the cluster, taking RDDs as input and producing new RDDs.

RDDs are immutable and their operations are lazy;

fault-tolerance is achieved by keeping track of the "lineage" of each RDD (the sequence of operations that produced it) so that it can be reconstructed in the case of data loss.


Spark librariesSpark SQL: is a Spark module for structured and semi-structured data processing. Itprovides a programming abstraction called DataFrames, a domain specific languageto manipulate them and can also act as distributed SQL query engine.

Spark MLib is a distributed machine-learning framework on top of Spark Core.Many common machine learning and statistical algorithms have been implementedand are shipped with MLlib which simplifies large scale machine learning pipelines,including statistics, classification, transformations, regression, clustering and more.

GraphX is a graph computation engine built on top of Spark that enables users tointeractively build, transform and reason about graph structured data at scale.

Spark Streaming enables powerful interactive and analytical applications acrossboth streaming and historical data. It ingests data in mini-batches and performsRDD transformations on those mini-batches of data. This design enables the sameset of application code written for batch analytics to be used in streaming analytics.


Spark program execution


DAG


Analytics with Spark


Streaming Data Analytics


What is Streaming Data?Streaming Data is data that is generated continuously by thousands of data sources, which typically send in the data records simultaneously, and in small sizes (order of Kilobytes). Streaming data includes a wide variety of data such as log files generated by customers using your mobile or web applications, ecommerce purchases, in-game player activity, information from social networks, financial trading floors, or geospatial services, and telemetry from connected devices or instrumentation in data centers.

This data needs to be processed sequentially and incrementally on a record-by-record basis or over sliding time windows, and used for a wide variety of analytics including correlations, aggregations, filtering, and sampling. Information derived from such analysis gives companies visibility into many aspects of their business and customer activity such as –service usage (for metering/billing), server activity, website clicks, and geo-location of devices, people, and physical goods –and enables them to respond promptly to emerging situations. For example, businesses can track changes in public sentiment on their brands and products by continuously analyzing social media streams, and respond in a timely fashion as the necessity arises.


Streaming Data ExamplesSensors in transportation vehicles, industrial equipment, and farm machinery send data to a streaming application. The application monitors performance, detects any potential defects in advance, and places a spare part order automatically preventing equipment down time.

A financial institution tracks changes in the stock market in real time, computes value-at-risk, and automatically rebalances portfolios based on stock price movements.

A real-estate website tracks a subset of data from consumers’ mobile devices and makes real-time property recommendations of properties to visit based on their geo-location.

A solar power company has to maintain power throughput for its customers, or pay penalties. It implemented a streaming data application that monitors of all of panels in the field, and schedules service in real time, thereby minimizing the periods of low throughput from each panel and the associated penalty payouts.

A media publisher streams billions of clickstream records from its online properties, aggregates and enriches the data with demographic information about users, and optimizes content placement on its site, delivering relevancy and better experience to its audience.

An online gaming company collects streaming data about player-game interactions, and feeds the data into its gaming platform. It then analyzes the data in real-time, offers incentives and dynamic experiences to engage its players.


Batch Processing VS Stream ProcessingBatch processing can be used to compute arbitrary queries over different sets of data. It usually computes results that are derived from all the data it encompasses, and enables deep analysis of big data sets. MapReduce-based systems are examples of platforms that support batch jobs.

Stream processing requires ingesting a sequence of data, and incrementally updating metrics, reports, and summary statistics in response to each arriving data record. It is better suited for real-time monitoring and response functions.

Many organizations are building a hybrid model by combining the two approaches, and maintain a real-time layer and a batch layer. Data is first processed by a streaming data platform to extract real-time insights, and then persisted into a store, where it can be transformed and loaded for a variety of batch processing use cases.


Challenges in Working with Streaming Data

Streaming data processing requires two layers:

o A Storage Layer: needs to support record ordering and strong consistency to enable fast, inexpensive, and replayable reads and writes of large streams of data.

o A Processing Layer: responsible for consuming data from the storage layer, running computations on that data, and then notifying the storage layer to delete data that is no longer needed.

You also have to plan for scalability, data durability, and fault tolerance in both the storage and processing layers.


Popular Stream Processing Tools and Platforms

Amazon Kinesis (https://aws.amazon.com/kinesis/)

Apache Kafka (https://kafka.apache.org/)

Apache Flume (https://flume.apache.org/)

Apache Storm (https://storm.apache.org/)

Apache Spark Streaming (https://spark.apache.org/streaming/)

Apache Flink

(https://flink.apache.org/)

https://aws.amazon.com/kinesis/

https://kafka.apache.org/

https://flume.apache.org/

https://storm.apache.org/

https://spark.apache.org/streaming/

https://flink.apache.org/


QUESTIONS

[email protected]

Tsapelas I. - [email protected]

mailto:[email protected]