chaire smart datamaw.isep.fr/2015/talks/assets/maw_talk1.pdf · research interest: data stream...

Time Concept Subject Speaker

9:05 Keynote 1 Big Data Processing Sylvain

LEFEBVRE

9:45 Keynote 2 Smart Cities Gilles BETIS

10:30 Break

11:00 Keynote 3 Distributed

Intelligence in IoT

Philippe

GAUTIER

11:45 Keynote 4 Data Protection Anne BARBIER

GOLIRO

12:30 Lunch

14:30 Workshop W Graph Databases Cedric FAUVET

17:30 Cocktail

Planning

Introduction to Big Data Processing

Raja Chiky – [email protected]

About RDI Team R. Chiky : Associate professor in Computer Science – LISITE-RDI Research interest: Data stream mining, scalability and resource optimization

in distributed architectures (e.g cloud architectures), recommender systems Research field: Large scale data management

1. Real-time and distributed

processing of various data

sources

2. Use semantic technologies to add a semantic

layer

3. Recommender systems and

collaborative data mining

4. Optimizing resources in large scale systems4. Optimizing resources in large scale systems

Heterogeneous and dynamic data

streams

Heterogeneous and dynamic data

streams

Heterogeneous and static dataHeterogeneous and static data

sensors

3

26/11/2015

4Goal of this talkRecognise some of the main terminologyRemember that there are many tools availableFocus on Hadoop - the most popular open-source Big Data

eco-systemRealise the potential of Big Data

CONTENT

What is Big Data?Data StreamingNoSQL databasesDistributed File SystemMapReduce paradigmVisualization

5

6Big Data: Buzzword!

26/11/2015

7What is Big Data?

Dawn oftime

2003 2012

5 EB

…

2.7 ZB

2015

10 ZB (E)

Volume of data created Worldwide

1 YB = 10^24 Bytes 1 ZB = 10^21 Bytes 1 EB = 10^18 Bytes 1 PB = 10^15 Bytes 1TB = 10^12 Bytes 1 GB = 10^9 Bytes

Variety of data

Velocity of data

Walmart handles 1M transactions per hour Google processes 24PB of data per day AT&T transfers 30 PB of data per day 90 trillion emails are sent per year World of Warcraft uses 1.3 PB of storage

Facebook with a user base of 900 M users, had 25 PB of compressed data

400M tweets per day in June ’12 72 hours of video is uploaded to

Youtube every minute

Radio TV News E-Mails Facebook Posts Tweets Blogs Photos Videos (user and paid)

Volume

Variety

Velocity

Big Data Elements

Source: Big Data & Analytics - Why Should We Care?, Vishwa Kolla

+ Veracity (IBM) - information uncertainty

RSS feeds Wikipedia GPS data RFID POS Scanners …

26/11/2015

8What is Big Data?Gartner Definition

Big data is high-volume, high-velocity and high-variety information assets that demand cost-effective, innovative forms of information processing for enhanced insight and decision making

McKinsey Definition

A dataset whose size is beyond the ability of typical database software tools to capture, store, manage and analyze.

26/11/2015

9Key factorsCheap storage

Recording everything is not expensive anymore

Cloud computing Cheap, on demand computing resources

from anywhere in the world and for everyone

Business reasons New insights arise that give competitive

advantage

Data in various forms everywhere: IoT and IoE, Social Networks, Open Data

The way we interact with each other and with data / information

…

26/11/2015

26/11/2015

10World of data

Website logsNetwork monitoring

Financial services

eCommerce Traffic controlPower consumption

Weather forecasting

Data may come from humans, sensors or machines

Data may come from humans, sensors or machines

26/11/2015

11Transforming our daily lives

Then Now

One size fits all Personalization & Targeted Selling

Source: Big Data Trends by David Feinleib

26/11/2015

12Fitness

Source: Big Data Trends by David Feinleib

Then Now

Manual tracking Focus on the goal

16/01/2014

Big Data workflow

1. Capture2. Store3. Analyze4. Visualize

Challenges arise in all these steps

13

26/11/2015

14Challenges: Data CollectionHeterogeneity of sources

Company databases => Silos Sensor networks, Intelligent objects Data streams: Social Networks, financial information, etc.

Data Velocity Data provenance and quality

Security / privacy

26/11/2015

15Type of data used in Big Data initiatives

Internal data

Traditional sources

« New data »

Source: Big Data opportunities survey, Unisphere / SAP, May 2013.

Challenges: Data CollectionVelocity

16

Source: http://practicalanalytics.co/2012/10/22/sizing-mobile-social-big-data-stats/

What is a data stream? Golab & Oszu (2003): “A data stream is a real-time, continuous, ordered

(implicitly by arrival time or explicitly by timestamp) sequence of items. It is impossible to control the order in which items arrive, nor is it feasible to locally store a stream in its entirety.”

Massive volumes of data, items arrive at a high rate.

17

26/11/2015

18Data Stream Management Systems

DBMS DSMS

Data model Permanent updatable relations Streams and permanent updatable relations

Storage Data is stored on disk Permanent relations are stored on diskStreams are processed on the fly

Query SQL language Creating structures Inserting/updating/deleting data Retrieving data (one-time query)

SQL-like query languageStandard SQL on permanent relationsExtended SQL on streams with windowingContinuous queries

Performance Large volumes of data Optimization of computer resources to deal with Several streams Several queries Ability to face variations in arrival rates without crash

26/11/2015

19Too much data streams

Too much data streams but not enough knowledge

Too much data streams but not enough knowledge

26/11/2015

20Semantic Web technologies for data streamAnnotate stream data with semantic metadataApply Linked Data principles to publish streaming data Interlink streaming data with existing datasets Integrate data stream processing + reasoning

Objectives : interoperability, automation, enrichment

26/11/2015

23Challenges in data storageLarge amounts of data

Need to use a highly distributed architecture Massive queries

Avoid joins since they are very time consumingEvolutionary schema

Flexibility and scalabilityPredictable and low latency High availabilityElasticity : Horizontal extensibilityNo need: Transaction / Strong consistency/ Complex queries

Limitation of RDBMS

“ If the only tool you have is a hammer, you tend to see every problem as a nail.”

Abraham Maslow

24

Limitation of RDBMS25

Limitations of RDBMSRelational DBMS offer:

join operators between tables to build complex queries involving several entities

Integrity constraints Transaction management ACID properties

In highly distributed environment: These mechanisms have a significant cost: With most RDBMS, data are in one machine (one node) It is difficult to place the data on different nodes.

26

NO SQL

Not Only

Relational

27

NoSQL?No SQL => Not Only SQLSQL must not die but storage solutions should be considered

for specific applications (especially web applications) Exact name: Non relational DBACID model does not allow scalability in a distributed

environment, for example by limiting the write speed (most expensive) Atomicity Consistency Isolation Durability

4 Nos: 1) NO SCHEMA (schema free) 2) NO JOIN (extract data without joins) 3) NO DATA FORMAT(graph, document, row, column) 4) NO ACID Transactions

28

30Scalability• Scalability is the ability of a system, network, or process to handle a

growing amount of work or its ability to be enlarged to accommodate that growth.

• Two kinds :

Horizontal scalability (scale out)– Add ordinary machines (commodity hardware)– Less expensive– More complex to set-up (due to load balancing concerns)

Vertical scalability (scale up)

– Add powerful servers– Costly– Easy to set-up

Horizontal Sharding

Each node stores a subset (identified by a key range) of a data table

31Sharding (data partitioning)• A shard (partition) is a logical division of a database into several

independent parts. This allows us to obtain a storage capacity greater than the limited storage capacity of hard disks, or perform queries in parallel on multiple partitions.

• Two kinds:

Vertical Sharding

Each node stores one (or mode) table(s) of a database

CAP theorem (E.Brewer, N. Lynch 2000)

C

A P

“CAP Theorem”:C-A-P: choose two.

consistency

Availability Partition-Tolerance

Claim: every distributed system is on one side of the triangle.

CA: available, and consistent, unless there is a partition.

AP: a reachable replica provides service even in a partition, but may be inconsistent.

CP: always consistent, even in a partition, but a reachable replica may deny service without agreement of the others

32

NoSQL Taxonomy

Data

Key-value

Document

Column

Graph

33

34

❑ HDFS is the Hadoop distributed file system❑ HDFS was inspired by Google File System

(GFS)The Google File System. Sanjay Ghemawat,

Howard Gobioff, and Shun-TakLeung, Google, 2003.

❑ Historically, it is composed of 2 main nodes : ➢ The NameNode is in charge of the

metadata management (one NameNode per cluster)

➢ The DataNode is in charge of the data storage (one DataNode per machine)

❑ Each file in HDFS is split into blocks (the block size is 64 MB by default)

❑ Each block is replicated on different DataNodes (3 replicas by default) :➢ The replication mechanism is

important for both fault tolerance and data availability

Namenode

Datanodes

11223344

112244

221133

114433

332244

File1

Presentation of HDFS

Challenges in Data AnalyticsProblems in large scale analytics

Distributed computation efficiency Evaluate performance gains from distribution Bringing data to the processor Efficient parallel algorithms (statistics, summaries)

Speed analytics: Streaming computations Streaming languages and libraries Load balancing

35

26/11/2015

36

Big Data: Technological challengesData infrastructure tools and platforms : data centers, cloud

infrastructures, noSQL databases, in-memory databases, Hadoop/Map Reduce Ecosphere

New generation of front-end tools for BI and analytic systems: data visualization and visual analytics, self-service BI, Mobile BI

Data processing : supercomputers, distributed or massively parallel-computing

MapReduce Introduced by Google in 2004 Aim: to parallelize processing (indexing, data mining, ...)Spread task of processing data on machinesMapReduce Offers:

Easy Parallelization and distribution of processing Fault Tolerance Load Balancing Abstraction for programmers

Programming modelBased on funtional programming languagesDevelopers implement two functions:

Map: in map phase, data is put to a number of machines. Output is partitioned (sorted) by a key. It is a step of processing data in the form of key/value pairs.

Reduce: For each key-group, data is aggregated (reduced). This is a step of aggregating the values for same key to form the final result.

MapReduce (Global architecture)

Example– Word Count

Moving computation to DataData is stored in Distributed File Systems

Ex. Google File System, Hadoop Distributed File SystemLarge blocks (a.k.a chunks), ususally of 64MB Chunks are replicated and distributed over machinesMap function on each of the chunksA Master node knows the data locality

Receive jobs Compute necessary map and reduce tasks Select and activate worker nodes (selected if possible close to

data)

Many MapReduce implementations: C#, C++, Erlang, Java, Python, etc.

The Apache Hadoop Map Reduce is arguable the most prominent one

HadoopOpen Source project (Apache Software Foundation)Software platform that lets one easily write and run

applications that process vast amounts of data across a cluster of machines

Java It includes: MapReduce, HDFS, HbaseYahoo! Is the biggest contributorhtp://wiki.apache.org/hadoop/PoweredBy

Amazon, Apple, Ebay, IBM, Google, Microsoft, SAP, Twitter, etc.

Hadoop Ecosystem Avro: Remote procedure call and data serialization Flume: harvesting, aggregating and moving large

amounts of log data in and out of Hadoop Hbase: Column oriented DB Hive: warehouse structure and SQL-like access for

data in HDFS Pig: high-level scripting language (Pig Latin) for

querying Sqoop: SQL To Hadoop (data import from RDBMS to

HDFS) Oozie: Job coordinator and workflow manager Hue: graphical interface for Hadoop Chukwa: large-scale log collection and analysis

…

26/11/2015

44

Problems with MapReduceThe main principle of MapReduce is its simplicity, as long as

the application does not require complex SQL queriesMapReduce is fairly low-level: must think about keys, values,

partitioning, etcAll data and intermediate data is written to disk !All standard databases operations must be coded by hand

Join, selection, projection, etc.Solution: Use high level languages

Translate programs to MapReduce automatically Pig, Hive, etc.

Alternative to MapReduce: Pig

Implementation started at Yahoo! ResearchExecute more than 30% of Yahoo!’s jobs Features

Expresses sequence of mapReduce jobs Provide relational (SQL) operators

(JOIN, GROUP BY, etc.)

Pig - example Given two tables: Visits, url-Info Find the top 10 most visited pages in each category

User Url Time

Amy cnn.com 8:00

Amy bbc.com 10:00

Amy flickr.com 10:05

Fred cnn.com 12:00

Url Category PageRank

cnn.com News 0.9

bbc.com News 0.8

flickr.com Photos 0.7

espn.com Sports 0.9

Vis

its

url

-In

fo

Load VisitsLoad Visits

Group by urlGroup by url

Foreach urlgenerate

count

Foreach urlgenerate

count

Load Url Info

Load Url Info

Join on urlJoin on url

Group by categoryGroup by categoryForeach category

generate top10 urls

Foreach category

generate top10 urls

in MapReduce

26/11/2015

48

In Pigvisits = load ‘/data/visits’ as (user, url, time);gVisits = group visits by url;visitCounts = foreach gVisits

generate url, count(visits);urlInfo = load ‘/data/urlInfo’ as

(url, category, pRank);visitCounts = join visitCounts by url,

urlInfo by url;gCategories = group visitCounts by category;topUrls = foreach gCategories

generate top(visitCounts,10);store topUrls into ‘/data/topUrls’;

* top, count are user defined functions

Alternative to Mapreduce: HiveDataWareHouse on top of HadoopOpen-SourceWritten in javaMetadata is stored in relational databasesProvides an SQL like query language : HiveQLProvides the possibility to create user defined functions Indexing

Language - DDLCreate a tablehive> CREATE TABLE customer (age INT, address STRING)Display tableshive> SHOW TABLES;Describe tableshive> DESCRIBE customer;Modify a tablehive> ALTER TABLE customer ADD COLUMNS (age INT);Delete a tablehive> DROP TABLE customer;

Language – DMLLoad a filehive> LOAD DATA LOCAL INPATH ‘/data/home/test.txt ’ OVERWRITE INTO TABLE customer;HIVEQLhive> SELECT c.age FROM customer c WHERE c.sdate=‘2008-08-15’hive>INSERT OVERWRITE DIRECTORY ‘/data/hdfs_file’ SELECT c.* FROM customer c WHERE c.sdate=‘2008-08-15’;

Can use ODBC to connect other external BI tools

Next gen : spark *

A richer set of operators

l Resilient Distributed Datasetl Immutable data tables l Lazy transformations

l Some can be indifferently applied to streaming datal

55Lambda architecture (By Nathan Marz)

Generic, scalable and fault-tolerant data processing architecture

SERVING LAYER

SPEED LAYER

BATCH LAYER

DATA FLOW QUERIES

REAL TIMESTREAM

PROCESSING

BATCHPROCESSING

PRECOMPUTED

VIEWS

Source: Mathieu DESPRIEE (USI)

Lambda architecture 1. All data entering the system is dispatched to both the batch

layer and the speed layer for processing.2. The batch layer has two functions: (i) managing the master

dataset (an immutable, append-only set of raw data), and (ii) to pre-compute the batch views.

3. The serving layer indexes the batch views so that they can be queried in ad-hoc way.

4. The speed layer compensates for the high latency of updates to the serving layer and deals with recent data only.

5. Any incoming query can be answered by merging results from batch views and real-time views.

http://lambda-architecture.net/

57Big Data Stream Mining

Machine Learning

Distributed

Batch

Hadoop

Mahout

Stream

S4, Storm

SAMOA

Non Distributed

Batch

R, WEKA,…

Stream

MOA

26/11/2015

58Challenges in Data Access and VisualizationThe main goal of data visualization is to communicate

information clearly and effectively through graphical meansProvide results of analytics workflow for faster systems such

as real-time query interfaces

“Visualization is a form of knowledge compression”

- David McCandless

26/11/2015

59

Conclusion: Big Data challengesSemantic Information aggregation

Information aggregation: “too much data to assimilate but not enough knowledge to act”

Distributed and real-time processing Design of real-time and distributed algorithms for stream

processing and information aggregation Distribution and parallelization of data mining algorithms

visual analytics and user modeling Dynamic user model Novel visualizations for very large datasets

Data privacy Big Data is often generated by people Obtaining consent is often impossible and anonymisation is very

hard

26/11/2015

60

Thanks toMarie-Aude Aufaure, ECPSylvain lefebvre, ISEP

chaire smart datamaw.isep.fr/2015/talks/assets/maw_talk1.pdf · research interest: data stream...

Documents