big data: motivation and overview - vub · big data: motivation and overview current trends in arti...

31
Big Data: motivation and overview Current Trends in Artificial Intelligence Guillaume Levasseur IRIDIA, Universit´ e libre de Bruxelles Friday, 2nd March 2018 (Short) Formal definition of machine learning Learning requires (more) data! More data? Big Data! Application Guillaume Levasseur (IRIDIA) Big Data 2018.03.02 1 / 27

Upload: others

Post on 21-May-2020

2 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Big Data: motivation and overview - VUB · Big Data: motivation and overview Current Trends in Arti cial Intelligence Guillaume Levasseur ... Distributed batch processing, e.g.with

Big Data: motivation and overviewCurrent Trends in Artificial Intelligence

Guillaume Levasseur

IRIDIA,Universite libre de Bruxelles

Friday, 2nd March 2018

(Short) Formal definition of machine learning

Learning requires (more) data!

More data? Big Data!

Application

Guillaume Levasseur (IRIDIA) Big Data 2018.03.02 1 / 27

Page 2: Big Data: motivation and overview - VUB · Big Data: motivation and overview Current Trends in Arti cial Intelligence Guillaume Levasseur ... Distributed batch processing, e.g.with

Machine learning

(Short) Formal definition of machine learning

f (X, ~ω) = ~y (1)

f Model

X Matrix of the features (input data)

~y Output vector

I ~y is continuous = regressionI ~y is discrete = classification (supervised) or clustering

(unsupervised)I ~y is symbolic = decision problem

~ω Weights of the model f

I Determining ~ω = train the model, learn

Guillaume Levasseur (IRIDIA) Big Data 2018.03.02 2 / 27

Page 3: Big Data: motivation and overview - VUB · Big Data: motivation and overview Current Trends in Arti cial Intelligence Guillaume Levasseur ... Distributed batch processing, e.g.with

Motivation

Learning requires (more) data!

Learning Determining the weights of the model using a trainingdataset.

Training dataset Set of features (and output values) representative of theproblem, as big as possible.

Points Attributes0 x1[0] x2[0] x3[0]1 x1[1] x2[1] x3[1]2 x1[2] x2[2] x3[2]...

......

...

I Problem How to store this data?

Guillaume Levasseur (IRIDIA) Big Data 2018.03.02 3 / 27

Page 4: Big Data: motivation and overview - VUB · Big Data: motivation and overview Current Trends in Arti cial Intelligence Guillaume Levasseur ... Distributed batch processing, e.g.with

Motivation Spreadsheet storage

Spreadsheet storage

Figure: Time series in a spreadsheet

I Limited to 1 million rows per file

I Cheesy

I Limited access to 1 user at atime, poor availability

I Sequential reading

⇒ Not suitable!

Guillaume Levasseur (IRIDIA) Big Data 2018.03.02 4 / 27

Page 5: Big Data: motivation and overview - VUB · Big Data: motivation and overview Current Trends in Arti cial Intelligence Guillaume Levasseur ... Distributed batch processing, e.g.with

Motivation RDBMS

Relational database management system

Figure: Relational model (neo4j.com)

I Greater capacity (> 1 M rows)

I Guaranteed consistency throughACID properties (atomicity,consistency, isolation,durability)

I Relations (”shortcuts”)between tables

I Structured query language(SQL) to retrieve data

I Better availability throughserving systems (MySQL,SQLServer, PostgreSQL)

Guillaume Levasseur (IRIDIA) Big Data 2018.03.02 5 / 27

Page 6: Big Data: motivation and overview - VUB · Big Data: motivation and overview Current Trends in Arti cial Intelligence Guillaume Levasseur ... Distributed batch processing, e.g.with

Motivation RDBMS

Relational database management system

I RDBMS limitations:I Fixed data schema: defined a priori, no missing data allowedI Server bound

I Problems:I What if there is missing data?

I Use custom symbols like ”NaN”.I Use flexible schema.

I What if the data varies in size, format, structure? How to combineheterogeneous data sources?

I Create separated tables and try to link them.I Use flexible schema.

I What if the workload become too heavy for the server (too manyrequests for the CPU, to much data for the HDD)?

I KIWI (kill it whit iron) = hardware upgrade.I Distribute the load on multiple machines.

Guillaume Levasseur (IRIDIA) Big Data 2018.03.02 6 / 27

Page 7: Big Data: motivation and overview - VUB · Big Data: motivation and overview Current Trends in Arti cial Intelligence Guillaume Levasseur ... Distributed batch processing, e.g.with

Motivation RDBMS

Relational database management system

I RDBMS limitations:I Fixed data schema: defined a priori, no missing data allowedI Server bound

I Problems:I What if there is missing data?

I Use custom symbols like ”NaN”.

I Use flexible schema.

I What if the data varies in size, format, structure? How to combineheterogeneous data sources?

I Create separated tables and try to link them.

I Use flexible schema.

I What if the workload become too heavy for the server (too manyrequests for the CPU, to much data for the HDD)?

I KIWI (kill it whit iron) = hardware upgrade.

I Distribute the load on multiple machines.

Guillaume Levasseur (IRIDIA) Big Data 2018.03.02 6 / 27

Page 8: Big Data: motivation and overview - VUB · Big Data: motivation and overview Current Trends in Arti cial Intelligence Guillaume Levasseur ... Distributed batch processing, e.g.with

Motivation RDBMS

Relational database management system

I RDBMS limitations:I Fixed data schema: defined a priori, no missing data allowedI Server bound

I Problems:I What if there is missing data?

I Use custom symbols like ”NaN”.I Use flexible schema.

I What if the data varies in size, format, structure? How to combineheterogeneous data sources?

I Create separated tables and try to link them.I Use flexible schema.

I What if the workload become too heavy for the server (too manyrequests for the CPU, to much data for the HDD)?

I KIWI (kill it whit iron) = hardware upgrade.I Distribute the load on multiple machines.

Guillaume Levasseur (IRIDIA) Big Data 2018.03.02 6 / 27

Page 9: Big Data: motivation and overview - VUB · Big Data: motivation and overview Current Trends in Arti cial Intelligence Guillaume Levasseur ... Distributed batch processing, e.g.with

Big Data

More data? Big Data!

Remember, back in 2006...

I More data, more variety, more requests.

I Google develops Bigtable, which later inspired

Guillaume Levasseur (IRIDIA) Big Data 2018.03.02 7 / 27

Page 10: Big Data: motivation and overview - VUB · Big Data: motivation and overview Current Trends in Arti cial Intelligence Guillaume Levasseur ... Distributed batch processing, e.g.with

Big Data

More data? Big Data!

Remember, back in 2006...

I More data, more variety, more requests.

I Google develops Bigtable, which later inspired

Guillaume Levasseur (IRIDIA) Big Data 2018.03.02 7 / 27

Page 11: Big Data: motivation and overview - VUB · Big Data: motivation and overview Current Trends in Arti cial Intelligence Guillaume Levasseur ... Distributed batch processing, e.g.with

Big Data

More data? Big Data!The 3 Vs of Big Data:

Variety Data of many different types are to be stored...

Velocity it’s coming fast...

Volume and there is a lot of it!

ScalabilityAbility to adapt when the order of magnitude changes.

Figure: Vertical and horizontal scalability (premaseem.files.wordpress.com).

Guillaume Levasseur (IRIDIA) Big Data 2018.03.02 8 / 27

Page 12: Big Data: motivation and overview - VUB · Big Data: motivation and overview Current Trends in Arti cial Intelligence Guillaume Levasseur ... Distributed batch processing, e.g.with

Big Data Consistency

Consistency: from ACID to BASE

I Problem ACID does not support partitioning across multiple nodes.

I BASE: basically available, soft state, eventually consistent

Figure: Consistency issue for distributed database systems (hbase.apache.org).

Guillaume Levasseur (IRIDIA) Big Data 2018.03.02 9 / 27

Page 13: Big Data: motivation and overview - VUB · Big Data: motivation and overview Current Trends in Arti cial Intelligence Guillaume Levasseur ... Distributed batch processing, e.g.with

Big Data CAP theorem

CAP theorem

Figure: The CAP theorem: any database system has to drop either theconsistency, the availability or the partition tolerance (w3resource.com).

Guillaume Levasseur (IRIDIA) Big Data 2018.03.02 10 / 27

Page 14: Big Data: motivation and overview - VUB · Big Data: motivation and overview Current Trends in Arti cial Intelligence Guillaume Levasseur ... Distributed batch processing, e.g.with

Big Data NoSQL

NoSQL technologies

I NoSQL = not only SQL, NewSQL = distributed relational database

I Shared-nothing (distributed) architecture ⇒ CP, AP

I Supported operations: CRUD (create, read, update, delete) and scan.

I Classification:

key-value unique ID value

columnarFamily 0 Family 1

unique ID column 0 column 1 column 2

document unique ID {attr0: value0, attr1: value1}

graph unique ID arcs vertices

search engines inverted ID plain text or document

Guillaume Levasseur (IRIDIA) Big Data 2018.03.02 11 / 27

Page 15: Big Data: motivation and overview - VUB · Big Data: motivation and overview Current Trends in Arti cial Intelligence Guillaume Levasseur ... Distributed batch processing, e.g.with

Big Data NoSQL

NoSQL technologiesMore than 200 different systems!

Figure: Logos of some NoSQL database systems.

Guillaume Levasseur (IRIDIA) Big Data 2018.03.02 12 / 27

Page 16: Big Data: motivation and overview - VUB · Big Data: motivation and overview Current Trends in Arti cial Intelligence Guillaume Levasseur ... Distributed batch processing, e.g.with

Big Data NoSQL

Hadoop

Figure: Architecture of the Hadoop distributed file system, the Namenode actsas master of the infrastructure (hadoop.apache.org).

Guillaume Levasseur (IRIDIA) Big Data 2018.03.02 13 / 27

Page 17: Big Data: motivation and overview - VUB · Big Data: motivation and overview Current Trends in Arti cial Intelligence Guillaume Levasseur ... Distributed batch processing, e.g.with

Big Data NoSQL

On top of Hadoop: HBase

Figure: HBase architecture (eureka.co).

Guillaume Levasseur (IRIDIA) Big Data 2018.03.02 14 / 27

Page 18: Big Data: motivation and overview - VUB · Big Data: motivation and overview Current Trends in Arti cial Intelligence Guillaume Levasseur ... Distributed batch processing, e.g.with

Big Data NoSQL

Cassandra

Figure: Cassandra’s ring strategy for splitting (docs.datastax.com).

Guillaume Levasseur (IRIDIA) Big Data 2018.03.02 15 / 27

Page 19: Big Data: motivation and overview - VUB · Big Data: motivation and overview Current Trends in Arti cial Intelligence Guillaume Levasseur ... Distributed batch processing, e.g.with

Big Data NoSQL

Cassandra

Figure: Cassandra’s writing process (docs.datastax.com).

I No master: nodes use gossiping to exchange metadata about thecurrent state

I Compaction: too many SSTables ⇒ merge in one bigger SSTable.

Guillaume Levasseur (IRIDIA) Big Data 2018.03.02 16 / 27

Page 20: Big Data: motivation and overview - VUB · Big Data: motivation and overview Current Trends in Arti cial Intelligence Guillaume Levasseur ... Distributed batch processing, e.g.with

Big Data NoSQL

Cassandra

Figure: Result of a SELECT query in CQL.

Guillaume Levasseur (IRIDIA) Big Data 2018.03.02 17 / 27

Page 21: Big Data: motivation and overview - VUB · Big Data: motivation and overview Current Trends in Arti cial Intelligence Guillaume Levasseur ... Distributed batch processing, e.g.with

Big Data NoSQL

Elasticsearch

Figure: Sharding and replication with elasticsearch (elastic.co).

2 types of nodes:

Master Allocates the shards and the replicas to the other nodes.

Datanode Indexes, stores and reads the documents.

Guillaume Levasseur (IRIDIA) Big Data 2018.03.02 18 / 27

Page 22: Big Data: motivation and overview - VUB · Big Data: motivation and overview Current Trends in Arti cial Intelligence Guillaume Levasseur ... Distributed batch processing, e.g.with

Big Data NoSQL

Elasticsearch

elastic.co

Guillaume Levasseur (IRIDIA) Big Data 2018.03.02 19 / 27

Page 23: Big Data: motivation and overview - VUB · Big Data: motivation and overview Current Trends in Arti cial Intelligence Guillaume Levasseur ... Distributed batch processing, e.g.with

Big Data NoSQL

Elasticsearch - Lucene backend

Figure: Lucene indexing process (4.bp.blogspot.com).

Guillaume Levasseur (IRIDIA) Big Data 2018.03.02 20 / 27

Page 24: Big Data: motivation and overview - VUB · Big Data: motivation and overview Current Trends in Arti cial Intelligence Guillaume Levasseur ... Distributed batch processing, e.g.with

Big Data Distributed processing

Distributed data processing

Figure: MapReduce: Distributed processing method for cluster computing (Deanet al. , 2004).

Guillaume Levasseur (IRIDIA) Big Data 2018.03.02 21 / 27

Page 25: Big Data: motivation and overview - VUB · Big Data: motivation and overview Current Trends in Arti cial Intelligence Guillaume Levasseur ... Distributed batch processing, e.g.with

Big Data Distributed processing

Batch processing

Figure: Distributed batch processing, e.g.with Apache Spark.

Guillaume Levasseur (IRIDIA) Big Data 2018.03.02 22 / 27

Page 26: Big Data: motivation and overview - VUB · Big Data: motivation and overview Current Trends in Arti cial Intelligence Guillaume Levasseur ... Distributed batch processing, e.g.with

Big Data Distributed processing

Streaming analytics

Figure: Distributed streaming analytics using e.g.Apache Storm or Flink.

Guillaume Levasseur (IRIDIA) Big Data 2018.03.02 23 / 27

Page 27: Big Data: motivation and overview - VUB · Big Data: motivation and overview Current Trends in Arti cial Intelligence Guillaume Levasseur ... Distributed batch processing, e.g.with

Big Data YCSB

Benchmarking with YCSB

0 20 40 60 80 100Read operations proportion [%]

0

500

1000

1500

2000

2500

3000

3500Th

roug

hput

[op/

s]CassandraElasticsearchHBase

Figure: Throughput for 48 YCSB clients against a 11-nodes cluster for a varyingread operations proportion with 1 M rows of 32×8 B.

Guillaume Levasseur (IRIDIA) Big Data 2018.03.02 24 / 27

Page 28: Big Data: motivation and overview - VUB · Big Data: motivation and overview Current Trends in Arti cial Intelligence Guillaume Levasseur ... Distributed batch processing, e.g.with

Application

Application

Figure: Dataflow for IoT application in household energy monitoring.

Guillaume Levasseur (IRIDIA) Big Data 2018.03.02 25 / 27

Page 29: Big Data: motivation and overview - VUB · Big Data: motivation and overview Current Trends in Arti cial Intelligence Guillaume Levasseur ... Distributed batch processing, e.g.with

Application

Application

Figure: Prototypes of home sensors.

Guillaume Levasseur (IRIDIA) Big Data 2018.03.02 26 / 27

Page 30: Big Data: motivation and overview - VUB · Big Data: motivation and overview Current Trends in Arti cial Intelligence Guillaume Levasseur ... Distributed batch processing, e.g.with

Application

Application

Wikimedia Foundation:https://grafana.wikimedia.org/

Guillaume Levasseur (IRIDIA) Big Data 2018.03.02 27 / 27

Page 31: Big Data: motivation and overview - VUB · Big Data: motivation and overview Current Trends in Arti cial Intelligence Guillaume Levasseur ... Distributed batch processing, e.g.with

Application

References

[1][2][3]

Alejandro Corbellini et al. “Persisting big-data: The NoSQLlandscape”. In: Information Systems 63 (2017), pp. 1–23.

Jeffrey Dean and Sanjay Ghemawat. “MapReduce: Simplified DataProcessing on Large Clusters”. In: OSDI 04. 2004.

Fay Chang et al. “Bigtable: A Distributed Storage System forStructured Data”. In: OSDI 06. 2006.

Guillaume Levasseur (IRIDIA) Big Data 2018.03.02 28 / 27