big data: motivation and overview - vub · big data: motivation and overview current trends in arti...

Post on 21-May-2020

2 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

Big Data: motivation and overviewCurrent Trends in Artificial Intelligence

Guillaume Levasseur

IRIDIA,Universite libre de Bruxelles

Friday, 2nd March 2018

(Short) Formal definition of machine learning

Learning requires (more) data!

More data? Big Data!

Application

Guillaume Levasseur (IRIDIA) Big Data 2018.03.02 1 / 27

Machine learning

(Short) Formal definition of machine learning

f (X, ~ω) = ~y (1)

f Model

X Matrix of the features (input data)

~y Output vector

I ~y is continuous = regressionI ~y is discrete = classification (supervised) or clustering

(unsupervised)I ~y is symbolic = decision problem

~ω Weights of the model f

I Determining ~ω = train the model, learn

Guillaume Levasseur (IRIDIA) Big Data 2018.03.02 2 / 27

Motivation

Learning requires (more) data!

Learning Determining the weights of the model using a trainingdataset.

Training dataset Set of features (and output values) representative of theproblem, as big as possible.

Points Attributes0 x1[0] x2[0] x3[0]1 x1[1] x2[1] x3[1]2 x1[2] x2[2] x3[2]...

......

...

I Problem How to store this data?

Guillaume Levasseur (IRIDIA) Big Data 2018.03.02 3 / 27

Motivation Spreadsheet storage

Spreadsheet storage

Figure: Time series in a spreadsheet

I Limited to 1 million rows per file

I Cheesy

I Limited access to 1 user at atime, poor availability

I Sequential reading

⇒ Not suitable!

Guillaume Levasseur (IRIDIA) Big Data 2018.03.02 4 / 27

Motivation RDBMS

Relational database management system

Figure: Relational model (neo4j.com)

I Greater capacity (> 1 M rows)

I Guaranteed consistency throughACID properties (atomicity,consistency, isolation,durability)

I Relations (”shortcuts”)between tables

I Structured query language(SQL) to retrieve data

I Better availability throughserving systems (MySQL,SQLServer, PostgreSQL)

Guillaume Levasseur (IRIDIA) Big Data 2018.03.02 5 / 27

Motivation RDBMS

Relational database management system

I RDBMS limitations:I Fixed data schema: defined a priori, no missing data allowedI Server bound

I Problems:I What if there is missing data?

I Use custom symbols like ”NaN”.I Use flexible schema.

I What if the data varies in size, format, structure? How to combineheterogeneous data sources?

I Create separated tables and try to link them.I Use flexible schema.

I What if the workload become too heavy for the server (too manyrequests for the CPU, to much data for the HDD)?

I KIWI (kill it whit iron) = hardware upgrade.I Distribute the load on multiple machines.

Guillaume Levasseur (IRIDIA) Big Data 2018.03.02 6 / 27

Motivation RDBMS

Relational database management system

I RDBMS limitations:I Fixed data schema: defined a priori, no missing data allowedI Server bound

I Problems:I What if there is missing data?

I Use custom symbols like ”NaN”.

I Use flexible schema.

I What if the data varies in size, format, structure? How to combineheterogeneous data sources?

I Create separated tables and try to link them.

I Use flexible schema.

I What if the workload become too heavy for the server (too manyrequests for the CPU, to much data for the HDD)?

I KIWI (kill it whit iron) = hardware upgrade.

I Distribute the load on multiple machines.

Guillaume Levasseur (IRIDIA) Big Data 2018.03.02 6 / 27

Motivation RDBMS

Relational database management system

I RDBMS limitations:I Fixed data schema: defined a priori, no missing data allowedI Server bound

I Problems:I What if there is missing data?

I Use custom symbols like ”NaN”.I Use flexible schema.

I What if the data varies in size, format, structure? How to combineheterogeneous data sources?

I Create separated tables and try to link them.I Use flexible schema.

I What if the workload become too heavy for the server (too manyrequests for the CPU, to much data for the HDD)?

I KIWI (kill it whit iron) = hardware upgrade.I Distribute the load on multiple machines.

Guillaume Levasseur (IRIDIA) Big Data 2018.03.02 6 / 27

Big Data

More data? Big Data!

Remember, back in 2006...

I More data, more variety, more requests.

I Google develops Bigtable, which later inspired

Guillaume Levasseur (IRIDIA) Big Data 2018.03.02 7 / 27

Big Data

More data? Big Data!

Remember, back in 2006...

I More data, more variety, more requests.

I Google develops Bigtable, which later inspired

Guillaume Levasseur (IRIDIA) Big Data 2018.03.02 7 / 27

Big Data

More data? Big Data!The 3 Vs of Big Data:

Variety Data of many different types are to be stored...

Velocity it’s coming fast...

Volume and there is a lot of it!

ScalabilityAbility to adapt when the order of magnitude changes.

Figure: Vertical and horizontal scalability (premaseem.files.wordpress.com).

Guillaume Levasseur (IRIDIA) Big Data 2018.03.02 8 / 27

Big Data Consistency

Consistency: from ACID to BASE

I Problem ACID does not support partitioning across multiple nodes.

I BASE: basically available, soft state, eventually consistent

Figure: Consistency issue for distributed database systems (hbase.apache.org).

Guillaume Levasseur (IRIDIA) Big Data 2018.03.02 9 / 27

Big Data CAP theorem

CAP theorem

Figure: The CAP theorem: any database system has to drop either theconsistency, the availability or the partition tolerance (w3resource.com).

Guillaume Levasseur (IRIDIA) Big Data 2018.03.02 10 / 27

Big Data NoSQL

NoSQL technologies

I NoSQL = not only SQL, NewSQL = distributed relational database

I Shared-nothing (distributed) architecture ⇒ CP, AP

I Supported operations: CRUD (create, read, update, delete) and scan.

I Classification:

key-value unique ID value

columnarFamily 0 Family 1

unique ID column 0 column 1 column 2

document unique ID {attr0: value0, attr1: value1}

graph unique ID arcs vertices

search engines inverted ID plain text or document

Guillaume Levasseur (IRIDIA) Big Data 2018.03.02 11 / 27

Big Data NoSQL

NoSQL technologiesMore than 200 different systems!

Figure: Logos of some NoSQL database systems.

Guillaume Levasseur (IRIDIA) Big Data 2018.03.02 12 / 27

Big Data NoSQL

Hadoop

Figure: Architecture of the Hadoop distributed file system, the Namenode actsas master of the infrastructure (hadoop.apache.org).

Guillaume Levasseur (IRIDIA) Big Data 2018.03.02 13 / 27

Big Data NoSQL

On top of Hadoop: HBase

Figure: HBase architecture (eureka.co).

Guillaume Levasseur (IRIDIA) Big Data 2018.03.02 14 / 27

Big Data NoSQL

Cassandra

Figure: Cassandra’s ring strategy for splitting (docs.datastax.com).

Guillaume Levasseur (IRIDIA) Big Data 2018.03.02 15 / 27

Big Data NoSQL

Cassandra

Figure: Cassandra’s writing process (docs.datastax.com).

I No master: nodes use gossiping to exchange metadata about thecurrent state

I Compaction: too many SSTables ⇒ merge in one bigger SSTable.

Guillaume Levasseur (IRIDIA) Big Data 2018.03.02 16 / 27

Big Data NoSQL

Cassandra

Figure: Result of a SELECT query in CQL.

Guillaume Levasseur (IRIDIA) Big Data 2018.03.02 17 / 27

Big Data NoSQL

Elasticsearch

Figure: Sharding and replication with elasticsearch (elastic.co).

2 types of nodes:

Master Allocates the shards and the replicas to the other nodes.

Datanode Indexes, stores and reads the documents.

Guillaume Levasseur (IRIDIA) Big Data 2018.03.02 18 / 27

Big Data NoSQL

Elasticsearch

elastic.co

Guillaume Levasseur (IRIDIA) Big Data 2018.03.02 19 / 27

Big Data NoSQL

Elasticsearch - Lucene backend

Figure: Lucene indexing process (4.bp.blogspot.com).

Guillaume Levasseur (IRIDIA) Big Data 2018.03.02 20 / 27

Big Data Distributed processing

Distributed data processing

Figure: MapReduce: Distributed processing method for cluster computing (Deanet al. , 2004).

Guillaume Levasseur (IRIDIA) Big Data 2018.03.02 21 / 27

Big Data Distributed processing

Batch processing

Figure: Distributed batch processing, e.g.with Apache Spark.

Guillaume Levasseur (IRIDIA) Big Data 2018.03.02 22 / 27

Big Data Distributed processing

Streaming analytics

Figure: Distributed streaming analytics using e.g.Apache Storm or Flink.

Guillaume Levasseur (IRIDIA) Big Data 2018.03.02 23 / 27

Big Data YCSB

Benchmarking with YCSB

0 20 40 60 80 100Read operations proportion [%]

0

500

1000

1500

2000

2500

3000

3500Th

roug

hput

[op/

s]CassandraElasticsearchHBase

Figure: Throughput for 48 YCSB clients against a 11-nodes cluster for a varyingread operations proportion with 1 M rows of 32×8 B.

Guillaume Levasseur (IRIDIA) Big Data 2018.03.02 24 / 27

Application

Application

Figure: Dataflow for IoT application in household energy monitoring.

Guillaume Levasseur (IRIDIA) Big Data 2018.03.02 25 / 27

Application

Application

Figure: Prototypes of home sensors.

Guillaume Levasseur (IRIDIA) Big Data 2018.03.02 26 / 27

Application

Application

Wikimedia Foundation:https://grafana.wikimedia.org/

Guillaume Levasseur (IRIDIA) Big Data 2018.03.02 27 / 27

Application

References

[1][2][3]

Alejandro Corbellini et al. “Persisting big-data: The NoSQLlandscape”. In: Information Systems 63 (2017), pp. 1–23.

Jeffrey Dean and Sanjay Ghemawat. “MapReduce: Simplified DataProcessing on Large Clusters”. In: OSDI 04. 2004.

Fay Chang et al. “Bigtable: A Distributed Storage System forStructured Data”. In: OSDI 06. 2006.

Guillaume Levasseur (IRIDIA) Big Data 2018.03.02 28 / 27

top related