big data: motivation and overview - vub · big data: motivation and overview current trends in arti...

Big Data: motivation and overviewCurrent Trends in Artificial Intelligence

Guillaume Levasseur

IRIDIA,Universite libre de Bruxelles

Friday, 2nd March 2018

(Short) Formal definition of machine learning

Learning requires (more) data!

More data? Big Data!

Application

Guillaume Levasseur (IRIDIA) Big Data 2018.03.02 1 / 27

Machine learning

(Short) Formal definition of machine learning

f (X, ~ω) = ~y (1)

f Model

X Matrix of the features (input data)

~y Output vector

I ~y is continuous = regressionI ~y is discrete = classification (supervised) or clustering

(unsupervised)I ~y is symbolic = decision problem

~ω Weights of the model f

I Determining ~ω = train the model, learn

Motivation

Learning requires (more) data!

Learning Determining the weights of the model using a trainingdataset.

Training dataset Set of features (and output values) representative of theproblem, as big as possible.

Points Attributes0 x1[0] x2[0] x3[0]1 x1[1] x2[1] x3[1]2 x1[2] x2[2] x3[2]...

......

I Problem How to store this data?

Motivation Spreadsheet storage

Spreadsheet storage

Figure: Time series in a spreadsheet

I Limited to 1 million rows per file

I Cheesy

I Limited access to 1 user at atime, poor availability

I Sequential reading

⇒ Not suitable!

Motivation RDBMS

Relational database management system

Figure: Relational model (neo4j.com)

I Greater capacity (> 1 M rows)

I Guaranteed consistency throughACID properties (atomicity,consistency, isolation,durability)

I Relations (”shortcuts”)between tables

I Structured query language(SQL) to retrieve data

I Better availability throughserving systems (MySQL,SQLServer, PostgreSQL)

Motivation RDBMS

I RDBMS limitations:I Fixed data schema: defined a priori, no missing data allowedI Server bound

I Problems:I What if there is missing data?

I Use custom symbols like ”NaN”.I Use flexible schema.

I What if the data varies in size, format, structure? How to combineheterogeneous data sources?

I Create separated tables and try to link them.I Use flexible schema.

I What if the workload become too heavy for the server (too manyrequests for the CPU, to much data for the HDD)?

I KIWI (kill it whit iron) = hardware upgrade.I Distribute the load on multiple machines.

Motivation RDBMS

I Use custom symbols like ”NaN”.

I Use flexible schema.

I Create separated tables and try to link them.

I Use flexible schema.

I KIWI (kill it whit iron) = hardware upgrade.

I Distribute the load on multiple machines.

Motivation RDBMS

I Use custom symbols like ”NaN”.I Use flexible schema.

I Create separated tables and try to link them.I Use flexible schema.

I KIWI (kill it whit iron) = hardware upgrade.I Distribute the load on multiple machines.

Big Data

Remember, back in 2006...

I More data, more variety, more requests.

I Google develops Bigtable, which later inspired

Big Data

Remember, back in 2006...

I More data, more variety, more requests.

I Google develops Bigtable, which later inspired

Big Data

More data? Big Data!The 3 Vs of Big Data:

Variety Data of many different types are to be stored...

Velocity it’s coming fast...

Volume and there is a lot of it!

ScalabilityAbility to adapt when the order of magnitude changes.

Figure: Vertical and horizontal scalability (premaseem.files.wordpress.com).

Big Data Consistency

Consistency: from ACID to BASE

I Problem ACID does not support partitioning across multiple nodes.

I BASE: basically available, soft state, eventually consistent

Figure: Consistency issue for distributed database systems (hbase.apache.org).

Big Data CAP theorem

CAP theorem

Figure: The CAP theorem: any database system has to drop either theconsistency, the availability or the partition tolerance (w3resource.com).

Big Data NoSQL

NoSQL technologies

I NoSQL = not only SQL, NewSQL = distributed relational database

I Shared-nothing (distributed) architecture ⇒ CP, AP

I Supported operations: CRUD (create, read, update, delete) and scan.

I Classification:

key-value unique ID value

columnarFamily 0 Family 1

unique ID column 0 column 1 column 2

document unique ID {attr0: value0, attr1: value1}

graph unique ID arcs vertices

search engines inverted ID plain text or document

Big Data NoSQL

NoSQL technologiesMore than 200 different systems!

Figure: Logos of some NoSQL database systems.

Big Data NoSQL

Hadoop

Figure: Architecture of the Hadoop distributed file system, the Namenode actsas master of the infrastructure (hadoop.apache.org).

Big Data NoSQL

On top of Hadoop: HBase

Figure: HBase architecture (eureka.co).

Big Data NoSQL

Cassandra

Figure: Cassandra’s ring strategy for splitting (docs.datastax.com).

Big Data NoSQL

Cassandra

Figure: Cassandra’s writing process (docs.datastax.com).

I No master: nodes use gossiping to exchange metadata about thecurrent state

I Compaction: too many SSTables ⇒ merge in one bigger SSTable.

Big Data NoSQL

Cassandra

Figure: Result of a SELECT query in CQL.

Big Data NoSQL

Elasticsearch

Figure: Sharding and replication with elasticsearch (elastic.co).

2 types of nodes:

Master Allocates the shards and the replicas to the other nodes.

Datanode Indexes, stores and reads the documents.

Big Data NoSQL

Elasticsearch

elastic.co

Big Data NoSQL

Elasticsearch - Lucene backend

Figure: Lucene indexing process (4.bp.blogspot.com).

Big Data Distributed processing

Distributed data processing

Figure: MapReduce: Distributed processing method for cluster computing (Deanet al. , 2004).

Batch processing

Figure: Distributed batch processing, e.g.with Apache Spark.

Streaming analytics

Figure: Distributed streaming analytics using e.g.Apache Storm or Flink.

Big Data YCSB

Benchmarking with YCSB

0 20 40 60 80 100Read operations proportion [%]

3500Th

s]CassandraElasticsearchHBase

Figure: Throughput for 48 YCSB clients against a 11-nodes cluster for a varyingread operations proportion with 1 M rows of 32×8 B.

Application

Figure: Dataflow for IoT application in household energy monitoring.

Application

Figure: Prototypes of home sensors.

Application

Wikimedia Foundation:https://grafana.wikimedia.org/

Application

References

[1][2][3]

Alejandro Corbellini et al. “Persisting big-data: The NoSQLlandscape”. In: Information Systems 63 (2017), pp. 1–23.

Jeffrey Dean and Sanjay Ghemawat. “MapReduce: Simplified DataProcessing on Large Clusters”. In: OSDI 04. 2004.

Fay Chang et al. “Bigtable: A Distributed Storage System forStructured Data”. In: OSDI 06. 2006.

big data: motivation and overview - vub · big data: motivation and overview current trends in arti...

Documents

levasseur, Émile (1828-1911). la population ... - inria

5. motivation · motivation: big questions ... students...

7. big data und nosql-datenbanken · ss18, © prof. dr. e....

michele levasseur presentation 2015

201112 fr 22 - levasseur.org · 201112 fr 22 -...

andré levasseur (1927-2006)

wow nola creations susan levasseur, independent stampin’...

quizz limmunologie et vous ! marie-claude levasseur inf. b....

motivation to use big data and big data analytics in

until all are free - the trial statement of ray luc...

kvm and big vms · motivation why big vms? –...

tropical domain results downscaling ability of the ncep...

united states v. raymond luc levasseur, a/k/a john, jack,...

john levasseur springfield central high school 2009

influence of lifestyle redesign® on health, social...

new perspectives on employee motivation: balancing the big...

http:// 2009 année mondiale de lastronomie t- 6 mois...

eric schmidt. motivation is a big part of physical...

motivation and challenge big data volume velocity variety...

new perspectives on employee motivation: balancing the big 4