big data: motivation and overview - vub · big data: motivation and overview current trends in arti...
TRANSCRIPT
Big Data: motivation and overviewCurrent Trends in Artificial Intelligence
Guillaume Levasseur
IRIDIA,Universite libre de Bruxelles
Friday, 2nd March 2018
(Short) Formal definition of machine learning
Learning requires (more) data!
More data? Big Data!
Application
Guillaume Levasseur (IRIDIA) Big Data 2018.03.02 1 / 27
Machine learning
(Short) Formal definition of machine learning
f (X, ~ω) = ~y (1)
f Model
X Matrix of the features (input data)
~y Output vector
I ~y is continuous = regressionI ~y is discrete = classification (supervised) or clustering
(unsupervised)I ~y is symbolic = decision problem
~ω Weights of the model f
I Determining ~ω = train the model, learn
Guillaume Levasseur (IRIDIA) Big Data 2018.03.02 2 / 27
Motivation
Learning requires (more) data!
Learning Determining the weights of the model using a trainingdataset.
Training dataset Set of features (and output values) representative of theproblem, as big as possible.
Points Attributes0 x1[0] x2[0] x3[0]1 x1[1] x2[1] x3[1]2 x1[2] x2[2] x3[2]...
......
...
I Problem How to store this data?
Guillaume Levasseur (IRIDIA) Big Data 2018.03.02 3 / 27
Motivation Spreadsheet storage
Spreadsheet storage
Figure: Time series in a spreadsheet
I Limited to 1 million rows per file
I Cheesy
I Limited access to 1 user at atime, poor availability
I Sequential reading
⇒ Not suitable!
Guillaume Levasseur (IRIDIA) Big Data 2018.03.02 4 / 27
Motivation RDBMS
Relational database management system
Figure: Relational model (neo4j.com)
I Greater capacity (> 1 M rows)
I Guaranteed consistency throughACID properties (atomicity,consistency, isolation,durability)
I Relations (”shortcuts”)between tables
I Structured query language(SQL) to retrieve data
I Better availability throughserving systems (MySQL,SQLServer, PostgreSQL)
Guillaume Levasseur (IRIDIA) Big Data 2018.03.02 5 / 27
Motivation RDBMS
Relational database management system
I RDBMS limitations:I Fixed data schema: defined a priori, no missing data allowedI Server bound
I Problems:I What if there is missing data?
I Use custom symbols like ”NaN”.I Use flexible schema.
I What if the data varies in size, format, structure? How to combineheterogeneous data sources?
I Create separated tables and try to link them.I Use flexible schema.
I What if the workload become too heavy for the server (too manyrequests for the CPU, to much data for the HDD)?
I KIWI (kill it whit iron) = hardware upgrade.I Distribute the load on multiple machines.
Guillaume Levasseur (IRIDIA) Big Data 2018.03.02 6 / 27
Motivation RDBMS
Relational database management system
I RDBMS limitations:I Fixed data schema: defined a priori, no missing data allowedI Server bound
I Problems:I What if there is missing data?
I Use custom symbols like ”NaN”.
I Use flexible schema.
I What if the data varies in size, format, structure? How to combineheterogeneous data sources?
I Create separated tables and try to link them.
I Use flexible schema.
I What if the workload become too heavy for the server (too manyrequests for the CPU, to much data for the HDD)?
I KIWI (kill it whit iron) = hardware upgrade.
I Distribute the load on multiple machines.
Guillaume Levasseur (IRIDIA) Big Data 2018.03.02 6 / 27
Motivation RDBMS
Relational database management system
I RDBMS limitations:I Fixed data schema: defined a priori, no missing data allowedI Server bound
I Problems:I What if there is missing data?
I Use custom symbols like ”NaN”.I Use flexible schema.
I What if the data varies in size, format, structure? How to combineheterogeneous data sources?
I Create separated tables and try to link them.I Use flexible schema.
I What if the workload become too heavy for the server (too manyrequests for the CPU, to much data for the HDD)?
I KIWI (kill it whit iron) = hardware upgrade.I Distribute the load on multiple machines.
Guillaume Levasseur (IRIDIA) Big Data 2018.03.02 6 / 27
Big Data
More data? Big Data!
Remember, back in 2006...
I More data, more variety, more requests.
I Google develops Bigtable, which later inspired
Guillaume Levasseur (IRIDIA) Big Data 2018.03.02 7 / 27
Big Data
More data? Big Data!
Remember, back in 2006...
I More data, more variety, more requests.
I Google develops Bigtable, which later inspired
Guillaume Levasseur (IRIDIA) Big Data 2018.03.02 7 / 27
Big Data
More data? Big Data!The 3 Vs of Big Data:
Variety Data of many different types are to be stored...
Velocity it’s coming fast...
Volume and there is a lot of it!
ScalabilityAbility to adapt when the order of magnitude changes.
Figure: Vertical and horizontal scalability (premaseem.files.wordpress.com).
Guillaume Levasseur (IRIDIA) Big Data 2018.03.02 8 / 27
Big Data Consistency
Consistency: from ACID to BASE
I Problem ACID does not support partitioning across multiple nodes.
I BASE: basically available, soft state, eventually consistent
Figure: Consistency issue for distributed database systems (hbase.apache.org).
Guillaume Levasseur (IRIDIA) Big Data 2018.03.02 9 / 27
Big Data CAP theorem
CAP theorem
Figure: The CAP theorem: any database system has to drop either theconsistency, the availability or the partition tolerance (w3resource.com).
Guillaume Levasseur (IRIDIA) Big Data 2018.03.02 10 / 27
Big Data NoSQL
NoSQL technologies
I NoSQL = not only SQL, NewSQL = distributed relational database
I Shared-nothing (distributed) architecture ⇒ CP, AP
I Supported operations: CRUD (create, read, update, delete) and scan.
I Classification:
key-value unique ID value
columnarFamily 0 Family 1
unique ID column 0 column 1 column 2
document unique ID {attr0: value0, attr1: value1}
graph unique ID arcs vertices
search engines inverted ID plain text or document
Guillaume Levasseur (IRIDIA) Big Data 2018.03.02 11 / 27
Big Data NoSQL
NoSQL technologiesMore than 200 different systems!
Figure: Logos of some NoSQL database systems.
Guillaume Levasseur (IRIDIA) Big Data 2018.03.02 12 / 27
Big Data NoSQL
Hadoop
Figure: Architecture of the Hadoop distributed file system, the Namenode actsas master of the infrastructure (hadoop.apache.org).
Guillaume Levasseur (IRIDIA) Big Data 2018.03.02 13 / 27
Big Data NoSQL
On top of Hadoop: HBase
Figure: HBase architecture (eureka.co).
Guillaume Levasseur (IRIDIA) Big Data 2018.03.02 14 / 27
Big Data NoSQL
Cassandra
Figure: Cassandra’s ring strategy for splitting (docs.datastax.com).
Guillaume Levasseur (IRIDIA) Big Data 2018.03.02 15 / 27
Big Data NoSQL
Cassandra
Figure: Cassandra’s writing process (docs.datastax.com).
I No master: nodes use gossiping to exchange metadata about thecurrent state
I Compaction: too many SSTables ⇒ merge in one bigger SSTable.
Guillaume Levasseur (IRIDIA) Big Data 2018.03.02 16 / 27
Big Data NoSQL
Cassandra
Figure: Result of a SELECT query in CQL.
Guillaume Levasseur (IRIDIA) Big Data 2018.03.02 17 / 27
Big Data NoSQL
Elasticsearch
Figure: Sharding and replication with elasticsearch (elastic.co).
2 types of nodes:
Master Allocates the shards and the replicas to the other nodes.
Datanode Indexes, stores and reads the documents.
Guillaume Levasseur (IRIDIA) Big Data 2018.03.02 18 / 27
Big Data NoSQL
Elasticsearch
elastic.co
Guillaume Levasseur (IRIDIA) Big Data 2018.03.02 19 / 27
Big Data NoSQL
Elasticsearch - Lucene backend
Figure: Lucene indexing process (4.bp.blogspot.com).
Guillaume Levasseur (IRIDIA) Big Data 2018.03.02 20 / 27
Big Data Distributed processing
Distributed data processing
Figure: MapReduce: Distributed processing method for cluster computing (Deanet al. , 2004).
Guillaume Levasseur (IRIDIA) Big Data 2018.03.02 21 / 27
Big Data Distributed processing
Batch processing
Figure: Distributed batch processing, e.g.with Apache Spark.
Guillaume Levasseur (IRIDIA) Big Data 2018.03.02 22 / 27
Big Data Distributed processing
Streaming analytics
Figure: Distributed streaming analytics using e.g.Apache Storm or Flink.
Guillaume Levasseur (IRIDIA) Big Data 2018.03.02 23 / 27
Big Data YCSB
Benchmarking with YCSB
0 20 40 60 80 100Read operations proportion [%]
0
500
1000
1500
2000
2500
3000
3500Th
roug
hput
[op/
s]CassandraElasticsearchHBase
Figure: Throughput for 48 YCSB clients against a 11-nodes cluster for a varyingread operations proportion with 1 M rows of 32×8 B.
Guillaume Levasseur (IRIDIA) Big Data 2018.03.02 24 / 27
Application
Application
Figure: Dataflow for IoT application in household energy monitoring.
Guillaume Levasseur (IRIDIA) Big Data 2018.03.02 25 / 27
Application
Application
Figure: Prototypes of home sensors.
Guillaume Levasseur (IRIDIA) Big Data 2018.03.02 26 / 27
Application
Application
Wikimedia Foundation:https://grafana.wikimedia.org/
Guillaume Levasseur (IRIDIA) Big Data 2018.03.02 27 / 27
Application
References
[1][2][3]
Alejandro Corbellini et al. “Persisting big-data: The NoSQLlandscape”. In: Information Systems 63 (2017), pp. 1–23.
Jeffrey Dean and Sanjay Ghemawat. “MapReduce: Simplified DataProcessing on Large Clusters”. In: OSDI 04. 2004.
Fay Chang et al. “Bigtable: A Distributed Storage System forStructured Data”. In: OSDI 06. 2006.
Guillaume Levasseur (IRIDIA) Big Data 2018.03.02 28 / 27