chaire smart datamaw.isep.fr/2015/talks/assets/maw_talk1.pdf · research interest: data stream...
TRANSCRIPT
Time Concept Subject Speaker
9:05 Keynote 1 Big Data Processing Sylvain
LEFEBVRE
9:45 Keynote 2 Smart Cities Gilles BETIS
10:30 Break
11:00 Keynote 3 Distributed
Intelligence in IoT
Philippe
GAUTIER
11:45 Keynote 4 Data Protection Anne BARBIER
GOLIRO
12:30 Lunch
14:30 Workshop W Graph Databases Cedric FAUVET
17:30 Cocktail
Planning
Introduction to Big Data Processing
Raja Chiky – [email protected]
About RDI Team R. Chiky : Associate professor in Computer Science – LISITE-RDI Research interest: Data stream mining, scalability and resource optimization
in distributed architectures (e.g cloud architectures), recommender systems Research field: Large scale data management
1. Real-time and distributed
processing of various data
sources
2. Use semantic technologies to add a semantic
layer
3. Recommender systems and
collaborative data mining
4. Optimizing resources in large scale systems4. Optimizing resources in large scale systems
Heterogeneous and dynamic data
streams
Heterogeneous and dynamic data
streams
Heterogeneous and static dataHeterogeneous and static data
sensors
3
26/11/2015
4Goal of this talkRecognise some of the main terminologyRemember that there are many tools availableFocus on Hadoop - the most popular open-source Big Data
eco-systemRealise the potential of Big Data
CONTENT
What is Big Data?Data StreamingNoSQL databasesDistributed File SystemMapReduce paradigmVisualization
5
6Big Data: Buzzword!
26/11/2015
7What is Big Data?
Dawn oftime
2003 2012
5 EB
…
2.7 ZB
2015
10 ZB (E)
Volume of data created Worldwide
1 YB = 10^24 Bytes 1 ZB = 10^21 Bytes 1 EB = 10^18 Bytes 1 PB = 10^15 Bytes 1TB = 10^12 Bytes 1 GB = 10^9 Bytes
Variety of data
Velocity of data
Walmart handles 1M transactions per hour Google processes 24PB of data per day AT&T transfers 30 PB of data per day 90 trillion emails are sent per year World of Warcraft uses 1.3 PB of storage
Facebook with a user base of 900 M users, had 25 PB of compressed data
400M tweets per day in June ’12 72 hours of video is uploaded to
Youtube every minute
Radio TV News E-Mails Facebook Posts Tweets Blogs Photos Videos (user and paid)
Volume
Variety
Velocity
Big Data Elements
Source: Big Data & Analytics - Why Should We Care?, Vishwa Kolla
+ Veracity (IBM) - information uncertainty
RSS feeds Wikipedia GPS data RFID POS Scanners …
26/11/2015
8What is Big Data?Gartner Definition
Big data is high-volume, high-velocity and high-variety information assets that demand cost-effective, innovative forms of information processing for enhanced insight and decision making
McKinsey Definition
A dataset whose size is beyond the ability of typical database software tools to capture, store, manage and analyze.
26/11/2015
9Key factorsCheap storage
Recording everything is not expensive anymore
Cloud computing Cheap, on demand computing resources
from anywhere in the world and for everyone
Business reasons New insights arise that give competitive
advantage
Data in various forms everywhere: IoT and IoE, Social Networks, Open Data
The way we interact with each other and with data / information
…
26/11/2015
26/11/2015
10World of data
Website logsNetwork monitoring
Financial services
eCommerce Traffic controlPower consumption
Weather forecasting
Data may come from humans, sensors or machines
Data may come from humans, sensors or machines
26/11/2015
11Transforming our daily lives
Then Now
One size fits all Personalization & Targeted Selling
Source: Big Data Trends by David Feinleib
26/11/2015
12Fitness
Source: Big Data Trends by David Feinleib
Then Now
Manual tracking Focus on the goal
16/01/2014
Big Data workflow
1. Capture2. Store3. Analyze4. Visualize
Challenges arise in all these steps
13
26/11/2015
14Challenges: Data CollectionHeterogeneity of sources
Company databases => Silos Sensor networks, Intelligent objects Data streams: Social Networks, financial information, etc.
Data Velocity Data provenance and quality
Security / privacy
26/11/2015
15Type of data used in Big Data initiatives
Internal data
Traditional sources
« New data »
Source: Big Data opportunities survey, Unisphere / SAP, May 2013.
Challenges: Data CollectionVelocity
16
Source: http://practicalanalytics.co/2012/10/22/sizing-mobile-social-big-data-stats/
What is a data stream? Golab & Oszu (2003): “A data stream is a real-time, continuous, ordered
(implicitly by arrival time or explicitly by timestamp) sequence of items. It is impossible to control the order in which items arrive, nor is it feasible to locally store a stream in its entirety.”
Massive volumes of data, items arrive at a high rate.
17
26/11/2015
18Data Stream Management Systems
DBMS DSMS
Data model Permanent updatable relations Streams and permanent updatable relations
Storage Data is stored on disk Permanent relations are stored on diskStreams are processed on the fly
Query SQL language Creating structures Inserting/updating/deleting data Retrieving data (one-time query)
SQL-like query languageStandard SQL on permanent relationsExtended SQL on streams with windowingContinuous queries
Performance Large volumes of data Optimization of computer resources to deal with Several streams Several queries Ability to face variations in arrival rates without crash
26/11/2015
19Too much data streams
Too much data streams but not enough knowledge
Too much data streams but not enough knowledge
26/11/2015
20Semantic Web technologies for data streamAnnotate stream data with semantic metadataApply Linked Data principles to publish streaming data Interlink streaming data with existing datasets Integrate data stream processing + reasoning
Objectives : interoperability, automation, enrichment
26/11/2015
23Challenges in data storageLarge amounts of data
Need to use a highly distributed architecture Massive queries
Avoid joins since they are very time consumingEvolutionary schema
Flexibility and scalabilityPredictable and low latency High availabilityElasticity : Horizontal extensibilityNo need: Transaction / Strong consistency/ Complex queries
Limitation of RDBMS
“ If the only tool you have is a hammer, you tend to see every problem as a nail.”
Abraham Maslow
24
Limitation of RDBMS25
Limitations of RDBMSRelational DBMS offer:
join operators between tables to build complex queries involving several entities
Integrity constraints Transaction management ACID properties
In highly distributed environment: These mechanisms have a significant cost: With most RDBMS, data are in one machine (one node) It is difficult to place the data on different nodes.
26
NO SQL
Not Only
Relational
27
NoSQL?No SQL => Not Only SQLSQL must not die but storage solutions should be considered
for specific applications (especially web applications) Exact name: Non relational DBACID model does not allow scalability in a distributed
environment, for example by limiting the write speed (most expensive) Atomicity Consistency Isolation Durability
4 Nos: 1) NO SCHEMA (schema free) 2) NO JOIN (extract data without joins) 3) NO DATA FORMAT(graph, document, row, column) 4) NO ACID Transactions
28
30Scalability• Scalability is the ability of a system, network, or process to handle a
growing amount of work or its ability to be enlarged to accommodate that growth.
• Two kinds :
Horizontal scalability (scale out)– Add ordinary machines (commodity hardware)– Less expensive– More complex to set-up (due to load balancing concerns)
Vertical scalability (scale up)
– Add powerful servers– Costly– Easy to set-up
Horizontal Sharding
Each node stores a subset (identified by a key range) of a data table
31Sharding (data partitioning)• A shard (partition) is a logical division of a database into several
independent parts. This allows us to obtain a storage capacity greater than the limited storage capacity of hard disks, or perform queries in parallel on multiple partitions.
• Two kinds:
Vertical Sharding
Each node stores one (or mode) table(s) of a database
CAP theorem (E.Brewer, N. Lynch 2000)
C
A P
“CAP Theorem”:C-A-P: choose two.
consistency
Availability Partition-Tolerance
Claim: every distributed system is on one side of the triangle.
CA: available, and consistent, unless there is a partition.
AP: a reachable replica provides service even in a partition, but may be inconsistent.
CP: always consistent, even in a partition, but a reachable replica may deny service without agreement of the others
32
NoSQL Taxonomy
Data
Key-value
Document
Column
Graph
33
34
❑ HDFS is the Hadoop distributed file system❑ HDFS was inspired by Google File System
(GFS)The Google File System. Sanjay Ghemawat,
Howard Gobioff, and Shun-TakLeung, Google, 2003.
❑ Historically, it is composed of 2 main nodes : ➢ The NameNode is in charge of the
metadata management (one NameNode per cluster)
➢ The DataNode is in charge of the data storage (one DataNode per machine)
❑ Each file in HDFS is split into blocks (the block size is 64 MB by default)
❑ Each block is replicated on different DataNodes (3 replicas by default) :➢ The replication mechanism is
important for both fault tolerance and data availability
Namenode
Datanodes
11223344
112244
221133
114433
332244
File1
Presentation of HDFS
Challenges in Data AnalyticsProblems in large scale analytics
Distributed computation efficiency Evaluate performance gains from distribution Bringing data to the processor Efficient parallel algorithms (statistics, summaries)
Speed analytics: Streaming computations Streaming languages and libraries Load balancing
35
26/11/2015
36
Big Data: Technological challengesData infrastructure tools and platforms : data centers, cloud
infrastructures, noSQL databases, in-memory databases, Hadoop/Map Reduce Ecosphere
New generation of front-end tools for BI and analytic systems: data visualization and visual analytics, self-service BI, Mobile BI
Data processing : supercomputers, distributed or massively parallel-computing
MapReduce Introduced by Google in 2004 Aim: to parallelize processing (indexing, data mining, ...)Spread task of processing data on machinesMapReduce Offers:
Easy Parallelization and distribution of processing Fault Tolerance Load Balancing Abstraction for programmers
Programming modelBased on funtional programming languagesDevelopers implement two functions:
Map: in map phase, data is put to a number of machines. Output is partitioned (sorted) by a key. It is a step of processing data in the form of key/value pairs.
Reduce: For each key-group, data is aggregated (reduced). This is a step of aggregating the values for same key to form the final result.
MapReduce (Global architecture)
Example– Word Count
Moving computation to DataData is stored in Distributed File Systems
Ex. Google File System, Hadoop Distributed File SystemLarge blocks (a.k.a chunks), ususally of 64MB Chunks are replicated and distributed over machinesMap function on each of the chunksA Master node knows the data locality
Receive jobs Compute necessary map and reduce tasks Select and activate worker nodes (selected if possible close to
data)
Many MapReduce implementations: C#, C++, Erlang, Java, Python, etc.
The Apache Hadoop Map Reduce is arguable the most prominent one
HadoopOpen Source project (Apache Software Foundation)Software platform that lets one easily write and run
applications that process vast amounts of data across a cluster of machines
Java It includes: MapReduce, HDFS, HbaseYahoo! Is the biggest contributorhtp://wiki.apache.org/hadoop/PoweredBy
Amazon, Apple, Ebay, IBM, Google, Microsoft, SAP, Twitter, etc.
Hadoop Ecosystem Avro: Remote procedure call and data serialization Flume: harvesting, aggregating and moving large
amounts of log data in and out of Hadoop Hbase: Column oriented DB Hive: warehouse structure and SQL-like access for
data in HDFS Pig: high-level scripting language (Pig Latin) for
querying Sqoop: SQL To Hadoop (data import from RDBMS to
HDFS) Oozie: Job coordinator and workflow manager Hue: graphical interface for Hadoop Chukwa: large-scale log collection and analysis
…
26/11/2015
44
Problems with MapReduceThe main principle of MapReduce is its simplicity, as long as
the application does not require complex SQL queriesMapReduce is fairly low-level: must think about keys, values,
partitioning, etcAll data and intermediate data is written to disk !All standard databases operations must be coded by hand
Join, selection, projection, etc.Solution: Use high level languages
Translate programs to MapReduce automatically Pig, Hive, etc.
Alternative to MapReduce: Pig
Implementation started at Yahoo! ResearchExecute more than 30% of Yahoo!’s jobs Features
Expresses sequence of mapReduce jobs Provide relational (SQL) operators
(JOIN, GROUP BY, etc.)
Pig - example Given two tables: Visits, url-Info Find the top 10 most visited pages in each category
User Url Time
Amy cnn.com 8:00
Amy bbc.com 10:00
Amy flickr.com 10:05
Fred cnn.com 12:00
Url Category PageRank
cnn.com News 0.9
bbc.com News 0.8
flickr.com Photos 0.7
espn.com Sports 0.9
Vis
its
url
-In
fo
Load VisitsLoad Visits
Group by urlGroup by url
Foreach urlgenerate
count
Foreach urlgenerate
count
Load Url Info
Load Url Info
Join on urlJoin on url
Group by categoryGroup by categoryForeach category
generate top10 urls
Foreach category
generate top10 urls
in MapReduce
26/11/2015
48
In Pigvisits = load ‘/data/visits’ as (user, url, time);gVisits = group visits by url;visitCounts = foreach gVisits
generate url, count(visits);urlInfo = load ‘/data/urlInfo’ as
(url, category, pRank);visitCounts = join visitCounts by url,
urlInfo by url;gCategories = group visitCounts by category;topUrls = foreach gCategories
generate top(visitCounts,10);store topUrls into ‘/data/topUrls’;
* top, count are user defined functions
Alternative to Mapreduce: HiveDataWareHouse on top of HadoopOpen-SourceWritten in javaMetadata is stored in relational databasesProvides an SQL like query language : HiveQLProvides the possibility to create user defined functions Indexing
Language - DDLCreate a tablehive> CREATE TABLE customer (age INT, address STRING)Display tableshive> SHOW TABLES;Describe tableshive> DESCRIBE customer;Modify a tablehive> ALTER TABLE customer ADD COLUMNS (age INT);Delete a tablehive> DROP TABLE customer;
Language – DMLLoad a filehive> LOAD DATA LOCAL INPATH ‘/data/home/test.txt ’ OVERWRITE INTO TABLE customer;HIVEQLhive> SELECT c.age FROM customer c WHERE c.sdate=‘2008-08-15’hive>INSERT OVERWRITE DIRECTORY ‘/data/hdfs_file’ SELECT c.* FROM customer c WHERE c.sdate=‘2008-08-15’;
Can use ODBC to connect other external BI tools
Next gen : spark *
A richer set of operators
l Resilient Distributed Datasetl Immutable data tables l Lazy transformations
l Some can be indifferently applied to streaming datal
55Lambda architecture (By Nathan Marz)
Generic, scalable and fault-tolerant data processing architecture
SERVING LAYER
SPEED LAYER
BATCH LAYER
DATA FLOW QUERIES
REAL TIMESTREAM
PROCESSING
BATCHPROCESSING
PRECOMPUTED
VIEWS
Source: Mathieu DESPRIEE (USI)
Lambda architecture 1. All data entering the system is dispatched to both the batch
layer and the speed layer for processing.2. The batch layer has two functions: (i) managing the master
dataset (an immutable, append-only set of raw data), and (ii) to pre-compute the batch views.
3. The serving layer indexes the batch views so that they can be queried in ad-hoc way.
4. The speed layer compensates for the high latency of updates to the serving layer and deals with recent data only.
5. Any incoming query can be answered by merging results from batch views and real-time views.
http://lambda-architecture.net/
57Big Data Stream Mining
Machine Learning
Distributed
Batch
Hadoop
Mahout
Stream
S4, Storm
SAMOA
Non Distributed
Batch
R, WEKA,…
Stream
MOA
26/11/2015
58Challenges in Data Access and VisualizationThe main goal of data visualization is to communicate
information clearly and effectively through graphical meansProvide results of analytics workflow for faster systems such
as real-time query interfaces
“Visualization is a form of knowledge compression”
- David McCandless
26/11/2015
59
Conclusion: Big Data challengesSemantic Information aggregation
Information aggregation: “too much data to assimilate but not enough knowledge to act”
Distributed and real-time processing Design of real-time and distributed algorithms for stream
processing and information aggregation Distribution and parallelization of data mining algorithms
visual analytics and user modeling Dynamic user model Novel visualizations for very large datasets
Data privacy Big Data is often generated by people Obtaining consent is often impossible and anonymisation is very
hard
26/11/2015
60
Thanks toMarie-Aude Aufaure, ECPSylvain lefebvre, ISEP