Università degli Studi di Roma “Tor Vergata” Dipartimento di Ingegneria Civile e Ingegneria Informatica
Corso di Sistemi e Architetture per Big Data A.A. 2016/17
Valeria Cardellini
Introduction to Big Data
Big Data: fuzzy term, handle with care!
Valeria Cardellini - SABD 2016/17
1
Sou
rce:
ww
w.d
omo.
com
/blo
g/da
ta-n
ever
-sle
eps-
4-0/
Valeria Cardellini - SABD 2016/17
2
Why
Big
Dat
a?
How much data? • Every day in 2014 we created:
– 2.5 Exabytes (2.5x1018 ≈ 0.0025x270 Zettabytes) ... • How big is a Zettabyte? www.dailyinfographic.com/2016-the-year-of-the-zettabyte-infographic
– 2.5 Exabytes (2.5x1018) ...� – 2500 Petabytes (2500x1015) ... – 2500000 Terabytes (2500000x1012) … – 2500000000000000000 bytes!
• 90% of all the data in the world has been generated over the last two years (in 2013)
• 40 Zettabytes of data will be created by 2020 Source: www.ibmbigdatahub.com/infographic/four-vs-big-data
Valeria Cardellini - SABD 2016/17
3
How much data?
• Some “older” statistics: – Google: processes more than 20 PB a day (2008) – Facebook: has 2.5 PB of user data (2009) – eBay: has more 6.5 PB of user data + 50 TB/day
(5/2009) – CERN’s LHC: generates 1 PB of data per second
(2013)
Valeria Cardellini - SABD 2016/17
4
Big data driving factors
• Big Data is growing fast – Mobile devices – Internet of Things
Valeria Cardellini - SABD 2016/17
5
Smart world
Valeria Cardellini - SABD 2016/17
6
Internet of Things (IoT)
Valeria Cardellini - SABD 2016/17
7
How Big? IoT impact
• InternetofThings(IoT)willlargelycontributetoincreaseBigDatachallenges
• Prolifera;onofdatasources
Valeria Cardellini - SABD 2016/17
8
Big Data definitions
Different definitions • “Big data exceeds the reach of commonly used hardware
environments and software tools to capture, manage, and process it with in a tolerable elapsed time for its user population.” Teradata Magazine article, 2011
• “Big data refers to data sets whose size is beyond the ability of typical database software tools to capture, store, manage and analyze.” The McKinsey Global Institute, 2012
• “Big data is mostly about taking numbers and using those numbers to make predictions about the future. The bigger the data set you have, the more accurate the predictions about the future will be.” Anthony Goldbloom, Kaggle’s founder
• “Big data is a term for data sets that are so large or complex that traditional data processing application softwares are inadequate to deal with them.” Wikipedia, 2017
Valeria Cardellini - SABD 2016/17
9
… so, what is Big Data?
• “Big Data” is similar to “Small data”, but bigger • …but having data bigger it requires different
approaches (scale changes everything!) – New methodologies, tools, architectures
• …with an aim to solve new problems • …or old problems in a better way
Valeria Cardellini - SABD 2016/17
10
Gartner’s Big data definition
• The most-frequently used and perhaps, somewhat abused definition (revised version by Gartner, 2012) Big data is high volume, high velocity, and/or high variety information assets that require new forms of processing to enable enhanced decision making, insight discovery and process optimization.
Valeria Cardellini - SABD 2016/17
11
3V model for Big Data
1. Volume: data size challenging to store and process (how to index, retrieve)
2. Variety: data heterogeneity because of different data types (text, audio, video, record) and degree of structure (structured, semi-structured, unstructured data)
3. Velocity: data generation rate and analysis rate
• Defined in 2001 by D. Laney Valeria Cardellini - SABD 2016/17
12
The extended (3+n)V model
4. Value: Big data can generate huge competitive advantages – “Big data technologies describe a new generation of
technologies and architectures, designed to economically extract value from very large volumes of a wide variety of data, by enabling high-velocity capture, discovery, and/or analysis.” (IDC, 2011)
– “The bigger the data set you have, the more accurate the predictions about the future will be” (A. Goldbloom)
5. Veracity: uncertainty of accuracy and authenticity of data
6. Variability: data flows can be highly inconsistent with periodic peaks
7. Visualization
Valeria Cardellini - SABD 2016/17
13
Big Data visualization
• Presenta;onofdatainapictorialandgraphicalformat
• Mo;va;on:ourbrainprocessesimages60,000xfasterthantext
• Someexamples– FlightpaIernswww.aaronkoblin.com/work/flightpaIerns/
– Hurricanesuxblog.idvsolu;ons.com/2012/08/hurricanes-since-1851.html
– Rela;onshipsbetweenactorswhohavewonOscars,thedirectorstheyhaveworkedwithandalltheotheractorstheyhaveworkedwithwww.pitchinterac;ve.com/infovis/abstract.html
Valeria Cardellini - SABD 2016/17
14
Big Data on Google Trend
• Searched terms on Google Trend Big Data Hadoop Data mining Business analytics Database management system
Valeria Cardellini - SABD 2016/17
15
Gartner’s 2015 hype cycle for advanced analytics and data science
Valeria Cardellini - SABD 2016/17
16
Big Data potential value
Source: McKinsey Global Institute, 2011 Valeria Cardellini - SABD 2016/17
17
Why now?
• Because we have data – Data born already in digital form – 40% of data growth per year
• Because we can – 400$ for a drive in which to store all the music of
the world – More than 40 years of Moore's law: we have large
computational resources – 76% of organizations have invested in Big Data in
2016 – 130 billion $ invested in Big Data in 2016
Valeria Cardellini - SABD 2016/17
18
Some examples of Big Data applications
• Consumer product companies and retail organizations monitor social media like Facebook and Twitter to get an unprecedented view into customer behavior, preferences, and product perception
• Manufacturers monitor minute vibration data from their equipment to predict the optimal time to replace or maintain
• Manufacturers also monitor social networks, but with a different goal than marketers: to detect aftermarket support issues before a warranty failure becomes publicly detrimental
• Governments make data public for users to develop new applications that can generate public good (Open Data initiative)
Valeria Cardellini - SABD 2016/17
19
… many other Big Data applications in very diverse sectors
• Crime prevention in Los Angeles • Diagnosis and treatment of genetic diseases • Investments in financial sector • Generation of personalized advertising • Astronomical discoveries
See the BBC video www.bbc.co.uk/programmes/b01rt4c7
Valeria Cardellini - SABD 2016/17
20
Examples of real-time analytics
• Real-time analytics over high volume sensor data: analysis of energy consumption measurements (DEBS 2014 Grand Challenge)
• Real-time analytics over high volume geospatial data streams: analysis of taxi trips based on a stream of trip reports from New York City (DEBS 2015 Grand Challenge)
www.debs2015.org/call-grand-challenge.html
• Real-time analytics metrics for a dynamic (evolving) social-network graph: identification of the posts that currently trigger the most activity in the social network, and identification of large communities that are currently involved in a topic (DEBS 2016 Grand Challenge) www.ics.uci.edu/~debs2016/call-grand-challenge.html
Valeria Cardellini - SABD 2016/17
21
Examples of real-time analytics
• Finance: real-time forecasting of stock market
• Medicine: epidemy tracking govdatadownload.com/2015/01/27/big-data-enables-epidemic-tracking/
• Security – Fraud detection, DDOS attacks, behavioural pattern
recognition
• Urban traffic management [Art14]
Valeria Cardellini - SABD 2016/17
22
The Big Data process
Valeria Cardellini - SABD 2016/17
23
The Big Data process
• Acquisition – Requires:
• Selecting data • Filtering data • Generating metadata • Managing data provenance
Valeria Cardellini - SABD 2016/17
24
The Big Data process
• Extraction – Requires:
• Transformation • Normalization
– E.g., avoid duplication • Cleaning
– Detect and correct (or remove) corrupt or inaccurate data • Aggregation • Error handling
Valeria Cardellini - SABD 2016/17
25
The Big Data process
• Integration – Requires:
• Standardization • Conflict management • Reconciliation • Mapping definition
Valeria Cardellini - SABD 2016/17
26
The Big Data process
• Analysis – Requires:
• Exploration • Data mining • Machine learning • Visualization
Valeria Cardellini - SABD 2016/17
27
The Big Data process
• Interpretation – Requires:
• Knowledge of the domain • Knowledge of the provenance • Identification of patterns of interest • Flexibility of the process
Valeria Cardellini - SABD 2016/17
28
The Big Data process
• Decision – Requires:
• Managerial skills • Continuous improvement of
the process
Valeria Cardellini - SABD 2016/17
29
Risks and challenges of Big Data • Performance
– Data grows faster than energy on chip – Efficiency – Scalability and elasticity
• Goal: to scale linearly as workloads and data volumes grow
– Fault tolerance • Effectiveness • Heterogeneity
– Regarding data, processing environment, … • Flexibility • Privacy • Costs
Valeria Cardellini - SABD 2016/17
30
Effectiveness of Big data analysis • A famous example of inaccurate analysis • Google Flu Trends’ predictions
– Sometimes very inaccurate: over the interval 2011-2013, when it consistently overestimated flu prevalence and over one interval in the 2012-2013 flu season predicted twice as many doctors' visits as those recorded
Valeria Cardellini - SABD 2016/17
31
Lazer et al., "The Parable of Google Flu: Traps in Big Data Analysis". Science. 343 (6176): 1203–1205. doi:10.1126/science.1248506
Taming performance: distribution and replication
• Distributed architecture – The common architectural solution for Big Data
processing: cluster of commodity hardware resources that work together for a common goal
– Scale out (or horizontally), not up (or vertically)! – But elastic scale support is still a challenge
• Distributed processing – Shared-nothing model – New programming paradigms
• Resource replication – The well-known solution to achieve fault tolerance – Eventual consistency (CAP theorem!)
Valeria Cardellini - SABD 2016/17
32
Shared nothing vs. other parallel architectures
Valeria Cardellini - SABD 2016/17
33
D. DeWitt and J. Gray, “Parallel database systems: the future of high performance database systems”, ACM Communications, 1992
Big Data platforms
• Acquire, manage, process
• At large scales
• To meet QoS application requirements
Valeria Cardellini - SABD 2016/17
34
Our Big Data stack
Valeria Cardellini - SABD 2016/17
35
Resource Management
Data Storage
Data Processing
High-level Interfaces Support / Integration
The Big Data stack: BDAS
• BDAS: the Berkeley Data Analytics Stack
Valeria Cardellini - SABD 2016/17
36
The Big Data stack: Cloudera platform
Valeria Cardellini - SABD 2016/17
37
Data analysis paradigm shift • From the “old” way: Structure -> Ingest ->
Analyze – Also known as ETL (Extract, Transform, and Load) – Extract data from data sources – Transform data for storing in the proper format or
structure for the purposes of querying and analysis – Load data into the target final system, i.e., database
operational data store, data mart, or data warehouse (DWH)
Valeria Cardellini - SABD 2016/17
38
Data analysis paradigm shift • … to the new way: Ingest -> Analyze -> Structure
– Also known as ELT (Extract, Load, and Transform) – Extract data from the sources – Load data into a data lake – Transform data
• Advantages: – No need for a separate transformation engine – Data transformation and loading happen in parallel – More effective
Valeria Cardellini - SABD 2016/17
39
Data lake • “A data lake is a method of storing data within a
system or repository, in its natural format, that facilitates the collocation of data in various schemata and structural forms, usually object blobs or files.” (Wikipedia)
Valeria Cardellini - SABD 2016/17
40
Some techniques for Big Data analytics
• Data mining: anomaly detection, association rule learning, classification, clustering, regression, summarization
• Machine learning: supervised learning, unsupervised learning, reinforcement learning
• Crowdsourcing – Outsourcing human-intelligence tasks to a large group
of unspecified people via Internet
Valeria Cardellini - SABD 2016/17
41
We do not cover them in this course