introduction to big data - uniroma2.it · 2017-03-09 · big data definitions different definitions...

21
Università degli Studi di Roma Tor VergataDipartimento di Ingegneria Civile e Ingegneria Informatica Corso di Sistemi e Architetture per Big Data A.A. 2016/17 Valeria Cardellini Introduction to Big Data Big Data: fuzzy term, handle with care! Valeria Cardellini - SABD 2016/17 1

Upload: others

Post on 12-Mar-2020

17 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Introduction to Big Data - uniroma2.it · 2017-03-09 · Big Data definitions Different definitions • “Big data exceeds the reach of commonly used hardware environments and software

Università degli Studi di Roma “Tor Vergata” Dipartimento di Ingegneria Civile e Ingegneria Informatica

Corso di Sistemi e Architetture per Big Data A.A. 2016/17

Valeria Cardellini

Introduction to Big Data

Big Data: fuzzy term, handle with care!

Valeria Cardellini - SABD 2016/17

1

Page 2: Introduction to Big Data - uniroma2.it · 2017-03-09 · Big Data definitions Different definitions • “Big data exceeds the reach of commonly used hardware environments and software

Sou

rce:

ww

w.d

omo.

com

/blo

g/da

ta-n

ever

-sle

eps-

4-0/

Valeria Cardellini - SABD 2016/17

2

Why

Big

Dat

a?

How much data? •  Every day in 2014 we created:

–  2.5 Exabytes (2.5x1018 ≈ 0.0025x270 Zettabytes) ... •  How big is a Zettabyte? www.dailyinfographic.com/2016-the-year-of-the-zettabyte-infographic

–  2.5 Exabytes (2.5x1018) ...� –  2500 Petabytes (2500x1015) ... –  2500000 Terabytes (2500000x1012) … –  2500000000000000000 bytes!

•  90% of all the data in the world has been generated over the last two years (in 2013)

•  40 Zettabytes of data will be created by 2020 Source: www.ibmbigdatahub.com/infographic/four-vs-big-data

Valeria Cardellini - SABD 2016/17

3

Page 3: Introduction to Big Data - uniroma2.it · 2017-03-09 · Big Data definitions Different definitions • “Big data exceeds the reach of commonly used hardware environments and software

How much data?

•  Some “older” statistics: –  Google: processes more than 20 PB a day (2008) –  Facebook: has 2.5 PB of user data (2009) –  eBay: has more 6.5 PB of user data + 50 TB/day

(5/2009) –  CERN’s LHC: generates 1 PB of data per second

(2013)

Valeria Cardellini - SABD 2016/17

4

Big data driving factors

•  Big Data is growing fast –  Mobile devices –  Internet of Things

Valeria Cardellini - SABD 2016/17

5

Page 4: Introduction to Big Data - uniroma2.it · 2017-03-09 · Big Data definitions Different definitions • “Big data exceeds the reach of commonly used hardware environments and software

Smart world

Valeria Cardellini - SABD 2016/17

6

Internet of Things (IoT)

Valeria Cardellini - SABD 2016/17

7

Page 5: Introduction to Big Data - uniroma2.it · 2017-03-09 · Big Data definitions Different definitions • “Big data exceeds the reach of commonly used hardware environments and software

How Big? IoT impact

•  InternetofThings(IoT)willlargelycontributetoincreaseBigDatachallenges

•  Prolifera;onofdatasources

Valeria Cardellini - SABD 2016/17

8

Big Data definitions

Different definitions •  “Big data exceeds the reach of commonly used hardware

environments and software tools to capture, manage, and process it with in a tolerable elapsed time for its user population.” Teradata Magazine article, 2011

•  “Big data refers to data sets whose size is beyond the ability of typical database software tools to capture, store, manage and analyze.” The McKinsey Global Institute, 2012

•  “Big data is mostly about taking numbers and using those numbers to make predictions about the future. The bigger the data set you have, the more accurate the predictions about the future will be.” Anthony Goldbloom, Kaggle’s founder

•  “Big data is a term for data sets that are so large or complex that traditional data processing application softwares are inadequate to deal with them.” Wikipedia, 2017

Valeria Cardellini - SABD 2016/17

9

Page 6: Introduction to Big Data - uniroma2.it · 2017-03-09 · Big Data definitions Different definitions • “Big data exceeds the reach of commonly used hardware environments and software

… so, what is Big Data?

•  “Big Data” is similar to “Small data”, but bigger •  …but having data bigger it requires different

approaches (scale changes everything!) –  New methodologies, tools, architectures

•  …with an aim to solve new problems •  …or old problems in a better way

Valeria Cardellini - SABD 2016/17

10

Gartner’s Big data definition

•  The most-frequently used and perhaps, somewhat abused definition (revised version by Gartner, 2012) Big data is high volume, high velocity, and/or high variety information assets that require new forms of processing to enable enhanced decision making, insight discovery and process optimization.

Valeria Cardellini - SABD 2016/17

11

Page 7: Introduction to Big Data - uniroma2.it · 2017-03-09 · Big Data definitions Different definitions • “Big data exceeds the reach of commonly used hardware environments and software

3V model for Big Data

1.  Volume: data size challenging to store and process (how to index, retrieve)

2.  Variety: data heterogeneity because of different data types (text, audio, video, record) and degree of structure (structured, semi-structured, unstructured data)

3.  Velocity: data generation rate and analysis rate

•  Defined in 2001 by D. Laney Valeria Cardellini - SABD 2016/17

12

The extended (3+n)V model

4.  Value: Big data can generate huge competitive advantages –  “Big data technologies describe a new generation of

technologies and architectures, designed to economically extract value from very large volumes of a wide variety of data, by enabling high-velocity capture, discovery, and/or analysis.” (IDC, 2011)

–  “The bigger the data set you have, the more accurate the predictions about the future will be” (A. Goldbloom)

5.  Veracity: uncertainty of accuracy and authenticity of data

6.  Variability: data flows can be highly inconsistent with periodic peaks

7.  Visualization

Valeria Cardellini - SABD 2016/17

13

Page 8: Introduction to Big Data - uniroma2.it · 2017-03-09 · Big Data definitions Different definitions • “Big data exceeds the reach of commonly used hardware environments and software

Big Data visualization

•  Presenta;onofdatainapictorialandgraphicalformat

•  Mo;va;on:ourbrainprocessesimages60,000xfasterthantext

•  Someexamples–  FlightpaIernswww.aaronkoblin.com/work/flightpaIerns/

–  Hurricanesuxblog.idvsolu;ons.com/2012/08/hurricanes-since-1851.html

–  Rela;onshipsbetweenactorswhohavewonOscars,thedirectorstheyhaveworkedwithandalltheotheractorstheyhaveworkedwithwww.pitchinterac;ve.com/infovis/abstract.html

Valeria Cardellini - SABD 2016/17

14

Big Data on Google Trend

•  Searched terms on Google Trend Big Data Hadoop Data mining Business analytics Database management system

Valeria Cardellini - SABD 2016/17

15

Page 9: Introduction to Big Data - uniroma2.it · 2017-03-09 · Big Data definitions Different definitions • “Big data exceeds the reach of commonly used hardware environments and software

Gartner’s 2015 hype cycle for advanced analytics and data science

Valeria Cardellini - SABD 2016/17

16

Big Data potential value

Source: McKinsey Global Institute, 2011 Valeria Cardellini - SABD 2016/17

17

Page 10: Introduction to Big Data - uniroma2.it · 2017-03-09 · Big Data definitions Different definitions • “Big data exceeds the reach of commonly used hardware environments and software

Why now?

•  Because we have data –  Data born already in digital form –  40% of data growth per year

•  Because we can –  400$ for a drive in which to store all the music of

the world –  More than 40 years of Moore's law: we have large

computational resources –  76% of organizations have invested in Big Data in

2016 –  130 billion $ invested in Big Data in 2016

Valeria Cardellini - SABD 2016/17

18

Some examples of Big Data applications

•  Consumer product companies and retail organizations monitor social media like Facebook and Twitter to get an unprecedented view into customer behavior, preferences, and product perception

•  Manufacturers monitor minute vibration data from their equipment to predict the optimal time to replace or maintain

•  Manufacturers also monitor social networks, but with a different goal than marketers: to detect aftermarket support issues before a warranty failure becomes publicly detrimental

•  Governments make data public for users to develop new applications that can generate public good (Open Data initiative)

Valeria Cardellini - SABD 2016/17

19

Page 11: Introduction to Big Data - uniroma2.it · 2017-03-09 · Big Data definitions Different definitions • “Big data exceeds the reach of commonly used hardware environments and software

… many other Big Data applications in very diverse sectors

•  Crime prevention in Los Angeles •  Diagnosis and treatment of genetic diseases •  Investments in financial sector •  Generation of personalized advertising •  Astronomical discoveries

See the BBC video www.bbc.co.uk/programmes/b01rt4c7

Valeria Cardellini - SABD 2016/17

20

Examples of real-time analytics

•  Real-time analytics over high volume sensor data: analysis of energy consumption measurements (DEBS 2014 Grand Challenge)

•  Real-time analytics over high volume geospatial data streams: analysis of taxi trips based on a stream of trip reports from New York City (DEBS 2015 Grand Challenge)

www.debs2015.org/call-grand-challenge.html

•  Real-time analytics metrics for a dynamic (evolving) social-network graph: identification of the posts that currently trigger the most activity in the social network, and identification of large communities that are currently involved in a topic (DEBS 2016 Grand Challenge) www.ics.uci.edu/~debs2016/call-grand-challenge.html

Valeria Cardellini - SABD 2016/17

21

Page 12: Introduction to Big Data - uniroma2.it · 2017-03-09 · Big Data definitions Different definitions • “Big data exceeds the reach of commonly used hardware environments and software

Examples of real-time analytics

•  Finance: real-time forecasting of stock market

•  Medicine: epidemy tracking govdatadownload.com/2015/01/27/big-data-enables-epidemic-tracking/

•  Security –  Fraud detection, DDOS attacks, behavioural pattern

recognition

•  Urban traffic management [Art14]

Valeria Cardellini - SABD 2016/17

22

The Big Data process

Valeria Cardellini - SABD 2016/17

23

Page 13: Introduction to Big Data - uniroma2.it · 2017-03-09 · Big Data definitions Different definitions • “Big data exceeds the reach of commonly used hardware environments and software

The Big Data process

•  Acquisition –  Requires:

•  Selecting data •  Filtering data •  Generating metadata •  Managing data provenance

Valeria Cardellini - SABD 2016/17

24

The Big Data process

•  Extraction –  Requires:

•  Transformation •  Normalization

–  E.g., avoid duplication •  Cleaning

– Detect and correct (or remove) corrupt or inaccurate data •  Aggregation •  Error handling

Valeria Cardellini - SABD 2016/17

25

Page 14: Introduction to Big Data - uniroma2.it · 2017-03-09 · Big Data definitions Different definitions • “Big data exceeds the reach of commonly used hardware environments and software

The Big Data process

•  Integration –  Requires:

•  Standardization •  Conflict management •  Reconciliation •  Mapping definition

Valeria Cardellini - SABD 2016/17

26

The Big Data process

•  Analysis –  Requires:

•  Exploration •  Data mining •  Machine learning •  Visualization

Valeria Cardellini - SABD 2016/17

27

Page 15: Introduction to Big Data - uniroma2.it · 2017-03-09 · Big Data definitions Different definitions • “Big data exceeds the reach of commonly used hardware environments and software

The Big Data process

•  Interpretation –  Requires:

•  Knowledge of the domain •  Knowledge of the provenance •  Identification of patterns of interest •  Flexibility of the process

Valeria Cardellini - SABD 2016/17

28

The Big Data process

•  Decision –  Requires:

•  Managerial skills •  Continuous improvement of

the process

Valeria Cardellini - SABD 2016/17

29

Page 16: Introduction to Big Data - uniroma2.it · 2017-03-09 · Big Data definitions Different definitions • “Big data exceeds the reach of commonly used hardware environments and software

Risks and challenges of Big Data •  Performance

–  Data grows faster than energy on chip –  Efficiency –  Scalability and elasticity

•  Goal: to scale linearly as workloads and data volumes grow

–  Fault tolerance •  Effectiveness •  Heterogeneity

–  Regarding data, processing environment, … •  Flexibility •  Privacy •  Costs

Valeria Cardellini - SABD 2016/17

30

Effectiveness of Big data analysis •  A famous example of inaccurate analysis •  Google Flu Trends’ predictions

–  Sometimes very inaccurate: over the interval 2011-2013, when it consistently overestimated flu prevalence and over one interval in the 2012-2013 flu season predicted twice as many doctors' visits as those recorded

Valeria Cardellini - SABD 2016/17

31

Lazer et al., "The Parable of Google Flu: Traps in Big Data Analysis". Science. 343 (6176): 1203–1205. doi:10.1126/science.1248506

Page 17: Introduction to Big Data - uniroma2.it · 2017-03-09 · Big Data definitions Different definitions • “Big data exceeds the reach of commonly used hardware environments and software

Taming performance: distribution and replication

•  Distributed architecture –  The common architectural solution for Big Data

processing: cluster of commodity hardware resources that work together for a common goal

–  Scale out (or horizontally), not up (or vertically)! –  But elastic scale support is still a challenge

•  Distributed processing –  Shared-nothing model –  New programming paradigms

•  Resource replication –  The well-known solution to achieve fault tolerance –  Eventual consistency (CAP theorem!)

Valeria Cardellini - SABD 2016/17

32

Shared nothing vs. other parallel architectures

Valeria Cardellini - SABD 2016/17

33

D. DeWitt and J. Gray, “Parallel database systems: the future of high performance database systems”, ACM Communications, 1992

Page 18: Introduction to Big Data - uniroma2.it · 2017-03-09 · Big Data definitions Different definitions • “Big data exceeds the reach of commonly used hardware environments and software

Big Data platforms

•  Acquire, manage, process

•  At large scales

•  To meet QoS application requirements

Valeria Cardellini - SABD 2016/17

34

Our Big Data stack

Valeria Cardellini - SABD 2016/17

35

Resource Management

Data Storage

Data Processing

High-level Interfaces Support / Integration

Page 19: Introduction to Big Data - uniroma2.it · 2017-03-09 · Big Data definitions Different definitions • “Big data exceeds the reach of commonly used hardware environments and software

The Big Data stack: BDAS

•  BDAS: the Berkeley Data Analytics Stack

Valeria Cardellini - SABD 2016/17

36

The Big Data stack: Cloudera platform

Valeria Cardellini - SABD 2016/17

37

Page 20: Introduction to Big Data - uniroma2.it · 2017-03-09 · Big Data definitions Different definitions • “Big data exceeds the reach of commonly used hardware environments and software

Data analysis paradigm shift •  From the “old” way: Structure -> Ingest ->

Analyze –  Also known as ETL (Extract, Transform, and Load) –  Extract data from data sources –  Transform data for storing in the proper format or

structure for the purposes of querying and analysis –  Load data into the target final system, i.e., database

operational data store, data mart, or data warehouse (DWH)

Valeria Cardellini - SABD 2016/17

38

Data analysis paradigm shift •  … to the new way: Ingest -> Analyze -> Structure

–  Also known as ELT (Extract, Load, and Transform) –  Extract data from the sources –  Load data into a data lake –  Transform data

•  Advantages: –  No need for a separate transformation engine –  Data transformation and loading happen in parallel –  More effective

Valeria Cardellini - SABD 2016/17

39

Page 21: Introduction to Big Data - uniroma2.it · 2017-03-09 · Big Data definitions Different definitions • “Big data exceeds the reach of commonly used hardware environments and software

Data lake •  “A data lake is a method of storing data within a

system or repository, in its natural format, that facilitates the collocation of data in various schemata and structural forms, usually object blobs or files.” (Wikipedia)

Valeria Cardellini - SABD 2016/17

40

Some techniques for Big Data analytics

•  Data mining: anomaly detection, association rule learning, classification, clustering, regression, summarization

•  Machine learning: supervised learning, unsupervised learning, reinforcement learning

•  Crowdsourcing –  Outsourcing human-intelligence tasks to a large group

of unspecified people via Internet

Valeria Cardellini - SABD 2016/17

41

We do not cover them in this course