big data: hype or necessity?

69
Big Data Big Data: hype or necessity? Dr. ir. ing. Bart Vandewoestyne Sizing Servers Lab, Howest, Kortrijk Televic R&D meeting - April 25, 2014 1 / 69

Upload: bart-vandewoestyne

Post on 27-Jan-2015

116 views

Category:

Data & Analytics


2 download

DESCRIPTION

Presentation given at an internal Televic R&D meeting. Audience consisted of about about 20 R&D engineers.

TRANSCRIPT

Page 1: Big Data: hype or necessity?

Big Data

Big Data: hype or necessity?

Dr. ir. ing. Bart Vandewoestyne

Sizing Servers Lab, Howest, Kortrijk

Televic R&D meeting - April 25, 2014

1 / 69

Page 2: Big Data: hype or necessity?

Big Data

Outline

1 IntroductionBig Data?

2 Big Data TechnologyHadoopPig, HiveNoSQL

3 Big Data in my company?

4 Conclusions

2 / 69

Page 3: Big Data: hype or necessity?

Big Data

Introduction

Outline

1 IntroductionBig Data?

2 Big Data TechnologyHadoopPig, HiveNoSQL

3 Big Data in my company?

4 Conclusions

3 / 69

Page 4: Big Data: hype or necessity?

Big Data

Introduction

Big Data?

Exponential growth of data

© 2013 International Business Machines Corporation 4

Big Data: This is just the beginning

2010

Volu

me in

Exabyte

s

9000

8000

7000

6000

5000

4000

3000

2015

Percentage of uncertain data Pe

rce

nt o

f unce

rtain

data

100

80

60

40

20

0

You are here

Sensors & Devices

VoIP

Enterprise Data

Social Media

4 / 69

Page 5: Big Data: hype or necessity?

Big Data

Introduction

Big Data?

Examples

Facebook hosts ≈ 10 billion photos ≈ 1 petabyte

Large Hadron Collider: will produce ≈ 15 petabytes per year

5 / 69

Page 6: Big Data: hype or necessity?

Big Data

Introduction

Big Data?

Examples

RFID readers vehicle GPS traces

Smart energy meters

6 / 69

Page 7: Big Data: hype or necessity?

Big Data

Introduction

Big Data?

Examples relevant to Televic

Seattle’s Children Hospital Google Now

Union Pacific

Automatic reschedulingSensors in rails, GPS, RFID in terminals,. . .Weather forecast,. . .

7 / 69

Page 8: Big Data: hype or necessity?

Big Data

Introduction

Big Data?

Big Data definition

Definition of Big Data depends on who you ask:

Big Data

“Multiple terabytes or petabytes.”(according to some professionals)

“I don’t know.”(today’s big may be tomorrow’s normal)

“Relative to its context.”

8 / 69

Page 9: Big Data: hype or necessity?

Big Data

Introduction

Big Data?

Quotes on Big Data

“Big data” is a subjective label attached to situations inwhich human and technical infrastructures are unable tokeep pace with a company’s data needs.

It’s about recognizing that for some problems otherstorage solutions are better suited.

9 / 69

Page 10: Big Data: hype or necessity?

Big Data

Introduction

Big Data?

The Three V’s

Volume The amount of data is big.

Variety Different kinds of data:

structuredsemi-structuredunstructured

Velocity Speed-issues to consider:

How fast is the data available for analysis?How fast can we do something with it?

Other V’s: Veracity, Variability, Validity, Value,. . .

10 / 69

Page 11: Big Data: hype or necessity?

Big Data

Introduction

Big Data?

Structured data

Structured data

Pre-defined schema imposed on the data

Highly structured

Usually stored in a relational database system

Example

numbers: 20, 3.1415,. . .

dates: 21/03/1978

strings: ”Hello World”

. . .

Roughly 20% of all data out there is structured.

11 / 69

Page 12: Big Data: hype or necessity?

Big Data

Introduction

Big Data?

Semi-structured data

Semi-structured data

Inconsistent structure.

Cannot be stored in rows and tables in a typical database.

Information is often self-describing (label/value pairs).

Example

XML, SGML,. . .

BibTeX files

logs

tweets

sensor feeds

. . .

12 / 69

Page 13: Big Data: hype or necessity?

Big Data

Introduction

Big Data?

Semi-structured data: examples

Example

<?xml version="1.0"?>

<catalog>

<book id="bk101">

<author>Gambardella, Matthew</author>

<title>XML Developer’s Guide</title>

<genre>Computer</genre>

<price>44.95</price>

</book>

</catalog>

13 / 69

Page 14: Big Data: hype or necessity?

Big Data

Introduction

Big Data?

Unstructured data

Definition (Unstructured data)

Lacks structure or parts of it lack structure.

Example

multimedia: videos, photos,audio files,. . .

email messages

free-form text

word processing documents

presentations

reports

. . .

Experts estimate that 80 to 90 % of the data in anyorganization is unstructured.

14 / 69

Page 15: Big Data: hype or necessity?

Big Data

Introduction

Big Data?

Data Storage and Analysis

Storage capacity of hard drives has increased massively overthe years.

Access speeds have not kept up.

Example (Reading a whole disk)

Year Storage Capacity Transfer Speed Time

1990 1370 MB 4.4 MB/s ≈ 5 minutes2010 1 TB 100 MB/s > 2.5 hours

Solution: work in parallel!

Using 100 drives (each holding 1/100th of the data),reading 1 TB takes less than 2 minutes.

15 / 69

Page 16: Big Data: hype or necessity?

Big Data

Introduction

Big Data?

Working in parallel

Problems

1 Hardware failure?

2 Combining data from different disks for analysis?

Solutions

1 HDFS: Hadoop Distributed Filesystem

2 MapReduce: programming model

16 / 69

Page 17: Big Data: hype or necessity?

Big Data

Big Data Technology

Outline

1 IntroductionBig Data?

2 Big Data TechnologyHadoopPig, HiveNoSQL

3 Big Data in my company?

4 Conclusions

17 / 69

Page 18: Big Data: hype or necessity?

Big Data

Big Data Technology

Big Data Landscape

18 / 69

Page 19: Big Data: hype or necessity?

Big Data

Big Data Technology

Hadoop

Hadoop

Hadoop is VMware, but the other way around.

19 / 69

Page 20: Big Data: hype or necessity?

Big Data

Big Data Technology

Hadoop

Hadoop as the opposite of a virtual machine

VMware

1 take one physical server

2 split it up

3 get many small virtualservers

Hadoop

1 take many physical servers

2 merge them all together

3 get one big, massive, virtualserver

20 / 69

Page 21: Big Data: hype or necessity?

Big Data

Big Data Technology

Hadoop

Hadoop: core functionality

HDFS Self-healing, high-bandwidth, clustered storage.

MapReduce Distributed, fault-tolerant resource management,coupled with scalable data processing.

21 / 69

Page 22: Big Data: hype or necessity?

Big Data

Big Data Technology

Hadoop

HDFS architecture

22 / 69

Page 23: Big Data: hype or necessity?

Big Data

Big Data Technology

Hadoop

MapReduce

23 / 69

Page 24: Big Data: hype or necessity?

Big Data

Big Data Technology

Hadoop

MapReduce

24 / 69

Page 25: Big Data: hype or necessity?

Big Data

Big Data Technology

Hadoop

Hadoop: applications

Example Hadoop stack:

→ Hadoop distributions25 / 69

Page 26: Big Data: hype or necessity?

Big Data

Big Data Technology

Hadoop

Example Hadoop distributions

26 / 69

Page 27: Big Data: hype or necessity?

Big Data

Big Data Technology

Hadoop

Hadoop vs RDBMS

Relational Database Management Systems (RDBMS):

Very fast to max speed!

some queries → msecs

other queries → hours, days

use when

latency is importantACID transactions(banking,. . . )100% SQL compliance

Unstructured data → BLOB:-(

27 / 69

Page 28: Big Data: hype or necessity?

Big Data

Big Data Technology

Hadoop

Hadoop vs RDBMS

Hadoop:

Slower to (higher) maxspeed. . .

some queries → seconds,minutes

other queries → seconds!!!

Use when:

throughput importantscalability of storage/compute(un|semi)structured datacomplex data processing(NoSQL, Java, C, Python,. . . )

28 / 69

Page 29: Big Data: hype or necessity?

Big Data

Big Data Technology

Pig, Hive

Apache Hadoop essentials: technology stack

29 / 69

Page 30: Big Data: hype or necessity?

Big Data

Big Data Technology

Pig, Hive

Pig

MapReduce requires programmers

think in terms of map and reducefunctions,more than likely use the Java language.

Pig provides a high-level language (PigLatin) that can be used by

AnalystsData ScientistsStatisticiansEtc. . .

30 / 69

Page 31: Big Data: hype or necessity?

Big Data

Big Data Technology

Pig, Hive

Pig Latin

Pig Latin

Originally from Yahoo! to allow analysts to access data.

Dataflow language.

Makes it simpler to write MapReduce programs.

Abstracts you from specific details→ focus on data processing.

Has User Defined Functions (UDFs).

Compiles script into a set of MapReduce jobs.

31 / 69

Page 32: Big Data: hype or necessity?

Big Data

Big Data Technology

Pig, Hive

Pig example

Load users Load pages

Filterby age

Join onname

Groupon URL

Countclicks

Orderby clicks

Take top 5

Input data

file with user data

file with website data

Your task

Find the top 5 most visitedpages by users aged 18-25.

32 / 69

Page 33: Big Data: hype or necessity?

Big Data

Big Data Technology

Pig, Hive

In MapReduce

. . . 170 lines of Java MapReduce code . . .

33 / 69

Page 34: Big Data: hype or necessity?

Big Data

Big Data Technology

Pig, Hive

In Pig Latin

Example

Users = load ’users’ as (name, age);

Fltrd = filter Users by age >= 18 and age <= 25;

Pages = load ’pages’ as (user, url);

Jnd = join Fltrd by name, Pages by user;

Grpd = group Jnd by url;

Smmd = foreach Grpd generate group, COUNT(Jnd) as clicks;

Srtd = order Smmd by clicks desc;

Top5 = limit Srtd 5;

store Top5 into ’top5sites’;

Only 9 lines of Pig Latin.

34 / 69

Page 35: Big Data: hype or necessity?

Big Data

Big Data Technology

Pig, Hive

Hive

Originated at Facebook to analyze log data.

HiveQL: Hive Query Language, similar to standard SQL.

Queries are compiled into MapReduce jobs.

Has command-line shell, similar to e.g. MySQL shell.

35 / 69

Page 36: Big Data: hype or necessity?

Big Data

Big Data Technology

Pig, Hive

Hive: example

Example (Create table to hold weather data)

CREATE TABLE records (year STRING,

temperature INT,

quality INT)

ROW FORMAT DELIMITED

FIELDS TERMINATED BY ’\t’;

Example (Populate Hive with the data)

LOAD DATA LOCAL INPATH ’input/sample.txt’

OVERWRITE INTO TABLE records;

36 / 69

Page 37: Big Data: hype or necessity?

Big Data

Big Data Technology

Pig, Hive

Hive: example

Example (Run query)

hive> SELECT year, MAX(temperature)

> FROM records

> WHERE temperature != 9999

> AND (quality = 0 OR quality = 1)

> GROUP BY year;

1949 111

1950 22

37 / 69

Page 38: Big Data: hype or necessity?

Big Data

Big Data Technology

NoSQL

NoSQL

38 / 69

Page 39: Big Data: hype or necessity?

Big Data

Big Data Technology

NoSQL

RDBMS: Codd’s 12 rules

Codd’s 12 rules

A set of rules designed to define what is required from a databasemanagement system in order for it to be considered relational.

Rule 0 The Foundation rule

Rule 1 The Information rule

Rule 2 The guaranteed access rule

Rule 3 Systematic treatment of null values

Rule 4 Active online catalog based on the relational model

. . . . . .

39 / 69

Page 40: Big Data: hype or necessity?

Big Data

Big Data Technology

NoSQL

ACID

ACID

A set of properties that guarantee that database transactions areprocessed reliably.

Atomicity A transaction is all or nothing.

Consistency Only transactions with valid data.

Isolation Simultaneous transactions will not interfere.

Durability Written transaction data stays there “forever”(even in case of power loss, crashes, errors,. . . ).

40 / 69

Page 41: Big Data: hype or necessity?

Big Data

Big Data Technology

NoSQL

Scaling up

What if you need to scale up your RDBMS in terms of

dataset size,

read/write concurrency?

This usually involves

breaking Codds rules,

loosening ACID restrictions,

forgetting conventional DBA wisdom,

loose most of the desirable properties that made RDBMS soconvenient in the first place.

NoSQL to the rescue!

41 / 69

Page 42: Big Data: hype or necessity?

Big Data

Big Data Technology

NoSQL

NoSQL

NoSQL

‘Invented’ by Carl Strozzi in 1998 (for his file-based database)

“Not only SQL”

It’s NOT about

saying that SQL should never be used,

saying that SQL is dead.

42 / 69

Page 43: Big Data: hype or necessity?

Big Data

Big Data Technology

NoSQL

NoSQL databases

Four emerging NoSQL categories:

43 / 69

Page 44: Big Data: hype or necessity?

Big Data

Big Data Technology

NoSQL

Key-Value stores or ‘the big hash table’

Keys Values

13a1

13a2

13a3

Nexus 32 GB

Nexus 16 GB

Nexus 08 GB

Most basic type of NoSQLdatabases.

Aggregation of key-valuepairs.

Typically only 4 operations:

create(key, value)read(key)update(key, value)delete(key)

Fast, scalable, less complex.

Mainly used for systems with simple queries (caches etc. . . . )

44 / 69

Page 45: Big Data: hype or necessity?

Big Data

Big Data Technology

NoSQL

Key-Value stores or ’the big hash table’

45 / 69

Page 46: Big Data: hype or necessity?

Big Data

Big Data Technology

NoSQL

Column-oriented DBMS

Example

Id LastName FirstName Salary

10 Smith Joe 4000012 Jones Mary 5000011 Johnson Cathy 4400022 Jones Bob 55000

Row-based:10,Smith,Joe,40000;12,Jones,Mary,50000;11,Johnson,Cathy,44000;22,Jones,Bob,55000

Column-based:10,12,11,22;Smith,Jones,Johnson,Jones;Joe,Mary,Cathy,Bob;40000,50000,44000,55000

46 / 69

Page 47: Big Data: hype or necessity?

Big Data

Big Data Technology

NoSQL

Column family based databases

Like column-oriented DBMS, but with a twist

Columns and supercolumns ≈ RDBMS table columns

Family of columns ≈ RDBMS table

Keyspace ≈ RDBMS database

47 / 69

Page 48: Big Data: hype or necessity?

Big Data

Big Data Technology

NoSQL

Column family based databases

Most complex NoSQL database type.

Based on Google’s BigTable paper.

More flexibility than traditional RDBMS:adding (super)columns is always possible.

Excellent for analysis and mass treatment of data(via Map-Reduce type operations)

48 / 69

Page 49: Big Data: hype or necessity?

Big Data

Big Data Technology

NoSQL

Document databases

Data is stored as a collection ofdocuments(JSON, XML,. . . but also PDF,Excel,. . . )

Documents → collection ofkey-value pairs

Values can be

simple valuesarraysanother document (collection ofkey-values)

Schemaless

Quite well queryable

49 / 69

Page 50: Big Data: hype or necessity?

Big Data

Big Data Technology

NoSQL

Document databases

Example (Document 1)

{

FirstName: "Bob",

Address: "5 Oak St.",

Hobby: "sailing"

}

Example (Document 2)

{

FirstName: "Jonathan",

Address: "15 Wanamassa Road",

Children: [

{Name: "Michael", Age: 10},

{Name: "Jennifer", Age: 8},

{Name: "Samantha", Age: 5},

{Name: "Elena", Age: 2}

]

}

Best suited for custom queries like the ones in RDBMS.

Quite popular for Content Management Systems.

50 / 69

Page 51: Big Data: hype or necessity?

Big Data

Big Data Technology

NoSQL

Document databases: examples

51 / 69

Page 52: Big Data: hype or necessity?

Big Data

Big Data Technology

NoSQL

Graph databases

Julie Steve

RockMusic

Bob BMW

Fido Jim IBM

Sister in-Law To

Listen

s To

Listens To

Married

To Brother O

f

Drives

Works For

Collea

gue

Of

Works ForHas Pet

Based on graph theory.

Employ nodes (objects) and edges (relations between objects).

52 / 69

Page 53: Big Data: hype or necessity?

Big Data

Big Data Technology

NoSQL

Graph databases: examples

Well-suited for problems with network-structure:

mine data from social media“customers who bought this also looked at. . . ”relations between personshealthcare ontologies ???. . .

53 / 69

Page 54: Big Data: hype or necessity?

Big Data

Big Data Technology

NoSQL

Us the right tool for the right job!

http://db-engines.com/

54 / 69

Page 55: Big Data: hype or necessity?

Big Data

Big Data in my company?

Outline

1 IntroductionBig Data?

2 Big Data TechnologyHadoopPig, HiveNoSQL

3 Big Data in my company?

4 Conclusions

55 / 69

Page 56: Big Data: hype or necessity?

Big Data

Big Data in my company?

Typical RDBMS scaling story

1. Initial Public Launch

From local workstation → remotely hosted MySQL instance.

2. Service popularity ↑, too many reads hitting the database

Add memcached to cache common queries. Reads are now nolonger strictly ACID; cached data must expire.

3. Popularity ↑↑, too many writes hitting the database

Scale MySQL vertically by buying a beefed-up server:

16 cores

128 GB of RAM

banks of 15 k RPM hard drives

Costly

56 / 69

Page 57: Big Data: hype or necessity?

Big Data

Big Data in my company?

Typical RDBMS scaling story

4. New features → query complexity ↑, now too many joins

Denormalize your data to reduce joins.(Thats not what they taught me in DBA school!)

5. Rising popularity swamps the server; things are too slow

Stop doing any server-side computations.

57 / 69

Page 58: Big Data: hype or necessity?

Big Data

Big Data in my company?

Typical RDBMS scaling story

6. Some queries are still too slow

Periodically prematerialize the most complex queries, and try tostop joining in most cases.

7. Reads are OK, writes are getting slower and slower. . .

Drop secondary indexes and triggers (no indexes?).

If you stay up at nightworrying about your database(uptime, scale, or speed), you

should seriously considermaking a jump from theRDBMS world to HBase.

58 / 69

Page 59: Big Data: hype or necessity?

Big Data

Big Data in my company?

Two types of companies (personal view)

‘Core Big Data’ company

Core business = big data processing, crunching, analyzing,. . .

Example

Google, Facebook,. . .

Smart metering companies

Video/Image processing companies

Biotech companies with sequencing data

Lots of healthcare data???

. . .

59 / 69

Page 60: Big Data: hype or necessity?

Big Data

Big Data in my company?

Two types of companies (personal view)

‘General Big Data’ company

Some other core business.

Lots of useful data is available.

Desirable: business analytics, process optimization,. . .

Example

Supermarkets → customer cards

Transport firms → GPS-traces

Hospitals → patient and medical info???

. . .

60 / 69

Page 61: Big Data: hype or necessity?

Big Data

Big Data in my company?

Use-cases of Big Data

‘Core Big Data’ company

Big Data

crunching,

hacking,

processing,

analyzing,

. . .

‘General Big Data’ company

Business Analytics

improve decision-making,

gain operational insights,

increase overallperformance,

track and analyzeshopping patterns,

. . .

Both

Explore! Discover hidden gems!

61 / 69

Page 62: Big Data: hype or necessity?

Big Data

Big Data in my company?

Some examples

IBM: predict heart diseaselong before it strikes.

Predict and stop the spreadof infectious disease

62 / 69

Page 63: Big Data: hype or necessity?

Big Data

Big Data in my company?

Some examples

How to predict wine quality?

Skip tasting! Use science!

Weather seems the keyvariable.

Correlate historical weather& wine data.

Reduce fuel cost andimprove driver safety byanalyzing geolocation data

63 / 69

Page 64: Big Data: hype or necessity?

Big Data

Big Data in my company?

Big Data in your company

Big data is typically a division of the IT-department.

Requires skilled people:

sysadmins

software developers

data-scientists

visualization experts

. . .

Advice, trend (Andrew McAfee)

Give geeks a seat at the decision-making table.

64 / 69

Page 65: Big Data: hype or necessity?

Big Data

Big Data in my company?

Big Data in your company

65 / 69

Page 66: Big Data: hype or necessity?

Big Data

Big Data in my company?

IWT TETRA project

Data mining: van relationele database naar Big Data.

Dates

Submitted: 12/03/2014

Notification of acceptance: July, 2014

Runs from 01/10/2014 – 01/10/2016

People involved

Wannes De Smet (researcher)

Bart Vandewoestyne (researcher)

Johan De Gelas (project coordinator)

Thanks for being interested project partner :-)

66 / 69

Page 67: Big Data: hype or necessity?

Big Data

Conclusions

Outline

1 IntroductionBig Data?

2 Big Data TechnologyHadoopPig, HiveNoSQL

3 Big Data in my company?

4 Conclusions

67 / 69

Page 68: Big Data: hype or necessity?

Big Data

Conclusions

Conclusions

“Big” can be small too.

The Big Data landscape is huge.

RDBMS and SQL are not dead.

The right tool for the right job!

Your company can benefit from Big Data technology.

We can help.

Be brave in your quest. . .

68 / 69

Page 69: Big Data: hype or necessity?

Big Data

Conclusions

Questions?

Questions?

[email protected]

69 / 69