big data overview by edgars

52
1 Copyright © 2012, Oracle and/or its affiliates. All rights reserved. “Big Data” Edgars Ruņģis

Upload: lvoug

Post on 29-Nov-2014

98 views

Category:

Education


3 download

DESCRIPTION

Big Data" šodien ir viens no populārākajiem mārketinga saukļiem, kas tiek pamatoti un nepamatoti izmantots, runājot par (lielu?) datu uzglabāšanu un apstrādi. Prezentācijā es aplūkošu, kas tad patiesībā ir "big data" no tehnoloģijju viedokļa, kādi ir galvenie izmantošanas scenāriji un ieguvumi. Prezentācijā apskatīšu tādas tehnoloģijas kā Hadoop, HDFS, MapReduce, Impala, Sparc, Pig, Hive un citas. Tāpat tiks apskatīta integrācija ar tradicionālām DBVS un galvenie izmantošanas scenāriji.

TRANSCRIPT

Page 1: Big data overview by Edgars

1 Copyright © 2012, Oracle and/or its affiliates. All rights

reserved.

“Big Data”

Edgars Ruņģis

Page 2: Big data overview by Edgars

Copyright © 2013, Oracle and/or its affiliates. All rights reserved. 2

What is Big Data ?

VELOCITY VOLUME VARIETY

Sensors BLOG

Social Social

Enormous volumes of real-time

data streams from internal and

external sources and historic

data, hyper-volumes of

structured, semi-structured and

unstructured data

Combine historic data with

data streams and feeds

Detect significant events from

real-time data streams

Respond automatically to

detected events by raising

alerts

Call Data Records, Social

Media Traffic, Videos, Audio

Financial Transactions

Sensor based data, border

crossings, airline passenger

records

Page 3: Big data overview by Edgars

Copyright © 2013, Oracle and/or its affiliates. All rights reserved. 3

Big Data ≈ Hadoop

Page 4: Big data overview by Edgars

Copyright © 2013, Oracle and/or its affiliates. All rights reserved. 4

Hadoop Can Be Confusing

Page 5: Big data overview by Edgars

5 Copyright © 2012, Oracle and/or its affiliates. All rights

reserved.

What is Hadoop?

Page 6: Big data overview by Edgars

Copyright © 2013, Oracle and/or its affiliates. All rights reserved. 6

Hadoop

The Apache Hadoop software library is a framework that allows for the

distributed processing of large data sets across clusters of computers

using simple programming models. Hadoop is designed to scale up from

single servers to thousands of machines, each offering local

computation and storage. Rather than rely on hardware to deliver high-

availability, the library itself is designed to detect and handle failures at

the application layer, so delivering a highly-available service on top of a

cluster of computers, each of which may be prone to failures.

Page 7: Big data overview by Edgars

7 Copyright © 2012, Oracle and/or its affiliates. All rights

reserved.

What to Pay Attention To

Distributed Storage

– HDFS

Parallel Processing Framework

– MapReduce

Higher-Level Languages

– Hive

– Pig

– Etc.

Page 8: Big data overview by Edgars

8 Copyright © 2012, Oracle and/or its affiliates. All rights

reserved.

HDFS The Distributed Filesystem

What is it?

Benefits

Limitations

The petabyte-scale distributed file system at

the core of Hadoop.

Linearly-scalable on commodity hardware

An order of magnitude cheaper per TB

Designed around schema-on-read

Low security

Write-once, read-many model

Page 9: Big data overview by Edgars

9 Copyright © 2012, Oracle and/or its affiliates. All rights

reserved.

Interacting with HDFS

NameNodes and DataNodes

– NameNodes contain edits and organization

– DataNodes store data

Command-line access resembles UNIX filesystems

– ls (list)

– cat, tail (concatenate or tail file)

– cp, mv (copy or move within HDFS)

– get, put (copy between local file system and HDFS)

Page 10: Big data overview by Edgars

10 Copyright © 2012, Oracle and/or its affiliates. All rights

reserved.

HDFS Mechanics

DataNode

DataNode DataNode

DataNode DataNode

DataNode

Suppose we have a large file

And a set of DataNodes

Page 11: Big data overview by Edgars

11 Copyright © 2012, Oracle and/or its affiliates. All rights

reserved.

HDFS Mechanics

DataNode

DataNode DataNode

DataNode DataNode

DataNode

• The file will be broken up into blocks

• Blocks are stored in multiple locations

• Allows for parallelism and fault-tolerance

• Nodes operate on their local data

Page 12: Big data overview by Edgars

12 Copyright © 2012, Oracle and/or its affiliates. All rights

reserved.

MapReduce The Parallel Processing Framework

What is it?

Benefits

Limitations

The parallel processing framework that

dominates the Big Data landscape.

Provides data-local computation

Fault-tolerant

Scales just like HDFS

You are the optimizer

Batch-oriented

Page 13: Big data overview by Edgars

13 Copyright © 2012, Oracle and/or its affiliates. All rights

reserved.

MapReduce Mechanics

Suppose 3 face cards are

removed.

How do we find which suits

are short using

MapReduce?

Page 14: Big data overview by Edgars

14 Copyright © 2012, Oracle and/or its affiliates. All rights

reserved.

MapReduce Mechanics

Map Phase:

Each TaskTracker has some data local to it.

Map tasks operate on this local data.

If face_card: emit(suit, card)

TaskTracker/DataNode

TaskTracker/DataNode

TaskTracker/DataNode

TaskTracker/DataNode

Page 15: Big data overview by Edgars

15 Copyright © 2012, Oracle and/or its affiliates. All rights

reserved.

MapReduce Mechanics

Shuffle/Sort:

Intermediate data is shuffled and sorted for delivery to the reduce tasks

Sort

To Reducers

Page 16: Big data overview by Edgars

16 Copyright © 2012, Oracle and/or its affiliates. All rights

reserved.

MapReduce Mechanics

Reduce Phase:

Reducers operate on local data to produce final result

Emit:key, count(key)

TaskTracker TaskTracker TaskTracker TaskTracker

Spades: 3 Hearts: 2 Diamonds: 2 Clubs: 2

Page 17: Big data overview by Edgars

17 Copyright © 2012, Oracle and/or its affiliates. All rights

reserved.

Flow of key/values pairs

Map

Input

K0,v0

Output

K1,V1

Reduce

Input

K1,list(V)

Output

K2,V2

Page 18: Big data overview by Edgars

18 Copyright © 2012, Oracle and/or its affiliates. All rights

reserved.

The default way to process data into Hadoop

Page 19: Big data overview by Edgars

19 Copyright © 2012, Oracle and/or its affiliates. All rights

reserved.

SQL style!

NoSQL DB Driver

Application

NoSQL

HDFS + MapReduce = Hadoop

BigData

Hadoop

What should I do if I don’t know Java but still want to process data into HDFS?

Use Hive!

Page 20: Big data overview by Edgars

20 Copyright © 2012, Oracle and/or its affiliates. All rights

reserved.

Hive A move toward declarative language

What is it?

Benefits

Limitations

A SQL-like language for Hadoop.

Abstracts MapReduce code

Schema-on-read via InputFormat and SerDe

Provides and preserves metatdata

Not ideal for ad hoc work (slow)

Subset of SQL-92

Immature optimizer

Page 21: Big data overview by Edgars

21 Copyright © 2012, Oracle and/or its affiliates. All rights

reserved.

Storing a Clickstream

Storing large amounts of

clickstream data is a

common use for HDFS

Individual clicks aren’t

valuable by them selves

We’d like to write queries

over all clicks

Page 22: Big data overview by Edgars

22 Copyright © 2012, Oracle and/or its affiliates. All rights

reserved.

Defining Tables Over HDFS

Hive allows us to define

tables over HDFS

directories

The syntax is simple SQL

SerDes allow Hive to

deserialize data

Page 23: Big data overview by Edgars

23 Copyright © 2012, Oracle and/or its affiliates. All rights

reserved.

How Does It Work Anatomy of a Hive Query

SELECT suit, COUNT(*)

FROM cards

WHERE face_value > 10

GROUP BY suit;

How does Hive execute

this query?

Page 24: Big data overview by Edgars

24 Copyright © 2012, Oracle and/or its affiliates. All rights

reserved.

Anatomy of a Hive Query

SELECT suit, COUNT(*)

FROM cards

WHERE face_value > 10

GROUP BY suit;

1. Hive optimizer builds a MapReduce Job

2. Projections and predicates

become Map code

3. Aggregations become Reduce code

4. Job is submitted to

MapReduce JobTracker

Map task If face_card:

emit(suit, card)

Reduce task emit(suit,

count(suit)) Shu

ffle

Page 25: Big data overview by Edgars

25 Copyright © 2012, Oracle and/or its affiliates. All rights

reserved.

Hadoop Programming - Summary

• HDFS - Hadoop Distributed File System

– Designed to achieve SCALE and aggregate THROUGHPUT on commodity hardware

– Not a database; Data remains in its original format on disk (the query engine does NOT own

the data format)

• MapReduce

– Simple programming model for large scale data processing (NOT performance !!!)

– Checkpoints to disk for fault tolerance

– Leverages InputFormat, RecordReader and SerDe

– Leverages multiple copies of data (speculative execution)

• Various higher level languages

– Hive (SQL implemented as MapReduce)

– Pig - Scripting Languages like Python

25

Page 26: Big data overview by Edgars

26 Copyright © 2012, Oracle and/or its affiliates. All rights

reserved.

Impala

Page 27: Big data overview by Edgars

27 Copyright © 2012, Oracle and/or its affiliates. All rights

reserved.

What is Impala ?

• Massive parallel processing (MPP) database engine, developed by Cloudera

• Integrated into Hadoop stack on the same level as MapReduce, and not

above it (as Hive and Pig)

• Impala process data in Hadoop cluster without using MapReduce

HDFS

Map Reduce

Hive Pig

Impala

Page 28: Big data overview by Edgars

28 Copyright © 2012, Oracle and/or its affiliates. All rights

reserved.

Impala Architecture

Page 29: Big data overview by Edgars

29 Copyright © 2012, Oracle and/or its affiliates. All rights

reserved.

My query needs to go Faster

• Impala, Spark, Stinger, Shark, Tajo, Presto a whole bunch

more

– Remove MapReduce and check pointing to disk

– Sacrifice resilience and scale for performance

– Limited SQL capability and access paths

– Create a (new) SQL Database on top of HDFS

– Create and load data into optimized storage formats to optimize

performance

Page 30: Big data overview by Edgars

30 Copyright © 2012, Oracle and/or its affiliates. All rights

reserved.

How to load data in Hadoop ?

Page 31: Big data overview by Edgars

31 Copyright © 2012, Oracle and/or its affiliates. All rights

reserved.

Hadoop command-line

• Source server should to have Hadoop client program.

• File may be loaded by follow command hadoop fs –put

source.file /tmp/hadoop_dir

Page 32: Big data overview by Edgars

32 Copyright © 2012, Oracle and/or its affiliates. All rights

reserved.

Load data into Hadoop from RDBMS (Sqoop)

Page 33: Big data overview by Edgars

33 Copyright © 2012, Oracle and/or its affiliates. All rights

reserved. http://archive.cloudera.com/cdh4/cdh/4/flume-ng/FlumeUserGuide.html

Acquire stream data. Flume

Page 34: Big data overview by Edgars

34 Copyright © 2012, Oracle and/or its affiliates. All rights

reserved.

Flume for collecting Twitter feeds

Flume – twitter collecting

http://archive.cloudera.com/cdh4/cdh/4/flume-ng/FlumeUserGuide.html

Page 35: Big data overview by Edgars

35 Copyright © 2012, Oracle and/or its affiliates. All rights

reserved.

Hadoop distributives

Page 36: Big data overview by Edgars

Copyright © 2013, Oracle and/or its affiliates. All rights reserved. 37

Cloudera

37

Page 37: Big data overview by Edgars

Copyright © 2013, Oracle and/or its affiliates. All rights reserved. 39

Hortonworks

Page 38: Big data overview by Edgars

40 Copyright © 2012, Oracle and/or its affiliates. All rights

reserved.

Big Data and Oracle

Page 39: Big data overview by Edgars

41 Copyright © 2012, Oracle and/or its affiliates. All rights

reserved.

Oracle Big Data Solution

Stream Acquire – Organize – Analyze

Oracle BI Foundation Suite

Oracle Real-Time Decisions

Endeca Information Discovery

Decide

Oracle Event Processing Oracle Big Data

Connectors

Oracle Data Integrator

Oracle

Advanced

Analytics

Oracle

Database

Oracle

Spatial

& Graph

Apache Flume

Oracle GoldenGate

Oracle

NoSQL

Database

Cloudera

Hadoop

Oracle R

Distribution

Page 40: Big data overview by Edgars

42 Copyright © 2012, Oracle and/or its affiliates. All rights

reserved.

Oracle Big Data Connectors

• Oracle Loader for Hadoop

• Oracle SQL Connector for HDFS

• Oracle R Advanced Analytics for Hadoop

• Oracle XQuery for Hadoop

• Oracle Data Integrator Application Adapters for Hadoop

Licensed Together

Page 41: Big data overview by Edgars

43 Copyright © 2012, Oracle and/or its affiliates. All rights

reserved.

Oracle Loader for Hadoop

SHUFFLE /SORT

SHUFFLE /SORT

MAP

MAP

MAP

MAP

SHUFFLE /SORT

REDUCE

REDUCE

INPUT 2

INPUT 1

MAP

MAP

MAP

MAP

MAP

REDUCE

REDUCE

MAP

MAP

MAP

MAP

MAP

REDUCE

REDUCE

REDUCE

Load data from Hadoop

into Oracle Database

Oracle Database

Unstructured Data

REDUCE

Local Oracle table

Page 42: Big data overview by Edgars

44 Copyright © 2012, Oracle and/or its affiliates. All rights

reserved.

Generate external table in database pointing to HDFS data

Load into database or query data in place on HDFS

Fine-grained control over data type mapping Parallel load with automatic load balancing

Kerberos authentication

Oracle SQL Connector for HDFS Use Oracle SQL to Access Data on HDFS

External

Table

OSCH

OSCH

OSCH

SQL Query

HDFS

Client

Hadoop

Oracle Database

Access or load into the database in parallel using external table mechanism

OSCH

Page 43: Big data overview by Edgars

45 Copyright © 2012, Oracle and/or its affiliates. All rights

reserved. 45

New Data Sources for Oracle External Tables

CREATE TABLE movielog

(click VARCHAR2(4000))

ORGANIZATION EXTERNAL

( TYPE ORACLE_HIVE

DEFAULT DIRECTORY Dir1

ACCESS PARAMETERS

(

com.oracle.bigdata.tablename logs

com.oracle.bigdata.cluster mycluster)

)

REJECT LIMIT UNLIMITED

• New set of properties

– ORACLE_HIVE and ORACLE_HDFS access drivers

– Identify a Hadoop cluster, data source, column mapping, error

handling, overflow handling, logging

• New table metadata passed from Oracle DDL to Hadoop

readers at query execution

• Architected for extensibility

– StorageHandler capability enables future support for other

data sources

– Examples: MongoDB, Hbase, Oracle NoSQL DB

Page 44: Big data overview by Edgars

47 Copyright © 2012, Oracle and/or its affiliates. All rights

reserved.

Oracle SQL Connector for HDFS

• Load data from external table with Oracle SQL – INSERT INTO <tablename> AS SELECT * FROM <external tablename>

• Access data in-place on HDFS with Oracle SQL

– Note: No indexes, no partitioning, so queries are a full table scan

• Read data in parallel – Ex: If there are 96 data files and the database can support 96 PQ slaves, all 96 files can

be read in parallel

Page 45: Big data overview by Edgars

48 Copyright © 2012, Oracle and/or its affiliates. All rights

reserved.

Oracle SQL Connector for HDFS

• Text files

• Hive tables (text data)

• Oracle Data Pump files generated by Oracle Loader for

Hadoop

Input Data Formats

Page 46: Big data overview by Edgars

49 Copyright © 2012, Oracle and/or its affiliates. All rights

reserved.

Oracle SQL Connector for HDFS

• Oracle Data Pump: Binary format data file

• Load of Oracle Data Pump files is more efficient – uses about 50%

less database CPU

– Hadoop does more of the work, transforming text data into binary data

optimized for Oracle

Note: Only Oracle Data Pump files generated by Oracle Loader

for Hadoop

Data Pump Files

Page 47: Big data overview by Edgars

50 Copyright © 2012, Oracle and/or its affiliates. All rights

reserved.

Using Hadoop To Optimize IT

Page 48: Big data overview by Edgars

51 Copyright © 2012, Oracle and/or its affiliates. All rights

reserved.

Big Data and Optimized Operations

• Big Data can handle a lot of heavy lifting

– It’s a complement to the database

• Big Data allows access to more detail data for less

• We can use Big Data to make the database do more

Page 49: Big data overview by Edgars

Copyright © 2013, Oracle and/or its affiliates. All rights reserved. 52

Optimizing ETL

Mission

Critical

Reporting

Ad Hoc

Analysis

Long-running

batch

transformation

Big Data Problem

Base Table

Copy/Move

Base Table to

Hadoop

Load to

Oracle

Long-running

batch

transformation

Page 50: Big data overview by Edgars

53 Copyright © 2012, Oracle and/or its affiliates. All rights

reserved.

Q&A

Page 51: Big data overview by Edgars

54 Copyright © 2012, Oracle and/or its affiliates. All rights

reserved.

Page 52: Big data overview by Edgars

55 Copyright © 2012, Oracle and/or its affiliates. All rights

reserved.