big data introduction - arrow ecs€¦ · title: hadoop deep dive author: dan mcclary created date:...

1 Copyright © 2012, Oracle and/or its affiliates. All rights

reserved.

Insert Information Protection Policy Classification from Slide 8

Big Data Introduction

Ralf Lange

Global ISV & OEM Sales


reserved.


Conventional infrastructure


reserved.


Map Reduce


reserved.


In Actuality


reserved.


What is Map Reduce

[ , , , , , ]

, , , , , [ ]


reserved.


Basics Of Hadoop

In Memory

File 1 Piece 1 1

File 1 Piece 2 2

File 1 Piece 3 3

2 5

3 6

4 7

Name Node

Data Node Data Node Data Node Data Node JAR

Map Reduce

Map Reduce

Map Reduce

Map Reduce

Job Tracker Task Tracker Task Tracker Task Tracker Task Tracker


reserved.


Data Loading


reserved.


Programming Languages

Normal Hadoop

HCatalog

PIG

DataFu


reserved.


Management

Thread 2

Thread 1

Process 2

Process 1

ZooKeeper


reserved.


GUIs


reserved.


Similar to Oracle


reserved.


Big Data @ Oracle

Copyright © 2013, Oracle and/or its affiliates. All rights reserved. 13

Oracle Big Data Solution

Oracle BI Foundation Suite

Oracle Real-Time Decisions

Endeca Information Discovery

Decide

Oracle

Advanced

Analytics

Oracle

Database

Oracle

Spatial

& Graph

Acquire – Organize – Analyze

Oracle Big Data Connectors

Oracle Data Integrator

Stream

Oracle Event Processing

Apache Flume

Oracle GoldenGate

Oracle

NoSQL

Database

Cloudera

Hadoop

Oracle R

Distribution

Scalable key-value store

Scalable, low-cost data storage

and processing engine

Statistical analysis framework


Massive detail data

Big batch jobs

Unifying data sources Many data marts merged in

Hadoop to provide

unified views of data

Long running batch jobs can run

in Hadoop to make the most of

the DB

Store more raw detail data for

less cost, while keeping

aggregates in the DB

Big Data ≠ Unstructured Data


Big Data ≈ Hadoop


Hadoop Can Be Confusing


reserved.


What is Hadoop?


Hadoop

The Apache Hadoop software library is a framework that allows for the

distributed processing of large data sets across clusters of computers

using simple programming models. Hadoop is designed to scale up from

single servers to thousands of machines, each offering local

computation and storage. Rather than rely on hardware to deliver high-

availability, the library itself is designed to detect and handle failures at

the application layer, so delivering a highly-available service on top of a

cluster of computers, each of which may be prone to failures.

Framework for distributed processing

Large Data Sets

Clusters of Computers

Simple Computing Models

Highly Available Service


reserved.


What to Pay Attention To

Distributed Storage

– HDFS

Parallel Processing Framework

– MapReduce

Higher-Level Languages

– Hive

– Pig

– Etc.


reserved.


HDFS The Distributed Filesystem

What is it?

Benefits

Limitations

The petabyte-scale distributed file system at

the core of Hadoop.

Linearly-scalable on commodity hardware

An order of magnitude cheaper per TB

Designed around schema-on-read

Low security

Write-once, read-many model


reserved.


Interacting with HDFS

NameNodes and DataNodes

– NameNodes contain edits and organization

– DataNodes store data

Command-line access resembles UNIX filesystems

– ls (list)

– cat, tail (concatenate or tail file)

– cp, mv (copy or move within HDFS)

– get, put (copy between local file system and HDFS)


reserved.


HDFS Mechanics

DataNode

DataNode DataNode

DataNode DataNode

DataNode

Suppose we have a large file

And a set of DataNodes


reserved.


HDFS Mechanics

DataNode

DataNode DataNode

DataNode DataNode

DataNode

• The file will be broken up into blocks

• Blocks are stored in multiple locations

• Allows for parallelism and fault-tolerance

• Nodes operate on their local data


reserved.


MapReduce The Parallel Processing Framework

What is it?

Benefits

Limitations

The parallel processing framework that

dominates the Big Data landscape.

Provides data-local computation

Fault-tolerant

Scales just like HDFS

You are the optimizer

Quasi-functional model is counterintuitive

Batch-oriented


reserved.


MapReduce Mechanics

Suppose 3 face cards are

removed.

How do we find which suits

are short using

MapReduce?


reserved.


MapReduce Mechanics

Map Phase:

Each TaskTracker has some data local to it.

Map tasks operate on this local data.

If face_card: emit(suit, card)

TaskTracker/DataNode





reserved.


MapReduce Mechanics

Shuffle/Sort:

Intermediate data is shuffled and sorted for delivery to the reduce tasks

Sort

To Reducers


reserved.


MapReduce Mechanics

Reduce Phase:

Reducers operate on local data to produce final result

Emit:key, count(key)

TaskTracker TaskTracker TaskTracker TaskTracker

Spades: 3 Hearts: 2 Diamonds: 2 Clubs: 2


reserved.


Hive A move toward declarative language

What is it?

Benefits

Limitations

A SQL-like language for Hadoop.

Abstracts MapReduce code

Schema-on-read via InputFormat and SerDe

Provides and preserves metatdata

Not ideal for ad hoc work (slow)

Subset of SQL-92

Immature optimizer


reserved.


Storing a Clickstream

Storing large amounts of

clickstream data is a

common use for HDFS

Individual clicks aren’t

valuable by them selves

We’d like to write queries

over all clicks


reserved.


Defining Tables Over HDFS

Hive allows us to define

tables over HDFS

directories

The syntax is simple SQL

SerDes allow Hive to

deserialize data


reserved.


How Does It Work Anatomy of a Hive Query

SELECT suit, COUNT(*)

FROM cards

WHERE face_value > 10

GROUP BY suit;

How does Hive execute

this query?


reserved.


Anatomy of a Hive Query

SELECT suit, COUNT(*)

FROM cards

WHERE face_value > 10

GROUP BY suit;

1. Hive optimizer builds a MapReduce Job

2. Projections and predicates

become Map code

3. Aggregations become Reduce code

4. Job is submitted to

MapReduce JobTracker

Map task If face_card:

emit(suit, card)

Reduce task emit(suit,

count(suit)) Shu

ffle


reserved.


Using Hadoop To Optimize

IT


reserved.


Big Data and Optimized Operations

• Big Data can handle a lot of heavy lifting

– It’s a complement to the database

• Big Data allows access to more detail data for less

• We can use Big Data to make the database do more


Optimizing ETL, Saving SLAs

Mission

Critical

Reporting

Ad Hoc

Analysis

Long-running

batch

transformation

Big Data Problem

Base Table

Copy/Move

Base Table to

Hadoop

Load to

Oracle

Long-running

batch

transformation


Big Data Problem

Store More Details For Less

Base Table

Aggregation

Reporting Table

External Table or Aggregate on Hadoop


Using Hadoop To Build New Datasets


What Does a Big Data World Look Like? Truck / Motor Manufacturer

Collections Internal sensors

Miles Per Gallon, Driving

techniques

Location information

Uses Better tailored servicing plans

Better targeted marketing

Offer better finance deals or related

options

More data for R&D

Sell on to partners


Big Data and Analytics

Big Data does not make analytics easier

– There is no magic bullet

Some things work better in a database

Big Data allows the collection of new datasets

Big Data allows modeling on a more granular level


No Magic Bullets

Food monitoring by RFID tags Fridge monitors food

usage and sell-by dates

Monitor the complete car Better targeted marketing

There is a gap between

– The available dataset

– The value proposition

Big Data helps bridge the gap


Some Things Work Better in RDBMS

•Time Series

Analysis

•Spatial Analysis

•Linear and

Nonlinear Modeling

•Interaction with

SAS and R

•Clustering on

massive data

•Fine-grained

classification

•Dataset

construction

•Deploying models

on many subgroups


Collecting New Datasets The Complete Car

Minute-by-

minute

MPH

GPS

Readings

On-board

Vehicle

Diagnostics

How does the

customer

drive?

Where does

the customer

drive?

How do we

maximize their

value?

Trip

(Location and

Speed)

Vehicle Usage

Report

Big Data Problem


More Granular Modeling Testing Trip Dynamics

Analyst

New Model for

Maintenance

Alerts

Test and

Summarize

On All Engine

Readings

Aggregated

Test

Results

Big Data Problem


Fitting Fat Tails Modeling “outlying” customers

Analyst

Significant

value may exist

in the tails

Parallelized

Locally-

weighted

Linear

Regression

Model for all

data

Big Data Problem


reserved.


Q&A


reserved.


big data introduction - arrow ecs€¦ · title: hadoop deep dive author: dan mcclary created date:...

Documents