big data overview by edgars

1 Copyright © 2012, Oracle and/or its affiliates. All rights

reserved.

“Big Data”

Edgars Ruņģis

Copyright © 2013, Oracle and/or its affiliates. All rights reserved. 2

What is Big Data ?

VELOCITY VOLUME VARIETY

Sensors BLOG

Social Social

Enormous volumes of real-time

data streams from internal and

external sources and historic

data, hyper-volumes of

structured, semi-structured and

unstructured data

Combine historic data with

data streams and feeds

Detect significant events from

real-time data streams

Respond automatically to

detected events by raising

alerts

Call Data Records, Social

Media Traffic, Videos, Audio

Financial Transactions

Sensor based data, border

crossings, airline passenger

records


Big Data ≈ Hadoop


Hadoop Can Be Confusing


reserved.

What is Hadoop?


Hadoop

The Apache Hadoop software library is a framework that allows for the

distributed processing of large data sets across clusters of computers

using simple programming models. Hadoop is designed to scale up from

single servers to thousands of machines, each offering local

computation and storage. Rather than rely on hardware to deliver high-

availability, the library itself is designed to detect and handle failures at

the application layer, so delivering a highly-available service on top of a

cluster of computers, each of which may be prone to failures.


reserved.

What to Pay Attention To

Distributed Storage

– HDFS

Parallel Processing Framework

– MapReduce

Higher-Level Languages

– Hive

– Pig

– Etc.


reserved.

HDFS The Distributed Filesystem

What is it?

Benefits

Limitations

The petabyte-scale distributed file system at

the core of Hadoop.

Linearly-scalable on commodity hardware

An order of magnitude cheaper per TB

Designed around schema-on-read

Low security

Write-once, read-many model


reserved.

Interacting with HDFS

NameNodes and DataNodes

– NameNodes contain edits and organization

– DataNodes store data

Command-line access resembles UNIX filesystems

– ls (list)

– cat, tail (concatenate or tail file)

– cp, mv (copy or move within HDFS)

– get, put (copy between local file system and HDFS)


reserved.

HDFS Mechanics

DataNode

DataNode DataNode

DataNode DataNode

DataNode

Suppose we have a large file

And a set of DataNodes


reserved.

HDFS Mechanics

DataNode

DataNode DataNode

DataNode DataNode

DataNode

• The file will be broken up into blocks

• Blocks are stored in multiple locations

• Allows for parallelism and fault-tolerance

• Nodes operate on their local data


reserved.

MapReduce The Parallel Processing Framework

What is it?

Benefits

Limitations

The parallel processing framework that

dominates the Big Data landscape.

Provides data-local computation

Fault-tolerant

Scales just like HDFS

You are the optimizer

Batch-oriented


reserved.

MapReduce Mechanics

Suppose 3 face cards are

removed.

How do we find which suits

are short using

MapReduce?


reserved.

MapReduce Mechanics

Map Phase:

Each TaskTracker has some data local to it.

Map tasks operate on this local data.

If face_card: emit(suit, card)

TaskTracker/DataNode





reserved.

MapReduce Mechanics

Shuffle/Sort:

Intermediate data is shuffled and sorted for delivery to the reduce tasks

Sort

To Reducers


reserved.

MapReduce Mechanics

Reduce Phase:

Reducers operate on local data to produce final result

Emit:key, count(key)

TaskTracker TaskTracker TaskTracker TaskTracker

Spades: 3 Hearts: 2 Diamonds: 2 Clubs: 2


reserved.

Flow of key/values pairs

Map

Input

K0,v0

Output

K1,V1

Reduce

Input

K1,list(V)

Output

K2,V2


reserved.

The default way to process data into Hadoop


reserved.

SQL style!

NoSQL DB Driver

Application

NoSQL

HDFS + MapReduce = Hadoop

BigData

Hadoop

What should I do if I don’t know Java but still want to process data into HDFS?

Use Hive!


reserved.

Hive A move toward declarative language

What is it?

Benefits

Limitations

A SQL-like language for Hadoop.

Abstracts MapReduce code

Schema-on-read via InputFormat and SerDe

Provides and preserves metatdata

Not ideal for ad hoc work (slow)

Subset of SQL-92

Immature optimizer


reserved.

Storing a Clickstream

Storing large amounts of

clickstream data is a

common use for HDFS

Individual clicks aren’t

valuable by them selves

We’d like to write queries

over all clicks


reserved.

Defining Tables Over HDFS

Hive allows us to define

tables over HDFS

directories

The syntax is simple SQL

SerDes allow Hive to

deserialize data


reserved.

How Does It Work Anatomy of a Hive Query

SELECT suit, COUNT(*)

FROM cards

WHERE face_value > 10

GROUP BY suit;

How does Hive execute

this query?


reserved.

Anatomy of a Hive Query

SELECT suit, COUNT(*)

FROM cards

WHERE face_value > 10

GROUP BY suit;

1. Hive optimizer builds a MapReduce Job

2. Projections and predicates

become Map code

3. Aggregations become Reduce code

4. Job is submitted to

MapReduce JobTracker

Map task If face_card:

emit(suit, card)

Reduce task emit(suit,

count(suit)) Shu

ffle


reserved.

Hadoop Programming - Summary

• HDFS - Hadoop Distributed File System

– Designed to achieve SCALE and aggregate THROUGHPUT on commodity hardware

– Not a database; Data remains in its original format on disk (the query engine does NOT own

the data format)

• MapReduce

– Simple programming model for large scale data processing (NOT performance !!!)

– Checkpoints to disk for fault tolerance

– Leverages InputFormat, RecordReader and SerDe

– Leverages multiple copies of data (speculative execution)

• Various higher level languages

– Hive (SQL implemented as MapReduce)

– Pig - Scripting Languages like Python

25


reserved.

Impala


reserved.

What is Impala ?

• Massive parallel processing (MPP) database engine, developed by Cloudera

• Integrated into Hadoop stack on the same level as MapReduce, and not

above it (as Hive and Pig)

• Impala process data in Hadoop cluster without using MapReduce

HDFS

Map Reduce

Hive Pig

Impala


reserved.

Impala Architecture


reserved.

My query needs to go Faster

• Impala, Spark, Stinger, Shark, Tajo, Presto a whole bunch

more

– Remove MapReduce and check pointing to disk

– Sacrifice resilience and scale for performance

– Limited SQL capability and access paths

– Create a (new) SQL Database on top of HDFS

– Create and load data into optimized storage formats to optimize

performance


reserved.

How to load data in Hadoop ?


reserved.

Hadoop command-line

• Source server should to have Hadoop client program.

• File may be loaded by follow command hadoop fs –put

source.file /tmp/hadoop_dir


reserved.

Load data into Hadoop from RDBMS (Sqoop)


reserved. http://archive.cloudera.com/cdh4/cdh/4/flume-ng/FlumeUserGuide.html

Acquire stream data. Flume


reserved.

Flume for collecting Twitter feeds

Flume – twitter collecting

http://archive.cloudera.com/cdh4/cdh/4/flume-ng/FlumeUserGuide.html


reserved.

Hadoop distributives


Cloudera

37


Hortonworks


reserved.

Big Data and Oracle


reserved.

Oracle Big Data Solution

Stream Acquire – Organize – Analyze

Oracle BI Foundation Suite

Oracle Real-Time Decisions

Endeca Information Discovery

Decide

Oracle Event Processing Oracle Big Data

Connectors

Oracle Data Integrator

Oracle

Advanced

Analytics

Oracle

Database

Oracle

Spatial

& Graph

Apache Flume

Oracle GoldenGate

Oracle

NoSQL

Database

Cloudera

Hadoop

Oracle R

Distribution


reserved.

Oracle Big Data Connectors

• Oracle Loader for Hadoop

• Oracle SQL Connector for HDFS

• Oracle R Advanced Analytics for Hadoop

• Oracle XQuery for Hadoop

• Oracle Data Integrator Application Adapters for Hadoop

Licensed Together


reserved.

Oracle Loader for Hadoop

SHUFFLE /SORT

SHUFFLE /SORT

MAP

MAP

MAP

MAP

SHUFFLE /SORT

REDUCE

REDUCE

INPUT 2

INPUT 1

MAP

MAP

MAP

MAP

MAP

REDUCE

REDUCE

MAP

MAP

MAP

MAP

MAP

REDUCE

REDUCE

REDUCE

Load data from Hadoop

into Oracle Database

Oracle Database

Unstructured Data

REDUCE

Local Oracle table


reserved.

Generate external table in database pointing to HDFS data

Load into database or query data in place on HDFS

Fine-grained control over data type mapping Parallel load with automatic load balancing

Kerberos authentication

Oracle SQL Connector for HDFS Use Oracle SQL to Access Data on HDFS

External

Table

OSCH

OSCH

OSCH

SQL Query

HDFS

Client

Hadoop

Oracle Database

Access or load into the database in parallel using external table mechanism

OSCH


reserved. 45

New Data Sources for Oracle External Tables

CREATE TABLE movielog

(click VARCHAR2(4000))

ORGANIZATION EXTERNAL

( TYPE ORACLE_HIVE

DEFAULT DIRECTORY Dir1

ACCESS PARAMETERS

(

com.oracle.bigdata.tablename logs

com.oracle.bigdata.cluster mycluster)

)

REJECT LIMIT UNLIMITED

• New set of properties

– ORACLE_HIVE and ORACLE_HDFS access drivers

– Identify a Hadoop cluster, data source, column mapping, error

handling, overflow handling, logging

• New table metadata passed from Oracle DDL to Hadoop

readers at query execution

• Architected for extensibility

– StorageHandler capability enables future support for other

data sources

– Examples: MongoDB, Hbase, Oracle NoSQL DB


reserved.

Oracle SQL Connector for HDFS

• Load data from external table with Oracle SQL – INSERT INTO <tablename> AS SELECT * FROM <external tablename>

• Access data in-place on HDFS with Oracle SQL

– Note: No indexes, no partitioning, so queries are a full table scan

• Read data in parallel – Ex: If there are 96 data files and the database can support 96 PQ slaves, all 96 files can

be read in parallel


reserved.


• Text files

• Hive tables (text data)

• Oracle Data Pump files generated by Oracle Loader for

Hadoop

Input Data Formats


reserved.


• Oracle Data Pump: Binary format data file

• Load of Oracle Data Pump files is more efficient – uses about 50%

less database CPU

– Hadoop does more of the work, transforming text data into binary data

optimized for Oracle

Note: Only Oracle Data Pump files generated by Oracle Loader

for Hadoop

Data Pump Files


reserved.

Using Hadoop To Optimize IT


reserved.

Big Data and Optimized Operations

• Big Data can handle a lot of heavy lifting

– It’s a complement to the database

• Big Data allows access to more detail data for less

• We can use Big Data to make the database do more


Optimizing ETL

Mission

Critical

Reporting

Ad Hoc

Analysis

Long-running

batch

transformation

Big Data Problem

Base Table

Copy/Move

Base Table to

Hadoop

Load to

Oracle

Long-running

batch

transformation


reserved.

Q&A


reserved.

big data overview by edgars

Education