modern big data analytics tools: an overview

Post on 26-Jan-2015

112 Views

Category:

Technology

1 Downloads

Preview:

Click to see full reader

DESCRIPTION

Great Wide Open 2014 - Day 1 Milind Bhandarkar - Pivotal 3:30 PM - Operations 2 (Big Data)

TRANSCRIPT

Modern Big Data Analytics Tools: An

Overview

Milind Bhandarkar Chief Scientist, Pivotal (Twitter : @techmilind)

(All Images Courtesy Flickr, Creative Commons Licensed)

About Me• http://www.linkedin.com/in/milindb

• Founding member of Hadoop team at Yahoo! [2005-2010]

• Contributor to Apache Hadoop since v0.1

• Built and led Grid Solutions Team at Yahoo! [2007-2010]

• Parallel Programming Paradigms [1989-today] (PhD cs.illinois.edu)

• Center for Development of Advanced Computing (C-DAC), National Center for Supercomputing Applications (NCSA), Center for Simulation of Advanced Rockets, Siebel Systems (acquired by Oracle), Pathscale Inc. (acquired by QLogic), Yahoo!, LinkedIn, and Pivotal (formerly Greenplum)

Hadoop Midwife :-)

Once upon a time, in a land far far away…

Fast forward 15 years..

What Happened ?

And, then…

HDFS

ASF Projects FLOSS Projects Pivotal Products

MapReduce

In a blink of an eye…

HDFS

Pig

Sqoop Flume

Coordination and workflow management

Zookeeper

Command Center

GemFire XD

Oozie

MapReduce

Hive

Tez

Giraph

Hadoop UI

Hue

SolrCloud

Phoenix

HBase

Crunch Mahout

Spark

Shark

Streaming

MLib

GraphX

Impala

HAW

Q

SpringXD

MADlib

Ham

ster

PivotalR

YARN

ASF Projects FLOSS Projects Pivotal Products

History (2003-2010)

Google Papers

Yahoo! Search

+

=

W-1-W

• WebMap : Graph processing for WWW

• Dreadnaught: Infrastructure for WebMap

• W-1-W: WebMap In One Week

• Juggernaut: Infrastructure for W-1-W

• JFS, JMR, Condor: Abandoned for Hadoop

Lucene, Nutch

Kryptonite

Major Step Backwards?

MapReduce is the Revenge of System Programmers on

Database community. - Anonymous at XLDB, Stanford, 2010

O’Reilly Books 2013

Who Uses Hadoop? (From Hadoop Summit 2010)

Big Data Landscape - July 2012 http://www.forbes.com/sites/davefeinleib/2012/06/19/the-big-data-landscape/

Hadoop Ecosystem (Jan 2013) http://www.datameer.com/blog/perspectives/hadoop-ecosystem-as-of-january-2013-now-an-app.html

Game Changing Hadoop Economics

$-

$20,000

$40,000

$60,000

$80,000

2008 2009 2010 2011 2012 2013

Big Data Platform Price/TB

Big Data DB Hadoop

Hadoop Maturity

ETL Offload Accommodate massive data growth with existing EDW investments

Data Lakes Unify Unstructured and Structured Data Access

Big Data Apps Build analytic-led applications impacting top line revenue

Data-Driven Enterprise App Dev and Operational Management on HDFS Data Architecture

70% of data generated by

customers

80% of data being stored

3% being prepared for

analysis

0.5% being analyzed

<0.5% being operationalized

Average Enterprises

The Big Gap

Storage Options

• HDFS, MapR, Quantcast QFS

• EMC Isilon, NetApp, IBM GPFS, PanFS, PVFS, Lustre

• Amazon S3, EMC Atmos, OpenStack Swift

• GlusterFS, Ceph

• EMC ViPR

SQL-on-Hadoop• Pivotal HAWQ

• Cloudera Impala, Facebook Presto, Apache Drill, Cascading Lingual, Optiq, Hortonworks Stinger

• Hadapt, Jethrodata, IBM BigSQL, Microsoft PolyBase

• More to come...

Network Interconnect

...

......HAWQ & HDFS MasterSevers

Planning & dispatch

SegmentSevers

Query execution

...Storage !

HDFS, HBase …

Namenode

Breplication

Rack1 Rack2

DatanodeDatanode Datanode

Read/Write

Segment

Segment host

SegmentSegment

Segment host

SegmentSegment host

Master host

Meta Ops

HAWQ Interconnect

Segment

Segment

Segment

Segment hostSegment

Datanode

Segment Segment Segment Segment

HAWQ vs Hive

Lower is Better

Provides data-parallel implementations of mathematical, statistical and machine-learning

methods for structured and unstructured data.

In-Database Analytics

MADlib Algorithms

MADLib Functions• Linear Regression

• Logistic Regression

• Multinomial Logistic Regression

• K-Means

• Association Rules

• Latent Dirichlet Allocation

• Naïve Bayes

• Elastic Net Regression

• Decision Trees / Random Forest

• Support Vector Machines

• Cox Proportional Hazards Regression

• Descriptive Statistics

• ARIMA

k-Means Usage

SELECT * FROM madlib.kmeanspp ( ‘customers’, -- name of the input table ‘features’, -- name of the feature array column 2 -- k : number of clusters ); ! centroids | objective_fn | frac_reassigned | …!------------------------------------------------------------------------+------------------+-----------------+ …{{68.01668579784,48.9667382972952},{28.1452167573446,84.5992507653263}} | 586729.010675982 | 0.001 | …

Accessing HAWQ Through R

Pivotal R

• Interface is R client

• Execution is in database

• Parallelism handled by PivotalR

• Supports a portion of R

R> x = db.data.frame(“t1”) R> l = madlib.lm(interlocks ~ assets + nation, data = t)

A wrapper of MADlib • Linear regression • Logistic regression • Elastic Net • ARIMA • Table summary

A wrapper of MADlib • Linear regression • Logistic regression • Elastic Net • ARIMA • Table summary

• $ [ [[ $<- [<- [[<-

• is.na

• + - * / %% %/% ^

• & | !

• == != > < >= <=

• merge

• by

• db.data.frame

• as.db.data.frame

• preview• sort

• c mean sum sd var min max length colMeans colSums

• db.connect db.disconnect db.list db.objects

db.existsObject delete• dim names

• content

And more ... (SQL wrapper)

• predict

A wrapper of MADlib • Linear regression • Logistic regression • Elastic Net • ARIMA • Table summary

• Categorial variable as.factor()

• $ [ [[ $<- [<- [[<-

• is.na

• + - * / %% %/% ^

• & | !

• == != > < >= <=

• merge

• by

• db.data.frame

• as.db.data.frame

• preview• sort

• c mean sum sd var min max length colMeans colSums

• db.connect db.disconnect db.list db.objects

db.existsObject delete• dim names

• content

And more ... (SQL wrapper)

• predict

In-Database Execution• All data stays in DB: R objects merely point

to DB objects

• All model estimation and heavy lifting done in DB by MADlib

• R→ SQL translation done in the R client

• Only strings of SQL and model output transferred across ODBC/DBI

Beyond MapReduce with YARN

Single'App'

BATCH

HDFS

Single'App'

INTERACTIVE

Single'App'

BATCH

HDFS

Single'App'

BATCH

HDFS

Single'App'

ONLINE

Hadoop 1.0 (Image Courtesy Arun Murthy, Hortonworks)

MapReduce 1.0 (Image Courtesy Arun Murthy, Hortonworks)

Hadoop 2.0 (Image Courtesy Arun Murthy, Hortonworks)

HADOOP 1.0

HDFS%(redundant,*reliable*storage)*

MapReduce%(cluster*resource*management*

*&*data*processing)*

HDFS2%(redundant,*reliable*storage)*

YARN%(cluster*resource*management)*

Tez%(execu7on*engine)*

HADOOP 2.0

Pig%(data*flow)*

Hive%(sql)*

%Others%(cascading)*

*

Pig%(data*flow)*

Hive%(sql)*

%Others%(cascading)*

%

MR%(batch)*

RT%%Stream,%Graph%Storm,''Giraph'

*

Services%HBase'

*

Applica'ons+Run+Na'vely+IN+Hadoop+

HDFS2+(Redundant,*Reliable*Storage)*

YARN+(Cluster*Resource*Management)***

BATCH+(MapReduce)+

INTERACTIVE+(Tez)+

STREAMING+(Storm,+S4,…)+

GRAPH+(Giraph)+

INLMEMORY+(Spark)+

HPC+MPI+(OpenMPI)+

ONLINE+(HBase)+

OTHER+(Search)+(Weave…)+

YARN Platform (Image Courtesy Arun Murthy, Hortonworks)

NodeManager* NodeManager* NodeManager* NodeManager*

Container*1.1*

Container*2.4*

NodeManager* NodeManager* NodeManager* NodeManager*

NodeManager* NodeManager* NodeManager* NodeManager*

Container*1.2*

Container*1.3*

AM*1*

Container*2.2*

Container*2.1*

Container*2.3*

AM2*

Client2*

ResourceManager*

Scheduler*

YARN Architecture (Image Courtesy Arun Murthy, Hortonworks)

YARN

• Yet Another Resource Negotiator

• Resource Manager

• Node Managers

• Application Masters

• Specific to paradigm, e.g. MR Application master (aka JobTracker)

Beyond MapReduce

• Apache Giraph - BSP & Graph Processing

• Storm on Yarn - Streaming Computation

• HOYA - HBase on Yarn

• Hamster - MPI on Hadoop

• More to come ...

Hamster• Hadoop and MPI on the same

cluster

• OpenMPI Runtime on Hadoop YARN

• Hadoop Provides: Resource Scheduling, Process monitoring, Distributed File System

• Open MPI Provides: Process launching, Communication, I/O forwarding

GraphLab + Hamster on Hadoop

!

About GraphLab

• Graph-based, High-Performance distributed computation framework

• Started by Prof. Carlos Guestrin in CMU in 2009

• Recently founded Graphlab Inc to commercialize Graphlab.org

GraphLab Features• Topic Modeling (e.g. LDA)

• Graph Analytics (Pagerank, Triangle counting)

• Clustering (K-Means)

• Collaborative Filtering

• Linear Solvers

• etc...

Only Graphs are not Enough

• Full Data processing workflow requires ETL/Postprocessing, Visualization, Data Wrangling, Serving

• MapReduce excels at data wrangling

• OLTP/NoSQL Row-Based stores excel at Serving

• GraphLab should co-exist with other Hadoop frameworks

Data Platform of the Future ?

AnalyticData Marts

SQL Services

Operational Intelligence

In-Memory Database

Run-Time Applications

Data StagingPlatform

Data Mgmt. Services

Stream Ingestion

Streaming Services

Software-Defined Datacenter

New Data-fabrics

In-Memory Grid

...ETC

Questions?

top related