bay area hadoop user group

25
Accelerated Analytics for the Big Data Fabric Bay Area Hadoop User Group © 2012, Pentaho. All Rights Reserved. pentaho.com. Worldwide +1 (866) 660-7555

Upload: pentaho

Post on 20-Aug-2015

657 views

Category:

Documents


1 download

TRANSCRIPT

Accelerated Analytics for the Big Data FabricBay Area Hadoop User Group

© 2012, Pentaho. All Rights Reserved. pentaho.com. Worldwide +1 (866) 660-7555

2© 2012, Pentaho. All Rights Reserved. pentaho.com. Worldwide +1 (866) 660-7555

AGENDA

The Big Data Fabric

Big Data Preparation – An Everyday Challenge

Use-Case Scenario – Call Volume Analysis

Solution Requirements

Solution Workflow

Phase I - Data Preparation & Visualization

Phase II - Pentaho MapReduce & Orchestration

Summary

The Big Data Fabric

3

Big

Ana

lytic

sB

ig D

ata

Mgm

tD

ata

Int

egra

tion

VisualizationInteractive Analysis

DashboardsReports

R3rd Party BI Tools

Applications

Hadoop NoSQL Databases Analytic Databases

Pentaho Business Analytics 3rd Party Tools

Data Integration

Job Orchestration

Workflow

Scheduling

High Performance

Visual IDE

4© 2012, Pentaho. All Rights Reserved. pentaho.com. Worldwide +1 (866) 660-7555

Preparing Big Data for Analysisis an Everyday Challenge

• Very technical skills required• Divide between M-R developers & analysts• Beyond the reach of many organizations

5© 2012, Pentaho. All Rights Reserved. pentaho.com. Worldwide +1 (866) 660-7555

Pentaho Visual MapReduce

Accessible by any ETL developer, business analyst or data scientist

Executes inside Hadoop as a native Java MapReduce task

6© 2012, Pentaho. All Rights Reserved. pentaho.com. Worldwide +1 (866) 660-7555

Pentaho Reporting & Analytics

Hadoop NoSQL Hybrid

Data Visualization, Discovery and Analysis

Batch Reportingand Ad Hoc Query

7© 2012, Pentaho. All Rights Reserved. pentaho.com. Worldwide +1 (866) 660-7555

Use Case Scenario – Call Volume Analysis

• VOIP service provider has excess capacity and is considering expansion to consumer markets

• Business Analyst: what are the top 10 states for inbound calls on Fridays, Saturdays and Sundays?

• Research data available: – Call records – date/timestamp & destination phone #– NANP (North American Numbering Plan) data – area

code by country, state & time zone

?

8© 2012, Pentaho. All Rights Reserved. pentaho.com. Worldwide +1 (866) 660-7555

Solution Requirements

• Data Preparation– Access the call records in HDFS– Extract the destination area code for each call– Read the area code reference data– Lookup country, state and time zone by area code, append to each

record– Filter out records (non-U.S. calls, calls made on M-Tu-W-Th)– Load to a relational database– Generate metadata

• Analysis– Explore data multi-dimensionally– Find the top-10 states by inbound call volume– Navigate via a geospatial interface

• Deployment– Deploy in MapReduce to handle larger data volumes

9© 2012, Pentaho. All Rights Reserved. pentaho.com. Worldwide +1 (866) 660-7555

Solution Workflow

• Phase I - Business Analysts– Use a data extract to prepare and validate their analyses– Iterate over requirements with executives and stake-holders

• Phase II - MapReduce Developers/Analysts– Create production Pentaho MapReduce transformations– Manage the deployment and orchestration between the

Hadoop cluster and the production database

10© 2012, Pentaho. All Rights Reserved. pentaho.com. Worldwide +1 (866) 660-7555

Data Preparation (Phase I)

• The data pipeline implements the data preparation logic• Each component has a “personality”– access, calculate, join, filter

…• Free-form design

– As many or as few inputs, transformations and outputs as needed

• Schema contract exists only between connected components• Pipelined, multi-threaded for performance• 100% Java-based for deployment flexibility

11© 2012, Pentaho. All Rights Reserved. pentaho.com. Worldwide +1 (866) 660-7555

Data Pipeline – Input from HDFS

12© 2012, Pentaho. All Rights Reserved. pentaho.com. Worldwide +1 (866) 660-7555

Data Pipeline - Calculator

13© 2012, Pentaho. All Rights Reserved. pentaho.com. Worldwide +1 (866) 660-7555

Data Pipeline – Stream Lookup

14© 2012, Pentaho. All Rights Reserved. pentaho.com. Worldwide +1 (866) 660-7555

Data Pipeline – Row Filter

15© 2012, Pentaho. All Rights Reserved. pentaho.com. Worldwide +1 (866) 660-7555

Data Pipeline – Table Output

16© 2012, Pentaho. All Rights Reserved. pentaho.com. Worldwide +1 (866) 660-7555

Visualization – Multi-Dimensional UX

17© 2012, Pentaho. All Rights Reserved. pentaho.com. Worldwide +1 (866) 660-7555

Visualization – Geographic

18© 2012, Pentaho. All Rights Reserved. pentaho.com. Worldwide +1 (866) 660-7555

Visualization - Heatmap

19© 2012, Pentaho. All Rights Reserved. pentaho.com. Worldwide +1 (866) 660-7555

Deployment to Hadoop (Phase II)

• To process a larger set of data we can deploy the data pipeline via MapReduce– Input and output streams are encoded in key-value pairs– Two specialized components provide an interface:

– A special job component deploys the data pipeline to the Hadoop cluster:

20© 2012, Pentaho. All Rights Reserved. pentaho.com. Worldwide +1 (866) 660-7555

Pentaho MapReduce – Inputs/Outputs

. . . . . . . .

The core logic of the data pipeline is identical … only the ends change

21© 2012, Pentaho. All Rights Reserved. pentaho.com. Worldwide +1 (866) 660-7555

Pentaho MapReduce – Orchestration

22© 2012, Pentaho. All Rights Reserved. pentaho.com. Worldwide +1 (866) 660-7555

Instant Analytics (Roadmap)

Choose a Big Data Source, Answer a Few Questions,

Publish to Pentaho

Report, Explore and Analyze

Customize Model(Optional)

2

3

1

23© 2012, Pentaho. All Rights Reserved. pentaho.com. Worldwide +1 (866) 660-7555

SUMMARY

1. The Big Data Fabric encompasses a large collection of Hadoop distributions, NoSQL and analytical databases

2. A component-based approach to data access and integration can:

– Allow business analysts and data scientists to perform their own data preparation

– Result in more rapid validation of business requirements & metrics

– Be used to create data pipelines that can be deployed directly to a cluster, enabling analytics against much larger data sets

– Support orchestration across environments

24© 2012, Pentaho. All Rights Reserved. pentaho.com. Worldwide +1 (866) 660-7555

Summary

© 2012, Pentaho. All Rights Reserved. pentaho.com. Worldwide +1 (866) 660-7555

Thank You

Join the conversation. You can find us on:

http://blog.pentaho.com

@Pentaho

Facebook.com/Pentaho

Pentaho Business Analytics