bay area hadoop user group
TRANSCRIPT
Accelerated Analytics for the Big Data FabricBay Area Hadoop User Group
© 2012, Pentaho. All Rights Reserved. pentaho.com. Worldwide +1 (866) 660-7555
2© 2012, Pentaho. All Rights Reserved. pentaho.com. Worldwide +1 (866) 660-7555
AGENDA
The Big Data Fabric
Big Data Preparation – An Everyday Challenge
Use-Case Scenario – Call Volume Analysis
Solution Requirements
Solution Workflow
Phase I - Data Preparation & Visualization
Phase II - Pentaho MapReduce & Orchestration
Summary
The Big Data Fabric
3
Big
Ana
lytic
sB
ig D
ata
Mgm
tD
ata
Int
egra
tion
VisualizationInteractive Analysis
DashboardsReports
R3rd Party BI Tools
Applications
Hadoop NoSQL Databases Analytic Databases
Pentaho Business Analytics 3rd Party Tools
Data Integration
Job Orchestration
Workflow
Scheduling
High Performance
Visual IDE
4© 2012, Pentaho. All Rights Reserved. pentaho.com. Worldwide +1 (866) 660-7555
Preparing Big Data for Analysisis an Everyday Challenge
• Very technical skills required• Divide between M-R developers & analysts• Beyond the reach of many organizations
5© 2012, Pentaho. All Rights Reserved. pentaho.com. Worldwide +1 (866) 660-7555
Pentaho Visual MapReduce
Accessible by any ETL developer, business analyst or data scientist
Executes inside Hadoop as a native Java MapReduce task
6© 2012, Pentaho. All Rights Reserved. pentaho.com. Worldwide +1 (866) 660-7555
Pentaho Reporting & Analytics
Hadoop NoSQL Hybrid
Data Visualization, Discovery and Analysis
Batch Reportingand Ad Hoc Query
7© 2012, Pentaho. All Rights Reserved. pentaho.com. Worldwide +1 (866) 660-7555
Use Case Scenario – Call Volume Analysis
• VOIP service provider has excess capacity and is considering expansion to consumer markets
• Business Analyst: what are the top 10 states for inbound calls on Fridays, Saturdays and Sundays?
• Research data available: – Call records – date/timestamp & destination phone #– NANP (North American Numbering Plan) data – area
code by country, state & time zone
?
8© 2012, Pentaho. All Rights Reserved. pentaho.com. Worldwide +1 (866) 660-7555
Solution Requirements
• Data Preparation– Access the call records in HDFS– Extract the destination area code for each call– Read the area code reference data– Lookup country, state and time zone by area code, append to each
record– Filter out records (non-U.S. calls, calls made on M-Tu-W-Th)– Load to a relational database– Generate metadata
• Analysis– Explore data multi-dimensionally– Find the top-10 states by inbound call volume– Navigate via a geospatial interface
• Deployment– Deploy in MapReduce to handle larger data volumes
9© 2012, Pentaho. All Rights Reserved. pentaho.com. Worldwide +1 (866) 660-7555
Solution Workflow
• Phase I - Business Analysts– Use a data extract to prepare and validate their analyses– Iterate over requirements with executives and stake-holders
• Phase II - MapReduce Developers/Analysts– Create production Pentaho MapReduce transformations– Manage the deployment and orchestration between the
Hadoop cluster and the production database
10© 2012, Pentaho. All Rights Reserved. pentaho.com. Worldwide +1 (866) 660-7555
Data Preparation (Phase I)
• The data pipeline implements the data preparation logic• Each component has a “personality”– access, calculate, join, filter
…• Free-form design
– As many or as few inputs, transformations and outputs as needed
• Schema contract exists only between connected components• Pipelined, multi-threaded for performance• 100% Java-based for deployment flexibility
11© 2012, Pentaho. All Rights Reserved. pentaho.com. Worldwide +1 (866) 660-7555
Data Pipeline – Input from HDFS
12© 2012, Pentaho. All Rights Reserved. pentaho.com. Worldwide +1 (866) 660-7555
Data Pipeline - Calculator
13© 2012, Pentaho. All Rights Reserved. pentaho.com. Worldwide +1 (866) 660-7555
Data Pipeline – Stream Lookup
14© 2012, Pentaho. All Rights Reserved. pentaho.com. Worldwide +1 (866) 660-7555
Data Pipeline – Row Filter
15© 2012, Pentaho. All Rights Reserved. pentaho.com. Worldwide +1 (866) 660-7555
Data Pipeline – Table Output
16© 2012, Pentaho. All Rights Reserved. pentaho.com. Worldwide +1 (866) 660-7555
Visualization – Multi-Dimensional UX
17© 2012, Pentaho. All Rights Reserved. pentaho.com. Worldwide +1 (866) 660-7555
Visualization – Geographic
18© 2012, Pentaho. All Rights Reserved. pentaho.com. Worldwide +1 (866) 660-7555
Visualization - Heatmap
19© 2012, Pentaho. All Rights Reserved. pentaho.com. Worldwide +1 (866) 660-7555
Deployment to Hadoop (Phase II)
• To process a larger set of data we can deploy the data pipeline via MapReduce– Input and output streams are encoded in key-value pairs– Two specialized components provide an interface:
– A special job component deploys the data pipeline to the Hadoop cluster:
20© 2012, Pentaho. All Rights Reserved. pentaho.com. Worldwide +1 (866) 660-7555
Pentaho MapReduce – Inputs/Outputs
. . . . . . . .
The core logic of the data pipeline is identical … only the ends change
21© 2012, Pentaho. All Rights Reserved. pentaho.com. Worldwide +1 (866) 660-7555
Pentaho MapReduce – Orchestration
22© 2012, Pentaho. All Rights Reserved. pentaho.com. Worldwide +1 (866) 660-7555
Instant Analytics (Roadmap)
Choose a Big Data Source, Answer a Few Questions,
Publish to Pentaho
Report, Explore and Analyze
Customize Model(Optional)
2
3
1
23© 2012, Pentaho. All Rights Reserved. pentaho.com. Worldwide +1 (866) 660-7555
SUMMARY
1. The Big Data Fabric encompasses a large collection of Hadoop distributions, NoSQL and analytical databases
2. A component-based approach to data access and integration can:
– Allow business analysts and data scientists to perform their own data preparation
– Result in more rapid validation of business requirements & metrics
– Be used to create data pipelines that can be deployed directly to a cluster, enabling analytics against much larger data sets
– Support orchestration across environments