solving big data problems using hortonworks

79
Solving Big Data Problems using Hortonworks © Hortonworks Inc. 2011 – 2015. All Rights Reserved

Upload: dataworks-summithadoop-summit

Post on 14-Apr-2017

2.824 views

Category:

Technology


9 download

TRANSCRIPT

Page 1: Solving Big Data Problems using Hortonworks

Solving Big Data Problems using Hortonworks

© Hortonworks Inc. 2011 – 2015. All Rights Reserved

Page 2: Solving Big Data Problems using Hortonworks

ON

LY 100open source

Apache Hadoop data platform

% Founded in 2011

HADOOP1STprovider to go public

IPO 4Q14 (NASDAQ: HDP)

employees across800+

countries

technology partners1,350

17TM

Hortonworks Company Profile

Fastest company to reach $100 M in revenue

Page 3: Solving Big Data Problems using Hortonworks

Let’s talk about Big Data

, September 2014 survey of 100 CIOs from the US and Europe

Page 4: Solving Big Data Problems using Hortonworks

What problems and opportunities does Big Data create?

Data that traditional platforms

cannot handleNEW

TRADITIONAL

The OpportunityUnlock transformational business valuefrom a full fidelity of data and analyticsfor all data.

Geolocation

Server logs

Files & emails

ERP, CRM, SCM

Traditional Data Sources

New Data Sources

Sensorsand machines

Clickstream

Social media

Page 5: Solving Big Data Problems using Hortonworks

The Future of Data: Actionable Intelligence

D A T A I N M O T I O N

ST

OR

AG

E

ST

OR

AG

E

GR OU P 2GR OU P 1

GR OU P 4GR OU P 3

D A T A A T R E S T

INTERNETOF

ANYTHING

Page 6: Solving Big Data Problems using Hortonworks

Hortonworks Data Platform

H O R T O N W O R K S D A TA P L A T F O R M

Batch Interactive Search Streaming Machine Learning

YARN Resource Management System

CLICKSTREAM SENSOR SOCIAL MOBILE GEOLOCATIONS SERVER LOG EXISTING

Page 7: Solving Big Data Problems using Hortonworks

HDP is a collection of Apache Projects

HORTONWORKS DATA PLATFORM

Had

oop

&YA

RN

Flum

e

Ooz

ie

Pig

Hiv

e

Tez

Sqoo

p

Clo

udbr

eak

Am

bari

Slid

er

Kaf

ka

Kno

x

Solr

Zook

eepe

r

Spar

k

Falc

on

Ran

ger

HB

ase

Atla

s

Acc

umul

o

Stor

m

Phoe

nix

4.10.2

DATA MGMT DATA ACCESS GOVERNANCE & INTEGRATION OPERATIONS SECURITY

HDP 2.2Dec 2014

HDP 2.1April 2014

HDP 2.0Oct 2013

HDP 2.2Dec 2014

HDP 2.1April 2014

HDP 2.0Oct 2013 0.12.0 0.12.0

0.12.1 0.13.0 0.4.0

1.4.4 1.4.4 3.3.23.4.5

0.4.00.5.0

0.14.0 0.14.0 3.4.6 0.5.0 0.4.00.9.30.5.2

4.0.04.7.2

1.2.1 0.60.0 0.98.4 4.2.0 1.6.1 0.6.0 1.5.21.4.5 4.1.01.7.0

1.4.0 1.5.1 4.0.0

1.3.1

1.5.1 1.4.4 3.4.5

1.3.1

2.2.0

2.4.0

2.6.0

2.7.1 1.4.6 1.0.0 0.6.0 0.5.02.1.00.8.2 3.4.61.5.25.2.1 0.80.0 1.1.1 0.5.01.7.04.4.0 0.10.0 0.6.10.7.01.2.10.15.0HDP 2.3

July 2015 4.2.0

Ongoing Innovation in Apache

0.96.1

0.98.0 0.9.1

0.8.1

Page 8: Solving Big Data Problems using Hortonworks

Hortonworks Data Flow

Visual User InterfaceDrag and drop for efficient, agile operations

Immediate FeedbackStart, stop, tune, replay dataflows in real-time

Adaptive to Volume and BandwidthAny data, big or small

Event Level Data ProvenanceGovernance, compliance & data evaluation

Secure Data Acquisition & TransportFine grained encryption for controlled data sharing and selective data democratization

Powered by Apache NiFi

Page 9: Solving Big Data Problems using Hortonworks

HDF and HDP Deliver a Complete Big Data Solution

• HDF dynamically connects HDP to data at the edge

• HDF secures and encrypts the movement of data into HDP

• HDF includes mature IoAT data protocols that improve device extensibility

• HDF supports easily adjustable bi-direction IoAT dataflows

• HDF offers traceability of IoAT data with lineage and audit trails

• HDF brings a real-time, visual user interface to manipulate live dataflows

Page 10: Solving Big Data Problems using Hortonworks

ST

OR

AG

E

ST

OR

AG

E

Hortonworks Revenue Model

HDP and HDF are 100% free and Open Source – no license. Our customers subscribe to support, consulting experts and training programsAnnual Subscriptionsalign your success with ours

Expert Consulting & Traininghelp your team get to actionable intelligence as efficiently as possible

ARCHITECT&

DEVELOP

DEPLOY

OPERATE

Project 1

Project 5

Project 4

Project 3

Project 2

Project 6

EXPAND

Page 11: Solving Big Data Problems using Hortonworks

Sales Plays

Page 12: Solving Big Data Problems using Hortonworks

Hadoop Driver: Cost optimization

Archive Data off EDWMove rarely used data to Hadoop as active archive, store more data longer

Offload costly ETL processFree your EDW to perform high-value functions like analytics & operations, not ETL

Enrich the value of your EDWUse Hadoop to refine new data sources, such as web and machine data for new analytical context

ANAL

YTIC

S

Data Marts

Business Analytics

Visualization& Dashboards

HDP helps you reduce costs and optimize the value associated with your EDW

ANAL

YTIC

SD

ATA

SYS

TEM

S

Data Marts

Business Analytics

Visualization& Dashboards

HDP 2.3

ELT°

°

°

°

°

°

°

°

°

°

°

°

°

°

°

°

°

°

°

N

Cold Data, Deeper Archive& New Sources

Enterprise Data Warehouse

Hot

MPP

In-Memory

Clickstream Web&Social

Geolocation Sensor& Machine

ServerLogs

Unstructured

Existing Systems

ERP CRM SCM

SOU

RC

ES

Page 13: Solving Big Data Problems using Hortonworks

Single ViewImprove acquisition and retention

Predictive Analytics Identify your next best action

Data DiscoveryUncover new findings

Financial ServicesNew Account Risk Screens Trading Risk Insurance Underwriting

Improved Customer Service Insurance Underwriting Aggregate Banking Data as a Service

Cross-sell & Upsell of Financial Products Risk Analysis for Usage-Based Car Insurance Identify Claims Errors for Reimbursement

TelecomUnified Household View of the Customer Searchable Data for NPTB Recommendations Protect Customer Data from Employee Misuse

Analyze Call Center Contacts Records Network Infrastructure Capacity Planning Call Detail Records (CDR) Analysis

Inferred Demographics for Improved Targeting Proactive Maintenance on Transmission Equipment Tiered Service for High-Value Customers

Retail360° View of the Customer Supply Chain Optimization Website Optimization for Path to Purchase

Localized, Personalized Promotions A/B Testing for Online Advertisements Data-Driven Pricing, improved loyalty programs

Customer Segmentation Personalized, Real-time Offers In-Store Shopper Behavior

ManufacturingSupply Chain and Logistics Optimize Warehouse Inventory Levels Product Insight from Electronic Usage Data

Assembly Line Quality Assurance Proactive Equipment Maintenance Crowdsource Quality Assurance

Single View of a Product Throughout Lifecycle Connected Car Data for Ongoing Innovation Improve Manufacturing Yields

HealthcareElectronic Medical Records Monitor Patient Vitals in Real-Time Use Genomic Data in Medical Trials

Improving Lifelong Care for Epilepsy Rapid Stroke Detection and Intervention Monitor Medical Supply Chain to Reduce Waste

Reduce Patient Re-Admittance Rates Video Analysis for Surgical Decision Support Healthcare Analytics as a Service

Oil & GasUnify Exploration & Production Data Monitor Rig Safety in Real-Time Geographic exploration

DCA to Slow Well Declines Curves Proactive Maintenance for Oil Field Equipment Define Operational Set Points for Wells

GovernmentSingle View of Entity CBM & Autonomic Logistic Analysis Sentiment Analysis on Program Effectiveness

Prevent Fraud, Waste and Abuse Proactive Maintenance for Public Infrastructure Meet Deadlines for Government Reporting

Hadoop Driver: Advanced analytic applications

Page 14: Solving Big Data Problems using Hortonworks

NiFi and HDF Drivers

Optimize Splunk: Reduce costs by pre-filtering data so that only relevant content is forwarded into Splunk

Ingest Logs for Cyber Security: Integrated and secure log collection for real-time data analytics and threat detection

Feed Data to Streaming Analytics: Accelerate big data ROI by streaming data into analytics systems such as Apache Storm or Apache Spark Streaming

Move Data Internally: Optimize resource utilization by moving data between data centers or between on-premises infrastructure and cloud infrastructure

Capture IoT Data: Transport disparate and often remote IoTdata in real time, despite any limitations in device footprint, power or connectivity—avoiding data loss

Page 15: Solving Big Data Problems using Hortonworks

Hadoop Driver: Enabling the data lakeSC

ALE

SCOPE

Data Lake Definition• Centralized Architecture

Multiple applications on a shared data set with consistent levels of service

• Any App, Any DataMultiple applications accessing all data affording new insights and opportunities.

• Unlocks ‘Systems of Insight’Advanced algorithms and applications used to derive new value and optimize existing value.

Drivers:1. Cost Optimization2. Advanced Analytic Apps

Goal:• Centralized Architecture• Data-driven Business

DATA LAKE

Journey to the Data Lake with Hadoop

Systems of Insight

Page 16: Solving Big Data Problems using Hortonworks

Case Study: 12 month Hadoop evolution at TrueCarD

ata

Plat

form

Cap

abili

ties

12 months execution plan

June 2013Begin Hadoop Execution

July 2013Hortonworks Partnership

May ‘14IPO

Aug 2013Training & DevBegins

Nov 2013Production Cluster60 Nodes2 PB

Jan 201440% DevStaff Perficient

Dec 2013Three Production Apps(3 total)

Feb 2014Three More Production Apps(6 total)

12 Month Results at TRUECar• Six Production Hadoop Applications• Sixty nodes/2PB data• Storage Costs/Compute Costs

from $19/GB to $0.12/GB

“We addressed our data platform capabilities strategically as a pre-cursor to IPO.”

Page 17: Solving Big Data Problems using Hortonworks

Hortonworks Data Platform

Page 18: Solving Big Data Problems using Hortonworks

Hadoop emerged as foundation of new data architecture

Apache Hadoop is an open source data platform for managing large volumes of high velocity and variety of data• Built by Yahoo! to be the heartbeat of its ad & search business

• Donated to Apache Software Foundation in 2005 with rapid adoption by large web properties & early adopter enterprises

• Incredibly disruptive to current platform economics

Traditional Hadoop Advantagesü Manages new data paradigmü Handles data at scaleü Cost effectiveü Open source

Traditional Hadoop Had LimitationsBatch-only architecture Single purpose clusters, specific data setsDifficult to integrate with existing investmentsNot enterprise-grade

Application

StorageHDFS

Batch ProcessingMapReduce

Page 19: Solving Big Data Problems using Hortonworks

20092006

1 ° ° ° ° °

° ° ° ° ° N

HDFS(HadoopDistributedFileSystem)

MapReduceLargelyBatchProcessing

Hadoop w/MapReduce

YARN: Data Operating System

1 ° ° ° ° ° ° ° ° °

° ° ° ° ° ° ° ° °

°

°N

HDFS (Hadoop Distributed File System)

Hadoop2 & YARN based Architecture

Silo’d clustersLargely batch systemDifficult to integrate

MR-279:YARN

Hadoop 2 & YARN

Interactive Real-TimeBatch

Architected & led development of YARN to enable the Modern Data Architecture

October 23, 2013

Page 20: Solving Big Data Problems using Hortonworks

Apache Hadoop – Data Operating System

Shared Compute & Workload Management• Common data platform, many applications• Support multi-tenant access & processing• Batch, interactive & real-time use cases

Common & Shared Scale Out Storage• Shared data assets• Flexible schema• Cross workload access

YARN: Data Operating System(Cluster Resource Management)

1 ° ° ° ° ° ° °

° ° ° ° ° ° ° °

Script

Pig

SQL

Hive

TezTez

JavaScala

Cascading

Tez

° °

° °

° ° ° ° °

° ° ° ° °

Others

ISV Engines

HDFS (Hadoop Distributed File System)

Stream

Storm

Search

Solr

NoSQL

HBaseAccumulo

Slider Slider

BATCH, INTERACTIVE & REAL-TIME DATA ACCESS

In-Memory

Spark

Enterprise Hadoop

Page 21: Solving Big Data Problems using Hortonworks

Core Capabilities of Enterprise Hadoop

Load data and manage according

to policy

Deploy and effectively

manage the platform

Store and process all of your Corporate Data Assets

Access your data simultaneously in multiple ways(batch, interactive, real-time) Provide layered

approach tosecurity through Authentication, Authorization,

Accounting, and Data Protection

DATAMANAGEMENT

SECURITYDATAACCESSGOVERNANCE&INTEGRATION OPERATIONS

Enable both existing and new application toprovide value to the organization

PRESENTATION&APPLICATION

Empower existing operations and security tools to manage Hadoop

ENTERPRISEMGMT&SECURITY

Provide deployment choice across physical, virtual, cloud

DEPLOYMENTOPTIONS

Page 22: Solving Big Data Problems using Hortonworks

Hortonworks Data Platform 2.3

YARN : Data Operating System

DATA ACCESS SECURITYGOVERNANCE & INTEGRATION OPERATIONS

1 ° ° ° ° ° ° ° ° °

° ° ° ° ° ° ° ° ° °

°

N

AdministrationAuthenticationAuthorizationAuditingData Protection

RangerKnoxAtlasHDFS EncryptionData Workflow

SqoopFlumeKafkaNFSWebHDFS

Provisioning, Managing, & Monitoring

AmbariCloudbreakZookeeper

Scheduling

Oozie

Batch

MapReduce

Script

Pig

Search

Solr

SQL

Hive

NoSQL

HBaseAccumuloPhoenix

Stream

Storm

In-memory

Spark

Others

ISV Engines

TezTez Tez Slider Slider

HDFS Hadoop Distributed File System

DATA MANAGEMENT

Hortonworks Data Platform 2.3

DeploymentChoiceLinux Windows On-Premise Cloud

Data Lifecycle & Governance

FalconAtlas

Page 23: Solving Big Data Problems using Hortonworks

Architectures

Page 24: Solving Big Data Problems using Hortonworks

Basic EDW Cost Optimization Architecture

Batch

Sqoop

Transform

Processed

Hive

Raw

HDFS

Interactive

HiveServer

Reporting

BI Tools

Load

EDW

Existing Analytics

Fetch

1

2

3

4

ExternalTables

Page 25: Solving Big Data Problems using Hortonworks

More than save cost, Enrich With New Data

Batch

Sqoop

Transform

Processed

Hive

Raw

HDFS

Interactive

HiveServer

Reporting

BI Tools

Load

EDW

New Sources

Streaming

NiFi

Load

Existing Analytics

Fetch

New Analytics

1

2

3

4

5

6

ExternalTables

Page 26: Solving Big Data Problems using Hortonworks

Streaming Solution Architecture

HDP 2.x Data Lake

YARN

HDFS

APACHEKAFKA

SearchSolrSlider

OnlineDataProcessingHBaseAccumulo

RealTimeStreamProcessingStorm SQL

HiveStreaming Ingest

HDFS

HDP 2.x

Real-time data feeds

Page 27: Solving Big Data Problems using Hortonworks

Key Tenants of Lambda Architecture

§ Batch Layer§ Manages master data§ Immutable, append-only set of raw data§ Cleanse, Normalize & Pre-Compute

Batch Views§ Advanced Statistical Calculations

§ Speed layer§ Real Time Event Stream Processing§ Computes Real-Time Views

§ Serving Layer § Low-latency, ad-hoc query§ Reporting, BI & Dashboard

New Data Stream

Store Pre-Compute Views

Process Streams

Incremental Views

Business View

Business View

Query

SPEED LAYER

BATCH LAYER

SERVING LAYER

HDP and HDF

High Level Big Data IoT Architecture

Page 28: Solving Big Data Problems using Hortonworks

IoT on HDP

Problem Statement

Reference Architecture& Sizing

Solution Design& Customer Case Studies

Implementation Plan

Page 28 © Hortonworks Inc. 2011 – 2015. All Rights Reserved

Project Cost & ROI

Page 29: Solving Big Data Problems using Hortonworks

www.hortonworks.com

Ms. Brady knows to get a handle on sky-rocketing

premiums, she will need to better understand what is causing the incidents and

being able to prevent them.Ms. Brady sets the goal of reducing incidents by 5%

within 90 days.

Incidents of maintenance vehicles have continued to increase under COO Brady’s watch

2012

17.5M

2013 2014 2015

Insurance Premiums

Ms. Brady tasks, her Business Analyst, Tam with

gathering the necessary data to understand the cause of and reduce

incidents.

Business AnalystTam

Mega Corp has a problem

Page 30: Solving Big Data Problems using Hortonworks

www.hortonworks.com

Given the current premium cost of $3,500 per truck on 5,000 trucks, a 10% reduction in incidents will move the company from the high risk insurance category they are currently in and save the company $1000 on their insurance premium per truck per year or $5,000,000 annually.

Business AnalystTam

Page 31: Solving Big Data Problems using Hortonworks

www.hortonworks.com

Tam considers four questions she must answer to better understand and mitigate incidents. The are:

1) Is there a correlation of driver training to incidents?

2) Is there a correlation of weather to incidents?

3) Is there a correlation between certain driving behavior and incidents?

4) Is is possible to predict incidents before they occur?

Business AnalystTam

…to Behavioral InsightFrom reaction to human activity

…to Resource OptimizationFrom static resource planning

From break then fix

Shift from Reactive……to…... Proactive & Proscriptive

…to Preventative Maintenance

Page 32: Solving Big Data Problems using Hortonworks

www.hortonworks.com

Initially, Tam’s team is concerned that they may not be able to capture all the necessary data to answer the

questions Tam has posed and help her mitigate incidents. They know that the data is not all

structured and some of it is created in real-time and transmitted over the Internet. In addition, some data

will have to be captured from external sources.

Vehicle Data

Route Data

Weather Data

Structured Driver Data

Semi-Structured Maintenance Data

SueVarun Jeff

Page 33: Solving Big Data Problems using Hortonworks

Page 33 © Hortonworks Inc. 2011 – 2015. All Rights ReservedD

ATA

SYS

TEM

SEnterprise Data Warehouse

Hot

MPP

In-Memory

1

2

Clickstream Web&Social

Geolocation Sensor& Machine

ServerLogs

Unstructured

RDBMS ERPCRM

Systems of Record

The Team Recognizes The Current Data Architectures Limits Predictive Capabilities

1. Data Silos: difficult to find predictive correlations

2. Data Volumes: cannot store enough data to find patterns

3. New Data Sources: unable to capture and use new data for real-time analysis

ANAL

YTIC

S

Data Marts

Business Analytics

Visualization& Dashboards

3

Page 34: Solving Big Data Problems using Hortonworks

Page 34 © Hortonworks Inc. 2011 – 2015. All Rights Reserved

DA

TA S

YSTE

MSEnterprise Data

Warehouse

Hot

MPP

In-Memory

RDBMS ERPCRM

Systems of Record

The Team Leverages HDF & HDP to Expand The Capabilities of Their Existing Data Platform

ANAL

YTIC

S

Data Marts

Business Analytics

Visualization& Dashboards

Page 35: Solving Big Data Problems using Hortonworks

www.hortonworks.com

+HDP Data Analyst

Training

=HDP Data Analyst

+Developer Training

=HDP Developer

+HDP System Admin

Training

=HDP Sys Admin

+Data Science Training

=HDP Data Scientist

Developer System Admin SMESueVarun Jeff

Business AnalystTam

Then team engages their favorite SI and attends Hortonworks University training to get the project under

way

Page 36: Solving Big Data Problems using Hortonworks

Page 36 © Hortonworks Inc. 2011 – 2015. All Rights Reserved

IoT on HDP

Problem Statement

Reference Architecture& Sizing

Solution Design& Customer Case Studies

Implementation Plan

Page 36 © Hortonworks Inc. 2011 – 2015. All Rights Reserved

Project Cost & ROI

Page 37: Solving Big Data Problems using Hortonworks

Page 37 © Hortonworks Inc. 2011 – 2015. All Rights Reserved

StreamProcessing&Modeling(Kafka,Storm&Spark)

Solution Architecture

DistributedStorage:HDFS

ManyWorkloads:YARN

Real-timeServing&Searching(Hbase)

Alerts&Events

Real-TimeWebApp

InteractiveQuery(HiveonTez)SQL

Singleclusterwithconsistentsecurity,governance&operations

Collect,Conduct&Curate(HDF– BidirectionalDataFlow)

TruckSensors

The chosen solution provides XYZ company with the foundation to capture all the required data, analyze

correlations, and ultimately create a model that allows them to predict and mitigate incidents before they happen.

WeatherData

EDW

Sqoop

Page 38: Solving Big Data Problems using Hortonworks

www.hortonworks.com

Tam and Varun build the application

HDP AnalystTam Varun

Developer Analyst

Page 39: Solving Big Data Problems using Hortonworks

www.hortonworks.com

Ms. Brady is happy with the results. She is able to

determine that a subset of drivers are responsible for the increased cost. But like most managers she is not happy for long. Now she wants to be able

to predict future incidents.

Data Scientist

Machine Leaning

Jeff points out that HDP has tremendous statistical algorithm libraryand he can use these library to predict which drivers are likely to

have an event before the event occurs.

Jeff

Page 40: Solving Big Data Problems using Hortonworks

www.hortonworks.com

Jeff implements predicted violations logic using HDP

Machine Learningand is able to predict events

before they happen

Page 41: Solving Big Data Problems using Hortonworks

www.hortonworks.com

Ms. Brady is happy now that she can isolate where problems

exist, identify causal events and build models that help predict events before they

occur.

Page 42: Solving Big Data Problems using Hortonworks

www.hortonworks.com

< TODO: Show St. Louis Case Study >

http://hortonworks.com/blog/st-louis-buses-run-with-lhp-telematics-and-hortonworks/

Page 43: Solving Big Data Problems using Hortonworks

Page 43 © Hortonworks Inc. 2011 – 2015. All Rights Reserved

IoT on HDP

Problem Statement

Reference Architecture& Sizing

Solution Design& Customer Case Studies

Implementation Plan

Page 43 © Hortonworks Inc. 2011 – 2015. All Rights Reserved

Project Cost & ROI

Page 44: Solving Big Data Problems using Hortonworks

Page 44 © Hortonworks Inc. 2011 – 2015. All Rights Reserved

Big Data Functional ArchitectureKey Tenants of Lambda Architecture

§ Batch Layer§ Manages master data§ Immutable, append-only set of raw data§ Cleanse, Normalize & Pre-Compute

Batch Views§ Advanced Statistical Calculations

§ Speed layer§ Real Time Event Stream Processing§ Computes Real-Time Views

§ Serving Layer § Low-latency, ad-hoc query§ Reporting, BI & Dashboard

New Data Stream

Store Pre-Compute Views

Process Streams

Incremental Views

Business View

Business View

Query

SPEED LAYER

BATCH LAYER

SERVING LAYER

HDP and HDF

High Level Big Data IoT Architecture

Page 45: Solving Big Data Problems using Hortonworks

Page 45 © Hortonworks Inc. 2011 – 2015. All Rights Reserved

Storm/Spark Streaming

Storm

Detailed Reference Architecture for IoT Applications

HDF

Flume

Sink toHDFS

Transform

Interactive

UI Framework

Hive

Hive

HDFS

HDFS

SOURCE DATA

Server logs

Application Logs

Firewall Logs

CRM/ERP

Sensor

Kafka

Kafka

Stream toHDF

Forward to Storm

Real Time Storage

Spark-ML

Pig

Alerts

Bolt toHDFS

Dashboard

Silk

JMSAlerts

Hive Server

HiveServer

Reporting

BI Tools

High Speed Ingest

Real-Time

Batch Interactive

Machine LearningModels

Spark

Pig

Alerts SQOOP

Flume

Iterative ML

Hbase/Pheonix

HBaseEvent Enrichment

Spark-Thrift

Pig

Page 46: Solving Big Data Problems using Hortonworks

Page 46 © Hortonworks Inc. 2011 – 2015. All Rights Reserved

Sample Ingest: NiFi

Page 47: Solving Big Data Problems using Hortonworks

Page 47 © Hortonworks Inc. 2011 – 2015. All Rights Reserved

Apache Storm – Key Attributes

Open source, real-time event stream processing platform that provides fixed, continuous, & low latency processing for very high frequency streaming data

• Horizontally scalable like Hadoop• Eg: 10 node cluster can process 1M tuples per secondHighly scalable

• Automatically reassigns tasks on failed nodesFault-tolerant

• Supports at least once & exactly once processing semanticsGuarantees processing

• Processing logic can be defined in any languageLanguage agnostic

• Brand, governance & a large active communityApache project

Page 48: Solving Big Data Problems using Hortonworks

Page 48 © Hortonworks Inc. 2011 – 2015. All Rights Reserved

Storm - Basic ConceptsSpouts: Generate streams.

Tuple: Most fundamental data structure and is a named list of values that can be of any datatype

Streams: Groups of tuples

Bolts: Contain data processing, persistence and alerting logic. Can also emit tuples for downstream bolts

Tuple Tree: First spout tuple and all the tuples that were emitted by the bolts that processed it

Topology: Group of spouts and bolts wired together into a workflow

Topology

Page 49: Solving Big Data Problems using Hortonworks

Page 49 © Hortonworks Inc. 2011 – 2015. All Rights ReservedHORTONWORKS CONFIDENTIAL & PROPRIETARY INFORMATION

Distributed Database With Apache HBase

100%OpenSourceStoreandProcessPetabytesofDataFlexibleSchemaScaleoutonCommodityServersHighPerformance,HighAvailabilityIntegratedwithYARNSQLandNoSQL Interfaces

YARN:DataOperatingSystem

HBase

RegionServer

1 ° ° ° ° ° ° ° ° ° °

° ° ° ° ° ° ° ° ° ° N

HDFS(PermanentDataStorage)

HBase

RegionServer

HBase

RegionServer

Dynamic SchemaScales Horizontally to PB of DataDirectly Integrated with Hadoop

HDP

Page 50: Solving Big Data Problems using Hortonworks

Page 50 © Hortonworks Inc. 2011 – 2015. All Rights ReservedHORTONWORKS CONFIDENTIAL & PROPRIETARY INFORMATION

Apache Phoenix – Relational Database Layer Over HBase

A SQL Skin for HBase• Provides a SQL interface for managing data in HBase.• Large subset of SQL:1999 mandatory featureset.• Create tables, insert and update data and perform low-latency point lookups through JDBC.• Phoenix JDBC driver easily embeddable in any app that supports JDBC.

Phoenix Makes HBase Better• Oriented toward online / transactional apps.• If HBase is a good fit for your app, Phoenix makes it even better.• Phoenix gets you out of the “one table per query” model many other NoSQL stores force you into.

Page 51: Solving Big Data Problems using Hortonworks

Page 51 © Hortonworks Inc. 2011 – 2015. All Rights Reserved

In-Memory With Spark

Spark SQL

Spark Streaming MLlib GraphX

§ A data access engine for fast, large-scale data processing

§ Designed for iterative in-memory computations and interactive data mining

§ Provides expressive multi-language APIs for Scala, Java and Python

Page 52: Solving Big Data Problems using Hortonworks

Page 52 © Hortonworks Inc. 2011 – 2015. All Rights Reserved

Spark ML for machine learning

Democratizes Machine Learning

Unsupervised tasks• Clustering (K-means)

• Recommendation

• Collaborative Filtering: alternating least squares

• Dimensionality reduction: PCA, SVD

Supervised tasks• Classification

• Naïve Bayes, Decision Tree, Random Forest, Gradient boosted trees

• Regression

• Linear models (SVM, linear regression, logistic regression)

Page 53: Solving Big Data Problems using Hortonworks

Page 53 © Hortonworks Inc. 2011 – 2015. All Rights Reserved

Apache Hive: SQL in Hadoop

• Created by a team at Facebook

• Provides a standard SQL interface to data stored in Hadoop• Quickly analyze data in raw data files• Proven at petabyte scale

• Compatible with ALL major BI tools such as Tableau, Excel, MicroStrategy, Business Objects, etc…

SensorMobile

WeblogOperational

/MPP

SQLQueries

Page 54: Solving Big Data Problems using Hortonworks

Page 54 © Hortonworks Inc. 2011 – 2015. All Rights Reserved

Comparing SQL Options In HDP

Project Strengths UseCases UniqueCapabilities

ApacheHive MostcomprehensiveSQLScaleMaturity

ETLOffloadReportingLarge-scaleaggregations

Robustcost-basedoptimizerMatureecosystem(BI,backup,securityandreplication)

SparkSQL In-memoryLowlatency

ExploratoryanalyticsDashboards

Language-integratedQuery

ApachePhoenix Real-timeread/writeTransactionsHighconcurrency

DashboardsSystem-of-engagementDrill-down/Drill-up

Real-timeread/write

Page 55: Solving Big Data Problems using Hortonworks

Page 55 © Hortonworks Inc. 2011 – 2015. All Rights Reserved

Comparing Streaming Options In HDP

Apache Storm SparkStreaming

OneAtA Time MicroBatch(minimum batch latency=500ms)

LowLatency HigherThroughput

OperatesonTupleStream OperatesonStreamsofTuple Batches

AtLeastOnce(TridentForExactlyOnce)

ExactlyOnce

MultipleLanguageSupport MultipleLanguage Support

Page 56: Solving Big Data Problems using Hortonworks

Page 56 © Hortonworks Inc. 2011 – 2015. All Rights Reserved

Sizing

Page 57: Solving Big Data Problems using Hortonworks

Page 57 © Hortonworks Inc. 2011 – 2015. All Rights Reserved

HDF Sizing & Best Practices Sustained Throughput

For Sustained Throughput of 50MB/sec and thousands of events

per second

• 1-2 nodes• 8+ cores per node

(more is better)• 6+ disks per node

(SSD or Spinning)• 2 GB of mem per node• 1GB bonded NICs

ideally

For Sustained Throughput of

100MB/sec and tens of thousands of events per

second

• 3-4 nodes• 8+ cores per node

(more is better)• 6+ disks per node

(SSD or Spinning)• 2 GB of mem per node• 1GB bonded NICs

ideally

For Sustained Throughput of

200MB/sec and hundreds of thousands of events per second

• 5-7 nodes• 24+ cores per node

(effective cpus)• 12+ disks per node

(SSD or spinning)• 4GB of mem per node• 10GB bonded NICs

For Sustained Throughput of 400-

500MB/sec and hundreds of thousands of events per second

• 7-10 nodes• 24+ cores per node

(effective cpus)• 12+ disks per node

(SSD or spinning)• 6GB of mem per node• 10GB bonded NICs

Page 58: Solving Big Data Problems using Hortonworks

Page 58 © Hortonworks Inc. 2011 – 2015. All Rights Reserved

Kafka - Sizing & Best Practices

§ Cluster Sizing – Rule of Thumb– 10 MB/sec/Node or 100,000/sec/Node

• Higher throughput for large batch size

§ Configuration Best Practices– Num Of Partitions = max (Total Producer Throughput / Throughput per partition, Total Consumer

Throughput / Throughput per partition)• Over-estimate number of partitions per topic. Cannot increase partition count without breaking

message ordering guarantees– Collocate Kafka and Storm process

• Storm is CPU bound while Kafka is throughput bound• In high throughput scenarios, separate Kafka and Storm into independent nodes.

Page 59: Solving Big Data Problems using Hortonworks

Page 59 © Hortonworks Inc. 2011 – 2015. All Rights Reserved

Storm - Sizing & Best Practices

§ Cluster Sizing – Rule of Thumb– 100,000 events per second per supervisor node

• Predicated on work being performed by Bolt’s execute method• Mileage will vary by project• Testing is critical

§ Configuration Best Practices– 1 Worker / Machine / Topology– 1 Executor per CPU Core– Topology Parallelism = Num of Machines x (Num of Cores Per Machine -1 )

• Distribute total parallelism among spout and bolts to maximize topology throughput

Page 60: Solving Big Data Problems using Hortonworks

Page 60 © Hortonworks Inc. 2011 – 2015. All Rights Reserved

Hbase - Sizing & Best Practices

§ Cluster Sizing – Rule of Thumb– 10 MB/sec/node of Write Throughput– 1-3 TB per node of compressed data (non replicated)

• HDFS volume of 6-12 TB– Sizing = max(required ingestion rate / Write Throughput per node, Total data size/ Data Per Node)

§ Configuration Best Practices– Region Server Size ~ 10G– Number of Regions Per Region Server ~ 100-200– Cluster/Pre-Split tables– For IOT scenarios

• Consider using Hive to store raw data while using Phoenix to store aggregates• Batch insert data to Phoenix using MapReduce

– Tailor Batch interval to application SLAs

Page 61: Solving Big Data Problems using Hortonworks

www.hortonworks.com

Ms. Brady knows to get a handle on sky-rocketing

premiums, she will need to better understand what is causing the incidents and

being able to prevent them.Ms. Brady sets the goal of reducing incidents by 5%

within 90 days.

Incidents of maintenance vehicles have continued to increase under COO Brady’s watch. The Department of Transportation has contacted

Mega Corporation.

2012

17.5M

2013 2014 2015

Insurance Premiums

Ms. Brady tasks, her Business Analyst, Tam with

gathering the necessary data to understand the cause of and reduce

incidents.

Business AnalystTam

Problem statement recap

Page 62: Solving Big Data Problems using Hortonworks

www.hortonworks.com

Given the current premium cost of $3,500 per truck on 5,000 trucks, a 10% reduction in incidents will move the company from the high risk insurance category they are currently in and save the company $1000 on their insurance premium per truck per year or $5,000,000 annually.

Business AnalystTam

Problem statement recap

Page 63: Solving Big Data Problems using Hortonworks

Page 63 © Hortonworks Inc. 2011 – 2015. All Rights Reserved

Sizing - Cluster Storage Requirement

Effective Capacity

× Intermediate Size× Replication Count× Temp Space

Compression Ratio

Rule of thumb§ Replication Count: 3§ Temp Space: x1.2

Vary greatly§ Intermediate/Materialized: 30-50%§ Compression Ratio: 2-4

Page 64: Solving Big Data Problems using Hortonworks

Page 64 © Hortonworks Inc. 2011 – 2015. All Rights Reserved

Data Volume for Mega Corp§ Number of Trucks = 5000§ Events per second per truck = 10§ Size of each event = 128 Bytes

§ 1 year raw sensor data storage requirements: 5000 x 10 x 128 x 60 x 60 x 24 x 365 = 200 TB§ 5 year sensor data storage: 200TB X 5 X 1.5 (processing overhead) = 1.5 PB

§ Q: How many nodes are needed for storing 1.5PB? (answered later)

Page 65: Solving Big Data Problems using Hortonworks

Page 65 © Hortonworks Inc. 2011 – 2015. All Rights Reserved

HBase, Kafka, Storm and NiFi RequirementsIngest rate = 128 Bytes X 5000 trucks X 10 events/s = 6.4 KB/s

Q: For 6.4 KB/s ingest rate, how many NiFi, Kafka and Storm nodes are needed?

We will store last 15 days of data in Hbase.

Hbase storage needed: 5000 * 10 * 60 * 60 * 24 * 15 * 128 = 8.2 TB

Q: How many Hbase nodes are needed for 8.2TB storage?

Page 66: Solving Big Data Problems using Hortonworks

Page 66 © Hortonworks Inc. 2011 – 2015. All Rights Reserved

Sizing - Number Of Worker Nodes for Sensor Data

§ # of Worker Nodes = = = 32Storage Per Server

Total Cluster Storage 1.5 PB

48 TB

Page 67: Solving Big Data Problems using Hortonworks

Page 67 © Hortonworks Inc. 2011 – 2015. All Rights Reserved

Sizing – NiFi, Kafka, Hbase and Storm Nodes

DataNodes& Hbase

NiFi Kafka & Storm Ingest Nodes

Client Nodes

MasterNodes

Total

32 2 3 2 5 44

§ Recall that:§ NiFi can collect @ 50 MB/s/node§ Kafka can ingest @10MB/s/node or 100,000 events/s/node§ Storm can process @ 100,000 events/s/node§ Each HBase Region Server can store 1TB

§ So for 6.4 KB/s ingest rate: 1 NiFi , 1 Kafka, 1 Storm nodes are sufficient. § We will use 2 NiFi & 3 Kafka for HA.§ Hbase nodes needed = 1.5PB/1TB = 8 nodes§ Co-locate Kafka and Storm.§ Co-locate DataNode and Hbase.

Page 68: Solving Big Data Problems using Hortonworks

www.hortonworks.com

NiFi 1

NiFi 2

Storm 1 Kafka 1

Storm 2 Kafka 2

Storm 3 Kafka 3

DataNode 1 HBase 1

Truck 1

Truck 2

Truck 3

Truck 5000

NiFi Nodes

Edge Nodes

Master NodesClients 1

Clients 2

DataNode 2Hbase 2

DataNode 3 Hbase 3

DataNode 4 Hbase 4

DataNode 5Hbase 5

DataNode 6 Hbase 6

DataNode 7Hbase 7

DataNode 8Hbase 8

DataNode 9 DataNode 10

DataNode 31 DataNode 32

Master 1

Master 2

Master 3

Master 4

Master 5

Worker Nodes

HDF

HDPWorld

MegacorpDatacenter

Page 69: Solving Big Data Problems using Hortonworks

Page 69 © Hortonworks Inc. 2011 – 2015. All Rights Reserved

Ingest Node 1Master Node 4

StormHiveserver

WebHCat

Falcon

Worker Node 1

Node Manager

Datanode

hBaseRegion

Worker Node 2

Node Manager

Datanode

hBaseRegion

Worker Node 3

Node Manager

Datanode

hBaseRegion

Worker Node 4

Node Manager

Datanode

hBaseRegion

Worker Node 5

Node Manager

Datanode

hBaseRegion

hBase Master 1

Master Node 3Master Node 2Master Node 1

Namenode 1

Zookeeper

Oozie

Zookeeper

Namenode 2

Resource Manager 1

Zookeeper

History Server

Timeline Server

Hiveserver 2

Journal Keeper

Journal Keeper

Journal Keeper

Resource Manager 2

hBase Master 2

Kafka

Master Node 5

Zookeeper

History Server

Ambari

Monitoring & Metrics

Worker Node 32

Node Manager

Datanode

hBaseRegion

Ingest Node 2

Storm

Kafka

Ingest Node 3

Storm

Kafka

Edge Node 1

Clients

Knox

Edge Node 1

Clients

Knox

HDP Service Layout

Page 70: Solving Big Data Problems using Hortonworks

Page 70 © Hortonworks Inc. 2011 – 2015. All Rights Reserved

Master Node Specs

12 + Cores 128 - 256 GB RAM(1 X 256GB SSD Drive for OS)(2 X 1TB Drives)2 X 1 – 10 Gb Switch

Approximate Cost Per Node $8,000 - $18,000

Page 71: Solving Big Data Problems using Hortonworks

Page 71 © Hortonworks Inc. 2011 – 2015. All Rights Reserved

NiFi Nodes Specs

8+ Cores 16 GB RAM(1 X 256GB SSD Drive for OS)(2 X 1TB Drives)2 X 1 – 10 Gb Switch

Approximate Cost Per Node $5,000 - $8,000

Page 72: Solving Big Data Problems using Hortonworks

Page 72 © Hortonworks Inc. 2011 – 2015. All Rights Reserved

Slave Nodes Specs

12 + Cores 32 - 64 GB RAM12 X 1 TB SATA Drives (Processing/IOPS Optimized)12 X 2 TB SATA Drives (Balanced)12 X 4 TB SATA Drives (Storage Optimized)1 X 1 – 10 Gb Switch

Approximate Cost Per Node $5,000 - $12,000

Page 73: Solving Big Data Problems using Hortonworks

Page 73 © Hortonworks Inc. 2011 – 2015. All Rights Reserved

IoT on HDP

Problem Statement

Reference Architecture& Sizing

Solution Design& Customer Case Studies

Implementation Plan

Page 73 © Hortonworks Inc. 2011 – 2015. All Rights Reserved

Project Cost & ROI

Page 74: Solving Big Data Problems using Hortonworks

Page 74 © Hortonworks Inc. 2011 – 2015. All Rights Reserved

Project Plan

Strategy10 days

Training 10 days

Design & Build60 days

Test30 days

Promote10 days

Use Case Workshop

Cluster Build-out

Solution Build-out

Prove-out

Promote Solution

Tam puts together a quick project plan and

estimates it will take 120 days to get Ms. Brady

her solution

Page 75: Solving Big Data Problems using Hortonworks

www.hortonworks.com

75Resource Plan

Data Scientist Consultant

TamData Flow

Consultant

VarunArchitect

Consultant

JeffDeveloper Consultant

Sue

Project ManagerJen

Engagement Manager Consultant

Jim

Enterprise ArchitectFrank

Business AnalystSue

DeveloperJim

Page 76: Solving Big Data Problems using Hortonworks

IoT on HDP

Problem Statement

Reference Architecture& Sizing

Solution Design& Customer Case Studies

Implementation Plan

Page 76 © Hortonworks Inc. 2011 – 2015. All Rights Reserved

Project Cost & ROI

Page 77: Solving Big Data Problems using Hortonworks

Project CostComponent Quantity Unit Cost Total Cost

Hardware 44 $10,000 $440K

Software – HDP 11 SKUs $18,000/SKU $198K

Software – HDF 2 SKUs $36000/SKU $72K

Dev and Test Consulting

3040 hrs* $300/hr $912K

Engagement Consulting

360 hrs* $300/hr $108K

Training 30** $2500 $75K

Travel & Expense $100K

Total $1.885M

* 4 resources x 8 hrs x 95 days, engagement mgr for 45 days** Admin, Analyst & Data Science Training for 30 associates

Page 78: Solving Big Data Problems using Hortonworks

Page 78 © Hortonworks Inc. 2011 – 2015. All Rights Reserved

Project ROI§ Insurance Cost Reduction – 5M

§ Project Cost – 1.885M

§ First year savings ~ 3.1M

Page 79: Solving Big Data Problems using Hortonworks

Page 79 © Hortonworks Inc. 2011 – 2015. All Rights Reserved Tweet: #hadooproadshow

Thank You